Planet Code4Lib

Crossing the country / Galen Charlton

As some of you already know, Marlene and I are moving from Seattle to Atlanta in December. We’ve moved many (too many?) times before, so we’ve got most of the logistics down pat. Movers: hired! New house: rented! Mail forwarding: set up! Physical books: still too dang many!

We could do it in our sleep! (And the scary thing is, perhaps we have in the past.)

One thing that is different this time is that we’ll be driving across the country, visiting friends along the way.  3,650 miles, one car, two drivers, one Keurig, two suitcases, two sets of electronic paraphernalia, and three cats.

Cross-country route

Who wants to lay odds on how many miles it will take each day for the cats to lose their voices?

Fortunately Sophia is already testing the cats’ accommodations:

Sophie investigating the crate

I will miss the friends we made in Seattle, the summer weather, the great restaurants, being able to walk down to the water, and decent public transportation. I will also miss the drives up to Vancouver for conferences with a great bunch of librarians; I’m looking forward to attending Code4Lib BC next week, but I’m sorry to that our personal tradition of American Thanksgiving in British Columbia is coming to an end.

As far as Atlanta is concerned, I am looking forward to being back in MPOW’s office, having better access to a variety of good barbecue, the winter weather, and living in an area with less de facto segregation.

It’s been a good two years in the Pacific Northwest, but much to my surprise, I’ve found that the prospect of moving back to Atlanta feels a bit like a homecoming. So, onward!

Bookmarks for November 21, 2014 / Nicole Engard

Today I found the following resources and bookmarked them on <a href=

Digest powered by RSS Digest

The post Bookmarks for November 21, 2014 appeared first on What I Learned Today....

Free webinar: The latest on Ebola / District Dispatch

Photo by Phil Moyer

Photo by Phil Moyer

As the Ebola outbreak continues, the public must sort through all of the information being disseminated via the news media and social media. In this rapidly evolving environment, librarians are providing valuable services to their communities as they assist their users in finding credible information sources on Ebola, as well as other infectious diseases.

On Tuesday, December 12, 2014, library leaders from the U.S. National Library of Medicine will host the free webinar “Ebola and Other Infectious Diseases: The Latest Information from the National Library of Medicine.” As a follow-up to the webinar they presented in October, librarians from the U.S. National Library of Medicine will be discussing how to provide effective services in this environment, as well as providing an update on information sources that can be of assistance to librarians.

Speakers

  • Siobhan Champ-Blackwell is a librarian with the U.S. National Library of Medicine Disaster Information Management Research Center. Champ-Blackwell selects material to be added to the NLM disaster medicine grey literature data base and is responsible for the Center’s social media efforts. Champ-Blackwell has over 10 years of experience in providing training on NLM products and resources.
  • Elizabeth Norton is a librarian with the U.S. National Library of Medicine Disaster Information Management Research Center where she has been working to improve online access to disaster health information for the disaster medicine and public health workforce. Norton has presented on this topic at national and international association meetings and has provided training on disaster health information resources to first responders, educators, and librarians working with the disaster response and public health preparedness communities.

Date: December 12, 2014
Time: 2:00 PM–3:00 PM Eastern
Register for the free event

If you cannot attend this live session, a recorded archive will be available to view at your convenience. To view past webinars also done in collaboration with iPAC, please visit Lib2Gov.org.

The post Free webinar: The latest on Ebola appeared first on District Dispatch.

Library as Digital Consultancy / M. Ryan Hess

As faculty and students delve into digital scholarly works, they are tripping over the kinds of challenges that libraries specialize in overcoming, such as questions regarding digital project planning, improving discovery or using quality metadata. Indeed, nobody is better suited at helping scholars with their decisions regarding how to organize and deliver their digital works than librarians.

At my institution, we have not marketed our expertise in any meaningful way (yet), but we receive regular requests for help by faculty and campus organizations who are struggling with publishing digital scholarship. For example, a few years ago a team of librarians at my library helped researchers from the University of Ireland at Galway to migrate and restructure their online collection of annotations from the Vatican Archive to a more stable home on Omeka.net. Our expertise in metadata standards, OAI harvesting, digital collection platforms and digital project planning turned out to be invaluable to saving their dying collection and giving it a stable, long-term home. You can read more in my Saved by the Cloud post.

These kinds of requests have continued since. In recognition of this growing need, we are poised to launch a digital consultancy service on our campus.

Digital Project Planning

A core component of our jobs is planning digital projects. Over the past year, in fact, we’ve developed a standard project planning template that we apply to each digital project that comes our way. This has done wonders at keeping us all up to date on what stage each project is in and who is up next in terms of the workflow.

Researchers are often experts at planning out their papers, but they don’t normally have much experience with planning a digital project. For example, because metadata and preservation are things that normally don’t come up for them, they overlook planning around these aspects. And more generally, I’ve found that just having a template to work with can help them understand how the experts do digital projects and give them a sense of the issues they need to consider when planning their own projects, whether that’s building an online exhibit or organizing their selected works in ways that will reap the biggest bang for the buck.

We intend to begin formally offering project planning help to faculty very soon.

Platform Selection

It’s also our job to keep abreast of the various technologies available for distributing digital content, whether that is harvesting protocols, web content management systems, new plugins for WordPress or digital humanities exhibit platforms. Sometimes researchers know about some of these, but in my experience, their first choice is not necessarily the best for what they want to do.

It is fairly common for me to meet with campus partners that have an existing collection online, but which has been published in a platform that is ill-suited for what they are trying to accomplish. Currently, we have many departments moving old content based in SQL databases to plain HTML pages with no database behind them whatsoever. When I show them some of the other options, such as our Digital Commons-based institutional repository or Omeka.net, they often state they had no idea that such options existed and are very excited to work with us.

Metadata

I think people in general are becoming more aware of metadata, but there is still lots of technical considerations that your typical researcher may not be aware of. At our library, we have helped out with all aspects of metadata. We have helped them clean up their data to conform to authorized terms and standard vocabularies. We have explained Dublin Core. We have helped re-encode their data so that diacritics display online. We have done crosswalking and harvesting. It’s a deep area of knowledge and one that few people outside of libraries know on a suitably deep level.

One recommendation for any budding metadata consultants that I would share is that you really need to be the Carl Sagan of metadata. This is pretty technical stuff and most people don’t need all the details. Stick to discussing the final outcome and not the technical details and your help will be far more understood and appreciated. For example, I once presented to a room of researchers on all the technical fixes to a database that we made to enhance and standardize the metadata, but his went over terribly. People later came up to me and joked that whatever it was we did, they’re sure it was important and thanked us for being there. I guess that was a good outcome since they acknowledged our contribution. But it would have been better had they understood, the practical benefits for the collection and users of that content.

SEO

Search Engine Optimization is not hard, but it is likely that few people outside of the online marketing and web design world know what it is. I often find people can understand it very quickly if you simply define it as “helping Google understand your content so it can help people find you.” Simple SEO tricks like defining and then using keywords in your headers will do wonders for your collection’s visibility in the major search engines. But you can go deep with this stuff too, so I like to gauge my audience’s appetite for this stuff and then provide them with as much detail as I think they have an appetite for.

Discovery

It’s a sad statement on the state of libraries, but the real discovery game is in the major search engines…not in our siloed, boutique search interfaces. Most people begin their searches (whether academic or not) in Google and this is really bad news for our digital collections since by and large, library collections are indexed in the deep web, beyond the reach of the search robots.

I recently tried a search for the title of a digital image in one of our collections in Google.com and found it. Yeah! Now I tried the same search in Google Images. No dice.

More librarians are coming to terms with this discovery problem now and we need to share this with digital scholars as they begin considering their own online collections so that they don’t make the mistakes libraries made (and continue to make…sigh) with our own collections.

We had one department at my institution that was sitting on a print journal that they were considering putting online. Behind this was a desire to bring the publication back to life since they had been told by one researcher in Europe that she thought the journal had been discontinued years ago. Unfortunately, it was still being published, it just wasn’t being indexed in Google. We offered our repository as an excellent place to do so, especially because it would increase their visibility worldwide. Unfortunately, they opted for a very small, non-profit online publisher whose content we demonstrated was not surfacing in Google or Google Scholar. Well, you can lead a horse to water…

Still, I think this kind of understanding of the discovery universe does resonate with many. Going back to our somewhat invisible digital images, we will be pushing many to social media like Flickr with the expectation that this will boost visibility in the image search engines (and social networks) and drive more traffic to our digital collections.

Usability

This one is a tough one because people often come with pre-conceived notions of how they want their content organized or the site designed. For this reason, sometimes usability advice does not go over well. But for those instances when our experiences with user studies and information architecture can influence a digital scholarship project, it’s time well spent. In fact, I often hear people remark that they “never thought of it that way” and they’re willing to try some of the expert advice that we have to share.

Such advice includes things like:

  • Best practices for writing for the web
  • Principles of information architecture
  • Responsive design
  • Accessibility support
  • User Experience design

Marketing

It’s fitting to end on marketing. This is usually the final step in any digital project and one that often gets dropped. And yet, why do all the work of creating a digital collection only to let it go unnoticed. As digital project expert, librarians are familiar with the various channels available to promote and build followers with tools like social networking sites, blogs and the like.

With our own digital projects, we discuss marketing at the very beginning so we are sure all the hooks, timing and planning considerations are understood by everyone. In fact, marketing strategy will impact some of the features of your exhibit, your choice of keywords used to help SEO, the ultimate deadlines that you set for completion and the staffing time you know you’ll need post launch to keep the buzz buzzing.

Most importantly, though, marketing plans can greatly influence the decision for which platform to use. For example, one of the benefits of Omeka.net (rather than self-hosted Omeka) is that any collection hosted with them becomes part of a network of other digital collections, boosting the potential for serendipitous discovery. I often urge faculty to opt for our Digital Commons repository over, say, their personal website, because anything they place in DC gets aggregated into the larger DC universe and has built-in marketing tools like email subscriptions and RSS feeds.

The bottom line here is that marketing is an area where librarians can shine. Online marketing of digital collections really pulls together all of the other forms of expertise that we can offer (our understanding of metadata, web technology and social networks) to fulfill the aim of every digital project: to reach other people and teach them something.


Steve Hetzler's "Touch Rate" Metric / David Rosenthal

Steve Hetzler of IBM gave a talk at the recent Storage Valley Supper Club on a new, scale-free metric for evaluating storage performance that he calls "Touch Rate". He defines this as the proportion of the store's total content that can be accessed per unit time. This leads to some very illuminating graphs that I discuss below the fold.

Steve's basic graph is a log-log plot with performance increasing up and to the right. Response time for accessing an object (think latency) decreases to the right on the X-axis and the touch rate, the proportion of the total capacity that can be accessed by random reads in a year (think bandwidth) increases on the Y-axis. For example, a touch rate of 100/yr means that random reads could access the entire contents 100 times a year. He divides the graph into regions suited to different applications, with minimum requirements for response time and touch rate. So, for example, transaction processing requires response times below 10ms and touch rates above 100 (the average object is accessed about once every 3 days).

The touch rate depends on the size of the objects being accessed. If you take a specific storage medium, you can use its specifications to draw a curve on the graph as the size varies. Here Steve uses "capacity disk" (i.e. commodity 3.5" SATA drives) to show the typical curve, which varies from being bandwidth limited (for large objects on the left, horizontal side) being response limited (for small objects on the right, vertical side).

As an example of the use of these graphs, Steve analyzed the idea of MAID (Massive Array of Idle Drives). He used HGST MegaScale DC 4000.B SATA drives, and assumed that at any time 10% of them would be spun-up and the rest would be in standby. With random accesses to data objects, 9 out of 10 of them will encounter a 15sec spin-up delay, which sets the response time limit. Fully powering-down the drives as Facebook's cold storage does would save more power but increase the spin-up time to 20s. The system provides only (actually somewhat less than) 10% of the bandwidth per unit content, which sets the touch rate limit.

The Steve looked at the fine print of the drive specifications. He found two significant restrictions:
  • The drives have a life-time limit of 50K start/stop cycles.
  • For reasons that are totally opaque, the drives are limited to a total transfer of 180TB/yr.
Applying these gives this modified graph. The 180TB/yr limit is the horizontal line, reducing the touch rate for large objects. If the drives have a 4-year life, we would need 8M start/stop cycles to achieve a 15sec response time. But we only have 50K. To stay within this limit, the response time has to increase by a factor of 8M/50K, or 160, which is the vertical line. So in fact a traditional MAID system is effective only in the region below the horizontal line and left of the vertical line, much smaller than expected.

This analysis suggests that traditional MAID is not significantly better than tapes in a robot. Here, for example, Steve examines configurations varying from one tape drive for 1600 LTO6 tapes, or 4PB per drive, to a quite unrealistically expensive 1 drive per 10 tapes, or 60TB per drive. Tape drives have a 120K lifetime load/unload cycle limit, and the tapes can withstand at most 260 full-file passes, so tape has a similar pair of horizontal and vertical lines.

The reason that Facebook's disk-based cold storage doesn't suffer from the same limits as traditional MAID is that it isn't doing random I/O. Facebook's system schedules I/Os so that it uses the full bandwidth of the disk array, raising the touch rate limit to that of the drives, and reducing the number of start-stop cycles. Admittedly, the response time for a random data object is now a worst-case 7 times the time for which a group of drives is active, but this is not a critical parameter for Facebook's application.

Steve's metric seems to be a major contribution to the analysis of storage systems.

Townhall, not Shopping Mall! Community, making, and the future of the Internet / Jenny Rose Halperin

I presented a version of this talk at the 2014 Futurebook Conference in London, England. They also kindly featured me in the program. Thank you to The Bookseller for a wonderful conference filled with innovation and intelligent people!

A few days ago, I was in the Bodleian Library at Oxford University, often considered the most beautiful library in the world. My enthusiastic guide told the following story:

After the Reformation (when all the books in Oxford were burned), Sir Thomas Bodley decided to create a place where people could go and access all the world’s information at their fingertips, for free.

“What does that sound like?” she asked. “…the Internet?”

While this is a lovely conceit, the part of the story that resonated with me for this talk is the other big change that Bodley made, which was to work with publishers, who were largely a monopoly at that point, to fill his library for free by turning the library into a copyright library. While this seemed antithetical to the ways that publishers worked, in giving a copy of their very expensive books away, they left an indelible and permanent mark on the face of human knowledge. It was not only preservation, but self-preservation.

Bodley was what people nowadays would probably call “an innovator” and maybe even in the parlance of my field, a “community manager.”

By thinking outside of the scheme of how publishing works, he joined together with a group of skeptics and created one of the greatest knowledge repositories in the world, one that still exists 700 years later. This speaks to a few issues:

Sharing economies, community, and publishing should and do go hand in hand and have since the birth of libraries. By stepping outside of traditional models, you are creating a world filled with limitless knowledge and crafting it in new and unexpected ways.

The bound manuscript is one of the most enduring technologies. This story remains relevant because books are still books and people are still reading them.

As the same time, things are definitely changing. For the most part, books and manuscripts were pretty much identifiable as books and manuscripts for the past 1000 years.

But what if I were to give Google Maps to a 16th Century Map Maker? Or what if I were to show Joseph Pulitzer Medium? Or what if I were to hand Gutenberg a Kindle? Or Project Gutenberg for that matter? What if I were to explain to Thomas Bodley how I shared the new Lena Dunham book with a friend by sending her the file instead of actually handing her the physical book? What if I were to try to explain Lena Dunham?

These innovations have all taken place within the last twenty years, and I would argue that we haven’t even scratched the surface in terms of the innovations that are to come.

We need to accept that the future of the printed word may vary from words on paper to an ereader or computer in 500 years, but I want to emphasize that in the 500 years to come, it will more likely vary from the ereader to a giant question mark.

International literacy rates have risen rapidly over the past 100 years and companies are scrambling to be the first to reach what they call “developing markets” in terms of connectivity. In the vein of Mark Surman’s talk at the Mozilla Festival this year, I will instead call these economies post-colonial economies.

Because we (as people of the book) are fundamentally idealists who believe that the printed word can change lives, we need to be engaged with rethinking the printed word in a way that recognizes power structures and does not settle for the limited choices that the corporate Internet provides (think Facebook vs WhatsApp). This is not as a panacea to fix the world’s ills.

In the Atlantic last year, Phil Nichols wrote an excellent piece that paralleled Web literacy and early 20th century literacy movements. The dualities between “connected” and “non-connected,” he writes, impose the same kinds of binaries and blind cure-all for social ills that the “literacy” movement imposed in the early 20th century. In equating “connectedness” with opportunity, we are “hiding an ideology that is rooted in social control.”

Surman, who is director of the Mozilla Foundation, claims that the Web, which had so much potential to become a free and open virtual meeting place for communities, has started to resemble a shopping mall. While I can go there and meet with my friends, it’s still controlled by cameras that are watching my every move and its sole motive is to get me to buy things.

85 percent of North America is connected to the Internet and 40 percent of the world is connected. Connectivity increased at a rate of 676% in the past 13 years. Studies show that literacy and connectivity go hand in hand.

How do you envision a fully connected world? How do you envision a fully literate world? How can we empower a new generation of connected communities to become learners rather than consumers?

I’m not one of these technology nuts who’s going to argue that books are going to somehow leave their containers and become networked floating apparatuses, and I’m not going to argue that the ereader is a significantly different vessel than the physical book.

I’m also not going to argue that we’re going to have a world of people who are only Web literate and not reading books in twenty years. To make any kind of future prediction would be a false prophesy, elitist, and perhaps dangerous.

Although I don’t know what the printed word will look like in the next 500 years,

I want to take a moment to think outside the book,

to think outside traditional publishing models, and to embrace the instantaneousness, randomness, and spontaneity of the Internet as it could be, not as it is now.

One way I want you to embrace the wonderful wide Web is to try to at least partially decouple your social media followers from your community.

Twitter and other forms of social media are certainly a delightful and fun way for communities to communicate and get involved, but your viral campaign, if you have it, is not your community.

True communities of practice are groups of people who come together to think beyond traditional models and innovate within a domain. For a touchstone, a community of practice is something like the Penguin Labs internal innovation center that Tom Weldon spoke about this morning and not like Penguin’s 600,000 followers on Twitter. How can we bring people together to allow for innovation, communication, and creation?

The Internet provides new and unlimited opportunities for community and innovation, but we have to start managing communities and embracing the people we touch as makers rather than simply followers or consumers.

The maker economy is here— participatory content creation has become the norm rather than the exception. You have the potential to reach and mobilize 2.1 billion people and let them tell you what they want, but you have to identify leaders and early adopters and you have to empower them.

How do you recognize the people who create content for you? I don’t mean authors, but instead the ambassadors who want to get involved and stay involved with your brand.

I want to ask you, in the spirit of innovation from the edges

What is your next platform for radical participation? How are you enabling your community to bring you to the next level? How can you differentiate your brand and make every single person you touch psyched to read your content, together? How can you create a community of practice?

Community is conversation. Your users are not your community.

Ask yourself the question Rachel Fershleiser asked when building a community on Tumblr: Are you reaching out to the people who want to hear from you and encouraging them or are you just letting your community be unplanned and organic?

There reaches a point where we reach the limit of unplanned organic growth. Know when you reach this limit.

Target, plan, be upbeat, and encourage people to talk to one another without your help and stretch the creativity of your work to the upper limit.

Does this model look different from when you started working in publishing? Good.

As the story of the Bodelian Library illustrated, sometimes a totally crazy idea can be the beginning of an enduring institution.

To repeat, the book is one of the most durable technologies and publishing is one of the most durable industries in history. Its durability has been put to the test more than once, and it will surely be put to the test again. Think of your current concerns as a minor stumbling block in a history filled with success, a history that has documented and shaped the world.

Don’t be afraid of the person who calls you up and says, “I have this crazy idea that may just change the way you work…” While the industry may shift, the printed word will always prevail.

Publishing has been around in some shape or form for 1000 years. Here’s hoping that it’s around for another 1000 more.

ALA Washington Office copyright event “too good to be true” / District Dispatch

ALA Washington Office Executive Director Emily Sheketoff, Jonathan Band, Brandon Butler and Mary Rasenberger.

(Left to right) ALA Washington Office Executive Director Emily Sheketoff, Jonathan Band, Brandon Butler and Mary Rasenberger.

On Tuesday, November 18th, the American Library Association (ALA) held a panel discussion on recent judicial interpretations of the doctrine of fair use. The discussion, entitled “Too Good to be True: Are the Courts Revolutionizing Fair Use for Education, Research and Libraries?” is the first in a series of information policy discussions to help us chart the way forward as the ongoing digital revolution fundamentally changes the way we access, process and disseminate information. This event took place at Arent Fox, a major Washington, D.C. law firm that generously provided the facility for our use.

These events are part of the ALA Office for Information Technology Policy’s broader Policy Revolution! initiative—an ongoing effort to establish and maintain a national public policy agenda that will amplify the voice of the library community in the policymaking process and position libraries to best serve their patrons in the years ahead.

Tuesday’s event convened three copyright experts to discuss and debate recent developments in digital fair use. The experts—ALA legislative counsel Jonathan Band; American University practitioner-in-practice Brandon Butler; and Authors Guild executive director Mary Rasenberger—engaged in a lively discussion that highlighted some points of agreement and disagreement between librarians and authors.

The library community is a strong proponent of fair use, a flexible copyright exception that enables use of copyrighted works without prior authorization from the rights holder. Fair use can be determined by the consideration of four factors. A number of court decisions issued over the last three years have affirmed the use of copyrighted works by libraries as fair, including the mass digitization of books housed in some research libraries, such as Authors Guild v. HathiTrust.

Band and Butler disagreed with Rasenberger on several points concerning recent judicial fair use interpretations. Band and Butler described judicial rulings on fair use in disputes like the Google Books case and the HathiTrust case as on-point, and rejected arguments that the reproductions of content at issue in these cases could result in economic injury to authors. Rasenberger, on the other hand, argued that repositories like HathiTrust and Google Books can in fact lead to negative market impacts for authors, and therefore do not represent a fair use.

Rasenberger believes that licensing arrangements should be made between authors and members of the library, academic and research communities who want to reproduce the content to which they hold rights. She takes specific issue with judicial interpretations of market harm that require authors to demonstrate proof of a loss of profits, suggesting that such harm can be established by showing that future injury is likely to befall an author as a result of the reproduction of his or her work.

Despite their differences of opinion, the panelists provided those in attendance at Tuesday’s event with some meaningful food for thought, and offered a thorough overview of the ongoing judicial debates over fair use. We were pleased that the Washington Internet Daily published an article “Georgia State Case Highlights Fair Use Disagreement Among Copyright Experts,” on November 20, 2014, about our session. ALA continues to fight for public access to information as these debates play out.

Stay tuned for the next event, planned for early 2015!

OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright Event OITP Copyright event ALA Washington Office Executive Director Emily Sheketoff, Jonathan Band, Brandon Butler and Mary Rasenberger.

The post ALA Washington Office copyright event “too good to be true” appeared first on District Dispatch.

Deployment and Development workflows at Cherry Hill / Cherry Hill Company

Last year, we reached a milestone at Cherry Hill when we moved all of our projects into a managed deployment system. We have talked about Jenkins, one of the tools that we use to manage our workflow and there has been continued interest on what our "recipe" consists of. Being that we are using open source tools, and we think of ourselves as part of the (larger than Drupal) open source community, I want to share a bit more of what we use and how it is stitched together. Our hope is that this helps to spark a larger discussion of the tools others are using, so we can all learn from each other.

Git

Git is a distributed code revision control system. While we could use any revision control system such as CSV, Subversion (and even though this is a given with most agencies, we strongly suggest you use *some* system over nothing at all), git is fairly easy to use, has great...

Read more »

10 Historic Mustache Must-haves / DPLA

In a continuation of our weekly facial hair inspiration (check out last week’s list of Civil War mustached men), we recognize that the “Movember” challenge isn’t easy. Growing an impressive beard or mustache, even for a good cause, can be a struggle. Let us help!

This week: A collection of historic mustache must-haves.

A “mustache-guard” best used “with drinking-cups or goblets, tumblers, and other drinking-vessels.”
A support group: The “Mustache Club,” 1893.
A little synthetic help (like this woman wearing a fake ‘stache in a skit).
A Japanese “mustache-lifter” from the 1920s.
Or this stick, which Japanese men used to raise their mustaches while   drinking wine. 
A little bit of dye, to keep your mustache a “natural brown or black,” as this advertisement promises. 
A steady reflection.
A sense of humor (or not, if you aren’t a fan of clowns).
A nice ride, for regular trips to the barber.
 A theme song.

Weekly user tests: Finding subject guides / Shelley Gullikson

This week we did a guerrilla-style test to see how (or if) people find our subject guides, particularly if they are not in our main listing. We asked “Pretend that someone has told you there is a really great subject guide on the library website about [subject]. What would you do to find it?” We cycled through three different subjects not listed on our main subject guide page: Canadian History, Ottawa, and Homelessness.

Some Context

Our subject guides use a template created in-house (not LibGuides) and we use Drupal Views and Taxonomy to create our lists. The main subject guide page has  an A-Z list, an autocomplete search box, a list of broad subjects (e.g. Arts and Social Sciences) and a list of narrower subjects (e.g. Sociology). The list of every subject guide is on another page. Subject specialists were not sure if users would find guides that didn’t correspond to the narrower subjects (e.g. Sociology of Sport).

Results

The 21 students we saw did all kinds of things to find subject guides. We purposely used the same vocabulary as what is on the site because it wasn’t supposed to be a test about the label “subject guide.” However, less than 30% clicked on the Subject Guides link; the majority used some sort of search.

Here you can see the places people went to on our home page most (highlighted in red), next frequently (in orange) and just once (yellow).subject_test

When people used our site search, they had little problem finding the guide (although a typo stymied one person). However, a lot of participants used our Summon search. I think there are a couple of reasons for this:

  • Students didn’t know what a subject guide was and so looked for guides the way they look for articles, books, etc.
  • Students think the Summon search box is for everything

Of the 6 students who did click on the Subject Guides link:

  • 2 used broad subjects (and neither was successful with this strategy)
  • 2 used narrow subjects (both were successful)
  • 1 used the A-Z list (with success)
  • 1 used the autocomplete search (with success)

One person thought that she couldn’t possibly find the Ottawa guide under “Subject Guides” because she thought those were only for courses. I found this very interesting because a number of our subject guides do not map directly to courses.

The poor performance of the broad subjects on the subject guide page is an issue and Web Committee will look at how we might address that. Making our site search more forgiving of typos is also going to move up the to-do list. But I think the biggest takeaway is that we really have to figure out how to get our guides indexed in Summon.


ALA welcomes Simon & Schuster change to Buy It Now program / District Dispatch

ALA President Courtney Young

ALA President Courtney Young

Today, the American Library Association (ALA) and its Digital Content Working Group (DCWG) welcomed Simon & Schuster’s announcement that it will allow libraries to opt into the “Buy It Now” program. The publisher began offering all of its ebook titles for library lending nationwide in June 2014, with required participation in the “Buy It Now” merchandising program, which enables library users to directly purchase a title rather than check it out from the library. Simon & Schuster ebooks are available for lending for one year from the date of purchase.

In an ALA statement, ALA President Courtney Young applauded the move:

From the beginning, the ALA has advocated for the broadest and most affordable library access to e-titles, as well as licensing terms that give libraries flexibility to best meet their community needs.

We appreciate that Simon & Schuster is modifying its library ebook program to provide libraries a choice in whether or not to participate in Buy It Now. Providing options like these allow libraries to enable digital access while also respecting local norms or policies. This change also speaks to the importance of sustaining conversations among librarians, publishers, distributors and authors to continue advancing our shared goals of connecting writers and readers.

DCWG Co-Chairs Carolyn Anthony and Erika Linke also commented on the Simon & Schuster announcement:

“We are still in the early days of this digital publishing revolution, and we hope we can co-create solutions that expand access, increase readership and improve exposure for diverse and emerging voices,” said. “Many challenges remain including high prices, privacy concerns, and other terms under which ebooks are offered to libraries. We are continuing our discussions with publishers.”

For more library ebook lending news, visit the American Libraries magazine E-Content blog.

The post ALA welcomes Simon & Schuster change to Buy It Now program appeared first on District Dispatch.

Introducing Library Pipeline / In the Library, With the Lead Pipe

Surfing a Pipeline

South Coast Pipe by Colm Walsh (CC-BY)

In Brief: We’re creating a nonprofit, Library Pipeline, that will operate independently from In the Library with the Lead Pipe, but will have similar and complementary aims: increasing and diversifying professional development; improving strategies and collaboration; fostering more innovation and start-ups, and encouraging LIS-related publishing and publications. In the Library with the Lead Pipe is a platform for ideas; Library Pipeline is a platform for projects.

At In the Library with the Lead Pipe, our goal has been to change libraries, and the world, for the better. It’s on our About page: We improve libraries, professional organizations, and their communities of practice by exploring new ideas, starting conversations, documenting our concerns, and arguing for solutions. Those ideas, conversations, concerns, and solutions are meant to extend beyond libraries and into the societies that libraries serve.

What we want to see is innovation–new ideas and new projects and collaborations. Innovative libraries create better educated citizens and communities with stronger social ties.

Unfortunately, libraries’ current funding structures and the limited professional development options available to librarians make it difficult to introduce innovation at scale. As we started talking about a couple of years ago, in our reader survey and in a subsequent editorial marking our fourth anniversary, we need to extend into other areas, besides publication, in order to achieve our goals. So we’re creating a nonprofit, Library Pipeline, that will operate independently from In the Library with the Lead Pipe, but will have similar and complementary aims.

Library Pipeline is dedicated to supporting structural changes by providing opportunities, funding, and services that improve the library as an institution and librarianship as a profession. In the Library with the Lead Pipe, the journal we started in 2008, is a platform for ideas; Library Pipeline is a platform for projects. Although our mission is provisional until our founding advisory board completes its planning process, we have identified four areas in which modest funding, paired with guidance and collaboration, should lead to significant improvements.

Professional Development

A few initiatives, notably the American Library Association’s Emerging Leaders and Spectrum Scholars programs, increase diversity and provide development opportunities for younger librarians. We intend to expand on these programs by offering scholarships, fellowships, and travel assistance that enable librarians to participate in projects that shift the trajectory of their careers and the libraries where they work.

Collaboration

Organized, diverse groups can solve problems that appear intractable if participants have insufficient time, resources, perspective, or influence. We would support collaborations that last a day, following the hack or camp model, or a year or two, like task forces or working groups.

Start-ups

We are inspired by incubators and accelerators, primarily YCombinator and SXSW’s Accelerator. The library and information market, though mostly dormant, could support several dozen for-profit and nonprofit start-ups. The catalyst will be mitigating founders’ downside risk by funding six months of development, getting them quick feedback from representative users, and helping them gain customers or donors.

Publishing

Librarianship will be stronger when its practitioners have as much interest in documenting and serving our own field as we have in supporting the other disciplines and communities we serve. For that to happen, our professional literature must become more compelling, substantive, and easier to access. We would support existing open access journals as well as restricted journals that wish to become open access, and help promising writers and editors create new publications.

These four areas overlap by design. For example, we envision an incubator for for-profit and nonprofit companies that want to serve libraries. In this example, we would provide funding for a diverse group of library students, professionals, and their partners who want to incorporate, and bring this cohort to a site where they can meet with seasoned librarians and entrepreneurs. After a period of time, perhaps six months, the start-ups would reconvene for a demo day attended by potential investors, partners, donors, and customers.

Founding Advisory Board

We were inspired by the Constellation Model for our formation process, as adapted by the Digital Public Library of America and the National Digital Preservation Alliance (see: “Using Emergence to Take Social Innovation to Scale”). Our first step was identifying a founding advisory board, whose members have agreed to serve a two-year term (July 2014-June 2016). At the end of which the Board will be dissolved and replaced with a permanent governing board. During this period, the advisory board will formalize and ratify Library Pipeline’s governance and structure, establish its culture and business model, promote its mission, and define the organizational units that will succeed the advisory board, such as a permanent board of trustees and paid staff.

The members of our founding advisory board are:

The board will coordinate activity among, and serve as liaisons to, the volunteers on what we anticipate will eventually be six subcommittees (similar to DPLA’s workstreams). This is going to be a shared effort; the job is too big for ten people. Those six subcommittees and their provisional charges are:

  • Professional Development within LIS (corresponding to our “Professional Development” area). Provide professional development funding, in the form of scholarships, fellowships, or travel assistance, for librarians or others who are working in behalf of libraries or library organizations, with an emphasis on participation in cross-disciplinary projects or conferences that extend the field of librarianship in new directions and contribute to increased diversity among practitioners and the population we serve.
  • Strategies for LIS (corresponding to “Collaboration”). Bring together librarians and others who are committed to supporting libraries or library-focused organizations. These gatherings could be in-person or online, could last a day or could take a year, and could be as basic as brainstorming solutions to a timely, significant issue or as directed as developing solutions to a specific problem.
  • Innovation within LIS (corresponding to “Start-Ups”). Fund and advise library-related for-profit or nonprofit startups that have the potential to help libraries better serve their communities and constituents. We believe this area will be our primary focus, at least initially.
  • LIS Publications (corresponding with “Publishing”). Fund and advise LIS publications, including In the Library with the Lead Pipe. We could support existing open access journals or restricted journals that wish to become open access, and help promising writers and editors create new publications.
  • Governance. This may not need to be a permanent subcommittee, though in our formative stages it would be useful to work with people who understand how to create governance structures that provide a foundation that promotes stability and growth.
  • Sustainability. This would include fundraising, but it also seems to be the logical committee for creating the assessment metrics we need to have in place to ensure that we are fulfilling our commitment to libraries and the people who depend on them.

How Can You Help?

We’re looking for ideas, volunteers, and partners. Contact Brett or Lauren if you want to get involved, or want to share a great idea with us.

ALA and E-rate in the press / District Dispatch

Children and library computersFor nearly a year-and-a-half, the FCC has been engaged in an ongoing effort to update the E-rate program for the digital age. The American Library Association (ALA) has been actively engaged in this effort, submitting comments and writing letters to the FCC and holding meetings with FCC staff and other key E-rate stakeholders.

Our work on the E-rate modernization has drawn the attention of several media outlets over the past week, as the FCC prepares to consider an order that we expect to help libraries from the most populated cities to the most rural areas meet their needs related to broadband capacity and Wi-Fi:

The FCC Plans to Increase Your Phone Bill to Build Better Internet in Schools (ALA quoted)
E-Rate Funding Would Get Major Boost Under FCC Chair’s Plan
FCC’s Wheeler Draws Fans With E-Rate Cap Hike
Is expanding Wi-Fi to 10 million more students worth a cup of coffee?

ALA was also mentioned in articles from CQ Roll Call and PoliticoPro on Monday.

The new E-rate order is the second in the E-rate modernization proceeding. The FCC approved a first order on July 11th, which focuses on Wi-Fi and internal connections. ALA applauds the FCC for listening to our recommendations throughout the proceeding. Its work reflects an appreciation for all that libraries do to serve community needs related to Education, Employment, Entrepreneurship, Empowerment, and Engagement—the E’s of Libraries.

The post ALA and E-rate in the press appeared first on District Dispatch.

Torus - 2.30 / FOSS4Lib Recent Releases

Package: 
Release Date: 
Thursday, November 20, 2014

Last updated November 20, 2014. Created by Peter Murray on November 20, 2014.
Log in to edit this page.

2.30 Thu 20 Nov 2014 11:34:12 CET

- MKT-168: fix parent's 'created' lost during update

- MKT-170: bootstrap 'originDate' for non-inherit records

Learning Linked Data: SPARQL / OCLC Dev Network

One thing you realize pretty quickly is that it is very hard to work with Linked Data and just confine one’s explorations to a single site or data set. The links inevitably lead you on a pilgrimage from one data set to another and another. In the case of the WorldCat Discovery API, my pilgrimage led me from WorldCat to id.loc.gov, FAST and VIAF and from VIAF on to dbpedia. Dbpedia is an amazingly fun data set to play with. Using it to provide additional richness and context to the discovery experience has been enlightening.

Libraries & Research: Changes in libraries / HangingTogether

[This is the fourth in a short series on our 2014 OCLC Research Library Partnership meeting, Libraries and Research: Supporting Change/Changing Support. You can read the firstsecond, and third posts and also refer to the event webpage that contains links to slides, videos, photos, and a Storify summary.]

And now, onward to the final session of the meeting, which focused appropriately enough on changes in libraries, which include new roles and and preparing to support future service demands. They are engaging in new alliances and are restructuring themselves to prepare for change in accordance with their strategic plans.

[Paul-Jervis Heath, Lynn Silipigni Connaway, and Jim Michalko]

[Paul-Jervis Heath, Lynn Silipigni Connaway, and Jim Michalko]

Lynn Silipigni Connaway (Senior Research Scientist, OCLC Research) [link to video] shared the results of several studies that identify the importance of user-centered assessment and evaluation. Lynn has been working actively in this area since 2003, looking at not only researchers but also future researchers (students!). In interviews on virtual reference, focusing on perspective users, Lynn and her team found that students use Google and Wikipedia but also rely on human resources — other students, advisers, graduate students and faculty. In looking through years of data, interviewees tend to use generic terms like “database” and refer to specific tools and sources only when they are further along in their career — this doesn’t mean they don’t use them, rather, they get used to using more sophisticated terminology as they go along. No surprise, convenience trumps everything; researchers at all levels are eager to optimize their time so many “satisfice” if the assignment or task doesn’t warrant extra time spent. From my perspective, one of the most interesting findings from Lynn’s studies relates to students’ somewhat furtive use of Wikipedia, which she calls the Learning Black Market (students look up something in Google, find sources in Wikipedia, copy and paste the citation into their paper!). Others use Facebook to get help. Some interesting demographic differences — more established researchers use Twitter, and use of Wikipedia declines as researchers get more experience. In regards to the library, engagement around new issues (like data management) causes researchers to think anew about ways the library might be useful. Although researchers of all stripes will reach out to humans for help, librarians rank low on that list. Given all of these challenges, there are opportunities for librarians and library services — be engaging and be where researchers are, both physically and virtually. We should always assess what we are doing — keep doing what’s working, cut or reinvent what is not. Lynne’s presentation provides plenty of links and references for you to check out.

Paul-Jervis Heath (Head of Innovation & Chief Designer, University of Cambridge) [link to video] spoke from the  perspective of a designer, not a librarian (he has worked on smart homes, for example). He shared findings from recent work with the Cambridge University libraries. Because of disruption, libraries face a perfect storm of change in teaching, funding, and scholarly communications. User expectations are formed by consumer technology. While we look for teachable moments, Google and tech companies do not — they try to create intuitive experiences. Despite all the changes, libraries don’t need to sit on the sidelines, they can be engaged players. Design research is important and distinguished from market research in that it doesn’t measure how people think but how they act. From observation studies, we can see that students want to study together in groups, even if they are doing their own thing. The library needs to be optimized for that. Another technique employed, asking students to use diaries to document their days. Many students prefer the convenience of studying in their room but what propels them to the library is the desire to be with others in order to focus. At Cambridge, students have a unique geographic triangle defined by where they live, the department where they go to class, and the market they prefer to shop in. Perceptions about how far something (like the library) is outside of the triangle are relative. Depending on how far your triangle points are, life can be easy or hard. Students are not necessarily up on technology so don’t make assumptions. It turns out that books (the regular, paper kind) are great for studying! But students use ebooks to augment their paper texts, or will use when all paper books are gone. Shadowing (with permission) is another technique which allows you to immerse yourself in a researcher’s life and understand their mental models. Academics wear lot of different hats, play different roles within the university and are too pressed for time to learn new systems. It’s up to the library to create efficiencies and make life easier for researchers. Paul closed by emphasizing six strategic themes: transition from physical to digital; library spaces; sustainable classic library services; supporting research and scholarly communications; making special collections more available; and creating touchpoints that will bring people back to the library seamlessly.

Jim Michalko (Vice President, OCLC Research Library Partnership) [link to video] talked about his recent work looking at library organizational structures and restructuring. (Jim will be blogging about this work soon, so I won’t give more than a few highlights.) For years, libraries have been making choices about what to do and how to do it, and libraries have been reorganizing themselves to get this (new) work done. Jim gathered feedback from 65 institutions in the OCLC Research Library Partnership and conducted interviews with a subset of those, in order to find out if structure indeed follows strategy. Do new structures represent markets or adjacent strategies (in business speak)? We see libraries developing capacities in customer relationship management and we see this reflected in user-focused activities. Almost all institutions interviewed were undertaking restructuring based on a changes external to the library, such as new constituencies and expectations. Organizations are orienting themselves to be more user centered, and to align themselves with a new direction taken by the university. We see many libraries bringing in skill sets beyond those normally found in the library package. Many institutions charged a senior position with helping to run a portion of a regional or national service. Other similarities: all had a lot of communication about restructuring. Almost all also related to a space plan.

This session was followed by a discussion session and I invite you to watch it, and also to watch this lovely summary of our meeting delivered by colleague Titia van der Werf (less than 7 minutes long and worth watching!):

If you attended the meeting or were part of the remote viewing audience for all or part of it, or if you watched any of the videos, I hope you will leave some comments with your reactions. Thanks for reading!

All the News That’s Fit to Archive / Library of Congress: The Signal

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

The Library has had a web archiving program since the early 2000s.  As with other national libraries, the Library of Congress web archiving program started out harvesting the web sites of its national election campaigns, followed by some collections to harvest sites for period of time connected with events (for example, an Iraq War web archive and a papal transition 2005 web archive along with collecting the sites of the U.S. House and Senate and the legislative branch of government more broadly.

An American of the 1930s getting his news by reading a newspaper. These days he'd likely be looking at a computer screen. Photo courtesy of the Library of Congress Prints and Photographs division.

An American of the 1930s getting his news by reading a newspaper. These days he’d likely be looking at a computer screen. Photo courtesy of the Library of Congress Prints and Photographs division.

The question for the Library of Congress of “what else” to harvest beyond these collections is harder to answer than one might think because of the relatively small web archiving capacity of the Library of Congress (which is influenced by our permissions approach) compared to the vast immenseness of the Internet.  About six years ago we started a collection now known as the Public Policy Topics, for which we would acquire sites with content reflecting different viewpoints and research on a broad selection of public policy questions, including the sites of national political parties, selected advocacy organizations and think tanks and other organizations with a national voice in America’s policy discussions that could be of interest to future researchers.  We are adding more sites to Public Policy Topics continuously.

Eventually I decided to include some news web sites that contained significant discussion of policy issues from particular points of view – sites ranging from DailyKos.com to Townhall.com, from TruthDig.com to Redstate.com.  We started crawling these sites on a weekly basis to try to assure complete capture over time and to build a representation of how the site looked as different news events came and went in the public consciousness (and on these web sites).  We have been able to assess the small number of such sites that we have crawled and have decided that the results are acceptable.  But this was obviously not a very large-scale effort compared to the increasing number of sites presenting general news on the Internet -for many people, their current equivalent of a newspaper.

Newspapers – they are a critical source for historical research and the Library of Congress has a long history of collecting and providing access to U.S. (and other countries’) newspapers.  Having started to collect a small number of “newspaper-like” U.S. news sites for the Public Policy Topics collection, I began a conversation with three reference librarian colleagues from the Newspaper & Current Periodical Reading Room – Amber Paranick, Roslyn Pachoca and Gary Johnson ­- about expanding this effort to a new collection, a “General News on the Internet” web archive.  They explained to me:

Our newspaper collections are invaluable to researchers.  Newspapers provide a first-hand draft of history.  They provide supplemental information that cannot be found anywhere else.  They ‘fill in the gaps,’ so to speak. The way people access news has been changing and evolving ever since newspapers were first being published. We recognized the need to capture news published in another format.  It is reasonable to expect us to continue to connect these kinds of resources to our current and future patrons. Websites tend to be ephemeral and may disappear completely.  Without a designated archive, critical news content may be lost.

In short, my colleagues shared my interest, concern and enthusiasm for starting a larger collection of Internet-only general news sites as a web archiving collection.  I’ll let them explain their thinking further:

When we first got started on the project, we weren’t sure how to proceed.  Once we established clear boundaries on what to include, what types of news sites would be within scope for this collection, our selection process became easier. We asked for help in finding websites from our colleagues. 

We felt it was important to include sites that focus on general news with significant national presence where there are articles that have an author’s voice, such as with HuffingtonPost.com or BuzzFeed.com (even as some of these sites also contain articles that are meant to attract visitors, so-called “click bait).  We wanted to include a variety of sites that represent more cutting edge ways of presenting general news, such as Vox.com and TheVerge, and we felt sites that focus on parody such as TheOnion.com were also important to have represented.  Of course, these sites are not the only sources from which people obtain their news, but we tried to choose a variety that included more trendy or popular sources as well as the conventional or traditional types.  Again, the idea is to assure future users have access to a significant representation of how Americans accessed news at this time using the Internet.

The Library of Congress has an internal process for proposing new web archiving collections.  I worked with Amber, Roslyn and Gary and they submitted a “General News on the Internet” project proposal and it was approved.  Yay!  Then the work began – Amber, Roslyn and Gary describe some of the hurdles:

We understand that archiving video content is a problem. We thought websites like NowThisNews.com could be great candidates but in effect, because they contained so much video and a kind of Tumblr-like portal entry point for news, we had to reject them.  Since we do not do “one hop out” crawling, the linked-to content that is the substantive content (i.e., the news) would be entirely missed.   Also, websites like Vice.com change their content so frequently, it might be impossible to capture all of its content.

In addition, it was decided that sites chosen would not include general news sites associated primarily with other delivery vehicles, such as CNN.com or NYTimes.com.  Many of these types also have paywalls and therefore obviously would create limitations when trying to archive.

We also encountered another type of challenge with Drudgereport.com.  Since it is primarily a news-aggregator with most of the site consisting of links to news on other sites it would be tough to include the many links with the limitations in crawling (again, the “one hop” limitation – we don’t harvest links that are on a different URL).  In the end we decided to proceed in archiving The Drudge Report site since it is well known for the content that is original to that site.

The harvesting for this collection has now been underway for several months; we are examining the results.  We look forward to making an archived version of today’s news as brought to you by the Internet available to Library of Congress patrons for many tomorrows.

What news sites do you think we should collect?

Talk "Costs: Why Do We Care?" / David Rosenthal

Investing in Opportunity: Policy Practice and Planning for a Sustainable Digital Future sponsored by the 4C project and the Digital Preservation Coalition featured a keynote talk each day. The first, by Fran Berman, is here.

Mine was the second, entitled Costs: Why Do We Care? It was an update and revision of The Half-Empty Archive, stressing the importance of collecting, curating and analyzing cost data. Below the fold, an edited text with links to the sources.

Introduction

I'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford University Libraries, and I have two reasons for being especially happy to be here today. First, I'm a Londoner. Second, under the auspices of JISC the UK has been a very active participant in the LOCKSS program since 2006. As with all my talks, you don't need to take notes or ask for the slides. The text of the talk, with links to the sources, will go up on my blog shortly.

Why do I think am I qualified to stand here and pontificate about preservation costs? The LOCKSS Program develops, and supports users of, the LOCKSS digital preservation technology. This is a peer-to-peer system designed to let libraries collect and preserve copyright content published on the Web, such as e-journals and e-books. LOCKSS users participate in a number of networks customized for these and other forms of content including government documents, social science datasets, library special collections, and so on. One of these networks, the CLOCKSS archive, a community-managed dark archive of e-journals and e-books, was recently certified to the Trusted Repository Audit Criteria, equalling the previous highest score and gaining the first-ever perfect score for technology. The LOCKSS software is free open source, the LOCKSS team charges for support and services. On that basis, with no grant funding, for more than 7 years we have covered our costs and accumulated some reserves.

Because understanding and controlling our costs is very important for us, and because the LOCKSS system's Lots Of Copies trades using more disk space for using less of other resources (especially lawyers), I have been researching the costs of storage for some years.

Like all of you, the LOCKSS team has to plan and justify our budget each year. It is clear that economic failure is one of the most significant threats to the content we preserve, as it is even for the content national libraries preserve. For each of us individually the answer to "Costs: Why Do We Care?" is obvious. But I want to talk about why the work we are discussing over these two days, of collecting, curating, normalizing, analyzing and disseminating cost information about digital curation and preservation, is important not just at an individual level but for the big picture of preservation. What follows is in three sections:
  • The current  situation.
  • Cost trends.
  • What can be done?

The Current Situation

How well are we doing at the task of preservation? Attempts have been made to measure the probability that content is preserved in some areas; e-journals, e-theses and the surface Web:
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) compare information extracted from journal publisher's web sites with the Keepers Registry and conclude:
    We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
  • The Prelida project Hiberlink project studied the links in 46,000 US theses and determined that about 50% of the linked-to content was preserved in at least one Web archive.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Web is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:
    • tried to access the URL from the live Web.
    • used Memento to ask the major Web archives whether they had at least one copy of that URL.
    Their results are somewhat difficult to interpret, but for their two more random samples they report:
    URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
So, are we preserving half the stuff that should be preserved? Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.

An Optimistic Assessment

First, the assessment isn't risk-adjusted:
  • As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation, so what have those resources achieved for future readers?
  • As regards Web content, the more links to a page, the more likely the crawlers are to find it, and thus, other things such as robots.txt being equal, the more likely it is to be preserved. But equally, the less at risk of loss.
Second, the assessment isn't adjusted for difficulty:
  • A similar problem of risk-aversion is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence.
  • The same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content.
  • Similarly, the web content that is preserved is the content that is easier to find and collect. Smaller, less linked web-sites are probably less likely to survive.
Harvesting the low-hanging fruit directs resources away from the content at risk of loss.

Third, the assessment is backward-looking:
  • As regards scholarly communication it looks only at the traditional forms, books and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are mostly both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.
  • As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.
Fourth, the assessment is likely to suffer measurement bias:
  • The measurements of the scholarly literature are based on bibliographic metadata, which is notoriously noisy. In particular, the metadata was apparently not de-duplicated, so there will be some amount of double-counting in the results.
  • As regards Web content, Ainsworth et al describe various forms of bias in their paper.
As Cliff Lynch pointed out in his summing-up of the 2014 IDCC conference, the scholarly literature and the surface Web are genres of content for which the denominator of the fraction being preserved (the total amount of genre content) is fairly well known, even if it is difficult to measure the numerator (the amount being preserved). For many other important genres, even the denominator is becoming hard to estimate as the Web enables a variety of distribution channels:
  • Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.
  • YouTube and other sites now contain vast amounts of video, some of which represents what in earlier times would have been movies.
  • Much music now happens on YouTube (e.g. Pomplamoose
  • Scientific data is exploding in both size and diversity, and despite efforts to mandate its deposit in managed repositories much still resides in grad students laptops.
Of course, "what we should be preserving" is a judgement call, but clearly even purists who wish to preserve only stuff to which future scholars will undoubtedly require access would be hard pressed to claim that half that stuff is preserved.

Preserving the Rest

Overall, its clear that we are preserving much less than half of the stuff that we should be preserving. What can we do to preserve the rest of it?
  • We can do nothing, in which case we needn't worry about bit rot, format obsolescence, and all the other risks any more because they only lose a few percent. The reason why more than 50% of the stuff won't make it to future readers would be can't afford to preserve.
  • We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding levels.
  • We can more than halve the cost per unit content. Doing so requires a radical re-think of our preservation processes and technology.
Such a radical re-think requires understanding where the costs go in our current preservation methodology, and how they can be funded. As an engineer, I'm used to using rules of thumb. The one I use to summarize most of the research into past costs is that ingest takes half the lifetime cost, preservation takes one third, and access takes one sixth.

On this basis, one would think that the most important thing to do would be to reduce the cost of ingest. It is important, but not as important as you might think. The reason is that ingest is a one-time, up-front cost. As such, it is relatively easy to fund. In principle, research grants, author page charges, submission fees and other techniques can transfer the cost of ingest to the originator of the content, and thereby motivate them to explore the many ways that ingest costs can be reduced. But preservation and dissemination costs continue for the life of the data, for "ever". Funding a stream of unpredictable payments stretching into the indefinite future is hard. Reductions in preservation and dissemination costs will have a much bigger effect on sustainability than equivalent reductions in ingest costs.

Cost Trends

We've been able to ignore this problem for a long time, for two reasons. From at least 1980 to 2010 storage costs followed Kryder's Law, the disk analog of Moore's Law, dropping 30-40%/yr. This meant that, if you could afford to store the data for a few years, the cost of storing it for the rest of time could be ignored, because of course Kryder's Law would continue forever. The second is that as the data got older, access to it was expected to become less frequent. Thus the cost of access in the long term could be ignored.

Can we continue to ignore these problems?

Preservation

Kryder's Law held for three decades, an astonishing feat for exponential growth. Something that goes on that long gets built into people's model of the world, but as Randall Munroe points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve.

This graph, from Preeti Gupta of UC Santa Cruz, plots the cost per GB of disk drives against time. In 2010 Kryder's Law abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the world's capacity to build disks, and prices doubled. Earlier this year they finally got back to 2010 levels. Industry projections are for no more than 10-20% per year going forward (the red lines on the graph). This means that disk is now about 7 times as expensive as was expected in 2010 (the green line), and that in 2020 it will be between 100 and 300 times as expensive as 2010 projections.

These are big numbers, but do they matter? After all, preservation is only about one-third of the total. and only about one-third of that is media costs.

Our models of the economics of long-term storage compute the endowment, the amount of money that, deposited with the data and invested at interest, would fund its preservation "for ever". This graph, from my initial rather crude prototype model, is based on hardware cost data from Backblaze and running cost data from the San Diego Supercomputer Center (much higher than Backblaze's) and Google. It plots the endowment needed for three copies of a 117TB dataset to have a 95% probability of not running out of money in 100 years, against the Kryder rate (the annual percentage drop in $/GB). The different curves represent policies of keeping the drives for 1,2,3,4,5 years. Up to 2010, we were in the flat part of the graph, where the endowment is low and doesn't depend much on the exact Kryder rate. This is the environment in which everyone believed that long-term storage was effectively free. But suppose the Kryder rate were to drop below about 20%/yr. We would be in the steep part of the graph, where the endowment needed is both much higher and also strongly dependent on the exact Kryder rate.

We don't need to suppose. Preeti's graph and industry projections show that now and for the foreseeable future we are in the steep part of the graph. What happened to slow Kryder's Law? There are a lot of factors, we outlined many of them in a paper for UNESCO's Memory of the World conference (PDF). Briefly, both the disk and tape markets have consolidated to a couple of vendors, turning what used to be a low-margin, competitive market into one with much better margins. Each successive technology generation requires a much bigger investment in manufacturing, so requires bigger margins, so drives consolidation. And the technology needs to stay in the market longer to earn back the investment, reducing the rate of technological progress.

Thanks to aggressive marketing, it is commonly believed that "the cloud" solves this problem. Unfortunately, cloud storage is actually made of the same kind of disks as local storage, and is subject to the same slowing of the rate at which it was getting cheaper. In fact, when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale. Cloud storage really is cheaper if your demand is spiky, but digital preservation is the canonical base-load application.

You may think that the cloud is a competitive market; in fact it is dominated by Amazon.
Jillian Mirandi, senior analyst at Technology Business Research Group (TBRI), estimated that AWS will generate about $4.7 billion in revenue this year, while comparable estimated IaaS revenue for Microsoft and Google will be $156 million and $66 million, respectively.
When Google recently started to get serious about competing, they pointed out that Amazon's margins may have been minimal at introduction, by then they were extortionate:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."
Notice that the major price drop triggered by Google was a one-time event; it was a signal to Amazon that they couldn't have the market to themselves, and to smaller players that they would no longer be able to compete.

In fact commercial cloud storage is a trap. It is free to put data in to a cloud service such as Amazon's S3, but it costs to get it out. For example, getting your data out of Amazon's Glacier without paying an arm and a leg takes 2 years. If you commit to the cloud as long-term storage, you have two choices. Either keep a copy of everything outside the cloud (in other words, don't commit to the cloud), or stay with your original choice of provider no matter how much they raise the rent.

Unrealistic expectations that we can collect and store the vastly increased amounts of data projected by consultants such as IDC within current budgets place currently preserved content at great risk of economic failure. Here are three numbers that illustrate the looming crisis in long-term storage, its cost:
Here's a graph that projects these three numbers out for the next 10 years. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue line is the IT budget, at computereconomics.com's 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate projected by IDC, all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget.

Dissemination

The storage part of preservation isn't the only on-going cost that will be much higher than people expect, access will be too. In 2010 the Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. With research data this can be a real difficulty; the value of the data may not be evident for a long time. Shang dynasty astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.

In most cases so far the cost of an access to an individual item has been small enough that archives have not charged the reader. Research into past access patterns to archived data showed that access was rare, sparse, and mostly for integrity checking.

But the advent of "Big Data" techniques mean that, going forward, scholars increasingly want not to access a few individual items in a collection, but to ask questions of the collection as a whole. For example, the Library of Congress announced that it was collecting the entire Twitter feed, and almost immediately had 400-odd requests for access to the collection. The scholars weren't interested in a few individual tweets, but in mining information from the entire history of tweets. Unfortunately, the most the Library could afford to do with the feed is to write two copies to tape. There's no way they could afford the compute infrastructure to data-mine from it. We can get some idea of how expensive this is by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until recently it was 5.5 times.

Ingest

Almost everyone agrees that ingest is the big cost element. Where does the money go? The two main cost drivers appear to be the real world, and metadata.

In the real world it is natural that the cost per unit content increases through time, for two reasons. The content that's easy to ingest gets ingested first, so over time the difficulty of ingestion increases. And digital technology evolves rapidly, mostly by adding complexity. For example, the early Web was a collection of linked static documents. Its language was HTML. It was reasonably easy to collect and preserve. The language of today's Web is Javascript, and much of the content you see is dynamic. This is much harder to ingest. In order to find the links much of the collected content now needs to be executed as well as simply being parsed. This is already significantly increasing the cost of Web harvesting, both because executing the content is computationally much more expensive, and because elaborate defenses are required to protect the crawler against the possibility that the content might be malign.

It is worth noting, however, that the very first US web site in 1991 featured dynamic content, a front-end to a database!

The days when a single generic crawler could collect pretty much everything of interest are gone; future harvesting will require more and more custom tailored crawling such as we need to collect subscription e-journals and e-books for the LOCKSS Program. This per-site custom work is expensive in staff time. The cost of ingest seems doomed to increase.

Worse, the W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.

Metadata in the real world is widely known to be of poor quality, both format and bibliographic kinds. Efforts to improve the quality are expensive, because they are mostly manual and, inevitably, reducing entropy after it has been generated is a lot more expensive than not generating it in the first place.

What can be done?

We are preserving less than half of the content that needs preservation. The cost per unit content of each stage of our current processes is predicted to rise. Our budgets are not predicted to rise enough to cover the increased cost, let alone more than doubling to preserve the other more than half. We need to change our processes to greatly reduce the cost per unit content.

Preservation

It is often assumed that, because it is possible to store and copy data perfectly, only perfect data preservation is acceptable. There are two problems with this expectation.

To illustrate the first problem, lets examine the technical problem of storing data in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability per unit time. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

Here's some back-of-the-envelope hand-waving. Amazon's S3 is a state-of-the-art storage system. Its design goal is an annual probability of loss of a data object of 10-11. If the average object is 10K bytes, the bit half-life is about a million years, way too short to meet the requirement but still really hard to measure.

Note that the 10-11 is a design goal, not the measured performance of the system. There's a lot of research into the actual performance of storage systems at scale, and it all shows them under-performing expectations based on the specifications of the media. Why is this? Real storage systems are large, complex systems subject to correlated failures that are very hard to model.

Worse, the threats against which they have to defend their contents are diverse and almost impossible to model. Nine years ago we documented the threat model we use for the LOCKSS system. We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure
To illustrate the second problem, consider that building systems to defend against all these threats combined is expensive, and can't ever be perfectly effective. So we have to resign ourselves to the fact that stuff will get lost. This has always been true, it should not be a surprise. And it is subject to the law of diminishing returns. Coming back to the economics, how much should we spend reducing the probability of loss?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 150th most visited site, whereas loc.gov is the 1519th. For UK users archive.org is currently the 131st most visited site, whereas bl.uk is the 2744th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more really is better.

Unrealistic expectations for how well data can be preserved make the best be the enemy of the good. We spend money reducing even further the small probability of even the smallest loss of data that could instead preserve vast amounts of additional data, albeit with a slightly higher risk of loss.

Within the next decade all current popular storage media, disk, tape and flash, will be up against very hard technological barriers. A disruption of the storage market is inevitable. We should work to ensure that the needs of long-term data storage will influence the result. We should pay particular attention to the work underway at Facebook and elsewhere that uses techniques such as erasure coding, geographic diversity, and custom hardware based on mostly spun-down disks and DVDs to achieve major cost savings for cold data at scale. 

Every few months there is another press release announcing that some new,  quasi-immortal medium such as fused silica glass or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than magnetic or solid state storage. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the archive problem. But access will always be slow enough that the data would have to be really cold before being committed to DNA.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:

  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)

Dissemination

The real problem here is that scholars are used to having free access to library collections and research data, but what scholars now want to do with archived data is so expensive that they must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars. A charging mechanism is needed to pay for the infrastructure. Fortunately, because the scholar's access is spiky, the cloud provides both suitable infrastructure and a charging mechanism.

For smaller collections, Amazon provides Free Public Datasets, Amazon stores a copy of the data with no charge, charging scholars accessing the data for the computation rather than charging the owner of the data for storage.

Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library of Congress kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. For this year, it would have averaged about $4100/mo, or about $50K. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 initial requests would each need to pay about $125 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach. Because the Library's preservation copy isn't in the cloud, they aren't locked-in.

In the near term, separating the access and preservation copies in this way is a promising way not so much to reduce the cost of access, but to fund it more realistically by transferring it from the archive to the user. In the longer term, architectural changes to preservation systems that closely integrate limited amounts of computation into the storage fabric have the potential for significant cost reductions to both preservation and dissemination. There are encouraging early signs that the storage industry is moving in that direction.

Ingest

There are two parts to the ingest process, the content and the metadata.

The evolution of the Web that poses problems for preservation also poses problems for search engines such as Google. Where they used to parse the HTML of a page into its Document Object Model (DOM) in order to find the links to follow and the text to index, they now have to construct the CSS object model (CSSOM), including executing the Javascript, and combine the DOM and CSSOM into the render tree to find the words in context. Preservation crawlers such as Heritrix used to construct the DOM to find the links, and then preserve the HTML. Now they also have to construct the CSSOM and execute the Javascript. It might be worth investigating whether preserving a representation of the render tree rather than the HTML, CSS, Javascript, and all the other components of the page as separate files would reduce costs.

It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.

Metadata is commonly regarded as essential for preservation. For example, there are 52 criteria for ISO 16363 Section 4. Of these, 29 (56%) are metadata-related. Creating and validating metadata is expensive:
  • Manually creating metadata is impractical at scale.
  • Extracting metadata from the content scales better, but it is still expensive since:
  • In both cases, extracted metadata is sufficiently noisy to impair its usefulness.
We need less metadata so we can have more data. Two questions need to be asked:
  • When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
  • How important is the metadata to the task of preservation? Generating metadata because it is possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if format obsolescence isn't happening , or if it turns out that emulation rather than format migration is the preferred solution, it is a waste of resources. If the reason to validate the formats of incoming content using error-prone tools is to reject allegedly non-conforming content, it is counter-productive. The majority of content in formats such as HTML and PDF fails validation but renders legibly.
The LOCKSS and CLOCKSS systems take a very parsimonious approach to format metadata. Nevertheless, the requirements of ISO 16363 forced us to expend resources implementing and using FITS, whose output does not in fact contribute to our preservation strategy, and whose binaries are so large that we have to maintain two separate versions of the LOCKSS daemon, one with FITS for internal use and one without for actual preservation. Further, the demands we face for bibliographic metadata mean that metadata extraction is a major part of ingest costs for both systems. These demands come from requirements for:
  • Access via bibliographic (as opposed to full-text) search, For example, OpenURL resolution.
  • Meta-preservation services such as the Keepers Registry.
  • Competitive marketing.
Bibliographic search, preservation tracking and bragging about exactly how many articles and books your system preserves are all important, but whether they justify the considerable cost involved is open to question. Because they are cleaning up after the milk has been spilt, digital preservation systems are poorly placed to improve metadata quality.

Resources should be devoted to avoiding spilling milk rather than cleanup. For example, given how much the academic community spends on the services publishers allegedly provide in the way of improving the quality of publications, it is an outrage than even major publishers cannot spell their own names consistently, cannot format DOIs correctly, get authors' names wrong, and so on.

The alternative is to accept that metadata correct enough to rely on is impossible, downgrade its importance to that of a hint, and stop wasting resources on it. One of the reasons full-text search dominates bibliographic search is that it handles the messiness of the real world better.

Conclusion

Attempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" will vastly exceed that from all other causes, such as bit rot and format obsolescence. This raises two questions:
  • Will persisting with current preservation technologies improve the odds of preservation? At each stage of the preservation process current projections of cost per unit content are higher than they were a few years ago. Projections for future preservation budgets are at best no higher. So clearly the answer is no.
  • If not, what changes are needed to improve the odds? At each stage of the preservation process we need to at least halve the cost per unit content. I have set out some ideas, others will have different ideas. But the need for major cost reductions needs to be the focus of discussion and development of digital preservation technology and processes.
Unfortunately, any way of making preservation cheaper can be spun as "doing worse preservation". Jeff Rothenberg's Future Perfect 2012 keynote is an excellent example of this spin in action. Even if we make large cost reductions, institutions have to decide to use them, and "no-one ever got fired for choosing IBM".

We live in a marketplace of competing preservation solutions. A very significant part of the cost of both not-for-profit systems such as CLOCKSS or Portico, and commercial products such as Preservica is the cost of marketing and sales. For example, TRAC certification is a marketing check-off item. The cost of the process CLOCKSS underwent to obtain this check-off item was well in excess of 10% of its annual budget.

Making the tradeoff of preserving more stuff using "worse preservation" would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than those other inferior systems" would win. Thus private interests work against the public interest in preserving more content.

To sum up, we need to talk about major cost reductions. The basis for this conversation must be more and better cost data.

Stump The Chump: D.C. Winners / SearchHub

Last week was another great Stump the Chump session at Lucene/Solr Revolution in DC. Today, I’m happy to anounce the winners:

  • First Prize: Jeff Wartes ($100 Amazon gift certificate)
  • Second Prize: Fudong Li ($50 Amazon gift certificate)
  • Third Prize: Venkata Marrapu ($25 Amazon gift certificate)

Keep an eye on the Lucidworks YouTube page to watch the video as soon as it is available and see the winning questions.

I want to thank everyone who participated — either by sending in your questions, or by being there in person to heckle me. But I would especially like to thank the judges, and our moderator Cassandra Targett, who had to do all the hard work preparing the questions.

See you next year!

The post Stump The Chump: D.C. Winners appeared first on Lucidworks.

Lucidworks Fusion v1.1 Now Available / SearchHub

Hot on the heels of the v1 release of Lucidworks Fusion, we’re back with a whole new set of features and improvements to help you design, build, and deploy search apps with lightning speed. Here’s what’s new in Fusion v1.1: Windows Support fusion-v1.1-email-windows We know some of you were a little miffed that Fusion didn’t support Windows out of the gate. With the release of v1.1, Fusion now supports Windows 7, Windows 8.1, Windows 2008 Server, and Windows 2012 Server. Enhanced Signal Processing Framework Pipeline_UI_Icon_150x150 Fusion’s signal processing framework has added several improvements to allow more complex interactions between signals types to give your users higher relevancy and insights including co-occurrence aggregations, extensive new math options, and alternative integration options. UI Updates fusion-v1.1-email-ui-updates A new streamlined interface lets you edit and configure schemas all right in the browser – without using the command line or editing a config file. This lets non-technical users access the power and flexibility of Fusion. Quick Start fusion-v1.1-email-quick-start Getting started with Fusion is easier than ever with our new Quick Start which walks you through creating your first collection, indexing data, and getting your first app up and running. Relevancy Workbench fusion-v1.1-email-relevancy-workbench Our new relevancy workbench provides a more intuitive interface to make it easier than ever to increase relevancy and fine-tune results – even for non-technical users. Connector Bonanza fusion-v1.1-email-sharepoint-google-drive-couchbase-jive Fusion v1.0 shipped with over 25 connectors so you can index no matter where it lives. Fusion v1.1 now ships with connectors for Sharepoint 2010 and 2013, Subversion 1.8 and greater, Google Drive, Couchbase and Jive. Grab it now! Lucidworks Fusion v1.1 is now available for download.

The post Lucidworks Fusion v1.1 Now Available appeared first on Lucidworks.

New Uses for Old Advertising / DPLA

3 Feeds One Cent, International Stock Food Company, Minneapolis, Minnesota, ca.1905. Courtesy of Hennepin County Library's James K. Hosmer Special Collections Library via the Minnesota Digital Library.

3 Feeds One Cent, International Stock Food Company, Minneapolis, Minnesota, ca.1905. Courtesy of Hennepin County Library’s James K. Hosmer Special Collections Library via the Minnesota Digital Library.

Digitization efforts in the US have, to date, been overwhelmingly dominated by academic libraries, but public libraries are increasingly finding a niche by looking to their local collections as sources for original content. The Hennepin County Library has partnered with the Minnesota Digital Library (MDL)—and now the Digital Public Library of America—to bring thousands of items to the digital realm from its extensive holdings in the James K. Hosmer Special Collections Department. These items include maps, atlases, programs, annual reports, photographs, diaries, advertisements, and trade catalogs.

Our partnership with MDL has not only provided far greater access to these hidden parts of our collections, it has also made patrons much more aware of the significance of our collections and the large number of materials that we could be digitizing. The link to DPLA has further increased our awareness of the potential reach of our collections: DPLA is already the second largest source of referrals to our digital content on MDL. All this has motivated us to increase our digitization activities and place greater emphasis on the role of digital content in our services.

Recently, we have been contributing hundreds of items related to local businesses in the form of large advertising posters, trade catalogs, and over 300 business trade cards from Minneapolis companies. These vividly illustrated materials provide a fascinating view of advertising techniques, local businesses, consumer and industrial goods, social mores and popular culture from the late 19th and early 20th centuries.

Hennepin County Library is committed to serving as Hennepin County’s partner in lifelong learning with programs for babies to seniors, new immigrants, small business owners and students of all ages. It comprises 41 libraries, and has holdings of more than five million books, CDs, and DVDs in 40 world languages. It manages around 1,750 public computers, has 11 library board members, and is one great system serving 1.1 million residents of Hennepin County.

Featured image credit: Detail of 1893 Minneapolis Industrial Exposition Catalog, Minneapolis, Minnesota. Courtesy of Hennepin County Library’s James K. Hosmer Special Collections Library via the Minnesota Digital Library.


cc-by-iconAll written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

My second Python script, dispersion.py / Eric Lease Morgan

This is my second Python script, dispersion.py, and it illustrates where common words appear in a text.

#!/usr/bin/env python2

# dispersion.py - illustrate where common words appear in a text
#
# usage: ./dispersion.py <file>

# Eric Lease Morgan <emorgan@nd.edu>
# November 19, 2014 - my second real python script; "Thanks for the idioms, Don!"


# configure
MAXIMUM = 25
POS     = 'NN'

# require
import nltk
import operator
import sys

# sanity check
if len( sys.argv ) != 2 :
  print "Usage:", sys.argv[ 0 ], "<file>"
  quit()
  
# get input
file = sys.argv[ 1 ]

# initialize
with open( file, 'r' ) as handle : text = handle.read()
sentences = nltk.sent_tokenize( text )
pos       = {}

# process each sentence
for sentence in sentences : 
  
  # POS the sentence and then process each of the resulting words
  for word in nltk.pos_tag( nltk.word_tokenize( sentence ) ) :
    
    # check for configured POS, and increment the dictionary accordingly
    if word[ 1 ] == POS : pos[ word[ 0 ] ] = pos.get( word[ 0 ], 0 ) + 1

# sort the dictionary
pos = sorted( pos.items(), key = operator.itemgetter( 1 ), reverse = True )

# do the work; create a dispersion chart of the MAXIMUM most frequent pos words
text = nltk.Text( nltk.word_tokenize( text ) )
text.dispersion_plot( [ p[ 0 ] for p in pos[ : MAXIMUM ] ] )

# done
quit()

I used the program to analyze two works: 1) Thoreau’s Walden, and 2) Emerson’s Representative Men. From the dispersion plots displayed below, we can conclude a few things:

  • The words “man”, “life”, “day”, and “world” are common between both works.
  • Thoreau discusses water, ponds, shores, and surfaces together.
  • While Emerson seemingly discussed man and nature in the same breath, but none of his core concepts are discussed as densely as Thoreau’s.
Thoreau's Walden

Thoreau’s Walden

Emerson's Representative Men

Emerson’s Representative Men

Python’s Natural Langauge Toolkit (NLTK) is a good library to get start with for digital humanists. I have to learn more though. My jury is still out regarding which is better, Perl or Python. So far, they have more things in common than differences.

Jobs in Information Technology: November 19 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Assistant Director of Digital Strategies, Houston Public Library Houston, TX

Director of Library Technology,  Central Michigan University, Mount Pleasant, MI

Information Technology Manager,  Library System of Lancaster County,  Lancaster, PA

Information Technology Technical Associate: User Interface Designer,  Milner Library,  Illinois State University,  Normal,  IL

IT Operations Specialist,  Gwinnett County Public Library,  Lawrenceville, GA

Library Creative learning Spaces Coordinator,  Multnomah County Library,  Portland, OR

Web Manager , UC San Diego Library, San Diego,  CA

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

Link roundup November 19, 2014 / Harvard Library Innovation Lab

Yipes cripes we’ve got our winter coats on today. Sit down with a hot beverage and enjoy these internet finds.

The FES Watch Is an E-Ink Chameleon – Design Milk

The Ingenuity and Beauty of Creative Parchment Repair in Medieval Books | Colossal

A Brief History of Failure

Lost At The Museum? This Ingenious 3-D Map Makes Navigation A Cinch

Genre, gender and agency analysis using Parts of Speech in Watson Content Analytics. A simple demonstraton. / John Miedema

Genre is often applied as a static classification: fiction, non-fiction, mystery, romance, biography, and so on. But the edges of genre are “blurry” (Underwood). The classification of genre can change over time and situation. Ideally, genre and all classifications could be modeled dynamically during content analysis. How can IBM’s Watson Content Analytics (WCA) help analyze genre? Here is a simple demonstration.

In WCA I created a collection of 1368 public domain novels from Open Library. For this demonstration, I obtained author metadata and expressed it as a WCA facet. I did not obtain existing genre metadata. I will demonstrate that I can use author gender to dynamically classify genre for a specific analytical question. In particular, I follow the research of Matthew Jockers and the Nebraska Literary Lab. Can genre be distinguished by the gender of the author? How is action and agency treated differently in male and female genres? This simple demonstration does not answer these questions, but shows how WCA can be used to give insight into literature.

In Figure 1, the WCA Author facet is used to filter the collection to ten male authors: Walter Scott, Robert Louis Stevenson, and others. The idea is to dynamically generate a male genre by the selection of male authors. (Simple, but note that a complex array of facets could be used to quickly define a male genre.)

genre gender 1

In Figure 2, the WCA Parts-of-Speech analysis lists frequently used verbs in the collection susbset, the male genre: tempt, condemn, struggle. Some values might be considered action verbs, but further analysis is required.

genre gender 2

 

In Figure 3, the verb “struggle” is seen in the context of its source, the Waverly novels: “the Bohemian struggled to detain Quentin”, “to struggle with the sea”. This view can be used to determine the gender of characters, the actions they are performing, and interpret agency.

genre gender 3

 

In Figure 4, a new search is performed, this time filtering for female authors: Jane Austen, Maria Edgeworth, Susan Ferrier, and others. In this case, the idea is to dynamically generate a female genre by selecting female authors.

genre gender 4

 

In Figure 5, the WCA Parts-of-Speech analysis lists frequently used verbs in the female genre: mix, soothe, furnish. At a glance, there is an obvious difference in quality from the verbs in the male genre.

genre gender 5

Finally in Figure 6, the verb “furnish” is seen in the context of its source in Jane Austen’s Letters, “Catherine and Lydia … a walk to Meryton was necessary to amuse their morning hours and furnish conversation.” In this case, furnish does not refer to the literal furnishing of a house, but to the facilitation of dialog. As before, detailed content inspection is needed to analyze and interpret agency.

genre gender 6

Libraries & Research: Supporting change in the university / HangingTogether

[This is the third in a short series on our 2014 OCLC Research Library Partnership meeting, Libraries and Research: Supporting Change/Changing Support. You can read the first and second posts and also refer to the event webpage that contains links to slides, videos, photos, and a Storify summary.]

[Driek Heesakkers, Paolo Manghi, Micah Altman, Paul Wouters, and John Scally]

[Driek Heesakkers, Paolo Manghi, Micah Altman, Paul Wouters, and John Scally]

As if changes in research are not enough, changes are also coming at the university level and at the national level. The new imperatives of higher education around Open Access, Open Data and Research Assessment are impacting the roles of libraries in managing and providing access to e-research outputs, in helping define the university’s data management policies, and demonstrating value in terms of research impact. This session explored these issues and more!

John MacColl (University Librarian at University of St Andrews) [link to video] opened the session, speaking briefly about the UK context to illustrate how libraries are taking up new roles within academia. John presented this terse analysis of the landscape (and I thank him for providing notes!):

  • Professionally, we live increasingly in an inside-out environment. But our academic colleagues still require certification and fixity, and their reputation is based on a necessarily conservative world view (tied up with traditional modes of publishing and tenure)
  • Business models are in transition. The first phase of transition was from publisher print to publisher digital. We are now in a phase which he terms as deconstructive, based on a reassessment of the values of scholarly publishing, driven by the high cost of journals.
  • There are several reasons for this: among the main ones are the high costs of publisher content, and our responsibility as librarians for the sustainability of the scholarly record; another is the emergence of public accountability arguments – the public has paid for this scholarship, they have the right to access outputs.
  • What these three new areas of research library activity have in common is the intervention of research funders into the administration of research within universities, although the specifics vary considerably in different nations.

John Scally (Director of Library and University Collections, University of Edinburgh) [link to video] added to the conversation, speaking about the role of the research library in research data management (RDM) at the University of Edinburgh. From John’s perspective, the library is a natural place for RDM work to happen because the library has been in the business of managing and curating stuff for a long time and services are at the core of the library. Naturally, making content available in different ways is a core responsibility of the library. Starting research data conversations around policy and regulatory compliance is difficult — it’s easier to frame as a problem around storage, discovery and reuse of data. At Edinburgh they tried to frame discussions around how can we help, how can you be more competitive, do better research? If a researcher comes to the web page about data management plans (say at midnight, the night before a grant proposal is due) that webpage should do something useful at the time of need, not direct researchers to come to the library during the day. Key takeaways: Blend RDM into core services, not a side business. Make sure everyone knows who is leading. Make sure the money is there, and you know who is responsible. Institutional policy is a baby step along the way, implementation is most important. RDM and open access are ways of testing (and stressing) your systems and procedures – don’t ignore fissures and gaps. An interesting correlation between RDM and the open access repository – since RDM has been implemented at Edinburgh, deposits of papers have increased.

Driek Heesakkers (Project Manager at the University of Amsterdam Library) [link to video] told us about RDM at the University of Amsterdam and in the Netherlands. Netherlands differs from other landscapes, characterized as “bland” – not a lot of differences between institutions in terms of research outputs. A rather complicated array of institutions for humanities, social science, health science, etc, all trying to define their roles in RDM. For organizations who are mandated to capture data, it’s vital that they not just show up at the end of the process to scoop up data, but that they be embedding in the environment where the work is happening, where tools are being used.  Policy and infrastructure need to be rolled out together. Don’t reinvent the wheel – if there are commercial partners or cloud services that do the work well, that’s all for the good. What’s the role of the library? We are not in the lead with policy but we help to interpret and implement — similarly with technology. The big opportunity is in the support – if you have faculty liaisons, you should be using them for data support. Storage is boring but necessary. The market for commercial solutions is developing which is good news – he’d prefer to buy, not built, when appropriate. This is a time for action — we can’t be wary or cautious.

Switching gears away from RDM, Paul Wouters (Director of the Centre for Science and Technology Studies at the University of Leiden) [link to video] spoke about the role of libraries in research assessment. His organization combines fundamental research and services for institutions and individual researchers. With research becoming increasingly international and interdisciplinary, it’s vital that we develop methods of monitoring novel indicators. Some researchers have become, ironically and paradoxically, fond of assessment (may be tied up with the move towards the quantified self?). However, self assessment can be nerve wracking and may not return useful information. Managers may are also interested in individual assessment because it may help them give feedback.  Altmetrics do not correlate closely to citation metrics, and and can vary considerably across disciplines. It’s important to think about the meaning of various ways of measuring impact. As an example of other ways of measuring, Paul presented the ACUMEN (Academic Careers Understood through Measurement and Norms) project, which allows researchers to take the lead and tell a story given evidence from his or her portfolio. An ACUMEN profile includes a career narrative supported by expertise, outputs, and influence. Giving a stronger voice to researchers is more positive than researchers not being involved in or misunderstanding (and resenting) indicators.

Micah Altman (Director of Research, Massachusetts Institute of Technology Libraries) [link to video] discussed the importance of researcher identification and the need to uniquely identify researchers in order to manage the scholarly record and to support assessment. Micah spoke in part as a member of a group that OCLC Research colleague Karen Smith-Yoshimura led, the Registering Researchers Task Group working group (their report, Registering Researchers in Authority Files is now available). It explored motivations, state of the practice, observations and recommendations. The problem is that there is more stuff, more digital content, and more people (the average number of authors on journal articles have gone up, in some cases way up). To put it mildly, disambiguating names is not a small problem. A researcher may have one or more identifiers, which may not link to one another and may come from different sources. The task group looked at the problem not only from the perspective of the library, but also from the perspective of various stakeholders (publishers, universities, researchers, etc.). Approaches to managing name identifiers result in some very complicated (and not terribly efficient) workflows. Normalizing and regularizing this data has big potential payoffs in terms of reducing errors in analytics, and creating a broad range of new (and more accurate) measures. Fortunately, with a recognition of the benefits, interoperability between identifier systems is increasing, as is the practice of assigning identifiers to researcher. One of the missing pieces is not only identifying researchers but also their roles in a given piece of work (this is a project that Micah is working on with other collaborators). What are steps that libraries can take? Prepare to engage! Work across stakeholder communities; demand more than PDFs from publishers. And prepare for more (and different) types of measurement.

Paolo Manghi (Researcher at Institute of Information Science and Technologies “A. Faedo” (ISTI), Italian National Research Council) [link to video] talked about the data infrastructures that support access to the evolving scholarly record and the requirements needed for different data sources (repositories, CRIS systems, data archives, software archives, etc.) to interoperate. Paolo spoke as a researcher, but also as the technical manager of the EU funded OpenAIRE project. This project started in 2009 out of a strong open access push from the European Commission. The project initially collected metadata and information about access to research outputs. The scope was expanded to include not only articles but also other research outputs. The work is done by human input and also technical infrastructure. They rely on input from repositories, also use software developed elsewhere. Information is funneled via 32 national open access desks. They have developed numerous guidelines (for metadata, for data repositories, and for CRIS managers to export data to be compatible with OpenAIRE). The project fills three roles — a help desk for national agencies, a portal (linking publications to research data and information about researchers) and a repository for data and articles that are otherwise homeless (Zenodo). Collecting all this information into one place allows for some advanced processes like deduplication, identifying relationships, demonstrating productivity, compliance, and geographic distribution. OpenAIRE interacts with other repository networks, such as SHARE (US), and ANDS (Australia). The forthcoming Horizon 2020 framework will cause some significant challenges for researchers and service providers because it puts a larger emphasis on access for non-published outputs.

The session was followed by a panel discussion.

I’ll conclude tomorrow with a final posting, wrapping up this series.

Cataloging Board Games / LITA

Since September, I have been immersed in the world of games and learning.  I co-wrote a successful grant application to create a library-based Center for Games and Learning.

IMLS

The project is being  funded through a Sparks Ignition! Grant from the Institute of Museum and Library Services.

One of our first challenges has been to decide how to catalog the games.  I located this presentation on SlideShare.  We have decided to catalog the games as Three Dimensional Objects (Artifact) and use the following MARC fields:

  • MARC 245  Title Statement
  • MARC 260  Publication, Distribution, Etc.
  • MARC 300  Physical Description
  • MARC 500  General Note
  • MARC 508  Creation/Production Credits
  • MARC 520  Summary, Etc.
  • MARC 521  Target Audience
  • MARC 650  Topical Term
  • MARC 655  Index Term—Genre/Form

There are many other fields that we could use, but we decided to keep it as simple as possible.  We decided not to interfile the games and instead, create a separate collection for the Center for Games and Learning.  Due to this, we will not be assigning a Library of Congress Classification to them, but will instead by shelving the games in alphabetical order.  We also created a material type of “board games.”

For the Center for Games and Learning we are also working on a website that will be live in the next few months.  The project is still in its infancy and I will be sharing more about this project in upcoming blog posts.

Do any LITA blog readers have board games in your libraries? If, so what MARC fields do you use to catalog the games?

 

 

 

 

 

 

SB IT Preservation at ApacheCon Europe 2014 in Budapest / State Library of Denmark

Ok, actually only two of us are here. It would be great to have the whole department at the conference, then we could cover more tracks and start discussing, what we will be using next week ;-)

14 - 1

The first keynote was mostly introduction to The Apache Software Foundation along with some key numbers. The second keynote (in direct extension of the first) was an interview with best selling author Hugh Howey, who self-published ‘Wool’, in 2011. A very inspiring interview! Maybe I could be an author too – with a little help from you? One of the things he talked about was how he thinks

“… the future looks more and more like the past”

in the sense that storytelling in the past was collaborative storytelling around the camp fire. Today open source software projects are collaborative, and maybe authors should try it too? Hugh Howey’s book has grown with help from fans and fan fiction.

The coffee breaks and lunches have been great! And the cake has been plentiful!

Cake

Så skal Apache software foundations 15 års fødselsdag da fejres!

More cake!

Var der nogen som sagde at Ungarn var kendt for kager?

And yes, there has also been lots and lots of interesting presentations of lots and lots of interesting Apache tools. Where to start? There is one that I want to start using on Monday: Apache Tez. The presentation was by Hitesh Shah from Hortonworks and the slides are available online.

There are quite a few, that I want to look into a bit more and experiment with, such as Spark and Cascading, and I think my colleague can add a few more. There are some that we will tell our colleagues at home about, and hope that they have time to experiment… And now I’ll go and hear about Quadrupling your Elephants!

Note: most of the slides are online. Just look at http://events.linuxfoundation.org/events/apachecon-europe/program/slides.


The Public Domain Review brings out its first book / Open Knowledge Foundation

Open Knowledge project The Public Domain Review is very proud to announce the launch of its very first book! Released through the newly born spin-off project the PDR Press, the book is a selection of weird and wonderful essays from the project’s first three years, and shall be (we hope) the first of an annual series showcasing in print form essays from the year gone by. Given that there’s three years to catch up on, the inaugural incarnation is a special bumper edition, coming in at a healthy 346 pages, and jam-packed with 146 illustrations, more than half of which are newly sourced especially for the book.

Spread across six themed chapters – Animals, Bodies, Words, Worlds, Encounters and Networks – there is a total of thirty-four essays from a stellar line up of contributors, including Jack Zipes, Frank Delaney, Colin Dickey, George Prochnik, Noga Arikha, and Julian Barnes.

What’s inside? Volcanoes, coffee, talking trees, pigs on trial, painted smiles, lost Edens, the social life of geometry, a cat called Jeoffry, lepidopterous spying, monkey-eating poets, imaginary museums, a woman pregnant with rabbits, an invented language drowning in umlauts, a disgruntled Proust, frustrated Flaubert… and much much more.

Order by 26th November to benefit from a special reduced price and delivery in time for Christmas.

If you are wanting to get the book in time for Christmas (and we do think it is a fine addition to any Christmas list!), then please make sure to order before midnight (PST) on 26th November. Orders place before this date will also benefit from a special reduced price!

Please visit the dedicated page on The Public Domain Review site to learn more and also buy the book!

PERICLES Extraction Tool / FOSS4Lib Updated Packages

Last updated November 18, 2014. Created by Peter Murray on November 18, 2014.
Log in to edit this page.

The PERICLES Extraction Tool (PET) is an open source (Apache 2 licensed) Java software for the extraction of significant information from the environment where digital objects are created and modified. This information supports object use and reuse, e.g. for a better long-term preservation of data. The Tool was developed entirely for the PERICLES EU project http://www.pericles-project.eu/ by Fabio Corubolo, University of Liverpool, and Anna Eggers, Göttingen State and University Library.

License: 
Development Status: 
Operating System: 

Releases for PERICLES Extraction Tool

Programming Language: 
Open Hub Stats Widget: 

Quick Links and Search Frequency / Library Tech Talk (U of Michigan)

Does adding links to popular databases change user searching behavior? An October 2013 change to the University of Michigan Library’s front page gave us the opportunity to conduct an empirical study and shows that user behavior has changed since the new front page design was launched.

REGISTER: ADVANCED DSPACE TRAINING RESCHEDULED / DuraSpace News

Winchester, MA  We are happy to announce the re-scheduled dates for the in-person, 3-day Advanced DSpace Course in Austin March 17-19, 2015. The total cost of the course is being underwritten with generous support from the Texas Digital Library and DuraSpace. As a result, the registration fee for the course for DuraSpace Members is only $250 and $500 for Non-Members (meals and lodging not included). Seating will be limited to 20 participants.
 

Yaffle: Memorial University’s VIVO-Based Solution to Support Knowledge Mobilization in Newfoundland and Labrador / DuraSpace News

One particular VIVO project that demonstrates the spirit of open access principles is Yaffle. Many VIVO implementations provide value to their host institutions, ranging from front-end access to authoritative organizational information to highlights of works created in the social sciences and arts and humanities. Yaffle extends beyond its host institution and provides a cohesive link between Memorial University and citizens from Newfoundland and Labrador. The prospects for launching Yaffle in other parts of Canada will be realized in the near future. 

On Forgetting / Ed Summers

After writing about the Ferguson Twitter archive a few months ago three people have emailed me out of the blue asking for access to the data. One was a principal at a small, scaryish defense contracting company, and the other two were from a prestigious university. I’ve also had a handful of people interested where I work at the University of Maryland.

I ignored the defense contractor. Maybe that was mean, but I don’t want to be part of that. I’m sure they can go buy the data if they really need it. My response to the external academic researchers wasn’t much more helpful since I mostly pointed them to Twitter’s Terms of Service which says:

If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.

You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.

Any Content provided to third parties via non-automated file download remains subject to this Policy.

It’s my understanding that I can share the data with others at the University of Maryland, but I am not able to give it to the external parties. What I can do is give them the Tweet IDs. But there are 13,480,000 of them.

So that’s what I’m doing today: publishing the tweet ids. You can download them from the Internet Archive:

https://archive.org/details/ferguson-tweet-ids

I’m making it available using the CC-BY license.

Hydration

On the one hand, it seems unfair that this portion of the public record is unshareable in its most information rich form. The barrier to entry to using the data seems set artificially high in order to protect Twitter’s business interests. These messages were posted to the public Web, where I was able to collect them. Why are we prevented from re-publishing them since they are already on the Web? Why can’t we have lots of copies to keep stuff safe? More on this in a moment.

Twitter limits users to 180 requests every 15 minutes. A user is effectively a unique access token. Each request can hydrate up to 100 Tweet IDs using the statuses/lookup REST API call.

180 requests * 100 tweets = 18,000 tweets/15 min 
                          = 72,000 tweets/hour

So to hydrate all of the 13,480,000 tweets will take about 7.8 days. This is a bit of a pain, but realistically it’s not so bad. I’m sure people doing research have plenty of work to do before running any kind of analysis on the full data set. And they can use a portion of it for testing as it is downloading. But how do you download it?

Gnip, who were recently acquired by Twitter, offer a rehydration API. Their API is limited to tweets from the last 30 days, and similar to Twitter’s API you can fetch up to 100 tweets at a time. Unlike the Twitter API you can issue a request every second. So this means you could download the results in about 1.5 days. But these Ferguson tweets are more than 30 days old. And a Gnip account costs some indeterminate amount of money, starting at $500…

I suspect there are other hydration services out there. But I adapted twarc the tool I used to collect the data, which already handled rate-limiting, to also do hydration. Once you have the tweet IDs in a file you just need to install twarc, and run it. Here’s how you would do that on an Ubuntu instance:

    
    sudo apt-get install python-pip
    sudo pip install twarc
    twarc.py --hydrate ids.txt > tweets.json
    

After a week or so, you’ll have the full JSON for each of the tweets.

Archive Fever

Well, not really. You will have most of them. But you won’t have the ones that have been deleted. If a user decided to remove a Tweet they made, or decided to remove their account entirely you won’t be able to get their Tweets back from Twitter using their API. I think it’s interesting to consider Twitter’s Terms of Service as what Katie Shilton would call a value lever.

The metadata rich JSON data (which often includes geolocation and other behavioral data) wasn’t exactly posted to the Web in the typical way. It was made available through a Web API designed to be used directly by automated agents, not people. Sure, a tweet appears on the Web but it’s in with the other half a trillion Tweets out on the Web, all the way back to the first one. Requiring researchers to go back to the Twitter API to get this data and not allowing it circulate freely in bulk means that users have an opportunity to remove their content. Sure it has already been collected by other people, and it’s pretty unlikely that the NSA are deleting their tweets. But in a way Twitter is taking an ethical position for their publishers to be able to remove their data. To exercise their right to be forgotten. Removing a teensy bit of informational toxic waste.

As any archivist will tell you, forgetting is an essential and unavoidable part of the archive. Forgetting is the why of an archive. Negotiating what is to be remembered and by whom is the principal concern of the archive. Ironically it seems it’s the people who deserve it the least, those in positions of power, who are often most able to exercise their right to be forgotten. Maybe putting a value lever back in the hands of the people isn’t such a bad thing. If I were Twitter I’d highlight this in the API documentation. I think we are still learning how the contours of the Web fit into the archive. I know I am.

If you are interested in learning more about value levers you can download a pre-print of Shilton’s Value Levers: Building Ethics into Design.

Social networking for researchers: ResearchGate and their ilk / Dan Scott

The Centre for Research in Occupational Safety and Health asked me to give a lunch'n'learn presentation on ResearchGate today, which was a challenge I was happy to take on... but I took the liberty of stretching the scope of the discussion to focus on social networking in the context of research and academics in general, recognizing four high-level goals:

  1. Promotion (increasing citations, finding work positions)
  2. Finding potential collaborators
  3. Getting advice from experts in your field
  4. Accessing other's work

I'm a librarian, so naturally my take veered quickly into the waters of copyright concerns and the burden (to the point of indemnification) that ResearchGate, Academia.edu, Mendeley, and other such services put on their users to ensure that they are in compliance with copyright and the researchers' agreements with publishers... all while heartily encouraging their users to upload their work with a single click. I also dove into the darker waters of r/scholar, LibGen, and SciHub, pointing out the direct consequences that our university has suffered due to the abuse of institutional accounts at the library proxy.

Happily, the audience opened up the subject of publishing in open access journals--not just from a "covering our own butts" perspective, but also from the position of the ethical responsibility to share knowledge as broadly as possible. We briefly discussed the open access mandates that some granting agencies have put in place, particularly in the States, as well as similar Canadian initiatives that have occurred or are still emerging with respect to public funds (SSHRC and the Tri-Council). And I was overjoyed to hear a suggestion that, perhaps, research funded by the Laurentian University Research Fund should be required to publish in an open access venue.

I'm hoping to take this message back to our library and, building on Kurt de Belder's vision of the library as a Partner in Knowledge help drive our library's mission towards assisting researchers in not only accessing knowledge, but most effectively sharing and promoting the knowledge they create.

That leaves lots of work to do, based on one little presentation :-)

Classes in RDF / Karen Coyle

RDF allows one to define class relationships for things and concepts. The RDFS1.1 primer describes classes succinctly as:
Resources may be divided into groups called classes. The members of a class are known as instances of the class. Classes are themselves resources. They are often identified by IRIs and may be described using RDF properties. The rdf:type property may be used to state that a resource is an instance of a class.
This seems simple, but it is in fact one of the primary areas of confusion about RDF.

If you are not a programmer, you probably think of classes in terms of taxonomies -- genus, species, sub-species, etc. If you are a librarian you might think of classes in terms of classification, like Library of Congress or the Dewey Decimal System. In these, the class defines certain characteristics of the members of the class. Thus, with two classes, Pets and Veterinary science, you can have:
Pets
- dogs
- cats

Veterinary science
- dogs
- cats
In each of those, dogs and cats have different meaning because the class provides a context: either as pets, or information about them as treated in veterinary science.

For those familiar with XML, it has similar functionality because it makes use of nesting of data elements. In XML you can create something like this:
<drink>
    <lemonade>
        <price>$2.50</price>
        <amount>20</amount>
    </lemonade>
    <pop>
        <price>$1.50</price>
        <amount>10</amount>
    </pop>
</drink>
and it is clear which price goes with which type of drink, and that the bits directly under the <drink> level are all drinks, because that's what <drink> tells you.

Now you have to forget all of this in order to understand RDF, because RDF classes do not work like this at all. In RDF, the "classness" is not expressed hierarchically, with a class defining the elements that are subordinate to it. Instead it works in the opposite way: the descriptive elements in RDF (called "properties") are the ones that define the class of the thing being described. Properties carry the class information through a characteristic called the "domain" of the property. The domain of the property is a class, and when you use that property to describe something, you are saying that the "something" is an instance of that class. It's like building the taxonomy from the bottom up.

This only makes sense through examples. Here are a few:
1. "has child" is of domain "Parent".

If I say "X - has child - 'Fred'" then I have also said that X is a Parent because every thing that has a child is a Parent.

2. "has Worktitle" is of domain "Work"

If I say "Y - has Worktitle - 'Der Zauberberg'" then I have also said that Y is a Work because every thing that has a Worktitle is a Work.

In essence, X or Y is an identifier for something that is of unknown characteristics until it is described. What you say about X or Y is what defines it, and the classes put it in context. This may seem odd, but if you think of it in terms of descriptive metadata, your metadata describes the "thing in hand"; the "thing in hand" doesn't describe your metadata. 

Like in real life, any "thing" can have more than one context and therefore more than one class. X, the Parent, can also be an Employee (in the context of her work), a Driver (to the Department of Motor Vehicles), a Patient (to her doctor's office). The same identified entity can be an instance of any number of classes.
"has child" has domain "Parent"
"has licence" has domain "Driver"
"has doctor" has domain "Patient"

X - has child - "Fred"  = X is a Parent 
X - has license - "234566"  = X is a Driver
X - has doctor - URI:765876 = X is a Patient
Classes are defined in your RDF vocabulary, as as the domains of properties. The above statements require an application to look at the definition of the property in the vocabulary to determine whether it has a domain, and then to treat the subject, X, as an instance of the class described as the domain of the property. There is another way to provide the class as context in RDF - you can declare it explicitly in your instance data, rather than, or in addition to, having the class characteristics inherent in your descriptive properties when you create your metadata. The term used for this, based on the RDF standard, is "type," in that you are assigning a type to the "thing." For example, you could say:
X - is type - Parent
X - has child - "Fred"
This can be the same class as you would discern from the properties, or it could be an additional class. It is often used to simplify the programming needs of those working in RDF because it means the program does not have to query the vocabulary to determine the class of X. You see this, for example, in BIBFRAME data. The second line in this example gives two classes for this entity:
<http://bibframe.org/resources/FkP1398705387/8929207instance22>
a bf:Instance, bf:Monograph .

One thing that classes do not do, however, is to prevent your "thing" from being assigned the "wrong class." You can, however, define your vocabulary to make "wrong classes" apparent. To do this you define certain classes as disjoint, for example a class of "dead" would logically be disjoint from a class of "alive." Disjoint means that the same thing cannot be of both classes, either through the direct declaration of "type" or through the assignment of properties. Let's do an example:
"residence" has domain "Alive"
"cemetery plot location" has domain "Dead"
"Alive" is disjoint "Dead" (you can't be both alive and dead)

X - is type - "Alive"                                         (X is of class "Alive")
X - cemetery plot location - URI:9494747      (X is of class "Dead")
Nothing stops you from creating this contradiction, but some applications that try to use the data will be stumped because you've created something that, in RDF-speak, is logically inconsistent. What happens next is determined by how your application has been programmed to deal with such things. In some cases, the inconsistency will mean that you cannot fulfill the task the application was attempting. If you reach a decision point where "if Alive do A, if Dead do B" then your application may be stumped and unable to go on.

All of this is to be kept in mind for the next blog post, which talks about the effect of class definitions on bibliographic data in RDF.

Current Learning Opportunities with LITA / LITA

LITA has multiple learning opportunities available over the next several months.  Hot topics to keep your brain warm over the winter.

Re-Drawing the Map Series

Presenters: Mita Williams and Cecily Walker
Offered: November 18, 2014, December 9, 2014, and January 6, 2015
All: 1:00 pm – 2:00 pm Central Time

Top Technologies Every Librarian Needs to Know

Presenters: Brigitte Bell, Steven Bowers, Terry Cottrell, Elliot Polak and Ken Varnum,
Offered: December 2, 2014
1:00 pm – 2:00 pm Central Time

Getting Started with GIS

Instructor: Eva Dodsworth, University of Waterloo
Offered: January 12 – February 9, 2015

For details and registration check out the fuller descriptions below and follow the links to their full web pages

Re-Drawing the Map Series

redrawmapthumbJoin LITA Education and instructors Mita Williams and Cecily Walker in “Re-drawing the Map”–a webinar series! Pick and choose your favorite topic.  Can’t make all the dates but still want the latest information? Registered participants will have access to the recorded webinars.

Here’s the individual sessions.

 Web Mapping: moving from maps on the web to maps of the web
Tuesday Nov. 18, 2014
1:00 pm – 2:00 pm Central Time
Instructor: Mita Williams
<completed>

Get an introduction to web mapping tools and learn about the stories they can help you to tell!

OpenStreetMaps: Trust the map that anyone can change
Tuesday December 9, 2014,
1:00 pm – 2:00 pm Central Time
Instructor: Mita Williams

Ever had a map send you the wrong way and wished you could change it?  Learn how to add your local knowledge to the “Wikipedia of Maps.”

Coding maps with Leaflet.js
Tuesday January 6, 2015,
1:00 pm – 2:00 pm Central Time
Instructor: Cecily Walker

Ready to make your own maps and go beyond a directory of locations? Add photos and text to your maps with Cecily as you learn to use the Leaflet JavaScript library.

Register Online page arranged by session date (login required)

Top Technologies Every Librarian Needs to Know

Varnum300pebWe’re all awash in technological innovation. It can be a challenge to know what new tools are likely to have staying power — and what that might mean for libraries. The recently published Top Technologies Every Librarian Needs to Know highlights a selected set of technologies that are just starting to emerge and describes how libraries might adapt them in the next few years.

In this webinar, join the authors of three chapters as they talk about their technologies and what they mean for libraries.
December 2, 2014
1:00 pm – 2:00 pm Central Time

Hands-Free Augmented Reality: Impacting the Library Future
Presenters: Brigitte Bell & Terry Cottrell

The Future of Cloud-Based Library Systems
Presenters: Elliot Polak & Steven Bowers

Library Discovery: From Ponds to Streams
Presenter: Ken Varnum

Register Online page arranged by session date (login required)

Getting Started with GIS

Layout 1Getting Started with GIS is a three week course modeled on Eva Dodsworth’s LITA Guide of the same name. The course provides an introduction to GIS technology and GIS in libraries. Through hands on exercises, discussions and recorded lectures, students will acquire skills in using GIS software programs, social mapping tools, map making, digitizing, and researching for geospatial data. This three week course provides introductory GIS skills that will prove beneficial in any library or information resource position.

No previous mapping or GIS experience is necessary. Some of the mapping applications covered include:

  • Introduction to Cartography and Map Making
  • Online Maps
  • Google Earth
  • KML and GIS files
  • ArcGIS Online and Story Mapping
  • Brief introduction to desktop GIS software

Instructor: Eva Dodsworth, University of Waterloo
Offered: January 12 – February 9, 2015

Register Online page arranged by session date (login required)

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty, mbeatty@ala.org.

The heartbeat of budget transparency / Open Knowledge Foundation

budget_heartbeat_1

Every two years the International Budget Partnership (IBP) runs a survey, called the Open Budget Survey, to evaluate formal oversight of budgets, how transparent governments are about their budgets and if there are opportunities to participate in the budget process. To easily measure and compare transparency among the countries surveyed, IBP created the Open Budget Index where the participating countries are scored and ranked using about two thirds of the questions from the Survey. The Open Budget Index has already established itself as an authoritative measurement of budget transparency, and is for example used as an eligibility criteria for the Open Government Partnership.

However, countries do not release budget information every two years; they should do so regularly, on multiple occasions in a given year. There is, however, as stated above a two year gap between the publication of consecutive Open Budget Survey results. This means that if citizens, civil society organisations (CSOs), media and others want to know how governments are performing in between Survey releases, they have to undertake extensive research themselves. It also means that if they want to pressure governments into releasing budget information and increase budget transparency before the next Open Budget Index, they can only point to ‘official’ data which can be up to two years old.

To combat this, IBP, together with Open Knowledge, have developed the Open Budget Survey Tracker (the OBS Tracker), http://obstracker.org,: an online, ongoing budget data monitoring tool, which is currently a pilot and covers 30 countries. The data are collected by researchers selected among the IBP’s extensive network of partner organisations, who regularly monitor budget information releases, and provide monthly reports. The information included in the OBS Tracker is not as comprehensive as the Survey, because the latter also looks at the content/comprehensiveness of budget information — not only the regularity of its publication. The OBS Tracker, however, does provide a good proxy of increasing or decreasing levels of budget transparency, measured by the release to (or witholding from) the public of key budget documents. This is valuable information for concerned citizens, CSOs and media.

With the Open Budget Survey Tracker, IBP has made it easier for citizens, civil society, media and others to monitor, in near real time (monthly), whether their central governments release information on how they plan to and how they spend the public’s money. The OBS Tracker allows them to highlight changes and facilitates civil society efforts to push for change when a key document has not been released at all, or not in a timely manner.

Niger and Kyrgyz Republic have improved the release of essential budget information after the latest Open Budget Index results, something which can be seen from the OBS Tracker without having to wait for the next Open Budget Survey release. This puts pressure on other countries to follow suit.

budget_heartbeat_2

The budget cycle is a complex process which involves creating and publishing specific documents at specific points in time. IBP covers the whole cycle, by monitoring in total eight documents which include everything from the proposed and approved budgets, to a citizen-friendly budget representation, to end-of-the-year financial reporting and the auditing from a country’s Supreme Audit Institution.

In each of the countries included in the OBS Tracker, IBP monitors all these eight documents showing how governments are doing in generating these documents and releasing them on time. Each document for each country is assigned a traffic light color code: Red means the document was not produced at all or published too late. Yellow means the document was only produced for internal use and not released to the general public. Green means the document is publicly available and was made available on time. The color codes help users quickly skim the status of the world as well as the status of a country they’re interested in.

budget_heartbeat_3

To make monitoring even easier, the OBS Tracker also provides more detailed information about each document for each country, a link to the country’s budget library and more importantly the historical evolution of the “availability status” for each country. The historical visualisation shows a snapshot of the key documents’ status for that country for each month. This helps users see if the country has made any improvements on a month-by-month basis, but also if it has made any improvements since the last Open Budget Survey.

Is your country being tracked by the OBS Tracker? How is it doing? If they are not releasing essential budget documents or not even producing them, start raising questions. If your country is improving or has a lot of green dots, be sure to congratulate the government; show them that their work is appreciated, and provide recommendations on what else can be done to promote openness. Whether you are a government official, a CSO member, a journalist or just a concerned citizen, OBS Tracker is a tool that can help you help your government.

WMS Web Services Install November 23 / OCLC Dev Network

The new date for the November WMS Web services install is this Sunday, November 23rd. This install will include changes to two of our WMS APIs.

Losing and Finding Legal Links / Library of Congress: The Signal

Imagine you’re a legal scholar and you’re examining the U.S. Supreme Court decisions of the late nineties to mid-two thousands and you want to understand what resources were consulted to support official opinions. A study in the Yale Journal of Law and Technology indicates you would find that only half of the nearly 555 URL links cited in Supreme Court opinions since 1996 would still work. This problem has been widely discussed in the media and the Supreme Court has indicated it will print all websites cited and place the printouts in physical case files at the Supreme Court, available only in Washington, DC.

Old Georgetown Law School building, photo taken between 1910 - 1925

Georgetown Law School, Washington, DC. Negative, part of National Photo Company Collection, 1910-1925.   http://www.loc.gov/pictures/item/npc2008012259/.

On October 24, 2014 Georgetown University Law Library hosted a one-day symposium on this problem which has been studied across legal scholarship and other academic works. The meeting, titled 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent, presented a broad overview of why websites disappear, why this is particularly problematic in the legal citation context and the proposal of actual solutions and strategies to addressing the problem.

The keynote address was given by Jonathan Zittrain, George Bemis Professor of Law at Harvard Law School. A video of his presentation is now available from the meeting website. In it he details a service created by Harvard Law School Libraries and other law libraries called Perma.cc that allows those with an account to submit links that can be archived at a participating library. The use case for Perma.cc is to support links in new forms of academic and legal writing. Today, over 26,000 links have been archived.

Herbert Van de Sompel of the Los Alamos National Laboratory also demonstrated the Memento browser plug-in that allows users who’ve downloaded the plug-in to see archived versions of a website (if that website has been archived) while they are using the live web. The Internet Archive, The British Library, the UK National Archives and other archives around the world all provide archived versions of websites through Memento. The Memento protocol has been widely implemented, integrated in MediaWiki sites and supports “time travel” to old websites that cover all topics.

Both solutions, Perma.cc and Memento, depend on action by, and coordination of, organizations and individuals who are affected by the linkrot problem. At the end of his presentation Van de Sompel reiterated that technical solutions exist to deal with linkrot; what is still needed is broad participation in the selection, collection and archiving of web resources and a sustainable and interoperable infrastructure of tools and services, like Memeno and Perma.cc, that connect the archived versions of website with the scholars, researchers and users that want to access them today and into the future.

Michael Nelson of Old Dominion University, a partner in developing Memento, posted notes on the symposium presentations. For even more background and documentation on the problem of linkrot, the meeting organizers collected a list of readings. The symposium demonstrated the ability of a community, in this case, law librarians, to come together to address a problem in their domain, the results of which benefit the larger digital stewardship community and serve as models for coordinated action.

Two weeks left to submit your GIF IT UP entries! / DPLA

It’s been a little over a month since we launched GIF IT UP, an international competition to find the best GIFs reusing public domain and openly licensed digital video, images, text, and other material available via DPLA and DigitalNZ. Since then we’ve received dozens of wonderful submissions from all over the world, all viewable in the competition gallery.

The winners of GIF IT UP will have their work featured and celebrated online at the Public Domain Review and Smithsonian.com. Haven’t submitted an entry yet? Well, what are you waiting for? Submit a GIF!


About GIF IT UP

(animated gif) The still images used in this GIF come from Eadweard Muybridge's

Cat Galloping (1887). The still images used in this GIF come from Eadweard Muybridge’s “Animal locomotion: an electro-photographic investigation of consecutive phases of animal movements” (1872-1885). Courtesy USC Digital Library, 2010. View original record (item is in the public domain). GIF available under a CC-BY license.

How it works. The GIF IT UP competition has six categories:

  1. Animals
  2. Planes, trains, and other transport
  3. Nature and the environment
  4. Your hometown, state, or province
  5. WWI, 1914-1918
  6. GIF using a stereoscopic image
  7. Open category (any reusable material from DigitalNZ or DPLA)

A winner will be selected in each of these categories and, if necessary, a winner will be awarded in two fields: use of an animated still public domain image, and use of video material.

To view the competition’s official homepage, visit http://dp.la/info/gif-it-up/.

Judging. GIF IT UP will be co-judged by Adam Green, Editor of the Public Domain Review and by Brian Wolly, Digital Editor of Smithsonian.com. Entries will be judged on coherence with category theme (except for the open category), thoroughness of entry (correct link to source material and contextual information), creativity, and originality.

Gallery. All entries that meet the criteria outlined below in the Guidelines and Rules will be posted to the GIF IT UP Tumblr Gallery. The gallery entries with the most amount of Tumblr “notes” will receive the people’s choice award and will appear online at the Public Domain Review and Smithsonian.com alongside the category winners.

Submit. To participate, please first take a moment to read “How it Works” and the guidelines and rules on the GIF IT UP homepage, and then submit your entry by clicking here.

Deadline. The competition deadline is December 1, 2014 at 5:00 PM EST / December 2, 2014 at 10:00 AM GMT+13.

GIFtastic Resources. You can find more information about GIF IT UP–including select DPLA and DigitalNZ collections available for re-use and a list of handy GIF-making tips and tools–over on the GIF IT UP homepage.

Questions. For questions or other inquiries, email us at info@digitalnz.org or info@dp.la, or tweet us @digitalnz or @dpla. Good luck and happy GIFing!