Another note to my future self about upgrading Ubuntu.
Ubuntu 14.04 was released yesterday. I have two laptops that run it. I did the unimportant one first, and everything went fine. Then I did the important one, the one where I do all my work, and after restarting it came up with a boot error:
error: symbol 'grub_term_highlight_color' not found
I had two reactions. First, boot errors are solvable. The boot stuff is on one part of my hard drive, and my real stuff is on another part, and it’s fine where it is, I just need to fix the boot stuff. Besides, I have backups. So with a bit of fiddling, I’ll be able to fix it. Second, cripes, what the hell? I’ve been using this laptop for six months or a year or more since a major upgrade, and now it’s telling me there’s some problem with how it boots up? That is a load of shite.
Searching turned up evidence other people had the same problem, and they were being blamed for having an improper boot sector or some such business. For a few minutes I felt like non-geeks feel when presented with problems like this: despair … annoyance … frustration … the first pangs of hate.
But such is life. When upgrading a system we must be prepared for possible problems. We cannot expect it to always go smoothly. Even in the face of such technical problems we must try to remain tranquil.
It’s solvable, I remembered. So I downloaded a Boot-Repair Disk image—this is a very useful tool, and it works even though it’s a year old—and put it on a USB drive with startup disk creator, then booted up, ran sudo boot-repair, used all the default answers, let it do its work, and everything was all right. Phew.
Aside from that, everything about the upgrade went perfectly fine. This time I did it at the command line with sudo do-release-upgrade. It took a while to download all the upgraded packages, but the actual update went quickly and smoothly. My thanks to everyone involved with Debian, Ubuntu, GNU/Linux, and everything else.
(However, I’m glad I had another machine available where I could do the download and set up the boot disk. Without it, I would have been in trouble. I don’t know if a similar problem might have arisen when Windows or MacOS users do an upgrade.)
It seems to me I have done this once (or twice) before, but I feel like it is time to continue blogging on Loomware. My Loomware blog started in February of 2004, so I guess I can call this the 10th anniversary and just get on with it!
One of the drivers for me was the incredible interest in the Islandora digital asset management system, which had its genesis in 2007 just after I joined UPEI. In the last 7 years Islandora has seen adoption in countries all over the world, and for a wide range of functions. I will start the posts next week with a series on the coming version of Islandora - 7.x-1.3, which is our way of saying the 3rd release of Islandora for Drupal 7 and Fedora 3. This new series will describe all the awesome goodness in the upcoming release, solution pack by solution pack, module by module and innclude some shoutouts to friends and colleagues who are giving their time and extpertise to build a great open source ecosystem!
Larra Clark of ALA’s Office for Information Technology Policy speaks on panel.
The ongoing digital revolution continues to create new opportunities for education, entrepreneurship, job skills training and more. Those of us with home broadband, smartphones or both can easily take advantage of these opportunities. However, for millions of Americans currently living without personal access to high-capacity internet or who lack digital literacy skills, libraries serve as the on-ramp to the digital world. With a growing number of people turning to libraries to avail themselves of broadband-enabled technologies, library networks are being strained more than ever before. Yesterday, the Institute for Library and Museum Services (IMLS) held a public hearing to discuss the importance of high-speed connectivity in libraries and outline strategies for helping libraries expand bandwidth to accommodate growing network use.
Federal Communications Commission (FCC) Chairman Thomas Wheeler’s opening remarks set the tone for the day: “Andrew Carnegie built 2,500 libraries in a public-private partnership, defining information access for millions of people for more than a century,” he said. “We stand on the precipice of being able to have the same kind of seminal impact on the flow of information and ideas in the 21st century…That’s why reform of the E-rate program is so essential. The library has always been the on-ramp to the world of information and ideas, and now that on-ramp is at gigabit speeds.”
The hearing convened three expert panels, each of which discussed a different dimension of library connectivity. The first panel propounded strategies for helping libraries procure the resources they need to build network capacity. Chris Jowaisas of the Gates Foundation urged libraries to underscore the ways in which their activities advance the goals of top giving foundations. “[Libraries should]…package their services to meet foundation needs,” Jowaisas said. “With a robust and reliable broadband connection, libraries and communities can move into more areas of exploration and innovation. The foundation hopes the network of supporters of this vision grows because we have seen and learned first-hand from investments in public libraries that they are key organizations for growing opportunity.”
Following his remarks, Clarence Anthony of the National League of Cities stressed the need for the library community to ramp up its efforts to make government leaders aware of the extent to which urban communities rely on libraries for broadband access.
The second panel analyzed current library connectivity data and identified areas where the data falls short in assessing broadband capacity. Larra Clark of ALA’s Office for Information Technology Policy drew on 20 years of research to illustrate that the progress libraries have made in expanding bandwidth—while meaningful—has generally not proven sufficient to accommodate the growing needs of users. About 9 percent of public libraries reported speeds of 100 Mbps or greater in the 2012 Public Library Funding & Technology Access Study, and the forthcoming Digital Inclusion Survey shows this number has only climbed to 12 percent. More than 66 percent of public libraries report they would like to increase their broadband connectivity speeds. “Libraries aren’t standing still, but too many are falling behind,” Clark said.
Researcher John Horrigan also gave the audience a preview of forthcoming research looking at Americans’ levels of digital readiness, which finds significant variations in digital skills even among people who are highly connected to digital tools. Of the 80 percent of Americans with home broadband or a smartphone, nearly one-fifth (or 34 million adults) has a low level of digital skills. “(Libraries) are the vanguard in the forces we bring to bear to bolster digital readiness,” Horrigan noted. “Libraries will have more demands placed upon them, which makes the case for ensuring they have the resources to meet these demands compelling.”
The final panel built on the capacity-building strategies offered by Jowaisas and Anthony by providing real-world examples of successful efforts to expand library bandwidth. Gary Wasdin of the Omaha Public Library System discussed ways in which his libraries are leveraging federal dollars to engage private funders in efforts to build broadband capacity, and Eric Frederick of Connect Michigan described how public-private synergies are improving library connectivity in his state. The final panelist was Linda Lord, Maine state librarian and chair of ALA’s E-rate Task Force. Lord discussed ALA’s efforts to inform the FCC’s ongoing E-rate modernization proceeding. “ALA envisions that all libraries will be at a gig (1Gbps) by 2018”, Lord said. The E-rate program provides schools and libraries with telecommunications services at discounted rates. Linda went on to clearly articulate ALA’s commitment to updating the program to help libraries address 21st century challenges.
I was recently selected by the Code4Lib community to receive a diversity scholarship to attend the Code4Lib conference in Raleigh, North Carolina. The Code4Lib conference was the perfect place to make new connections with people who aim to make information more accessible through technology. As someone who is in close proximity in technology and usability, I was interested in the new strides taking place in this area. At this conference, I made new contacts for future collaboration and attend talks ranging from Linked Open Data and Google Analytics.
Coming to my first Code4Lib was significant because when I first began connecting with the group and its resources, I was a freshly-minted graduate in the middle of a career change. By the time I landed in Raleigh, three months into a new job, I was an information professional--more or less.
After graduating last May from library school, I admit to using the Code4Lib website obsessively during my quest for employment; I quickly found the site, wiki, listserv and journal invaluable. There was a level of energy and involvement by users that made it stand out from other, more conventional professional organizations. Plus, the job postings often described exactly the kinds of emerging, interdisciplinary positions I was most interested in. Code4Lib was a network I wanted to be a part of. Miraculously, my search worked out: I was offered a position, though I had not yet started when I finally applied for the diversity scholarship.
As a recipient of Diversity Scholarship for the 9th annual Code4Lib conference in Raleigh, North Carolina, I had an enlightening and incredible experience. I learned a great deal of information that revolved around library system usability, emerging coding frameworks, and applying social justice to user-centered design. Throughout the conference, I asked myself, how I could use these concepts and coding techniques for my daily work at my institution? As a “one-man shop” I have limited support for implementing many of these technologies. However, as I have networked with the diverse members of the code4lib community I know that it will be a bit easier trying to experiment with these techniques.
My time at the conference revealed that many libraries are passionately striving to make end-user systems usable, accessible, and transparent. There were numerous presentations that revolved around these ideas, such as using APIs to create data visualizations for displaying library statistics, real-time interactive discovery systems and interfaces, moving away from “list” type listings of holdings to network-node maps, web accessibility for differently abled patrons, and much more. The numerous lightning talks also provided a great wealth of information (all within 5 minutes!)
Jennifer Maiko Kishi
Code4Lib 2014 Conference Report
1 April 2014
As a new professional in the field, lone digital archivist, and a first timer to the Code4Lib Conference, my experience was incredibly inspiring and enriching. I value Code4Lib’s collective mission of teaching and learning through community, collaboration, and a free exchange of ideas. The conference was unique and unlike any other library or archives conference I have attended. I appreciate the thoughtfulness of planning events to specifically welcome new attendees. The newcomer dinner was not only a great way to meet fellow newbies (and oldtimers) on the evening before the conference, but also provided familiar faces to say hello to the following day. Moreover, Code4Lib resolved my session selecting anxieties, where I always feel like I’ve missed out on yet another important session. The conference is set up so that all attendees will have equal opportunities to view the sessions together in a continuous fashion, in addition to live streams made available to those unable to attend. The conference was jam packed with back to back presentations, lightning talks, and breakout sessions. There was a good balance of interesting topics by insightful speakers, mixed in with scheduled breaks with copious coffee and tea to stay alert and focused throughout the day.
Code4Lib 2014: Conference Review
J. (Jenny) Gubernick
I was fortunate to receive a diversity scholarship to help defray the costs of attending Code4Lib 2014 in Raleigh, NC. Although I am still processing the somewhat overwhelming amount of information I absorbed, I suspect that I will look back at this past week as a transformative experience. I pivoted from thinking of myself as “not a real programmer,” “lucky to have any job,” and that “maybe someday I can do something cool,” to thinking of myself as being in a position of great empowerment to learn and do, and being ready to apply my skills to a more complex work. I look forward to continuing to be part of this community in months and years to come.
As a diversity scholarship recipient, I was afforded the opportunity to attend the 2014 Code4Lib conference in Raleigh, NC. The conference consisted of two and a half days of presentations and one day of preconference workshops. Looking back on the experience, I am impressed by the content of the presentations, the openness of the community, and the overall sense of curiosity and exploration. I learned a great deal and am looking forward to applying the inspiration and motivation that I took away from the conference in my daily work.
Prior to the start of the conference itself, I attended the “Archival Discovery and Use” pre-conference session. True to its name, Code4Lib has historically been more library-focused, but this session covered topics like the modern relevance of archival finding aids, archival crowdsourcing, and presentation methods for digitized materials. Because librarians and archivists have so many intertwined concerns, I was glad to see the archival community represented.
I had an enjoyable and educational time at Code4Lib 2014. It was my first time attending any Code4Lib event, and I am grateful to have had the opportunity to be there, thanks to the Diversity Scholarship sponsored by the Council on Library and Information Resources/Digital Library Federation, EBSCO, ProQuest, and Sumana Harihareswara. Thank you to the sponsors, the scholarship and organizing committees, and everyone else involved with the conference for this amazing learning experience!
In March 2014, I attended my first (and definitely not only) Code4Lib National Conference. I had been following the Code4Lib group via their website, journal, wiki and local NYC chapter for some time; but being a metadata/cataloging person, I was hesitant to jump into a meeting of programmers, coders, systems librarians, and others. I am immensely glad that I did not let this hesitation hold me back this year, as the 2014 Code4Lib Conference was the best and most inviting conference that I have ever attended.
After all the conferences and the craziness at work, LibTechConf seems like ages ago and though it’s been a little while, I wanted to write the usual reflection that I do. I wish I had done it sooner now, but I’m finally getting to it. Great Keynotes I normally prefer getting keynote speakers from outside […]
If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice.
Linked data in archival practice is not new. Others have been here previously. You can benefit from their experience and begin publishing linked data right now using tools with which you are probably already familiar. For example, you probably have EAD files, sets of MARC records, or metadata saved in database applications. Using existing tools, you can transform this content into RDF and put the result on the Web, thus publishing your information as linked data.
If you have used EAD to describe your collections, then you can easily make your descriptions available as valid linked data, but the result will be less than optimal. This is true not for a lack of technology but rather from the inherent purpose and structure of EAD files.
A few years ago an organisation in the United Kingdom called the Archive’s Hub was funded by a granting agency called JISC to explore the publishing of archival descriptions as linked data. The project was called LOCAH. One of the outcomes of this effort was the creation of an XSL stylesheet (ead2rdf) transforming EAD into RDF/XML. The terms used in the stylesheet originate from quite a number of standardized, widely accepted ontologies, and with only the tiniest bit configuration / customization the stylesheet can transform a generic EAD file into valid RDF/XML for use by anybody. The resulting XML files can then be made available on a Web server or incorporated into a triple store. This goes a long way to publishing archival descriptions as linked data. The only additional things needed are a transformation of EAD into HTML and the configuration of a Web server to do content negotiation between the XML and HTML.
For the smaller archive with only a few hundred EAD files whose content does not change very quickly, this is a simple, feasible, and practical solution to publishing archival descriptions as linked data. With the exception of doing some content negotiation, this solution does not require any computer technology that is not already being used in archives, and it only requires a few small tweaks to a given workflow:
implement a content negotiation solution
create and maintain EAD file s
transform EAD into RDF/XML
transform EAD into HTML
save the resulting XML and HTML files on a Web server
go to step #2
EAD is a combination of narrative description and a hierarchal inventory list, and this data structure does not lend itself very well to the triples of linked data. For example, EAD headers are full of controlled vocabularies terms but there is no way to link these terms with specific inventory items. This is because the vocabulary terms are expected to describe the collection as a whole, not individual things. This problem could be overcome if each individual component of the EAD were associated with controlled vocabulary terms, but this would significantly increase the amount of work needed to create the EAD files in the first place.
The common practice of using literals to denote the names of people, places, and things in EAD files would also need to be changed in order to fully realize the vision of linked data. Specifically, it would be necessary for archivists to supplement their EAD files with commonly used URIs denoting subject headings and named authorities. These URIs could be inserted into id attributes throughout an EAD file, and the resulting RDF would be more linkable, but the labor to do so would increase, especially since many of the named items will not exist in standardized authority lists.
Despite these short comings, transforming EAD files into some sort of serialized RDF goes a long way towards publishing archival descriptions as linked data. This particular process is a good beginning and outputs valid information, just information that is not as linkable as possible. This process lends itself to iterative improvements, and outputting something is better than outputting nothing. But this particular proces is not for everybody. The archive whose content changes quickly, the archive with copious numbers of collections, or the archive wishing to publish the most complete linked data possible will probably not want to use EAD files as the root of their publishing system. Instead some sort of database application is probably the best solution.
In some ways MARC lends it self very well to being published via linked data, but in the long run it is not really a feasible data structure.
Converting MARC into serialized RDF through XSLT is at least a two step process. The first step is to convert MARC into MARCXML and then MARCXML into MODS. This can be done with any number of scripting languages and toolboxes. The second step is to use a stylesheet such as the one created by Stefano Mazzocchi to transform the MODS into RDF/XML — mods2rdf.xsl From there a person could save the resulting XML files on a Web server, enhance access via content negotiation, and called it linked data.
Unfortunately, this particular approach has a number of drawbacks. First and foremost, the MARC format had no place to denote URIs; MARC records are made up almost entirely of literals. Sure, URIs can be constructed from various control numbers, but things like authors, titles, subject headings, and added entries will most certainly be literals (“Mark Twain”, “Adventures of Huckleberry Finn”, “Bildungsroman”, or “Samuel Clemans”), not URIs. This issue can be overcome if the MARCXML were first converted into MODS and URIs were inserted into id or xlink attributes of bibliographic elements, but this is extra work. If an archive were to take this approach, then it would also behoove them to use MODS as their data structure of choice, not MARC. Continually converting from MARC to MARCXML to MODS would be expensive in terms of time. Moreover, with each new conversion the URIs from previous iterations would need to be re-created.
Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) goes a long way to implementing a named authority database that could be linked from archival descriptions. These XML files could easily be transformed into serialized RDF and therefore linked data. The resulting URIs could then be incorporated into archival descriptions making the descriptions richer and more complete. For example the FindAndConnect site in Australia uses EAC-CPF under the hood to disseminate information about people in its collection. Similarly, “SNAC aims to not only make the [EAC-CPF] records more easily discovered and accessed but also, and at the same time, build an unprecedented resource that provides access to the socio-historical contexts (which includes people, families, and corporate bodies) in which the records were created” More than a thousand EAC-CPF records are available from the RAMP project.
METS, MODS, OAI-PMH service providers, and perhaps more
If you have archival descriptions in either of the METS or MODS formats, then transforming them into RDF is as far away as your XSLT processor and a content negotiation implementation. As of this writing there do not seem to be any METS to RDF stylesheets, but there are a couple stylesheets for MODS. The biggest issue with these sorts of implementations are the URIs. It will be necessary for archivists to include URIs into as many MODS id or xlink attributes as possible. The same thing holds true for METS files except the id attribute is not designed to hold pointers to external sites.
Some archives and libraries use a content management system called ContentDM. Whether they know it or not, ContentDM comes complete with an OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting) interface. This means you can send a REST-ful URL to ContentDM, and you will get back an XML stream of metadata describing digital objects. Some of the digital objects in ContentDM (or any other OAI-PMH service provider) may be something worth exposing as linked data, and this can easily be done with a system called oai2lod. It is a particular implementation of D2RQ, described below, and works quite well. Download application. Feed oai2lod the “home page” of the OAI-PMH service provider, and oai2load will publish the OAI-PMH metadata as linked open data. This is another quick & dirty way to get started with linked data.
Publishing linked data through XML transformation is functional but not optimal. Publishing linked data from a database comes closer to the ideal but requires a greater amount of technical computer infrastructure and expertise.
Databases — specifically, relational databases — are the current best practice for organizing data. As you may or may not know, relational databases are made up of many tables of data joined together with keys. For example, a book may be assigned a unique identifier. The book has many characteristics such as a title, number of pages, size, descriptive note, etc. Some of the characteristics are shared by other books, like authors and subjects. In a relational database these shared characteristics would be saved in additional tables, and they would be joined to a specific book through the use of unique identifiers (keys). Given this sort of data structure, reports can be created from the database describing its content. Similarly, queries can be applied against the database to uncover relationships that may not be apparent at first glance or buried in reports. The power of relational databases lies in the use of keys to make relationships between rows in one table and rows in other tables. The downside of relational databases as a data model is infinite variety of fields/table combinations making them difficult to share across the Web.
Not coincidently, relational database technology is very much the way linked data is expected to be implemented. In the linked data world, the subjects of triples are URIs (think database keys). Each URI is associated with one or more predicates (think the characteristics in the book example). Each triple then has an object, and these objects take the form of literals or other URIs. In the book example, the object could be “Adventures Of Huckleberry Finn” or a URI pointing to Mark Twain. The reports of relational databases are analogous to RDF serializations, and SQL (the relational database query language) is analogous to SPARQL, the query language of RDF triple stores. Because of the close similarity between well-designed relational databases and linked data principles, the publishing of linked data directly from relational databases makes whole lot of sense, but the process requires the combined time and skills of a number of different people: content specialists, database designers, and computer programmers. Consequently, the process of publishing linked data from relational databases may be optimal, but it is more expensive.
Thankfully, many archivists probably use some sort of behind the scenes database to manage their collections and create their finding aids. Moreover, archivists probably use one of three or four tools for this purpose: Archivist’s Toolkit, Archon, ArchivesSpace, or PastPerfect. Each of these systems have a relational database at their heart. Reports could be written against the underlying databases to generate serialized RDF and thus begin the process of publishing linked data. Doing this from scratch would be difficult, as well as inefficient because many people would be starting out with the same database structure but creating a multitude of varying outputs. Consequently, there are two alternatives. The first is to use a generic database application to RDF publishing platform called D2RQ. The second is for the community to join together and create a holistic RDF publishing system based on the database(s) used in archives.
D2RQ is a very powerful software system. It is supported, well-documented, executable on just about any computing platform, open source, focused, functional, and at the same time does not try to be all things to all people. Using D2RQ it is more than possible to quickly and easily publish a well-designed relational database as RDF. The process is relatively simple:
download the software
use a command-line utility to map the database structure to a configuration file
edit the configuration file to meet your needs
run the D2RQ server using the configuration file as input thus allowing people or RDF user-agents to search and browse the database using linked data principles
alternatively, dump the contents of the database to an RDF serialization and ingest the result into your favorite RDF triple store
The downside of D2RQ is its generic nature. It will create an RDF ontology whose terms correspond to the names of database fields. These field names do not map to widely accepted ontologies & vocabularies and therefore will not interact well with communities outside the ones using a specific database structure. Still, the use of D2RQ is quick, easy, and accurate.
If you are going to be in Rome for only a few days, you will want to see the major sites, and you will want to adventure out & about a bit, but at the same time is will be a wise idea to follow the lead of somebody who has been there previously. Take the advise of these people. It is an efficient way to see some of the sights.
First, some assume that the price of MLC NAND flash will continue to decrease at a rapid and predictable rate that will make it competitive with HDDs for bandwidth, and nearly for capacity, by 2014 or 2015. This downward trend, it is assumed, will make flash a viable alternative for large storage and to act as a memory or “buffer” to improve performance.
Second, there is a general assumption that prices for bandwidth ($/GB/s) for SSDs is much lower than for HDDs, and that enterprises will measure costs in these terms instead of capacity.
Third, there is no distinction made between flash in general, such as consumer SSDs, and enterprise storage SSDs. It is assumed that MLC NAND will not only reduce in price ($/GB) but also that it will increase in density and larger capacity drives will be developed.
Fourth, it is assumed that the quality of MLC NAND will either remain constant or increase as prices decrease and densities increase, allowing it to improve not only performance, but also reliability and power consumption of the systems it is used in.
Fifth, it is assumed that power consumption for SSDs is, or will shortly be, significantly lower than that of HDDs overall, on a per GB basis and on a per GB/s basis.
Sixth, they assume disk performance will grow at a constant rate of about 20 percent per generation and not improve.
Seventh, they assume file system data layout will not improve to allow better disk utilization.
Henry is looking at the market for performance storage, not for long-term storage, but given that limitation I agree with nearly everything he writes. However, I think there is a simpler argument that ends up at the same place that Henry did:
Flash can do everything that hard disk can, but there are many markets where hard disk cannot do what flash can do.
The supply of both flash and hard disk is constrained. Flash is constrained because investing in new flash fabs would not be profitable, especially given the obviously limited scope for shrinking flash cells. Hard disk is constrained because the market is effectively a duopoly, and both players are struggling to transition from the current PMR technology to HAMR.
Thus flash will command a premium over hard disk prices so that the market directs the limited supply of flash to those applications, such as tablets, smartphones, and high-performance servers, where its added value is highest.
Today we’re open-sourcing two internal projects from The Times:
PourOver.js, a library for fast filtering, sorting, updating and viewing large (100k+ item) categorical datasets in the browser, and
Tamper, a companion protocol for compressing categorical data on the server and decompressing in your browser. We’ve achieved a 3–5x compression advantage over gzipped JSON in several real-world applications.
…Collections are important to developers, especially news developers. We are handed hundreds of user submitted snapshots, thousands of archive items, or millions of medical records. Filtering, faceting, paging, and sorting through these sets are the shortest paths to interactivity, direct routes to experiences which would have been time-consuming, dull or impossible with paper, shelves, indices, and appendices….
…The genesis of PourOver is found in the 2012 London Olympics. Editors wanted a fast, online way to manage the half a million photos we would be collecting from staff photographers, freelancers, and wire services. Editing just hundreds of photos can be difficult with the mostly-unimproved, offline solutions standard in most newsrooms. Editing hundreds of thousands of photos in real-time is almost impossible.
Yep, those sorts of tasks sound like things libraries are involved in, or would like to be involved in, right?
The actual JS does some neat things with figuring out how to incrementally and just-in-time send delta’s of data, etc., and some good UI tools. Look at the page for more.
I am increasingly interested in what ‘digital journalism’ is up to these days. They are an enterprise with some similarities to libraries, in that they are an information-focused business which is having to deal with a lot of internet-era ‘disruption’. Journalistic enterprises are generally for-profit (unlike most of the libraries we work in), but still with a certain public service ethos. And some of the technical problems they deal with overlap heavily with our area of focus.
It may be that the grass is always greener, but I think the journalism industry is rising to the challenges somewhat better than ours is, or at any rate is putting more resources into technical innovation. When was the last time something that probably took as many developer-hours as this stuff, and is of potential interest outside the specific industry, came out of libraries?
I have seen several different approaches to division of labor in developing, deploying, and maintaining web apps.
The one that seems to work best to me is when the same team responsible for developing an app is the team responsible for deploying it and keeping it up, as well as for maintaining it. The same team — and ideally the same individual people (at least at first; job roles and employment changes over time, of course).
If the people responsible for writing the app in the first place are also responsible for deploying it with good uptime stats, then they have incentive to create software that can be easily deployed and can stay up reliably. If it isn’t at first, then the people who receive the pain of this are the same people best placed to improve the software to deploy better, because they are most familiar with it’s structure and how it might be altered.
Software is always a living organism, it’s never simply “done”, it’s going to need modifications in response to what you learn from how it’s users use it, as well as changing contexts and environments. Software is always under development, the first time it becomes public is just one marker in it’s development lifecycle, and not a clear boundary between “development” and “deployment”.
Compare this to other divisions of labor, where maybe one team does “R&D” on a nice prototype, then hands their code over to another team to turn it into a production service, or to figure out how to get it deployed and keep it deployed reliably and respond to trouble tickets. Sometimes these teams may be in entirely different parts of the organization. If it doesn’t deploy as easily or reliably as the ‘operations’ people would like, do they need to convince the ‘development’ people that this is legit and something should be done? And when it needs additional enhancements or functional changes, maybe it’s the crack team of R&Ders who do it, even though they’re on to newer and shinier things; or maybe it’s the operations people expected to it, even though they’re not familiar with the code since they didn’t write it; or maybe there’s nobody to do it at all, because the organization is operating on the mistaken assumption that developing software is like constructing a building, when it’s done it’s done.
I just don’t find that it works well to create robust, reliable software which can evolve to meet changing requirements.
There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.
In this world of silos, development threw releases at the ops or release team to run in production.
The ops team makes sure everything works, everything’s monitored, everything’s continuing to run smoothly.
When something breaks at night, the ops engineer can hope that enough documentation is in place for them to figure out the dial and knobs in the application to isolate and fix the problem. If it isn’t, tough luck.
Putting developers in charge of not just building an app, but also running it in production, benefits everyone in the company, and it benefits the developer too.
It fosters thinking about the environment your code runs in and how you can make sure that when something breaks, the right dials and knobs, metrics and logs, are in place so that you yourself can investigate an issue late at night.
The responsibility to maintaining your own code in production should encourage any developer to make sure that it breaks as little as possible, and that when it breaks you know what to do and where to look.
That’s a good thing.
None of this means you can’t have people who focus on ops other people who focus on dev; but I think it means they should be situated organizationally close to each other, on the same teams, and that the dev people have to have share some ops responsibilities, so they feel some pain from products that are hard to deploy, or hard to keep running reliably, or hard to maintain or change.
 Note some people think even constructing a building shouldn’t be “when it’s done it’s done”, but that buildings too should be constructed in such a way that allows continual modification by those who inhabit them, in response to changing needs or understandings of needs.
If you sell an ebook through Amazon's Kindle Direct program, Amazon doesn't want you to offer it for less somewhere else. It's easy to understand why; if you're a consumer, you hate to pay $10 for an ebook on Amazon and then find that you can get it direct from the author for $5. But is it legal for Amazon to enjoin a publisher from offering better prices in other channels? In other words, is Amazon allowed to insist on a "Most Favored Nation" (MFN) provision?
4. Setting Your List Price You must set your Digital Book's List Price (and change it from time-to-time if necessary) so that it is no higher than the list price in any sales channel for any digital or physical edition of the Digital Book. But if you choose the 70% Royalty Option, you must further set and adjust your List Price so that it is at least 20% below the list price in any sales channel for any physical edition of the Digital Book.
I really don't know the answer, but I do know that Apple's MFN provision was a focus of the Department of Justice's successful prosecution of Apple and 5 colluding publishers for violations of the Sherman Antitrust Act. If Apple couldn't have a MFN, then how can Amazon insist on it, given their dominant market position in ebooks?
Although the judge found that the MFN clause in this instance was critical to Apple’s ability to orchestrate the unlawful conspiracy, Judge Cote explicitly held that MFN clauses are not, in and of themselves, “inherently illegal.” Judge Cote explained that “entirely lawful contracts may contain an MFN …. The issue is not whether an entity … used an MFN, but whether it conspired to raise prices.” This determination, she stated, must be based on consideration of the “totality of the evidence,” rather than on the language of the agency agreement or MFN alone. Examining the facts in this particular case, Judge Cote found that Apple’s use of the MFN clause to facilitate the e-book conspiracy with the publishers constituted a “per se” violation of the antitrust laws.
depending on the economic and commercial circumstances, MFN clauses have on occasion caused concern to competition authorities. In particular:
They can act as a disincentive to price cutting. If a supplier knows that, by offering a discount to any third-party customer, the supplier must also offer the customer benefiting from the MFN clause a discount to ensure that the latter enjoys the most favourable price, that is a "double cost" to price cutting, and therefore could have the effect of deterring price cuts and keeping prices higher than they might otherwise be.
In the European Union, Amazon has run into problems with a similar "Price Parity" provision for the Amazon Marketplace. After inquiries by European Union regulatory agencies Amazon agreed NOT to enforce Price Parity, a policy that has been in effect since August 31, 2013. The Bookseller reported on the effect of this agreement in the (print) book market.
In the U.S., there's further confusion about distribution channel pricing because of the Robinson-Patman Act, which prevents them from pricing print books to favor one distributor over another. But according to the Federal Trade Commission, "The Act applies to commodities, but not to services, and to purchases, but not to leases." Since ebooks are licensed, not sold, it seems to this non-lawyer that Robinson-Patman shouldn't apply to ebooks.
The particular situation that has drawn my attention is the case of authors and publishers that make their ebooks available under Creative Commons licenses. Many of these authors also make their ebooks available via the Kindle Direct Publishing Program. There's nothing at all wrong with that - many readers prefer to get these ebooks onto their Kindles via Amazon, and are happy to know that some money ends up with the creators of the ebook. Amazon offers convenience, reliable customer service and wireless delivery.
At Unglue.it, we're starting to offer Creative Commons creators the ability to ask people who download their ebooks for support (the program officially launches on April 30). The top concern these authors have expressed to us about this program is the "setting your list price" clause for their Kindle Direct channel. If they participate in our "Thanks for Ungluing" program they worry that Amazon will kick them out of the KDP program and the corresponding revenue stream.
We've done a few things to address this concern. Creators can set a "list price" in unglue.it- it's the suggested contribution for the pay-what-you-want download. And that's the price we report in our schema.org metadata.
But what if Amazon sees Unglue.it offering free downloads of books they're offering for $3.99 on the Kindle? Would they delist the book from the Kindle platform and kill that revenue stream? Or maybe delist the publisher entirely?
It seems to me that if Amazon did this, it could be running afoul of Judge Cote's guidelines for MFN provisions. Enforcing the MFN would amount to a retaliation against creators who offer lower prices (including zero) in other channels. Amazon doesn't even let you set your price to zero.
If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra.
Linked data is not a fad. It is not a trend. It makes a lot of computing sense, and it is a modern way of fulfilling some the goals of archival practice. Just like Rome, it is not going away. An understanding of what linked data has to offer is akin to experiencing Rome first hand. Both will ultimately broaden your perspective. Consequently it is a good idea to make a concerted effort to learn about linked data, as well as visit Rome at least once. Once you have returned from your trip, discuss what you learned with your friends, neighbors, and colleagues. The result will be enlightening everybody.
The previous sections of this book described what linked data is and why it is important. The balance of book describes more of the how’s of linked data. For example, there is a glossary to help reenforce your knowledge of the jargon. You can learn about HTTP “content negotiation” to understand how actionable URIs can return HTML or RDF depending on the way you instruct remote HTTP servers. RDF stands for “Resource Description Framework”, and the “resources” are represented by URIs. A later section of the book describes ways to design the URIs of your resources. Learn how you can transform existing metadata records like MARC or EAD into RDF/XML, and then learn how to put the RDF/XML on the Web. Learn how to exploit your existing databases (such as the one’s under Archon, Archivist’s Toolkit, or ArchiveSpace) to generate RDF. If you are the Do It Yourself type, then play with and explore the guidebook’s tool section. Get the gentlest of introductions to searching RDF using a query language called SPARQL. Learn how to read and evaluate ontologies & vocabularies. They are manifested as XML files, and they are easily readable and visualizable using a number of programs. Read about and explore applications using RDF as the underlying data model. There are a growing number of them. The book includes a complete publishing system written in Perl, and if you approach the code of the publishing system as if it were a theatrical play, then the “scripts” read liked scenes. (Think of the scripts as if they were a type of poetry, and they will come to life. Most of the “scenes” are less than a page long. The poetry even includes a number of refrains. Think of the publishing system as if it were a one act play.) If you want to read more, and you desire a vetted list of books and articles, then a later section lists a set of further reading.
After you have spent some time learning a bit more about linked data, discuss what you have learned with your colleagues. There are many different aspects of linked data publishing, such as but not limited to:
allocating time and money
analyzing the RDF of yours as well as others
cleaning and improving RDF
collecting and harvesting the RDF of others
deciding what ontologies & vocabularies to use
designing local URIs
enhancing RDF triples stores by asserting additional relationships
finding and identifying URIs for the purposes of linking
making RDF available on the Web (SPARQL, RDFa, data dumps, etc.)
provisioning value-added services against RDF (catalogs, finding aids, etc.)
storing RDF in triple stores
In archival practice, each of these things would be done by different sets of people: archivists & content specialists, administrators & managers, computer programers & systems administrators, metadata experts & catalogers. Each of these sets of people have a piece of the publishing puzzle and something significant to contribute to the work. Read about linked data. Learn about linked data. Bring these sets of people together discuss what you have learned. At the very least you will have a better collective understanding of the possibilities. If you don’t plan to “go to Rome” right away, you might decide to reconsider the “vacation” at another time.
Even Michelangelo, when he painted the Sistine Chapel, worked with a team of people each possessing a complementary set of skills. Each had something different to offer, and the discussion between themselves was key to their success.
Comprehensive social search on the Internet remains an unsolved problem. Social networking sites tend to be isolated from each other, and the information they contain is often not fully searchable outside the confines of the site. EgoSystem, developed at Los Alamos National Laboratories (LANL), explores the problems associated with automated discovery of public online identities for people, and the aggregation of the social, institution, conceptual, and artifact data connected to these identities. EgoSystem starts with basic demographic information about former employees and uses that information to locate person identities in various popular online systems. Once identified, their respective social networks, institutional affiliations, artifacts, and associated concepts are retrieved and linked into a graph containing other found identities. This graph is stored in a Titan graph database and can be explored using the Gremlin graph query/traversal language and with the EgoSystem Web interface.
This article describes how the University of North Texas Libraries' Digital Projects Unit used simple, freely-available APIs to add place names to metadata records for over 8,000 maps in two digital collections. These textual place names enable users to easily find maps by place name and to find other maps that feature the same place, thus increasing the accessibility and usage of the collections. This project demonstrates how targeted large-scale, automated metadata enhancement can have a significant impact with a relatively small commitment of time and staff resources.
In late 2012, OSU Libraries and Press partnered with Maria’s Libraries, an NGO in Rural Kenya, to provide users the ability to crowdsource translations of folk tales and existing children's books into a variety of African languages, sub-languages, and dialects. Together, these two organizations have been creating a mobile optimized platform using open source libraries such as Wink Toolkit (a library which provides mobile-friendly interaction from a website) and Globalize3 to allow for multiple translations of database entries in a Ruby on Rails application. Research regarding successes of similar tools has been utilized in providing a consistent user interface. The OSU Libraries & Press team delivered a proof-of-concept tool that has the opportunity to promote technology exploration, improve early childhood literacy, change the way we approach foreign language learning, and to provide opportunities for cost-effective, multi-language publishing.
In this article, we present a case study of how the main publishing format of an Open Access journal was changed from PDF to EPUB by designing a new workflow using JATS as the basic XML source format. We state the reasons and discuss advantages for doing this, how we did it, and the costs of changing an established Microsoft Word workflow. As an example, we use one typical sociology article with tables, illustrations and references. We then follow the article from JATS markup through different transformations resulting in XHTML, EPUB and MOBI versions. In the end, we put everything together in an automated XProc pipeline. The process has been developed on free and open source tools, and we describe and evaluate these tools in the article. The workflow is suitable for non-professional publishers, and all code is attached and free for reuse by others.
The Valley Library at Oregon State University Libraries & Press supports access to technology by lending laptops and e-readers. As a newcomer to tablet lending, The Valley Library chose to implement its service using Google Nexus tablets and an open source custom firmware solution, CyanogenMod, a free, community-built Android distribution. They created a custom build of CyanogenMod featuring wireless updates, website shortcuts, and the ability to quickly and easily wipe devices between patron uses. This article shares code that simplifies Android tablet maintenance and addresses Android application licensing issues for shared devices.
As the archival horizon moves forward, optical media will become increasingly significant and prevalent in collections. This paper sets out to provide a broad overview of optical media in the context of archival migration. We begin by introducing the logical structure of compact discs, providing the context and language necessary to discuss the medium. The article then explores the most common data formats for optical media: Compact Disc Digital Audio, ISO 9660, the Joliet and HFS extensions, and the Universal Data Format (with an eye towards DVD-Video). Each format is viewed in the context of preservation needs and what archivists need to be aware of when handling said formats. Following this, we discuss preservation workflows and concerns for successfully migrating data away from optical media, as well as directions for future research.
Digital signage has been used in the commercial sector for decades. As display and networking technologies become more advanced and less expensive, it is surprisingly easy to implement a digital signage program at a minimal cost. In the fall of 2011, the University of Florida (UF), Health Sciences Center Library (HSCL) initiated the use of digital signage inside and outside its Gainesville, Florida facility. This article details UF HSCL’s use and evaluation of DigitalSignage.com signage software to organize and display its digital content.
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.
We are happy to announce the v1.0 release of the OCLC Python 2.7 Authentication Library via Github. This code library is the fourth implementation that the OCLC Developer Network is releasing to assist developers working with our web services protected by our API key system.
Sign up below to get the Open Access Button, a safe, easy to use browser bookmarklet that you can use to show the global effects of research paywalls – and to help get access to the research you need. Every time you hit a paywall blocking your research, click the button. Fill out a short form, add your experience to the map along with thousands of others. Then use our tools to search for access to papers, and spread the word with social media. Every person who uses the Open Access Button brings us closer to changing the system.
Before Whatson accepts a search query it will first ingest, analyze and index documents so that searches don’t take forever. I have shown how Whatson will use Apache Tika to extract metadata and convert different content types into plain text. After that, the plain text will be split up into words, called tokens, so that queries can later be matched up to documents. Here is a simple example:
The tokenizer analyzes whitespace and punctuation to produce a list of tokens. Partial example, with pipes inserted by me:
Dr. | Lanyon | sat | alone | over | his | wine | . | This | was | a | hearty | , | healthy …
The tokenizer was smart enough to keep the period with “Dr.” but separate it out when it was used to split up sentences. This is why you don’t want to build from scratch.
The library is the heart of the University. From it, the lifeblood of scholarship flows to all parts of the University; to it the resources of scholarship flow to enrich the academic body. With a mediocre library, even a distinguished faculty cannot attain its highest potential; with a distinguished library, even a less than brilliant faculty may fulfill its mission. For the scientist, the library provides an indispensable backstop to his laboratory and field work. For the humanist, the library is not only his reference centre; it is indeed his laboratory and the field of his explorations. What he studies, researches and writes is the product of his reading in the library. For these reasons, the University library must be one of the primary concerns for those responsible for the development and welfare of the institution. At the same time, the enormous cost of acquisitions, the growing scarcity of older books, the problem of storage and cataloguing make the library one of the most painful headaches of the University administrator.
“If we can put a man on the moon and we can transplant a heart, we surely can say when something shows up ‘free’ and do something about that.” Rep. Tom Marino (R-PA).
In March, the U.S. House Judiciary Subcommittee on Courts, Intellectual Property and the Internet held a hearing on Section 512, the provision that provides protection for internet service providers from liability for the infringing actions of network users. The Library Copyright Alliance (LCA) submitted comments (pdf) in support of no changes to the existing law, holding that this provision helps libraries provide online services in good faith without liability for the potentially illegal actions of a third party.
Though libraries were not specifically represented in the hearing, one line of questioning directed at both Google and Automattic Inc.—owner of WordPress—stands out as relevant to both present and future methods of delivering content and services to library patrons: “free” as the opposite of “legal” or “legitimate.”
Several representatives focused on witnesses Katherine Oyama, senior copyright policy counsel for Google, and Paul Sieminski, general counsel for Automattic Inc., expressing significant confusion about how Google creates and modifies indexing and search algorithms, as well as the nuances of copyright protection on a blogging platform. “Free” was the watchword, and many subcommittee members expressed the same basic concerns.
Rep. Judy Chu (D-CA) asked about autocomplete results in Google that include “free” and “watch online,” saying that such results “induce infringement” on the part of searchers. Rep. Cedric Richmond (D-LA) further echoed worries that unsophisticated Internet users like his grandmother would be “induced to infringe” by seeing an autocomplete result for “watch 12 Years a Slave free online.”
But the most colorful exchange began with Rep. Tom Marino (R-PA) expressing disbelief that Google could not simply ban or remove terms such as “watch X movie online for free” from the engine.
Oyama rightly pointed out that “we are not going to ban the word ‘free’ from search…there are many legitimate sources for music and films that are available for free.” She also promoted YouTube’s ContentID software as an effective answer to alleged infringement, though there are certainly reasons to remain wary of the “software savior” in addressing takedown notices (more on ContentID coming soon).
As libraries begin exploring ways to deliver legally obtained and responsibly monitored content to patrons, we will have to offer a counterpoint to the concept of “free” as the automatic enemy of rights holders. While we know that it is anything but free to provide these services (no-fee or no-charge is perhaps a better description), the public often perceives it as such, and simply banning phrases like “read for free” or “watch for free” from the world’s largest Internet index will not reduce infringement. Instead, it removes a responsible and reliable source from top page results, which is the exact opposite of what the lawmakers above support.
If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra. If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice. For a week, do everything you would do in a few days, and make one or two day-trips outside Rome in order to get a flavor of the wider community. If you can afford two weeks, then do everything you would do in a week, and in addition befriend somebody in the hopes of establishing a life-long relationship.
When you read a guidebook on Rome — or any travel guidebook — there are simply too many listed things to see & do. Nobody can see all the sites, visit all the museums, walk all the tours, nor eat at all the restaurants. It is literally impossible to experience everything a place like Rome has to offer. So it is with linked data. Despite this fact, if you were to do everything linked data had to offer, then you would do all of things on the following list starting at the first item, going all the way down to evaluation, and repeating the process over and over:
design the structure your URIs
select/design your ontology & vocabularies — model your data
map and/or migrate your existing data to RDF
publish your RDF as linked data
create a linked data application
harvest other people’s data and create another application
Given that it is quite possible you do not plan to immediately dive head-first into linked data, you might begin by getting your feet wet or dabbling in a bit of experimentation. That being the case, here are a number of different “itineraries” for linked data implementation. Think of them as strategies. They are ordered from least costly and most modest to greatest expense and completest execution:
Rome in a day – Maybe you can’t afford to do anything right now, but if you have gotten this far in the guidebook, then you know something about linked data. Discuss (evaluate) linked data with with your colleagues, and consider revisiting the topic a year.
Rome in three days – If you want something relatively quick and easy, but with the understanding that your implementation will not be complete, begin migrating your existing data to RDF. Use XSLT to transform your MARC or EAD files into RDF serializations, and publish them on the Web. Use something like OAI2RDF to make your OAI repositories (if you have them) available as linked data. Use something like D2RQ to make your archival description stored in databases accessible as linked data. Create a triple store and implement a SPARQL endpoint. As before, discuss linked data with your colleagues.
Rome in week – Begin publishing RDF, but at the same time think hard about and document the structure of your future RDF’s URIs as well as the ontologies & vocabularies you are going to use. Discuss it with your colleagues. Migrate and re-publish your existing data as RDF using the documentation as a guide. Re-implement your SPARQL endpoint. Discuss linked data not only with your colleagues but with people outside archival practice.
Rome in two weeks – First, do everything you would do in one week. Second, supplement your triple store with the RDF of others’. Third, write an application against the triple store that goes beyond search. In short, tell stories and you will be discussing linked data with the world, literally.
Last week, the American Library Association (ALA) joined an amicus brief calling for reconsideration of a 9th circuit court decision in Garcia v. Google, case where actress Cindy Sue Garcia sued Google for not removing a YouTube video in which she appears. Garcia appears for five seconds in “Innocence of Muslims,” the radical anti-Islamic video that fueled the attack on the American embassy in Benghazi. The video was uploaded on YouTube, exposing Garcia to threats and hate mail. Garcia did not know that her five second performance would be used in a controversial video.
Garcia turned to the copyright law for redress, arguing that her five second performance was protected by copyright, and therefore, as a rights holder she could ask that the video be removed from YouTube. While we empathize with Garcia’s situation, the copyright law does not protect performances in film—instead these performances are works-for-hire. This ruling, if taken to its extreme, would hold that anyone who worked on a film—from the editor to the gaffer—could claim rights, creating a copyright permissions nightmare.
On appeal, the judge agreed that the copyright argument was weak, but nonetheless ruled for Garcia. The video currently is not available for public review. This decision needs to be reheard en banc—the copyright ruling is mistaken, and perhaps more importantly, the copyright law cannot be used to restrain speech. While the facts of this case are not at all appealing, we agree that rules of law need to be upheld. Fundamental values of librarianship—including intellectual freedom, fair use, and preservation of the cultural record—are in serious conflict with the existing court ruling.
On 26 March 2014 I gave a short talk at the March 2014 AR Standards Community Meeting in Arlington, Virginia. The talk was called “Stuff, Standards and Sites: Libraries and Archives in AR.” My slides and the text of what I said are online:
I struggled with how best to talk to non-library people, all experts in different aspects of augmented reality, about how our work can fit with theirs. The stuff/standards/sites components gave me something to hang the talk on, but it didn’t all come together as well as I’d hoped and in the heat of actually speaking I forgot to mention a couple of important things. Ah well.
I made the slides a new way. They are done with reveal.js, but I wrote them in Emacs with Org and then used org-reveal to export them. It worked beautifully! The diagrams in the slides are done in text in Org with ditaa and turned into images on export.
What I write in Org looks like this (here I turned image display off, but one keystroke makes them show):
When turned into slides, that looks like this:
Working this way was a delight. No more nonsense about dragging boxes and stuff around like in Power Point. I get to work with pure text, in my favourite editor, and generate great-looking slides, all with free software.
CHICAGO — Tablets, desktops, smartphones, laptops, minis: we live in a world of screens, all of different sizes. Library websites need to work on all of them, but maintaining separate sites or content management systems is resource intensive and still unlikely to address all the variations. By using responsive Web design, libraries can build one site for all devices—now and in the future. In “Responsive Web Design for Libraries: A LITA Guide,” published by ALA TechSource, experienced responsive Web developer Matthew Reidsma, named “a web librarian to watch” by ACRL’s TechConnect blog, shares proven methods for delivering the same content to all users using HTML and CSS. His practical guidance will enable Web developers to save valuable time and resources by working with a library’s existing design to add responsive Web design features. With both clarity and thoroughness, and firmly addressing the expectations of library website users, this book:
shows why responsive Web design is so important, and how its flexibility can meet the needs of both today’s users and tomorrow’s technology;
provides in-depth coverage of implementing responsive Web design on an existing site, steps for taking traditional desktop CSS and adding breakpoints for site responsiveness and ways to use grids to achieve a visual layout that’s adaptable to different devices;
includes valuable tips and techniques from Web developers and designers, such as how to do more with fewer resources and improving performance by designing a site that sends fewer bytes over fewer connections;
offers advice for making vendor sites responsive;
features an abundance of screen captures, associated code samples and links to additional resources.
Reidsma is Web services librarian at Grand Valley State University, in Allendale, Mich. He is the cofounder and editor in chief of Weave: Journal of Library User Experience, a peer-reviewed, open-access journal for library user experience professionals. He speaks frequently about library websites, user experience and responsive design around the world. Library Journal named him a “Mover & Shaker” in 2013. He writes about libraries and technology at Matthew Reidsma.
The Library and Information Technology Association (LITA), a division of ALA, educates, serves and reaches out to its members, other ALA members and divisions, and the entire library and information community through its publications, programs and other activities designed to promote, develop, and aid in the implementation of library and information technology.
ALA Store purchases fund advocacy, awareness and accreditation programs for library professionals worldwide. Contact us at (800) 545-2433 ext. 5418 or firstname.lastname@example.org.
“Put down the marker, step away from the whiteboard.” I joked that once in a design session. A picture can represent a rich array of information in a single frame — that is its strength and weakness. “A picture paints a thousand words. Stop all the talking!” It can take a while to assimilate all the information in a diagram. Here is my first cut at an architecture diagram for Whatson, my home basement attempt at building Watson using public knowledge and open source technology. I will detail the components in future posts as the build proceeds.
1. Data Source to Index. In order for Whatson to be able to answer questions in a timely fashion, data sources must be pre-processed. Data sources must be crawled and indexed. The index is the structured target for searches.
1.1. Data Sources. I will download data sources including public domain literature and Wikipedia. Other sources may be added. The more data sources, the smarter Whatson will be.
1.2. Crawl. In a previous post, I showed how I can use Apache Tika to convert different content types (e.g., html, pdf) into plain text, and extract metdata. This is the crawl stage. The common plain text format makes further processing much easier.
1.3. Index. Using OpenNLP I will process text along a UIMA pipeline. UIMA is an open, industry standard architecture for Natural Language Processing (NLP) stages. A UIMA pipeline is a series of text-processing steps including: parsing document content into individual words or tokens; analyzing the tokens into parts of speech like nouns and verbs; and identification of entities like people, locations and organizations.
2. Question to Answer. Once the data sources have been crawled and indexed, a question may be submitted to Whatson. The output must be Whatson’s single best answer.
2.1. Question. A user interface will accept a question in natural language.
2.2. Cognitive Analysis. Whatson will analyze the question text. The analysis first submits the text to the UIMA pipeline built for step 1. The pipeline outputs are used here to make the question easier to analyze for the next step, deciding the question type. Is the question seeking a person or a place? Is the context literal or figurative? Current or historical? Based on the question type, modules will be enlisted to answer the question. The modular approaches simulates the human brain, with different modules dedicated to different kinds of knowledge and cognitive processing. The modules use domain-specific logic to search for answers in the index prepared in step 1. For example, a literature module will have domain-specific rules for analyzing literature. This approach prevents Whatson from wild goose chases and speeds up processing. The output of the cognitive analysis is a candidate answer and confidence level from each enlisted module.
2.3 Dialog. Whatson needs to decide which answer from the cognitive analysis is best. If the stakes are low, it will simply select the answer with the highest confidence level. The questioner can respond whether the answer is satisfactory. A dialog may continue with additional questions. If Whatston is used in a context that has penalties, like playing Jeopardy, it might not risk giving its best answer if the confidence level is low. If the context permits, Whatson could ask for hints or prompt for a restatement of the question.
Baker. Final Jeopardy: Man vs. Machine and the Quest to Know Everything.
Ingersoll, Morton & Farris. Taming Text.
Every day, through secret contracts being carried out within public institutions, there is confirmation that the interest of the public is not served. A few days ago, young Nigerians in Abuja were arrested for protesting against the reckless conduct of the recruitment exercise at the Nigerian Immigration Service (NIS) that led to the death of 19 applicants.
Although the protesters were later released, the irony still stings that whilst no one has been held for the resulting deaths from the reckless recruitment conduct, the young voices protesting against this grave misconduct are being silenced by security forces. Most heart-breaking is the reality that the deadly outcomes of the recruitment exercise could have been avoided with more conscientious planning, through an adherence to due process and diligence in the selection of consultants to carry out the exercise.
A report released by Premium times indicates that the recruitment exercise was conducted exclusively by the Minister of Interior who hand-picked the consultant that carried out the recruitment exercise at the NIS. The non-responsiveness of the Ministry in providing civic organizations including BudgIT and PPDC with requested details of the process through which the consultant was selected gives credence to the reports of due process being flouted.
The non-competitive process through which the consultant was selected is in sharp breach of the Public procurement law and its results have undermined the concept of value for money in the award of contracts for public services. Although a recruitment website was built and deployed by the hired consultant, the information gathered by the website does not seem to have informed the plan for the conduct of the recruitment exercise across the country which left Nigerians dead in its wake. Whilst the legality of the revenue generated from over 710,000 applicants is questioned, it is appalling that these resources were not used to ensure a better organized recruitment exercise.
This is not the first time that public institutions in Nigeria have displayed reckless conduct in the supposed administration of public services to the detriment of Nigerians. The recklessness with which the Ministry of Aviation took a loan to buy highly inflated vehicles, the difficulty faced by BudgIT and PPDC in tracking the exact amount of SURE-P funds spent, the 20 billion Dollars unaccounted for by the NNPC are a few of the cases where Nation building and development is undermined by public institutions.
In the instance of the NIS recruitment conducted three weeks ago, some of the consequences have been immediate and fatal, yet there is foot dragging in apportioning liability and correcting the injustice that has been dealt to Nigerians. On the same issue, public resources have been speedily deployed to silence protesters.
It is time that our laws which require due process and diligence are fully enforced. Peaceful protests should no longer be clamped down because Nigerians are justified for being outraged by any form of institutional recklessness. The Nigerian Immigration Service recruitment exercise painfully illustrates that the outcomes of secret contracts could be deadly and such behaviour cannot be allowed to continue. We must stop institutional recklessness, we must stop secret contracts.
Ms. Seember Nyager coordinates procurement monitoring in Nigeria. Follow Nigerian Procurement Monitors at @Nig_procmonitor.
Amidst a flurry of congressional hearings and treaty negotiations, it is important to remember that statistics often tell half of the story. As I catch up on recent U.S. House subcommittee hearings, I continue to marvel at how often both committee members and witnesses conflate a total number of takedown notices with actual cases of infringement. This is not a new problem; the “Chilling Effect” is a well-documented (pdf) result of widespread abuse of Section 512 takedown notices. In 2009, Google reported that over a third of DMCA takedown notices were invalid:
Google notes that more than half (57%) of the takedown notices it has received under the US Digital Millennium Copyright Act 1998, were sent by business targeting competitors and over one third (37%) of notices were not valid copyright claims.
And that doesn’t even include YouTube or Blogger takedown statistics! The numbers aren’t much better today. Google’s latest Transparency Report shows over 27 million removal requests over the past three years, with nearly a million of those requests denied (requests cited as “improper” or “abusive”) in 2011 alone. Many rights holders will continue to point to takedown notice numbers as evidence of widespread infringement, but this simply bolsters a landscape in which everybody is guilty until proven innocent of violating copyright.
A new but still “pre-published” version of the Linked Archival Metadata: A Guidebook is available. From the introduction:
The purpose of this guidebook is to describe in detail what linked data is, why it is important, how you can publish it to the Web, how you can take advantage of your linked data, and how you can exploit the linked data of others. For the archivist, linked data is about universally making accessible and repurposing sets of facts about you and your collections. As you publish these fact you will be able to maintain a more flexible Web presence as well as a Web presence that is richer, more complete, and better integrated with complementary collections.
A few weeks back, I dropped Google search in favor of DuckDuckGo, an alternative search engine that does not log your searches. Today, I’m here to report on that experience and suggest two even better secure search tools: StartPage and Ixquick.
The probelm with DuckDuckGo
As I outlined in my initial blog post, DuckDuckGo falls down probably as a consequence of its emphasis on privacy. Whereas Google results are based on an array of personal variables that tie specific result sets to your social graph…a complex web of data points collected on you through your Chrome Browser, Android apps, browser cookies, location data, possibly even the contents of your documents and emails stored on Google’s servers (that’s a guess, but totally within the scope of reason). This is a considerable handicap for DuckDuckGo.
But moreover, Google’s algorithm remains superior to everything else out there.
The benefits of using DuckDuckGo, of course, are that you are far more anonymous, especially if you are searching in private browser mode, accessing the Internet through a VPN or Tor, etc.
Again, given the explosive revelations about aggressive NSA data collection and even of government programs that hack such social graphs, and the potential leaking of that data to even worse parties, many people may decide that, on balance, they are better off dealing with poor search precision rather than setting themselves up for a cataclysmic breach of their data.
I’m one such person, but to be quite honest, I was constantly turning back to Google because DuckDuckGo just wouldn’t get me what I knew was out there.
Fortunately, I found something better: StartPage and Ixquick.
There are two important things to understand about StartPage and Ixquick:
Both StartPage and Ixquick use proxy services to query other search engines. In the case of Ixquick, they query multiple search engines and then return the results with the highest average rank. StartPage only queries Google, but via the proxy service, making your search private and free of the data mining intrigue that plagues the major search engines.
Still some shortcomings remain
But, like DuckDuckGo, neither Ixquick or StartPage are able to source your social graph, so they will never get results as closely tailored to you as Google. By design, they are not looking at your cookies or building their own database of you, so they won’t be able to guess your location or political views, and therefore, will never skew results around those variables. Then again, your results will be more broadly relevant and serendipitous, saving you from the personal echo-chamber that you may have found in Google.
It’s been over a month since I switched from DuckDuckGo to StartPage and so far it’s been quite good. StartPage even has a passable image and video search. I almost never go to Google anymore. In fact, I’ve used a browser plugin called Stylish to re-skin Google’s search interface with the NSA logo just as a humorous reminder that every search is being collected by multiple parties.
For that matter, I’ve used the same plugin to re-skin StartPage since where they get high marks for privacy and search results, they’re interface design needs major work…but I’m just picky that way.
So, with my current setup, I’ve got StartPage as my default browser, set in my omnibar in Firefox. Works like a charm!
As part of the redesign for the new site, the main thing that I really wanted to change in terms of the look was the front page.Based on my experience and discussions with staff about what our users look for when they arrive at the site, I had an idea of what information should be […]
After the last postSebgot me wondering if there were any differences between libraries, archives and museums when looking at upload and comment activity in Flickr Commons in Aaron’ssnapshot of the Flickr Commons metadata.
First I had to get a list of Flickr Commons organizations and classify them as either a library, museum or archive. It wasn’t always easy to pick, but you can see the result here. I lumped galleries in with museums. I also lumped historical societies in with archives. Then I wrote a script that walked around in the Redis database I already had from loading Aaron’s data.
In doing this I noticed there were some Flickr Commons organizations that were missing from Aaron’s snapshot:
I didn’t do any research to see if these organizations had significant activity. Also, since there were close to a million files, I didn’t load the British Library activity yet. If there’s interest in adding them into the mix I’ll splurge for the larger ec2 instance.
Anyhow, below are the results. You can find the spreadsheet for these graphs up in Google Docs
This was all done rather quickly, so if you notice anything odd or that looks amiss please let me know. Initially it seemed a bit strange to me that libraries, archives and museums trended so similarly in each graph, even if the volume was different.
I was in New York a couple of weeks ago, and I went to the Strand Bookstore, that multistory heaven of used and new books. I wandered around a while and got some things I’d been wanting. I wanted to read something set in New York so I looked first at Lawrence Block’s books and got The Burglar in the Closet, which opens with Bernie Rhodenbarr sitting in Gramercy Park, which I’d just passed by on the walk down, and then at Donald E. Westlake and got Get Real, the last of the Dortmunder series, and mostly set in the Lower East Side. Welcome to New York.
While I was standing near a table in the main aisle on the ground floor an older woman carrying some bags passed behind me and accidentally knocked some books to the floor. “Oh, I’m sorry, did I do that?” she said in a thick local accent. A young woman and I both leaned over to pick up the books. I was confused for a moment, because it looked like the cover had ripped, but it hadn’t, the rip was printed.
It’s a fine book, a gripping history and biography, covering in full something I only knew a tiny bit about. Seneca wrote a good amount of philosophy, including the Epistles, a series of letters full of Stoic advice to a younger friend, but the editions of his philosophy (or his tragedies) don’t go much into the details of Seneca’s life. They might mention he was a senator and advisor to Nero, and rich (as rich as a billionaire today), but then they get on to analyzing the subtleties of his thoughts on nature or equanimity.
Seneca led an incredible life: he was a senator in Rome, he was banished by the emperor Claudius on trumped-up charges of an affair with Caligula’s sister, but was later called back to Rome at the behest of Agrippina, Nero’s mother, to act as an advisor and tutor to the young man. Five years later, Agrippina poisoned Claudius, and Nero became emperor.
Seneca was very close to Nero and stayed as his advisor for years. It worked fairly well at first, but Nero was Nero. This is the main matter of the book: how Seneca, the wise Stoic, stayed close to Nero, who gradually went out of control: wild behaviour, crimes, killings, and eventually the murder of his mother Agrippina. An attempt to kill her on a boat failed, and then:
None of Seneca’s meditations on morality, Virtue, Reason, and the good life could have prepared him for this. Before him, as he entered Nero’s room, stood a frightened and enraged youth of twenty-three, his student and protégé for the past ten years. For the past five, he had allied with the princeps against his dangerous mother. Now the path he had first opened for Nero, by supporting his dalliance with Acte, had led to a botched murder and a political debacle of the first magnitude. It was too late for Seneca to detach himself. The path had to be followed to its end.
Every word Seneca wrote, every treatise he published, must be read against his presence in this room at this moment. He stood in silence for a long time, as though contemplating the choices before him. There were no good ones. When he finally spoke, it was to pass the buck to Burrus. Seneca asked whether Burrus could dispatch his Praetorians to take Agrippina’s life.
Seneca supported Nero’s matricide.
It’s impossible to match that, and other things Seneca did, with his Stoic writings, but it was all the same man. It’s a remarkable and paradoxical life.
Romm’s done a great job of writing this history. It’s full of detail (especially drawing on Tacitus), with lots of people and events to follow, but it’s all presented clearly and with a strong narrative. If you liked I, Claudius you’ll like this, and I see similar comments about House of Cards and Game of Thrones.
I especially recommend this to anyone interested in Stoicism. Thrasea Pateus is a minor figure in the book, another senator and also a Stoic, but one who acted like a Stoic should have, by opposing Nero. He was new to me. Seneca’s Stoic nephew Lucan, who wrote the epic poem The Civil War, also appears. He was friends with Nero but later took part in a conspiracy to kill the emperor. It failed, and Lucan had to commit suicide, as did Seneca, who wasn’t part of the plot.
There’s a nice chain of philosophers at the end of the book. After Nero’s death, Thrasea’s Stoic son-in-law Helvidius Priscus returns to Rome, as does the great Stoic Musonius Rufus and Demetrius the Cynic. The emperor Vespasian later banished philosophers from Rome (an action that seems very puzzling these days; I’m not sure what the modern equivalent would be), but for some reason let Musonius Rufus stay. One of his students was Epictetus, who had been a slave belonging to Epaphroditus, who in turn had been Nero’s assistant and had been with him when Nero, on the run, committed suicide—in fact, Epaphroditus helped his master by opening up the cut in his throat.
Later the Stoics were banished from Rome again, and Epictetus went to Greece and taught there. He never wrote anything himself, but one of his students, Arrian, wrote down what he said, which is why we now have the very powerful Discourses. And years later this was read by Marcus Aurelius, the Stoic emperor, a real philosopher king.
The following guest post is by Nicole Valentinuzzi, from our Stop Secret Contracts campaign partner Publish What You Fund.
A new campaign to Stop Secret Contracts, supported by the Open Knowledge Foundation, Sunlight Foundation and many other international NGOs, aims to make sure that all public contracts are made available in order to stop corruption before it starts.
As transparency campaigners ourselves, Publish What You Fund is pleased to be a supporter of this new campaign. We felt it was important to lend our voice to the call for transparency as an approach that underpins all government activity.
We campaign for more and better information about aid, because we believe that by opening development flows, we can increase the effectiveness and accountability of aid. We also believe that governments have a duty to act transparently, as they are ultimately responsible to their citizens.
This includes publishing all public contracts that governments put out for tender, from school books to sanitation systems. These publicly tendered contracts are estimated to top nearly US$ 9.5 trillion each year globally, yet many are agreed behind closed doors.
These secret contracts often lead to corruption, fraud and unaccountable outsourcing. If the basic facts about a contract aren’t made publicly available – for how much and to whom to deliver what – then it is not possible to make sure that corruption and abuses don’t happen.
But what do secret contracts have to do with aid transparency, which is what we campaign for at Publish What You Fund?
Well, consider the recent finding by the campaign that each year Africa loses nearly a quarter of its GDP to corruption…then consider what that money could have been spent on instead – things like schools, hospitals and roads.
This is money that in many cases is intended to be spent on development. It should be published – through the International Aid Transparency Initiative (IATI), for example – so that citizens can follow the money and hold governments accountable for how it is spent.
But corruption isn’t just a problem in Africa – the Stop Secret Contracts campaign estimates Europe loses an estimated €120 billion to corruption every year.
At Publish What You Fund, we tell the world’s biggest providers of development cooperation that they must publish their aid information to IATI because it is the only internationally-agreed, open data standard. Information published to IATI is available to a wide range of stakeholders for their own needs – whether people want to know about procurement, contracts, tenders or budgets. More than that, this is information that partner countries have asked for.
Governments use tax-payer money to award contracts to private companies in every sector, including development. We believe that any companies that receive public money must be subject to the same transparency requirements as governments when it comes to the goods and services they deliver.
Greater transparency and clearer understanding of the funds that are being disbursed by governments or corporates to deliver public services can only be helpful in building trust and supporting accountability to citizens. Whether it is open aid or open contracts, we need to get the information out of the hands of governments and into the hands of citizens.
Ultimately for us, the question remains how transparency will improve aid – and open contracts are another piece of the aid effectiveness puzzle. Giving citizens full and open access to public contracts is a crucial first step in increasing global transparency. Sign the petition now to call on world leaders to make this happen.
FCC Chairman Tom Wheeler will speak at the IMLS hearing.
On Thursday, April 17, 2014, from 9:30–11:30 a.m., leaders from the American Library Association (ALA) will participate in “Libraries and Broadband: Urgency and Impact,” a public hearing hosted by the Institute for Museum and Library Services (IMLS) that will explore the need for high-speed broadband in American libraries. Larra Clark, director of the ALA Program on Networks, and Linda Lord, ALA E-rate Task Force Chair and Maine State Librarian, will present on two panels.
Federal Communications Commission Chairman Thomas Wheeler will make opening remarks at the hearing, and expert panelists from across the library, technology, and public policy spectrum will explore the issue of high-speed broadband in America’s libraries. IMLS Director Susan H. Hildreth will chair the hearing along with members of the National Museum Services Board including, Christie Pearson Brandau of Iowa, Charles Benton of Illinois, Winston Tabb of Maryland, and Carla Hayden also of Maryland.
Interested participants may register to attend the event in-person at D.C.’s Martin Luther King Jr. Memorial Library. Alternatively, participants can also tune into event virtually, as IMLS will stream the hearing live on YouTube or Google+. Library staff may also participate by submitting written comments sharing their successes, challenges or other input related to library broadband access and use into the hearing record on or before April 24, 2014. Each comment must include the author’s name and organizational affiliation, if any, and sent to email@example.com. Guidance for submitting testimony is available here (pdf).
Toronto Public Library has introduced a new service that allows customers to download or stream a wide variety of music and video content. With a library card, customers can access music albums from a wide variety of genres, movies, educational television and documentaries. More information is available at tpl.ca/hoopla.
“We’re happy to now offer customers a great selection of music and videos that they can easily stream or download. E-content is our fastest area of growth, with customers borrowing more than 2 million ebooks, eaudio-books and emagazines in 2013. We expect we’ll see even more growth this year with the introduction of online music and video,” said Vickery Bowles, Director of Collections Management at Toronto Public Library.
With just a library card, customers can listen to a wide selection of music albums and watch a variety of video content. Content may be borrowed via a browser, smartphone or tablet and instantly streamed or downloaded with no waiting lists or late fees. Customers may borrow up to five items per month.
Seems like a very nice service. I’m happy to see my local library system working to get more streaming media to people in Toronto. I’m unhappy with the privacy implications of this, however. (As is Kate Johnson, a professor at the library school at the University of Western Ontario, who’s interviewed in that video clip: she raises the privacy question, but the reporter completely drops the issue). Here are my speculations based on a brief examination of what I see.
None of that bothered me particularly, so I went to sign up for an account to try it out. This is the third step in the process:
“Enter your libary card number,” it says. “If your library gave you a PIN to use with your library card, please enter it.” I have a PIN, but I stopped here. (I don’t know what happens to people without a password; I’d guess they’re asked to set one up.)
Certainly Hoopla needs to be sure that anyone claiming to be a Toronto Public Library user actually is. But it looks like they’re doing it by asking the user for their library card number and password and then asking TPL if that is a valid account.
This is not right. There’s no need for any third party to know my library card number. OAuth would be a better way to do it: as it says, it’s “an open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications.” This is what they say to anyone offering services online: “If you’re storing protected data on your users' behalf, they shouldn’t be spreading their passwords around the web to get access to it. Use OAuth to give your users access to their data while protecting their account credentials.”
Who’s behind Hoopla, anyway? It’s a sevice run by Midwest Tape, who on their Twitter account say “Midwest Tape is a full service DVD, Blu-ray, music CD, audiobook, and Playaway distributor, conducting business exclusively with public libraries since 1989.” They’re run out of Holland, Ohio, in the United States.
I suspect this means the Toronto Public Library is offering a service that requires users to give their library card number and password to an American company that will store it on American servers, which means the data is available to the US government through the PATRIOT Act. (Of course, we also need to assume that all library data can be access by our spy agencies, but we need to do what we can.)
I may be wrong. I’ll ask Hoopla and TPL and update this with what I find.
I’m on my way to a meeting at Intersect about the next phase of the Cr8it data packaging and publishing project. Cr8it is an ownCloud plugin, and ownCloud is widely regarded as THE open source dropbox-like service, but it is not without its problems.
Dropbox has been a huge hit, a killer app with what I call powers to "Share, Sync & See". Syncing between devices, including mobile (where it’s not really syncing) is what made Dropbox so pervasive, giving us a distributed file-system with almost frictionless sharing via emailed requests, with easy signup for new users. The see part refers to the fact that you can look at your stuff via the web too. And there is a growing ecosystem of apps that can use Dropbox as an underlying distributed filesystem.
ownCloud is (amongst other things) an open source alternative to Dropbox.com’s file-sync service. A number of institutions and service providers in the academic world are now looking at it because it promises some of the killer-app qualities of dropbox in an open source form, meaning that, if all goes well it can be used to manage research data, on local or cloud infrastructure, at scale, with the ease of use and virality of dropbox. If all goes well.
There are a few reasons dropbox and other commercial services are not great for a university:
We need to be able control where data are stored and have the flexibility to bring data close to large facilities. This is why CERN have the largest ownCloud test lab in the world, so I’ve heard.
It is important to be able to write applications such as Cr8it without being beholden to a company like Dropbox.com, Apple, Google or Microsoft who can approve or deny access to their APIs at their pleasure, and can change or drop the underlying product. (Google seem to pose a particular risk in this department, they play fast and loose with products like Google Docs, dumping features when it suits them)
But ownCloud has some problems. The ownCloud forum is full of people saying, "tried this out for my company/workgroup/school. Showed promise but there’s too many bugs. Bye." At UWS eResearch we have been using it more or less successfully for several months, and have experienced some fairly major issues to do with case-sensitivty and other incompatibilities between various file systems on Windows, OS X and Linux.
From my point of view as an eResearch manager, I’d like to see the emphasis at ownCloud be on getting the core share-sync-see stuff working, and then on getting a framework in place to support plugins in a robust way.
Last week, the first version of OwnCloud Documents was released as a part of OwnCloud 6. This incorporates a subset of editing features from the upstream WebODF project that is considered stable and well-tested enough for collaborative editing.
We tried this editor at eResearch UWS as a shared scratchpad in a strategy session and it was a complete disaster, our browsers kept losing contact with the document, and when we tried to copy-paste the text to safety it turned out that copying text is not supported. In the end we had to rescue our content by copying HTML out of the browser and stripping out the tags.
In my opinion, ownCloud is not going to reach its potential when the focus remains on getting shiny new stuff out all the time, far from making ownCloud shine, every broken app like this editor tarnishes its reputation substantially. By all means release these things for people to play with but the ownCloud team needs to have a very hard think about what they mean by "stable and well tested".
Along with others I’ve talked to in eResearch, I’d like to see work at owncloud.com focus on:
Define Sync behaviour in detail, complete with automated tests and have a community-wide push to get the ongoing sync problems sorted. For example, fix this bug reported by a former member of my team along with several others to do with differences between file systems.
Create a standard way to generate and store file derivaties such as image thumbnails, or HTML document previews, as well as additional file metadata. At the moment plugins are left to their own devices, so there is no way for apps to reliably access each others data. I have put together a simple Alpha-quality framework for generating web-views of things via the file system, Of the Web, but I’d really like to be able to hook it in to ownCloud properly.
Get the search onto a single index rather than the current approach of having an index per user, something like Elastic Search, Solr or Lucene could easily handle a single metadata-and-text index with information about sharing, with changes to files on the server fed to the indexer as they happen.
[Update 2014-04-11] Get the sync client to handle connecting to multiple ownCloud servers, in Academia we will definitely have researchers wanting to use more than one service, eg AARNet’s Cloudstor+and an institutional ownCloud. (Not to mention proper dropbox integration)
Several of us here at OCLC have spent considerable time over the last decade trying to pull bibliographic records into work clusters. Lately we've been making considerable progress along these lines and thought it would be worth sharing some of the results.
Probably our biggest accomplishment is that work we have done to refine the worksets is now visible in WorldCat.org (as well as in an experimental view of the works as RDF). This is a big step for us, involving a number of people in research, development and production. In addition to making the new work clusters visible in WorldCat, this gives us in Research the opportunity to use the same work IDs in other services such as Classify. We also expect to move the production work IDs into services such as WorldCat Identities.
One of the numbers we keep track of is the ratio of records to works. When we first started, the record to work ratio was something like 1.2:1, that is, every work cluster averaged 1.2 records. The ratio is now close to 1.6:1, and for the first time the majority of records in WorldCat are now in work clusters with other records, primarily because of better matching.
Of records that have at least one match, we find the average workset size is 3.9 records. In terms of holdings we have 10.6 holdings/workset and over 43 holdings/non-singleton workset (worksets with more than one record). Another way to look at this is that 84% of WorldCat's holdings are in non-singleton worksets and over 1.5 billion of WorldCats 2.1 billion holdings are in worksets of 3 or more records, so collecting them together has a big impact on many displays.
As the worksets become larger and more reliable we are finding many uses for them, not the least in improving the work-level clustering itself. We find the clustering helps find variations in names, which in turn helps find title variations. We are also learning how to connect our manifestation and expression level clustering with our work-level algorithms, improving both. The Multilingual WorldCat work reported here is also an exciting development growing out of this.
There is still more to do of course. One of our latest approaches is to build on the Multilingual WorldCat work by creating new authority records in the background that can be used to guide the automated creation of authority records from WorldCat, that in turn help generate better clusters. We are applying this technique at first on problem works such as Twain's Adventures of Huckleberry Finn and his Adventures of Tom Sawyer which are published together so often and cataloged in so many ways that it is difficult to separate the two. These generated title authority records are starting to show up in VIAF as 'xR' records.
So, we've been working on this off and on for a decade, but WorldCat and our computational capabilities have changed dramatically and it still seems like a fresh problem to us as we pull in VIAF to help and use matching techniques that just would not have been feasible a decade ago.
While many of us, both in and out of OCLC Research, have worked on this over the years, no one has done more than Jenny Toves who both designs and implements the matching code.
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.
pae·an - : a joyous song or hymn of praise, tribute, thanksgiving, or triumph – Merriam-Webster
I returned from the Code4Lib Conference recently chock-full of things I want to investigate. I was also reminded about just how much I love the Unix filesystem. Yes, really.
I’ve long thought that the simplest solution to a problem is often the best. That is, complicated solutions tend to have more things that can go wrong. Plus they can be more difficult to learn, manage, and replace. That is why I’ve developed quite a bit of skepticism towards throwing databases at every problem.
I remember the time I was backing up my server and I neglected to dump the database that backed up one of my web sites. Yes, I really did. And yes, I had to rebuild it from scratch when I went to restore from the backup and realized my error. Running a tar job on the directory is SO inadequate when you lack the database that is required to make sense of it all. Sure, that is a stupid mistake, I’ll admit, but it would have been so much easier and less complicated had the site all been sitting on the filesystem.
And the thing is, often it can. Many web sites that are supported by a database like MySQL don’t really need a database at all. Mostly all they need is a way to search. And you don’t need a database for that. For that, all you need is an index. There are a lot of options out there for indexing, from the simple (such as my go-to favorite Swish-e) to the more complex (for example, XTF or Solr, which both support some very sophisticated sites).
Some benefits of relying on the filesystem include:
A tried and true technology that is as old as time. Well, maybe not time, but you get the idea. Filesystem technology has been around as long as there have been computers. You can take the word of an old-timer on that.
Drop-dead easy backup. Tar up the directory tree, gzip it, and throw the file on something else. Done and done.
Complete transparency. If you want to see a file, just look at it. You don’t have to figure out some complicated SQL query to pull something back out. It’s sitting right there where you can see it.
Slower obsolescence. Filesystems age at the rate of mountains. Databases age at the rate of flowers. Pick one to rely upon. No, seriously.
I understand that a number of open source applications such as Drupal and Omeka have made it relatively easy for people to set up a web site that needs to support a variety of user interactions by using the classic stack of Linux/Apache/MySQL/PHP. That is a good thing, and I support it. All I’m saying is that not everything needs to go into such a stack, and using that method comes with real consequences that should be understood from the beginning.
So what does this all mean? For me, it means that I will think long and hard before I set up another instance of the classic stack. I’m actually totally cool with that stack if you remove MySQL. I really don’t want to be a database administrator. I don’t. Just give me the filesystem and a decent indexer. That’s frequently all I need. And it may be all you need for at least some projects. Because, you know, the filesystem rocks.
With all that’s happened in Washington in the past year—threats to eliminate the federal agency that administers funding to libraries, legislation to stifle open access and the government shutdown—now is the time, more than ever, to stand up for libraries. If you appreciate the critical roles that libraries play in creating an informed and engaged citizenry, register now for this year’s National Library Legislative Day (NLLD), a two-day advocacy event where hundreds of library supporters, leaders and patrons will meet with their legislators to advocate for library funding.
May 5 & 6, 2014
National Library Legislative Day, which is hosted by the American Library Association (ALA), will be held May 5-6, 2014, in Washington, D.C. Now in its 40th year, National Library Legislative Day focuses on the need to fund the Library Services and Technology Act, support legislation that gives people who use libraries access to federally-funded scholarly journal articles and continue funding that provides school libraries with vital materials.
As part of the event, participants will receive training and briefings to prepare them for meetings with their members of Congress. Participants who register for National Library Legislative Day will connect with their state’s coordinator, who then arranges the meetings with legislators, communicates with the ALA Washington Office and serves as the contact person for the state delegation.
Advocate from Home
Advocates who cannot travel to Washington for National Library Legislative Day can still make a difference and speak up for libraries. As an alternative, the American Library Association sponsors Virtual Library Legislative Day, which takes place on May 6, 2014. To participate in Virtual Library Legislative Day, register now for American Library Association policy action alerts.
For the next month, the ALA Washington will share National Library Legislative Day resources on the District Dispatch. Keep up with the conversation by using the hashtag #nlld14.
In Brief: In recent years, student staff have become essential to the success of library operations, particularly within higher education. Student library employment offers a unique opportunity for students to integrate library-specific knowledge and skills with their academic and personal development. This article will discuss the importance of developing an integrated student staff development approach.
There is an old Peanuts cartoon in which Lucy derisively comments on Linus’s desire to become a doctor, focusing particularly on the fact that Linus could never be a doctor because he doesn’t love mankind. In the last panel, Linus yells in protest, “I love mankind, it’s just the people I can’t stand.”
In April 2012, a colleague and I attended a local consortia meeting. During a post-lunch panel of various academic librarians, the discussion turned to student staff. A particular librarian commented negatively on the abilities of their library’s student staff, indicating that only the librarians were doing real library work. This feeling seemed to be shared, to some degree, by other librarians in the room. While this did not sit right with my colleague or me, to our own failing neither of us responded. On the drive home we began to flesh out what exactly bothered us about the comment as well as our own lack of response. We concluded that if librarians are not happy with the performance of their student staff, then the fault lies with the librarians. This conversation drove us to re-work our entire student staff approach.
It is quite easy to take Linus’s response and tweak it to fit attitudes that we as librarians can hold: “I love mankind, it’s just the patrons, or this patron, I can’t stand.” “I love mankind, it’s just the volunteer staff I can’t stand.” “I love mankind, it’s just the student staff I can’t stand.” Whether these sentiments are stated aloud in a consortia meeting or kept locked in one’s thoughts, they are going to impact the ways in which we as library staff relate to student staff. The foundational impact of student staff on the day-to-day functioning of the library can not be underestimated. “Without the student workers the library could not remain open as long; costs for staffing the circulation desk would increase; document delivery and interlibrary loan services would take too long; materials would not be re-shelved in a timely manner; and processing new books would be slowed.”1 Recognizing these tasks are essential for library success is to also recognize reliance on student staff performing those tasks.
Recognizing the Role of Student Staff
Reliance on student staff has significantly increased in recent years. Consider that in the 1950s, professional librarians comprised 50 to 90 percent of the staff in college and university libraries. By the late 1980s, student staff members outnumbered librarians by a ratio of two to one.2 During the 1990s, libraries passed the point where students were viewed merely as a “…labor reserve for the monotonous and repetitive tasks that are necessary for successful library operation.”3 This is particularly true in higher education, where the library is often perceived as a desirable place to work. The increased number of student staff in conjunction with the learning environment engendered by a collegiate atmosphere provides a unique opportunity; namely, “…library employment would seem to provide students with the opportunity to apply what they learn on the job to their academic studies.”4 The library as an employer is uniquely poised to help student staff synthesize a variety of skills due to the eclectic skill set that library work can require. “It is professional staff members’ responsibility to provide student employees with an opportunity for involvement that is both meaningful and educational while assisting them in becoming successful members of an increasingly global society.”5
What does it mean to provide involvement that is meaningful and educational? For student staff, working in the library should not be disconnected from other areas of life and study. Library employment is another avenue to support students as they work to integrate academic, professional and personal skill sets. In order to support this integration, the library must create developmental and assessment processes that will deliberately engage the student staff members recognizing “…work that is more firmly linked to academically purposeful behaviors and conditions would presumably have greater positive effects for the students.”6
In order to hire and train student staff effectively, libraries need to establish comprehensive and structured hiring and training processes. The specifics of these processes are outside of this article’s focus. However, there are many resources in library scholarship and trade publications which can provide assistance in developing robust hiring and training procedures.7
As college and universities work to develop successful and well-rounded students, there is particular focus on deliberately linking the student’s learning in the classroom with their experience outside of it. One of the areas highlighted to help students realize success is that of on-campus employment. If the academic library should be leading higher education change, academic librarians should be working to develop processes that incorporate student development as part of the broader learning experience. To an extent, libraries and librarians are already doing this. I have had two opportunities over the last year to teach an online Continuing Education Unit on student staff development. In those units I have interacted with librarians across the country who are actively thinking, working, and wrestling with the development and assessment of their student staff. The mutual benefit of these interactions came through the discussions and exchanges of meaningful and educational ways to improve how we are working with our student staff. As a profession we must be more deliberately active in our approach to these areas of student staff development and assessment. This article will argue for an integrated approach to student staff training that works to tie student staff development to other areas of students’ growth and development during their time as undergraduates. Examples from the library literature and my own library will be examined. In the conclusion I will provide what I think are some good beginning steps to this process of integration.
Many of the articles or books dealing with supervising student staff deal with development in one of two categories: professional staff development or developing student staff skills as related to library positions.8 Focus on both of these areas is essential and should continue. Typically “development training usually refers to long-term growth: training to improve performance…For student employees development training is usually limited to preparing them for supervisory duties within the department…”9 There’s nothing wrong with this. It is part of our job as professional librarians to improve our libraries. But to view development solely in this light impoverishes the library’s potential to challenge and grow its student staff. Developmental training should be intentionally designed with particular opportunities and planned tasks that allow student staff to practice and work on their own problem-solving, evaluatory, and critical thinking skills. Several examples of how I’ve tried to realize this in my library will be discussed below.
An essential part of development is regular assessment. “A library that recognizes the need for and benefits of assessment of performance and service presents rewarding opportunities for staff to become more engaged in their work and to identify more strongly with the library’s mission and goals.”10 The goal of assessment is simple: holding student staff accountable to the work they are supposed to be doing, with the expectation of a particular level of quality. In order for assessment to succeed, clearly communicated expectations and requirements must be in place to communicate what successful work in the library looks like. Much of the assessment writing on libraries focuses on how the library is performing within the institution. This focus does not necessarily include how specific subsets of the library staff are helping the library meet institutional goals. “Student employment is an important service provided by libraries and it should be included in library assessment plans.”11 Some practical examples of assessment of student staff will be provided later in this article.
Development Requires Care
Practically speaking, how does the library become a place that provides space for students to exercise their skills in their roles in the library? As librarians we need to care about all of our staff, student or otherwise. Linus reaches an important truth in his response as it is much easier to care for mankind in the abstract then the messy day-to-day negotiations of human relationships. It is impossible to develop your student staff if you don’t care about them. Wendell Berry sums this up nicely, “I think that the ideal of loving your neighbor has to take on the possibility that he may be somebody you’re going to have great difficulty loving or liking or even tolerating.”12 This is not to say that there are not consequences for mistakes or that students should not ever be fired or released from library employment. Rather the library’s approach to its student staff should recognize that students are in process of maturation and growth. By employing them, the library has the opportunity to positively participate in those processes.
Students in general are a pretty fascinating bunch. Cultivating care for your staff gives you the opportunity to get to know them at an individual level. Getting to know your staff requires spending consistent amounts of time with them, using that time (staff meetings, periodic evaluations, interactions during shifts) to learn more about their strengths and how students might bring those strengths to bear on their staff roles. For example, this past fall my library implemented LibGuides and was in the process of trying to figure out ways of highlighting the library’s curriculum manipulatives.13 The morning shift supervisor at the time was a graduate student who had some experience with photography. We had talked about her various photo shoots and efforts to start a website, so her interest in photography was something we chatted about semi-regularly. I do not exactly remember how the idea came up, but through our conversation about how to best use LibGuides we came up with the idea of holding a photo-shoot for the curriculum manipulatives. We improvised a backdrop and she artistically arranged the different elements of the manipulatives to highlight their usage. She then uploaded the images to the LibGuide, along with the item’s description, to allow library users to see exactly what the manipulatives look like. She and I had to work through some issues of communication and expectation together, but the end product turned out well. This type of project is meaningful and provided a significant contribution to the library. Caring about student staff is the first step to planning and allowing for meaningful work that contributes to the library’s ability to provide information resources to the campus community.
Development Requires Flexibility and Time
In addition to caring, flexibility and time are needed to avoid a Linus-like response to disliking people due to spending time with them. For example, I have found it to be extremely helpful to view student staff training as ongoing and not as a one-time or first semester approach. Training is an iterative process that may have to occur in the middle of whatever work that I’m doing requiring me to be flexible and responsive to student staff needs. “Training does not end with instructions. It must include the supervisor’s setting an example of the work ethic encouraged by the library culture, and of the sense of fair play, encompassing both positive and negative feedback, that each library promotes for its employees.”14 Without time to invest there will be no student staff development. There needs to be time to plan, time to prepare, and time to spend with your student staff as well as time to show that you care. This needs to be planned into your schedule. Otherwise, unplanned expenditures of time with your student staff are going to seem like interruptions and hassles. There should also be time given, within reason, for student staff to develop into their roles. Students are not sea monkeys where they hit the water and start swimming and growing. My library currently has student staff in their second or third year of library employment who had rough beginnings but are now some of the library’s most valued employees. A particular senior student, early in her library employment, often missed meetings, was flustered easily by patron questions, and lacked confidence in her library role. She was given time to develop in her library position and has taken on leadership roles within the library. I can confidently assign her complex tasks with basic instructions, being sure that she will proceed as far as she is able, attempt some problem-solving, and contact me with any questions. As she is majoring in communications, she took ownership of updating the student staff handbook, allowing her to utilize skills and knowledge gleaned from her major to benefit her library role.
Practically Applying Assessment
The library staff responsible for supervising students need to communicate a shared standard of what a successful staff member looks like. Assessment is an integral part of this communication process. There are two levels on which assessment should occur. Assessment serves to examine quality of the job performed as well as the individual performing it. One of the articles that was particularly helpful to the overhaul of my library’s student staff development process was the article Gone Fishing by Carol Anne Chouteau and Mary Heinzman, in which they narrate the process by which they wanted to motivate as well as assess student shelving.15 The authors used paper fish to help motivate, train, and track student staff as they shelved books. We derived our own approach from this article. Instead of fish, we use approximately 250 8” tall die-cut owls, cut out of bright yellow paper and laminated. These owls reside in a box at the circulation desk. When books are returned, the student staff member writes their initials on an owl with a dry erase marker and after shelving the book places an owl to the left of that book. I (or the library’s part-time staff member) will, throughout the day, review the stacks to pull the owls. We keep track of the total number of owls shelved as well as mistakes. As a result, the precision of student staff shelving has improved. This process also highlights any consistent shelving issues and allows us to meet directly with the student to address them. The student staff member and I can walk back to the shelf, examine the issue, and they can fix it. This provides direct evaluation and ownership of the shelving process and gives opportunity for praise and recognition of students who are doing exemplary work.
Assessment should also focus on the individual. If working in the library is to contribute to student development, then individual assessment is necessary to communicate that how a student can grow in character areas as well as in skills areas. In my library we have adopted a rubric-based approach taken from Linda Lemery’s article “Student Assistant Management: Using an Evaluation Rubric”.16 The hardest aspect I’ve found in the rubric-based approach is to present it to the student staff in a way that they can retain the categories and expectations without causing the rubric to be perceived as onerous. The rubric is used to clearly state what is expected from the student staff and staff supervisors.17
In order to set a baseline of expectations, I meet with each student at the beginning of the year to set goals for the year. We discuss the strengths that they bring to the library and some areas of growth that they can focus on for the upcoming year. We also meet at the end of the year to review their progress. At that point the possibility of continued library employment is also considered. We also conduct regular staff meetings, typically occurring once a month throughout the semester. This helps to keep staff on the same page and offers an opportunity to address any questions or staff-wide trainings that need to be accomplished. Student schedules can pose some difficulty. I will follow up directly with students who miss the meetings, using Doodle to help in the scheduling process.
In our staff meetings, because not all of the students work together, we play a modified version of Cranium, breaking up the students into teams. Having students hum, draw, or act while trying to beat the clock or the other teams has been one of the most helpful aspects of establishing the feeling of a team and sense of cohesion and unity. That being said, a recent reduction in my library from two professional librarians to one has added a layer of difficulty. There is less time to spend with student staff, the extent of training has suffered, and team meetings have been sporadically scheduled.
Example 1: Building Stacks
What does the application of integrated student development look like in real life? I have two examples that illustrate ways of helping student staff connect their learning outside of the library with the successful completion of library tasks. We recently updated the layout of our curriculum lab which required book shifting and stack adjustment. There are two particular staff who share two evening shifts during the week. I took this opportunity to hand the specific project of adjusting the stacks to these two student staff individuals. Before the stacks could be built, the shelves had to be emptied of books and removed. The two students did a good job of removing the books in such a way as to allow them to still be largely usable while they completed the stack adjustment project. Granted, they missed a few things in the shelf re-building process that we had to go back and fix together. I might have been able to bypass this but I wanted to give them an opportunity to practice some of the mechanical and problem-solving skills I had observed. I had a fair amount of confidence in their abilities but wanted to confirm that they could work together, problem-solve effectively, and inform me of any issues. Library employment should offer students the opportunity to experiment with solutions to various issues. The development process is not clear-cut or a step-by-step program to success. Evaluating and assessing are not in place just to tell the student staff whether or not they hitting the mark, but to also highlight accomplished work so that the value of that work can be recognized. “Student success is promoted by setting and holding students to standards that stretch them to perform at high levels, inside and outside the classroom.”18
Example Two: Video Project
A second example of how a library can work to develop its student staff can be found in projects that are not explicitly related to library employment. This semester a student staff member and I are working together on a series of short videos featuring professors from around the school talking about books they enjoy. The video series was the student’s idea. We tossed the idea back and forth, developed a loose script, emailed a handful of professors and dove into the project. As the project continues, I contact the professors, the student staff member oversees the shooting and editing, and we collaborate together on the other details of the project. This takes time. Time I could be spending doing other library work. However, a project like this not only benefits the library but also gives this student a chance to hone his interests, abilities, and skills as a filmmaker to craft some great short videos. He is also working with our campus videographer in regards to light, graphics, and layout, so there is a level of interdepartmental interaction and support. I deliberately try to make sure I’m not taking over the project. As questions about direction, shot angles, time limits, etc. come up, I consciously make the effort to push those questions back to him so that he is responsible for the final decision.
The idea for this video series developed because this student works in the library. If he had not been hired, we would not have crossed paths. While the planning, shooting and editing of the videos are outside of his regular library employment, the library has provided a platform from which he can grow this particular skill set. Additionally, these videos will serve as helpful marketing tools for the library. Creating the videos has also been very fun. It has provided interaction with professors on another level, helping them to remember the value of the library for students in their classes and their particular discipline.
One of the hardest parts of student staff development, in my mind, is transition. Student staff are eventually going to leave. They graduate, transfer, or find other employment. There needs to be mental preparation for this because, whether you realize it or not, you most likely have an expectation for the work that was done and now need to communicate that expectation to the individual who is going to fill the departed student’s staff shoes. For example, this fall my library had to hire for a maintenance position. This student staff member is responsible for emptying trash, filling the printers, changing light bulbs, etc. The student in that position and I work together to address the physical plant issues in the library. The previous student was fantastic. I relied on his responsibility and initiative. He was extremely consistent and followed through with each task. He graduated and thus a replacement needed to be hired. About two weeks into the semester there were tasks going unfinished and I realized that I had communicated the requirements of the job but not the expectations. I sat down with the new hire and he and I worked out a schedule and set expectations for how and when he was going to get his work done. It is very easy to expect new workers to simply be clones of previous excellent workers. Instead, student staff need to be held to an objective set of requirements that is clearly presented to them. This is why using a rubric-based approach is especially helpful, so that students are aware of our expectations.
I am not writing this article because I believe my library has the best student staff development approach. If you ever visited you would find a competent and effective staff but we are not without issues. Student staff development is not about creating a perfect student staff but rather helping students to develop an integrated, holistic view of their work and education, so that they are better equipped for whatever they end up doing post-college. However “…while the vision and potential of collaborative learning are enticing, the reality of implementation is much more challenging.”19 Realizing student staff development as collaborating with student learning is hard work and there is no silver bullet to ensure success. However, I do believe that “supervising student staff is an amazing, exhausting and exhilarating experience.”20 I strive to operate with the assumption that my student staff are fantastic and I try to demonstrate that through my interactions with them, the tasks that are given, and the way that the hiring, training, developing and assessment processes are conducted.
This article is not meant to be merely illustrative of what one library is doing. As a profession we need to add the topic of student staff development to the conversations we are already having about the library’s future role in academia and public life. We need to recognize the value that student staff bring to their library positions. Recognizing that value will change how we talk about our student staff and how we talk with them. What are your student staff majoring in? What are they good at? What do they enjoy doing? How does what makes your student staff members interesting and unique contribute to the library’s impact on campus? Let’s collectively evaluate our current student staff development processes to determine the level of integration with students’ learning outside of library employment. If a library does not have concrete and evaluatory processes for student staff in place, those need to be established. We need to consider student staff development as something that not only improves our libraries but is significant in the holistic development of the library’s student staff. By making the effort to take these steps, we will realize the value of our student staff, the value of the work they do, and, ultimately, the value of the library.
Some of these conversations and discussion are already happening but on a limited scale. I understand that this can be a sensitive area for a librarian to discuss. In sharing what you are doing with your student staff you may feel as though you are stating “I have arrived and my student staff are flawless!” We should not wait for our student staff to reach perfection before we start sharing our processes and ideas with each other. The comment section of this article is a great place to start. I look forward to the discussion.
My deep and sincere thanks to the eminently capable Lead Pipe editors-Erin, Ellie, Emily and Hugh-who gave copious insight and detailed feedback to direct and guide this article. My thanks as well to Josh Michael as external editor for his erudite input and our time together as colleagues.
Choutea, Carol Anne; Mary Heinzman. “Gone Fishing; Using the FISH! Business Model to Motivate Student Workers”. Technical Services Quarterly Vol. No. 3 2007. Pp. 41-49.
Jacobson, Heather A., Shuyler, Kristen S. “Student perceptions of academic and social effects of working in a university library”. Reference Services Review Vol. 41 No. 3, 2013.
Kuh, George D., Jillian Kinzie, et al. Student Success in College: Creating Conditions that Matter. Jossey-Bass, San Francisco. 2005.
Lemery, Linda D. “Student Assistant Management: Using an Evaluation Rubric”. College & Undergraduate Libaries, Vol. 15 (4), 2008. Pp. 451-462.
Perozzi, Brett. Enhancing Student Learning Through College Employment Dog Ear Publishing. Bloomington, IN, 2009.
Slagell, Jeff; Langendorfer, Jeanne M. “Don’t Tread on Me: The Art of Supervising Student Assistants” The Serials Librarian Vol. 44, Nos 3-4, 2003. Pp. 279-284.
P. 148. Maxey-Harris, Charlene; Cross, Jeanne; McFarland, Thomas. “Student Workers: The Untapped Resource for Library Professions.” Library Trends 59, Nos. 1-2, 2010.
P. 635 Stanfield, Andrea G. and Russell L. Palmer, “Peer-ing into the information commons: Making the most of student assistants in new library spaces.” Reference Services Review Vol. 38, No. 4, 2011.
P. 87 Clark, Charlene K. “Motivating and Rewarding Student Workers” Journal of Library Administration 21, no. 3/4 1995.
P. 547. Jacobson, Heather A., Shuyler, Kristen S. “Student perceptions of academic and social effects of working in a university library.” Reference Services Review 41, no. 3 2013.
P. 199. Scrogham, Eve; McGuire, Sara Punksy. “Orientation Training and Development” in Perozzi, Brian (Ed.) Enhancing Student Learning Through College Employment. Dog Ear Publishing. Bloomington, IN. 2009.
P. 549. Jacobson, Heather A., Shuyler, Kristen S. “Student perceptions of academic and social effects of working in a university library.” Reference Services Review 41, no. 3, 2013.
For a brief list please see Richard McKay’s “Inspired Hiring: Tools for Success in Interviewing and Hiring Student Staff.” Library Administration & Management 20, no. 3, 2006: 128-134. See also the beginning of Nora Murphy’s “When the Resources Are Human:Managing Staff, Students, and Ourselves.” Journal of Archival Organization 7, no. 1-2, 66-73. See also David Baldwin, and Daniel Barkley’s Supervisors of Student Employees in Today’s Academic Libraries. Libraries Unlimited, 2007.
For professional staff development see as example Elaine Z. Jennerich’s “The long-term view of library staff development.” College and Research Library News 67, no. 10. 2006: 612-614.
P. 170 Baldwin, David; Barkley, Daniel . Supervisors of Student Employees in Today’s Academic Libraries. Libraries Unlimited, 2007.
P. 156. Oltmanns, Gail V. “Organization and Staff Renewal using Assessment.” Library Trends 53, No. 1, Summer 2004.
P. 560 Jacobson, Heather A., Shuyler, Kristen S. “Student perceptions of academic and social effects of working in a university library.” Reference Services Review 41, No. 3, 2013
P.10 Williamson, Bruce. The Plowboy Interview” in Grubbs, Morris Allen (Ed.) Conversations with Wendell Berry. University Press of Mississippi. 2007.
Curriculum manipulatives are hands-on items that are focused on kindergarten through elementary age students to teach particular concepts. For example if you were teaching a class on currency or mathematics you could check out out a bunch of cardboard coins. If you were teaching a class on counting, proportions or weight, you could check out brass weights, several different kinds of scales, etc.
P. 83. Burrows, Janice H. “Training Student Workers in Academic Libraries: How and Why.” Journal of Library Administration 21, No.3/4, 1995.
See Carol Anne Choutea; Mary Heinzman. “Gone Fishing; Using the FISH! Business Model to Motivate Student Workers”. Technical Services Quarterly 24, No. 3, 2007. Pp. 41-49.
See Linda D. Lemery’s “Student Assistant Management: Using an Evaluation Rubric”. College & Undergraduate Libaries 15, No. 4, 2008. Pp. 451-462.
For a particularly helpful article on rubric use and writing see Megan Oakleaf’s “Using Rubrics to Collect Evidence for Decision-Making: What do Librarians Need to Learn?” Evidence Based Library and Information Practice 2, No. 3, 2007. Pp. 27-42.
P. 269 Kuh, George D., Jillian Kinzie, et al. Student Success in College: Creating Conditions that Matter. Jossey-Bass, San Francisco. 2005.
P. 101. Arum, Richard; Josipa Roksa. Academically Adrift: Limited Learning on College Campuses. University of Chicago Press, Chicago. 2011.
P. 218. Scrogham, Eve; McGuire, Sara Punksy. “Orientation Training and Development” in Perozzi, Brian (Ed.) Enhancing Student Learning Through College Employment. Dog Ear Publishing. Bloomington, IN. 2009.
The OKFestival 2014 Team is happy to announce that we are launching our Financial Aid Programme today!
We’re delighted to support and ensure the attendance of those with great ideas who are actively involved in the open movement, but whose distance or finances make it difficult for them to get to this year’s festival in Berlin. Diversity and inclusivity are a huge part of our festival ethos and we are committed to ensuring broad participation from all corners of the world. We’re striving to create a forum for all ideas and all people and our Financial Aid Programme will help us to do just that.
Our Travel Grants cover travel and accommodation costs, and our aim is to get you to Berlin if you can’t quite make it there yourself. For more information on what we’ll cover – and what we won’t – how to apply, and what to expect if you do, have a look at our Financial Aid page.
ZBW Labs now uses DBpedia resources as tags/categories for articles and projects. The new Web Taxonomy plugin for DBpedia Drupal module (developed at ZBW) integrates DBpedia labels, stemming from Wikipedia page titles, via a comfortable autocomplete plugin into the authoring process. On the term page (example), further information about a keyword can be obtained by a link to the DBpedia resource. This at the same time connects ZBW Labs to the Linked Open Data Cloud.
The plugin is the first one released for Drupal Web Taxonomy, which makes LOD resources and web services easily available for site builders. Plugins for further taxonomies are to be released within our Economics Taxonomies for Drupal project.
As a follow up to my last post I added a script to my fork of Aaron’s py-flarchive that will load up a Redis instance with comments, notes, tags and sets for Flickr images that were uploaded by Brooklyn Museum. The script assumes you’ve got a snapshot of the archived metadata, which I downloaded as a tarball. It took several hours to unpack the tarball on a medium ec2 instance; so if you want to play around and just want the redis database let me know and I’ll get it to you.
Once I loaded up Redis I was able to generate some high level stats:
machine tags: 933
Given how many images there were there it represents an astonishing number of authors: unique people who added tags, comments or notes. If you are curious I generated a list of the tags and saved them as a Google Doc. The machine tags were particularly interesting to me. The majority (849) of them look like Brooklyn Museum IDs of some kind, for example:
But there were also 51 geotags, and what looks like 23 links to items in Pleiades, for example:
If I had to guess I’d say this particular machine tag indicated that the Brooklyn Museum image depicted Abu Simbel. Now there weren’t tons of these machine tags but it’s important to remember that other people use Flickr as a scratch space for annotating images this way.
If you aren’t familiar with them, Flickr notes are annotations of an image, where the user has attached a textual note to a region in the image. Just eyeballing the list, it appears that there is quite a bit of diversity in them, ranging from the whimsical:
cool! they look soo surreal
teehee somebody wrote some graffiti in greek
Lol are these painted?
Steaks are ready!
to the seemingly useful:
Ramesses III Temple
Napoleon’s troops are often accused of destroying the nose, but they are not the culprits. The nose was already gone during the 18th century.
Similarly the general comments run the gamut from:
always wanted to visit Egypt
Just a few points. This is not ‘East Jordan’ it is in the Hauran region of southern Syria. Second it is not Qarawat (I guess you meant Qanawat) but Suweida. Third there is no mention that the house is enveloped by the colonnade of a Roman peripteral temple.
The fire that destroyed the buildings was almost certainly arson. it occurred at the height of the Pullman strike and at the time, rightly or wrongly, the strikers were blamed.
You can see in the background, the TROCADERO with two towers .. This “medieval city” was built on the right bank where are now buildings in modern art style erected for the exposition of 1937.
Brooklyn Museum pulled over 48 tags from Flickr before they deleted the account. That’s just 0.7% of the tags that were there. None of the comments or notes were moved over.
In the data that Aaron archived there was one indicator of user engagement: the datetime included with comments. Combined with the upload time for the images it was possible to create a spreadsheet that correlates the number of comments with the number of uploads per month:
I’m guessing the drop off in December of 2013 is due to that being the last time Aaron archived Brooklyn Museum’s metadata. You can see that there was a decline in user engagement: the peak in late 2008 / early 2009 was never matched again. I was half expecting to see that user engagement fell off when Brooklyn Museum’s interest in the platform (uploads) fell off. But you can see that they continued to push content to Flickr, without seeing much of a reward, at least in the shape of comments. It’s impossible now to tell if tagging, notes or sets trended differently.
Since Flickr includes the number of times each image was viewed it’s possible to look at all the images and see how many times images were viewed, the answer?
Not a bad run for 5,697 images. I don’t know if Brooklyn Museum downloaded their metadata prior to removing their account. But luckily Aaron did.
A couple of weeks ago Kevin Phaup took the lead of facilitating a 3D printing workshop here in the Libraries’s Center For Digital Scholarship. More than a dozen students from across the University participated. Kevin presented them with an overview of 3D printing, pointed them towards a online 3D image editing application (Shapeshifter), and everybody created various objects which Matt Sisk has been diligently printing. The event was deemed a success, and there will probably be more specialized workshops scheduled for the Fall.
Since the last blog posting there has also been another Working Group meeting. A short dozen of us got together in Stinson-Remick where we discussed the future possibilities for the Group. The consensus was to create a more formal mailing list, maybe create a directory of people with 3D printing interests, and see about doing something more substancial — with a purpose — for the University.
To those ends, a mailing list has been created. Its name is 3D Printing Working Group . The list is open to anybody, and its purpose is to facilitate discussion of all things 3D printing around Notre Dame and the region. To subscribe address an email message to firstname.lastname@example.org, and in the body of the message include the following command:
Finally, the next meeting of the Working Group has been scheduled for Wednesday, May 14. It will be sponsored by Bob Sutton of Springboard Technologies, and it will be located in Innovation Park across from the University, and it will take place from 11:30 to 1 o’clock. I’m pretty sure lunch will be provided. The purpose of the meeting will be continue to outline the future directions of the Group as well as to see a demonstration of a printer called the Isis3D.
Recently my colleague Karen Smith-Yoshimura noted a blog post that demonstrates effective traits for using social media on behalf of an organization. Titled “Social Change”, the post documents the choices that Brooklyn Museum staff made recently to pare down their social media participation to venues that they find most effective. As they put it:
There comes a moment in every trajectory where one has to change course. As part of a social media strategic plan, we are changing gears a bit to deploy an engagement strategy which focuses on our in-building audience, closely examines which channels are working for us, and aligns our energies in places where we feel our voice is needed, but allows for us to pull away where things are happening on their own.
This clearly indicates that it doesn’t make a lot of sense to simply get an account on every social media site out there and let’er rip. For one reason it is highly unlikely that your organization has the bandwidth to engage effectively in every platform. Another is that without the ability to engage effectively, it’s best to not even attempt it. Having a moribund presence on a social platform is worse than having no presence at all.
Therefore, being a savvy social media user means consciously reviewing your social media use periodically to:
Identify venues that are no longer useful to you and either shutdown the account or put it on ice.
Identify venues that you find useful and maintain or increase your use of those venues.
Consider whether the nature of your engagement should change. For example, should you use more pictures to make your posts more engaging? Should you craft messages that are more intriguing than informative, thus potentially increasing visits to your site?
Kudos to the Brooklyn Museum for doing this right. Read the post, and understand what it means to be a thoughtful social media user. We should all be so savvy.
SSL certificates can be compromised using a new vulnerability that shipped on currently supported versions of Debian, Ubuntu, CentOS, Fedora, the BSDs, etc.
Time update your servers, regenerate certs and if you are being rigorous about it, go through the certificate revocation process for your old ones. BUT, be careful that you have available OpenSSL 1.0.1g (or newer, should their be one). Versions previous to 1.0.1 are NOT vulnerable to heartbleed. Though many of these old versions are vulnerable to other bugs, you would not want to update from 1.0.0 for the sole purpose of avoiding heartbleed, if you are only going to land in 1.0.1e, thereby introducing the problem.
Considering the widespread deployment of OpenSSL, it is hard to overstate how common this bug is online.
My first build of ‘Whatson’ left me wanting. I felt I needed to better define how cognitive technology differed from good-old-fashioned-search, like Google. On one level, cognitive technology is, well, more mental. It uses more than keyword matching and regular expressions; but then so does Google. It uses language analysis; so does Google. It succeeds using very large unstructured data sets. So too Google. So what distinguishes cognitive technology like Watson, and must be wired into the bone of my Whatson?
I benefited by reading, Final Jeopardy: Man vs. Machine and the Quest to Know Everything, by Stephen Baker. The difference between search and cognitive technology is the difference between a set of search results and a single correct answer, between looking and finding, seeking versus knowing. Google provides a “vague pointer” to the answer. Watson provides a single, precise answer. Many versus one.
There is rarely one right answer to a question. The essence of critical thinking is the ability to find other ways of thinking about a problem. Google stacks up a list of results and assigns a confidence level to each one. So do cognitive technologies. Unlike Google, cognitive technologies have to be good enough that the top answer is right most of the time. Watson made its public debut playing the game of Jeopardy. Part of its smarts was knowing when to pass a turn, but it had to be able to answer quickly and correctly most of the time or it would lose the game. Cognitive technology raises the bar. It must use more sophisticated language analysis to really understand a human question. It has to be better at pattern recognition. It must employ more thoughtful decision making and follow a big picture strategy.
We have become so used to Google that we are content with a list of search results. What it would be like if we could answer a question on the first try? Would that be it? Done? Not quite. A game can have one right answer, but not the real world. What cognitive technologies can do is eliminate is the silly amount of time we spend sifting through search results. We could ask a question and get a satisfactory answer, and then, just like a dialog with a person, we would ask another question. Beautiful.
During my presentation on WebSockets, there were a couple points where folks in the audience could enter text in an input field that would then show up on a slide. The data was sent to the slides via WebSockets. It is not often that you get a chance to incorporate the technology that you’re talking about directly into how the presentation is given, so it was a lot of fun. At the end of the presentation, I allowed folks to anonymously submit questions directly to the HTML slides via WebSockets.
I ran out of time before I could answer all of the questions that I saw. I’ll try to answer them now.
Questions From Slides
You can see in the YouTube video at the end of my presentation (at 1h38m26s) the following questions came in. ([Full presentation starts here[(https://www.youtube.com/watch?v=_8MJATYsqbY&feature=share&t=1h25m37s).) Some lines that came in were not questions at all. For those that are really questions, I’ll answer them now, even if I already answered them.
Are you a trained dancer?
No. Before my presentation I was joking with folks about how little of a presentation I’d have, at least for the interactive bits, if the wireless didn’t work well enough. Tim Shearer suggested I just do an interpretive dance in that eventuality. Luckily it didn’t come to that.
When is the dance?
There was no dance. Initially I thought the dance might happen later, but it didn’t. OK, I’ll admit it, I was never going to dance.
Did you have any efficiency problems with the big images and chrome?
On the big video walls in Hunt Library we often use Web technologies to create the content and Chrome for displaying it on the wall. For the most part we don’t have issues with big images or lots of images on the wall. But there’s a bit of trick happening here. For instance when we display images for My #HuntLibrary on the wall, they’re just images from Instagram so only 600x600px. We initially didn’t know how these would look blown up on the video wall, but they end up looking fantastic. So you don’t necessarily need super high resolution images to make a very nice looking display.
Upstairs on the Visualization Wall, I display some digitized special collections images. While the possible resolution on the display is higher, the current effective resolution is only about 202px wide for each MicroTile. The largest image is then only 404px side. In this case we are also using a Djatoka image server to deliver the images. Djatoka has an issue with the quality of its scaling between quality levels where the algorithm chosen can make the images look very poor. How I usually work around this is to pick the quality level that is just above the width required to fit whatever design. Then the browser scales the image down and does a better job making it look OK than the image server would. I don’t know which of these factors effect the look on the Visualization Wall the most, but some images have a stair stepping look on some lines. This especially effects line drawings with diagonal lines, while photographs can look totally acceptable. We’ll keep looking for how to improve the look of images on these walls especially in the browser.
I don’t currently have solid plans for developing other content for any of the walls. Some of the work that I and others in the Libraries have done early on has been to help see what’s possible in these spaces and begin to form the cow paths for others to produce content more easily. We answered some big questions. Can we deliver content through the browser? What templates can we create to make this work easier? I think the next act is really for the NCSU Libraries to help more students and researchers to publish and promote their work through these spaces.
Is it lunchtime yet?
In some time zone somewhere, yes. Hopefully during the conference lunch came soon enough for you and was delicious and filling.
Could you describe how testing worked more?
I wish I could think of some good way to test applications that are destined for these kinds of large displays. There’s really no automated testing that is going to help here. BrowserStack doesn’t have a big video wall that they can take screenshots on. I’ve also thought that it’d be nice to have a webcam trained on the walls so that I could make tweaks from a distance.
But Chrome does have its screen emulation developer tools which were super helpful for this kind of work. These kinds of tools are useful not just for mobile development, which is how they’re usually promoted, but for designing for very large displays as well. Even on my small workstation monitor I could get a close enough approximation of what something would look like on the wall. Chrome will shrink the content to fit to the available viewport size. I could develop for the exact dimensions of the wall while seeing all of the content shrunk down to fit my desktop. This meant that I could develop and get close enough before trying it out on the wall itself. Being able to design in the browser has huge advantages for this kind of work.
I work at DH Hill Library while these displays are in Hunt Library. I don’t get over there all that often, so I would schedule some time to see the content on the walls when I happened to be over there for a meeting. This meant that there’d often be a lag of a week or two before I could get over there. This was acceptable as this wasn’t the primary project I was working on.
By the time I saw it on the wall, though, we were really just making tweaks for design purposes. We wanted the panels to the left and right of the Listen to Wikipedia visualization to fall along the bezel. We would adjust font sizes for how they felt once you’re in the space. The initial, rough cut work of modifying the design to work in the space was easy, but getting the details just right required several rounds of tweaks and testing. Sometimes I’d ask someone over at Hunt to take a picture with their phone to ensure I’d fixed an issue.
While it would have been possible for me to bring my laptop and sit in front of the wall to work, I personally didn’t find that to work well for me. I can see how it could work to make development much faster, though, and it is possible to work this way.
Race condition issues between devices?
Some spaces could allow you to control a wall from a kiosk and completely avoid any possibility of a race condition. When you allow users to bring their own device as a remote control to your spaces you have some options. You could allow the first remote to connect and lock everyone else out for a period of time. Because of how subscriptions and presence notifications work this would certainly be possible to do.
For Listen to Wikipedia we allow more than one user to control the wall at the same time. Then we use WebSockets to try to keep multiple clients in sync. Even though we attempt to quickly update all the clients, it is certainly possible that there could be race conditions, though it seems unlikely. Because we’re not dealing with persisting data, I don’t really worry about it too much. If one remote submits just after another but before it is synced, then the wall will reflect the last to submit. That’s perfectly acceptable in this case. If a client were to get out of sync with what is on the wall, then any change by that client would just be sent to the wall as is. There’s no attempt to make sure a client had the most recent, freshest version of the data prior to submitting.
While this could be an issue for other use cases, it does not adversely effect the experience here. We do an alright job keeping the clients in sync, but don’t shoot for perfection.
How did you find the time to work on this?
At the time I worked on these I had at least a couple other projects going. When waiting for someone else to finish something before being able to make more progress or on a Friday afternoon, I’d take a look at one of these projects for a little. It meant the progress was slow, but these also weren’t projects that anyone was asking to be delivered on a deadline. I like to have a couple projects of this nature around. If I’ve got a little time, say before a meeting, but not enough for something else, I can pull one of these projects out.
I wonder, though, if this question isn’t more about the why I did these projects. There were multiple motivations. A big motivation was to learn more about WebSockets and how the technology could be applied in the library context. I always like to have a reason to learn new technologies, especially Web technologies, and see how to apply them to other types of applications. And now that I know more about WebSockets I can see other ways to improve the performance and experience of other applications in ways that might not be as overt in their use of the technology as these project were.
For the real-time digital collections view this is integrated into an application I’ve developed and it did not take much to begin adding in some new functionality. We do a great deal of business analytic tracking for this application. The site has excellent SEO for the kind of content we have. I wanted to explore other types of metrics of our success.
The video wall projects allowed us to explore several different questions. What does it take to develop Web content for them? What kinds of tools can we make available for others to develop content? What should the interaction model be? What messaging is most effective? How should we kick off an interaction? Is it possible to develop bring your own device interactions? All of these kinds of questions will help us to make better use of these kinds of spaces.
Speed of an unladen swallow?
I think you’d be better off asking a scientist or a British comedy troupe.
This question was in response to how 80% of the interactions with the Listen to Wikipedia application are via QR code. We placed a URL and QR code on the wall for Listen to Wikipedia not knowing which would get the most use.
Unfortunately there’s no simple way I know of to kick off an interaction in these spaces when the user brings their own device. Once when there was a stable exhibit for a week we used a kiosk iPad to control a wall so that the visitor did not need to bring a device. We are considering how a kiosk tablet could be used more generally for this purpose. In cases where the visitor brings their own device it is more complicated. The visitor either must enter a URL or scan a QR code. We try to make the URLs short, but because we wanted to use some simple token authentication they’re at least 4 characters longer than they might otherwise be. I’ve considered using geolocation services as the authentication method, but they are not as exact as we might want them to be for this purpose, especially if the device uses campus wireless rather than GPS. We also did not want to have a further hurdle of asking for permission of the user and potentially being rejected. For the QR code the visitor must have a QR code reader already on their device. The QR code includes the changing token. Using either the URL or QR code sends the visitor to a page in their browser.
Because the walls I’ve placed content on are in public spaces there is no good way to know how many visitors there are compared to the number of interactions. One interesting thing about the Immersion Theater is that I’ll often see folks standing outside of the opening to the space looking in, so even if there where some way to track folks going in and out of the space, that would not include everyone who has viewed the content.
If you have other questions about anything in my presentation, please feel free to ask. (If you submit them through the slides I won’t ever see them, so better to email or tweet at me.)
Today, the American Library Association (ALA) called on (pdf) the Federal Communications Commission (FCC) to deploy newly identified E-rate program funding to boost library broadband access and alleviate historic shortfalls in funding for internal connections. In response to the FCC’s March Public Notice, the ALA seeks to leverage existing high-speed, scalable networks to increase library broadband speeds, improve area networks and further explore cost efficiencies that could be enabled through new consortium approaches.
Supporting school-library wide-area network partnerships to better leverage local E-rate investments and support community use of high-capacity connections during non-school hours;
Providing short-term funding focused on deployment where libraries are in close proximity to providers that can ensure scalable broadband at affordable construction charges and recurring costs over time; and
Advancing cost-efficient library network development with new diagnostic and technical support provided at the state level.
“ALA welcomes this new $2 billion investment to support broadband networks in our nations’ libraries and schools so we may meet growing community demand for services ranging from interactive online learning to videoconferencing to downloading and streaming increasingly digital collections,” said ALA President Barbara Stripling. “This infusion can provide ‘two-for-one’ benefits by advancing library broadband to and within our buildings immediately and continuing to improve the E-rate program in the near future.”
You may have noticed Brooklyn Museum’s recent announcement that they have pulled out of Flickr Commons. Apparently they’ve seen a “steady decline in engagement level” on Flickr, and decided to remove their content from that platform, so they can focus on their own website as well as Wikimedia Commons.
Brooklyn Museum announced three years ago that they would be cross-posting their content to Internet Archive and Wikimedia Commons. Perhaps I’m not seeing their current bot, but they appear to have two, neither of which have done an upload since March of 2011, based on their useractivity. It’s kind of ironic that content like this was uploaded to Wikimedia Commons by Flickr Uploader Bot and not by one of their own bots.
The announcement stirred up a fair bit of discussion about how an institution devoted to the preservation and curation of cultural heritage material could delete all the curation that has happened at Flickr. The theory being that all the comments, tagging and annotation that has happened on Flickr has not been migrated to Wikimedia Commons. I’m not even sure if there’s a place where this structured data could live at Wikimedia Commons. Perhaps some sort of template could be created, or it could live in Wikidata?
Fortunately, Aaron Straup-Cope has a backup copy of Flickr Commons metadata, which includes a snapshot of the Brooklyn Museum’s content. He’s been harvesting this metadata out of concern for Flickr’s future, but surprise, surprise — it was an organization devoted to preservation of cultural heritage material that removed it. It would be interesting to see how many comments there were. I’m currently unpacking a tarball of Aaron’s metadata on an ec2 instance just to see if it’s easy to summarize.
It would help if we had a bit more method to the madness of our own Web presence. Too often the Web is treated as a marketing platform instead of our culture’s predominant content delivery mechanism. Brooklyn Museum deserves a lot of credit for talking about this issue openly. Most organizations just sweep it under the carpet and hope nobody notices.
What do you think? Is it acceptable that Brooklyn Museum discarded the user contributions that happened on Flickr, and that all the people who happened to be pointing at said content from elsewhere now have broken links? Could Brooklyn Museum instead decided to leave the content there, with a banner of some kind indicating that it is no longer actively maintained? Don’t lots of copies keep stuff safe?
Or perhaps having too many copies detracts from the perceived value of the currently endorsed places of finding the content? Curators have too many places to look, which aren’t synchronized, which add confusion and duplication. Maybe it’s better to have one place where people can focus their attention?
Perhaps these two positions aren’t at odds, and what’s actually at issue is a framework for thinking about how to migrate Web content between platforms. And different expectations about content that is self hosted, and content that is hosted elsewhere?
I gave a talk at UC Berkeley's Swarm Lab entitled "What Could Possibly Go Wrong?" It was an initial attempt to summarize for non-preservationistas what we have learnt so far about the problem of preserving digital information for the long term in the more than 15 years of the LOCKSS Program. Follow me below the fold for an edited text with links to the sources.
I'm David Rosenthal and I'm an engineer. I'm about two-thirds of a century old. I wrote my first program almost half a century ago, in Fortran for an IBM1401. Eric Allman invited me to talk; I've known Eric for more than a third of a century. About a third of a century ago Bob Sproull recruited me for the Andrew project at C-MU, I where I worked on the user interface with James Gosling. I followed James to Sun to work on window systems, both X, which you've probably used, and a more interesting one called NeWS that you almost certainly haven't. Then I worked on operating systems with Bill Shannon, Rob Gingell and Steve Kleiman. More than a fifth of a century ago I was employee #4 at NVIDIA, helping Curtis Priem architect the first chip. Then I was an early employee at Vitria, the second company of JoMei Chang and Dale Skeen, founders of the company now called Tibco. One seventh of a century ago, after doing 3 companies, all of which IPO-ed, I was burnt out and decided to ease myself gradually into retirement.
Academic Journals and the Web
It was a total failure. I met Vicky Reich, the wife of the late Mark Weiser, CTO of Xerox PARC. She was a librarian at Stanford, and had been part of the team which, nearly a fifth of a century ago, started Stanford's HighWire Press and pioneered the transition of academic journals from paper to the Web.
In the paper world, librarians saw themselves as having two responsibilities, to provide current scholars with the materials they needed, and to preserve their accessibility for future scholars. They did this through a massively replicated. loosely coupled, fault-tolerant, tamper-evident, system of mutually untrusting but cooperating peers that had evolved over centuries. Libraries purchased copies of journals, monographs and books. The more popular the work, the more replicas were stored in the system. The storage of each replica was not very reliable; libraries put them in the stacks and let people take them away. Most times the replicas came back, sometimes they had coffee spilled on them, and sometimes they vanished. Damage could be repaired via inter-library loan and copy. There was a market for replicas; as the number of replicas of a work decreased, the value of a replica in this market increased, encouraging librarians who had a replica to take more care it, by moving it to more secure storage. The system resisted attempts at censorship or re-writing of history precisely because it was a loosely coupled peer-to-peer system; although it was easy to find a replica, it was hard to find all the replicas, or even to know exactly how many there were. And although it was easy to destroy a replica, it was fairly hard to modify one undetectably.
The transition of academic journals from paper to the Web destroyed two of the pillars of this system, ownership of copies, and massive replication. In the excitement of seeing how much more useful content on the Web was to scholars, librarians did not think through the fundamental implications of the transition. The system that arose meant that they no longer purchased a copy of the journal, they rented access to the publisher's copy. Renting satisfied their responsibility to current scholars, but it couldn't satisfy their responsibility to future scholars.
Librarians' concerns reached the Mellon Foundation, who funded exploratory work at Stanford and five other major research libraries. In what can only be described as a serious failure of systems analysis, the other five libraries each proposed essentially the same system, in which they would take custody of the journals. Other libraries would subscribe to this third-party archive service. If they could not get access from the original publisher and they had a current subscription to the third-party archive they could access the content from the archive. None of these efforts led to a viable system because they shared many fundamental problems including:
Libraries such as Harvard were reluctant to outsource a critical function to a competing library such as Yale. On the other hand, funders were reluctant to pay for more than one archive.
Publishers were reluctant to deliver their content to a library in order that the library might make money by re-publishing the content to others. This made the contract negotiations necessary to obtain content from the publishers time-consuming and expensive.
The concept of a subscription archive was not a solution to the problem of post-cancellation access; it was merely a second instance of exactly the same problem.
One of the problems I had been interested in at Sun and then again at Vitria was fault-tolerance. To a computer scientist, it was a solved problem. Byzantine Fault Tolerance (BFT) could prove that 3f+1 replicas could survive f simultaneous faults. To an engineer, it was not a solved problem. Two obvious questions were:
What is the probability that my system will encounter f simultaneous faults?
How could my system recover if it did?
There's a very good reason why suspension bridges use stranded cables. A solid rod would be cheaper, but the bridge would then have the same unfortunate property as BFT. It would work properly up to the point of failure, which would be sudden, catastrophic and from which recovery would be impossible.
I have long thought that the fundamental challenge facing system architects is to build systems that fail gradually, progressively, and slowly enough for remedial action to be effective, all the while emitting alarming noises to attract attention to impending collapse. In a post-Snowden world it is perhaps superfluous to say that these properties are especially important for failures caused by external attack or internal subversion.
The LOCKSS System
As Vicky explained the paper library system to me, I came to see two things:
It was a system in the physical world that had a very attractive set of fault-tolerance properties.
An analog of the paper system in the Web world could be built that retained those properties.
With a small grant from Michael Lesk, then at the NSF, I built a prototype system called LOCKSS (Lots Of Copies Keep Stuff Safe), modelled on the paper library system. By analogy with the stacks, libraries would run what you can think of as a persistent Web cache with a Web crawler which would pre-load the cache with the content to which the library subscribed. The contents of each cache would never be flushed, and would be monitored by a peer-to-peer anti-entropy protocol. Any damage detected would be repaired by the Web analog of inter-library copy. Because the system was an exact analog of the existing paper system, the copyright legalities were very simple.
The Mellon Foundation, and then Sun and the NSF funded the work to throw my prototype away and build a production-ready system. The interesting part of this started when we discovered that, as usual with my prototypes, the anti-entropy protocol had gaping security holes. I worked with Mary Baker and some of her students in CS, Petros Maniatis, Mema Roussopoulos and TJ Giuli, to build a real P2P anti-entropy protocol, for which we won Best Paper at SOSP a tenth of a century ago.
The interest in this paper is that it shows a system, albeit in a restricted area of application, that has a high probability of failing slowly and gradually, and of generating alarms in the case of external attack, even from a very powerful adversary. It is a true P2P system with no central control, because that would provide a focus for attack. It uses three major defensive techniques:
Effort-balancing, to ensure that the computational cost of requesting a service from a peer exceeds the computational cost of satisfying the request. If this condition isn't true in a P2P network, the bad guy can wear the good guys down.
Rate-limiting, to ensure that the rate at which the bad guy can make bad things happen can't make the system fail quickly.
Lots of copies, so that the anti-entropy protocol can work with samples of the population of copies. Randomly sampling the peers makes it hard for the bad guy to know which peers are involved in which operations.
The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at current prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk. The median data item in the Global LOCKSS network has copies at a couple of dozen peers.
I doubt that copyright is high on your list of design problems. You may be wrong about that, but I'm not going to argue with you. So, the rest of this talk will not be about the LOCKSS system as such, but about the lessons we've learned in the last 15 years that are applicable to everyone who is trying to store digital information for the long term. The title of this talk is the question that you have to keep asking yourself over and over again as you work on digital preservation, "what could possibly go wrong?" Unfortunately, once I started writing this talk, it rapidly grew far too long for lunch. Don't expect a comprehensive list, you're only getting edited low-lights.
Stuff is going to get lost
Lets start by examining the problem in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.
Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.
At scale, storing realistic amounts of data for human timescales is an unsolvable problem. Some stuff is going to get lost. This shouldn't be a surprise, even in the days of paper stuff got lost. But the essential information needed to keep society running, to keep science progressing, to keep the populace entertained was stored very robustly, with many copies on durable, somewhat tamper-evident media in a fault-tolerant, peer-to-peer, geographically and administratively diverse system.
This is no longer true. The Internet has, in the interest of reducing costs and speeding communication, removed the redundancy, the durability and the tamper-evidence from the system that stores society's critical data. Its now all on spinning rust, with hopefully at least one backup on tape covered in rust.
a rapid succession of coronal mass ejections ... sent a pulse of magnetized plasma barreling into space and through Earth’s orbit. Had the eruption come nine days earlier, when the ignition spot on the solar surface was aimed at Earth, it would have hit the planet, potentially wreaking havoc with the electrical grid, disabling satellites and GPS, and disrupting our increasingly electronic lives. ... A study last year estimated that the cost of a solar storm like [this] could reach $2.6 trillion worldwide.
Most of the information needed to recover from such an event exists only in digital form on magnetic media. These days, most of it probably exists only in "the cloud", which is this happy place immune from the electromagnetic effects of coronal mass ejections and very easy to access after the power grid goes down.
How many of you have read the science fiction classic The Mote In God's Eye by Larry Niven and Jerry Pournelle? It describes humanity's first encounter with intelligent aliens, called Moties. Motie reproductive physiology locks their society into an unending cycle of over-population, war, societal collapse and gradual recovery. They cannot escape these Cycles, the best they can do is to try to ensure that each collapse starts from a higher level than the one before by preserving the record of their society's knowledge through the collapse to assist the rise of its successor. One technique they use is museums of their technology. As the next war looms, they wrap the museums in the best defenses they have. The Moties have become good enough at preserving their knowledge that the next war will feature lasers capable of sending light-sails to the nearby stars, and the use of asteroids as weapons. The museums are wrapped in spheres of two-meter thick metal, highly polished to reduce the risk from laser attack.
Larry and Jerry were writing a third of a century ago, but in the light of this week's IPCC report, they are starting to look uncomfortably prophetic. The problem we face is that, with no collective memory of a societal collapse, no-one is willing to pay either to fend it off or to build the museums to pass knowledge to the successor society.
Why is stuff going to get lost?
One way to express the "what could possibly go wrong?" question is to ask "against what threats are you trying to preserve data?" The threat model of a digital preservation system is a very important aspect of the design which is, alas, only rarely documented. In 2005 we did document the LOCKSS threat model. Unfortunately, we didn't consider coronal mass ejections or societal collapse from global warming.
We observed that most discussion of digital preservation focused on these threats:
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
How much stuff is going to get lost?
The more we spend per byte, the safer the bytes are going to be. Unfortunately, this is subject to the Law of Diminishing Returns; each successive nine of reliability is exponentially more expensive than the last. We don't have an unlimited budget, so we're going to have to trade off cost against the probability of data loss. To do this we need models to predict the cost of storing data using a given technology, and models to predict the probability of that technology losing data. I've worked on both kinds of model and can report that they're both extremely difficult.
The claims are based on a model of the failure mechanisms and data from accelerated life testing, in which batches of media are subjected to unrealistically high temperature and humidity. The model is used to extrapolate from these unrealistic conditions to the conditions to be encountered in service. There are two problems, the conditions in service typically don't match those assumed by the models, and the models only capture some of the failure mechanisms.
These problems are much worse when we try to model not just failures of individual media, but of the entire storage system. Research has shown that media failures account for less than half the failures encountered in service; other components of the system such as buses, controllers, power supplies and so on contribute the other half. But even models that include these components exclude many of the threats we identified, from operator errors to coronal mass ejections.
Even more of a problem is that the threats, especially the low-probability ones, are highly correlated. Operators are highly likely to make errors when they are stressed coping with, say, an external attack. The probability of economic failure is greatly increased by, say, insider abuse. Modelling these correlations is a nightmare.
It turns out that economics are by far the largest cause of data failing to reach future readers. A month ago I gave a seminar in the I-school entitled The Half-Empty Archive, in which I pulled together the various attempts to measure how much of the data that should be archived is being collected by archives, and assessed that it was much less than half. No-one believes that archiving budgets are going to double, so we can be confident that the loss rate from unable to afford to collect is at least 50%. This dwarfs all other causes of data loss.
Lets Keep Everything For Ever!
Digital preservation has three cost areas; ingest, preservation and dissemination. In the seminar I looked at the prospects for radical cost decreases in all three, but I assume that the one you are interested in is storage, which is the main cost of preservation. Everyone knows that, if not actually free, storage is so cheap that we can afford to store everything for ever. For example, Dan Olds at The Register comments on an interview with co-director of the Wharton School Customer Analytics Initiative Dr. Peter Fader:
But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition."
Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely.
They believe this because they lived through a third of a century of Kryder's Law, the analog of Moore's Law for disks. Kryder's Law predicted that the bit density on the platters of disk drives would more than double every 18 months, leading to a consistent 30-40%/yr drop in cost per byte. Thus, long-term storage was effectively free. If you could afford to store something for a few years, you could afford to store it for ever. The cost would have become negligible.
As Randall Munroe points out, in the real world exponential growth can't continue for ever. It is always the first part of a S-curve. One of the things that most impressed me about Krste Asanović's keynote on the ASPIRE Project at this year's FAST conference was that their architecture took for granted that Moore's Law was in the past. Kryder's Law is also flattening out.
Here's a graph, from Preeti Gupta at UCSC, showing that in 2010, even before the floods in Thailand doubled $/GB overnight, the Kryder curve was flattening. Currently, disk is about 7 times as expensive as it would have been had the pre-2010 Kryder's Law continued. Industry projections are for 10-20%/yr going forward - the red lines on the graph show that in 2020 disk is now expected to be 100-300 times more expensive than pre-2010 expectations.
Industry projections have a history of optimism, but if we believe that data grows at IDC's 60%/yr, disk density grows at IHS iSuppli's 20%/yr, and IT budgets are essentially flat, the annual cost of storing a decade's accumulated data is 20 times the first year's cost. If at the start of the decade storage is 5% of your budget, at the end it is more than 100% of your budget. So the Big Data zealot has an affordability problem.
Why Is Kryder's Law Slowing?
It is easy to, and we often do, conflate Kryder's Law, which describes the increase in the areal density of bits on disk platters, with the cost of disk storage in $/GB. We wave our hands and say that it roughly mapped one-for-one into a decrease in the cost of disk drives. We are not alone in using this approximation, Mark Kryder himself does (PDF):
Density is viewed as the most important factor ... because it relates directly to cost/GB and in the HDD marketplace, cost/GB has always been substantially more important than other performance parameters. To compare cost/GB, the approach used here was to assume that, to first order, cost/GB would scale in proportion to (density)-1
My co-author Daniel Rosenthal (no relation) has investigated the relationship between bits/in2 and $/GB over the last couple of decades. Over that time, it appears that about 3/4 of the decrease in $/GB can be attributed to the increase in bits/in2. Where did the rest of the decrease come from? I can think of three possible causes:
Economies of scale. For most of the last two decades the unit shipments of drives have been increasing, resulting in lower fixed costs per drive. Unfortunately, unit shipments are currently declining, so this effect has gone into reverse. In 2005 Mark Kryder was quoted as predicting "In a few years the average U.S. consumer will own 10 to 20 disk drives in devices that he uses regularly," but what is in those devices now is flash. The remaining market for disks is the cloud; they are no longer a consumer technology.
Manufacturing technology. The technology to build drives has improved greatly over the last couple of decades, resulting in lower variable costs per drive. Unfortunately HAMR, the next generation of disk drive technology has proven to be extraordinarily hard to manufacture, so this effect has gone into reverse.
Vendor margins. Over the last couple of decades disk drive manufacturing was a very competitive business, with numerous competing vendors. This gradually drove margins down and caused the industry to consolidate. Before the Thai floods, there were only two major manufacturers left, with margins in the low single digits. Unfortunately, the lack of competition and the floods have led to a major increase in margins, so this effect has gone into reverse.
But these factors only account for 1/4 of the missing cost decrease. Where did the other 3/4 go? Here is a 2008 graph from Dave Anderson of Seagate showing how what looks like a smooth Kryder's Law curve is actually the superimposition of a series of S-curves, one for each successive technology. Note how Dave's graph shows Perpendicular Magnetic Recording (PMR) being replaced by Heat Assisted Magnetic Recording (HAMR) starting in 2009. No-one has yet shipped HAMR drives. Instead, the industry has resorted to stretching PMR by shingling (which increases the density) and helium (which increases the number of platters).
Each technology generation has to stay in the market long enough to earn a return on the cost of the transition from its predecessor. There are two problems:
The return it needs to earn is, in effect, the margins the vendors enjoy. The higher the margins, the longer the technology needs to be in the market. Margins have increased.
As technology advances, the easier problems get solved first. So each technology transition involves solving harder and harder problems, so it costs more. The transition from PMR to HAMR has turned out to be vastly more expensive than the industry expected. Getting the laser and the magnetics in the head assembly to cooperate is very hard, the transition involves a huge increase in the production of the lasers, and so on.
According to Dave's 6-year-old graph, we should now be almost done with HAMR and starting the transition to Bit Patterned Media (BPM). It is already clear that the HAMR-BPM transition will be even more expensive and thus even more delayed than the PMR-HAMR transition. So the projected 20%/yr Kryder rate is unlikely to be realized. The one good thing, if you can call it that, about the slowing of the Kryder rate for disk is that it puts off the day when the technology hits the superparamagnetic limit. This is when the shrinking magnetic domains become unstable at the temperatures encountered inside an operating disk, which are quite warm.
We'll Just Use Tape Instead of Disk
About 70% of all bytes of storage produced each year is disk,the rest being tape and solid state.. Tape has been the traditional medium for long-term storage. Its recording technology lags about 8 years behind disk; it is unlikely to run into the problems plaguing disk for some years. We can expect its relative cost per byte advantage over disk to grow in the medium term. But tape is losing ground in the market. Why is this?
In the past, the access patterns to archived data were stable. It was rarely accessed, and accesses other than integrity checks were sparse. But this is a backwards-looking assessment. Increasingly, as collections grow and data-mining tools become widely available, scholars want not to read individual documents, but to ask questions of the collection as a whole. Providing the compute power and I/O bandwidth to permit data-mining of collections is much more expensive than simply providing occasional sparse read access. Some idea of the increase in cost can be gained by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until last week it was 5.5 times.
An example of this problem is the Library of Congress' collection of the Twitter feed. Although the Library can afford the considerable costs of ingesting the full feed, with some help from outside companies, the most they can afford to do with it is to make two tape copies. They couldn't afford to satisfy any of the 400 requests from scholars for access to this collection that they had accumulated by this time last year. Recently, Twitter issued a call for a "small number of proposals to receive free datasets", but even Twitter can't support 400.
Thus future archives will need to keep at least one copy of their content on low-latency, high-bandwidth storage, not tape.
We'll Just Use Flash Instead
Flash memory's advantages, including low power, physical robustness and low access latency have overcome its higher cost per byte in many markets, such as tablets and servers. But there is no possibility of flash replacing disk in the bulk storage market; that would involve trebling the number of flash fabs. Even if we ignore the lead time to build the new fabs, the investment to do so would not pay dividends. Everyone understands that shrinking flash cells much further will impair their ability to store data. Increasing levels, stacking cells in 3D and increasingly desperate signal processing in the flash controller will keep density going for a little while, but not long enough to pay back the investment in the fabs.
We'll Just Use Flash Non-volatile RAM Instead
There are many technologies vying to be the successor to flash, and most can definitely keep scaling beyond the end of flash provided the semiconductor industry keeps on its road-map. They all have significant advantages over flash, in particular they are byte- rather than block-addressable. But analysis by Mark Kryder and Chang Soo Kim (PDF) at Carnegie-Mellon is not encouraging about the prospects for either flash or the competing solid state technologies beyond the end of the decade.
We'll Just Use Metal Tape, Stone DVDs, Holographic DVDs DNA Instead
Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.
The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.
There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than Kryder's or Moore's Laws. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the cold storage problem. But DNA access will always be slow enough that it can't store the only copy.
The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Double the reliability is only worth 1/10th of 1 percent cost increase. ...
Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)
Note that this analysis assumes that the drives fail under warranty. One thing the drive vendors did to improve their margins after the floods was to reduce the length of warranties.
Does Kryder's Law Slowing Matter?
Figures from SDSC suggest that media cost is about 1/3 of the lifecycle cost of storage, although figures from BackBlaze suggest a much higher proportion. As a rule of thumb, the research into digital preservation costs suggests that ingesting the content costs about 1/2 the total lifecycle costs, preserving it costs about 1/3 and disseminating it costs about 1/6. So why are we worrying about a slowing of the decrease in 1/9 of the total cost?
Different technologies with different media service lives involve spending different amounts of money at different times during the life of the data. To make apples-to-apples comparisons we need to use the equivalent of Discounted Cash Flow to compute the endowment needed for the data. This is the capital sum which, deposited with the data and invested at prevailing interest rates, would be sufficient to cover all the expenditures needed to store the data for its life.
We built an economic model of the cost of long-term storage. Here it is from 15 months ago plotting the endowment needed for 3 replicas of a 117TB dataset to have a 98% chance of not running out of money over 100 years, against the Kryder rate, using costs from Backblaze. Each line represents a policy of keeping the drives for 1,2 ... 5 years before replacing them.
In the past, with Kryder rates in to 30-40% range, we were in the flatter part of the graph where the precise Kryder rate wasn't that important in predicting the long-term cost. As Kryder rates decrease, we move into the steep part of the graph, which has two effects:
The endowment needed increases sharply.
The endowment needed becomes harder to predict, because it depends strongly on the precise Kryder rate.
The reason to worry is that the cost of storing data for the long term depends strongly on the Kryder rate if it falls much below 20%, which it has. Everyone's storage expectations, and budgets, are based on their pre-2010 experience, and on a belief that the effect of the floods was a one-off glitch; the industry will quickly get back to historic Kryder rates. It wasn't, and they won't.
Does Losing Stuff Matter?
Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.
However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.
Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 153rd most visited site, whereas loc.gov is the 1231st. For UK users archive.org is currently the 137th most visited site, whereas bl.uk is the 2752th.
Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more is better.
Can We Do Better?
In the short term, the inertia of manufacturing investment means that things aren't going to change much. Bulk data is going to be on disk, it can't compete with other uses for the higher-value space on flash. But looking out to the end of the decade and beyond, we're going to be living in a world of much lower Kryder rates. What does this mean for storage system architectures?
The reason disks have a five-year service life isn't an accident of technology. Disks are engineered to have a five-year service life because, with a 40%/yr Kryder rate, it is uneconomic to keep the data on the drive for longer than 5 years. After 5 years the data will take up about 8% of the drive's replacement.
At lower Kryder rates the media, whatever they are, will be in service longer. That means that running cost will be a larger proportion of the total cost. It will be worth while to spend more on purchasing the media to spend less on running them. Three years ago Ian Adams, Ethan Miller and I were inspired by the FAWN paper from Carnegie-Mellon to do an analysis we called DAWN: Durable Array of Wimpy Nodes. In it we showed that, despite the much higher capital cost, a storage fabric consisting of a very large number of very small nodes each with a very low-power system-on-chip and a small amount of flash memory would be competitive with disk.
The reason was that DAWN's running cost would be so much lower, and its economic media life so much longer, that it would repay the higher initial investment. The more the Kryder rate slows, the better our analysis looks. DAWN's better performance was a bonus. To the extent that successors to flash behave like RAM, and especially if they can be integrated with the system-on-chip, they strengthen the case further with lower costs and an even bigger performance edge.
Expectations for future storage technologies and costs were built up during three decades of extremely rapid cost per byte decrease. We are now 4 years into a period of much slower cost decrease, but expectations remain unchanged. Some haven't noticed the change, some believe it is temporary and the industry will return to the good old days of 40%/yr Kryder rates.
Industry insiders are projecting no more than 20%/yr rates for the rest of the decade. Technological and market forces make it likely that, as usual, they are being optimistic. Lower Kryder rates greatly increase both the cost of long-term storage, and the uncertainty in estimating it.
The idea that archived data can live on long-latency, low-bandwidth media is no longer the case. Future archival storage architectures must deliver adequate performance to sustain data-mining as well as low cost. Bundling computation into the storage medium is the way to do this.
As usual, I was too busy answering questions to remember most of them. Here are the ones I remember, rephrased, with apologies the the questioners whose contributions slipped my memory:
Won't the evolution of flash technology drive its price down more quickly than disk? The problem is that the manufacturing capacity doesn't, and won't exist for flash to displace disk in the bulk storage space. Flash is a better technology than disk for many applications, so it is likely always to command a premium over disk.
Isn't DNA a really noisy technology to build long-term memory from? At the raw media level, all storage technologies are unpleasantly noisy. The signal processing that goes on inside your disk or flash controlled is amazing. DNA has the advantage that the signal processing has a vast number of replicas to work with.
Doesn't experience with flash suggest that it isn't capable of storing data reliably for the long term? The way current flash controllers use the raw medium optimizes things other than data retention, such as performance (for SSDs) and low cost (for SD cards, see Bunnie Huang and xobs' talk at the Chaos Computer Conference). That doesn't mean it isn't possible, with alternate flash controller technology, to optimize for data retention.
Google Fellows visit the ALA Washington Office for a luncheon last year.
The American Library Association’s Washington Office is calling for graduate students, especially those in library and information science-related academic programs, to apply for the 2014 Google Policy Fellows program. Applications are due by Monday, April 14, 2014.
For the summer of 2014, the selected fellow will spend 10 weeks in residence at the ALA Washington Office to learn about national policy and complete a major project. Google provides the $7,500 stipend for the summer, but the work agenda is determined by the ALA and the selected fellow.
The Google Washington office provides an educational program for all of the fellows, such as lunchtime talks and interactions with Google Washington staff.
The fellows work in diverse areas of information policy that may include digital copyright, e-book licenses and access, future of reading, international copyright policy, broadband deployment, telecommunications policy (including e-rate and network neutrality), digital divide, access to information, free expression, digital literacy, online privacy, the future of libraries generally, and many other topics.
Jamie Schleser, a doctoral student at American University, served as the ALA 2013 Google Policy Fellow. Schleser worked with OITP to apply her dissertation research regarding online-specific digital libraries to articulate visions and strategies for the future of libraries.
Further information about the program and host organizations is available at the Google Public Policy Fellowship website.
Rebecca has rocks in her head and they are not coming out. This will not be a post on library technology.
Rebecca is the five-year-old daughter of Kat and Eric, friends in Cleveland that I met while I worked at Case Western Reserve University. This week Kat and Eric told Rebecca that another tumor had grown in her brain, that it could not be removed, and that their search for a drug or a technique to shrink it would probably be fruitless. It was Rebecca who knew this meant she was going to die. Earlier than any child should.
To Kat and Eric: thank you for living this chapter of your life in the open. I hope putting your thoughts and feelings on to the internet have been a helpful form of therapy. I hope the comments from around the world have been a source of buoyancy and strength to you and by extension your family.
Please know your loving response is an inspiration. I’m writing this post in part to add my voice to the chorus of support, and in part to celebrate you as a role model for parents in similar situations. You are in my thoughts, hopes and dreams.