Blogs and feeds of interest to the Code4Lib community, aggregated.


July 02, 2009

Dempsey, Lorcan

In English?

By: dempsey

Categories: Analytics and measurement• OCLC

I thought I would post some numbers here which were prepared by my colleague Brian Lavoie for another purpose. The question was: how many of the books in US libraries are in English?

First of all, what is a book? Deciding what a book is involves some choices (are theses in or out, for example?). This analysis uses the definition of 'print books' given in the Google 5 analysis published in DLib Magazine a while back [1].

a. All of WorldCat (Apr 09):
135.3 million records
Cataloged as "eng": 46 percent (so 54 percent non-English)

b. Print books only (Apr 09):
91.2 million
Cataloged as: "eng": 40 percent (so 60 percent non-English)

c. Print books in US libraries (Jan 09)
42.5 million
Cataloged as "eng": 57 percent (so 43 percent non-English)

d. Print books representing combined collections of three academic research libraries participating in GBS (April 2009):
7.2 million
Cataloged as: "eng": 54 percent (so 46 percent non-English)


Note - c is calculated on a slightly earlier version of the database as we had already pulled out US library holdings. The data in d is being looked at for another purpose: hence the slightly arbitrary selection of 3 libraries.

Note - these numbers are for records in the database, which represent 'manifestations' in FRBR terms. If one were to count holdings or actual copies the numbers would be different. The proportion of 'eng' would go up as English titles will be more widely held and in greater numbers of copies.


[1] Here is how the definition of a 'print book' was decided upon and operationalised for the Google 5 analysis. "Although there is no unambiguous bibliographic definition of a book, libraries have often used monographic language materials as a proxy for books, and this practice is adopted for this study. More specifically, in the context of a MARC21 record, a book is defined as a language-based monograph, identified by the codes "a" and "m" in bytes 6 and 7 of the leader, respectively. For the purposes of this study, theses/dissertations and government documents are excluded from the analysis, since these materials are usually acquired and managed as separate segments of the library collection. Records describing books in print format were identified by eliminating all non-print formats, such as digital, microform, Braille, and so on."


July 02, 2009 10:46 PM

Styles, Rob

OTTO - Controllerism Instrument at djtechtools.com

from OTTO - Controllerism Instrument at djtechtools.com: Controllerism continues to take small leaps forward as the software and techniques improve but the giant steps are going to happen in the realm of performance interfaces. Without a solid controller surface that has been designed to play like an instrument we wont be able to leave the realm [...]

by Rob Styles at July 02, 2009 08:31 PM

Singer, Ross

Linked Open LibraryThing

For Ian Davis‘ birthday, Danny Ayers sent out an email asking people to make some previously unavailable datasets accessible as linked data as Ian’s present.  It was a pretty neat idea.  One that I wish I had thought of.

Given that Ian is my boss (prior to about a month ago, Ian was just nebulously “above me” somewhere in the Talis hierarchy, but I now report to him directly) one could cynically make the claim that by providing Ian a ‘linked data gift’ that I would just be currying favor by being a kiss-ass.  You could make that claim, sure, but evidently you are not aware of how I hurt the company.

Anyway, as my contribution, I decided to take the data dumps from LibraryThing that Tim Spalding pretty graciously makes available [whoa, in the time that I first started this post until now, the data has gone AWOL, I suppose I did this just in time].  The data isn’t always very current and not all of the files are terribly useful (the tags one, for example, doesn’t offer much since the tags aren’t associated with anything — it’s just words and their counts), but it’s data and between ThingISBN and the WikipediaCitations I thought it would be worth it.

I wanted to take a very pragmatic approach to this: no triple store, no search, no rdf libraries, minimal interface.  Mostly this was inspired by Ed Summers‘ work with the Library of Congress Authorities, but, also, if Tim (or, whoever at LibraryThing) saw that making LibraryThing linked data was as easy as a few template tweaks (as opposed to a major change in their development stack) this exercise was much more likely to actually make its way into LibraryThing.

What I ended up with (the first pass released before the end of Ian’s birthday, I might add) was LODThing: a very simple application written in Ruby’s Sinatra framework, DataMapper and SQLite.  The entire application is less than 230 lines of Ruby (including the web app and data loader) plus 2 HAML templates and 2 builder templates for the HTML/RDFa and RDF/XML, respectively.  The SQL database has three tables, including the join table.  This is really simple stuff.  The only real reason it took a couple days to create was trying to get the data loaded into SQLite from these huge XML files.  Nokogiri is fast (well, Ruby fast), but a 200 MB XML file is pretty big.  It was nice to get acquainted with Nokogiri’s pull parser, though.

There are a few things to take away from this exercise.

  1. When data is freely available, it’s really quite simple to reconstitute it into linked data without any need to depart from your traditional technology stack.  There is nothing even remotely semantic-webby about LODThing except its output.
  2. We now have an interesting set of URIs and relationships to start to express and model FRBR relationships.
  3. The Wikipedia citations data is extremely useful and could certainly be fleshed out more.  One could imagine querying DBpedia or Freebase on these concepts and identifying if the Wikipedia article is actually referring to the work itself and use that.  Right now LODThing makes no claims about the relationships except that it’s a reference from Wikipedia.

LODThing isn’t really intended for human consumption, so there’s no real “default way in”.  The easiest way to use it is to make a URI from an ISBN:

If you know the LibraryThing ‘work ID’, you can get in that way, too:

Also, you can all of these resources as RDF/XML by replacing the .html with .rdf.

So, Tim, you wrote on the LT API page that you would love to see what people are doing with your data, here you go.  It would be even more awesome if it made it’s way back into LT — after all, it would alleviate some of the need for you to have a special API for this stuff.

Also, special thanks to Toby Inkster for providing a ton of help in getting this to resemble something that a linked data aware agent would actually want and finally turning the httpRange-14 light bulb on over my head.  He also immediately linked to it from his Amazon API LODifiier, which is sort of cool, too.

I’ll be happy to throw the sources into a github repository if anybody’s interested in them.

by Ross at July 02, 2009 04:39 PM

threepress

¿Qué es Bookworm?

Bookworm is now available in Spanish!

picture-1

I’m thrilled to finally have this up as Spanish was one of the languages I was most interested in adding.

by liza at July 02, 2009 03:16 PM

Bigwood, David

Report and Recommendations for Moving Image Works, Part 3a: Operational Definitions

News from OLAC.

CAPC's Moving Image Work-Level Records Task Force has completed a draft of its report and recommendations for operational definitions for a sample of five attributes of or roles needed for moving image work/primary expression records.

We started out with the intention to simply write definitions for each term. However, while thinking about these pieces of information in the context of a shared, online database, we decided that it would be useful to investigate at least some types of "data about data" and to consider how we might be able to accommodate different types of data (e.g., both identifiers and textual strings) and deal with different levels of data reliability. We have tried to explain our reasoning and process in the introductory section. We do not believe that this draft has reached its final form yet, but we do think that we have come to a point where it would be useful to get feedback from a larger group on the perceived viability of our general approach. To evaluate the document, you may find it helpful to attempt to create a few sample records using these guidelines.

This section will also include an annotated list of potential sources for work-level information. The secondary sources section is not quite complete, but we hope to issue a draft in the near future.

The draft report is available on the OLAC web site as Part 3a. We will take comments and suggestions on the draft through Friday, July 31.
Comments are sought.

by David (noreply@blogger.com) at July 02, 2009 04:04 PM

JISC Information Environment Team

Do library catalogues and repositories talk to each other?

The Centre for Digital Library Research at the University of Strathclyde is currently investigating the links between university library catalogues and digital repositories as part of a JISC funded study.

Can library users find resources in their university’s digital repository through the library catalogue? Do library catalogues and repositories share records for the same items?

If catalogues and repositories generally aren’t linked in these sorts of ways at the moment, could they be?

These are some of the issues being explored by the study. The team are surveying repository managers and others about now - so, if you receive a request to respond to their online survey, your response would be much appreciated.

For further information about the study, please visit the project’s Web site and the project’s Web page on the JISC site.

by Ben Wynne at July 02, 2009 01:27 PM

How do people use electronic information resources?

Research funded by the JISC, RIN and others over recent years has helped to increase understanding of how students and researchers use electronic information resources. Analysis of Web logs - such as the work done for the e-Books Observatory Study by CIBER at UCL - has proved a fruitful line of inquiry.

A new study - which has now been underway for a few months (so apologies for this late post) - seeks to add to this evidence through detailed observation of how individual students and researchers in Business and Economics use a number of information resources in their area (such as Business Source Premier).

The aim is to observe how individuals react to and use particular interfaces and then to explore those behaviours through structured interviews.

The work is being conducted by Middlesex University and is being complemented by an analysis of Web logs for a selection of Business and Economics e-books and e-journals by CIBER.

A report of the findings is expected during the autumn.

For further information, please visit the project Web page on the JISC Web site.

by Ben Wynne at July 02, 2009 01:08 PM

Spalding, Tim (Thingology)

Categories for your LTFL Reviews

Teen reviews from Seattle Public Library
We've a new feature to LibraryThing for Libraries, suggested by Lare over at the Seattle Public Library. He was looking for a way to show off just some of their reviews—reviews for their summer reading program.

Libraries can now add "categories" for their reviewers to check off—library book club books, Big Read books, reviews by library staff, etc. And the library can show off just one category of reviews in their LTFL blog widget.

Seattle has made blog-widget pages for their kids section, teen section, and even their adult section of the site. By categorizing the reviews into age-related groups, they can feature items in their catalog that would interest the patrons for each demographic.

We'll be releasing some more cool features at American Library Association meeting in Chicago next week.

by Sonya (noreply@blogger.com) at July 02, 2009 12:46 PM

Powell, Andy and Johnston, Pete

Investigating the "Scott Cantor is a member of the IEEE problem"

The UK Access Management Federation and other similar initiatives worldwide provide a SAML-based single sign-on solution for access to online resources for the education and research community.  Typically, a user must sign-on to their home institution, using their local username and password, before being granted access to a remote online resource.  In the main, this prevents the user from having to remember a separate username and password for each online resource that they wish to access.  However, there is a perceived problem that some users have several affiliations (their university, their employer, the NHS, their professional body, etc.), each of which may grant access to a different set of online resources, and that, currently, online services are not able to make seamless decisions about which resources a given user is entitled to access because they lack knowledge about these multiple affiliations.

We have recently funded Simon McLeish at LSE to undertake an investigation into this area, commonly known as the Scott Cantor is a member of the IEEE problem. (Scott Cantor is lead developer of the Shibboleth software and an editor of the SAML 2.0 specification).  This investigation will try to discover the extent of this problem in UK HE - who is affected, how serious stakeholders perceive it to be, and what is expected from a solution - in order to inform future work in this area.

More information about this study can be found thru the project's Wiki.  As usual, the final report will be made openly available to the community under a Creative Commons licence.

by Andy Powell at July 02, 2009 10:07 AM

Bigwood, David

Funding

The Office for Information Technology Policy (OITP) of the American Library Association (ALA) has released Fiber to the Library: How Public Libraries Can Benefit from Using Fiber Optics for their Broadband Internet Connections It "articulates the benefits of fiber optic technology for public libraries and strategies to obtain such fiber connectivity. An important goal of this policy brief is to help applicants include “fiber to the library” in their federal broadband stimulus funding proposals under the American Recovery and Reinvestment Act (ARRA)."

My local library, Helen Hall, is receiving Energy Efficiency and Conservation Block Grant funds to get a new cooling system.

by David (noreply@blogger.com) at July 02, 2009 10:02 AM

State Library of Denmark

compress2Scaled


As a part of our quest trying to optimize the speed of our search front end I recently tried out the Yahoo js and css minifyer – YUI Compressor.

At first glance the nice things about the YUI Compressor are that it is a Java based (we are a Java friendly team), open source and fairly easy to work with. The YUI Compressor handles both javascript and css but in this post I have chosen to focus on the js part.

The test integration into my IDE (Intellij IDEA) and the project was quite easy because somebody has taken the time to write YUIAnt. I just downloaded the YUI compressor version 2.4.2 and the YUIAnt.jar and added them to the project and modified my build scripts to run the compressor when the website is deployed to the web server. The beauty of this is that you naturally don’t have to look at the minified javascript when editing and if you for some reason want to debug the code run time you can easily setup a debug option in your build script and bypass the compressor for on the fly debugging. If you aren’t into all this build script stuff or have a simple project there are lots of online YUI Compressor sites out there where you can paste you js code or css and get a compressed version in return.

The version 2.4.2 of the YUI Compressor nearly worked without problems. For some reason – I didn’t bother to investigate further – the YUI Compressor had some issues with unterminated Strings in the jscalendar-1.0 library. I just excluded the directory and went on with my small non scientific test using Firebug as my test environment.

The first screen shot shows the size and load times for our js files. Business as usual – the YUI Compressor is disabled.

nocompress2Scaled

The next screen shot shows the size and load times for the same js files now with the compression enabled.

compress2Scaled

The file sizes have been reduced and the overall load time has shrunk approximately half a second. When the file sizes are very small the load times are very sensitive to queing effects but the file size is in most cases reduced. In the case of bigger js files the improvement in speed as well as size is clear. I have tried to compensate for caching effects in both cases (compress/not compressed). It seems that there is about a 20-25% reduction in file size and approximately the same reduction in load time for the js. These numbers are without using the obfuscation option (reduction of variable names to the shortest possible length and other tricks) simply because I don’t thing we will be comfortable with this knowing that it might cause errors.

As I am new to this I am interested to hear about any major drawbacks compressing/minifying may have.

This is of course a small step and not something which alone makes the difference between a slow and a fast site but I am hoping that attention to a number of different optimization issues will make a big difference in the long run.

by Jørn Thøgersen at July 02, 2009 06:12 AM

Lederman, Sol (Federated Search Blog)

Science source selection

My fur was raised when I saw Serials Solutions’ claim that their discovery service was an evolutionary step beyond federated search. I raised my concerns a couple of times: here and here. My beef isn’t with Serials Solutions as a business, it’s with their position that it’s fine to not search content that they don’t provide access to. There’s no room (yet) in their discovery service model to include access to quality content that can only be searched live, i.e. via federated search. Carl Grant joined the conversation and various people commented, making the topic a very lively one.

My concern was, and is, that libraries and research organizations would consider giving away their responsibility to select quality sources for their patrons for what I imagine to be two primary reasons: (1) library patrons don’t like to wait 30 seconds for federated search results, and (2) (possibly) cost savings. I don’t have a lot of sympathy for the Google generation. Even though I’m an American and my culture has taught me that immediate gratification is a good thing I think 30 seconds is a small price to pay to see better results. Cost I can’t speak to as I don’t have any figures.

One of my colleagues pointed me to an article by scientist and writer Michael Nielsen, Is scientific publishing about to be disrupted?, which only strengthens my belief that access to content from aggregators only supplements access via other methods such as federated search.

Michael Nielsen is a very accomplished scientist. His bio lists some of his impressive credentials:

Michael Nielsen is one of the pioneers of quantum computation. Together with Ike Chuang of MIT, he wrote the standard text on quantum computation. This is the most highly cited physics publication of the last 25 years, and one of the ten most highly cited physics books of all time (Source: Google Scholar, December 2007). He is the author of more than fifty scientific papers, including invited contributions to Nature and Scientific American. His research contributions include involvement in one of the first quantum teleportation experiments (related), named as one of Science Magazine’s Top Ten Breakthroughs of the Year for 1998, quantum gate teleportation, quantum process tomography, the fundamental majorization theorem for comparing entangled quantum states, and critical contributions to the formula for the quantum channel capacity. A full list of papers is here.

Nielsen’s article argues that there is impending disruption of scientific publishing. The article is fascinating, Nielsen is a compelling and well-informed writer and I recommend you read the fairly long article and, if you have time, that you follow at least some of the numerous links. I want to also add that I had the opportunity to spend some time with Nielsen at a conference he helped to organize at the Perimeter Institute and I very much appreciate how incredibly down to earth the man is.

What I found most valuable in Nielsen’s writing were various examples of science being published in non-traditional ways.

One example is Nielsen’s response to a New York Times editorial about the death of newspapers. Here’s a snippet from the editorial:

There’s a great deal of good commentary out there on the Web, as you say. Frankly, I think it is the task of bloggers to catch up to us, not the other way around… Our board is staffed with people with a wide and deep range of knowledge on many subjects. Phil Boffey, for example, has decades of science and medical writing under his belt and often writes on those issues for us… Here’s one way to look at it: If the Times editorial board were a single person, he or she would have six Pulitzer prizes…

And here’s Nielsen’s poignant response:

[The New York Times editorial piece] demonstrates a deep commitment to high-quality journalism, and the other values that have made the New York Times great. In ordinary times this kind of commitment to values would be a sign of strength. The problem is that as good as Phil Boffey might be, I prefer the combined talents of Fields medallist Terry Tao, Nobel prize winner Carl Wieman, MacArthur Fellow Luis von Ahn, acclaimed science writer Carl Zimmer, and thousands of others. The blogosophere has at least four Fields medalists (the Nobel of math), three Nobelists, and many more luminaries. The New York Times can keep its Pulitzer Prizes.

Nielsen’s point is clear. The blogosphere is a tremendous resource to scientists. Libraries and research organizations miss huge amounts of valuable and current resources if they only provide access to content from major publishers (or their aggregators.) I do realize that the writings of probably all of the bloggers that Nielsen mentioned is available through Google and might not make sense to federate. The problem with searching Google for excellent science is that you need the time and discernment to find the good stuff. But, however one might access science content, the power of traditional publishers is waning which is a really good reason to not depend on them for all the science worth reading.

Here’s another excerpt from Nielsen’s article, this one on innovative ways to communicate science that are sprouting up everywhere:

What’s new today is the flourishing of an ecosystem of startups that are experimenting with new ways of communicating research, some radically different to conventional journals. Consider Chemspider, the excellent online database of more than 20 million molecules, recently acquired by the Royal Society of Chemistry. Consider Mendeley, a platform for managing, filtering and searching scientific papers, with backing from some of the people involved in Last.fm and Skype. Or consider startups like SciVee (YouTube for scientists), the Public Library of Science, the Journal of Visualized Experiments, vibrant community sites like OpenWetWare and the Alzheimer Research Forum, and dozens more. And then there are companies like Wordpress, Friendfeed, and Wikimedia, that weren’t started with science in mind, but which are increasingly helping scientists communicate their research.

These Web 2.0 science offerings, at least the ones that provide an API or other mechanism for efficient search, are prime candidates for federation as they constantly generate new content.

One last quote from Nielsen. I very much enjoyed the great examples Nielsen packed into this paragraph of outstanding science being found in blogs of all places.

It’s easy to miss the impact of blogs on research, because most science blogs focus on outreach. But more and more blogs contain high quality research content. Look at Terry Tao’s wonderful series of posts explaining one of the biggest breakthroughs in recent mathematical history, the proof of the Poincare conjecture. Or Tim Gowers recent experiment in “massively collaborative mathematics”, using open source principles to successfully attack a significant mathematical problem. Or Richard Lipton’s excellent series of posts exploring his ideas for solving a major problem in computer science, namely, finding a fast algorithm for factoring large numbers. Scientific publishers should be terrified that some of the world’s best scientists, people at or near their research peak, people whose time is at a premium, are spending hundreds of hours each year creating original research content for their blogs, content that in many cases would be difficult or impossible to publish in a conventional journal. What we’re seeing here is a spectacular expansion in the range of the blog medium. By comparison, the journals are standing still.

At SLA 2009, Abe delivered a presentation: A Journey to 10,000 sources. The talk was about (this blog’s sponsor) Deep Web Technologies‘ efforts to search initially hundreds, then thousands, and eventually 10,000 sources. The accompanying paper makes this important argument for making a wider range of science information available to researchers:

By relying on only the content available from the major publishers and aggregators, researchers miss other important content, in particular the output of scientists who do not publish in mainstream journals. The world is shrinking, the brain pool is growing, and the output of science is everywhere.

While one may argue about the merits of federation vs. crawling and indexing vs. discovery services those arguments frequently focus on the technological merits of particular approaches. The more important question, I think, is what information is worth your while to see? For most of us that information can’t all be federated, or all indexed, or all provided to us by a discovery service. I think federated search will continue to evolve into this hybrid being where multiple technologies are enlisted to give scientists what they need.

ShareThis

by Sol at July 02, 2009 03:21 AM

Chudnov, Dan

THATCamp 2009

Another THATCamp has come and gone and it was, again, a lot of fun. I've grown used to the dynamics of an unconference in the past five years or so because that's the kind of event I attend most of the time, now. JCDL 2009 was the first academic conference I'd attended in years, and though I enjoyed it as well and met a lot of interesting people and learned some useful stuff, it was missing the energy the mix of people at a good unconference can generate. And, though I feel like a self-important prig as I write this, I hated that though I'd made the effort to attend, there was no chance for me to get up and show off some stuff I'd worked on in front of the group. I use software that lets a user to become a committer; I value friendships that let a student become a teacher; I attend conferences that let an attendee become a presenter. Take out that dynamic and it's nowhere near as compelling.

Because it features this principle, as any good unconference does, the best part of THATCamp is the people. Both years I've met so many fascinating people and learned about so much amazing work that it's taken the whole week following for my brain to settle back down and follow up on all the threads left dangling on sunday afternoon like so many thesis topics. There's talk of franchised THATCamps to be staged in Austin and London among other places, and that's exciting. There's a #thatcamp channel on freenode that threatens to become a regular hangout. I've got about 50 more people I'm following on twitter all of whom already fill my screen with fascinating stuff to read and look at all day and some of them are even following me, too. What more could you ask for?

Well, there are a few things. I think there are a few tweaks to the formula that could improve the event a bit. I offer these only in the hopes of making THATCamp even better, not to complain or kill anybody's leftover buzz.

  1. Shorter sessions. This year the sessions were 1:15 long; for intense topics that engage everybody in the room that's what you need to give everybody a chance to go deep. But for open-ended discussions where there's as much airing of concerns about how "this needs to happen" and "we have to do that", 1:15 is about 25 minutes too long. It might have just been the sessions I chose this year, but it seemed like I was in more of the latter type sessions than the former, and that was a bit of a let down. Also, there were as many as five or six sessions running concurrently in several slots on the first day, any three or four of which I would've liked to sit in on. Tightening the schedule could allow for more time blocks and cut down on the number of simultaneous tracks.
  2. More hacking. When you go from having Bill Turkel teaching people how to fire code into an Arduino and the Omeka developers teaching how to write plugins and even me doing a simple tutorial on how to make little colorful balls dance around on screen with Processing one year to basically none of that the next year, it's a bit of a drop off to somebody like me who likes to learn by doing, especially in realtime at a moment when I'm jazzed up by all the amazing people and ideas in the air.

    We talked about this a bit in #thatcamp on IRC last night - maybe if the sessions were a bit shorter and there were fewer concurrent tracks, one of the extra rooms could be a "hackin' room" or some such. Sorta like the chillout room at a rave with plenty of water and comfy couches where people can take a breather but, er, well, the exact opposite of that.

    It might just be that I'm a little bit disappointed in myself for not prepping a hackier topic myself. I put a lot of time into hacks just for THATCamp last year and it was great fun pulling them off. I'd like to think that it was fun for the people in the room with me, too, and either way I learned a lot from the experience and I hope that was mutual. This year I was burned out on conference travel and work and didn't have the extra cycles to put something fun and new together, and I'm sorry I didn't. If I get to go again, I promise to do whatever I can to bring the hackin' back in!

  3. Let us do our own scheduling. This is probably the biggest one. At the Foo Camp I went to the intro evening session ended with everybody mingling around big schedule boards where times, topics, and rooms get worked out among the attendees in realtime. It's messy and takes a while but it ends with drinks and everybody's just happy to be bumping into all the other fascinating people around them anyway so it serves as a nice icebreaker, too. At THATCamp, CHNM staff instead comb through ideas posted in advance to the blog and group and sort and lump and split topics into sessions with titles that don't necessarily match what the idea-posters had in mind. I wanted to talk about improving web sites with linked data but where do I go to talk about that in this schedule? "Standards"? "Publishing"? "Software Development"? "Libraries and Web 2.0"? (that's where I went, and did a bit of the talk, but I'm not sure my topic was what everybody else there had in mind, and I know I wasn't alone in this mode of confusion).

    By cutting out this dynamic let-the-people-do-it-themselves step you minimize opportunities for catchy titles to draw people in, for people to negotiate whether or not they should merge their own topics, and for people to simply get to know each other and decide which other people they want to be sure to hear from and hang out with right off the bat. And imho you maximize confusion about which sessions to go to and where you can find the people you want to hear from.

    I'd advocate for filling out a big whiteboard with a schedule with people putting the names of their talks and their names with it and leaving a good 60-90 minutes to work it all out. On a real board or on paper (vs. online), so we'd have to occupy the same physical space. With drinks nearby.

    I know Jeremy put a ton of work into scheduling because I caught him in the act when I arrived late so I know it was no trivial feat. I just think opening it up would be easier on @clioweb and @digitalhumanist and better for the rest of us too.

  4. Three word intros. Another nice thing they did at Foo was *very* brief intros of everybody in the room: your name, your affiliation, and *just* three words about who you are or what you're into. Mine would be: "Dan Chudnov, Library of Congress, One Big Library". It's a chance to put names to faces, it's another friendly icebreaker, and it's a chance for all of us 140-charsmiths to be clever.
  5. The schedule. Maybe it might help to have an evening meeting the night before for the welcoming session, the scheduling, and maybe one or two lead talks to kick things off. Then everybody can go get dinner or drinks and talk and think about what's coming the next morning and maybe work on their slides or demos or whatever overnight. You'd know when your slot is the next day, and which sessions you want to be sure to get to.

I don't want to be all "they do it better at Foo Camp" but these last few points really do reflect things that Foo Camp does a little better that I think THATCamp could adopt to make it just that much better.

And not to repeat myself, but I offer all this up with the hope of leading folks to think about various ways to make a great event even greater. I ain't complaining - the organizers do a great job making a lot of people with diverse backgrounds comfortable in a terrific space with plenty of coffee and wifi and surprisingly good food and nicely designed t-shirts and as long as they'll have me, I'll keep applying to attend again. It's just that I'm a bit of a hacker at heart and I'm always thinking about little optimizations, so take this as nothing more than that.

I hope to see y'all again next year, or even sooner - and next time you're in DC please stop by LC to say hi if you like.

by dchud at July 02, 2009 03:16 AM

del.icio.us

almost.at - Following People at Real-World Events in Real-Time

"Following People at real-world events in real-time". Pulls in from twitter, flickr, youtube, twitpic, tinyurl, bitly. Looks useful for remote-following of conferences!

by keyvowel at July 02, 2009 01:25 AM

The Code4Lib Journal - Using a Web Services Architecture with Me, Myself and I

by laurientaylor at July 02, 2009 01:13 AM

July 01, 2009

del.icio.us

The Archivist: Save And Export Twitter Searches Before They Go Away - Opinions - MIX Online

"The Archivist is a Windows application that runs on your local system and allows you to archive tweets for later data-mining and analysis for a given search.""If you leave The Archivist open, it will update with the latest results every 10 minutes. You can also close The Archivist and open it later. The Archivist will save the tweets and get all the tweets it can since that search."

by keyvowel at July 01, 2009 11:24 PM

Hellman, Eric

Linked Data Heresy? Under the Hood at AdaptiveBlue

Have you ever watched a web server log? Thirteen years ago, I was starting up a scientific e-journal, and it was very gratifying to watch the monitor and see the traffic coming in from all over the world. Occasionally I would turn on the referrer log to see where people were coming from. One time, I was surprised to see that somebody in Poland was coming to my e-journal site from a russian web site with "xxx" in the URL. Curious about what sort of site might be linking to my e-journal, I checked out the site, and found it to be about blond, naked women. I wasn't sure about what this indicated about my e-journal. Perhaps the Polish scientists found the e-journal and the xxx site equally stimulating? Perhaps their boss had just walked into the room, and they needed a work-oriented internet site to cover their other browsing?

My perspective on the privacy of my internet browsing changed that day. I've become mildly paranoid about things that might spy on me. I am very selective about the Facebook apps that I load, for example, but I don't bother to flush my browsing history or block web bugs or things like that. I enjoyed finding out "what Google knows about me" (post it to Facebook and tag your friends to do the same!). I really worry about Firefox extensions (or "Add-ons"), because I know how extremely powerful and/or intrusive they can be. Even so, the 3 or 4 things I add to Firefox are the main reason I don't use Safari, despite its integration advantages. I'm not surprised that IE and Safari have declined to support practical extension mechanisms; they're sort of scary. On the other hand, Firefox Add-ons have presented very few spyware-related problems; this is due in part to the fact that they must be written in Javascript and delivered as source. It's relatively easy to go and open an Add-on and inspect its code, so if an Add-on does something other than what it says it does, it's likely that sooner or later someone will discover the truth.

A really interesting Firefox Add-on called "Glue" is being offered by a venture-funded company called AdaptiveBlue. (no relation whatsoever to my company, Gluejar, Inc.) Glue watches you browse the internet and when it sees you on one of a set of sites that it knows about, it reports the pages you're on to AdaptiveBlue, enabling them to construct a "Social Network of Things", where the Things might be Books, Music, Products, Wine, Companies, etc.

Image representing AdaptiveBlue as depicted in...

Overall there are over 300 sites that the Glue Add-on does something with. A lot goes on in Glue, and I didn't take the time to sort everything out. For example, when you go to a topic page in Wikipedia or a book page in WorldCat, or a stock page in Yahoo Finance, the url that you visited is reported to AdaptiveBlue. Usually, the Add-on then slides down a Glue header which tells you about what the Glue Social Network thinks about the Thing you are looking at. Personally, I find this very distracting, and I don't plan to continue using Glue, but I can imagine that many people will appreciate the consistent interface to the social network and other services that is presented. Other sites handled by glue include LibraryThing, Epicurious, Last.fm, ESPN, theStreet, ToysRUs, Expedia, GameSpy, Metacritic, WineLibrary, Flixster, Connotea, Flickr, Technorati, Walmart and eBay, just to name a few. It was very difficult to find the official list of sites that Glue works with on the GetGlue web site; I wish the AdaptiveBlue people were more upfront about exactly what they do on these sites. Nonetheless, the Add-on appears to do what it says it does. I also would like to see the user given more control over the sorts of things that are reported to AdaptiveBlue- I'm much more relaxed about sharing my Wine and Sports browsing than I am about my Wikipedia and Stocks browsing. And I really don't want to share my Russian XXX site browsing!

It's interesting to compare Glue to the OpenURL linking services that have been almost universally adopted in libraries. (I developed one of the first OpenURL link servers, which is now owned by OCLC, Inc.) Like Glue, the OpenURL link servers present users with relevant information and links to services surrounding "things" which are typically journal articles or books. One library that I worked with even used a social network to connect users to other users who had viewed the same item, just like Glue. There was even a Firefox Add-on developed that routed "thing" links to link servers. The link server vendor community worked with publishers closely to enable OpenURL linking; although AdaptiveBlue promotes its "SmartLinks", I doubt that many of the sites Glue is aware of understand what they are doing.

Glue makes heavy use of Amazon web services, including the product information web service, the SimpleDB service and the S3 simple storage service. It's smart these days to outsource scalability and concentrate on your application's functions. Glue also makes nice use of the Dojo and Mochikit Javascript toolkits. In browsing the code, I noticed that many of the problems it addressed were exactly the same ones we encountered developing Linkbaton 9 years ago, and the solutions look quite similar (in otherwords, I think the developers have done a pretty good job!) except that the tools available today are so much more advanced than what we had to work with 9 years ago.

Given that AdaptiveBlue makes a big deal about the Semantic-ness of its technology, I was surprised to find out how it identifies "Things". The canonical way to identify a Thing on the semantic web is to give it a URI, and then attach properties to it. When I spoke with AdaptiveBlue founder and CEO Alex Iskold at the Semantic Technology Conference, he told me that they only use title and author strings to define book Things. In fact, they bundle these strings into keys (such as books/cryptonomicon/neal_stephenson), then use the keys as if they identified a book, when in the real world, it's more complicated. So the "Things" in the AdaptiveGlue "Social Network of Things" are entities that do not correspond to books, but rather correspond to descriptions of books. Interestingly, this is exactly the approach taken in OpenURL URI's, which are really descriptive metadata packages, not entity URI's.

The first of Tim Berners-Lee's "Four Rules" for Linked Data is "Use URIs as names for things". Both Glue and OpenURL, which were designed separately as practical solutions for linking to things, seem to break this rule. Instead they build URIs using descriptions of the things, and don't bother naming the things themselves. Maybe Tim BL's first rule is wrong!

by Eric Hellman (noreply@blogger.com) at July 01, 2009 10:50 PM

Clarke, Kevin

Howto: Saving an XCF with Layers to a PDF with Pages

I’m surprised that there isn’t an easier way to go from a Gimp file (.xcf) to a PDF.  Sure, you can always “print to pdf” if you are working with a single layer image, but what if you have a multi-layer image that you want to turn into a PDF with multiple pages (each page being a layer from the image)?

Here is one way that I’ve found to accomplish this.  I’m using Ubuntu so any install stuff will be specific to that distribution, but the software I’m using should work on any Linux distro.

First, you’ll need Gimp.  I’m assuming that’s already installed.

Gimp won’t save a multi-layer image to a .ps, .tif, or .pdf by itself, though, so you need to install a script called “Save Layers as Individual Files” (this script can be downloaded for Gimp 2.4 or newer from Panotools) .

Once you download this script it needs to be put in your Gimp scripts directory.

unzip -d ~/.gimp-2.6/scripts Save-layers-tiff-24.zip

Your scripts directory may be named something else if you are using another version of Gimp (other than 2.6). Once the script is in that directory, it will appear in the Script-Fu > Utils menu within Gimp (and can be applied to any open image).

Next, you need to install imagemagick.  If you don’t already have it installed, it’s as easy (on Ubuntu) as:

sudo aptitude install imagemagick

Once that is installed, you’ll be able to use the mogrify program which comes with ImageMagick. From within the directory that contains all your TIF files, type:

mogrify -format pdf *.tif

This will generate PDFs for each of your TIF files. You can then merge all the PDFs files into one using a program called PDFTK. To install that, just type:

sudo aptitude install pdftk

Running that program is as easy as typing:

pdftk filename*.pdf cat output singlename.pdf

The filename*.pdf argument will catch all the individually named files created by the mogrify program (filename1.pdf, filename2.pdf, filename3.pdf, filename4.pdf, etc.)

And, that’s it! You can open your new singlename.pdf file and have all those XCF layers now represented by individual pages within the PDF. This is the easiest way that I’ve found to accomplish this task, but if you know of a better/easier way I’d love to hear it!

by ksclarke at July 01, 2009 09:00 PM

Morgan, Eric Lease

Mass Digitization Mini-Symposium: A Reverse Travelogue

The Professional Development Committee of the Hesburgh Libraries at the University of Notre Dame a “mini-symposium” on the topic of mass digitization on Thursday, May 21, 2009. This text documents some of what the speakers had to say. Given the increasingly wide availability of free full text information provided through mass digitization, the forum offered an opportunity for participants to learn how such a thing might affect learning, teaching, and scholarship. *

Setting the Stage

presenters and organizers
Presenters and organizers

After introductions by Leslie Morgan, I gave a talk called “Mass digitization in 15 minutes” where I described some of the types of library services and digital humanities processes that could be applied to digitized literature. “What might libraries be like if 51% or more of our collections were available in full text?”

Maura Marx

The Symposium really got underway with the remarks of Maura Marx (Executive Director of the Open Knowledge Commons) in a talk called “Mass Digitization and Access to Books Online.” She began by giving an overview of mass digitization (such as the efforts of the Google Books Project and the Internet Archive) and compared it with large-scale digitization efforts. “None of this is new,” she said, and gave examples including Project Gutenberg, the Library of Congress Digital Library, and the Million Books Project. Because the Open Knowledge Commons is an outgrowth of the Open Content Alliance, she was able to describe in detail the mechanical digitizing process of the Internet Archive with its costs approaching 10¢/page. Along the way she advocated the HathiTrust as a preservation and sharing method, and she described it as a type of “radical collaboration.” “Why is mass digitization so important?” She went on to list and elaborate upon six reasons: 1) search, 2) access, 3) enhanced scholarship, 4) new scholarship, 5) public good, and 6) the democratization of information.

The second half of Ms. Marx’s presentation outlined three key issues regarding the Google Books Settlement. Specifically, the settlement will give Google a sort of “most favored nation” status because it prevents Google from getting sued in the future, but it does not protect other possible digitizers the same way. Second, it circumvents, through contract law, the problem of orphan works; the settlement sidesteps many of the issues regarding copyright. Third, the settlement is akin to a class action suit, but in reality the majority of people affected by the suit are unknown since they fall into the class of orphan works holders. To paraphrase, “How can a group of unknown authors and publishers pull together a class action suit?”

She closed her presentation with a more thorough description of Open Knowledge Commons agenda which includes: 1) the production of digitized materials, 2) the preservation of said materials, and 3) and the building of tools to make the materials increasingly useful. Throughout her presentation I was repeatedly struck by the idea of the public good the Open Knowledge Commons was trying to create. At the same time, her ideas were not so naive to ignore the new business models that are coming into play and the necessity for libraries to consider new ways to provide library services. “We are a part of a cyber infrastructure where the key word is ’shared.’ We are not alone.”

Gary Charbonneau

Gary Charbonneau (Systems Librarian, Indiana University - Bloomington) was next and gave his presentation called “The Google Books Project at Indiana University“.

Indiana University, in conjunction with a number of other CIC (Committee on Institutional Cooperation) libraries have begun working with Google on the Google Books Project. Like many previous Google Book Partners, Charbonneau was not authorized to share many details regarding the Project; he was only authorized “to paint a picture” with the metaphoric “broad brush.” He described the digitization process as rather straightforward: 1) pull books from a candidate list, 2) charge them out to Google, 3) put the books on a truck, 4) wait for them to return in few weeks or so, and 5) charge the books back into the library. In return for this work they get: 1) attribution, 2) access to snippets, and 3) sets of digital files which are in the public domain. About 95% of the works are still under copyright and none of the books come from their rare book library — the Lilly Library.

Charbonneau thought the real value of the Google Book search was the deep indexing, something mentioned by Marx as well.

Again, not 100% of the library’s collection is being digitized, but there are plans to get closer to that goal. For example, they are considering plans to digitize their “Collections of Distinction” as well as some of their government documents. Like Marx, he advocated the HathiTrust but he also suspected commercial content might make its way into its archives.

One of the more interesting things Charbonneau mentioned was in regards to URLs. Specifically, there are currently no plans to insert the URLs of digitized materials into the 856 $u field of MARC records denoting the location of items. Instead they plan to use an API (application programmer interface) to display the location of files on the fly.

Indiana University hopes to complete their participation in the Google Books Project by 2013.

Sian Meikle

The final presentation of the day was given by Sian Meikle (Digital Services Librarian, University of Toronto Libraries) whose comments were quite simply entitled “Mass Digitization.”

The massive (no pun intended) University of Toronto library system consisting of a whopping 18 million volumes spread out over 45 libraries on three campuses began working with the Internet Archive to digitize books in the Fall of 2004. With their machines (the “scribes”) they are able to scan about 500 pages/hour and, considering the average book is about 300 pages long, they are scanning at a rate of about 100,000 books/year. Like Indiana and the Google Books Project, not all books are being digitized. For example, they can’t be too large, too small, brittle, tightly bound, etc. Of all the public domain materials, only 9% or so do not get scanned. Unlike the output of the Google Book Project, the deliverables from their scanning process include images of the texts, a PDF file of the text, an OCRed version of the text, a “flip book” version of the text, and a number of XML files complete with various types of metadata.

Considering Meikle’s experience with mass digitized materials, she was able to make a number of observations and distinctions. For example, we — the library profession — need to understand the difference between “born digital” materials and digitized materials. Because of formatting, technology, errors in OCR, etc, the different manifestations have different strengths and weaknesses. Some things are more easily searched. Some things are displayed better on screens. Some things are designed for paper and binding. Another distinction is access. According to some of her calculations, materials that are in electronic form get “used” more than their printed form. In this case “used” means borrowed or downloaded. Sometimes the ratio is as high as 300-to-1. There are three hundred downloads to one borrow. Furthermore, she has found that proportionately, English language items are not used as heavily as materials in other languages. One possible explanation is that material in other languages can be harder to locate in print. Yet another difference is the type of reading one format offers over another; compare and contrast “intentional reading” with “functional reading.” Books on computers make it easy to find facts and snippets. Books on paper tend to lend themselves better to the understanding of bigger ideas.

Lastly, Meikle alluded to ways the digitized content will be made available to users. Specifically, she imagines it will become a part of an initiative called the Scholar’s Portal — a single index of journal article literature, full text books, and bibliographic metadata. In my mind, such an idea is the heart of the “next generation” library catalog.

Summary and Conclusion

The symposium was attended by approximately 125 people. Most were from the Hesburgh Libraries of the University of Notre Dame. Some were from regional libraries. There were a few University faculty in attendance. The event was a success in that it raised the awareness of what mass digitization is all about, and it fostered communication during the breaks as well as after the event was over.

The opportunities for librarianship and scholarship in general are almost boundless considering the availability of full text content. The opportunities are even greater when the content is free of licensing restrictions. While the idea of complete collections totally free of restrictions is a fantasy, the idea of significant amounts of freely available full text content is easily within our grasp. During the final question and answer period, someone asked, “What skills and resources are necessary to do this work?” The answer was agreed upon by the speakers, “What is needed? An understanding that the perfect answer is not necessary prior to implementation.” There were general nods of agreement from the audience.

Now is a good time to consider the possibilities of mass digitization and to be prepared to deal with them before they become the norm as opposed to the exception. This symposium, generously sponsored by the Hesburgh Libraries Professional Development Committee, as well as library administration, provided the opportunity to consider these issues. “Thank you!”

Notes

* This posting was orignally “published” as a part of the Hesburgh Libraries of the University of Notre Dame website, and it is duplicated here because “Lot’s of copies keep stuff safe.”

by Eric Lease Morgan at July 01, 2009 05:23 PM

Bigwood, David

Cataloging & Classification Quarterly

Call for Papers....

Cataloging & Classification Quarterly

CCQ welcomes the submission of research, theory, and practice papers relevant to the broad field of bibliographic organization.

This journal, published now 8 times a year by Taylor & Francis, LLC, is respected as an international forum that emphasizes research and review articles, description of new programs and technologies relevant to cataloging and classification, and considered speculative articles on improved methods of bibliographic control for the future.

Articles are particularly welcome in areas dealing with research-based cataloging practice, including user behavior, user needs and benefits.

Authors are encouraged to submit manuscripts via email with attached word document to the Editor, Sandra K. Roe, Bibliographic Services Librarian, Illinois State University (email: skroe@ilstu.edu).

Special Issues
Colleagues interested in guest editing a special issue or expanded double issue are invited to contact the Editor with a general proposal, tentative schedule, and CVs. Previous special issues have included:

Annual Best Paper Award
Taylor & Francis sponsors an annual prize for CCQ with a small financial stipend for the Best Paper of the Year.

Free Print Sample
A free print specimen copy may be obtained by sending an email to marisa.starr@taylorandfrancis.com>

For More Details
Further details may be found at the CCQ home page.

by David (noreply@blogger.com) at July 01, 2009 05:12 PM

Rochkind, Jonathan

What librarians do


So I just gave (or co-gave) a presentation here on Umlaut as deployed here as our Find It service.

One of the most exciting parts to me was that various (non-IT)  librarians in the room, un-prompted, starting throwing out ideas of what it could do in the future. Quite good ideas. I had to resist the techies urge to respond to them with “Well, yeah, but see, that’s harder than it might seem to make work like that…”, and instead try to be encouraging and positive, because it was great to have such a conversation. We hardly ever have such conversations.

Why? I think becuase usually a non-technical librarian has absolutely no way to put such innovative thoughts into practice.  As Karen Schneider talked about in her 2007 Code4Lib Keynote, libraries have ended up outsourcing a significant part of their core business to vendors,  in a way that we pay for it, and we get it, and we pretty much take what we get.

My experience made me realize today that one of the (many) negative side effects of this is that librarians have lost the opportunity (and thus been implicitly  ‘trained’ not to even bother trying) of doing what librarians should be doing in this era when so many of our services are delivered over the web: Figuring out how to make these services meet our users needs better!

Contrary to popular belief, you can’t just let your users tell you what your services will be. Sure, of course you need to listen to your users. And if you listen and observe very carefully, you can figure out what your users needs are, some of which they may not even be able to articulate themselves, but others of which they most certainly can.  But you can’t count on your users to identify the best solutions to these needs. That’s what we’re for, that’s why we’re professionals!

And, to me at least, it’s one of the most most interesting and rewarding parts of our jobs.

But the outsourcing of much of the libraries business to vendors has taken the opportunity to do that away from most of us — an IT geek like me in a library that let’s him get away with it still has some. Most non-IT librarians have had it reinforced that they shouldn’t even bother. And while you have to be an IT type to implement new online services or features, you shouldn’t have to be one to be engaged in dreaming up and planning them.

One thing open source can do is return this power to us.   I’m pretty pleased where Umlaut (and my ability to explain it) is finally at the point where it’s future potential can be seen enough to encourage non-technical librarians to start suggesting “Hey, but what if it could do this and that to? Wouldn’t that be great?”

And, if I can somehow find the time amongst the way too many really great things that I’d like to do if I had time, maybe soon it will!

Posted in General

by jrochkind at July 01, 2009 03:52 PM

Bigwood, David

von Braun Collection

Digital librarians and archivists might be interested in this. NASA is seeking ideas on how to analyze and catalog notes from Wernher von Braun into an electronic system.

On the eve of the 40th anniversary of the historic first moon landing, NASA is seeking ideas from the public, academia, and industry about how to analyze and catalog notes from spaceflight pioneer Wernher von Braun into an electronic, searchable database or other system.

Von Braun was the first director of NASA's Marshall Space Flight Center in Huntsville, Ala., and a key figure in the development of the Saturn V rocket and NASA's Apollo program. NASA has a full collection of "Weekly Notes" von Braun wrote during the 1960s and 1970s. These notes were used to track programmatic and institutional issues at Marshall, and are considered by many historians to be a valuable source of data.

NASA has issued a request for information and is looking for concepts that will provide an innovative resource for agency engineers and scientists, as well as researchers in academia

by David (noreply@blogger.com) at July 01, 2009 04:01 PM

Powell, Andy and Johnston, Pete

RESTful Design Patterns, httpRange-14 & Linked Data

Stefan Tilkov recently announced the availability of the video of a presentation he gave a few months ago on design patterns (& anti-patterns) for REST. I recommend having a look at it, as it covers a lot of ground and has lots of useful examples, and I find his presentational style strikes a nice balance of technical detail and reflection. If you haven't got time to listen, the slides are also available in PDF (though I do think hearing the audio clarifies quite a lot of the content).

One of the questions that this presentation (and other similar ones) planted at the back of my mind is that of how some of the patterns presented might be impacted by the W3C TAG's httpRange-14 resolution and the Cool URIs conventions for distinguishing between what it calls "real world objects" and "Web documents", some of which describe those "real world objects". The Cool URIs document focuses on the implications of this distinction on the use of the HTTP protocol to request representations of resources, using the GET method, but does not touch on the question of whether/how it affects the use of HTTP methods other than GET.

In the early part of his presentation, Stefan introduces the notion of "representation" and the idea that a single resource may have multiple representations. Some of the resources referred to in his examples, like "customers" (slide 16 in the PDF; slide 16 in the video presentation), when seen from the perspective of the Cool URIs document, fall, I think, into the category of "real world objects" - things which may be described (by distinct resources) but are not themselves represented on the Web. So, following the Cool URIs guidelines, the URI of a customer would be a "hash URI" (URI with fragment id) or a URI for which the response to an HTTP GET request is a 303 redirect to the (distinct) URI of a document describing the customer.

But what about non-"read-only" interactions, and using methods other than GET? The third "design pattern" in the presentation is one for "resource creation" (slide 55 in the PDF; slide 98 in the video presentation). Here a client POSTs a representation of a resource to a "collection resource" (slide 50 in the PDF; slide 93 in the video presentation). The example of a "collection resource" used is a collection of customers, with the implication, I think, that the corresponding "resource creation" example would involve the posting of a representation of a customer, and the server responding 201 with a new URI for the customer.

I think (but I'm not sure, so please do correct me!) that the implication of the httpRange-14 resolution is that in this example, the "collection resource", the resource to which a POST is submitted, would be a collection of "customer descriptions", and the thing posted would be a representation of a customer description for the new customer, and the URI returned for the newly created resource would be the URI of a new customer description. And a GET for the URI of the description would return a representation which included the URI of the new customer.

Restcool

(In the diagram above, http://example.org/customers/123 is the URI of a customer; http://example.org/docs/customers/123 is the URI of a document describing that customer

And, finally, a GET for the URI of the customer (assuming it isn't a "hash URI") would - following the Cool URIs conventions - return a 303 redirect to the URI of the description.

There is some discussion of this is in a short post by Richard Cyganiak, and I think the comments there bear out what I'm suggesting here, i.e. that POST/PUT/DELETE are applied to "Web documents" and not to "real-world objects".

The comment by Leo Sauermann on that post refers to the use of a SPARQL endpoint for updates - the SPARQL Update specification certainly addresses this area. It talks in terms of adding/deleting triples to/from a graph, and adding/deleting graphs to/from a "graph store". I think the "adding a graph to a graph store" case is pretty close to the requirement that is being addressed by the "post representation to Collection Resource" pattern. But I admit I struggle slightly to reconcile the SPARQL Update approach with Stefan's design pattern - and indeed, he highlights the "endpoint" notion, with different methods embedded in the content of the representation, as part of one of his "anti-patterns", their presence typically being an indicator that an architecture is not really RESTful.

I should emphasise that I'm trying to avoid seeming to adopt a "purist" position here: I recognise that "RESTfulness" is a choice rather than an absolute requirement. However, interest in the RESTful use of HTTP has grown considerably in recent years (to the extent that some developers seem keen to apply the label "RESTful", regardless of whether their application meets the design constraints specified by the architectural style or not). And now the "linked data" approach - which of course makes use of the httpRange-14 conventions - also seems to be gathering momentum, not least following the announcement by the UK government that Tim Berners-Lee would be advising them on opening up government data (and his issuing of a new note in his Design Issues series focussed explicitly on government data). It seems to me it would be helpful to be clear about how/where these two approaches intersect, and how/where they diverge (if indeed they do!). Purely from a personal perspective, I would like to be clearer in my own mind about whether/how the sort of patterns recommended by Stefan apply in the post-httpRange-14/linked data world.

by PeteJ at July 01, 2009 02:00 PM

Nodalities (Talis)

Interesting semantic web stuff

By Tom Scott
| This guest post originally appeared on Tom Scott’s blog; republished under CreativeCommons License, and with kind permission of the author.

It’s starting to feel like the world has suddenly woken up to the whole Linked Data thing — and that’s clearly a very, very good thing. Not only are Google (and Yahoo!) now using RDFa but a whole bunch of other things are going on, all rather exciting, below is a round up of some of the best. But if you don’t know what I’m talking about you might like to start off with TimBL’s talk at TED.

TimBL is working with the UK Cabinet Office (as an advisor) to make our information more open and accessible on the web [cabinetoffice.gov.uk]
The blog states that he’s working on:

The Guardian has an article on the appointment.

Closer to home there have been a few interesting developments

Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections [pdf]
Our paper at this years European Semantic Web Conference (ESWC2009) looking at how the BBC has adopted semantic web technologies, including DBpedia, to help provide a better, more coherent user experience. For which we won best paper of the in-use track – congratulations to Silver and Georgie.

The BBC has announced a couple SPARQL endpoints, hosted by talis and openlink [welcomebackstage.com]
Both platforms allow you to search and query the BBC data in a number of different ways, including SPARQL — the standard query language for semantic web data. If you’re not familiar with SPARQL, the Talis folk have published a tutorial that uses some NASA data.

A social semantic BBC?
Nice presentation from Simon and Ben on how social discovery of content could work… “show me the radio programmes my friends have listen to, show me the stuff my friends like that I’ve not seen” all built on people’s existing social graph. People meet content via activity.

PriceWaterhouseCooper’s spring technology forecast focuses on Linked Data [pwc.com]
“Linked Data is all about supply and demand. On the demand side, you gain access to the comprehensive data you need to make decisions. On the supply side, you share more of your internal data with partners, suppliers, and—yes—even the public in ways they can take the best advantage of. The Linked Data approach is about confronting your data silos and turning your information management efforts in a different direction for the sake of scalability. It is a component of the information mediation layer enterprises must create to bridge the gap between strategy and operations… The term “Semantic Web” says more about how the technology works than what it is. The goal is a data Web, a Web where not only documents but also individual data elements are linked.”
Including an interview with me!

You should also check out…

sameas.org a service to help link up equivalent URIs
It helps you to find co-references between different data sets. Interestingly it’s also licenced under CC0 which means all copyright and related or neighboring rights are waived.

Enhanced by Zemanta

Image: “Semantic Web Rubik’s Cube” by dullhunk, CC License, via flickr

by admin at July 01, 2009 01:45 PM

eIFL-FOSS Blog

ACCESS2008: Library Technology Conference, 2-4 October 2008 - report

Maybe it's just me but sometimes I need to recharge my batteries. Here is my solution: spend a couple of days with energized library technologists, FOSS developers, and systems librarians. Well, I did say that maybe it's just me. Fortunately my batteries got a full charge this week at Access2008, Canada's premiÚre library technology conference, which was being hosted just down the road from me by McMaster University. The librarians attending Access2008 totally get the need to take a holistic approach to ICT in libraries. And they mostly get FOSS as well. In fact I think I met more dedicated proponents of FOSS in libraries over the course of this conference than I had ever known existed.

One of the highlights for me was the opportunity to see keynote speaker Karen Schneider, whose blog has long been a must for librarians concerned with technology. Karen is now Community Librarian for Equinox Software which is the principal support company for Evergreen, a FOSS ILS. I thoroughly enjoyed her talk entitled Open++ - dispatches from the OSS frontlines. Karen was sharing some of the pluses (or "++" - which signals praise and potential karma points in the IRC channels that library technology geeks frequent) and a few minuses of her task of explaining open source on the ground in libraries. It is no small task to set out to demystify the FOSS community and ethos, but it is all part of the effort to spread the word about Evergreen.

Perhaps it is just the nature of the Access conferences, or maybe it is a reflection of the state of libraries in North America at the moment, but I found FOSS everywhere I turned. Dale Askey of Kansas State University gave a great talk about the anxieties some of us have about letting people see our code, and the real need to get it out there. Eric Lease Morgan spoke about his MyLibrary project at the University of Notre Dame. Walter Lewis and Slavkio Manojlovich spoke about the partnership between AlouetteCanada and OurOntario.ca All of these are FOSS efforts, naturally.

Other FOSS-relevant talks were given by a whole panel of librarians demonstrating their various uses of the Drupal content management system, and I was astounded by the simplicity and elegance of LibX (which started as a FireFox plugin but is also avaialbe for IE). Karen Coombs from the University of Houston gave a great presentation on the extremely modular approach she takes there for library services, disavowing monolithic solutions and instead knitting her library web space together with contributions from both proprietary and FOSS components. And of course one of the talks I was most keen on hearing was that of John Fink of McMaster University and Dan Scott of Laurentian University on progress in the Conifer project, which bring together a number of Canadian university libraries in one very large Evergreen instantiation. Dan, of course, is no stranger to eIFL.net having led the Evergreen training component of the eIFL-FOSS ILS project working in Armenia earlier this year. The news on Conifer is that progress is going well and the current expected date for all of these libraries to "go live" with Evergreen is the spring of 2009.

Evergreen did tend to be ever present at this conference. But other FOSS ILSs were also heard from. At least one group of public libraries located in the Ontario hinterland have decided to band together and share expertise on Koha.

Of course this conference wasn't entirely about or for FOSS in libraries. Access2008 is a conference for library technologists and there were lots of other solutions being canvassed. But perhaps it is only human for the most exciting buzz to come from IT solutions that librarians are creating for themselves that they can share with their peers. Thus one appeal of open source, perhaps.

I haven't followed news of mass digitization projects closely so perhaps I was the only one astounded by the talk by Jonathan Bengston and Sian Meikle of the University of Toronto on the mass digitization project going on there. I confess I had no idea of the scale of this. It is immense. Literally thousands of books are being digitized on a daily basis. This is impressive even as merely a feat of organization. But the results were also impressive. Sadly this mammoth effort has a shakey fundation now that Microsoft has decided to end its funding. But it certainly gave us food for thought about what is possible with sufficient resources.

The conference was rounded off after two and a half days with an inspiring talk from Bob Young, famed local entrepreneur and co-founder of RedHat.com and Lulu.com. Bob is always good value as a speaker but I found him especially insightful today as he contrasted his life in technology firms with one of his current roles as owner of a professional sports franchise, the Hamilton Tiger Cats.

I should finish this conference report with a mention of something that happened the day before the conference began: Hackfest. Hackfest is a day-long event in which librarians and programmers gather, divvy up a problem set, and set to work. You might say, it is the very spirit of what Access2008 is all about. You might also be wondering just how much real development work can actually get done in a day. The answer: lots! I was consistently impressed as the various groups that had worked together reported back during the conference. Here, for example, is Dan Scott's blog post on his Hackfest activity in which he was sorting out how to use Zotero with Evergreen. Cool!

My thanks to the organisers of Access2008. My batteries are re-charged. Full steam ahead!

by randy-m at July 01, 2009 11:41 AM

Greenstone training workshop, Nairobi, Kenya, 22-26 September 2008 - report

(Guest blog post from Amos Kujenga, National University of Science and Technology (NUST) Library, Zimbabwe)

The week before last, Misheck Nyaluso and I were at the University of Nairobi where we conducted a 5-day Greenstone Workshop (from Monday 22 - Friday 26 September). The event was sponsored by UNESCO (Nairobi Cluster Office), organised by the Kenya Information Preservation Society (KIPS), and held at the Jomo Kenyatta Memorial Library. Several people spoke at the opening ceremony and of interest was the presence of Mrs Jacinta Were, the eIFL country coordinator for Kenya. She was also part of a Steering Committee which in 2005 was involved in the initial feasibility study for the establishment of a Greenstone Support Organisation for Africa.
 
A total of about 24 participants (mostly librarians) were trained and given attendence certificates on the closing day. We borrowed a bit from the Lesotho workshop style by concentrating on the general DL issues on the first day. This was of great benefit to some of the participants who (believe it or not) thought Greenstone was some scanning software! It was also interesting to note how digital libraries have been so closely associated with scanning that people sometimes fail to realize that there are many many collections that they can build from "born digital" material.

KIPS has to date produced a Greenstone CD-ROM of a list of abstracts of Theses and Dissertations about Kenya. Infact, most of the participants were drawn from organisations that contributed content towards this collection. There was also a demo of a Greenstone CD-ROM of articles on Gender Issues from Kenyan newspapers by the Kenya Indexing Project.

With KIPS playing a leading role, there's much potential for big time Greenstone projects in Kenya, moreso since they've already set the pace by virtue of their existing collection. They expressed great zeal to establish a network of "Greenstoners" in Kenya and judging from the performance of some of the participants, the future looks bright. If KIPS can work closely with other organisations, e.g., the local eIFL consortium (Kenyan Libraries and Information Services Consortium - KLISC) much can be achieved to build an effective user and support network. We also continually encouraged the participants to play an active role on the sagreenstone discussion list, in addition to using the other technical support resources.

We also had Zoe Cormack from the Rift Valley Institute giving a brief talk on their work on the Sudan Open Archive project. This was an eye opener for many who got to get ideas of how to handle complex scanning/digitisation issues.

The UNESCO representative, Mr Hezekiel Dlamini (to whom we're quite grateful for inviting us to assist in running this workshop) also indicated willingness to have an advanced workshop - which, however, would only be for those institutions that would have evidence of some work with Greenstone.

On the whole, we had an interesting week in Nairobi, not to mention the confusion on the roads! To quote one taxi driver, some tourist once exclaimed, "Anyone who can drive in Nairobi for a month without a scratch deserves an international driving license!"

by randy-m at July 01, 2009 11:37 AM

State Library of Denmark

wwws


wwws

The code for our search front end has over time grown to a considerable size and we have started to suspect that the web site’s response time could be better. With this in mind I have for some time now been keen on looking into optimizing the speed of our front end – especially when the underlying search engine Summa has proven to be blazingly fast.

There are a lot of things we could do better such as:

1. Optimizing the javascript code by trawling through the lot and removing redundancy as well as rewriting some of the methods to be more efficient.

2. A thorough cleanup of the css. There is a lot we can do here as we have loads of redundancy, classes which are not in use anymore and declarations which could be handled way cooler. Another thing I noticed is we like divs – loads of divs.

3. Taking a critical look at our numerous DOM transformations. Some of them are down right unnecessary.

4. General optimizing of the server side code. In fact this part isn’t all that bad but a general clean up once in a while doesn’t hurt anybody.

Because my summer holiday is coming up soon I have chosen to start with some light weight stuff. I have tried out the newest version of the YUI Compressor – tool to compress/minify javascript and css. As we don’t use minifying at the moment we should be able to benefit from it performance wise. In order not to clutter up this post I will post my experience with this in a separate post soonish.

by Jørn Thøgersen at July 01, 2009 11:15 AM

Styles, Rob

Why hash tags are broken, and ideas for what to do instead.

I was at Moseley Bar Camp last Sunday and there were some great sessions. Andy Mabbett stood up to lead a discussion entitled Let’s Play Tag: recent developments and emerging issues in the use of tagging for added semantic richness. Andy was looking for discussion on how to solve the problem of ambiguity in hash tags [...]

by Rob Styles at July 01, 2009 10:30 AM

Future Archives (Bodleian Library)

Waxwork Accessions

I decided the other day that it would be useful to have a representative accession or two to play with. This way we could test for scalability and robustness (in dealing with different file formats, crazy filenames, and the like) of the various tools that will make up BEAM and also try out some of our ideas regarding packaging, disk images and such.

It isn't really possible to use a real accession for this purpose, mostly due to the confidentiality of some of our content. But I did want the accession to be as genuine as possible and here is how I did it. Any ideas for alternatives would be great!

The way I saw it, I needed three things to create the accession:

  1. A list of files and folders that formed a real accession
  2. A set of data that could be used - real documents, images, sound files, system files, etc.
  3. Some way of tying these together to create an accession modelled on a real one but containing public data
Fortunately Susan already had a list of files that made up the a 2GB hard drive from a donor - created from the forensic machine - which I thought would be a good starting point. Point 1 covered!

Next question was where to get the data. My first thought was to use public (open licensed) content from the Web - obtaining images when required through Flickr, getting documents via Scribd, etc. This is still a good approach for certain accessions. However, looking at my file list I quickly realised I wasn't just dealing with nice, obvious "documents". The accession contained a working copy of Windows 95 for example, jammed full of DLLs and EXEs. It also contained files pulled from old PCW disks by the owner, with no extension, applications from older systems, and all manner of oddities - "~$oto ", "~~S", "!.bk!" are just some examples.

It occured to me that I needed a more diverse source of files - most likely a real live system that could meet my request for a DLL while not revealing much about the original file. Where would I find this source? My own PC of course!

In theory my PC is dual-boot, running Ubuntu and Windows XP. The Windows XP partition is rarely used (I've nothing against XP, it just isn't so good a software development environment as Linux), but it struck me it'd make an excellent source of files, even if it was a version of Windows some way down the tracks from 95. By pulling files from my Windows disk (mounted by Ubuntu) I could, hopefully, create a more representative accession with a few more problems to solve than just "document" content.

(I also thought I could try creating a file system with a representative set of files to choose from - dlls from Windows 95 disks, etc. - but that would mean some manual collation of said files. This may be where I go next!).

So, 1 and 2 covered, what I needed next was a way to tie the file list to the data. I decided to use the file extension for this. For example, if the file list contains:

C:\WINDOWS\SYSTEM32\ABC.DLL

I wanted to grab any file with a ".DLL" extension from my data source (the XP disk). Any random file, rather than one that matched the accession, because the random here is likely to cause problems when it comes to processing this artificial accession and problems is what we really need to test something.

This suggested I needed a way to ask my file system "What do you have that has '.dll' at the end of the path?". There were lots of ways to do this - and here is where Linux shines. We have 'find', 'locate', 'which', etc. on the command line to discover files. There is also 'tracker' that I could have set indexing the XP filesystem. In the end I opted for Solr.

Solr provides a very quick and easy way to build an index of just about anything - it uses Lucene behind the scenes. (I like the way that almost rhymes!) If you're unfamiliar with either, then find out all about them quickly! In short, you tell it which fields you want to index (and how you want them indexed) and create XML documents that contain those fields. POST these to the Solr update service and it indexes them there and then.

I installed the Solr Web app., tweaked the configuration (including setting the data directory because no matter what I did with the environment and JNDI variables, it kept writing data to the directory from which Tomcat was started!), and then started posting documents for indexing to it. The document creation and POSTs were done with a simple Java "program" (really a script, and I could've just just about any language, but we're mostly using Java and I'm trying to de-rust my Java skills, I figured why not do this with Java too). The index of around 140,000 files took about 15 minutes (I've no idea if that is good or not).

(Renhart suggested an offshoot of this indexing too - namely the creation of a set of OS profiles, so that we can have a service that can be asked things like "What OS(es) does a file with SHA-1 hash XYZ belong to?" - enabling us to profile OSes and remove duplicates from our accessions).

The final step was to use another Java "program" to cludge the list of files in the accession with a lookup on the Solr index and some file copying. Then it is done - one accession that mirrors a real live file structure, contains real live files, but none of those files are "private" or a problem if they're lost. Even better, because we used more recent files, the accession is now 8GB rather than 2GB, aligning more with what we'd expect to get in the future.

Hooray! Now gotta pack it into disk images and start exploring processing!

Should anyone be interested, the source code is available for download.

by pixelatedpete (pixelatedpete@gmail.com) at July 01, 2009 10:30 AM

Bigwood, David

Geospatial Information in MODS

The MODS Editorial Committee is looking for community input on geospatial information in MODS.

In considering changes for future versions of MODS, the MODS/MADS Editorial Committee is starting to think about how to better handle geospatial information. Detailed geospatial information in the form of coordinates, etc. is becoming more and more common and can promote many innovative user interactions with resources. Currently MODS has poor support for this information.

The committee would like to bring together use cases for supporting geospatial access to resources from MODS and/or MADS implementations. We are interested both in use cases that you already have in your MODS/MADS implementation and that any local geospatial experts you have access to can provide, to help us inform how MODS and/or MADS should evolve to better handle this information. It should be noted this discussion came to the Editorial Committee from the more specific geospatial elements (latitude/longitude, equinox/epoch) in RDA, although we want to look beyond RDA for guidance in this decision.

So far, we have identified the following use cases for geospatial data:What others can you provide?

Are there more specific use cases both for geospatial *coverage* (what a resource is about or represents) and geospatial *origin* (where a resource is from, for example, a soil sample)? This distinction seems important but it would be useful to understand what is done differently in each case.

There is some question as to whether the appropriate place for this information is MODS or MADS - thoughts on this issue? Should MODS/MADS be looking to embedding or referencing other standards for this information, and, if so, which and where? What is the best balance between functionality (and potentially complexity) and ease of creation/maintenance/use?

We look forward to hearing discussion on this issue - it's a complex but important one that will benefit from community contribution.

by David (noreply@blogger.com) at July 01, 2009 10:17 AM

Tennant: Digital Libraries

"The Flow" Revisited: The Professional Angle

A few years ago I wrote one of my Library Journal "Digital Libraries" columns on the phenomenon of "flow" ("Hustle and Flo...

July 01, 2009 03:45 AM

June 30, 2009

del.icio.us

The Code4Lib Journal

by casstrevino at June 30, 2009 09:58 PM

Brinley, Jonathan

Retirement

A brief personal note: I have left my job at Ball State University so I can have more time to pursue my career as a freelance web developer.

For about two years, now, my wife and I have been building up our business, Adelie Design, Stephanie doing the design, me doing the coding. Business is good—good enough that working nights and weekends isn’t enough anymore (and doesn’t leave enough time to see the family!). So I’m “retiring”, which is to say that I’ll be staying home and working 40 fewer hours each week.

I’m glad to be leaving, but it’s been a pretty good four years at Ball State. To my colleagues and friends there: thank you. I hope to remain involved in libraries, and especially the code4lib community. Hopefully I’ll have some free time I can devote to “fun” projects.

If you, reader, know of anyone needing a web developer or designer, please have them contact me.

by Jonathan Brinley at June 30, 2009 08:00 PM

Murray, Peter

Federal Research Public Access Act Reintroduced

New legislation was introduced in the U.S. Senate last week to support the publication of federally-sponsored research results under open access terms.
Sponsored by Senator Lieberman of Connecticut and co-sponsored by Senator Cornyn of Texas, it mandates open access to author pre-print versions with peer review changes in federally-run repositories within six months of publication. Called S.1373, it is a nearly identical version to the bill of the same name that these two senators introduced in 2006, which ultimately died in committee. The 2006 version was supported by a wide variety of organizations including the American Library Association, as tracked by the Alliance for Taxpayer Access (ATA).

In his statement on the floor of the Senate introducing the bill, Senator Cornyn described the benefits of the legislation:

Our bill will ask all Federal departments and agencies that invest $100 million or more annually in research to develop a public access policy. Our goal is to have the results of all government-funded research to be disseminated and made available to the largest possible audience. By speeding access to this research, we can help promote the advancement of science, accelerate the pace of new discoveries and innovations, a