Blogs and feeds of interest to the Code4Lib community, aggregated.
February 09, 2010
Prompted by a recent entry, I got a note from Koen Calis, Librarian Bruges Public Library, about their catalogue, Cabrio. Here is quite a full presentation which covers a range of interesting feature:
I was interested in their adaptation of my colleague Robin Murray's synthesise/specialise/mobilise framework to frame the discussion. In his note, Koen remarked that "Bruges Public library considers the horizontal discovery of local resources (heritage collections, community information, courses and events, local advisory data...) to be a very important starting point for redeveloping our library into a local knowledge hub and to enhance participation of the local community".
by dempseyl@oclc.org (Lorcan Dempsey) at February 09, 2010 07:11 PM
John Kirriemuir has issued a request for updated information for his his eighth Virtual World Watch "snapshot" survey of the use of virtual worlds in UK Higher and Further Education.
Previous survey reports can be found on the VWW site.
For further information about the sort of information John is after, see his post. He would like responses by the end of February 2010.
Our period of funding for this work is approaching its end, so this will be the last survey funded under the Eduserv Research Programme. John is planning to continue some Virtual World Watch activity, at least through 2010, as he indicates in this presentation which he gave to the recent "Where next for Virtual Worlds?" (wn4vw) meeting in London:
The slides from the other presentations from the wn4vw meeting (including a video of the opening presentation by Ralph Schroeder) are also available here, and you can find an archive of tagged Twitter posts from the day here.
I enjoyed the meeting (even if I'm not sure we really arrived at many concrete answers to the question of "where next?"), but it also felt quite sad. It marked the end of the projects Eduserv funded in 2007 on the use of virtual worlds in education. That grants call was the first one I was involved with after joining Eduserv in 2006, and although it was an area that was completely new to me, the response we got, both in terms of the number of proposals and their quality, seemed very exciting. And I still look back on the 2007 Symposium as one of the most successful (if rather nerve-wracking at the time!) events I've been involved in. As things worked out, I wasn't able to follow the progress of the projects as closely as I'd have liked, but the recent meeting reminded me again of the strong sense of community that seems to have built up amongst researchers, learning technologists and educators working in this area, which seems to have outlived particular projects and programmes. Of course we only funded a handful of projects, and other funding agencies helped develop that community too (I'm thinking particularly of JISC with its Open Habitat project, and the EU MUVEnation project), but it's something I'm pleased we were able to contribute to in a small way.
by PeteJ at February 09, 2010 04:38 PM

(Click image for full size graphic.)
Following the JISC seminar last week on persistent identifiers (#jiscpid on Twitter) there was some discussion about DOI and its role within a Linked Data context. John Erickson has responded with a very thoughtful post DOIs, URIs and Cool Resolution, which ably summarizes how the current problem with DOI in that the way the DOI is is implemented by the handle HTTP proxy may not have kept pace with actual HTTP developments. (For example, John notes that the proxy is not capable of dealing with 'Accept' headers.) He has proposed a solution, and the post has attracted several comments.
I just wanted to offer here the above diagram in an attempt to corral some of the various facets relating to DOI that I am aware of. I realize that this may seem like an open invitation to flame on - and this is a very preliminary draft - but ... be kind!
So, this may be totally off the wall but it represents my best understanding of DOI as used by CrossRef.
I have distinguished three main contexts:
- Generic Data - A generalized information context where the an object is identified with a DOI, an identifier system that is currently being ratified through the ISO process. This is the raw DOI number. (This definitely is not a first class object on the Web as it has no URI.)
- Web Data - An online information context (here I use the term 'Web' in its widest sense) where resources are identified by URI (not necessarily an HTTP URI). Here DOI is represented under two URI schemes: 'doi:' (unregistered but preferred by CrossRef), and 'info:' (registered and available for general URI use). Also it has a presence on the Web via an HTTP proxy (dx.doi.org) URL where it is used as a slug to create a permalink (as listed at 'A'). A simple HTTP redirect is used (with status code 302) to turn this permalink into the publisher response page http://example/1. (Note that typically a second redirect will occur on the publisher platform, here shown by the redirect to http://example/2.)
- Linked Data - An online information context where resources are identified by HTTP URI and conform to Linked Data principles. Now this is where there is a tension arises between the common publisher perspective and the strict semantic viewpoint. Implicit in the general Web context given above was the notion that the permalink ('A') was somehow related to the abstract object and the redirection service applied to it associated the abstract resource with concrete representations of the object.
So how do we relate the DOI HTTP URI with the abstract ('work') identifier listed at 'D' in the diagram?
Well the Architecture of the World Wide Web recognizes two distinct classes of resources: Information Resources (IR) and Non-Information Resources (NR). (Note: Only the term 'information resource' is used in AWWW.) IR are those that can be directly retrieved using HTTP, whereas NR are not directly retrievable but have an associated description which is retrievable and is itself a proxy for the real world object.
So either the HTTP URI denotes an IR (as listed at 'B') and is resolved (through HTTP status code '302 Found') to a default representation, which is the view that the Linked Data community would currently have of DOI. But this is at odds with what the CrossRef position which regards DOI as identifying the abstract work. Alternately to fit better the CrossRef model of DOI the HTTP URI would denote an NR (as listed at 'A') which would be resolved (through HTTP status code '303 See Other') to an associated description - a publisher response page.
There will be those self-appointed URI czars who will bemoan the fact of there being multiple URIs. But frankly there is nothing inherently wrong with that. Just as in the real world there are many languages so in the online world there are multiple contexts and histories. We can attempt to make some sense of this by making use of the well-known semantic properties owl:sameAs and ore:similarTo and declare (as also shown in the diagram) the following assertions:
info:doi/D owl:sameAs doi:D .
http://dx.doi.org/D ore:similarTo info:doi/D .
http://dx.doi.org/D ore:similarTo doi:D .
Note that ore:similarTo (stemming from the
OAI-ORE work) is a weaker kind of relationship than owl:sameAs (which comes from
OWL) and may be appropriate in this usage.
In sum, scenario 'A' is what we have currently implemented, scenario 'B' is what might be commonly perceived as being implemented, and scenario 'C' may be a more correct semantic position.
Your comments (and not unkind comments, please;) are more than welcome.
February 09, 2010 02:31 PM
On Wednesday, February 10th, 2-3pm Eastern Standard Time, Chris Kenneally of the Copyright Clearance Centr will be hosting a special "Beyond t...
February 09, 2010 12:16 PM
February 08, 2010
For years, I’ve regularly gotten requests from authors and publishers for licenses to reproduce images in books listed on The Online Books Page, or included in the local collection of A Celebration of Women Writers. Sometimes these requests relate to copyrighted books that I list but don’t control rights for; in those cases, I do my best to refer the request to the book’s copyright holder. But often, they’re for images in our own collections, from books published over 100 years ago. In those cases, I respond that the image is in the public domain (and our digitization, which adds no originality, is also in the public domain), so no license is necessary or appropriate.
Usually that response receives a thankful reply, sometimes with signs of surprise that an image can be reused without permission. But sometimes I’ll get back a more alarmed reply. “My publisher says I need a license for every image in my book, or I can’t use it,” it might say, followed by a plea for help in tracking down some long-defunct 19th century publisher.
I wish I could say this was an atypical anecdote. But, if you look around the Web, you’ll find that there are huge numbers of historic images– paintings, photographs, figures, and the like– that are behind access barriers, or closed off altogether from online access, when they don’t have to be. Artstor has over a million images of thousands of years of art that you can’t look at unless you’re at an institution that has a subscription. The fine arts image catalog at my own library has over 100,000 digital images, none of which can be seen online by the public outside of Penn, except in thumbnails. Neither Artstor nor Penn want to keep art away from the public; both are nonprofit educational institutions. But clearing images for free public access on a large scale has to date been impractical for these institutions.
Restrictions on images also create holes in other works. For instance, under the proposed Google Books settlement, images in books that might be under copyright would be blanked out unless the rightsholder to the book also asserted they held the rights in the images. These sorts of omissions can cut the heart out of many works. In a recent New Republic article, “For the Love of Culture“, Lawrence Lessig described how a critical table was omitted in an otherwise free article about his daughter’s possible illness, due to rights-clearance issues. “I could not believe that we were this far down the path to insanity already,” he wrote of the incident.
Part of the insanity is that many of these images from our cultural heritage are actually in the public domain. Many people are aware that copyrights prior to 1923 have expired in the US. But so have many copyrights from later in the 20th century. Pre-1964 copyrights generally had to be renewed 28 years after the start of their term, or they would expire. (Exceptions and further details are described here.) But most copyrights were never renewed; and that’s especially true for images.
In 1923, there were copyright registrations for 3,059 works of art, 1,149 scientific and technical drawings, 7,533 photographs, and 11,289 prints and pictorial illustrations, making a total of 23,030 copyright registrations for these classes of image. In 1951, 28 years later, there were 198 copyright renewals for all of these image classes combined. This represents a renewal rate of less than 1%.
We have just completed posting scans that make all active copyright renewals for artwork viewable online. In fact, once we finish scanning one last batch of renewals for maps and for commercial prints (meaning images created for product packaging and promotion) all active copyright renewals for any type of still image will be viewable online. In later years, the number of image copyright renewals grows slightly, but not by much. But the number of images published in those years grows substantially.
Images without a copyright registration of their own might still be under copyright if they were first published as part of a copyrighted book, newspaper, magazine, or other larger work. Fortunately, we have complete online renewal records for those kinds of works too. It becomes much easier to establish the public domain status of a newspaper photograph, for instance, if you know (as I previously revealed) that no newspaper outside New York renewed copyright for any issue published before the end of World War II.
Having copyright renewals online for artwork is an important step towards freeing the public domain in images. But there’s more needed to make copyright clearance practical at a large scale. Putting scanned renewal records into a searchable database (perhaps combined with fair use image thumbnails) will make it easier to find any copyright renewals that might exist for a particular image. (A similar database for book renewals already exists, and there are more book renewals than image renewals.) Making original copyright registrations available as well (as we now have for artwork through 1949, and soon will have for later years) lets us determine when the copyright for an image began, and whether it was renewed in time to prevent it from expiring.
Furthermore, establishing the history and provenance of images will let us determine when unregistered artwork enter the public domain. Registered or not, the copyright to an image created before 1964 began no later than its first US publication, and the copyright for many such images therefore ended after 28 years due to a lack of renewal. And the mostly-frozen American public domain still includes more work each year that was never published before 2003. On Public Domain Day last month, all such work by artists who died in 1939 entered the public domain in the US. (I won’t get now into the rather baroque rules for establishing “publication” of an artwork, but you can determine it if the history of the image is documented.)
So we have a rich treasure trove of images in the public domain that’s been largely buried under presumptions and uncertainties about copyright. By finding and sharing information about their copyrights, we can protect and enjoy these images in the commons of the public domain, where they can be viewed freely, included in new works, and reused in any way we can imagine. If you find this prospect intriguing, I hope you’ll help bring these images to light.
by John Mark Ockerbloom at February 08, 2010 08:05 PM
If I wasn't an indoctrinated corporate drone I would be a scientist, and indeed, back when I was a wee boy I dreamed of becoming a geologist. Boy, did I know my gray rocks from the slightly lighter gray rocks and so on. I took great delight in walks in nature finding moraines and tills and other long-gone remnants of geological implication (glaciers, mostly), and I could tell rombeporfyr from feltspat and point out the probable processes involved in creating the shapes and colors. It was a glorious time, and I've still got it I think (and I've passed it on to my kids who always make me carry tons of rocks back home ... there's poetic justice if I ever heard it), but nowadays mostly through the local geography (which is interesting in its own mind as the Kiama area are remnants of several epochs of volcanic activity on top of sandstone, with a strong iron presence. I'll probable make a post about all this in the future sometime).
Knowing something about geology makes you somewhat aware of what's known as geological time, a time frame that spans billions of years. And, as some might suspect, trying to get a grip on what 'billions of years' for a mere human is is a daunting and often failed task. But a rudimentary understanding of geological time and processes also rendered me immune to a lot of otherwise human misunderstanding and nonsense that our cultures have built up over time to explain all that which we didn't understand. So if you understand unstable (ie. radioactive) isotopes in rocks and their half-life, how they break down (as a figure of speech) from an unstable to a stable form, you have no problem understanding other processes that also runs across billions of years, and indeed, runs parallel to geological time and processes. And to someone who not only knows a few things about rocks but also those things which you find inside rocks, evolution is not hard to grasp, at least not the tenant that it is right there, in front of you, staring back at you after you chipped that piece of rock off from the rock wall. For me, it was the most natural thing, and indeed sparked my deep interest in all things biological as well.
So for me to read Dawkins book "Greatest show on earth" was more like a dumbed-down defense of something that I thought no one was stupid enough to refute. But, there it was, in the first chapter, a fleshing out that there were indeed idiots out there who just could grasp the most basic notions and evidence, people who actually thought everything we see now has been unchanging for all the time the earth and the universe have existed; about 10.000 years. Huh? *blink* Maybe the sub-title should have tipped me off; "The evidence for evolution", as if we needed more evidence than what was taught in school.
Then I realized that not all the kids I went to school with paid too much attention when such big issues came up. They probably passed the tests and all, but I did not see them engage with (or annoy with too many questions) the teacher the way I think I did, they didn't go out into the woods to climb rocks and find fossils themselves, they didn't deduce the layers of a side of a deep canyon with a river at the bottom who was responsible for the canyon, who dug it, how the shape came to be. I guess they ended up not knowing as much, at least not on these subjects.
And that was the greatest shock for me; the world really needs to be convinced that evolution is real!?
It was like someone punched me in the gut; here I was thinking our human species were going places, and then I found out that the truth somehow is in question. Looking at their argument against is nothing short of a laughing matter, all attributed to the fact that their faith is in disagreement with the science. Ouch. So who do we think is right? The people of faith and no facts, or thousands of scientists working together for hundreds of years on the greatest Utopian adventure humankind has ever ventured on? Oh, the irony.
For me, in short, is that the book is great; it's well-written, perhaps two notches too intelligent in places (c'mon, references to poetry? Who reads poetry anymore? And it uses a lot of big words), but a tad bit too apologetic as there is nothing excusable about being ignorant by choice (although I understand that this angle is mostly for the US market) and, I feel, just way too soft on the "opposition." These people are clearly not just history deniers, they are outright dishonest about their thirst for truth and knowledge, probably wouldn't know epistemology if it hit them over the head, cannot fathom that human traits and physiology only makes sense in evolutionary terms (have you checked your vestigial parts lately?), and since the discovery of genetics the huge amount of science that only works if evolution is true over geological time. I agree that thinking evolution is not true is crazy on a scale of, err, biblical proportions, and as much as I this book wasn't for me, I guess there is a strong need for it if there truly are this many nut cases out there who will deny anything if it doesn't sync with their faith or holy book. Weird.
by Alexander Johannesen (noreply@blogger.com) at February 08, 2010 05:56 PM
In order to prepare an upcoming migration, the GWDG scheduled a few hours downtime for the MPG/SFX server on Sunday between 01:00 a.m. and 06:00 a.m.
Please be aware that MPG/SFX services will be unavailable during that time, while the upcoming system is synchronized with the MPG/SFX server in production.
On Wednesday, February 17, after thorough testing, the new system will be switched to production during our regular maintenance hours, which will involve another short downtime.
We apologize for any inconvenience.
by eia at February 08, 2010 04:29 PM
A number of places asked me to provide my initial first impressions of the survey I did on Cloud Computing in Higher Education.
My first impression is that survey respondents don’t understand what the cloud is, but that shouldn’t have been a shock to me, most people don’t know what it is. Admittedly this lack of understanding could be due to the way the survey was structured; but even after a couple of tweaks to the order of sections (I moved Software to the front of the survey and Platforms and Infrastructure to the end), respondents were still having a hard time understanding what was what. I’ll most likely need to do some clean up on the survey, but I think for now I have a great understanding of what is going on in Higher Education in terms of the cloud.
In the survey I tried to define the three main components of the cloud: Software-as-a-Service, Platform-as-a-Service, and Infrastructure-as-a-Service. After the survey was completed, I realized that I’ll need to do a better job of defining things for the audience members in my NERCOMP presentation. Something that I though I could spend 10 minutes on will probably need a solid 20 minutes so that everyone can be on the same page. I think diagrams and other visual aids might really help people understand what these different components are, and how they correspond to computing they are already using.
My second impression is that institutions are very comfortable with using Software-as-a-Service. Below is a graph showing SaaS usage among respondents. Facebook is of course the leader in the SaaS cloud race, with Twitter and Google Docs coming in right behind them. What I think is the most interesting though is that overall, Google has the highest share of the marketplace.

As far as PaaS and IaaS, most institutions are not using these services. I’m hesitant to show results from these sections of the survey since so many respondents confused software for platforms or infrastructure. Once I clean things up, I’ll provide more information. Suffice to say though, few if any institutions are using Infrastructures or Platforms in the cloud. Those that are are using Amazon Web Services (for infrastructure) and Google Code (for platform).
More to come later. And thank you to everyone that took the time to take my survey. It has been eye opening and hopefully will make me a more informed speaker.
Share/Save
by Rosalyn Metz at February 08, 2010 03:57 PM
Roy, Bruce and Don have promised to blog their thoughts and post photos from the OCLC API Mashathon that happened today at VALA 2010, but the jet lag was likely catching up with them after a full day of mashing. In the meantime, you can read the tweetstream from the event (and for the rest of the conference) from the Libraries Interact blog.
Excited to see what new ideas Australian and NZ librarians and library developers come up with as a result of this session. If you're attending VALA 2010 and didn't get a chance to attend the Mashathon, stop by the OCLC stand to get your complimentary WorldCat Search API temporary access key. There are 200 available, first come first served! Plus Don, Bruce and Roy will also do short demos at the stand during the conference itself. So even if you couldn't mash today, you can still see what all the excitement is about!
by Alice Sneary at February 08, 2010 03:14 PM
At the closing session of Electronic Resources and Libraries 2010 I had the chance to ask Ross Singer and John Blyberg about the place of microformats and COinS in information organization. Ross had just finished speaking about the importance of linked data. As I recall John said that microformats, COinS and other semantic markup is important even if it lacks links. Providing a machine readable understanding of a text string is good, it can lead to links. Ross said, without links markup is useful today but not a way to move forward. It is a tool for today but not the future. RDFa is the way forward.
The talk was the end of an excellent conference. Well worth attending.
by David (noreply@blogger.com) at February 08, 2010 09:50 AM
This morning DCMI tweeted about the Bibliographic Ontology Specification. New to me.
The Bibliographic Ontology describe bibliographic things on the semantic Web in RDF. This ontology can be used as a citation ontology, as a document classification ontology, or simply as a way to describe any kind of document in RDF. It has been inspired by many existing document description metadata formats, and can be used as a common ground for converting other bibliographic data sources.
by David (noreply@blogger.com) at February 08, 2010 09:35 AM
February 07, 2010
Support for Data Stewardship
A brief article drawing attention to the issue of poor (or non-existent) data stewardship and what happens without open data and peer review of research elements like computer programs.
by mleggott at February 07, 2010 08:55 PM
Main Drivers for Successful Re-use of Research Data
Summary from the report:
On 23-24 September 2009 an international discussion workshop on “Main Drivers for Successful Re-Use of Research Data” was held in Berlin, prepared and organised by the Knowledge Exchange Working Group on Primary Research Data. The main focus of the workshop was on the benefits, challenges and obstacles of re-using data from a researcher’s perspective. The use cases presented by researchers from a variety of disciplines (13 presentations) were supplemented by two keynotes and selected presentations by specialists from infrastructure institutions, publishers, and funding bodies (national and European level, 8 presentations). By choosing this design the workshop was able to provide a critical evaluation of what lessons have been learned concerning sharing and re-using research data from a researcher’s perspective and what actions might be taken to encourage and facilitate more successful re-use. Despite the individual differences characterising the diverse disciplines, it became clear that important issues are comparable. Apart from several technical challenges such as metadata exchange standards and quality assurance it was obvious that the most important obstacles to re-using research data more efficiently are socially determined. It was agreed that in order to overcome this problem more effort should be made 1) to raise awareness and 2) to encourage stakeholders to combine forces in order to support and stimulate sharing and re-use of research data on all levels (researchers, institutions, publishers, funders, governments).
by mleggott at February 07, 2010 08:51 PM
Interesting Research Tool
I had been meaning to look at Diigo for sometime now and there are a few things to like. The main one is how it brings together elements from a number of useful research tools, providing a single point of entry. I also like that it brings back what I thought was a brilliant idea many years ago when it surfaced with a Web app called Third Voice - the ability to mark-up webpages with your own notes. Worth some time to play despite a few rough edges.
by mleggott at February 07, 2010 08:46 PM
New Report from University of California
A recent report from UC on the scholarly publishing landscape. I haven't had a chance to read it yet but a quick read of the executive summary (see below for an excerpt from the press release) seems to say that academics will go forward (or not) at their own pace, thank you very much. One statement suggests that the new "tech-savvy" grad students and post-grads are making no more use of new technologies for scholarly publishing than their older colleagues. Duh. How else are they going to get tenure? This is evidence of nothing but the same problem that vexes universities from all angles, whether online learning, open data or scholarship - the status quo suits those in the system just fine.
The final report brings together the responses of 160 interviewees across 45, mostly elite, research institutions in seven selected academic fields: archaeology, astrophysics, biology, economics, history, music, and political science. Our premise has always been that disciplinary conventions matter and that social realities (and individual personality) will dictate how new practices, including those under the rubric of Web 2.0 or cyberinfrastructure, are adopted by scholars. That is, the academic values embodied in disciplinary cultures, as well as the interests of individual players, have to be considered when envisioning new schemata for the communication of scholarship at its various stages.
P.S. Note the BePress engine for the report.
by mleggott at February 07, 2010 08:40 PM
I love the CERN library’s message of “Raw bibliographic book data available now!”, framed
1989: TimBL invented WWW at CERN
2009: TimBL calls for “Open Data Now” at TED
CERN is the latest library to share their book data, as CERN emerging technologies librarian Patrick Danowski announced on twitter. The Open Book Data Project is further described on their website and in a youtube video (below) purpose-made for the occasion. The data is dual-licensed as CC0 and PDDL.
This isn’t the first time that library data has been shared with a splash.
After speaking at Code4Lib 2008 (my first Code4Lib conference), Brewster Kahle was presented with MARC records from the Oregon Summit consortium.
In 2007, a number of Library of Congress records were deposited in connection with
Scriblio, a faceted catalog Casey Durfee described at Code4Lib2007. Scriblio has gone through several incarnations; the open source Kochief project is the latest.
Further, as Jonathan Gorman and I were discussing in #code4lib earlier this week, there are several collections of MARC records and more donated to Open Library hosted at the Internet Archive. A few are misclassified so also consider keyword searches (‘MARC’ and ‘MARC libraries’) if you’re trying to find all the MARC records that archive.org has.
Linked data in libraries is coming along more slowly; fruit, perhaps, for another post.
Where do you look for bibliographic records? Feel free to leave tips in the comments!
by jodi at February 07, 2010 04:21 PM
We were pleased to welcome Dr Michelle Alexopoulos from the University of Toronto to OCLC last week. Michelle is an economist whose recent research has focused on creating and analyzing new measures of technical change for developed economies.
The abstract of her talk gives a flavor of some of this work, and why it was of interest to us:
Can the patterns of library collections be used to measure economic growth and technological shifts? In this talk, Dr. Alexopoulos will unveil new indicators of technical change that, she argues, resolve many of the problems associated with traditional ones (e.g., research and development (R&D) intensity and patents). Dr. Alexopoulos' measures are primarily derived from previous unutilized information contained in MARC21 records (available from the Library of Congress and OCLC's WorldCat database) on new book titles in various fields of technology over the last century. Further, Dr. Alexopoulos will discuss how the indices are related to inputs into knowledge production (such as scientific advances and R&D), and demonstrate that the measures are closely correlated with the commercialization date of new technologies. Finally, she will highlight a number of questions that the new indicators can help answer. [Presentation splashpage]
We are very interested to see Worldcat data used in this way, alongside other sources of data about book publication and use (books in print data and sales data). It was interesting hearing Michelle describe some of the reasons why books - and library catalog data - was a good candidate as an indicator:
- Book publication is linked to changes in knowledge (consider the appearance of manuals, how-to books, ...)
- The timing is right: there is a good correspondence between the date of commercialization of a technology or process and the date of books published about it. This is supported by commercial interests of publishers in catching interest at the right time.
- Library catalogs group books into subject classifications which can be useful for analysis purposes.
We will make the slides and audio of the presentation available soon. Some further details of the approach can be found in these publications:
Incidentally, it was also quite interesting for OCLC colleagues to see an economist talk knowledgeably about the MARC format ;-)
by dempseyl@oclc.org (Lorcan Dempsey) at February 07, 2010 12:57 AM
February 06, 2010
The last few weeks have seen a tremendous increase in interest about ePub. Many new blog posts have been written trying to explain the format. We’ve also seen a big jump in the number of publishers coming to Threepress for help with tricky ePub problems or just asking for guidance about the format. While I’d like to pretend that the growth is due, in part, to a long-anticipated awareness about the benefits of open standards among consumers, publishers, and suppliers, I think it’s more likely that it was Steve Jobs’ explicit mention of ePub support in iBooks on the iPad that drove most of the excitement. What makes me most excited about this groundswell is the sudden interest in ePub from a number of clever developers.
Just in the last few days, details emerged of two new JavaScript ePub readers, rePublish from Blaine Cook (@blaine) and JSEpub (screenshot) from August Lilleaas (@augustl). These two new readers join @liza’s epubjs, which will be a year old on Tuesday. An improved version of epubjs powers the ePub Zen Garden, which helps “dispel the myth that digital books can’t also be crafted works of visual design.”
Why are JavaScript ePub readers interesting? They’re interesting to me for three reasons:
- JavaScript is the most popular programming language in the world and it might be the best way to get more developers interested in creating and tweaking ePub readers.
- JavaScript ePub readers start challenging publishers, developers, and book readers to start thinking about what’s most important in delivering a compelling reading experience in a browser. We’ve spent a lot of time thinking about these choices while developing Ibis Reader, which will launch later this month, so I’m eager to see more opinions.
- Building a pure-JavaScript ePub reader requires unzipping in JavaScript, which had no open source implementations until just recently. August has written about and open sourced his critical breakthrough for unzipping files in JavaScript. [Edit: Oops! I was wrong about this one. See the comments for more details.]
Colin Hazlehurst has also published some impressive introductions, tutorials, and code for the .NET/C# crowd at his InsideEpub project and on his blog.
Do you know of other techies making waves with ePub? Please let us know!
(And if you’re one of those publishers who is looking for help, contact us.)
by Keith Fahlgren at February 06, 2010 09:09 PM
Hi. I usually get this out on Fridays, but I hope you don’t miss it because it’s coming out on Saturday this week. Seems like it was a slowish week in FRBRania. The first couple of pieces involve the RDA-L mailing list archives (RDA being, of course, the new cataloguing rules Resource Description and Access) and also Karen Coyle .
Mix and Match: Mashups of Bibliographic Data
Mix and Match: Mashups of Bibliographic Data at the recent American Library Association conference had people from Google talking about Google Books metadata, OCLC talking about ONIX, and the Open Library talking about the Open Library. Eric Hellman was there and wrote it up in Google Exposes Book Metadata Privates at ALA Forum, which a lot of people have been pointing out, including on RDA-L.
Karen Coyle, who was the Open Library person at the session, brought the four FRBR user tasks into talk about alphabetical ordering of titles:
In FRBR we have the four user tasks: find, identify, select, obtain. These are fully imbued with the assumption of user knowledge.
“to find entities that correspond to the user’s stated search criteria (i.e., to locate either a single entity or a set of entities in a file or database as the result of a search using an attribute or relationship of the entity);”
This seems to eliminate the possibility that the user could be successful in the library catalog with a need like: “I just finished Twilight and loved it. What else might I like?” Yet that is a legitimate query to bring to the library, and even to the library catalog. Perhaps we should spend some time re-writing the FRBR user tasks, expanding them to meet a wider variety of user needs. Then we could look at our catalogs and say: “What does this mean in terms of catalog functionality?” I maintain that alphabetical order will not be at the top of our list, but will probably appear along some user tasks.
Peter Murray was also there, and wrote it up in Mashups of Bibliographic Data: A Report of the ALCTS Midwinter Forum:
[From the OCLC section.] If there is an exact match for the incoming ONIX record in WorldCat, the WorldCat record is enhanced with certain fields from the ONIX record (descriptions, author biographies, web links) — being careful not to override authority work being done by libraries, but adding enhancements that libraries may not otherwise input. In turn, enhancements from exact match record and FRBR work set records (hardcover versus softcover versus audiobook, etc.) are added to the ONIX record (non-English subject headings, adding a Dewey Decimal Classification (DDC) field from another similar record if one doesn’t already exist, change the author field to an authority-controlled version). If there is not an exact match for the ONIX record in WorldCat, a new WorldCat record is built from the ONIX record and it is subsequently enhanced by metadata found in the FRBR work set records.
RDA-L thread on RDA and granularity
Coyle began the RDA and Granularity thread prompted by a chat at a libary conference. As you can see from the archives it started a big long discussion that changed Subject. Somewhere in there John Myers posted in the Systems v Cataloging subthread:
[C]onsider the FRBR expression entity. A significant aspect in textual works between expressions is translation. We do have a 240 field to record that, but since the application of the rules for Uniform titles were left to the discretion of the cataloging agency, indication of an expression for a translation can also appear in a translation note recorded in tag 500, sometimes in conjunction with the 240 but oftentimes alone (as several thousand records in my catalog will attest). Now, if this data were consistently recorded in the 240 (both with respect to the format and to the application of use of the 240), then machine FRBR-ization of these records for translations would be relatively simple.
There was more FRBR discussion in the replies.
RDA National Test Update
Jennifer Eustis’s RDA National Test: Update points to Testing Resource Description and Access (RDA) at the Library of Congress, which sketches out how a bunch of libraries are going to test RDA before committing to use it. Because FRBR is fundamental to RDA, this will also be the biggest test so far of how FRBR helps bibliographic organization.
RDA vs. AACR2: Implications for Social Justice
On 11 January the New York City Radical Reference Collective ran RDA vs. AACR2: Implications for Social Justice, with Rick Block from Columbia University.
Jessica Lingel wrote notes on the session, which are worth reading. It looks like there was a good review of FRBR and RDA and where things are at, and then some interesting questions about that and the social justice and progressive side of cataloguing.
Question – what aspects of cataloging relate to issues of social justice?
It’s mostly a matter of subject headings. But even in descriptive cataloging, what gets included, what doesn’t has implications. RDA won’t so much change that, although it raise the question of personal archiving.
I’d never thought about this angle on FRBR and RDA. Very interesting subject. The first thing that strikes me is that in the linked data and Semantic Web approach anyone can say anything about anything. It will be much easier for people to apply their own sets or subsets of terminology to a group of things while still keeping connected with the rest of the universe, and for anyone else who wants to use that vocabulary to mix it in with their own system. This is a big improvement.
by William Denton (wtd@pobox.com) at February 06, 2010 04:47 PM
February 05, 2010

The invaluable epubcheck has officially been at version 1.0.3 for months, but the latest incremental build (1.0.5) has significant improvements. I’ve been seeing a number of ebooks entering the marketplace which pass epubcheck 1.0.3 but have serious flaws that are caught in 1.0.5.
At Threepress we’ve been using 1.0.5 internally for some time, as I suspect many organizations have, so I’ve upgraded the public epubcheck validation service to use the latest code. I’ll keep it up to date periodically until version 1.0.5 becomes final.
(If you prefer to use an earlier version you should download the code directly from the main site, but I strongly recommend against doing so as many serious errors may be bypassed.)
by Liza Daly at February 05, 2010 10:03 PM
Yea! My first gem ever released!
[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was...ugly. And
I didn't really understand it. So I dug in today and wrote this.]
I’ve just pushed to Gemcutter my first gem — a JRuby-only
producer/consumer class that works with anything that provides #each called jruby_producer_consumer.
It’s JRuby-only because it uses (a) A blocking queue implemenation that’s native Java, and (b) threading, which isn’t
a huge win under regular Ruby.
There’s no testing there because I’m not sure how to test threaded stuff
It is, I hope, easy to use:
require 'rubygems'
require 'jruby_producer_consumer'
# Create a ProducerConsumer. Arguments are anything that implements #each
# and the size for the underlying queue. For the former, I'll just use a Range object.
eachable = 1..10
queuesize = 3
pc = ProducerConsumer.new(eachable, queuesize)
# Just a method to show what happens
def sample (consumerid, x)
puts "Consumer #{consumerid}: consuming #{x}"
sleep 1 # otherwise this'll finsish before I can create multiple consumers
end
# Create three consumers. You can pass any number of args to
# #consumer, and must pass a block whose arguments are the
# object returned by eachable#each and those args back.
['A', 'B', 'C'].each do |consumerid|
pc.consumer(consumerid) do |x, consumerid|
sample(consumerid, x)
end
end
# OUTPUT
# Consumer A: consuming 1
# Consumer B: consuming 2
# Consumer C: consuming 3
# Consumer A: consuming 4
# Consumer B: consuming 5
# Consumer C: consuming 6
# Consumer B: consuming 7
# Consumer A: consuming 8
# Consumer C: consuming 9
# Consumer B: consuming 10
by Bill at February 05, 2010 07:46 PM
The Code4Lib Conference Planning Group (anyone can join) is putting out a call for proposals to host the 2011 Code4Lib Conference. Information on the kind of venue we seek and the delineation of responsibilities between the host organization and the Planning Group can be found at the conference hosting web page.
read more
by rtennant at February 05, 2010 07:28 PM
[ Editor's note: Charles Knight received honorable mention in the second annual Federated Search Blog contest. In recognition of this honor, the Blog is publishing his essay and contest sponsor Deep Web Technologies is awarding Charles a $100 prize.
Charles Knight is a blogger and leading authority on alternative search engines. Recently the editor of AltSearchEngines, he now blogs about "All things Search" for TheNextWeb at http://thenextweb.com/search. He lives in Charlottesville, VA. ]
How Federated Search can Make You RICH! by Charles Knight
Unlike all of the other entries to this contest, this is the ONLY one that will MAKE YOU RICH.
Yes, you heard me right, this is a COMMERCIAL application of Federated Search technology.
Feel free to PATENT, BUILD, and SELL this product and then WATCH THE MONEY ROLL IN!
The FIRST STEP is to print out and tape in front of you this image of a beautiful Gemstone Globe.
As you prepare for the 2010 Christmas shopping season, note that you would never want to actually sell one of these globes!! Why not? Because they are LIMITED and STATIC.
For the moment, just imagine the difference between a Microsoft multi-touch SURFACE table and an ETCH-A-SKETCH.
The globe that you will be developing will fix those two problems using Federated Search technology. IT’S THAT SIMPLE!
So the NEXT STEP is to make them multi-functional. The surface of the globe is actually a curved screen. Don’t worry, flexible screen technology will soon be available, but you would want to contact those vendors ASAP.
This change in your globes will allow you to touch (or SPEAK!) the globe and change it from one mode to another, e.g. political boundaries, typographical features, global weather (or warming for the GREENIES), population statistic icons and so forth, as many as you can dream up.
Finally, add the FEDERATED SEARCH technology. As soon as you select a mode (by touch or voice), the GUI runs a search query against specialized databases in real-time and sends the search results to the surface of the globe.
IMAGINE seeing political boundaries as they are at that minute, and always up to date! IMAGINE touching this magic globe and seeing all of the weather patterns around the world in real time, as they are that minute! SEE the clouds swirling around, the darkness falling, the precise areas of rain and sunshine, and all MOVING!!
Researchers can work at their desks while population statistics change in each country, e.g. birth and death rates. These globes are both BEAUTIFUL and PRACTICAL, and one in your home will be the ultimate conversation piece!!
This is a PUBLIC CONTEST and this idea is NOT PROTECTED, so HURRY and DON’T DELAY! Get started on it now!
ShareThis
by Sol at February 05, 2010 07:26 PM
In conjunction with the OCLC API Mashathon at VALA next week, I'm posting a tutorial on how to use the xISSN service to enhance local information about serials. This tutorial discusses how to use the xISSN services to enhance an existing library user interface--such as a catalog--by adding information from the xISSN service. It demonstrates how to add two things on the fly to a library catalog full record display screen:
1. whether or not a journal is peer reviewed
2. a link to the latest journal table of contents
What have
you done with xISSN? Comments welcome with links to your catalog or describe your mashup in the comments. We can feature your work in the
xISSN Application Gallery.
by Karen Coombs at February 05, 2010 03:02 PM
The PIRUS2 (Publisher and Insitutional Repository Usage Statistics) project - which I blogged about briefly in September and which is exploring technical, organisational and economic issues in collecting and aggregating article usage statistics from repositories and publishers - has now been underway for a few months.
The project plan is available from the JISC Web site, with further information available from the project Web site. The primary partners in this project are MIMAS, Cranfield University, COUNTER, CrossRef and Oxford University Press - which means that it is well placed to consider the many issues to which the collection, aggregation and use of article level statistics gives rise.
PIRUS2 is not alone in considering these issues and is in contact with the Open Access Statistik and SURFSure projects in Germany and the Netherlands respectively which are also working on collecting article level usage data from repositories. The projects are taking similar technical approaches. One key decision - which is in line with a recommendation of the JISC usage statistics review of 2008 - has been to format log data as OpenURL context objects. One explanation of OpenURL context objects can be found on the SURF Web site. Other standards being used are OAI-PMH and SUSHI for harvesting the usage data.
PIRUS2 continues to the end of 2010.
by Ben Wynne at February 05, 2010 02:59 PM
How long has it been since you read something that came from a government agency and thought: "Wow! Brilliant!" Kudos to the Department of Justice for their Statement of Interest in the AAP/AG v. Google suit. Summed up, in their words:
In general, the project is a "good thing" -
Breathing life into millions of works that are now effectively dormant, allowing users to search the text of millions of books at no cost, creating a rights registry, and enhancing the accessibility of such works for the disabled and others are all worthy objectives.
However, the settlement goes beyond the original dispute, and is trying to use class action to create a new market that is unrelated to the copyright-related lawsuit -
Although the United States believes the parties have approached this effort in good faith and the ASA is more circumscribed in its sweep than the original Proposed Settlement, the ASA suffers from the same core problem as the original agreement: it is an attempt to use the class action mechanism to implement forward-looking business arrangements that go far beyond the dispute before the Court in this litigation. As a consequence, the ASA purports to grant legal rights that are difficult to square with the core principle of the Copyright Act that copyright owners generally control whether and how to exploit their works during the term of copyright. Those rights, in turn, confer significant and possibly anticompetitive advantages on a single entity – Google.
Not only that, but the DOJ seems to lend some weight to the "fair use" defense originally claimed by Google (and by the participating libraries) -
There has not been – and simply could not be – any allegation in this litigation that Google has sold full access to works for which it lacks the right to do so, or even that such activity was threatened. Indeed, selling such access would have been legally indefensible, and thus would have been at odds with Google’s entire pre-settlement book search strategy, which was premised upon staying within colorable “fair use” grounds. With very good reason, therefore, Google consciously avoided creating precisely the factual predicate that might support the settlement of book- and
subscription-selling claims. The business models that the ASA authorizes therefore relate to activities in which Google never engaged or threatened to engage, and thus claims of copyright infringement that could not have been brought.
The anti-trust issues brought up by the suit are unchanged in this amended settlement agreement. This leaves the judge in an even tougher spot than he seemed to be in before: if he decides that the suit is a valid class-action then he has to address the anti-trust issues. However, I have seen no clear description anywhere of how those could be addressed, so the judge is being asked to be very clever indeed -
Finally, the United States recognizes that if, as discussed supra, class representatives lack the power under Rule 23 to grant Google the power to exploit broadly the digital rights of class members to sell books, create subscription libraries, etc., then neither the class representatives nor Google possesses the power to authorize such activity by third parties. However, if the Court determines that the class representatives possess such rights as to Google, then the Court should carefully examine whether there exists a means for rival distributors to access orphan and rights-uncertain works consistent with Rule 23.
The DOJ suggests the following:
- Some issues could be resolved by turning the "opt out" into "opt in" for rights holders. (That would essentially be exactly what we have today under copyright law.)
- A "waiting period" before Google can make use of out-of-print works, to give rights holders a chance to surface. (This option seems to contradict #1)
- More effort should go into finding rights holders.
- A periodic reassessment of the marketplace for the out of print works (which, because of exposure, could have changed in market value)
The big question is: Is this the death knell for the settlement? And if so, where do we go next? I predict that if the suit is rejected we will have orphan works legislation sooner rather than later, since this suit has clearly high-lighted the need for such legislation. The copyright violation lawsuit against Google, however, remains. I fear that the settlement has poisoned the air for a fair use decision. We've seen the sausage being made, and it will be harder than ever to approach this project with an open and fair mind.
What can be done? Well, in France, when faced with a take-over of their cultural heritage by Google (their words, not mine), the government responded by giving libraries a large sum so that they can do the digitizing themselves; a kind of "by the people, for the people" digitization project. Is it too much to hope that could happen here?
by Karen Coyle (noreply@blogger.com) at February 05, 2010 06:11 AM
Publishing idea-man Mike Shatzkin recently wrote a provocative blog post, "Why are you for killing bookstores?"
He lays out the uncomfortable facts:
"Although there are probably few people reading this blog who expect bookstores to be around in 15 or 20 years (and those who do will undoubtedly leave a comment!), there are many who would like to keep them around as long as possible. There is a magic to being in a building surrounded by 40,000, 60,000, 100,000 different books. Bookstores are inherently community centers. They make possible the wide dissemination and promotion of great writing. They enable people to see heavily-illustrated books before they purchase them.
But have you thought about this? If you are for bookstores lasting as long as possible, you want to slow down the uptake of ebooks."
He goes on to explain the broad dynamics of the situation—the way Amazon, the big physical retailers and publishing look at the future, and which side they're on—faster ebooks or not. It's a stimulating read. And a depressing one.
Particularly depressing for me is the fact that Shatzkin never mentions libraries. (As
one commenter on his post wrote, "Those buildings with 1000s of books that you speak so fondly of are called libraries.") It's not his fault, really. It's a short blog post. But I think it shows the extent of the problem for libraries. When a top industry analyst looks at the book world, libraries don't figure very prominently. There is a war going on, and libraries are going to be collateral damage.
They don't deserve it. US libraries circulated some 2.1 billion books last year, compared to 3.1 billion books sold. But they don't have much of a profile in the commercial world.(1) Being responsible for something like 39% of reading, bookstores only are about 4% of book
sales.(2)
The difference is, of course, that libraries don't pay every time they circulate a book. Under the First Sale doctrine—the idea that you, well, own the things you own—libraries can pay once, and lend a book out multiple times.
Ebooks change this. As ebooks advance, libraries are going to lose their "First Sale" advantage. Publishers will never allow a library to "own" an ebook absolutely, just as consumers don't really own their ebooks. Libraries are going to be renting them, in fact or in effect, and they're going to paying a lot more to do it. They're going to be paying for the use they get out of them, not spending what consumers spend and getting more use. (I've
written on the economics here before, so check that out first if you disagree with me.)
As the logic takes hold, libraries will be transformed into "simple" book-subsidy machines, not the special, advantaged ones they are now. That means they're either be forced to subscribe to fewer books, invest a lot more in their holdings or, for public libraries, convince voters to give them a lot more money. Those are bad options.
Other factors exacerbate the problem. Libraries are losing the "aggregation advantage." When every book is available anywhere, why go to the library to get it? And piracy hurts. Digitization has
cut the music industry in half in the last decade, and there's no reason to believe books will become the first digital medium to avoid it. When you can not only get a book anywhere, but get it for
free, why go to the library?
There are some reasons. Unlike bookstores, of course, libraries do other solid, valuable things. They employ librarians, who help you find and understand things. They provide free internet access. They hold story times and author readings. They lend out other things, although, excepting
tools and
people, digitization is going to wipe those markets out too.(3) And they're funded indirectly. Bookstores monetize their community value—whether it's an author reading or just the value of meeting cool people—by selling valuable objects. They create more value than they can realize. Public libraries, by contrast, monetize through government taxation, which is to say by periodically asking voters if they value them. As of now, despite some budgetary cuts, voters mostly do.
But, overall, I think libraries are headed in the same direction as bookstores and in obedience to the same logic—falling in tandem with the rise of ebooks. If they survive, it'll be for everything else they offer and so, for me at least, apart from the librarians, whose value won't fall, ebook libraries won't be full-fledged libraries anymore.
Shatzkin concludes:
"I don’t think anybody would want to be accused of being in favor of killing bookstores faster. And very few of us would be comfortable having it said we were trying to slow down the progress of digital technology, strategizing to slow down ebook uptake. But you are for one or the other, unless you don’t have any opinion at all."
Isn't the same thing true for libraries and ebooks?
Update 1: If you want to reply, you can leave a comment, but I also started a
topic in Talk about the topic.
Well, that's about the most depressing thing I've written. I hope I'm wrong. And I even have some hopeful, positive things to say too. But I'll save them for another day.1. These numbers are all very wiggly. Eric Hellman, formerly of OCLC, has been working on them for a while. Start with
this,
this and
this.
2. As founder of LibraryThing, which doesn't cede the term "library" to institution collections of books alone, I need to mention that "lending" isn't just an institutional library phenomenon. Regular people lend and share books too, probably in numbers to rival libraries. That phenomenon will be largely ended by ebook DRM—and revived by piracy.
3. It's actually digitization plus virtualization. CDs are digital, but they're also physical objects, so libraries can own them for real. When CDs are gone—and they're going—libraries will have to contract with digital music services. The dynamics are similar to the ebook dynamics.
by Tim (noreply@blogger.com) at February 05, 2010 12:56 AM
February 04, 2010
Do students like serialized parodies of popular action shows? I sure hope so.
read more
by tim at February 04, 2010 09:01 PM
So a while back I posted code which added peer reviewed indicators to a Serial Solutions E-Journal list. Never being quite satisfied with how stuff works and wanting to make things better I’d rewritten and expanded the script. Now it adds Peer Reviewed indicators to Serial Solutions and an Innovative catalog full record display screen. It also adds links to display the most current table of contents for a given journal if it exists (in both the Serial Solutions and Innovative UI).
Adding Peer Review indicators
- Grab the ISSN from the page (Innovative, Serial Solutions)
- Send ISSN to xISSN service and retrieve whether or not the journal is peer reviewed
- Add Peer Reviewed indicator to the page
The hardest part of this script involve obtaining the ISSN. Serial Solutions luckily tags this in a span. Innovative puts it in a table structure so using JQuery I can use the following
$(“#fullSection td.bibInfoLabel:contains(‘ISSN’)”).next().text()
what this does is find the td with the text ISSN in it and then gets the text in the next tag.
Adding the Peer Reviewed indicator is a matter of finding the place in the HTML structure you want to add the new code and appending it. For simplicity sake in Innovative I’m just adding a new row to the table which contains the bibliographic data.
Adding a link to the table of contents
- Grab the ISSN from the page (Innovative, Serial Solutions)
- Send ISSN to xISSN service and retrieve whether or not the journal is has a table of contents RSS feed available
- If ISSN has an RSS feed available, add a link which say See Latest Table of Contents and executes the TOC script
This script build on what the Peer Review section of the script does and in addition to requesting the peer review field also gets the rssurl field from xISSN. If there is an rssurl field then a link is created and added to the page.
The tricky part of this script is the portion which brings up the Table of Contents in a popup window. What is tricky about this is the fact that the RSS feed exists on a different server and that its XML that needs to be manipulated. It isn’t the fact that data is XML part that creates the difficulty, JQuery is capable handling XML. However, we don’t really know the form (RSS 1.0, RSS 2.0 or Atom) that the feed is which makes it much more difficult. Additionally, because the data being retrieved isn’t JSON we can’t get it without creating a cross-site scripting issue. Two resolve both these issues, I’ve created a PHP script which retrieves the feed and parses it into JSON which I can access. I’m using the SimplePie library to parse the feed which saves me lots of time because it takes care of the multiple types of feeds issue.
This is my 2.0 solution to the problem. My initial solution used a PHP script that just built the popup HTML content and then configured Apache to proxy the PHP script to avoid the cross site scripting issue. I gave up on this solution because it is predicated on the person installing the Javascript being able to configure Apache on the server with the Javascript to act as a proxy. This makes the solution more complicated to configure which was unacceptable. If you want to explore the code in more depth feel free to view the full javascript and the PHP code.
This post is a hold over from before I started working for OCLC which I didn’t get published until now. I’m posting it here so that folks who saw the original content can follow-up. Future posts on OCLC Web Services will be at the OCLC DevNet Blog.
by Karen at February 04, 2010 08:17 PM
So as I previously mentioned I created a script that crosslists print books and ebooks in Serial Solutions and our library catalog. The mechanics behind this script are pretty simple.
- Screenscrape the ISBN from the web page using JQuery
- Send the ISBN to a PHP page which queries the WorldCat Search API for that ISBN and holdings at the UH Library or Send ISBN to PHP page which queries Serial Solutions to see if UH has electronic holdings for that item
- PHP script returns a JSON object with the OCLC Number
- Use JQuery to Parse the JSON retrieve the OCLC Number and build a link to be inserted into the desired spot on the web page.
The steps are the same for both Serial Solutions and the catalog. The big differences in the code? The code which grabs the ISBN and the code which inserts the link in the right place. This is because the UIs are different so it take different JQuery code to get the ISBN and then insert the link.
Here is the Javascript which works to insert crosslinking into an Innovative catalog and Serial Solutions. I’ve commented it so you can see which part corresponds to each.
In addition, to make this work you have to have the PHP scripts on your server. There is one for WorldCat and one for Serial Solutions. I created these to solve the cross server scripting problem and get the data into JSON format which is easier to manipulate as well. I’ve made these available for download as well as examples (Serial Solutions / WorldCat). It isn’t as abstracted as much as I like. For example, if I had the time I would have coded it so that the PHP builds the link back to the catalog based on the OCLC Symbol submitted. I can do this if I tap the OCLC Registry but I was in a rush and didn’t take the time to code it this way on the first round.
This post is a hold over from before I started working for OCLC which I didn’t get published until now. I’m posting it here so that folks who saw the original content can follow-up. Future posts on OCLC Web Services will be at the OCLC DevNet Blog.
by Karen at February 04, 2010 08:16 PM
It seems to be the one event that people think is important enough to go to, even though they fear in their hearts that, yet again, not a lot of progress will be made. Most of those at yesterday’s JISC-funded Persistent Identifiers workshop yesterday had been to several such meetings before. For my part, I learned quite a lot, but the slightly flat outcome was not all that unexpected. It’s not quite Groundhog Day, as things do move forward slightly from one meeting to the next.
Part of the trouble is in the name. There is this tendency to think that persistent identifiers can be made persistent by some kind of technical solution. To my mind this is a childish belief in the power of magic, and a total abrogation of responsibility; the real issues with “persistent” identifiers are policy and social issues. Basically, far too many people just don’t get some simple truths. If you have a resource which has been given some kind of identifier that resolves to its address (so people can use it), and you change that address without telling those who manage the identifier/resolution, then the identifier will be broken. End of, as they say!
This applies whether you have an externally managed identifier (DOI, Handle, PURL) or an internally managed identifier (eg a well-designed HTTP URI… Paul Walk threatened to throw a biscuit at the first person to mention “Cool URLs”, but had to throw it at himself!).
Now clearly some identifiers have traction in some areas. Thanks to the efforts of CrossRef and its member publishers, the DOI is extremely useful in the scholarly journal literature world. You really wouldn’t want to invent a new identifier for journal articles now, and if you have a journal that doesn’t use DOIs (ahem!), you would be well-advised to sign up. It looks very affordable for a small publisher: $275 per year plus $1 per article.
Even for such a well-established identifier, with well-defined policies and a strong set of social obligations, things do go wrong. I give you Exhibit A, for example, in which Bryan Lawrence discovers that dereferencing a DOI for a 2001 article on his publications list leads to "Content not found" (apologies for the “acerbic” nature of my comment there). It looks like this was due to a failure of two publishers to handle a journal transfer properly; the new publisher made up a new DOI for the article, and abandoned the old one. Aaaaarrrrrrggggghhhhhhh! Moving a resource and giving it a new DOI is a failure of policy and social underpinning (let alone competence) that no persistent identifier scheme can survive! CrossRef does its best to prevent such fiascos occurring, but see social issues above. People fail to understand how important this is, or simple things like: the DOI prefix is not part of your brand!
Whether a DOI is the right identifier to use for research data seems to me a much more open question. The issue here is whether the very different nature of (at least some kinds of) research data would make the DOI less useful. The DataCite group is committed to improving the citability of research data (which I applaud), but also seems to be committed to use of the DOI, which is a little more worrying. While the DOI is clearly useful for a set of relatively small, unchanging digital objects published in relatively small numbers each year (eg articles published in the scholarly literature), is it so useful for a resource type which varies by many orders of magnitude in terms of numbers of objects, rate of production, size of object, granularity of identified subset, and rate of change? In particular, the issue of how a DOI should relate to an object that is constantly changing (as so many research datasets do) appears relatively un-examined.
There was some discussion, interesting to me at least, on the relationships of DOIs to the Linked Data world. If you remember, in that world things are identified by URIs, preferably HTTP URIs. We were told (via the twitter backchannel, about which I might say more later) that DOIs are not URIs, and that the dx.doi.org version is not a DOI (nor presumably is the INFO URI version). This may be fact, but seems to me rather a problem, as it means that "real DOIs" don't work as 1st class citizens of a Linked data World. If the International DOI Foundation were to declare that the HTTP version was equivalent to a DOI, and could be used wherever a DOI could be used, then the usefulness of the DOI as an identifier in a Linked Data world might be greatly increased.
A question that’s been bothering me for a while is when an “arms-length” scheme, like PURL, Handle, DOI etc is preferable to a well-managed local HTTP identifier. We know that such well-managed HTTP identifiers can be extremely persistent; as far as I know all of the eLib programme URIs established by UKOLN in 1995 still work, even though UKOLN web infrastructure has completely changed (and I suspect that those identifiers have outlasted the oldest extant DOI, which must have happened after 1998). Such a local identifier remains under your control, free of external costs, and can participate fully in the Linked Data world; these are quite significant advantages. It seems to me that the main advantage of the set of “arms-length” identifiers is that they are independent of the domain, so they can be managed even if the original domain is lost; at that point, a HTTP URI redirect table could not be set up. So I’m afraid I joked on twitter that perhaps “use of a DOI was a public statement of lack of confidence in the future of your organisation”. Sadly I missed waving the irony flag on this, so it caused a certain amount of twitter outrage that was unintentional!
In fact the twitter backchannel was extremely interesting. Around a third or so of the twits were not actually at the meeting, which of course was not apparent to all. And it is in the nature of a backchannel to be responding to a heard discourse, not apparent to the absent twits; in other words, the tweets represent a flawed and extremely partial view of the meeting. Some of those who were not present (who included people in the DOI world, the IETF and big publishers) seemed to get quite the wrong end of the stick about what was being said. On the other hand, some external contributions were extremely useful and added value for the meat-space participants!
I will end with one more twitter contribution. We had been talking a bit about the publishing world, and someone asked how persistent are academic publishers. The tweet came back from somewhere “well, their salespeople are always ringing us up ;-) !
by noreply@blogger.com (Chris Rusbridge) at February 04, 2010 07:46 PM
I'm told, by way of my own imagination based on loose rumors put out by flying pink fairies, that Topic Maps is a waning technology, poorly supported by the IT industry at large, hard to wrap your head around, and generally icky to deal with.
All of this is, unfortunately, true.
But, as in all stories told by only one side, there is an other side just waiting to come out into the light, just one day, real soon now. This day may never come, but here is my own little attempt to shed some light on a few of the issues with the Topic Maps world. It was about 10 years ago I first got a whiff of Topic Maps, so my first post in 2010 seems fitting to take some Topic Maps rumors, loose observations and vague statements, and make some comments along the way. Here we go ;
1. Topic Maps are hard
Why, yes, to a commoner or some person with a somewhat traditional approach to computing, Topic Maps can indeed seem like an alien concept at first. The first time I started reading up on it I was mesmerized and frightened at the same time, wondering where the magic would bring me and just how painful it would be for me when reality would kick in (and me) ; there were new notions and concept, new words, new paradigms everywhere! Reification, role types, associations, occurrences, occurrence type, typified information, subjects and topics, ontologies (upper, lower, specialized ones) the list goes on. It is terrifying indeed, and for many, many people they are so terrifying that SQL and C# and .Net and C and PHP seems like a comforting auntie lulling you back into things we know and know well, no hard thinking required (just lots of hair to pull out).
Until you realize a few things, that is. For example, the vocabulary is anchored in information science, and with a bit of research or learning it shouldn't take that long to get familiar with it. Even the complex issues of reification and ontologies after some time will be as normal and self-explainable as second-cousins and language. (And yes, there is a correlation between the examples given! See if you can find them!) And perhaps more importantly, the problems you can solve with Topic Maps can completely and utterly eradicate the major problems those traditional methods give us, one of the biggest bug-bears that I'd ever had! (Anyone wish to offer me a book deal on how to solve most of the main IT development problems in seriously interesting ways? :)
Can I just mention that having an small epiphany about Topic Maps have the effect of you never returning to the real world and look at it the same way, ever again? I have never met a person who got Topic Maps return to the old ways, at least not without making huge compromises. Getting it will change you in good ways, and is most definitely worth the effort despite the pain.
Tips to newbies: It's not really hard, even if it seems hard. But it requires you to change your mind on some key issues.
2. Topic Maps are poorly supported in the real-world
Oh yes, indeed. If you talk to anyone, any company in your immediate serenity (yes, a tautologically pun) and ask them about their use of Topic Maps, you'd most likely get a blank stare back and a careful "What would we need maps for?"
There's the odd technical-inclined person who might now a toddle about what these fabled Topic Maps are all about, but very, very few people understand what they are, and even less have implemented them into something useful. (The exception to this is, oddly enough, the country of Norway, and some scantily-clad areas of southern Germany) No mainstream software package comes with the stuff wrapped in, no word-processor touts its amazingness, no operating system comes with support for it, and no popular software of any kind use it.
But then, there's the odd system that use it. You'll find it also in the odd Norwegian government portal, which is bizarre in its own right, and perhaps deep down in some academic underfunded project or perhaps some commercial project where parts of the data-model masquerades as it. My old website use it. I have a framework or two. There's the odd other open-source project, a few API's, and a host of other well-meaning but obscure projects that perhaps has got it, albeit well hidden and kept away from children.
For a technology that stands out as something that can fix it all, I find it bizarre that it is found so seldom, but then bizarre is not the same as surprised. And when you look at the "competition", the well-funded, well-marketed, well-established world of the Semantic Web, championed by none other than the W3C and Tim Berners-Lee, well you have to concede that it shouldn't be much of a surprise at all, really. Topic Maps is a tiny group of enthusiasts (a few hundred, being liberal with statistics) who'll saw off their right leg if it meant we could get the specs done in time, while the Semantic World is littered with academia, organisations and companies (we're talking thousands upon thousands of people actively working on it), so no, you should not be surprised.
Tips to newbies: As the saying go, if a million flies eat it ... surely, it has some nutritional value or greater worth over, say, that green grass the cows are dumping it on?
3. Topic Maps is dying and obsolete; use RDF instead
There was a period about 10 years ago which I regard as the Topic Maps time of bloom ; the trees had beautiful flowers on, the pink and purple petals falling over the world of IT like a slow-motion rainfall of beauty. Everywhere you turned there was people talking about it and potential projects popping all the time.
But times went by. Topic Maps was too hard for most (see point 1 and 2), and not just the technical implications themselves and the language and terms used, but also the philosophy of it, the very idea of why we should be using it over, say, any relational database or traditional software stack. I mean, what's the point, really?
The point is easy to miss, admittedly. A technology that can be used for everything is hard to pin down and said to be good for something. And we have focused just too damn much on knowledge management systems, and not only that, but used our own special language in the process which often is quite remote from knowledge management speech in the enterprise arena (but you find it rife in academia). When the world looks to Topic Maps, all they see is a difficult way to do knowledge management. Ugh.
Myself, I'm using Topic Maps in highly non-traditional ways. I use maps for my application (definitions, actions and functionality), for functional topology (generic functionality in hyper-systems based on typification), for business logic (rules, conditions, interactions) and, perhaps just as important, for the actual development itself (modules and plugins, deployment, versioning, services) which makes for a highly (and this "highly" is quite higher than any normally used "highly") customizable and flexible framework for making great semantic applications. But more on the details at some later stage.
Tips for newbies: No, it's not dead nor dying, just not as popular as stuff that's easier or more accessible
4. Topic Maps is nothing new
Well, given its roughly 20 year history (and I'm counting from early days of HyTyme), in Internet years it's an old, old dog, so by that alone we can't say there's anything new, but most people would mean "new" here to mean something like "we've been doing X for years, so why do we need this?", where X usually points to some bit of the Topic Maps paradigm that indeed has been done before. Of course it has. There is nothing new in Topic Maps except, of course, putting it all together and standardize one cohesive and complete way of doing pretty damn most of what you would need for your complex data-model, identity management, semantic or otherwise relational, interoperable information and / or structural need, chucking in knowledge management, too, for good measure.
There are of course nothing new with Topic Maps, except that all that old stuff is bundled into a new thing, if you allow a 20 year old standard to be called "new." But then again, "the standard" is really a family of standards, all evolving and changing with the times. There's always a sub-standard (no pun intended ... well, not a lot of pun intended) in the woodworks, always some half-baked document to explain something or other, always something that is so damn specific and concise that the overall grooviness and funky bits are pushed to the side-lines.
Topic Maps is new and old at the same time, but it really is groovy and funky once you overcome the technical jargon and the concise nature of the standards.
Tips to newbies: The king is dead. Long live the king!
5. The Topic Maps community is, um, a bit tricky
Oh, yes indeed. And this one is the hardest to write about as I'm part of this community and know pretty much everyone, some more than others.
So let's say it this way; I'm a difficult person in certain ways, for example I talk a lot, I overflow with ideas rather than code, I don't care too much about political correctness, and I speak my mind and use language that could alienate people with too strong attachments to their ties or their social buckets.
And the core of the Topic Maps community is loaded with weirdos like me; highly opinionated, rough ideas, hard on woo, and soft on business. But the problem isn't the weirdos, but the low number of them. Any successful community with such a wide-ranging and all-encompassing area of what Topic Maps is all about (which is, uh, almost anything) going from epistemology to identity management to ontology work, well, you need a lot of personalities to match them all to make it seem like a lively place. We, on the other hand, have a handful of people, and the contrast between us all is sometimes just too great. And, I've noticed, we're not very good with newbies, either, so even if we answer their questions, quite often our answers are just too far out there for normal people to comprehend (and I've got a ton of circumstantial and anecdotal evidence to back it up).
I'm part of many different communities on the web, but there is only one champion of how fast an online discussion goes private (and it's not of the good kind; it's the kind where we need to express our frustrations in private [because, ultimately, we're nice people who don't want to offend anyone even when they deserve it, those bastards], lest we blow up and our eyes will bleed!), and that's the community which is located on a private server where you must write to the list owner in an email to be added. *sigh*
I tried my "question of the week" thing on the mailing-list for a while, and some of those went well, but too many of those question quickly descended into nothing or private arenas. So, I'm officially giving up on it for now. Maybe I'll come back stronger once my spine grows back, who knows?
Tips for newbies: Be strong, keep at it, ask for clarification! We don't know just how alien we are. And please join in as we need more weirdos.
6. What, exactly, is Topic Maps, anyways? I don't get it!Yes, indeed, what exactly is this darn Topic Maps thing? The funny thing is that there is no correct answer to that question. First of all, it's a family of standards that we collectively call "Topic Maps", but it could also mean either the
TMDM (Topic Maps Data Model) standard or the
XTM (Topic Maps XML exchange format) XML standard, depending on your non-sexual preferences. Some might even go out on a limb (
obviously not the limb cut off in point no. 2) and claim that it means the
TMRM (Topic Maps Reference Model) which is a more abstract framework, or possibly even just the philosophical direction - or, dare I say it, zeitgeist? - of the thing, like a blueprint for how to build a key-value recursive property framework with identity- and knowledge management system. Your mileage may vary.
But then we have a problem as it is not a technology nor a format. It is more akin to a language, a model or a direction of sorts. No, not a language like SQL (
even though the TMQL (Topic Maps Query Language) could be said to hold that place) that is to be parsed by a computer, nor a language like Norwegian or English. No, we're talking about a language that sits right in the middle between the computer and the human, a kind of mediator or translator, a model in which both machine and human can do things that each part understands equally well, a model which is defined through information science, math and human language.
So what is it? It's a language that both computers and humans can use without pulling too much in either direction, a language in the middle that, if spoken by many parties (
computers and humans both), they can all join hands and sing beautiful knowledge management songs together, share and propagate with ease. But of course, Topic Maps isn't limited to just knowledge management, oh no. You can solve unsurmountable things with it as you can make it represent whatever you want it to, and I really, truly mean anything. If you want a topic to represent your thing, off you go. It's that flexible.
It can work as the basis for pretty much any system that has structures in it of any kind or shape, and that, by and large, is pretty much any system ever built. So it's actually quite hard to explain just what you can use it for, even though traditionally it's content management, portals and knowledge management.
Tips to newbies: It's only a model ...
So there you go, a quick summary of bits and bobs about Topic Maps. In my next installment, I'll summarize my naval fluff collection, next the train-table changes of Minnamurra station of the last 10 years, and finally I thought I'd summarize all the redundant technology that's gathering dust in my garage. Stay tuned for exciting times ahead!
by Alexander Johannesen (noreply@blogger.com) at February 04, 2010 05:39 PM
So I was talking to Stephen diFilipo today and he mentioned OraTweet - which is Oracle's new enterprise Twitter like system and it made me wonder how many institutions had setup their own internal Twitter systems (Enterprise Micro-Blogging) like OraTweet (what a name)? I did know about Yammer like systems for use among internal enterprise staff. However, I did not know about Blue Twit (IBM) and Laconica. Are there others out there that we should be aware of and how do these systems feed back into public facing micro-blogging apps like twitter?
read more
by rmcdonal at February 04, 2010 04:20 PM
| This article will appear in Nodalities Magazine, Issue 9.
by Kaitlin Thaney
Program Manager of Science Commons, Creative Commons
In the emerging data web, there have been multiple efforts working towards the same broad goal of data sharing (ie., the NeuroCommons, Linked Open Data, efforts of the World Wide Web Consortium), but are still unevenly distributed. Our understanding of the legal, social and technical issues is increasing, but still is at a very early stage.
This past fall at the International Semantic Web Conference in Chantilly, VA, USA, I joined three other leading minds to lead a tutorial examining some of the legal and social frameworks for sharing data in the emerging data web, focusing on an overview of the need for access, the social issues of applying Free-Libre/Open Source (FLOSS) licenses to data, and the approach we advocate at Creative Commons to help navigate this complex space — converging on the public domain.
Lessons Learned
Creative Commons as an organisation works to make knowledge sharing easy, legal and scalable – with applications in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We maintain an integrated approach, and craft policy and legal tools to lower the barriers to knowledge sharing.
When it comes to data sharing, first and foremost, the information needs to be legally and technically accessible. The Open Access movement has increased awareness to this, using the Creative Commons licensing suite to unlock content, and has seen its share of qualified success. But what to do when the information you want to share and reuse falls outside the protections of copyright?
In short, it’s complicated.
This is the where the discussion of legal protections for data gets murky. Knowledge is not always copyrightable – it may be easy to discern the rights associated with journal articles, but what about data, ontologies, annotations, or research statements described in triples?
The emergence, adoption, and use of the free-libre/open licensing regimes has allowed for remix and reuse of software code, music, film, educational resources and scientific research in a way that otherwise would be difficult to achieve.
The successes of these licensing approaches has caused a change in the social ethos of licensing, instead using a traditional “all rights reserved” model to make something more free, rather than less.
But from our research, this approach is not ideal for data. The trend towards applying licenses, click-wrap agreements and other sorts of restrictions on scientific data is increasing, but with the undesired consequence of limiting the downstream use of this information, and even at times blocking interoperability. The costs are high, the terms are not always clear, nor the protections always legally sound, making it very difficult to scale for scientific uses. The result is a high barrier to entry to do meaningful analysis, annotation, search, etc. on the mass of data available currently that’s continuing to grow exponentially, and integrating with the literature available.
We advocate an approach of converging on the public domain, and requesting behaviours often found in the various flavours of free and open licensing through norms – not a legal construct. But first, let’s take a look at some of the issues to be aware of and their social implications to furthering the goal of linked open data.
Attribution v. Citation
Under US Copyright law, “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”Since facts are not covered by copyright, attribution – a license obligation – doesn’t seem to apply to ideas or facts either, since those rights are conditional on compliance with terms of the license.
Socially, the scholarly concept of citation is fairly well understood – credit where credit us due. It has long been viewed as an entrenched norm of good scientific practice.
But when it comes to the legalities of both terms and how to enact this behaviour, the devil is in the details, and the two are actually rather different when it comes to enforceability and applications / ramifications in the digital world.
In a copyright license, the word “attribution” is a legal requirement, whereas citation evokes more of a club mentality and social practice. Citation in its sole form is not assured or enforceable in the same way, but that’s not necessarily a downside. Ask yourself this, which one is more important – legal enforcement or credit enforced through professional reputation? Attribution – a relatively narrow legal term that can affect interoperability while at the same time possibly failing to provide what you really want? Or citation – an entrenched scientific norm that asks for credit where credit is due.
Implications of FLOSS toggles and directives on data sharing
These issues emerge when instead of focusing on maximizing interoperability of resources, one applies a property metaphor to data. And in the digital world, that tendency can have quite limiting ramifications to future use of the information, as technology continues to outpace the social components to data sharing.
Misunderstanding the legalities can lead to category errors on the social level, including unintentional infringement or on the other side of the spectrum, choosing not to use the resource for fear of infringement. The intentions are often good – believing that applying a less-restrictive copyright license is ensuring the data can be freely shared, reused, and built upon. But without existing precedent or involving a legal team, these issues make for a problematic area to navigate, creating additional confusion and burdens for the users, as well as data providers.
Let’s look at a few examples to gain a better understanding.
Non-Commercial – When used in the context of data, what is a commercial use of the data web? Is it the extraction of a subset, a query that may touch on the data set, hyperlinking?
Attribution – As detailed above, the definitions of attribution and citation are often conflated. Attribution speaks to the legal requirement triggered by the use of the work. But in the case of linked open data, if one were to run a query involving 30,000 data sources (something that is happening every day at an ever decreasing cost), would they then be required to attribute the contributors for all 30,000 databases? You can see how this unintended consequence of attribution stacking could impose a very daunting task for the researcher.
Share-Alike – This toggle specifies that any derivative product be relicensed under the same terms. In the example above of running a large query, all it would take would be one database licensed with a share-alike provision for the entire derivate work to then be under the same terms and no other license. This leads to compatibility issues
There are other external mechanisms and limitations imposed by various jurisdictions and countries that can have a profound effect on data-sharing, especially in terms of international data sharing efforts. These include the sui generis database directive in the European Union, Crown Copyright, “sweat of the brow” and “industrious collection” limitations, trade secrets and unfair competition laws, adding another dimension of complexity to an already complex arena.
After convening a series of meetings, roundtables and other discussions with members of the scientific community, the need emerged for a legally accurate and simple solution, that reduced and/or eliminated the need for one to make the distinction of what’s protected. The conflict between understanding the legal issues and complexities can best be resolved by a two-fold approach: (1) a reconstruction of the public domain and (2) the use of scientific norms to request behaviour through a non-license means.
Converging on the Public Domain (+ Norms)
We believe that the public domain is the best means to achieve maximum interoperability of data with the lowest imposed burdens on the user. This can be achieved through the use of a legal tool – either the Creative Commons CC0 Waiver or the Public Domain Dedication and License (PDDL) – waiving all intellectual property rights asserting that the provider makes no claims on the data. These tools put the work as closely into the public domain as possible.
It calls for data providers to waive all rights necessary for data extraction and re-use (ie., copyright, sui generis database rights, claims of unfair competition, implied contracts). It also requires the provider place no additional obligations such as copyleft or share-alike on the information, which could limit downstream use, as discussed above.
Science Commons also crafted the Protocol for Implementing Open Access Data – a protocol for evaluating database terms of use, in hopes of providing a unified framework for users to evaluate if any given database may be integrated with any other database.
The Protocol recommends one request behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts.
We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.
Final Thoughts
In the early days of the World Wide Web, there weren’t many free-libre licenses available, and after a debate over using GPL for the original web code, CERN chose to put it into the public domain. Getting the law out of the way was key to allow for network effects, and to the success of the Web.
Converge on the public domain and ensure the freedom to integrate. It’s the most scalable solution.
This work is licensed under a Creative Commons Attribution 3.0 License.
Resources
by admin at February 04, 2010 02:56 PM
“When you treat people like idiots, they’ll behave like idiots.”
Have a quick read of The Traffic Guru. It describes how removing traffic rules and signs in a small village prompted people to be more social and become better drivers.
“A year after the change, the results of this “extreme makeover” were striking: Not only had congestion decreased in the intersection—buses spent less time waiting to get through, for example—but there were half as many accidents, even though total car traffic was up by a third.”
Could such a strategy work in your library? Would you be willing to try?
by Aaron Schmidt at February 04, 2010 02:06 PM
Join Troy Linker from ALA Publishing for an introductory guided tour of the RDA Toolkit website. If you were at ALA Midwinter in Boston, you may already have taken this tour at the RDA Update Forum, the CC:DA meeting, or on the exhibit floor--but please feel free to join us again.
The webinar will be recorded and posted for anyone that is unable to participate live. Details for accessing the recorded webinar video will be emailed to registries and posted widely.
The tour includes:
Description of the RDA ToolkitOverview of the RDA Toolkit contents at launch and beyondTour of the RDA Toolkit interface including Search, Browse, Bookmarks, Workflows, Maps, and moreLaunch timelineDetails of the Complimentary Open Access periodRDA Toolkit pricing for the USLinking from external products to the RDA Toolkit
Join us on
February 8, - 21:00-22:00 GMT | 4:00pm-5pm EST | 3:00pm-4pm CST | 1:00pm-2pm PST
OR
Join us on
February 9, - 16:00-17:00 GMT | 11:00am-12pm EST | 10:00am-11am CST | 8:00am-9am PST
Adapted from a e-mail widely distributed.
by David (noreply@blogger.com) at February 04, 2010 11:32 AM
As the February 18 hearing on the
revised Google Books Settlement Agreement draws near, I think its timely to explore some issues surrounding full-text indexing of books. It's important to realize that when Google began its program of scanning books in libraries, it chose to do so in a way that entered the gray zone of fair use. Google continues to maintain that its scanning activities are perfectly legal, and fair use advocates welcomed the Publishers' and Authors' lawsuit because it had the potential to clarify ambiguities around fair use. No matter where the court decided to draw the line, the both fair use and rightsholder control would be able to extend into the zone of current uncertainty.
Overlooked in the controversy is the fact that Google could have chosen a safer course in its effort to make full-text indices of books. In this article, I'll argue that it's possible to make full-text indices of books in a way that steers well clear of copyright infringement. But first, I should note that playing it safe would not have been a good plan for Google. By pushing fair use to its limits, Google assured itself a favorable competitive position. In a lawsuit, Google could have lost on 90% of the fair use they were claiming and would still have ended up 10% ahead of where a safe course would have taken them. Google is large enough that even a 10% victory in court would have paid off in the long run. As it is, Google chose to settle the lawsuit under terms that put them in a better position than they would have occupied by playing it safe, and potential competitors don't gain the benefits of a fair-use precedent.
I make two assumptions about copyright in devising an copyright-safe indexing method:
- You can't infringe the copyright to a work if you don't copy the work.
- If you can't reconstruct a work from its index, then distributing copies of the index doesn't infringe on the work's copyright.
Just in case these assumptions are weak, my fall-back position is that indexing is clearly a fair use under US copyright law.
First, the fall-back assumption:
full-text indexing is allowed as fair use under US copyright law. Indices are allowed as "transformative uses". Judge Robert Patterson's decision (
pdf, 195K) in the
"Harry Potter Lexicon" case gives an excellent background of this jurisprudence and concludes:
The purpose of the Lexicon’s use of the Harry Potter series is transformative. Presumably, Rowling created the Harry Potter series for the expressive purpose of telling an entertaining and thought provoking story centered on the character Harry Potter and set in a magical world. The Lexicon, on the other hand, uses material from the series for the practical purpose of making information about the intricate world of Harry Potter readily accessible to readers in a reference guide. To fulfill this function, the Lexicon identifies more than 2,400 elements from the Harry Potter world, extracts and synthesizes fictional facts related to each element from all seven novels, and presents that information in a format that allows readers to access it quickly as they make their way through the series. Because it serves these reference purposes, rather than the entertainment or aesthetic purposes of the original works, the Lexicon’s use is transformative and does not supplant the objects of the Harry Potter works.
The author of the Lexicon lost his case not because his indexing was not allowed, but rather because he copied too much of J. K. Rowling's creative expression in doing so.
Second,
you have to copy to infringe copyright. A more accurate statement is this: You have to either make a copy or a derivative work to infringe copyright. The second piece of this can be a bit more confusing, because "derivative work" has a specific meaning in copyright law. A translation into another language is an example of a derivative work. Indices are
not derivative works. The law considers indices to be more akin to metadata. I might need access to a book to count the number of figures it contains, but a report of the number of figures in a book and what page they're on is in no way a derivative work. The copyright act defines a derivative work as
a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.
If you make copies by scanning, however, as Google is doing, you must also establish that your use is allowed as fair use. If you don't, then you don't even need to reach the fair use provision.
The last assumption gets more technical. The simplest form of a word index is a sorted list of words with pointers to the occurrence of the word within the text. So an index of that last sentence might look like this:
a 5,9
form 3
index 7
is 8
list 11
occurrence 18
of 4,12,19
pointers 15
simplest 2
sorted 10
text 24
the 1,17,20,23
to 16
with 14
within 22
word 6,21
words 13
It doesn't take a computer science degree to see that it's easy to reconstruct the sentence from this index. For that reason this form of index is equivalent to a copy. If you remove the position pointers, however, the index loses enough information that the sentence cannot be reconstructed. So if we take the words on a page of text and sort the words in each sentence, then sort the word-sorted sentences, we get an index of a page that can't be used to reconstruct text, but
can be used to build a useful full-text index of a book.
The trickiest step of completely copyright-safe indexing is producing the page index from a book without producing intermediate copies of the pages. In a conventional scanning process, a digital image of a page is stored to disk and the copy is passed to OCR software. Indexing software then works on the OCR text. A scanning process that was fastidious about copyright, however, could scan lines of text word by word and never acquire an image large enough to be subject to copyright.
US courts have considered the loading of a copyrightable work into a computer's RAM storage to constitute copying, but scanning sufficient to produce an index can in principle be done without requiring that to occur. (For an excellent law review article on the RAM-copying situation, read Jonathan Band and Jeny Marcinko's
article in Stanford Technology Law Review.) Also, even sentences of more than a few words can be considered copyrightable works, as I discussed in
an article from November.
Another possible way to avoid copying is to build a black-box indexer. A closer look at the RAM-copying precedent,
MAI SYSTEMS v. PEAK COMPUTER suggests that a non-copying scanning indexer can be built even if page images exist somewhere in RAM. In that case, the court reasoned that the software copy could be viewed via terminal readouts, system logs, and that sort of thing. If a closed-box indexing system were built so that page images resident in RAM could never be "perceived, reproduced, or otherwise communicated", then there is a fair chance that a court would find that copying was not occurring.
I'm a technologist, not a lawyer. I would welcome comment and criticism from experts of all stripes on this analysis. For example, I've not considered international aspects at all. There are many technical aspects of copyright-safe indexing that would need to be sorted out, but doing so could open the way to countless transformative uses of all the books in the world.
by Eric Hellman (noreply@blogger.com) at February 04, 2010 10:37 AM
February 03, 2010
The BBC News Web site had an interesting column by Bill Thompson yesterday titled “Open Societies need open systems.” The subtitle, “Openness, like democracy, must be constantly defended, says Bill Thompson” basically acts as a partial abstract as well. In this article he looks at Amazon’s disagreement with Macmillan that resulted in Amazon briefly de-listing all Macmillan stock and removing it from its indexes and the Apple/Adobe keruffle of Flash on the iPhone and soon to be released iPad.
I’m not quite sure how the Amazon/Macmillian dispute effects Democracy, or Openness for that matter, but it does go to show that highly successful retailers such as Amazon and Walmart can make it more or less difficult for a producer of a product to get it in the hands of consumer. Amazon, no doubt, felt that by trying to prevent different pricing for e-books it was helping the consumer (and thus it’s self) but obviously authors like Charlie Stross quoted in the article as saying “Amazon [has] screwed me, and I tend to take that personally, because they didn’t need to do that” saw it differently.
The Apple and Adobe situation I see differently, and while I do believe that while Apple is looking out for its own corporate interests, Apple also does want more Openness on the Web. As a company with a minority operating system share, the more open the Web is the better chance they have to compete. Adobe, on the other hand wants to, as Thompson puts it, “close off the web to non-Flash content.” While Apple, with its stance on DRM and other issues, has not always been a strong supporter of Openness, I believe in this case they are squarely on the side of Openness by support HTML5 and H.264 over continuing to enable the proprietary Adobe Flash format to be the de facto standard for video on the Web. Thus I find it a bit odd that Thompson appears to be supporting Adobe on this issue. Thompson says:
Just as we must work to retain our democratic forms of government in the face of adversity, so we must constantly be alert for those who would remove open systems in the name of efficiency and effectiveness.
He may be right that not installing Flash on the iPhone and iPad is in Apple’s best interest but I don’t see it as anti-Openness. Sometimes Openness and corporate interests can align, and I believe in this particular case Apple is on the side of Openness and Adobe is on the side of a closed, proprietary Web. At the very least, even if Apple is not a friend of Openness, neither is Adobe. Proprietary technologies and formats as de facto Web standards are a much greater threat to Openness than devices that don’t support them.
In looking at this issue from a Democracy 2.0 and access to information situation, libraries need to be aware of potential problems with proprietary formats and what devices can and will support them. If librarians believe that access to information is important for democracy, we need to make sure when we acquire (via licensing or purchasing) that the content is in a format that will be accessible to out patrons now and into the future.
by ecorrado at February 03, 2010 11:01 PM
"We outsourced the usability testing when we built our ebook and journal platforms. The results were invaluable and not only did the company do the tests and take care of the logistics of recruiting the participants, they also provided us with wireframes and lots of documentation, which saved us a lot of time when we got around to building the interfaces.
We posted all of the findings on our wiki:
http://spotdocs.scholarsportal.info/pages/viewrecentblogposts.action?key=client
"
by Tom.Pasley at February 03, 2010 11:00 PM
Yesterday I read a great post by Nat Torkington over at Radar O’Reilly. It really got my juices flowing. I’ve been thinking a lot about Data, especially data in the cloud/open data in advance of my presentation with Michael Klein at Code4Lib at the end of the month (holy crap its coming fast).
When I saw the post I printed it out (i know…the trees) and started marking it up. This morning when I came back into the office I took a look at the paper and realized that most of the ideas that I liked were basically screaming “I’m a project!” Some of the highlights from the post were:
- create a user-base for your data
- market to that user-base
- think about publishing your data at the beginning of the project
- consider the sustainability of publishing your data
- think about what you’re hoping to accomplish with your open data
- who are you targeting by opening up your data
- build your project based on what you want to accomplish and who you’re targeting
None of these points are really that new to me (or to anyone that works in systems). You need to plan before starting on a project, and if you plan right you can have an awesome project. I think planning is what makes data.gov.uk so much better than data.gov. It was well thought out. They considered user-bases in advance. They incorporated RDF into the data catalogue (yes that’s the british spelling…we are talking about a british site afterall). These little subtleties are what people are the most excited about.
At the same time I recognize that a lot of times we’re working with retrospective data, data that we thought no one would ever want to take a look at. But that doesn’t mean you can’t make it useful now, and data.gov.uk proves that. Through great planning they created a very useful tool.
So my point is PLAN PLAN PLAN. A well thought out plan can make a project succeed or fall flat.
Share/Save
by Rosalyn Metz at February 03, 2010 07:29 PM
I wrote a blog post on my other, Blipfoto, blog this morning, More famous than Simon Cowell, looking at some of the issues around persistent identifiers from the perspective of a non-technical audience. (You'll have to read the post to understand the title).
I used the identifier painted on the side of a railway bridge just outside Bath as my starting point.
It's certainly not an earth-shattering post, but it was quite interesting (for me) to approach things from a slightly different perspective:
What makes the bridge identifier persistent? It's essentially a social construct. It's not a technical thing (primarily). It's not the paint the number is written in, or the bricks of the bridge itself, or the computer system at head office that maps the number to a map reference. These things help... but it's mainly people that make it persistent.
I wrote the piece because the JISC have organised a meeting, taking place in London today, to consider their future requirements around persistent identifiers. For various reasons I was not able to attend - a situation that I'm pretty ambivalent about to be honest - I've sat thru a lot of identifier meetings in the past :-).
Regular readers will know that I've blown hot and cold (well, mainly cold!) about the DOI - an identifier that I'm sure will feature heavily in today's meeting. Just to be clear... I am not against what the DOI is trying to achieve, nor am I in any way negative about the kinds of services, particularly CrossRef, that have been able to grow up around it. Indeed, while I was at UKOLN I committed us to joining CrossRef and thus assigning DOIs to all UKOLN publications. (I have no idea if they are still members). I also recognise that the DOI is not going to go away any time soon.
I am very critical of some of the technical decisions that the DOI people have made - particularly their decision to encourage multiple ways of encoding the DOI as a URI and the fact that the primary way (the 'doi' URI scheme) did not use an 'http' URI. Whilst persistence is largely a social issue rather than a technological one, I do think that badly used technology can get in the way of both persistence and utility. I also firmly believe in the statement that I have made several times previously... that "the only good long term identifier is a good short term identifier". The DOI, in both its 'doi' URI and plain-old string of characters forms, is not a good short term identifier.
My advice to the JISC? Start from the principles of Linked Data, which very clearly state that 'http' URIs should be used. Doing so sidesteps many of the cyclic discussions that otherwise occur around the benefits of URNs and other URI schemes and allows people to focus on the question of, "how do we make http URIs work as well and as persistently as possible?" rather than always starting from, "http URIs are broken, what should we build instead?".
by AndyP at February 03, 2010 01:58 PM
Yesterday we had one of our monthly meetings – this one about handing over Koha assests to the Horowhenua Library Trust, the group we all voted should look out of the interests of Koha for the community. One of the assets we’re still waiting on is the koha.org domain name. Due to recent events we understand that we now have to wait a bit longer for a decision in this matter, but we can’t leave koha.org so very out of date (which it is right now), and so we made a community decision to open up koha-community.org and use that for official releases, documentation (coming soon), news and the support page.
I love when we all come together and solve a problem as a group – showing that all of the silly comments about the community being in turmoil are complete nonsense. We (the community) know who we are and where we want the software to go – and we know how to work together to get things done!
- Who Owns Koha? (go-to-hellman.blogspot.com)
- The Koha Community Rocks! (library-matters.blogspot.com)
Technorati Tags: koha
by Nicole at February 03, 2010 01:09 PM
I was fascinated to read about the Library of Congress' new Optical Properties Laboratory and what those upgrades have allowed them to achieve in...
February 03, 2010 07:16 AM
Here at the Developer Network, we were excited to learn yesterday about a mashup which features WorldCat and the New York Times APIs was highlighted on the First Look New York Times blog. The mashup--created Wade Guidry, the Library Technology Coordinator at the Collins Memorial Library at the University of Puget Sound, and a WorldCat Mashathon Seattle attendee --is done using Yahoo Pipes and draws data (ISBN) from the NYT best seller API to create links to each of the best-selling books in Worldcat.org.
Because the New York Times has several Best Seller lists, Mr. Guidry has created several Pipes.
The library at University of Puget Sound is also incorporating the links to the best sellers lists into their library catalog.
The project just goes to show what a little coding know-how can do to improve experiences for library users.
One things I'd like to be able to do is to build on this idea and write a script that uses the WorldCat Search API to check and see if a library has holdings for a particular item. If so, then create a link to that library's particular catalog. If not, then create a link to WorldCat.org. Also covers would be a great addition and could be pulled from Open Library or the LibraryThing API.
I've got the basic framework sketched out but we're still getting ready for the VALAtech Boot Camp OCLC API Mashathon next Monday. Keep checking back--I should a working version posted in the next couple of weeks.
by Karen Coombs at February 03, 2010 05:07 AM
For the past few months I’ve neglected to reblog in this space the availability of fresh new Digital Campus podcasts for your listening pleasure. Below is a list of the major topics of each of those episodes—if you’re new to the podcast, pick one that sounds interesting and give it a listen. Or just subscribe to the podcast to have fresh episodes delivered automatically to iTunes or your favorite podcatcher.
Important changes have arrived in this span of podcasts as well. After being the “show runner” for the first fifty episodes (doing the voice-overs and guiding the discussion in my best impression of a late-night jazz host), the other regulars on the podcast, Tom Scheinfeldt and Mills Kelly, will assume these duties (along with me) on a rotating basis starting with Digital Campus #51, “The Inevitable iPad.” In addition, we’ve been joined by a rotation of “irregulars” who greatly liven up the proceedings and actually have intelligent things to say.
Episode 51 – The Inevitable iPad: Inevitably, we obsess over what the iPad means for academia, museums, and libraries.
Episode 50 – The Crystal Ball Returns: Our popular year-end/beginning-of-the-year wrap-up and predictions of what’s to come.
Episode 49 – The Twouble with Twecklers: Twitter at academic conferences; speeding up the web.
Episode 48 – Balkanization of the Web?: The revised Google Books settlement; News Corp. v. Google; Wikipedia in its maturity.
Episode 47 – Publishers Bleakly: As publishing business models erode, we look at new models in their infancy.
Episode 46 – Theremin Dreams: How people adopt new technologies; Nook; Droid.
by Dan Cohen at February 03, 2010 02:24 AM
Note: This is post is a modified version of a comment I originally posted on Carl Grant’s blog. If you already read it, move on. Nothing new to see here.
Carl Grant recently made a post about Balancing innovation and focus that had a huge bent towards the question of investing in Open Source Software (OSS)
I agree with Carl that many libraries could use more focus when implementing new technology but I strongly disagree that this is any different when it comes to OSS versus proprietary applications. None of his critique is specific to OSS and signally out of OSS to me is a bit of a non sequitur. Many proprietary applications, including some of Ex Libris’ offerings, need a great deal of customization and often just as much, if not more, staff to implement and maintain as Open Source. I was talking to a proprietary ILS administrator from another University last year and they have twice as many systems people working on their ILS then Georgia Pines had to original develop Evergreen. Another example about three years ago a University had four new job advertisements to help them implement a new proprietary discovery layer. People like David Walker have put into a lot of work implementing a custom interface on top of Metalib. Are these wasted, redundant efforts? Why is this different then focusing efforts on OSS? It’s not any different. Or if it is, one could argue that at least a library would have the software to change and modify like University of Rochester did with Dspace in creating IR+ which they couldn’t do if they put all their previous efforts into a proprietary product that ended up not suiting their needs. This is not an OSS issue, it is a technology issue and a management issue. It is just as easy to say that Ex Libris building Primo Central (or whatever product you want to name) is “redundant and poorly coordinated investments” considering other vendors are in this space.
Carl’s underlying point “that librarianship is in need of a clear definition of the future of the profession and to examine how technology (open source or proprietary) will move that definition to fruition and, at the same time, leverage librarianship” is well taking and I agree. Libraries should evaluate each technology acquisition carefully considering need, budget, skill level, mission, etc. This evaluation may or may not lead to an existing OSS or propitiatory solution, developing a new OSS or home-grown solution, partnering with a vendor on a new product (such as the URM development partners are doing with Ex Libris), or not implementing anything at all. But dividing the world between Open Source and proprietary applications only serves in muddying the water and weakening this message.
That’s a lot of text for a non sequitur, no?
by ecorrado at February 03, 2010 01:24 AM
Salmon, an aggregation protocol, is championed by Google’s John Panzer, and described as an “an open, simple, standards-based solution” for “unifying the conversations”.
‘Conversations’ is deliberately plural, I think, to evoke the many conversations, invisible to one another: “The comments, ratings, and annotations increasingly happen at the aggregator and are invisible to the original source.”
Using Salmon, an aggregator pushes comments back to a “Salmon endpoint” (via POST). These can be published (or moderated) upstream at the original source. See also the summary of the Salmon protocol.
Comments swimming upstream…
by jodi at February 03, 2010 12:05 AM
In a post at ZDNet, Dion Hinchcliffe delineates 7 problems of today’s social web:
- Fragmentation of conversation.
- Disconnects between older and newer generations of social media
- Lack of control of identity, contacts, and data.
- A better social Web on mobile devices.
- Poor integration between social media and location services.
- Difficulty of coherently engaging in social activity across many channels.
- Coping with and getting value from the expanding information volume of social media.
from “The social Web in 2010: The emerging standards and technologies to watch” encountered via Ed H. Chi’s post at the PARC Augmented Social Cognition blog.
The trends? Openness, portability, aggregation of distributed content. Hopefully we’ll see more on all these fronts in 2010 and beyond. Hinchcliffe also suggests that we want “Better social and location capabilities added to the core of mobile devices.”
See the full post at ZDNet for more discussion and references to a number of standards, formats, and related developments. In the next post, I’ll highlight Salmon, a protocol for distributed commenting, which I’d neither encountered nor heard of.
by jodi at February 03, 2010 12:03 AM
February 02, 2010
I want to thank everyone who submitted an entry for the Federated Search Blog contest. I also want to thank the judges who read each of the entries and assigned them scores. Prizes will be awarded to the people whose entries earned the three highest scores from the judges. I have contacted the winners so you know who you are! As soon as I get the OK from the Computers In Libraries Conference and Magazine managers I will announce the winners. I don’t want to steal their thunder since CIL Magazine will be publishing the winning essay in its entirety and CIL Conference will be having the first place winner on their panel.
 |
 |
 |
 |
 |
| Abe Lederman |
Todd Miller |
Helen Mitchell |
Richard Tong |
Walt Warnick |
| Our distinguished judges |
ShareThis
by Sol at February 02, 2010 11:31 PM
Perhaps shockingly, I don't plan to so much as try to wade through all seven-hundred-odd pages of this report on scholarly-publishing practices. It's thorough, it's well-documented, it's decently-written… and based on the executive summary (itself weighing in at a hefty 20 pages), it won't tell me a thing I don't already know.
Academia is conservative. Academia thinks its current scholarly-production system is just fine and dandy, thank you. Academia has a love-hate relationship with peer review. Academia wants to outsource its tenure and promotion decisions any way that is convenient and looks just barely irreproachable enough.
None of this is news. It's dispiriting, but it's not news.
I invite you, however, to take a look at the survey population. "45, mostly elite, research institutions" (p. i) they drew their sample from. Just on the face of it—if we're looking for change in scholarly communication, especially disruptive change, elite researchers in well-established disciplines at elite institutions are the wrong place to look.
Of course such researchers don't want the hill disturbed—they're king of it, aren't they? They're the people for whom "sustaining innovation" is designed, in Clayton Christenson's parlance. They're the very tippy-top of the academic prestige market; they are the last to notice, much less use, a disruptive innovation.
For similar reasons, we don't want to look at the big, established journals and publishers for disruptive innovation. Sustaining innovation, yes, plenty of it. But once again, the king of the hill doesn't allow mining underneath him when he can prevent it.
"But there's better light over here under the streetlamp!" goes the old joke. So where might we look instead, despite the darkness? Well, I have some ideas.
Interdisciplinary, inchoate novelties like the "digital humanities." Young, impecunious disciplines. New journals—what is the proportion of OA to TA journal launches these days, and how is that ratio changing? Disciplines where data need a place to live and thrive. Disruptive innovations start where there's a need that the existing market can't or simply won't address.
That's where the action is likely to be—and to be blunt, most of the reason I'm not wading through that Berkeley report is that it doesn't tell me a thing about where I believe the action is.
Still, there are some good bits about data in there, so the executive summary is worth a skim.
Read the comments on this post...
February 02, 2010 10:42 PM
This is not your typical ‘why MARC must die’ post. It’s instead about very low level structural problems in a Marc21 binary file that my ILS outputs. It’s not about the semantics of MARC at all, it’s about the structural features of the Marc21 format.
I never had to know much about low-level Marc21 format details before, and wish I still didn’t, but I had to because my ILS (Horizon) is outputting certain bibs as MARC that the Marc4J Java library used by SolrMarc refused to read, claiming they were structurally invalid in various ways. (Never would have figured this stuff out with the invaluable help of sesuncedu, robcaSSon, and others in #code4Lib).
But this may help someone else figuring out why Marc4J can’t read their MARC.
1. Invalid leader bytes
In the leader of a Marc21 record, byte 10 is always ascii ‘2′, byte 11 is always ‘2′ as well, and bytes 20-24 are always ‘4500′. At least they’re supposed to be. Theoretically these bytes allow a record to specify details about the nature of it’s binary format — but these details are fixed in Marc21, and in all other Marc variants we know of, it’s a flexibility that was rarely or never taken advantage of in any Marc format.
However, Horizon actually stores most of it’s leader bytes in a db column. And if the leader bytes are something other than these invariants in that db column, Horizon’s marc export will include those leader bytes — even if they are invalid, even if they do _not_ accurately describe the Marc record they are attached to (which wouldn’t be a valid Marc21 record if it was true).
Since these values are invariant in Marc21 and most (all) other Marc formats, most Marc parsers ignore them.
However, Marc4J doesn’t, it actually treats them as gospel. So if those bytes were wrong, Marc4J will try reading the record improperly. And if those bytes weren’t ascii decimal digits at all, Marc4J will claim it can’t read the leader.
So I just had to fix those in our production ILS. And figure out where they’re coming from, and try to stop them from coming in again? Really, I blame our ILS here for even allowing such completely wrong bytes to be in it’s internal db.
(A perhaps better solution from the other end is fixing the Marc4J PermissiveReader to not pay attention to those bad bytes, assuming the invariant values. sesuncedu has prepared a patch doing some of that for Marc4J, hopefully it’ll get in there.)
2. Bib Records Too Long For Marc
Because of the nature of MARC21’s ‘directory’ structure, there is a maximum length that a MARC record can be. If it’s above this length, the MARC directory doesn’t have enough bytes in it to describe where the fields beyond this length are in the record, and the MARC record is unreadable. (Incidentally, it’s very odd that MARC includes internal byte offsets recorded as ascii decimal chars, rather than ordinary binary data. If it used more typical simple binary encoding of integers for byte offests, the maximum length of a MARC file would be quite a bit larger. But it doesn’t. Oh well.)
So what does Horizon marc export do if it has a record which has too much data, which will go over the maximum record length in MARC? It outputs it anyway. But the marc record it outputs is seriously messed up. It’s got a MARC directory which may be entirely illegal (not a multiple of 12), it’s got a wrong leader bytes 0-4 ‘length’, possibly other problems. Depending on the individual record and exactly how Horizon ended up outputting it, Marc4J might just skip it as a bad record and go on. That’s the best that could be expected.
However, more often, Marc4J gets entirely confused because of the bad leader bytes 0-4 length, and doesn’t understand where the subsequent record in the marc file actually begins. So every other record after this too long one in the marc file is a loss to Marc4J/SolrMarc indexing. Either every subsequent record can’t be indexed at all, or even worse, every subsequent record is indexed by Marc4J/SolrMarc, but completely wrong, because Marc4j/SolrMarc got the wrong data.
I need to work out a patch for Marc4J PermissiveReader so when encountering such a record, Marc4J can at least recover by properly finding the beginning of the NEXT record, using the Marc Record Separator character.
3. Blank/null tags
This one might be Horizon-specific. Horizon allows the operator to accidentally add a tag to a record that has a null tag value. Not 100, 245, or something else, but just null. This accident could have been made manually, or could have been made by some sort of automated import script when we batch loaded records into Horizon. When the Horizon marc exporter encounters such a record, it does output marc21 for it, but completely invalid and wrong marc21.
I blame Horizon for this, it ought not to allow null tag values to even exist in the db, and if they do, ought to be ignored on export, not create an invalid marc record on export.
This is another problem that often results in Marc4J getting completely confused about where one record ends and the next starts, making the entire rest of the Marc file after such a record un-readable. Probably because of a bad leader bytes 0-4 length value, so perhaps if I can work out a patch to above, it will at least result in Marc4J succesfully skipping such a record and going on to the rest of the file.
4. Illegal chars in Marc values?
This one I haven’t completely gotten to the bottom of yet, because I made the mistake of fixing the couple examples I found in the Horizon Staff Client, where it didn’t really show me exactly what was going on.
But I think some Marc control characters (Field Terminator or Record Terminator) wound up in some of my record values in the db. (No doubt as the result of an import gone wrong at some point in the past). The Horizon marc exporter simply included them unescaped in it’s marc output. Resulting in special marc control characters in illegal places, or places where they don’t mean what they mean, in the marc file. This also messed up Marc4J something awful.
Again, I kind of blame Horizon here, for allowing bad data in it’s internal store, and then for writing bad data out in marc export when such bad data is in it’s internal store.
NEW! 4 Feb 2010:
Marc control character in internal data value.
I’ll describe this one in Horizon-specific terminology, cause it’s clearer.
The horizon “bib” table holds an individual marc field in the ‘text’ column. Every ‘text’ column ENDS in the Marc Field Terminator character (decimal 30, hex 1E, sometimes displayed as “^^”).
However, some of our values have that Marc Field Terminator character _not_ as the last character, but internally. This creates problems in marc export, where the marc created by marcout is invalid unparseable marc. (as it includes marc Field Terminator control character in illegal position).
This problem is not visible in Horizon Staff Client, the control character is not shown. But it’s hiding there in the database anyway. If you open an individual record in Horizon Staff Client and then simply re-save it, it SEEMS to fix the problem in at least some cases (not sure about all), but probably makes more sense to fix it in bulk through an automated process anyway.
As a technical note: I used this SQL against hzdev db to find the number of bibs which contained char(30) as some char OTHER than the last in dbo.bib. It takes quite a while ro run. This would have to be re-done for dbo.bib_longtext.longtext, another table that data destined for marc export can hide. You could base an automated fix off of this SQL technique.
select count(distinct bib#) from dbo.bib where (charindex(char(30), text) != char_length(text))
Correction: that SQL will also find values that do not end in the FT at all. While Horizon ordinarily does so that’s sort of an error, it doesn’t cause any problems. Here’s one to find only ones with an internal FT after all, not including ones with no FT whatsoever:
select count(distinct bib#) from dbo.bib where (charindex(char(30), text) not in(0, char_length(text)))
A note on MARC control character terminology
One confusing thing in dealing with this stuff that took me a while to figure out is how MARC uses it’s own special weird names for certain control characters.
MARC has a “Field Terminator” (which is sometimes called ‘field separator’ in marc docs instead of ‘terminator’) and a “Record Terminator” (also sometimes called ‘record separator’ in docs instead of ‘terminator’).
But the ascii values used for these special MARC control codes already had names in ascii, and they are confusingly similar but not the same names! This certainly leads to confusion.
Marc “Field Terminator” == Hex 1E == Decimal 30 == Ascii “Record Separator” == “control-^” or “^^”, which is how stock vim will display it.
Marc “Record Terminator” == Hex 1D == Decimal 29 == Ascii “Group Separator” == “control-]” or “^]” which is how stock vim will display it.
(Correction 3 Feb, this next is also part of the marc standard):
Marc “Subfield Delimiter” == Hex 1F == Decimal 31 == Ascii “Unit Separator” == “control-_” or “^_” which is how it will show up in vim.
Also update 3 Feb 2010. I made this little sign and now keep it on my wall next to my desk, so I can refer to it ha.
Filed under:
General

by jrochkind at February 02, 2010 09:41 PM
There are several options for describing APIs in a way that machines and/or people can read: WSDL files (mostly used with SOAP), OpenSearch description files, YQL Open Data Tables, etc.
I had a theory that REST APIs could be sufficiently described — in a way that both machines and people can understand — using HTML5 forms.
Here's an example, describing the NCBI's ESearch API (part of EUtils):
This makes use of several new or modified HTML5 attributes on input and select elements: "type", "required", "pattern", "placeholder" and "autofocus".
The description needs to be able to define:
- which fields are available ("name")
- their default values ("selected")
- whether values are required ("required")
- hints for when required fields are not filled ("title")
- what type of data is in each field ("type")
- patterns to validate the data in each field ("pattern")
- possible options for each field, when the set of options is limited ("select/option")
- human-readable descriptions of each field ("label")
- suggested values or hints for the formatting of fields ("placeholder")
- dependencies between different fields ("optgroup", "class")
Still needed: a way to express dependencies between fields (e.g. "either this or these are required").
Deliberately missing: any definition of the structure or semantics of the response.
Here's another example, for EFetch. This one is missing optgroup elements, while I investigate the different combinations of parameters.
February 02, 2010 08:31 PM

I had a great time speaking in front of the audience at Digital Book World 2010. Actually, that’s not true, it was pretty scary — big audience! But the talk seemed well-received, and the conference was very well-organized, especially given that this was its first year.
You can follow the slides and notes on the Digital Book World page for the talk (sorry about my giant head). The slides are also embedded below.
by Liza Daly at February 02, 2010 08:30 PM
I wrote about RDF-encoding contact information a little earlier and had some very helpful comments. On reflection, and after exploring the “View Source” options for a couple of institutional contact pages, I’ve had some further thoughts.
- Contacts pages are rarely authored, they are nearly always created on the fly from an underlying database. This makes them natural for expressing in RDF (or microformats). It’s just a question of tweaking the way the HTML wrapper is assembled. Bath University’s Person Finder pages do encode their data in microformats.
- I wondered why more universities don’t encode their data in microformats or (even better) in RDF for Linked Data. One possible answer is that the contact pages were probably one of the earliest examples of constructing web pages from databases. It works, it ain’t broke, so they haven’t needed to fix it! If so, a reasonable case would need to be made for any change, but once made it would be comparatively cheap to carry out.
- A second problem is that it is not at all clear to me what the best encoding and vocabulary for institutional (or organisational unit) contact pages might be. So maybe it’s even less surprising that things have not changed. To say I'm confused is putting it mildly! So what follows list some of the options after further (but perhaps not complete) investigation...
One approach is the hCard microformat, based on the widely used vCard specification, RFC2426 (this is what Bath uses). That’s fine as far as it goes, but microformats don’t seem to fit directly in the Linked Data world. I’m no expect (clearly!), but in particular, microformats don’t use URIs for the names of things, and don’t use RDF. They appear useful for extracting information from a web page, but not much beyond that (I guess I stand to be corrected here!).
Looking at RDF-based encodings, there are options based on vCard, there are FOAF and SIOC (both really coming from a social networking view point), and there’s the Portable Contacts specification.
Given that vCard is a standard for contact information, it would seem sensible to look for a vCard encoding in RDF. It turns out that there are two RDF encodings of vCard, one supposedly deprecated, and the other apparently unchanged since 2006. I now discover an activity to formalise a W3C approach in this area, with a draft submission to W3C edited by Renato Ianella and dating only from last December (2009), but I would need a W3C username and password to see the latest version, so I can't tell how it's going,
Someone asked me a while ago who sets the standards for Linked Data vocabularies. My response at the time was that the users did, by choosing which specification to adopt. At the time, FOAF seemed to have most mentions in this general area, and I rather assumed (see the previous post) that it would have the appropriate elements. However, the “Friend of a Friend” angle really does seem to dominate; this vocabulary does seem to be more about relationships, and to be lacking in some of the elements needed for a contacts page. I suspect this might have stemmed from a desire to stop people compromising their privacy in a spam-laden world. However, those of us in public service posts often need to expose our contact details. However, FOAF does have email as foaf:mbox, which apparently includes phone and fax as well, as you can see from the sample FOAF extract in my earlier post.
In a tweet Dan Brickley suggested: “We'll probably round out FOAF’s address book coverage to align with Portable Contacts spec”, so I had a look at the latter. The main web site didn’t answer, but Google’s cache provided me with a draft spec, which does appear to have the elements I need.
What elements do I need for a contact page? Roughly I would want some or all of:
- Name
- Job title/role in DCC (my virtual organisation)
- (Optional job title/role in home organisation)
- Organisational unit/Organisation
- Address/location
- Phone/fax numbers
- Email address
So what could I do if this information were expressed in RDF in the contact pages for a partner institution (say UKOLN at Bath)? Well, presumably the DCC contact pages would be based on a database showing the staff who work on the DCC, with the contact information directly extracted from the remote pages (either linked in real time or perhaps cached in some way). And if Bath changed their telephone numbers again, our contact details would remain up to date. But more. Given that there are some staff members who have roles in several projects, it would be easy to see who the linkages were between the DCC and the other project (eg RSP in the past, or I2S2 now). Part of the point of Linked Data (rather than microformats) is that one can reason with it; follow the edges of the great global graph…
And perhaps I would be able to find a simple app that extracts a vCard from the contact page to import into my Mac’s Address Book, which is where I started this search from! You wouldn’t think it would be hard, would you? I mean, this isn’t rocket science, surely?
by noreply@blogger.com (Chris Rusbridge) at February 02, 2010 07:38 PM

QR-Code pointing to DLTJ
This morning I attended a presentation on “
Using QR Codes and Mobile Phones for Learning” at the Ohio Educational Technology Conference. Presented by Thomas McNeal and Mark van’t Hooft from Kent State University, the example used in the presentation was their
GeoHistorian Project from the 2009 ISTE conference. By using a pamphlet of 2-D barcodes labeled with strategic locations at the World War II Memorial in Washington, DC, participants using barcode scanners on smartphones were able to call up text and media from various websites while walking around the memorial. They put together a
video showing participants walking through the space and their impressions of the 2-D barcode-enhanced experience.
Tom emphasized the need to have an activity that is relevant to the technology. As he put it, “Use the technology to ampliy the activity.” In this specific case, the 2-D barcodes pointed to text, pictures, and videos that provide additional background to the components depicted in the World War II Memorial. As participants mentioned in the video, it is a way add context to the experience of walking through the memorial.
I had one minor quibble with the execution of the project. The presenters were using EZcodes, a 2-D barcode format licensed exclusively to ScanBuy rather than the emerging de facto QR codes. The problem with EZcodes is that what is encoded is an identifier that is translated by the ScanLife smartphone app to the final destination. By contrast, a QR code has the actual content — a short snippet of text, a URL, a phone number, etc. — actually encoded in the barcode. With an EZcode, the application on one’s smartphone has to look up the value of the identifier at ScanLife’s service before going to the final destination. With a QR code, the smartphone application can go right to the destination website.
The EZcode/Scanbuy scheme has privacy and sustainability issues. First, according to the terms-of-service for the ScanLife reader, each reader application is assigned a unique identifier; because the application must contact the ScanLife with the 2-D barcode identifier to find the value behind the identifier, ScanLife knows everything you scan. Secondly, the ScanLife server is a mandatory intermediary in the process, so if ScanLife goes away all of the barcodes become worthless. This is somewhat similar to the problem with music encumbered with digital rights management; if the server is unavailable, the music file is worthless.1 QR codes, though, since the data is encoded in the barcode itself, does not have either of these problems.
Footnotes
Post from: Disruptive Library Technology Jester
Experiential Learning Enhanced with 2-D Barcodes
by Peter Murray at February 02, 2010 06:52 PM
In a blog post at Creative Commons, UK moves towards opening government data, Jane Park notes that the UK Government have taken a significant step towards the use of Creative Commons licences by making the terms and conditions for the data.gov.uk website compatible with CC-BY 3.0:
In a step towards openness, the UK has opened up its data to be interoperable with the Attribution Only license (CC BY). The National Archives, a department responsible for “setting standards and supporting innovation in information and records management across the UK,” has realigned the terms and conditions of data.gov.uk to accommodate this shift. Data.gov.uk is “an online point of access for government-held non-personal data.” All content on the site is now available for reuse under CC BY. This step expresses the UK’s commitment to opening its data, as they work towards a Creative Commons model that is more open than their former Click-Use Licenses.
This feels like a very significant move - and one that I hadn't fully appreciated in the recent buzz around data.gov.uk.
Jane Park ends her piece by suggesting that "the UK as well as other governments move in the future towards even fuller openness and the preferred standard for open data via CC Zero". Indeed, I'm left wondering about the current move towards CC-BY in relation to the work undertaken a while back by Talis to develop the Open Data Commons Public Domain Dedication and Licence.
As Ian Davis of Talis says, Linked Data and the Public Domain:
In general factual data does not convey any copyrights, but it may be subject to other rights such as trade mark or, in many jurisdictions, database right. Because factual data is not usually subject to copyright, the standard Creative Commons licenses are not applicable: you can’t grant the exclusive right to copy the facts if that right isn’t yours to give. It also means you cannot add conditions such as share-alike.
He suggests instead that waivers (of which CC Zero and the Public Domain Dedication and License (PDDL) are examples) are a better approach:
Waivers, on the other hand, are a voluntary relinquishment of a right. If you waive your exclusive copyright over a work then you are explictly allowing other people to copy it and you will have no claim over their use of it in that way. It gives users of your work huge freedom and confidence that they will not be persued for license fees in the future.
Ian Davis' post gives detailed technical information about how such waivers can be used.
by AndyP at February 02, 2010 06:09 PM
Interesting article about the scalability issues around Second Life, What Second Life can teach your datacenter about scaling Web apps. (Note: this is not about the 3-D virtual world aspects of Second Life but about how the infrastructure to support it is delivered.)
Plenty of pixels have been spilled on the subject of where you should be headed: to single out one resource at random, Microsoft presented a good paper ("On Designing and Deploying Internet-Scale Services" [PDF]) with no less than 71 distinct recommendations. Most of them are good ("Use production data to find problems"); few are cheap ("Document all conceivable component failure modes and combinations thereof"). Some of the paper's key overarching principles: make sure all your code assumes that any component can be in any failure state at any time, version all interfaces such that they can safely communicate with newer and older modules, practice a high degree of automated fault recovery, auto-provision all resources. This is wonderful advice for very large projects, but herein lies a trap for smaller ones: the belief that you can "do it right the first time." (Or, in the young-but-growing scenario, "do it right the second time.") This unlikely to be true in the real world, so successful scaling depends on adapting your technology as the system grows.
by AndyP at February 02, 2010 05:31 PM
I wish library I can buy book.
I wish we had a permanent library.
I wish to be happy and proud of my accomplishments.
In the window of the
Chinatown Storefront Library in Boston stood a
Wish Tree. Modeled after Yoko Ono's
Wish Tree Project, the tree
was meant to allow patrons to pass on a spirit of energy and hope. The instructions were:
Make a wish. Write it down on a piece of paper. Fold it and tie it around a branch of a wish tree. Ask your friend to do the same. Keep wishing until the branches are covered.
The Chinatown Storefront Library closed its doors on January 17, 2010, the Sunday that ALA Midwinter was in town. Always meant to be a temporary library, the Storefront Library was an expression by Boston's Chinatown community of its need and support for a library of its own. The Chinatown neighborhood of Boston has been without a branch of the Boston Public Library since 1956, when the branch was closed and demolished to make way for a highway.
Without a local branch, Chinatown residents needing library services have to go to the main library in Copley Square, which, though a beautiful building, may seem rather imposing and hard to navigate for someone looking for Chinese language materials.
The founders of Chinatown Storefront Library, Sam and Leslie Davol, had been involved in community meetings surrounding the proposed design and construction of a new branch of Boston Public Library, and in that process had gotten to know faculty at Harvard's Graduate School of Design. With a new branch on hold due to budgetary reasons, the Davols decided to take action. A local developer offered to let them use a vacant storefront for free. Design students made some gorgeous, modernistic shelving pieces for the library, enabling it to create an inviting environment in an bare commercial space. Library students from Simmons paired with Cantonese- and Mandarin-speaking community volunteers to staff the facility. Donations of over 5,000 books were solicited, and for twelve weeks, a community library came into existence. The operating budget for the entire project was about $10,000.

The day before the closing, I had a chance to tour the Storefront Library and sit down with Sam Davol. Formerly a legal-aid lawyer in New York, he and his wife
moved back to Boston with their two children, partly so that Sam could devote more time to music. The Library project was an outgrowth of their involvement in the community and other cultural programming they've produced.
In just a few short months, the Storefront Library has had a clear impact on its neighborhood. People who used to avoid the block because of its vacant, spooky feel began to feel welcomed by the activity surrounding the library. Cultural activities, language classes and storytimes attracted people from the community and passersby.
Initially, the Storefront library did not plan to circulate books, but in the first week of operation patrons told them that they really wanted to take books home with them. A makeshift paper-based circulation system was implemented, and over 1,374 books were circulated in 11 weeks of operation, over half of them in Chinese. Over 4,000 books were
catalogued using LibraryThing.
In talking to librarians in general about the storefront library concept, I've gotten a consistent reaction that small storefront spaces could not provide sufficient room to provide internet access; terminals take up more room than books. At the Storefront Library, the computers tended to be lightly used. When I was there, some older gentemen were reading newspapers, some children were reading books, but no one was using the computers or internet access. This could be because the Library did not subscribe to electronic resources.
I think the most important lesson that can be learned from the Storefront Library experiment is that even small temporary libraries can be powerful agents of community development. In Boston, this role was accentuated by a location in close proximity to people's everyday lives. While I've written that
the future of public libraries may be in smaller locations, the Chinatown Storefront Library reminded me that many public libraries began as grassroots efforts to promote knowledge and culture.
Now that the Storefront Library has closed, its books will be going to a new reading room, to local schools, and a few to the Chinese Historical Society of New England. The furniture will be going to local schools and daycare facilities. Information about the project will be published on the
storefrontlibrary.org website so that similar projects in other communities can learn from their experiences.
As for Sam Davol, he goes on tour. He plays cello with the indie-pop band "
The Magnetic Fields", which has a new CD out,
Realism. I just got my tickets for one of the shows at New York's Town Hall in March.
I wish there were more people experimenting with libraries.
by Eric Hellman (noreply@blogger.com) at February 02, 2010 01:03 PM
Wayne Bivens-Tatum, a Princeton librarian and blogger, wrote an excellent post, called "Nothing is the Future." It attacks a certain sort of insipid library futurism—and is going all over the "Twittersphere":
The kindest interpretation of statements like "the future is mobile" or "the future of reference is SMS" or "the future is librarians in pods" or whatever is that the librarians are trying to create that future by speaking it. The incantation will somehow make it so.... The less kind interpretation is that the authors of such statements are reductionist promoters, reducing a complex field to whatever marginal utility they're focused on and claiming that this is the future, while simultaneously promoting themselves as seers.
The obvious and most likely statement is that nothing is the future, as in no thing is the future, period. Anyone who tells you different is just plain wrong. With technology, it should be clear to anyone who bothers to see past their obsessions that formats and tools die hard. Some people like to imply that if librarians don't take up every new trend they'll become like buggy whip makers. I should point out that there are still people who make buggy whips. Buggy whips aren't as popular as they once were, but they're still around. There are even buggies to accompany them.
I started to reply in comments, but my words added up. So here they are:
Though a purveyor of "Web 2.0" ideas—
I founded LibraryThing, what can I say?—I think it's a great post.
The rhetoric you describe rings true. It starts, I think, from the popularizers and enthusiasts who take up new technologies and communicate them to the great mass of librarians whose life revolves around other things. To get through the clutter—to be one of the things you take back from a weekend of ALA or PLA talks—the message is simplified and the rhetoric ratchets up. "This is useful" loses out to "this will save you." As it passes through libraryland the cycle repeats in spirals of simplification and amplification. Over and over I see broader intellectual discussions of technology and the future of libraries reduced to trivial and ephemeral exhortations like "every library needs to be on Meebo!" or "the future is SMS!"
It's depressing, but it's not unique to library technology. You see it in other trends, like "green libraries" (they're the future, didn't you get the memo?). It's in the dynamics of communication. Your post is a good corrective to it.
At the same time, you're missing something. I don't know if you're missing it for real, or just in this focused expression. But there's a powerful "yes but" here, and it needs saying—shouting even!—lest people take the wrong thing from your post.
For all the nonsense and hype, librares
are subject to an extraordinary and rapid cultural change. They have already changed drastically—especially if "libraries" means what libraries mean to culture generally, and people who don't work in them.
Libraries are in the "information business" and this business is in one of the most profound transformations in human history. This isn't buggies vs. Stanley Steamers—different ways of getting to the habberdasher. It's horse-and-buggy culture vs. everything the car has brought—mass production, suburban living, the Blitzkreig, the global economy, global warming and the sexual revolution. Certainly, as you say, carriges continue to exist as objects that convey people, but their meaning has been utterly transformed. If libraries end up as a way for rich people to indulge children on a visit to a big city—what carriages mean today—well, crap! How did that happen?!
The world is changing, and for all the noise about this or that technology, I don't think libraries are dealing with it squarely. (Forget Web 2.0; libraries haven't really ingested Web 1.0 yet.) "The future is X" isn't the best response to that change, but it's a response.
I expect your post will get wide circulation. It says something that hasn't been said before as well. But if it prompts librarians to dismiss technology's impact on the future of libraries, it will do great harm. Instead, I hope people use your essay as a way to "kick it up a notch" intellectually, get past the small stuff and confront the very real changes ahead.
PS: By the way, LibraryThing is releasing a universal mobile catalog. It's the future. No, really! :)
by Tim (noreply@blogger.com) at February 02, 2010 10:57 AM
This is the second post about the ANDS-funded metadata store work we’re doing at ADFI. The project now has a Trac site where we will be tracking progress and keeping notes; the site will be open to the public to read, to make it as easy as possible to reach a wide range of stakeholders, although there will be a few documents we have to keep under tighter control. The Trac site will be mainly used as a project wiki – but I will put in some job tickets and use the milestones feature to track what happens between our (mostly) weekly project meetings. This week’s milestone is up now.
In this post I’ll reveal some detail about our starting point (though not the name of the institution(s) we’ll be working with) and follow up on some of the feedback I got from ANDS staffers to my previous post. Scott Yeadon raised quite a few points, some of which I will get to in future posts and/or project plans and wiki pages.
ADFI staff met with ANDS stakeholders late in 2009, and we have agreed that a good starting point will be to focus on one of the ‘additional’ deliverables first. The core deliverable is a project plan to build a stand- alone metadata store, with an option to write extra plans for add-ons or customisations to existing repository software such as DSpace or Eprints should any organisations want to keep metadata about data collections in their IR. It happens that there has been a fair bit of work at least one institution (University X) in Australia where they do plan to keep metadata about collections (and maybe parties, and so on) in their IR; they are running the VITAL software which was associated with the ARROW project. So that’s our starting point: a project plan to write some open source software and supply configuration files and customisation, and documentation for an ARROW repository so that University X and other sites running the ARROW suite of software can participate in the data commons, and submit data to research data Australia via their IR.
The VITAL software is a web interface to Fedora, with configuration to index a Fedora repository and display. You can see it in action at the home of ARROW, Monash University. Some sites use bit of free software called VALET to put things in to the repository.
VALET has a simple design which I like – it allows you to design a form or forms to capture as much metadata as you like and configure a set of really simple approval steps. When a user starts adding metadata about a new object, the system saves the in-progress data by the simple expedient of serialising the form data to XML. Moving to the next step of a workflow just requires the application to put the saved data back into the form. When a user with the correct rights adds the item to the repository, pre-configured XSLT stylesheets run automatically to transform the serialised form data to whatever is required, usually MARCXML and Dublin Core.
The ARROW project sponsored a replacement for VALET called Squire which I reported on in 2008, but so far nobody has used it in anger. For this project I think it might be good to use Squire, or a something like it rather than VALET; we’re discussing the pros and cons with ANDS and University X.
Over the next week I will be putting together a skeletal draft of a project plan based around the proposed architecture ready for stakeholders at ANDS and the IR and eResearch communities to comment now that people are back from their summer holidays.
Some of the things which will need to be resolved are:
-
Which OAI-PMH provider to use?
Metadata will get from the VITAL repository to Research Data Australia via an OAI-PMH feed, but there are a few open source toolkits to choose from. We will need to support at least one, maybe more. As ANDS staffer Xiaobin Shen reminded me in the comments of my last post, one consideration will be whether or not the provider supports deletions, not all do. This will require careful testing before we commit to one provider or another.
-
What data model will be used?
At the moment, all the VITAL repositories in Australia that I know about have a very simple data model. Each item in the repository has a ‘master’ metadata record usually in MARCXML, sometimes MODS (they’re effectively the same) with derived metadata in Dublin Core. There may also be datastreams, usually PDF files. At this stage there is no formal content model, so the datastreams could be called anything and it’s up to humans to make sense of them; there’s no guaranteed way to tell whether a PDF is an abstract, or the whole record, a preprint or a published version of a paper, for example.
It’s a bit hard to get information about VITAL unless you’re a customer; the product page is currently sporting a copyright statement from 2008 and the brochure (PDF) is big on tropical fish and short on specifications so what I report here may need correcting.
The version of VITAL which I think most sites are running in Australia is 3.x. It uses Fedora 2 which doesn’t have formal content models. Fedora 3 has a formal mechanism for describing content models which means that you could describe the parts of an object, and their role in the object. From what I can gather VITAL 4 which I saw demonstrated in late 2008, and was released in March 2009 has content models too, but they are more about how to display an object than describing the relations between its parts. Perhaps someone could elaborate or correct this in the comments?
My working assumption is that for this development, the idea will be to stick with the way VITAL 3.x works, without worrying about content modelling which is fine here because this is not about complex objects with lots of data, it’s about metadata about data collections where the collections themselves will usually reside elsewhere, which brings us to the sub-question.
What goes in the repository and what is stored elsewhere?
There’s a real chicken and egg problem here. I gather that eventually the NLA will be running a party-identifier service based on People Australia, so when that’s established we won’t be typing names into metadata forms any more, we’ll be linking to an ID. So in the abstract model behind RIF-CS the party management just goes elsewhere. I wonder if the same could happen with activities (every project has some kind of web site now, so why not point to an RDF endpoint hosted on the project website or just to the project web site as an identifier) and services (not sure what to do about these, but then I’m not really sure yet what services are in this context).
Question is, if we want to get a system running now, what’s the best way to identify parties in a future-proof way? I discussed a related issue in a blog post for CAIRSS about NicNames and People Australia. Maybe NicNames can play a role here in the short term?
One of the design patterns I mentioned in that post, using an index to associate names in a repository with an identity service like NicNames via an index is expanded in a paper I wrote for the New Review of Information Networking. At the moment I can only link to this version which is not open as the publisher has not responded to my questions about the OA policy so I have yet to deposit a version in Eprints.
-
Which metadata format to use?
Scott Yeadon made it clear in the comments to my last post that RIF-CS was designed as an interchange format only and that it is not yet stable, which sounds like good reasons not to use it as a storage format. but I have confirmed reports that others in ANDS are thinking otherwise, and are encouraging IR managers to put RIF-CS in the repository; I’d like to hear their side of the story too. Stability aside, if RIF-CS has what it takes to describe a collection of data then it might be an OK storage format.
I’m not aware of all the alternatives but one that I have heard mentioned by and ANDS person is the Dublin Core Collections Application Profile. What else is out there in use for describing data collections and are there other data-collections registries harvesting from those descriptions in the rest of the world?
So this is an open issue for now; I hope we can get some consensus on a good data model for storing metadata about data collections (and the other entities).
-
What configuration is needed in VITAL? I think we need the following:
And If the data collection resides in the repository should there be a collection object and a collection-description object or just one object with both collection and description?
-
What kinds of APIs do we need?
VALET can be used to integrate with other systems via XML files which are deposited in a directory and picked-up by the ingest workflow of VALET, so they can be curated by data librarians, a technique which I think was developed by Simon McMillan at UNE in the RUBRIC days. We can certainly implement this, but should we have a web (or other) API for a system such as a grants database to add a new item as well? Should it be AtomPub or a simple post, or SWORD or something else?
Copyright Peter Sefton, 2010. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project and published to WordPress using The Fascinator.
by ptsefton at February 02, 2010 05:41 AM
In a campus IT meeting during a discussion about strategic planning one of the faculty members brought up the idea of a technology-free zone. Apparently he heard about some other college implementing such a thing. The committee decided to think about it and discuss it at a future meeting. I did think a campus-wide technology planning committee coming up with the idea of a technology-free zone a bit ironic. Anyway, I posted a brief tweet about this irony on facebook, twitter, and identi.ca and I got some good responses aboiut why this might be a good or bad idea. Dan Scott did point out we better “[g]et the level of technology right for those zones; otherwise, no clothes.” With that warning, lets look at technology-free zones when technolgy is defined to not include clothing.
After reading some comments (mostly on Facebook), I am thinking about this more. I did a quick Google search and when limited to .edu domains, it appears not many universities have such an area (I’m sure more than the few I found do, but they probably call it something different). I think if a campus is going to do this, a library makes a logical choice. Setting up such a zone shouldn’t cost too much money. Mostly some furniture: maybe with some comfy chairs and plants like the UW-Parkside Teaching and Learning Center? I think the bigger issues are 1) Space, and 2) Will they use it.
Space: I don’t know many libraries that have too much space. So, with limited space, is a technology-free zone a good use of space. That obviously would vary library-by-library and campus-by-campus.
Will they use it: On facebook I mentioned that none is forcing people to use technology and most libraries have quiet study areas. So, why make a technology-free zone? A former colleague mentioned that in most quiet areas there is still “residual noise” such as music from ear bugs and keyboard chatter. So really, a technology-free zone does offer something that a quiet study does not. That still doesn’t mean people would use it though.
Personally, I think if a library has a space it would be worth trying, or at least worth surveying students to see if they were interested. What’s the worse that can happen? No one comes, so after a year you re-purpose the space as a quiet study or group study or anything else. However, I’m not sure it would be worth trying this if it meant eliminating other spaces (such as quiet-study) or services. I say this mostly because with so much of of the information libraries are providing require technology to access, it could cause issues. Does anyone work in a library that has or had a technology-free zone? I’d love to hear how it worked.
by ecorrado at February 02, 2010 12:58 AM
February 01, 2010
Happy Groundhog's Day Eve! Or something.
If you've got a link that belongs in a Trogool tidbits round up, drop me a comment or tag it "trogool" on del.icio.us. Thanks!
Read the comments on this post...
February 01, 2010 10:43 PM
Cindy Hepfer, hardworking ALA Voting Representative to NISO has forwarded to us a group of announcements related to ISO/DIS 16175, Information and documentation–Principles and functional requirements for records in electronic office environments. This is a Fast track ballot, used to create an ISO standard from an existing standard, in this case the International Council on Archives and the Australasian Digital Recordkeeping Initiative standard of the same title. Fast track standards are submitted for their first ballot at the enquiry (DIS) stage; if there are no negative votes, the standard can proceed directly to publication.
This ballot is in three parts:
Part 1: Overview and statement of principles
Part 2: Guidelines and functional requirements for records in electronic office environments
Part 3: Guidelines and functional requirements for records in business systems
As a reminder of the process: ALA is a voting member of NISO, while NISO is the official US voting member of other International Organization for Standardization (ISO) groups. On behalf of ALA, Cindy will be providing feedback to NISO as to whether ALA believes that NISO should approve or disapprove the standard. NISO staff will review and consider our feedback along with that received from numerous other voting members.
Because this is an ISO standard, access to the text for review is only available via Cindy (her email is: HSLcindy@buffalo.edu). Any ALA member who wishes to see a copy of the draft standard must explicitly state to Cindy that he/she is a current ALA member. (It helps me to provide activity information to LITA if you also copy me on your request at metadata.maven@gmail.com).
Deadline for comments to Cindy is Monday, May 17, 2010.
Diane I. Hillmann
LITA Standards Coordinator
by Diane Hillmann at February 01, 2010 10:30 PM
To scale up from 500,000 volumes of full-text to 5 million, we decided to use Solr’s distributed search feature which allows us to split up an index into a number of separate indexes (called “shards”). Solr's distributed search feature allows the indexes to be searched in parallel and then the results aggregated so performance is better than having a very large single index.
Sizing the shards
by Tom Burton-West at February 01, 2010 08:56 PM
In the latest Perceptions survey, the most popular library management system is from a relatively new supplier to libraries and is available exclusively on a Software as a Service basis. The survey also reveals that interest in open source library management systems is weak outside the community of libraries that has already adopted one.
The Perceptions series of surveys is three years old now, and is part of Marshall Breeding’s armoury of library technology commentaries, the most well-used of which is Library Technology Guides. Meanwhile, Perceptions 2009: An international survey of library automation, like its predecessors, aims to ascertain levels of satisfaction within libraries with their library management system and suppliers thereof. Despite disruption in the library software arena, the library management system (LMS), or integrated library system (ILS) as it’s known to Marshall Breeding in the US, remains important:
The integrated library system (ILS) for most libraries represents the most critical component of its technology infrastructure and can do the most to help or hinder a library in fulfilling its mission to serve its patrons and in operating efficiently.
Interest may be waning in open source
One of Marshall’s central aims this year is to gauge interest in open source ILS products, which he describes as “one of the major issues brewing in the industry”.
A key overall finding was that companies supporting proprietary library management systems tend to receive higher satisfaction scores than companies involved with open source library management systems. Marshall notes explicitly that LIbLime received particularly low marks in customer satisfaction, whilst libraries that undertook to implement Koha without external support were highly satisfied with this arrangement.
Respondents who had made use of other support firms such as PTFS, Nusoft and ByWater Solutions (it should be noted that support companies servicing open source products are still not prevalent in the UK) were not sufficiently numerous to be included in the report’s summary tables. Likewise, Talis only had 14 respondents and therefore does not figure in the main tables, although as a UK supplier, we are happy to be positioned in 10th place in terms of satisfaction with LMS in an international survey.
As Marshall told the audience at the SCONUL conference here in the UK in June 2009, there are low levels of interest registered in open source library management systems apart from the community of libraries already using one. Even those libraries that are dissatisfied with their current proprietary system fail to demonstrate interest in open source.
But Software as a Service is top of the pops
Biblionix, described by Marshall as a relatively new company, gained the top satisfaction scores in the following categories – ILS product, company, and support for its product, Apollo. This is interesting not just because it’s a relatively new entrant in the library software marketplace, but because the product is offered exclusively through Software as a Service. As Marshall comments:
The responses for Apollo were overwhelmingly positive, the only product to receive 9 as either the mode or median response. The comments offered gave effusive praise for the company, the product, the ease of migration and for support.
It should be noted that takeup of Apollo is currently limited to small public libraries in the US.
Although UK suppliers don’t feature strongly in this international survey, it remains an important source in terms of looking at the key trends in our world.
by Sarah Bartlett at February 01, 2010 05:59 PM
It’s doubtful that anybody reading this blog missed the news that Apple finally took the wraps off their much rumored tablet: the iPad. Trouble is, a bunch of folks seem to be upset about the features and specs, or something that made the buzz machine go meh. It’s just a bigger iPhone, complain the privileged tech pundits.
They apparently missed the recent Pew Internet Project report on internet usage by demographic. While it shows white users most frequently access the internet from home, black and hispanic users more frequently get online from mobile devices. Further, internet use by hispanics jumped dramatically in recent years, far exceeding the growth among whites.

The report further notes that while 83% of US adults have cell phones, only 60% use the internet from home. I’ve said it before: our notions of what a “computer” is have to change. The age of ubiquitous connectivity, Twitter, Facebook, and uncounted other tiny miracles has already changed the the reasons we use technology and shown us the difference between what it’s for and what it does.
The Pew stats show our computers as historical artifacts of a different age, built for a different purpose. The iPad is built for the ubiquitous social internet. The iPad is built for everybody who enjoys mobile internet access and the remaining 40% of users who don’t have any, though I’m quite certain that experienced internet users will eventually fall in love with the device too. Remember, the then leading tech news site Slashdot panned the original iPod in 2001: “No wireless. Less space than a Nomad. Lame.”
You might have to check Wikipedia to remember what the Nomad was today, though the manufacturer once enjoyed 65% market share. The market for MP3 players in 2001 was just under $2 billion, by 2006 it had tripled to almost $6 billion. iPod sales continued to grow, much to the annoyance of iPod haters, until Apple released the iPhone and started cannibalizing their traditional iPod sales. Convergence had finally arrived.
Apple’s plan with the iPad is to dramatically expand the market for internet connected devices. Do you really want to bet against them?
by Casey Bisson at February 01, 2010 04:53 PM
This might only be a gotcha if you’re a lazy guy like me.
But in ordinary Solr, if you want to completely clear out your indexes, you can just delete the ‘data’ directory, no problem. It sounds weird, but several solr guru types told me I could do it, and it was certainly convenient to be able to do when my data (still in development) got all messed up and I just wanted to start over, and it did indeed work.
But when you’ve set up ‘multi core’ in Solr (which has nothing to do with CPUs, it’s solr term for having multiple entirely seperate solr indexes in one running solr process)… don’t try to just go and delete the ‘data’ directory in one of the cores. It messes everything up horribly and you have to repair/rebuild your cores.
So I guess in a solr multi-core world, if you want to delete all your solr data, you have to do it the normal way with a ‘delete’ operation.
Filed under:
General

by jrochkind at February 01, 2010 04:12 PM
I recently received a query about the encoding of Dublin Core metadata in HTML5, the revision of the HTML language being developed jointly by the W3C HTML Working Group and the Web Hypertext Application Technology Working Group (WHATWG). It has also been the topic of some recent discussion on the dc-general mailing list. While I've been aware of some of the discussions around metadata features in HTML5, until now I haven't looked in much detail at the current drafts.
There are various versions of the specification(s), all drafts under development and still changing (at times, quite quickly):
- The WHATWG has a working draft titled HTML5 (including next generation additions still in development). This document is constantly changing; the content at the time I'm writing is dated 30 January 2010, but will no doubt have changed by the time you read this. Of this spec, the WHATWG says:
This draft is a superset of the HTML5 work that is being published at the W3C: everything that is in HTML5 is also in the WHATWG HTML spec. Some new experimental features are being added to the WHATWG HTML draft, to continue developing extensions to the language while the W3C work through their milestones for HTML5. In other words, the WHATWG HTML specification is the next generation of the language, while HTML5 is a more limited subset with a narrower scope.
- The W3C has a "latest public version" of HTML 5: A vocabulary and associated APIs for HTML and XHTML currently the version dated 25 August 2009. (The content of that "date-stamped" version should continue to be available.)
- The W3C always has a "latest editor's draft" of that document, which at the time of writing is dated 30 January 2010, but also continues to change at frequent intervals. Note that, compared to the previous "latest public version", this draft incorporates some element of restructuring of the content, with some content separated out into "modules".
I can't emphasise too strongly that HTML5 is still a draft and liable to change; as the spec itself says in no uncertain terms: Implementors should be aware that this specification is not stable. Implementors who are not taking part in the discussions are likely to find the specification changing out from under them in incompatible ways.
.
For the purposes of this discussion I've looked primarily at the third document above, the W3C latest editor's draft. This post is really an attempt to raise some initial questions (and probably to expose my own confusion) rather than to provide any definitive answers. It is based on my (incomplete and very probably imperfect) reading of the drafts as they stand at this point in time - and it represents a personal view only, not a DCMI view.
1. Dublin Core metadata in HTML4 and XHTML
(This section covers DCMI's current recommendations for embedding metadata in X/HTML, so feel free to skip it if you are already familiar with this.)
To date, DCMI's specifications for embedding metadata in X/HTML documents have concerned themselves with representing metadata "about" the document as a whole, "document metadata", if you like. And in HTML4/XHTML, the principal source of document metadata is the head element (HTML4, 7.4). Within the head element:
- the meta element (HTML4, 7.4.4.2) provides for the representation of "property name" (the value of the @name attribute)/"property value" (the value of the @content attribute) pairs which apply to the document. It also offers the ability to supplement the value with the name of a "scheme" (the value of the @scheme attribute) which is used "to interpret the property's value".
- the link element (HTML4, 12.3) provides a means of representing a relationship between the document and another resource. It also - in attributes like @hreflang, @title, - suppports the provision of some metadata "about" that second resource.
(I should note here that the above text uses the terminology of the HTML4 specification, not of RDF or the DCMI Abstract Model (DCAM).)
The current DCMI recommendation for embedding document metadata in X/HTML, Expressing Dublin Core metadata using HTML/XHTML meta and link elements - which from here on I'll just refer to as "DC-HTML". Although the current recommendation is dated 2008, that version is only a minor "modernisation" of conventions that DCMI has recommended since the late 1990s. The specification describes a convention for representing what the DCAM calls a description (of the document) - a set of RDF triples - using the HTML meta and link elements and their attributes (and conversely, for interpreting a sequence of HTML meta and link elements as a set of RDF triples/DCAM description set). Contrary to some misconceptions, the convention is not limited to the use of DCMI-owned "terms"; indeed it does not assume the use of any DCMI-owned terms at all.
Consider the example of the following two RDF triples:
@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:modified "2007-07-22"^^xsd:date ,
ex:commentsOn <http://example.org/docs/123> .
Aside: from the perspective of the DCMI Abstract Model, these would be equivalent to the following description set, expresssed using the DC-Text syntax, but for the rest of this post, to keep things simple, I'll just refer to the RDF triples.
@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
Description (
Statement (
PropertyURI ( dc:modified )
LiteralValueString ( "2007-07-22"
SyntaxEncodingSchemeURI ( xsd:date )
)
)
Statement (
PropertyURI ( ex:commentsOn )
ValueURI ( <http://example.org/docs/123> )
)
)
)
)
Following the conventions of DC-HTML, those triples are represented in XHTML as:
Example 1: DC-HTML profile in XHTML
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head profile="http://dublincore.org/documents/2008/08/04/dc-html/">
<title>Document 001</title>
<link rel="schema.DC"
href="http://purl.org/dc/terms/" />
<link rel="schema.EX"
href="http://example.org/terms/" />
<link rel="schema.XSD"
href="http://www.w3.org/2001/XMLSchema#" />
<meta name="DC.modified"
scheme="XSD.date" content="2007-07-22" />
<link rel="EX.commentsOn"
href="http://example.org/docs/123" />
</head>
</html>
A few points to highlight:
- The example is provided in XHTML but essentially the same syntax would be used in HTML4.
- The triple with literal object is represented using a meta element.
- The triple with the URI as object is represented using a link element
- The predicate (property URI) may be any URI; the DC-HTML convention is not limited to DCMI-owned URIs, i.e. DC-HTML seeks to support the sort of URI-based vocabulary extensibility provided by RDF. There is no "registry" of a bounded set of terms to be used in metadata represented using DC-HTML; or, rather, "the Web is the registry". All an implementer requires to introduce a new property is the authority to assign a URI in some URI space they own (or in which they have been delegated rights to assign URIs).
- A convention for representing property URIs and datatype URIs as "prefixed names" is used, and in this example three other link elements (with @rel="schema.{prefix})"
are introduced to act as "namespace declarations" to support the convention. When a document using DC-HTML is processed, no RDF triples are generated for those
link elements (Aside: I have occasionally wondered whether this is abusing the
rel attribute, which is intended to capture the type of relationship between the document and the target resource, i.e. it is using a mechanism which does carry semantics for an essentially syntactic end (the abbreviation of URIs). But I'll suspend those misgivings for now...).
The prefixes used in these "prefixed names" are arbitrary, and DC-HTML does not specify the use/interpretation of a fixed set of @name or @rel attribute values. In the example above, I chose to associate the "DC" prefix with the "namespace URI" http://purl.org/dc/terms/, though "traditionally" it has been more commonly associated with the "namespace URI" http://purl.org/dc/elements/1.1/. Another document creator might associate the same prefix with a quite different URI again.
The DC-HTML profile generates triples only for those meta and link elements where the values of the @name and @rel attributes contain a prefixed name with a prefix for which there is a corresponding "namespace declaration".
The datatype of the typed literal is represented by the value of the meta/@scheme attribute.
There is no support for RDF blank nodes.
For the purposes of this discussion, perhaps the main point to make is that this use/interpretation of meta and link elements is specific to DC-HTML, not a general interpretation defined by the HTML4 specification. The mapping of prefixed names to URIs using link[@rel="schema.ABC"] "namespace declarations" is a DCMI convention, not part of X/HTML. And this is made possible through the use of a feature of HTML4 and XHTML called a "meta data profile": the document creator signals (by providing a specific URI as value of the head/@profile attribute) that they are applying the DC-HTML conventions and the presence of that attribute value licences a consumer to apply that interpretation of the data in a document. And, further, under that profile, as I noted for the example of the "DC" prefix, there is no single "meaning" assigned to meta/@name or link/@rel values.
In the XHTML case, the profile-dependent interpretation is made accessible in machine-processable form through the use of GRDDL, more specifically of a GRDDL profile transformation. i.e. a GRDDL processor uses the profile URI to access an XHTML "profile document" which provides a pointer to an XSLT transform which, when applied to an XHTML document using the DC-HTML profile, generates an RDF/XML document representing the appropriate RDF triples.
It may also be worth noting at this point that the profile attribute actually supports not just a single URI as value but a space-separated list of URIs i.e. within a single document, multiple profiles may be "applicable". And, potentially, those multiple profiles may specify different interpretations of a single @name or @rel value. I think the intent is that in that case all the interpretations should be applied - and in the case that multiple GRDDL profile transformations are provided, the output should be the result of merging the RDF graphs output from each individual transformation.
Now then, having laboured the point about the importance of the concept of the profile, I strongly suspect - though I don't have concrete evidence to support my suspicion - that it is not widely used by applications that provide and consume data using the other conventions described in the DC-HTML document.
It is certainly easy to find many providers of document metadata in X/HTML that follow some of the syntactic conventions of DC-HTML but do not include the @profile attribute. This is (currently, at least) the case even for many documents on DCMI's own site. And I suspect only a (small?) minority of applications consuming/processing DC metadata embedded in X/HTML documents do so by applying the DC-HTML GRDDL profile transform in this way. I suspect the majority of DC metadata embedded in X/HTML documents is processed without reference to the GRDDL transform, probably without using the @profile attribute value as a "trigger", possibly without generating RDF triples, and perhaps even without applying the "prefixed name"-to-URI mapping at all - i.e. these applications are "on level 1" in terms of the DC "interoperability levels" document. I suspect there are tools which use meta elements to generate simple property/(literal) value indexes, and do so on the basis of a fixed set of meta/@name values, i.e. they index on the basis that the expected values of the meta/@name attribute are "DC.title", "DC.date" (etc) and those tools would ignore values like "ABC.title", even if the "ABC" prefix was associated (via a link[@rel="schema.ABC"] "namespace declaration") with the URI http://purl.org/dc/elements/1.1/ (or http://purl.org/dc/terms/). But yes, I'm entering the realms of speculation here, and we really need some concrete evidence of how applications process such data.
2. RDFa in XHTML and HTML4
Since that DCMI document was published, the W3C has published the RDFa in XHTML specification, RDFa in XHTML: Syntax and Processing. as a W3C Recommendation. RDFa provides a syntax for embedding RDF triples in an XHTML document using attributes (a combination of pre-existing XHTML attributes and additional RDFa-specific attributes.) Unlike the conventions defined by DC-HTML, RDFa supports the representation of any RDF triple, not only triples "about" the document (i.e. with the document URI as subject), and RDFa attributes can be used anywhere in an XHTML document.
Any "document metadata" that could be encoded using the DC-HTML profile could also be represented using RDFa. DCMI has not yet published any guidance on the use of RDFa - not because it doesn't consider RDFa important, I hasten to add, but only because of a lack of available effort. Having said that, (it seems to me) it isn't an area where DCMI would need a new "recommendation", but it may be useful to have some primer-/tutorial-style materials and examples highlighting the use of common constructs used in Dublin Core metadata.
I don't intend to provide a full summary of RDFa, but it is worth noting that, at the syntax level, RDFa introduces the use of a datatype called CURIE which supports the abbreviation of URI references as prefixed names. In XHTML, at least, the prefixes are associated with URIs via XML Namespace declarations. The use of CURIEs in RDFa might be seen as a more generalised, standardised approach to the problem that DC-HTML seeks to address through its own "prefixed name"/"namespace declaration" convention.
It is perhaps worth highlighting one aspect of the RDFa in XHTML processing model here. In RDFa the XHTML link/@rel attribute is treated as supporting both XHTML link types and CURIEs, and any value that matches an entry in the list of link types in the section The rel attribute, MUST be treated as if it was a URI within the XHTML vocabulary, and all other values must be CURIEs
. So, the XHTML link types are treated as "reserved keywords", if you like, and a @rel attribute value of "next" is mapped to an RDF predicate of http://www.w3.org/1999/xhtml/vocab#next. For the case of XHTML, those "reserved keywords" are defined as part of the XHTML+RDFa document. They are also listed in the "namespace document" http://www.w3.org/1999/xhtml/vocab, which itself is an XHTML+RDFa document (though, N.B., there are other terms "in that namespace" which are not intended for use as link/@rel values). For a @rel value that is neither a member of the list nor a valid CURIE (e.g. rel="foobar" or rel="DC.modified" or rel="schema.DC"), no RDF triple is generated by an RDFa processor. As a consequence, RDFa "co-exists" well with the DC-HTML profile, by which I mean that an RDFa processor should generate no unanticipated triples from DC-HTML data in an XHTML+RDFa document.
Using RDFa in XHTML, then, the two example triples above could be represented as follows:
Example 2: RDFa in XHTML
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:ex="http://example.org/terms/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
version="XHTML+RDFa 1.0">
<head>
<title>Document 001</title>
<meta property="dc:modified"
datatype="xsd:date" content="2007-07-22" />
<link rel="ex:commentsOn"
href="http://example.org/docs/123" />
</head>
</html>
And of course document metadata could be embedded elsewhere in the XHTML+RDFa document, e.g. instead of using the meta and link elements, the data above could be represented in the body of the document:
Example 3: RDFa in XHTML (2)
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:ex="http://example.org/terms/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
version="XHTML+RDFa 1.0">
<head>
<title>Document 001</title>
</head>
<body>
<p>
Last modified on:
<span property="dc:modified"
datatype="xsd:date" content="2007-07-22">22 July 2007</span>
</p>
<p>
Comments on:
<a rel="ex:commentsOn"
href="http://example.org/docs/123">Document 123</a>
</p>
</body>
</html>
These examples do not make use of a head/@profile attribute. According to Appendix C of the RDFa in XHTML specification, the use of @profile is optional: a @profile value of http://www.w3.org/1999/xhtml/vocab may be included to support a GRDDL-based transform, but it is not required by an RDFa processor. (Having said that, looking at the profile document http://www.w3.org/1999/xhtml/vocab, I can't see a reference to a GRDDL profile transformation in that document.)
The initial RDFa in XHTML specification covered the case of XHTML only. But RDFa is intended as an approach to be used with other markup languages too, and recently a working draft HTML+RDFa has been published. Again, this is a draft which is liable to change. This document describes how RDFa can be used in HTML5 (in both the XML and non-XML syntax), but the rules are also intended to be applicable to HTML4 documents interpreted through the HTML5 parsing rules
. For the most part, it provides a set of minor changes to the syntax and processing rules specified in the RDFa in XHTML document.
I think (but I'm not sure!) the above example in HTML4 would look like the following, the only differences (for this example) being the change in the empty element syntax and the use of a different DTD for validation:
Example 4: RDFa in HTML4
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01+RDFa 1.0//EN"
"http://www.w3.org/MarkUp/DTD/html401-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:ex="http://example.org/terms/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
version="HTML+RDFa 1.0">
<head>
<title>Document 001</title>
<meta property="dc:modified"
datatype="xsd:date" content="2007-07-22" >
<link rel="ex:commentsOn"
href="http://example.org/docs/123" >
</head>
</html>
3. HTML5
The document HTML5 differences from HTML4 offers a summary of the principal differences between HTML4 and HTML5. One general point to note here is that HTML5 is defined as an "abstract language" - it is defined in terms of the HTML Document Object Model - which can be serialised in a format which is compatible with HTML4 and also in an XML format. The "differences" document has little to say on issues specifically related to "document metadata", but it does highlight the removal from the language of some elements and attributes, a topic I'll return to below.
As I mentioned above, the current editor's draft version of HTML5 separates some content out into modules. In the current drafts, three items would seem to be of interest when considering conventions for representing metadata "about" a document:
I'll discuss each of these sources in turn (though I think there is some interdependency in the first two).
3.1. Document Metadata in HTML5
The "Document metadata" section defines the meta and link elements in HTML5. In terms of evaluating how the DC-HTML conventions might be used within HTML5, the following points seem significant:
- For the @name attribute of the meta element, the spec defines some values, and it provides for a wiki-based registry of other values (HTML5ED, 4.2.5.2).
- The @scheme attribute of the meta element has been made obsolete and "must not be used by authors".
- In the property/value pairs represented by meta elements, the value must not be a URI.
- For the @rel attribute of the link element, the spec defines some values - strictly speaking, tokens that can occur , and it provides for a wiki-based registry of other values (HTML5ED, 5.12.3.19).
- The @profile attribute of the head element has been made obsolete and "must not be used by authors"
On the validation of values for the meta/@name attribute, HTML5 says:
Conformance checkers must use the information given on the WHATWG Wiki MetaExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).
When an author uses a new metadata name not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.
So I think this means that, in order to pass this conformance check as valid, all values of the meta/@name attribute must be registered. The registry currently contains an entry (with status "proposed") for all names beginning "DC.", though I'm not sure whether the registration process is really intended to support such "wildcard" entries. The entry does not indicate whether the intent is that the names correspond to properties of the Dublin Core Metadata Element Set (i.e. with URIs beginning http://purl.org/dc/elements/1.1/) or of the DC Terms collection (i.e. with URIs beginning http://purl.org/dc/terms/). Further, as noted above, the current DC-HTML spec does not prescribe a bounded set of @name values; rather it allows for an open-ended set of prefixed name values, not just names referring to the "terms" owned by DCMI. In HTML5, the expectation seems to be that all such values should be registered. So, for example, when DCMI worked with the Library of Congress to make available a set of RDF properties corresponding to the MARC Relator Codes, identified by LoC-owned URIs, I think the expectation would be that, for data using those properties to be encoded, a corresponding set of @name values would need to be registered. Similarly if an implementer group coins a new URI for a property they require, a new @name value would be required.
If the registration process for HTML5 turns out to be relatively "permissive" (which the text above suggests it may be), it may be that this is not an issue, but it does seem to create a new requirement not present in HTML4/XHTML. However, I note that the registration page currently includes a note that suggests a "high bar" for terms to be "Accepted": For the "Status" section to be changed to "Accepted", the proposed keyword must be defined by a W3C specification in the Candidate Recommendation or Recommendation state. If it fails to go through this process, it is "Unendorsed".
Having said that, the microdata specification refers to the possibility that @name values are URIs, and I think that the implication is that such URI values are exempt from the registration requirement (though this does not seem clear from the discussion of registration in the "core" HTML5 spec).
The meta/@scheme attribute, used in DC-HTML to represent datatype URIs for typed literals, is no longer permitted in HTML5. Section 10.2, which offers suggestions for alternatives for some of the features that have been made obsolete, suggests Use only one scheme per field, or make the scheme declaration part of the value.
, which I think is suggesting either using a different meta/@name value for each potential scheme value (e.g. "date-W3CDTF", "date-someOtherDateFormat") or using some sort of structured string for the @content value with the scheme name embedded (e.g. "2007-07-22|http://purl.org/dc/terms/W3CDTF")
The section on the registration of meta/@name attribute values includes the paragraph:
Metadata names whose values are to be URLs must not be proposed or accepted. Links must be represented using the link element, not the meta element
This constraint appears to rule out the use of meta/@name to represent the property in cases where (in RDF terms) the object is a literal URI. (This is different from the case where the object is an RDF URI reference, which in DC-HTML is covered by the use of the link element.) For example, the DCMI dc:identifier and dcterms:identifier properties may be used in this way to provide a URI which identifies the document - that may be the document URI itself, or it may be another URI which identifies the same document.
A similar issue to that above for the registration of meta/@name attribute values arises for the case of the link/@rel attribute, for which HTML5 says:
Conformance checkers must use the information given on the WHATWG Wiki RelExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted when used on the elements for which they apply as described in the "Effect on..." field, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).
When an author uses a new type not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.
AFAICT, the registry currently contains no entries related specifically to DC-HTML or the DCMI vocabularies.
As for the case of name, the microdata specification refers to the possibility that @rel values are URIs, and again I think the implication is that such URI values are exempt from the registration requirement (though, again, this does not seem clear from the discussion in the "core" HTML5 spec).
Finally, the head/@profile attribute is no longer available in HTML5. and Section 10.2 says:
When used for declaring which meta terms are used in the document, unnecessary; omit it altogether, and register the names.
When used for triggering specific user agent behaviors: use a link element instead.
I think DC-HTML's use of head/@profile places it into the second of these categories: the profile doesn't "declare" a bounded set of terms, but it specifies how a (potentially "open-ended") set of attribute values are to be interpreted/processed.
Furthermore, the draft HTML+RDFa document proposes the (optional) use of a link/@rel value of "profile", and there is a corresponding entry in the registry for @rel values. This seems to be a mechanism for (re-)introducing the HTML4 concept of the meta data profile, using a different syntactic form i.e. using link/@rel in place of the profile attribute. I'm not clear about the extent to which this has support within the wider HTML5 community. If it was adopted, I imagine the GRDDL specification would also evolve to use this mechanism, but that is guesswork on my part.
Julian Retschke summarises most of these issues related to DC-HTML in a recent message to the public-html mailing list here.
3.2. Microdata
Microdata is a new specification, specific to HTML5. The "latest editors draft" version is described as "a module that forms part of the HTML5 series of specifications published at the W3C". The content was previously a part of the "core" HTML5 specification, but the decision was taken recently to separate it from the main spec.
Microdata offers similar functionality to that offered by RDFa in that it allows for the embedding of data anywhere in an HTML5 document. Like RDFa, microdata is a generalised mechanism, not one tied to any particular set of terms, and also like RDFa, microdata introduces a new set of attributes, to be used in combination with existing HTML5 attributes. The syntactic conventions used in microdata are inspired principally by the conventions used in various microformats.
As for the case of RDFa, my purpose here is not to provide a full description of microdata, but to examine whether and how microdata can express the data that in HTML4/XHTML is expressed using the conventions of the DC-HTML profile.
The model underlying microdata is one of nested lists of name-value pairs:
The microdata model consists of groups of name-value pairs known as items.
Each group is known as an item. Each item can have an item type, a global identifier (if the item type supports global identifiers for its items), and a list of name-value pairs. Each name in the name-value pair is known as a property, and each property has one or more values. Each value is either a string or itself a group of name-value pairs (an item).
The microdata model is independent of the RDF model, and is not designed to represent the full RDF model. In particular, microdata does not require the use of URIs as identifiers for properties, though it does allow for the use of URIs. Microdata does not offer - as many RDF syntaxes, including RDFa, do - a mechanism for abbreviating property URIs. But the microdata spec does include an algorithm that provides a (partial, I think?) mapping from microdata to a set of RDF triples.
Probably the main feature of RDF that has no correspondence in microdata is literal datatyping - see the discussion by Jeni Tennison here - though there is a distinct element/attribute for date-time values.
Given this constraint, I don't think it is possible to express the first triple of my example above using microdata. If the typed literal is replaced with a plain literal (i.e. the object is "2007-07-22", rather than "2007-07-22"^^xsd:date), then I think the two triples could be encoded (using the XML serialisation) as follows, i.e. the property URIs appear in full as attribute values:
Example 5: Microdata in HTML5
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
<head>
<title>Document 001</title>
<meta name="http://purl.org/dc/terms/modified"
content="2007-07-22" />
<link rel="http://example.org/terms/commentsOn"
href="http://example.org/docs/123" />
</head>
</html>
As for the case of RDFa, microdata supports the embedding of data in the body of the document, so the triples could (I think!) also be represented as:
Example 6: Microdata in HTML5 (2)
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
<head>
<title>Document 001</title>
</head>
<body>
<div itemscope="" itemid="http://example.org/doc/001">
<p>
Last modified on:
<time itemprop="http://purl.org/dc/terms/modified"
datetime="2007-07-22">22 July 2007</time>
</p>
<p>
Comments on:
<a rel="http://example.org/terms/commentsOn"
href="http://example.org/docs/123">Document 123</a>
</p>
</body>
</html>
My understanding is that the itemid attribute is necessary to set the subject of the triple to the URI of the document, but I could be wrong on this point.
Also I think it's worth noting that the microdata-to-RDF algorithm specifies an RDF interpretation for some "core" HTML5 elements and attributes. For example:
- the head/title element is mapped to a triple with the predicate http://purl.org/dc/terms/title and the element content as literal object
- meta elements with name and content attributes are mapped to triples where the predicate is either (if the name attribute value is a URI) the value of the name attribute, or the concatenation of the string "http://www.w3.org/1999/xhtml/vocab#" and the value of the name attribute. So, e.g., a name attribute value of "DC.modified" would generate a predicate http://www.w3.org/1999/xhtml/vocab#DC.modified.
- A similar rule applies for the link element. So, e.g., a rel attribute value of "EX.commentsOn" would generate a predicate http://www.w3.org/1999/xhtml/vocab#EX.commentsOn and a rel attribute value of "schema.DC" would generate a predicate http://www.w3.org/1999/xhtml/vocab#schema.DC
As far as I can tell, these are rules to be applied to any HTML5 document - there is no "flag" to say that they apply to document A but not to document B - so would need to be taken into consideration in any DCMI convention for using meta/@name and link/@rel attributes in HTML5. For example, given the following HTML5 document (and leaving aside for a moment the registration issue, and assuming that "EX.commentsOn", "schema.DC" and "schema.EX" are registered values for @name and @rel)
Example 7: Microdata in HTML5 (3)
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
<head>
<title>Document 001</title>
<link rel="schema.DC"
href="http://purl.org/dc/terms/" />
<link rel="schema.EX"
href="http://example.org/terms/" />
<meta name="DC.modified"
content="2007-07-22" />
<link rel="EX.commentsOn"
href="http://example.org/docs/123" />
</head>
</html>
the microdata-to-RDF algorithm would generate the following five RDF triples:
@prefix dc: <http://purl.org/dc/terms/> .
@prefix xhv: http://www.w3.org/1999/xhtml/vocab#> .
<> dc:title "Document 001" ,
xhv:schema.DC <http://purl.org/dc/terms/> ,
xhv:schema.EX <http://example.org/terms/> ,
xhv:DC.modified "2007-07-22" ,
xhv:EX.commentsOn <http://example.org/docs/123> .
It's probably worth emphasising that although the URIs generated here are not-DCMI-owned URIs, it would be quite possible to assert an "equivalence" between a property with a URI beginning http://www.w3.org/1999/xhtml/vocab# and a corresponding DCMI-owned URI, which would imply a second triple using that DCMI-owned URI (e.g. <http://www.w3.org/1999/xhtml/vocab#DC.modified> owl:equivalentProperty <http://purl.org/dc/terms/modified>) - though, AFAICT, no such equivalence is suggested at the moment.
3.3. RDFa in HTML5
I noted above that, although the initial RDFa syntax specification had focused on the case of XHTML, a recent draft sought to extend this by describing the use of RDFa in HTML, including the case of HTML5.
As I already discussed, using RDFa, it is quite possible to represent any data that could be represented in HTML4/XHTML using the DC-HTML profile. So, using RDFa in HTML5, my two example triples could be represented (using the XML serialisation) as:
Example 8: RDFa in HTML5
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:ex="http://example.org/terms/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
version="XHTML+RDFa 1.0">
<head>
<title>Document 001</title>
<meta property="dc:modified"
datatype="xsd:date" content="2007-07-22" />
<link rel="ex:commentsOn"
href="http://example.org/docs/123" />
</head>
</html>
Note that the RDFa in HTML draft requires the use of the html/@version attribute which, in the current draft, is obsoleted by the "core" HTML5 specification. As noted above, RDFa in HTML also proposes the (optional) use of a link/@rel value of "profile".
In the initial discussion of RDFa above, I noted the existence of a list of "reserved keyword" values for the link/@rel attribute in XHTML, to which an RDFa processor prefixes the URI to generate RDF predicate URIs. In HTML5, that list of reserved values is defined not by the HTML5 specification but by the WHATWG registry of @rel values. So there may be cases where a value used in an link/@rel attribute in an HTML4/XHTML document does not result in an RDFa processor generating a triple (because that value is not included in the HTML4/XHTML reserved list), but the same value in an link/@rel attribute in an HTML5 document does cause an RDFa processor to generate a triple (because that value is included in the HTML5 @rel registry). My understanding is that the RDFa/CURIE processing model is designed to cope with such host-language-specific variations, but it is something document creators will need to be aware of.
4. Some concluding thoughts
DCMI's specifications for embedding metadata in X/HTML have focused on "document metadata", data describing the document. The current DCMI Recommendation for encoding DC metadata in HTML was created in 2008, and is based on the DCMI Abstract Model and on RDF. The syntactic conventions are largely those developed by DCMI in the late 1990s. The current document was developed with reference to HTML4 and XHTML only, and it does not take into consideration the changes introduced by HTML5. The conventions described are not limited to the use of a fixed set of DCMI-owned properties, but support the representation of document metadata using any RDF properties.
Looking at the HTML5 drafts raises various issues:
- HTML5 removes some of the syntactic components used by the DC-HTML profile in HTML4/XHTML, namely the scheme and profile attributes.
- HTML5 introduces a requirement for the registration of meta/@name and link/@rel attribute values; the current DC-HTML specification makes the assumption that an "open-ended" set of values is available for those attributes.
- The status of the concept of the meta data profile in HTML5 seems unclear. On the one hand, the profile attribute has been removed, but the proposed registration of a link/@rel value suggests that the profile approach is still available in HTML5.
- The microdata specification provides "global" RDF interpretations for meta and link elements in HTML5.
- Microdata offers a mechanism for embedding data in HTML5 documents, and that mechanism can be used for embedding some RDF data in HTML5 documents. Microdata has some limitations (the absence of support for typed literals), but microdata could be used to express a large subset of the data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The microdata specification is still a working draft and liable to change.
- RDFa also offers a mechanism for embedding RDF data in HTML5 documents. RDFa is designed to support the RDF model, and RDFa could be used in HTL5 to express the same data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The specification for using RDFa in HTML5 is still a working draft and liable to change.
There seems to be a possible tension between HTML5's requirement for the registration of meta/@name and link/@rel values and the assumption in DC-HTML that an "open-ended" set of values is available. Also, the microdata specification's mapping by simple concatenation of registered meta/@name and link/@rel values to URIs differs from DC-HTML's use of a prefix-to-URI mapping. However, as I suggested above, it seems quite probable that at least some applications using Dublin Core metadata in HTML do indeed operate on the basis of a small set of invariant meta/@name and link/@rel values corresponding to (a subset of the) DCMI-owned properties, i.e. they use a subset of the conventions of DC-HTML to represent document metadata using only a limited set of DCMI-owned properties - to represent data conforming to what DCMI would call a single "description set profile". With the addition of assertions of equivalence between properties (see above), then it would be possible to represent data conforming to the version of "Simple Dublin Core" that I described a little while ago - i.e. using only the properties of the Dublin Core Metadata Element Set with plain literal values - using HTML5 meta/@name (with a registered set of 15 values) and meta/@content attributes.
Both microdata and RDFa, on the other hand, are extensions to HTML5 that are designed to provide generalised, "vocabulary-agnostic" conventions for embedding data in HTML5 documents. Using both microdata and RDFa, data may include, but is not limited to, "document metadata". RDFa is designed to represent RDF data; and microdata can be used to represent some RDF data (it lacks support for typed literals). RDFa includes abbreviation mechanisms for URIs that are broadly similar to those used in the DC-HTML profile (in the sense that they both use a sort of "prefixed name" to abbreviate URIs); microdata does not provide such a mechanism and (I think) would require the use of URIs in full.
Both microdata and RDFa address the problem that DCMI seeks to address via the DC-HTML profile, in the context of a more generalised mechsnism for embedding data in HTML5 documents. Both microdata and RDFa could be used in HTML5 to represent document metadata that in HTML4/XHTML is represented using the DC-HTML profile (partially, for the case of microdata because of the absence of datatyping support). Currently, the documents describing RDFa in HTML5 and microdata are both still in draft and the subjects of vigorous debate, and it remains to be seen how/whether they progress through W3C process, and how implementers respond. But it seems to me the use of either would offer an opportunity for DCMI to move away from the maintenance of a DCMI-defined convention and to align with more generalised approaches.
by PeteJ at February 01, 2010 03:51 PM
In this Nodalities Podcast, I talk with blogger and Guardian information architect Martin Belam. I’ve run into Martin at a few Linked Data events where the news and media industries have had a high profile (including the recent News Media Summit, and News Innovation conference last year). Martin has an interest in Linked Data, and an interesting perspective on where it fits in with News, both as a tool for journalism and research and as a resource for the industry.
Also mentioned:
Guardian Open Platform

In this Nodalities Podcast, I talk with blogger and Guardian information architect Martin Belam. I've run into Martin at a few Linked Data events where the news and media industries have had a high profile (including the recent News Media Summit, and News Innovation conference last year). Martin has an interest in Linked Data, and an interesting perspective on where it fits in with News, both as a tool for journalism and research and as a resource for the industry.
Also mentioned:
Guardian Open Platform
by richard.wallis@talis.com at February 01, 2010 02:00 PM
Shelf Browse—which we announced last week—is now live in High Plains Library District's catalog. As we mentioned in our brief ALA announcement, Shelf Browse lets you browse your library's shelves visually, just as you would do in the physical library.
Shelf Browse lets your patrons see where a book sits on your actual shelves, and what's near it. It includes a "mini-browser" that sits on your detail pages, and a full-screen version, launched from the detail page.
See it in action at High Plains Library District. Some jumping off points:
Scroll back and forth, serendipitously browsing through the shelves. If lists are more your speed, in the full-screen version, you can switch between shelf and list mode.
For ordering information contact Peder Christensen at Bowker—toll-free at 877-340-2400 or email Peder.Christensen@bowker.com.
by Abby (noreply@blogger.com) at February 01, 2010 11:03 AM
Today, ByWater Solutions & Nelsonville Public Library (NPL) have announced that NPL will be using ByWater for support for their Koha installation. Now that’s what I’m talking about. You want to stay silent and let your actions speak for you – awesome!!
I couldn’t be happier for the folks at NPL who are awesome people who are dedicated to staying open!
Technorati Tags: koha
by Nicole at February 01, 2010 07:08 AM
January 31, 2010
WordPress has some simple built-in support for posting by email, but that didn’t stop a couple people from developing plugins that might do better. Postie and PostMaster both claim to support attached photos (though neither appears to use WP’s built-in media management). But if your goal is to post photos, you might consider posting through Flickr.
by Casey Bisson at January 31, 2010 11:45 PM
I took a webinar on Zotero taught by Jason Puckett earlier this month and since then I have been playing a bit more with it. I installed the beta release of version 2.0 which includes the ability to share libraries and store data online. This means that you can see my public library by visiting my page on Zotero. It also means that I can join groups and share resources with those who have similar interests to me.
If I have one complaint – it’s that it’s not easy to find my friends and colleagues who have shared their resources on Zotero.org. I’d like a find friends connection to Twitter or Facebook or something that allows me to find people with ease (like many other websites these days). I also find that many groups allow you to join and view resources – but not add resources – which seems silly for a group.
To learn more about Zotero, check out Jason’s guide. Also, if you want to join a group or two – check out the Koha Group and the Open Source for Libraries Group.
[update] If you manage a group you can allow members to add items by going to Manage Group > Library and then changing the permissions. [/update]
Technorati Tags: zotero
by Nicole at January 31, 2010 03:42 PM
In July last year I noted that the terminology around Linked Data was not necessarily as clear as we might wish it to be. Via Twitter yesterday, I was reminded that my colleague, Mike Ellis, has a very nice presentation, Don't think websites, think data, in which he introduces the term MRD - Machine Readable Data.
It's worth a quick look if you have time:
We also used the 'machine-readable' phrase in the original DNER Technical Architecture, the work that went on to underpin the JISC Information Environment, though I think we went on to use both 'machine-understandable' and 'machine-processable' in later work (both even more of a mouthful), usually with reference to what we loosely called 'metadata'. We also used 'm2m - machine to machine' a lot, a phrase introduced by Lorcan Dempsey I think. Remember that this was back in 2001, well before the time when the idea of offering an open API had become as widespread as it is today.
All these terms suffer, it seems to me, from emphasising the 'readability' and 'processability' of data over its 'linkedness'. Linkedness is what makes the Web what it is. With hindsight, the major thing that our work on the JISC Information Environment got wrong was to play down the importance of the Web, in favour of a set of digital library standards that focused on sharing 'machine-readable' content for re-use by other bits of software.
Looking at things from the perspective of today, the terms 'Linked Data' and 'Web of Data' both play up the value in content being inter-linked as well as it being what we might call machine-readable.
For example, if we think about open access scholarly communication, the JISC Information Environment (in line with digital libraries more generally) promotes the sharing of content largely through the harvesting of simple DC metadata records, each of which typically contains a link to a PDF copy of the research paper, which, in turn, carries only human-readable citations to other papers. The DC part of this is certainly MRD... but, overall, the result isn't very inter-linked or Web-like. How much better would it have been to focus some effort on getting more Web links between papers embedded into the papers themselves - using what we would now loosely call a 'micro format'? One of the reasons I like some of the initiatives around the DOI (though I don't like the DOI much as a technology), CrossRef springs to mind, is that they potentially enable a world where we have the chance of real, solid, persistent Web links between scholarly papers.
RDF, of course, offers the possibility of machine-readability, machine-processable semantics, and links to other content - which is why it is so important and powerful and why initiatives like data.gov.uk need to go beyond the CSV and XML files of this world (which some people argue are good enough) and get stuff converted into RDF form.
As an aside, DCMI have done some interesting work on Interoperability Levels for Dublin Core Metadata. While this work is somewhat specific to DC metadata I think it has some ideas that could be usefully translated into the more general language of the Semantic Web and Linked Data (and probably to the notions of the Web of Data and MRD).
Mike, I think, would probably argue that this is all the musing of a 'purist' and that purists should be ignored - and he might well be right. I certainly agree with the main thrust of the presentation that we need to 'set our data free', that any form of MRD is better than no MRD at all, and that any API is better than no API. But we also need to remember that it is fundamentally the hyperlink that has made the Web what it is and that those forms of MRD that will be of most value to us will be those, like RDF, that strongly promote the linkability of content, not just to other content but to concepts and people and places and everything else.
The labels 'Linked Data' and 'Web of Data' are both helpful in reminding us of that.
by AndyP at January 31, 2010 09:26 AM
My former OCLC colleague Eric Hellman has become one of the more interesting bloggers in our space. A little while ago he wrote about the acquisition of Liblime by PTFS. He made a general opening comment ...:
The library industry has likewise been troubled by misalignment of interests between the owners of the companies and their customers. That's why it's important for libraries to pay close attention to the frequent mergers and acquisitions of the companies that serve them. [PTFS to Acquire LibLime and Move to Library Systems Premier League]
And goes on to talk about the rationale for open source (primarily to avoid vendor lock-in, Eric argues) and PTFS and Liblime positions in the market.
Here, for example, he talks about aspects of the library/vendor transaction from the vendor perspective ...
From the vendor's point of view, the sales process is very expensive. Promises to customize the system to address customer peculiarities are common, and these add to the cost of system maintenance. Once the system has been sold, a proprietary system vendor has a guarantee of continuing profits from support contracts. Only the vendor has the system knowledge (and sometimes even the system access) to make even the most trivial changes. It's in the support phase that the vendor and customer interests can become misaligned. The vendor has every incentive to do the least work at the highest price possible. The customer is locked into whatever system they have chosen. [PTFS to Acquire LibLime and Move to Library Systems Premier League]
.... and here he talks about open source ..
The recent popularity of open source library management systems is in large part a search for business models that better align the interests of vendor and customer during the support phase. If the support vendor doesn't perform to the library's expectations, the library can hire a new support vendor without ditching their automation system. If a library wants to add a new feature to their system, or integrate it with a system from another vendor, they can hire a developer based on qualifications rather than access to source. The important thing to the library is not so much the access to source or the cost of the license, it's the absence of vendor lock-in. [PTFS to Acquire LibLime and Move to Library Systems Premier League]
The entry was informative and interesting. I may disagree with detail or emphasis (other factors are clearly in play in the current interest in open source for example) but - importantly - my thinking has been influenced by it.
When I finished reading it I was also struck by how unusual it is to read something like this in the sources where you might expect it, in the library 'journalism'. In general we are not well-served by library journalism (I am thinking of what is published in our 'trade magazines': American Libraries, Library Journal, CILIP Update, ...) when it comes to this type of 'business' analysis. Our discussions are poorer for it.
by dempseyl@oclc.org (Lorcan Dempsey) at January 31, 2010 04:01 AM
January 30, 2010
My first purchase I hadn’t intended on making. I wasn’t until I worked for week at home that I realized how wonderful it was to have dual monitors at work. Over the last 4+ years I’ve become accustom to using dual monitors when I work. It is particularly helpful when working on code. But honestly its just as important when you are working virtually and want to have your communication tools and what you are working on visible simultaneous. After a week without them I was at my wits end. So I went looking for another monitor. In the end, I sort of splurged with the monitor because I watch a lot of Hulu and have a limited amount of desk space. As a result, my main screen is now a new Samsung 23″ wide screen LED monitor.
My second purchase revolved around the fact that I have to be able to scan and fax stuff easily for the new job. So I’ve acquired a HP Office Jet Pro 8500 to scan/fax/print/copy. I was skeptical about the HP but was convinced by the ease of with which folks were able to setup the printer’s wireless capabilities (at least according to reviews/comments). Getting the basic setup done was pretty easy. The only road bump I hit involved the scan to email functionality which refused to connect to my Gmail account to send the info. A quick search of the HP user forums located the answer though. One thing I like about the HP is the built in web-based configuration tool. I found this WAY easier to use than the HP Setup Wizard software which seemed lame and crippled by comparison.Hey I’m a geek I want to tweak network masks, etc.
The third addition wasn’t really a necessity but rather my birthday gift. My husband knows me all too well. Realizing that I was going to be working from home, that an important part of that is being able to focus and that I use music to focus, he purchased me Bose speakers for my computer system. The new speakers make for a much improved listening experience when I work, write, code. Much easier to tune out the NASA T-1 trainers as they zoom overhead.
All and all I’m pretty happy which the additions and should have everything for my home office now. The only other thing I’m getting is an iPhone earphone/mic. One of my colleagues says that they work fine with a Mac Air. So I’ll be getting those to use when I’m on the road. Still have to deal with cell phone upgrades, but I’m waiting on those for a variety of reasons.
by Karen at January 30, 2010 11:24 PM
Since Apple announced its iPad the web has been consumed with discussion about the device. My husband, knowing I’m a Mac lover, sent me a link and though I swore I wasn’t going to get drawn into looking at it couldn’t help but take a peek.
From my perspective there are lots of thing to like about the iPad, screen size for one. The fact it runs iWorks is another.
My biggest question involve how to input stuff into the device. Looking at the pics it seems like this might be a little more like using a laptop. Since I write a lot when I’m on the road, this is important to me. The other question is how good will its ebook reader capabilities be. Several friends have various devices to read ebooks on from Kindle to iPod Touch. What I didn’t like about reading on the iPod Touch was the screen size, the iPad overcomes that issue but issues have been raised about how the screen technology would be on the eyes. Another issue for me is that I have to use VPN to get into any resources at my new job. I’ve got no idea how I would do this with the iPad. (Which is annoying and sad because after carrying my laptop around the office to meetings for week, I really sort of would have liked a tablet)
Ultimately, I’m on the fence about my desire to have an iPad. I have a Mac Air, which I LOVE beyond words. It was worth the extra money over netbook. However, I’m not sure how I feel about replacing it with and iPad. Particularly since I’m not sure how I’d VPN to the network at work. Which means I’d end up with yet another device to carry around. I’ve been looking at smart phones and ereaders. I’m likely to get a smart phone which means that the iPad would end up being a suped up ereader for me. Considering the price tag, I’m not sold. If anyone gets one and figures out how to VPN on them, let me know. I’d seriously be interested.
That being said, if I didn’t have my Mac Air, I’d be very likely to buy it.
by Karen at January 30, 2010 10:52 PM
I'm getting quite a few more comments here than when I started, which is lovely! To keep the conversation lively and civil, I've put together a comment policy, which you can find on the blog's About page. (I'll link to it from the sidebar momentarily.)
It's mostly common sense. Moreover, I haven't had to edit or delete a non-spam comment here yet.
Still, I'd rather have a policy and not need it than need it and not have it. So now it's there.
Read the comments on this post...
January 30, 2010 07:42 PM
In 2006, after weblogging for some 6 years while working at UC Berkeley, I took on a new role as a data architect on campus. I felt it important to keep blogging about my professional interests but to do so under a new moniker. I came up with "data unbound" to name the passion I had for the myriad possibilities latent in data, some of which I have strived to reveal.
A lot has happen since I started dataunbound.com, the weblog. I left my staff position at UC Berkeley so that I could devote myself more fully to the task of teaching others about the world of web APIs and mashups. I wrote my book on the subject Pro Web 2.0 Mashups: Remixing Data and Web Services, which has been very well-received, I'm pleased to say. Right now, I'm teaching my course Mixing and Remixing Information for the fifth time at the School of Information at UC Berkeley. This year, I'm focusing the course on the rapidly expanding area of open government and the web.
And now, I (in partnership with my wife, Laura Shefler) have taken the next step of formally starting Data Unbound LLC:
Data Unbound LLC is a training and consulting company that helps organizations access and share data effectively. The value of your data, when it is scattered throughout multiple databases and applications, grows if you can make it all work together. This value increases further when you leverage your information resources with the vast world of data on the Web. Our specialty is helping you to use APIs (application programming interfaces) to integrate data across your organization and beyond.
We're open for business, ready to work with clients to solve their data problems. Our training will enable their organizations to integrate data, both their own and that of others through APIs and data standards. I encourage you to read more of what we have written on dataunbound.com, in which we detail our approach and our offerings. In the next months, I'll be describing how general principles behind data integration and web APIs can solve your problems in your specific context. And if you know anyone who make use of Data Unbound, by all means, put them in touch with us.
by Raymond Yee at January 30, 2010 05:19 PM
The unveiling of Apple’s iPad this week provoked seemingly everyone to prognosticate the future of the device and the future of computing in general. I was instead prodded to revisit the past—specifically, the original design goals for the Mac spelled out by the brilliant (and humorous) Jef Raskin. Just read the principles Raskin lays out in 1979 in “Design Considerations for an Anthropophilic Computer“:
This is an outline for a computer designed for the Person In The Street (or, to abbreviate: the PITS); one that will be truly pleasant to use, that will require the user to do nothing that will threaten his or her perverse delight in being able to say: “I don’t know the first thing about computers,” and one which will be profitable to sell, service and provide software for.
You might think that any number of computers have been designed with these criteria in mind, but not so. Any system which requires a user to ever see the interior, for any reason, does not meet these specifications. There must not be additional ROMS, RAMS, boards or accessories except those that can be understood by the PITS as a separate appliance. For example, an auxiliary printer can be sold, but a parallel interface cannot. As a rule of thumb, if an item does not stand on a table by itself, and if it does not have its own case, or if it does not look like a complete consumer item in [and] of itself, then it is taboo.
If the computer must be opened for any reason other than repair (for which our prospective user must be assumed incompetent) even at the dealer’s, then it does not meet our requirements.
Seeing the guts is taboo. Things in sockets is taboo (unless to make servicing cheaper without imposing too large an initial cost). Billions of keys on the keyboard is taboo. Computerese is taboo. Large manuals, or many of them (large manuals are a sure sign of bad design) is taboo. Self- instructional programs are NOT taboo.
There must not be a plethora of configurations. It is better to offer a variety of case colors than to have variable amounts of memory. It is better to manufacture versions in Early American, Contemporary, and Louis XIV than to have any external wires beyond a power cord.
And you get ten points if you can eliminate the power cord.
Any differences between models that do not have to be documented in a user’s manual are OK. Any other differences are not.
It is most important that a given piece of software will run on any and every computer built to this specification…
It is expected that sales of software will be an important part of the profit strategy for the computer.
It only took 31 years (not especially a long time in the history of technology), but I think the iPad is the device Raskin envisioned (given, as Raskin would have agreed, that “the interior” and “the guts” now includes the software interior/guts as well as the hardware interior/guts).
Fraser Speirs has called the tech community’s negative reaction to the iPad “future shock” (via Daring Fireball); but it’s really the shockwave of the past—the radical vision of computing Raskin and Steve Jobs always had—f;inally catching up to the present.
by Dan Cohen at January 30, 2010 03:42 PM
I’m in a motel in Oxnard, resting up before a funeral tomorrow. My uncle Bob died. I didn’t know him well — our family has a lot of gaps in its attachments — but he led a good strong life and died with his boots on, felled by a series of strokes that began hours after he worked his last Friday at his clinic. He was a doctor — a dermatologist — and he had spent more than half a century getting up and going to work with a smile on his face.
A couple of weeks ago I climbed the Moraga stairs. Not the fancy stairs with the lovely mosaic tiles, but the prosaic eastern stairs, mere concrete steps leading up to a perch on top of the world.
I generally don’t do heights. I’m fine with planes, but on my own two legs, or in a car, heights make me queasy. No miracle happened on my climb. I stayed queasy, eyes-down, creeping to the top and then down again with my hands and arms wound round the bannisters.
“Nice view, yes?” said a dapper man striding past me.
“Yes,” I squeaked, eyes downward. But I had seen the view, when I reached the top. It was an amazing 360 view of San Francisco near sunset on a chilly day, a well-water-clear view that spread out before me across city and ocean. It was the most amazing view, and I, a San Francisco native, had never seen it before.
Really, before I looked at the place we would rent, I had never heard of Golden Gate Heights (what I think of as the “lonely goatherd” section of the Inner Sunset). I had never climbed this hill. Seen this view. Walked these steps.
I think a lot these days about how to introduce people who have never known great libraries to this experience. It’s an interesting problem. If you have never climbed those stairs or seen those heights, what are your expectations?
by K.G. Schneider at January 30, 2010 06:45 AM
January 29, 2010
My radar (Google Alerts) pointed me this morning to this article by Barbara Quint at Information Today. My first response to “EBSCO Exclusives Trigger Turmoil” was “What a mess!” Quint shares the saga of EBSCO and Gale lobbing volleys at each other during the ALA Midwinter meeting. EBSCO announces new acquisitions that were ‘exclusive to EBSCO for the library “marketspace.”‘ Major competitor Gale issued a letter to the library community urging “librarians to get involved in opposing publishers granting exclusives, at least to EBSCO.” Read Quint’s article for all the gory details.
If you’re a librarian running or contemplating a discovery service, how do you feel? EBSCO has some new content I assume is going to become available via their EBSCO Discovery Service and some content is going to disappear from Gale’s holding which I assume means it will disappear from Serials Solutions’ Summon discovery service which includes Gale as a major participant.
I’m not the smartest person when it comes to understanding relationships between publishers and aggregators but I did raise the concern about such a mess:
Yes - source lock-in. I’ve written, perhaps ad nauseam, about my concern that discovery services, if not integrated with federated search, force organizations that want a single search tool to choose one service or the other. Federated search is very important for organizations that have particular sources they want to search that are not available from one of the discovery services.
Even if an organization is happy with the set of sources provided through a discovery service, the availability of sources is dependent on the relationship with the publishers (and/or aggregators.) Discovery services are too new to know how publisher relationships will evolve, especially given the competition.
Choosing a discovery service causes a library to “cede control of selection,” to steal Carl Grant’s words. The relationship between publishers, aggregators, libraries, and patrons is an ever-shifting one. Unfortunately, it’s the patrons who are left to scramble when the content they care about is no longer available from the search service their library subscribes to. Sigh!
ShareThis
by Sol at January 29, 2010 11:29 PM
29 January is the 5th anniversary of my public blogging. I had a Bloglines private blog for about 9 days before I got fed up with its lack of capabilities. That 1st proto-blog was called In My Secret Life… via Leonard Cohen.
The 1st public-facing blog debuted on 29 January 2005 at bookmark.typepad.com and was called …the thoughts are broken…, which is from Ripple by the Grateful Dead. This would have been the beginning of my 2nd full semester of library school.
On 20 July 2006 I flipped the switch on Off the Mark on my own domain and hosted by LISHost after some tribulations with Typepad over many months. The story of the name is at that post.
On 19 July 2009 I again changed the name of the blog; reasons listed at the post. It is now known as habitually probing generalist.
I will make no promises as to what will or will not happen on this blog in the future. I have not been writing much for quite a while now—some of the reasons are interspersed in posts over the last 18 months or so—and I do not know if or when I will pick up the virtual pen again or how frequently. But I do appreciate having this space as an outlet and knowing that thanks to RSS anyone who truly cares what I might have to say can simply wait on that eventuality to arrive.
Thanks to all who have been here with me any of this time. Hopefully you’ll see me around here some more and I certainly hope to see you (and your feedback/comments/critiques/cries of BS/etc.).
by Mark at January 29, 2010 09:49 PM
(My apologies; this post inadvertently went up prematurely. If you were wondering where I was going with it, please read on!)
I met Steve Koch at Science Online 2010, where he wowed me showing off his students' open-notebook-science work. I love, just love, teachers who do that. I wish the sort of work I typically assign students was appropriate to it.
Because of the interactions Steve had with librarians at that conference, he's going back to talk with the digital librarian at his institution to see what they can do for each other.
I love that, too, though it makes me nervous. Consider a comment I got on a previous post:
I'am afraid the greater part of librarians are staring to their belly-buttons, and do not have the attitude or communication skills necessary to connect with their customers.
Ouch. Nor am I prepared to say that's incorrect. So when I send someone like Steve to meet with a librarian, I have to hope for a fruitful interaction. I can't rely on it.
Wondering where the commenter got that impression? Well, let's consider Steve Koch again. In a comment to another FriendFeed post, he said (quoted with permission; paragraph breaks mine, as FF doesn't let commenters paragraph their own comments):
I'm stoked about partnering with librarians going forward. I'm meeting with our digital initiatives librarian next week to learn what we can do regarding open data / open access / open science.
But a year ago, I was clueless about what university libraries were doing. Definitely a lot of that ignorance was my fault. But it makes sense if you think about my trajectory to current position as faculty. As an undergraduate and graduate student, most of my interactions with the library were moderately helpful at best, and sometimes completely hostile. For example, I had a comical (but infuriating at the time) battle over a $25 fine for using a 2-hour reserve textbook overnight (while the library was closed). And then all the frustration with copy machines & copy cards, etc. Basically, it sucked going to the library, and library & librarian were almost the same word.
So, with the advent of PDF, I was pretty much delighted that I never had to go to the library anymore. I discovered Inter-Library Loan and was proud that I didn't even know where the library was. Clearly all prejudices and a not clever on my part. However, I suspect that similar prejudices are shared by many faculty and other scientists.
I can think of two things that can be done: (1) educate current faculty, and (2) make things more pleasant for current grads and undergrads. In regards to (1), it's pretty tough to achieve. One idea would be to put advertisements in emails that deliver PDFs for ILL: "Do you like ILL? Your library can help you way more than that! email: ___"
Method (2) is likely more productive, IMO. I don't know a lot about it, but I suspect that undergrads and grads still have unpleasant relationships with the library. Making those more pleasant and collaborative will make for better partners in the future. Like I said, I don't know a lot about current state of affairs, and if indeed conditions have improved for students, maybe better advertising of that fact is called for?
What are we to take from this, we librarians, if we wish to regain ground among scientists?
- We need to address three market segments: young proto-scientists, practicing scientists who have no idea what we do, and practicing scientists such as Steve who have been actively turned off by libraries and librarians. By and large, it seems to me, we're doing quite a bit to address the first group's needs, not much at all for the second—and nothing whatever for the third.
- It's not enough to "be a library" any more. It has been enough for quite a long time—among other things, libraries were an important source of institutional prestige—but no more. The boundaries of science librarianship in the research institution are becoming the boundaries of the research enterprise. If we're not contributing to the research enterprise, we can expect to be in the gunsights.
- Patron service matters, if we are not to mint more Steve Kochs by the dozen. Every patron turned away from a library by sticklers for rules or unhelpful service is a spadeful of earth from our own grave.
- Our sixth column? Information-literacy instruction. Love your library instructors! They mint future academic-library patrons.
- One more time: we're not going to fix this situation sitting behind desks in a library our target populations don't visit. What Stephanie and Christina and John and Bonnie and Hope and Molly and Paolo and I did to advance librarianship, we did at a science conference.
Read the comments on this post...
January 29, 2010 09:15 PM
Readers of this humble blog (notice I didn't say the author of the blog was humble) already know about the HathiTrust, since I've written about i...
January 29, 2010 07:45 PM
So, you all know I’m a strong WordPress supporter – and you may not know – but I’m no real fan of TypePad because it seems so restrictive for my purposes. That said, I have friends who do like and use TypePad and the point of this site is to share things I learned – and today I learned that there is now a free version of TypePad called TypePad Micro. So if you want to use TypePad you may want to check out the TypePad by using TypePad Micro.
Technorati Tags: typepad
by Nicole at January 29, 2010 06:59 PM
So the folks here at PLAN learned about Poken from Helen Blowers who learned about it at the UGame, ULearn conference in Delft. A Poken is basically a little 4 fingered USB character that stores your virtual business card. You then go to Poken.com and enter in your contact info and when you ‘high 4′ (remember it’s 4 fingered) your friend’s Poken you exchange business cards.
So, as a gift for speaking today I got me my Poken – but as cool as this is – it’s only useful if more people start using Poken – so spread the word and maybe at the next library conference we can all ‘high 4′
Technorati Tags: poken
by Nicole at January 29, 2010 06:47 PM
This will be my last Day in the Life, as Reed and I got sick with RSV (and him with bronchiolitis as well) so I’m feverish, wiped out, and confined to bed. I wrote this Thursday evening before the worst of the illness had hit (and man, it hit like a ton of bricks during the night!)
Soooooooo tired this morning. Since we’d had such a bad night’s sleep last night, I let Reed sleep until he woke up on his own (Adam too). Reed woke up very stuffy, kind of crabby, and not really into eating much in the way of solid foods. I dropped him off at daycare and he seemed pretty happy there playing with his favorite toys. Ended up getting to work around 8:20. This is one of those days that I wish I actually liked coffee.
Fortunately, it’s a teaching day, so I know that’ll wake me up. I really love teaching, because it gets me working with students and faculty, it gets my energy levels up, and, well, it’s just fun most of the time. I used to be terrified of teaching, but over time I’ve not only become comfortable with it, but I really enjoy doing it.
Met with the Distance Learning Librarian (who I supervise) to catch up on what she’s been working on and the progress of some of the committees she’s a member of. She is a very self-directed and highly competent employee, so sometimes it’s easy to forget that she’s only been here since August and still needs plenty of support and advice. I talked to her about presenting on a committee we’re co-chairing at the Library Council meeting tomorrow morning since she could use more experience taking the reins in committee work.
Prepped for the International Studies senior seminar I’m teaching this afternoon. I’ve been trying to find the happy medium between over-preparing (which leads to boring) and under-preparing (which leads to screw-ups) for my instruction sessions and I think I’m getting closer to a happy medium. I’m trying a new instructional technique with this class to get the students more involved, so we’ll see if it’s a success or a major flop.
Did some collection development work as I’m woefully behind in the spending of my liaison funds.
Discussed the website redesign with the Systems Librarian and saw some graphical elements that the university webmaster had made for us. They look completely awesome and I’m so glad he was willing to work with the library on this since graphical design skills are something seriously lacking amongst the library staff.
At 1:45, the International Studies seminar showed up (15 minutes early — damn I’m glad I always start setting up early!). It’s a small class of 11 students, so an ideal one to try out new ideas with. Their assignment for the semester is to write a major research paper on some political, economic or historical topic relating to the country in which they’d studied abroad the year before, so there is a huge range of library resources that could be helpful depending on the topic. Fortunately, I had two hours with the students, so we covered a lot of ground. I’d gone in assuming that since they were seniors who’d taken plenty of history and political science classes (International Studies is an interdisciplinary major), they would already have lots of experience using resources like JSTOR, CIAO, WorldCat, etc. After asking the students a few questions at the beginning of the session, I realized how wrong I was. Only half had used JSTOR and none had used CIAO or WorldCat. Wow! So, that required a bit of readjustment in how I’d planned to teach the class. The one thing I really wanted to try with this class is to have students come up to my computer and do searches on their research topic. I guessed that students would pay more attention if it was their classmate up there, and I thought I could offer suggestions and search tips that they might be more likely to remember if they were the ones doing the searching. It also just makes more sense to do searches on their topics than on canned ones I came up with.
The class ended up being the best one I’ve ever taught. The students actually clapped for me at the end, which was a hoot. The students and the professor were even taking notes during the session, which is not something I often see. I had to do a little more demo-ing of the databases than I’d planned originally, but I still had them doing the searching most of the time. They really responded well to coming up to the computer to do their searching. I chose people to come up to do different searches based on the nature of their topic (economic, current political, historical, historical political, etc.). And it worked out nicely, because some students had the problem of having very few result and needing to broaden their search and others had the problem of too many and needing to narrow their topic. There were lots of nice examples to use as teaching moments. Not only was I giving them suggestions as they were searching, but the other students were as well. They were asking all sorts of questions about the databases. I fed off the students’ energy and definitely was more energetic and animated than I am with a class where the students don’t seem engaged. I came out of class feeling completely excited, awake and happy.
It’s experiences like this that remind me of why I love my job so much. Some days I’m mired in meetings, paperwork, creating tutorials and other activities that pretty much have me sitting in a chair all day. I like some of those activities (especially creating tutorials), but if that was all there was in my job, it wouldn’t be for me. But then there are those days when I get a lot of reference questions at the desk or I teach, where I really get to help students and faculty. That’s the stuff I love most about my job. Fortunately, as the semester gets going (it’s only week 2), I’ll have more and more interactions like these that will leave me energized and grateful to have the job I do.
by Meredith Farkas at January 29, 2010 06:23 PM
I’ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup.
Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file.
Assertion: If I can’t write a system that’s faster than what we have now, it’s probably not worth my time even though being able to fall back to ruby instead of java would be nice.
The Big Question: Is the MARC parsing process fast enough that it seems I might be able to write a system that runs faster than the solrmarc setup I have now?
The Answer (see below): Yes, if I use marc4j.
On our ridiculously-awesome hardware, right now we’re doing about 300 records/second for short files and 250 records/second for a full (6.5 million record) index, giving us a 7-8 hour reindex.
I’ll just post the results without a lot of commentary. I warmed stuff up in all cases, and ran on my desktop (so I could compare to MRI ruby, which isn’t installed on the server) and on the server where we usually run these things.
- The machines are my desktop OSX machine and the beefy linux server where we usually do this stuff
- The platforms are jruby 1.4 –server and MRI ruby 1.87
- The libraries are marc4j and ruby-marc 0.3.3
- The parsers are
- The standard binary parsers all around
- A home-grown AlephSequential format reader for the ’seq’ type. AlephSequential is a MARC representation that uses one line for each field. We use it because it doesn’t have length limitations and, not surprisingly, Aleph can spit it out pretty quickly compared to MARC-XML.
- Whatever marc4j uses internally for MARC-XML
- ruby-marc’s ‘jstax’ xml parser under jruby (which I wrote and apparently needs some love, see below)
- ruby-marc’s ‘libxml’ xml parser under MRI ruby
- Seconds is the average of two rounds, with measurements taken after a warmup run in each case.
The test files were 18,881 records in marc-xml, marc-binary, and AlephSequential formats.
MACHINE PLATFORM LIBRARY PARSER SECONDS REC/SECOND
desktop jruby marc4j binary 4.06 4650
desktop jruby marc4j xml 5.55 3401
desktop jruby ruby-marc binary 17.35 1088
desktop jruby ruby-marc jstax 80.11 236
desktop ruby ruby-marc binary 33.54 562
desktop ruby ruby-marc libxml 46.87 402
server jruby marc4j binary 2.29 8245
server jruby marc4j xml 3.36 5619
server jruby marc4j AlephSeq 3.68 5130
server jruby ruby-marc binary 9.93 1901
server jruby ruby-marc jstax 44.56 424
The quick takeaways, with all the obvious caveats:
- jruby with ruby-marc is twice as fast at binary and twice as slow at xml compared with MRI
- marc4j is four times as fast for binary and about an order of magnitutde faster for xml compared with ruby-marc.
- The server is fast.
We know from previous experience that libxml is the fastest of the current MRI-based marc-xml readers and that jstax is the best of the current jruby-based marc-xml readers. And, finally, we know that many of us can’t use marc-binary format because our records are too big.
If I’m gonna use jruby (which I think I am due to wanting to use the StreamingUpdateSolrServer) I’m gonna need to use marc4j and just wrap it up in some nicer syntax.
by Bill at January 29, 2010 03:51 PM
In my previous blog post I was trying to demonstrate the virtues of data.gov.uk making the descriptions of their datasets available as RDFa. Just this morning I learned from Mark Birbeck that the folks down under at data.australia.gov.au did this last October!
For example this page describing a dataset for public Internet locations has this RDF metadata inside it:
<http://data.australia.gov.au/80> cc:attributionName "http://www.centrelink.gov.au/"@en-au ;
cc:attributionURL <http://www.centrelink.gov.au/> ;
dc:coverage.geospatial "Australia"@en-au ;
dc:coverage.temporal "Not specified"@en-au ;
dc:creator "Centrelink"@en-au ;
dc:date.modified "2009-08-31"^^xsd:date ;
dc:date.published "2009-08-31"^^xsd:date ;
dc:description """<p xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml">Location of Centrelink Offices</p>
"""^^rdf:XMLLiteral ;
dc:identifier "80"@en-au ;
dc:keywords "<a href=\"http://data.australia.gov.au/tag/social-security\" rel=\"tag\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\">Social Security</a>"^^rdf:XMLLiteral ;
dc:license "<a href=\"http://creativecommons.org/licenses/by/2.5/au/\" rel=\"licence\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\"><img alt=\"Creative Commons License\" class=\"licence\" src=\"http://i.creativecommons.org/l/by/2.5/au/88x31.png\"/>Creative Commons - Attribution 2.5 Australia (CC-BY)</a>"^^rdf:XMLLiteral ;
dc:source "<a href=\"http://www.centrelink.gov.au/\" rel=\"dc:source\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\"/>"^^rdf:XMLLiteral ;
dc:subject "<a href=\"http://data.australia.gov.au/catalogue/community\" rel=\"category tag\" title=\"View all posts in Community\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\">Community</a>, <a href=\"http://data.australia.gov.au/catalogue/employment\" rel=\"category tag\" title=\"View all posts in Employment\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\">Employment</a>, <a href=\"http://data.australia.gov.au/catalogue/government\" rel=\"category tag\" title=\"View all posts in Government\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\">Government</a>"^^rdf:XMLLiteral ;
dc:title "Location of Centrelink Offices"@en-au ;
dc:type <http://purl.org/dc/dcmitype/Text> ;
agls:jurisdiction "[Commonwealth of] Australia (AU)"@en-au ;
<http://www1.australia.gov.au/datasets/Federal/Centrelink/Location%20of%20Centrelink%20offices%2031_08_09/centrelink_offices_31_08_2009.CSV> dc:format "CSV"@en-au .
Now this data isn’t without problems: notice the XML literals as objects in the assertions involving subject, keyword, license and source? But it’s a Beta after all, and lots of us are learning this as we go, so Australia deserves a ton of credit. One really nice thing they are doing is making assertions about the format and URL location of the dataset itself. It would be even better if the dataset description was linked up with the dataset files using oai-ore or some other vocabulary.
In about 5 minutes I adapted the simplistic data.gov.uk crawler to crawl the data.australia.gov.au data. There aren’t as many datasets, so the crawler only pulled down 1725 triples (minus the xhtml triples)…but perhaps I missed some in my simplistic crawl.
Seeing both the data.gov.uk and data.australia.gov.au efforts to make dataset descriptions available makes me wonder if it could be useful for the W3C eGov Working Group to provide some lightweight guidance on how to make dataset descriptions available: what sorts of vocabularies to use, the kinds of assertions that are important, etc. It’s hard not to daydream of trying to provide an aggregated view of both pools of data, which is kept in synch using the web, and which perhaps could pull down aggregated datasets and archive them, etc. Perhaps a little spot checking tool that took at look at your HTML and let you know if it can work as a dataset description would be useful too?
by ed at January 29, 2010 02:36 PM