Some quick sketches at the monday sessions (3-4 mins max). You see when I get tired. I get distracted and start drawing cartoon figures.Filed under: Figure Drawings Tagged: art, art model, class, Life Drawing, model, Nude, nude model, Nudes
Blogs and feeds of interest to the Code4Lib community, aggregated.
Some quick sketches at the monday sessions (3-4 mins max). You see when I get tired. I get distracted and start drawing cartoon figures.Filed under: Figure Drawings Tagged: art, art model, class, Life Drawing, model, Nude, nude model, Nudes
Whatson is a cognitive search application that will answer your natural language question with a single correct answer. It is a basement build, using open source technology and open web knowledge.
In previous posts I showed how Apache Tika can be used to crawl content in different formats and extract metadata, and how OpenNLP methods can be used to tokenize plain text. In this post I put everything together, replacing OpenNLP with Apache Solr methods to tokenize and otherwise modify the text for search. The following screenshot shows the completed process for my small sample of classic literature. The search term, “whale” returns both Moby Dick and Call of the Wild. Okay that’s two answers, not one. Like Captain Ahab, the journey will be arduous and the goal elusive. The current build is just standard search so far. The cognitive part will begin soon, as per the Whatson architecture.
The following code is a clean, complete, and well-documented example in Java. It’s a bit long for embedding in a blog post, but Github Gist does not currently support a height limit with a vertical scrollbar. I may need to look at other methods of embedding code.
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.
Visit the LITA Job Site for more available jobs and for information on submitting a job posting.
Code4lib was a big year this year, in many ways. While not the biggest in numbers (2013 had 384 vs. ~350 this year), it felt a lot bigger because we were in a much smaller room, so it felt more crowded and bigger. Presenting This year was also a big one for me, because for […]
This article presents an analysis of 111 Library and Information Science journals based on measurements of “openness” including copyright policies, open access self-archiving policies and open access publishing options. We propose a new metric to rank journals, the J.O.I. Factor (Journal Openness Index), based on measures of openness rather than perceived rank or citation impact. Finally, the article calls for librarians and researchers in LIS to examine our scholarly literature and hold it to the principles and standards that we are asking of other disciplines. [Also available as an EPUB for reading on mobile devices, or as a PDF.]
January 2014 saw the launch of Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3), which was the first major disciplinary or field-specific shift toward open access. Considerable numbers of journals and publishers are moving to embrace open access, exploring a variety of business models, but SCOAP3 represents a significant and new partnership between libraries, publishers and researchers.1 Simply, 10 journals under the SCOAP3 program were converted to open access overnight and are being supported financially by libraries paying article processing charges through a consortium rather than purchasing subscriptions. The Physics field has been at the forefront of open access for more than 20 years, beginning with the foundation of arXiv.org and followed by their premier society, American Physical Society (APS), actively evolving their publications to provide efficient open access options for authors. There has yet to be any such movement in the professional literature of Library and Information Sciences (LIS), despite the fact that the library world is inextricably linked to “open access” both in principle and in practice. The authors note this disciplinary discrepancy, and through an analysis of LIS journals and professional literature hope to inspire those researching and publishing in the LIS field to take control of our professional research practices. We conducted this analysis by grading 111 select LIS journals using a metric we propose to call the “J.O.I Factor” (Journal Openness Index), judging “How Open Is it?” based on a simplified version of the open access spectrum proposed by Public Library of Science (PLOS), the Scholarly Publishing and Academic Resources Coalition (SPARC), and the Open Access Scholarly Publishers Association (OASPA). It is our hope that doing so will lead to the shifts in the scholarly communication system that libraries are necessarily pursuing.2
Scholarly publishing is evolving in many ways, as anyone connected to academia knows. Discussions about publishing often center on the potential that digital technology offers to disseminate the results of scholarly research, a role traditionally filled by scholarly associations, societies, university presses, and commercial publishers. Scholars and researchers at institutions ranging from Ivy League universities to state colleges are raising questions about how non-traditional “digital” scholarship will be evaluated, what criteria and credence should be given to new, openly accessible online journals, and what role open access repositories have in disseminating and preserving the scholarly record. Reaching even into public policy, the Office of Science and Technology Policy (OSTP) convened a Scholarly Publishing Roundtable in 2009. That group’s final report offered the recommendation that each federal research agency (National Science Foundation, National Endowment for the Humanities, etc.) should expeditiously develop and implement public access policies, offering free access to results of federally funded research. 2013 saw OSTP revisit that recommendation and, in response to an overwhelming petition, issue a directive to all federal funding agencies with more than $100 million in R&D funding to develop and implement open access policies, similar to the National Institute of Health’s Public Access Policy, in effect since May 2, 2005.
Popular media are also taking up the question of how scholarly publishing will evolve. The Guardian regularly features pieces in its Higher Education Network calling for redefinition of the publishing cycle that earns large publishing companies significant financial gains off of the gift economy of intellectual content and peer-reviewing in which faculty participate. One opinion piece went so far to say, “Academic publishers make Murdoch look like a socialist… down with the knowledge monopoly racketeers.” The Economist coined the term “Academic Spring” in a February 2012 piece, referring to faculty’s rising discontent with the current system. They cite the example of Timothy Gowers, an award winning Cambridge mathematician who called for a boycott of Elsevier, a large STEM publisher, for its unsatisfactory business practices. As of April 22nd, 2014 that boycott, thecostofknowledge.com, had 14,602 signatories. Finally, US News and World Report published a piece in July 2012 that opened with Harvard University’s Faculty Advisory Council stating “many large journal publishers have made the scholarly communication environment financially unsustainable and academically restrictive.”
Responding to these “tectonic shifts in publishing,” university libraries and academic librarians are undergirding a system that is shaky at best. Budgets remain flat, while subscription costs continue to rise; all the while many libraries are investing in staff and infrastructure in the area of scholarly communication, supporting open access initiatives, or moving directly into publishing themselves.3 While the primary push for adapting this system has been working through disciplinary faculty to change research culture, academic librarians are slowly engaging the idea that publishing practices within our own journals and professional writing could be an effective way to mold the future of academic publishing. The scope of this article is to engage our own community, librarians who publish in professional or academic literature, and target pressure points in our subset of academic publishing that could be capitalized upon to push the whole system forward. We are approaching this topic with the goal of plainly sketching out what LIS publishing looks like currently, in terms of scholarly communication practices like copyright assignment, journal policies for open access self-archiving and open access publishing.
Studies of this magnitude have been conducted in the recent past, although they have primarily focused on the attitudes of individual librarian authors toward publishing practices more than analyzing the publishing practices and policies journals themselves. Elaine Peterson, in 2006, produced an exploration of “Librarian Publishing Preferences and Open-Access Electronic Journals”, in which she conducts a brief survey. The results show that academic librarians often consider open access journals as a means of sharing their research but hold the same reservations about them as many other disciplines, i.e. concerns about peer review and valuation by administration in terms of promotion and tenure.4 This line of thought is continued in Snyder, Imre and Carter’s 2007 study, which focused more specifically on intellectual property concerns of academic librarian authors and allowable self-archiving practices. They quote Peter Suber, author of Open Access and director of Harvard’s Open Access Project, writing, “‘There is a serious problem [serials pricing and permission crisis], known best to librarians, and a beautiful solution [open access] within the reach of scholars.’ One can draw the conclusion from Suber’s statement that librarians as authors should be the most prominent supporters of open access and that, as scholars, they would practice self-archiving.”5 This study in particular lays a unsettling foundation that 50% of respondents cared mostly about publication without considering the copyright policies of the journals in which they published and that only 16% had exercised the right to self-archive in an institutional repository.((Ibid)) These and other similar studies highlight the simple fact that concerns about changing publishing habits are the same within librarianship as they are in many other disciplines.
College and Research Libraries (C&RL), a well-regarded journal for academic librarianship, published four articles between 2009 and 2013 that studied the publishing practices of academic librarians through surveys.6 Each has contributed valuable insights while reaching very similar conclusions across the board. Palmer, Dill and Christie conclude that in attitude, “Librarians are in favor of seeing their profession take some actions toward open access [...] yet this survey found that agreement with various open access–related concepts does not constitute actual action.”7 Mercer, focused on the publishing and archiving behaviors rather than attitudes of academic librarians, highlights the substantial differences between the dual role many academic librarians inhabit; library professional first and academic researcher second. She writes, “…librarians may be risk takers in their professional roles, where they are actively encouraging changes in the system of scholarly communication and adoption of new technologies but are risk-averse as faculty in their roles as researchers and authors.”8 Taken together, the research could lead one to think that academic librarians are invested in changes to the scholarly publishing system about as little as disciplinary faculty and are just as cautious about evolving their own publishing habits.
Many academic authors write and publish out of passion for their research and to contribute to the progression of knowledge in society. Unfortunately, because of the system of measurement in which academia is mired, credentials, merit and perception can also play a substantial role in the publishing decisions of faculty. Without delving too deep into the discussion of tenure for librarians, the expectations for publishing in certain journals, or at all, are slightly different for librarians than other university faculty. Both the h-index and journal impact factor are measurements of supposed “impact,” based on the citations an article receives, which have in turn been equated with quality.9 The h-index is an impact measure for an individual, whereas impact factor applies to the journal level. Two recent studies follow Mercer’s line of argument and look at the journals in the LIS field, rather than the authors, using these two traditional measures of “impact.”
Jingfeng Xia conducted a fascinating study proposing that the h-index of authors published in a journal, as opposed to that journal’s impact factor, could provide an efficient method of ranking LIS journals, especially those that are open access and not listed in Journal Citation Reports. Xia’s article also underscores some of the complications that arise when lumping together all journals in the Library and Information Science field; Library and Information Science Research (LISR), a researcher-focused journal published by Elsevier (h-index = 21, impact factor = 1.4, not open access) is judged alongside D-Lib Magazine published by the Corporation for National Research Initiatives (h-index = 33, impact factor = 0.7, open access), a journal aimed at the practice of digital librarianship. LISR’s impact factor (1.4) is high for LIS journals (median 0.74), but when compared to the h-index of D-Lib’s authors LISR seems to have less “impact.” Xia’s employment of the h-index as a measurement, illustrated in this example, shows the breadth and depth that alternate matrices may introduce, the complications of judging journal quality based on citations, and the potential inversion of perceived impact depending on how one looks at it.
Expanding on the idea that acknowledging the perceived quality of journals is a valuable practice within librarianship, Judith Nixon’s “Core Journals in Library and Information Science: Developing a Methodology for Ranking LIS Journals” was published in 2013 by C&RL. She proposes, based on successful practices at Purdue University Libraries, that “Top LIS journals can be identified and ranked into tiers by compiling journals that are peer-reviewed and highly rated by the experts, have low acceptance rates and high circulation rates, are journals that local faculty publish in, and have strong citation ratings as indicated by an ISI impact factor and a high h-index using Google Scholar data.”10 The production of a ranked list like this aligns perfectly with the type of study we performed, and our conclusions will highlight some similarities and differences between Nixon’s list and our findings, pitting the Journal Openness Index (J.O.I) Factor against the Top Tier journals she presents.
Whereas some of these studies in LIS publishing focused on the “people” angle, studying librarians and their attitudes and practices around publishing, we chose to follow more recent research and widen the lens to look at the journals in which librarians might publish. A challenge presents itself when broadening to this scale: there is the ever-present blurred line between the publishing habits of working librarians and those of teaching/research faculty in library schools and academic departments — Library and Information Science Research vs. D-Lib Magazine for example. There are obvious differences between these groups, so pairing analysis on the specific journals where professional librarians typically publish with the more specific studies on that same group’s publishing habits will present the most accurate portrait of the scholarly communication landscape as it has been studied to date. We leave the extension of this research for future study.
The journals that we began with came from an internal list compiled as part of a professional development initiative at Florida State University Libraries. A student worker in the Assessment department compiled the original list of 74 journals, and then the co-authors of this piece expanded that list to 111 after consulting the LIS Publications Wiki. The journals were ingested into a spreadsheet with columns for impact factor, scope, instructions for authors, indexing information and other common details. Our first task was to add columns for copyright policy, open access archiving policy, and open access publishing options. Our journal list includes an extraordinarily broad range of journals including research focused journals and those in subfields of librarianship like archives and technical services. This decision was made so as to gather data from the broadest possible representation of LIS scholarship.
After compiling and organizing the journal list, we collected each journal’s standard policies on copyright assignment, open access self-archiving (“green open access”), and open access publishing (“gold open access”). We began gathering these data by searching the SHERPA/RoMEO database for commercial journals and the Directory of Open Access Journals (DOAJ) for open access journals. After searching these databases, we double checked policies and open access options on the journal and/or publisher’s website using the following workflow: locate the policies section of the website, which is commonly labeled “Policies,” “Policies and Guidelines,” “Author’s Rights,” or “Author’s Guidelines”; identify the copyright policy of the journal; identify the open access self-archiving policy or “green open access” options that the journal permits; identify the open access publishing or “gold open access options” of the journal, which may be listed in the policies section or a specific “Open Access Options” section; and finally view the copyright transfer agreement or other author agreement, if available. All details were inputted to the spreadsheet and coded for consistency.
Grading journals based on how “open” they are, as opposed to citation impact or h-index, is a novel approach, and one that had not been applied to LIS literature to our knowledge. In fact, it is not clear that this measurement has been used extensively in any field or practice aside from the production of the spectrum and some supporting documentation by PLOS, SPARC, and OASPA. Potentially then, as further research is done using the J.O.I Factor, the grades we apply to journals herein may be different based on how many measures of openness are used and how they are counted. Our proposed enumeration of the J.O.I Factor is indicated on the image below, superimposed over the “How Open Is It?” scale produced by SPARC/PLOS. The application of J.O.I Factors to specific journals is contained to our Conclusion section for purposes of clarity and emphasis.
The original spectrum breaks openness down to six categories, three of which overlap neatly with the criteria we used in our analysis: 1) Copyrights, 2) Reuse Rights, and 3) Author Posting Rights. The remaining categories, Reader Rights, Automatic Posting, and Machine Readability were mostly ancillary to our focus, and so the J.O.I Factor numbers that we apply only account for the three criteria we researched. The “Reader Rights” category does include some details about embargoes but typically refers to embargoes on the final published PDF released after that term expires by the publisher. Our use of the embargo data point was in terms of Author Posting Rights, so we chose not to include Reader Rights as a category in our J.O.I Factor calculations.
Also, the spectrum lumps open access publishing options, another of our data points, in with Reader Rights as “immediate access to some, but not all, articles (including the ‘hybrid’ model” — “hybrid” meaning the business model where articles can be made open access on a one-by-one basis for a fee. We decided to add a “-” for journals that offer open access publishing for a fee, illustrating the negative connotation that might have for authors. Journals that are fully open access without any publishing fees will have a J.O.I number and a “+” illustrating positive connotations. Information Technology and Libraries, for example, published by Library and Information Technology Association/ALA, would have a J.O.I Factor of 12+; four points for author retention of copyrights, four points for broad reuse rights (CC-BY), four points for the author being allowed to post any version of the article in a repository and “+” for the journal being fully open access without imposing any publication fees.
We hope that the application of the J.O.I Factor in this article serves merely as a proof of concept, and we invite colleagues to use our data, apply J.O.I Factors to all the journals we listed there, and extend this work to account for the full range of possible factors of openness.
The most common major publishers from our sample were Taylor & Francis (25 journals), Emerald (12 journals), and Elsevier (8 journals). Society and Association publishers followed closely with 23 journals, and Universities and University Presses had 18. The remainder were either unknown, other types of organizations, smaller publishing houses or “self-affiliated.” The three clearly self-affiliated journals, First Monday, Code4Lib and In the Library with the Lead Pipe are all fully open access but have a range of difference in their copyright policies, illustrating the variety of publishing options within the LIS field.11
Each journal was assigned a corresponding code for its copyright, open access archiving, and open access publication policies. These codes were used primarily for organizing the information in our spreadsheet, and are not conflated with our proposed J.O.I Factors which are applied after all data were collected, organized, and analyzed. The codes represent the range of possible options under each category, based on the variety of options we identified in the journals we reviewed. For example, the Copyright field could range from (1) required full transfer of copyright to (4) copyright jointly shared between author and publisher. Self-archiving policies ranged from Not Permitted (0) to allowing the final published PDF (6), with a range of embargo periods for each category in between. (See Table 1 for all codes)
Despite librarianship’s ongoing waltz with copyright complications, 43 of the LIS journals we reviewed still require the author to transfer all copyrights to the publisher, “during the full term of copyright and any extensions or renewals… including but not limited to the right to publish, republish, transmit, sell, distribute and otherwise use the [article] in whole or in part… in derivative works throughout the world, in all languages and in all media of expression now known or later developed” (emphasis our own).12 However, leaning toward a more expansive rights agreement, 61 journals allow the author to retain copyright, 38 of which require a License to Publish be granted to the publisher.13 21 of the 38 that require a license granted to the publisher are Taylor & Francis journals, which fall under their new author rights for LIS journals. Taylor and Francis shows leadership in adapting their rights agreements for LIS journals, although one co-author of this article sought to push them further, with success. The remaining 23 journals that allow the author to retain copyright also offer the article to be published under a Creative Commons (CC) license, ranging from Attribution-Non-Commercial-No Derivatives (Collaborative Librarianship) to Public Domain (First Monday). The boldest and most progressive copyright policy goes to First Monday, which offers total author choice, from copyright transfer (©), through every possible Creative Commons license, to releasing the work in the public domain (CC0).
This category provided the broadest range of possibilities, mostly due to the fact that different publishers assign different terms of embargo for self-archiving. Assuming that well-informed LIS authors who submit to these journals desire the simplest and broadest open access options, 24 journals allow the pre-print (submitted version), post-print (accepted version) and final published PDF to be archived in an open access institutional repository, with no stated embargoes. 22 of these 24 are fully open access journals, and they are all published by societies, associations, universities or self-affiliated groups. Common thought in academic publishing tends to say that society/association publishers lose the most when going open access; it is heartening to see this is absolutely untrue in LIS literature. The strictest embargo on self-archiving in an institutional repository is 18 months for 10 of the Taylor and Francis journals. University of Texas Press and University of Chicago Press both allow archiving after 12 months, while ironically, given the topic of the journal, the Journal of Scholarly Publishing published by Toronto University Press only allows archiving of the pre-print with no policies for post-prints.
An important point to consider when discussing self-archiving policies is the farce that they truly are. Kevin Smith, Duke University’s Scholarly Communication Officer stated it most plainly in his February 5 blog post titled It’s the Content, Not the Version! He writes,
…this notion of versions is, at least in part, an artificial construction that publishers use to assert control while also giving the appearance of generosity in their licensing back to authors of very limited rights to use earlier versions. The versions are artificially based on steps in the publication permission process (before submission, peer-review, submission, publication), not on anything intrinsic to the work itself that would justify a change in copyright status.14
The practice of self-archiving is totally dependent on copyright transfer agreements, and based on the representative sample of LIS journals we reviewed, all but 8% had direct or implied policies regarding what the author is allowed to do with specific versions of the same work. The author’s false sense of control over their work and the publisher’s exploitation of that sense deserves a study unto its own. Suffice it to say that if the field of Library and Information Studies considers a green open access policy a good deal, there is much work to be done.
A common misconception about achieving open access is that it always requires a fee on the part of the author. While this mostly true for traditional commercial publishers attempting to retain their income stream while “acquiescing” to the desires of their authors, it is a falsity broadly, which is proven in our analysis. 56 journals offer open access on an article-by-article basis and require an article processing charge (APC) ranging from $300 to $3,000. 52 of these are published by commercial publishers (Elsevier, Sage, Springer, Wiley, Taylor and Francis, and Emerald). In stark contrast, 35 journals on our list are fully open access and all articles are published without a fee. A significant number, 20 journals, either do not offer a “gold open access” publication option or do not publicize it. A number of the 20 journals that do not offer or publicize a paid open access business model are University Presses (6), and association/society journals (7).
As noted above, within these LIS journals there is considerable diversity in policies. We wanted to further explore that depth of difference by looking specifically at the fully open access journals in our sample. This section reiterates some of the analyses from previous sections, but we thought it still important to enumerate the complexities of publishing within this subset of a subset. 38 of the 111 journals that we looked at are open access, and only two (The International Journal of Library Science and IFLA Journal) have a publication charge, $300 and $1500 respectively. While two of the 38 open access journals require a full copyright transfer (International Journal of Library Science and Student Research Journal) a little more than half of them (21) allow the author to keep copyright AND attach a Creative Commons license to the work.15 27 of these fully open access journals allow the author to deposit the final published PDF in a repository, meaning that 11 fully open access journals either place some restrictions on the reuse of open access content or have poorly defined reuse policies.
Even though these are open access journals, the data suggests that what qualifies as “open access” even within our own field is still loosely defined, a point we attempt to illustrate by applying the J.O.I. Factor at the close of this article. Some might make the argument that any restriction of authors’ rights (copyright) and readers’ rights (reuse via licenses) toes the line of not achieving pure open access. Emily Drabinski, a reviewer of this article, made the salient point that the policies we discuss as needing to change are under the purview of journal editorial boards who are often in the complicated position of being between authors (colleagues) and publishers. To that end, we encourage journal Editors as well as authors to lead by taking action. Regardless, as the measures of openness are more effectively discussed within our communities of practice, the LIS field is making slow progress toward public access (readability) and open access (re-usability), a trend we expect to broaden and deepen.
This article illustrates something with which every researcher in the field of Library and Information Studies must contend. A significant percentage of our professional literature is still owned and controlled by commercial publishers whose role in scholarly communication is to maintain “the scholarly record,” yes, but also to generate profits at the expense of library budgets by selling our intellectual property back to us. Conversely, there is much to be proud of, including the many association, society and University-sponsored journals that are well-respected and proving important points about the viability of open access as a business model, a dissemination mechanism, and a principle to which librarians hold — our “free to all” heritage. It is our hope that this article inspires the activism that the earlier articles from our review of the literature pointed out as a disturbing discrepancy in our professional practice. Simply, this is our call for librarians to practice what we preach, regardless of, or even in the face of, tenure and promotion “requirements,” long-held professional norms, and the unnecessary fear, uncertainty and doubt that control academic publishing. We already have models for activism on the collections side of our work; we call our colleagues to echo those impulses on the production side of scholarship, as editors, authors, bloggers, library publishers, and consumers of research.
There are three practical means of seeding this change; 1) exercise the right to self-archive every piece of scholarship published in LIS journals, or better yet never give those rights away in the first place; 2) move the “prestige” to open access, meaning offering the best work to journals that are invested in a more benevolent scholarly communication system; and 3) as editors, work diligently to adapt the policies and procedures for the journals we control to align with our professional principles of access, expansive understanding of copyrights, fair use, and broad reusability.
Returning to “Nixon’s list,” which proposed a possible ranking system for LIS journals, it is interesting to grade her list in terms of the “openness” criteria we’ve employed in this article, and in light of the practical actions we propose. Nixon’s findings present 18 journals that were determined to be the “Tier One” journals, based on the criteria she and her colleagues developed.16 11 of those 18 were also identified as top LIS journals from her literature review. Table 2 shows those 11 “prestige” journals, as graded by our applied J.O.I Factor.
The results are striking. College and Research Libraries, widely regarded as a top journal for practicing librarians, received a J.O.I Factor of 9+, whereas Information Technology and Libraries (ITAL) measures at 12+, all because of ITAL’s generous Reuse Rights policy (CC-BY). JASIST is tied for last place (J.O.I Factor 2-) with Elsevier and Emerald journals because of copyright transfer requirements, no reuse rights and middling author posting allowances. Library Trends and Library Quarterly (university press journals) sit solidly in the middle, entirely due to their author posting policies which allow posting the Publisher’s PDF.
Based on this, in closing, we submit these final questions to the LIS research community: are these the journals we want on a top tier list, and what measure of openness will we define as acceptable for our prestigious journals? Further, how long will we tolerate measurements like impact factor and h-index guiding our criteria for advancement, while accounting for very little that matters to how we principle ourselves and our work? Finally, has the time come and gone for LIS to lead the shifts in scholarly communication? It is our hope that this article prompts furious and fair debate, but mostly that it produces real, substantive evolution within our profession, how we research, how we assign value to scholarship, and how we share the products of our intellectual work.
Our thanks and gratitude go to Emily Drabinski for her thoughtful, helpful and engaging comments as the external reviewer of this article. Thanks also to Lead Pipe colleagues and editors, Ellie, Erin, and Hugh, for challenging our ideas, correcting our bad grammar and making this lump of coal into a diamond. Most of all, thanks to Brett for proposing the term “Journal Openness Index” to replace our not creative and weird-sounding original concept.
Peterson, E. (2006) Librarian Publishing Preferences and Open-Access Electronic Journals. Electronic Journal of Academic and Special Librarianship, 7(2). Accessible at http://southernlibrarianship.icaap.org/content/v07n02/peterson_e01.htm
Carter, H., Carolyn Snyder, and Andrea Imre. (2007) “Library Faculty Publishing and Self-Archiving: A Survey of Attitudes and Awareness.” portal: Libraries and the Academy, 7(1). Open access version at http://opensiuc.lib.siu.edu/morris_articles/1/
Palmer, K., Emily Dill, and Charlene Christie. (2009) “Where There’s a Will There’s a Way?: Survey of Academic Librarian Attitudes about Open Access.” College and Research Libraries, 70. Accessible at http://crl.acrl.org/content/70/4/315.full.pdf+html
Mercer, H. (2011) Almost Halfway There: An Analysis of the Open Access Behaviors of Academic Librarians. College and Research Libraries, 72. Accessible at http://crl.acrl.org/content/72/5/443.full.pdf+html
Nixon, J. (2014) Core Journals in Library and Information Science: Developing a Methodology for Ranking LIS Journals. College and Research Libraries, 75. Accessible at http://crl.acrl.org/content/75/1/66.full.pdf+html
Smith, K. (2014) Its the content, not the version! Scholarly Communications @ Duke [blog], posted on February 5. Accessible at http://blogs.library.duke.edu/scholcomm/2014/02/05/its-the-content-not-the-version/
Vandegrift, Micah; Bowley, Chealsye (2014): LIS Journals measured for “openness.” http://dx.doi.org/10.6084/m9.figshare.994258
Malenfant, K. J. (2010) Leading Change in the System of Scholarly Communication: A Case Study of Engaging Liaison Librarians for Outreach to Faculty. College & Research Libraries, 71. Accessible at http://crl.acrl.org/content/71/1/63.full.pdf+html
Sugimoto, C. R., Tsou, A., Naslund, S., Hauser, A., Brandon, M., Winter, D., … Finlay, S. C. (2012) Beyond gatekeepers of knowledge: Scholarly communication practices of academic librarians and archivists at ARL institutions. College & Research Libraries, 75. Accessible at http://crl.acrl.org/content/75/2/145.full.pdf+html
Xia, J. (2012) Positioning Open Access Journals in a LIS Journal Ranking. College & Research Libraries, 73. Accessible at http://crl.acrl.org/content/73/2/134.full.pdf+html
Henry, D. and Tina M. Neville. (2004) Research, Publication, and Service Patterns of Florida Academic Librarians. The Journal of Academic Librarianship, 30. Open access version at http://hdl.handle.net/10806/200. Published version at http://dx.doi.org/10.1016/j.acalib.2004.07.006
Joswick, K. (1999) Article Publication Patterns of Academic Librarians: An Illinois Case Study. College & Research Libraries, 60. Accessible at http://crl.acrl.org/content/60/4/340.full.pdf+html
Hart, R. (1999) Scholarly Publication by University Librarians: A Study at Penn State. College & Research Libraries, 60. Accessible at http://crl.acrl.org/content/60/5/454.full.pdf+html
Wiberley, Jr., S. Julie M. Hurd, and Ann C. Weller (2006) Publication Patterns of U.S. Academic Librarians from 1998 to 2002. College & Research Libraries, 67. Accessible at http://crl.acrl.org/content/67/3/205.full.pdf+html
Harley, D.; Acord, Sophia Krzys; Earl-Novell, Sarah; Lawrence, Shannon; & King, C. Judson. (2010). Assessing the Future Landscape of Scholarly Communication: An Exploration of Faculty Values and Needs in Seven Disciplines. UC Berkeley: Center for Studies in Higher Education. Accessible at http://www.escholarship.org/uc/item/15x7385g
Frass, W. Jo Cross, and Victoria Gardener (2013) Taylor and Francis Open Access Survey – Supplement 1-8 Data Breakdown by Subject Area. Accessible at http://www.tandfonline.com/page/openaccess/opensurvey
Priego, E. (2012) Fieldwork: Mentions of Library Science Journals Online. Accessible at http://www.altmetric.com/blog/fieldwork-mentions-library-journals-online/
The Price of Information (Feb. 2012) http://www.economist.com/node/21545974
A (free) roundup of content on the Academic Spring (April 2012) http://www.guardian.co.uk/higher-education-network/blog/2012/apr/12/blogs-on-the-academic-spring
Academic Publishers make Murdoch look like a Socialist (Aug. 2011) http://www.guardian.co.uk/commentisfree/2011/aug/29/academic-publishers-murdoch-socialist
Is the Academic Publishing Industry on the Verge of Disruption? (July 2012) http://www.usnews.com/news/articles/2012/07/23/is-the-academic-publishing-industry-on-the-verge-of-disruption
(In the previous post, Better ways of using R on LibStats (1), I explain the background for this reference desk statistics analysis with R, and I set up the data I use. This follows on, showing another example of how I figured out how to do something more cleanly and quickly.)
In Ref desk 4: Calculating hours of interactions (from almost exactly two years ago) I explained in laborious detail how I calculated the total hours of interaction at the reference desks. I quote myself:
Another fact we record about each reference desk interaction is its duration, which in our
libstatsdata frame is in the
time.spentcolumn. As I explained in Ref Desk 1: LibStats these are the options:
- NA (“not applicable,” which I’ve used, though I can’t remember why)
- 0-1 minute
- 1-5 minutes
- 5-10 minutes
- 10-20 minutes
- 20-30 minutes
- 30-60 minutes
- 60+ minutes
We can use this information to estimate the total amount of time we spend working with people at the desk: it’s just a matter of multiplying the number of interactions by their duration. Except we don’t know the exact length of each duration, we only know it with some error bars: if we say an interaction took 5-10 minutes then it could have taken 5, 6, 7, 8, 9, or 10 minutes. 10 is 100% more than 5: relatively that’s a pretty big range. (Of course, mathematically it makes no sense to have a 5-10 minute range and a 10-20 minute range, because if something took exactly 10 minutes it could go in either category.)
Let’s make some generous estimates about a single number we can assign to the duration of reference desk interactions.
Duration Estimate NA 0 minutes 0-1 minute 1 minute 1-5 minutes 5 minutes 5-10 minutes 10 minutes 10-20 minutes 15 minutes 20-30 minutes 25 minutes 30-60 minutes 40 minutes 60+ minutes 65 minutes
This means that if we have 10 transactions of duration 1-5 minutes we’ll call it 10 * 5 = 50 minutes total. If we have 10 transactions of duration 20-30 minutes we’ll call it a 10 * 25 = 250 minutes total. These estimates are arguable but I think they’re good enough. They’re on the generous side for the shorter durations, which make up most of the interactions.
To do all those calculations I made a function, then a data frame of sums, then I loop through all the library branches, build up new a new data frame for each by applying the function to the sums, then put all those data frames together into a new one. Ugly! And bad!
When I went back to the problem and tackled it with
dplyr I realized I’d made a mistake right off the bat back then: I shouldn’t have added up the number of “20-30 minute” durations (e.g. 10) and then multiplied by 25 to get 250 minutes total. It’s much easier to use the
time.spent column in the big data frame to generate a new column of estimated durations and then add those up. For example, in each row that has a
time.spent of “20-30 minutes” put 25 in the
est.duration column, then later add up all those 25s. Doing it this way means only ever having to deal with vectors, and R is great at that.
Here’s the data I’m interested in. I want to have a new
est.duration column with numbers in it.
> head(subset(l, select=c("day", "question.type", "time.spent", "library.name"))) day question.type time.spent library.name 1 2011-02-01 4. Strategy-Based 5-10 minutes Scott 2 2011-02-01 4. Strategy-Based 10-20 minutes Scott 3 2011-02-01 4. Strategy-Based 5-10 minutes Scott 4 2011-02-01 3. Skill-Based: Non-Technical 5-10 minutes Scott 5 2011-02-01 4. Strategy-Based 5-10 minutes Scott 6 2011-02-01 4. Strategy-Based 5-10 minutes Scott
I’ll do it with these two vectors and the
match command, which the documentation says “returns a vector of the positions of (first) matches of its first argument in its second.” Here I set them up and show an example of using them to convert the words to an estimated number.
> possible.durations <- c("0-1 minute", "1-5 minutes", "5-10 minutes", "10-20 minutes", "20-30 minutes", "30-60 minutes", "60+ minutes") > duration.times <- c(1, 4, 8, 15, 25, 40, 65) > match("20-30 minutes", possible.durations)  5 > duration.times  25 > duration.times[match("20-30 minutes", possible.durations)]  25
That’s how to do it for one line, and thanks to the way R works, if we say we want this to be done on a column, it will do the right thing.
> l$est.duration <- duration.times[match(l$time.spent, possible.durations)] > head(subset(l, select=c("day", "question.type", "time.spent", "library.name", "est.duration"))) day question.type time.spent library.name est.duration 1 2011-02-01 4. Strategy-Based 5-10 minutes Scott 8 2 2011-02-01 4. Strategy-Based 10-20 minutes Scott 15 3 2011-02-01 4. Strategy-Based 5-10 minutes Scott 8 4 2011-02-01 3. Skill-Based: Non-Technical 5-10 minutes Scott 8 5 2011-02-01 4. Strategy-Based 5-10 minutes Scott 8 6 2011-02-01 4. Strategy-Based 5-10 minutes Scott 8
dplyr it’s easy to make a new data frame that lists, for each month, how many ref desk interactions happened and an estimate of their total duration. First I’ll take a fresh sample so I can use the
> l.sample <- l[sample(nrow(l), 10000),] > sample.durations.pm <- l.sample %.% group_by(library.name, month) %.% summarise(minutes = sum(est.duration, na.rm =TRUE), count=n()) > sample.durations.pm Source: local data frame [274 x 4] Groups: library.name library.name month minutes count 1 ASC 2011-09-01 77 7 2 ASC 2011-10-01 66 2 3 ASC 2011-11-01 13 7 4 ASC 2012-01-01 41 3 5 ASC 2012-02-01 11 5 6 ASC 2012-03-01 1 1 7 ASC 2012-04-01 4 1 8 ASC 2012-05-01 23 3 9 ASC 2012-06-01 8 2 10 ASC 2012-07-01 4 1 .. ... ... ... ... > ggplot(sample.durations.pm, aes(x=month, y=minutes/60)) + geom_bar(stat="identity") + facet_grid(library.name ~ .) + labs(x="", y="Hours", title="Estimated total interaction time (based on a small sample only)")
count column is made the same way as last time, and the
minutes column uses the
sum function to add up all the durations in each grouping of the data. (
na.rm = TRUE removes any NA values before adding; without that R would say 5 + NA = NA.)
So easy compared to all the confusing stuff I was doing before.
Finally, finding the average duration is just a matter of dividing (
mutate comes in
> sample.durations.pm <- mutate(sample.durations.pm, average.length = minutes/count) > sample.durations.pm Source: local data frame [274 x 5] Groups: library.name library.name month minutes count average.length 1 ASC 2011-09-01 77 7 11.000000 2 ASC 2011-10-01 66 2 33.000000 3 ASC 2011-11-01 13 7 1.857143 4 ASC 2012-01-01 41 3 13.666667 5 ASC 2012-02-01 11 5 2.200000 6 ASC 2012-03-01 1 1 1.000000 7 ASC 2012-04-01 4 1 4.000000 8 ASC 2012-05-01 23 3 7.666667 9 ASC 2012-06-01 8 2 4.000000 10 ASC 2012-07-01 4 1 4.000000 .. ... ... ... ... ... > ggplot(sample.durations.pm, aes(x=month, y=average.length)) + geom_bar(stat="identity") + facet_grid(library.name ~ .) + labs(x="", y="Minutes", title="Estimated average interaction time (based on a small sample only)")
Don’t take those numbers as reflecting the actual real activity going on at YUL. It’s just a sample, and it conflates all kinds of questions, from directional (“where’s the bathroom”), which take 0-1 minutes, to specialized (generally the deep and time-consuming upper-year, grad and faculty questions, or ones requiring specialized subject knowledge), which can take hours. Include the usual warnings about data gathering, analysis, visualization, interpretation, problem(at)ization, etc.
A couple of years ago I wrote some R scripts to analyze the reference desk statistics that we keep at York University Libraries with LibStats. I wrote five posts here about what I found; the last one, Ref desk 5: Fifteen minutes for under one per cent, links to the other four.
Those scripts did their job, but they were ugly, and there were some more things I wanted to do. Because of my recent Ubuntu upgrade, I’m running R version 3.0.2 now, which means I can use the new
dplyr package by R wizard Hadley Wickham and others. (It doesn’t work on 3.0.1.) The vignette for dplyr has lots of examples, and I’ve been seeing great posts about it, and I was eager to try it. So I’m going back to the old work and refreshing it and figuring out how to do what I wanted to do in 2012—or couldn’t because we only had one year of data; now that we have four, year-to-year comparisons are interesting.
This first post is about how I used to do things in an ugly and slow way, and how to do them faster and better.
I begin with a CSV file containing a slightly munged and cleaned dump of all the information from LibStats.
$ head libstats.csv timestamp,question.type,question.format,time.spent,library.name,location.name,initials 02/01/2011 09:20:11 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA 02/01/2011 09:43:09 AM,4. Strategy-Based,In-person,10-20 minutes,Scott,Drop-in Desk,AA 02/01/2011 10:00:56 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA 02/01/2011 10:05:05 AM,3. Skill-Based: Non-Technical,Phone,5-10 minutes,Scott,Drop-in Desk,AA 02/01/2011 10:17:20 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA 02/01/2011 10:30:07 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA 02/01/2011 10:54:41 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA 02/01/2011 11:08:00 AM,4. Strategy-Based,In-person,10-20 minutes,Scott,Drop-in Desk,AA 02/01/2011 11:32:00 AM,3. Skill-Based: Non-Technical,In-person,10-20 minutes,Scott,Drop-in Desk,AA
I read the CSV file into a data frame, then fix a couple of things. The date is a string and needs to be turned into a Date, and I use a nice function from
lubridate to find the floor of the date, which aggregates everything to the month it’s in.
> l <- read.csv("libstats.csv") > library(lubridate) > l$day <- as.Date(l$timestamp, format="%m/%d/%Y %r") > l$month <- floor_date(l$day, "month") > str(l) 'data.frame': 187944 obs. of 9 variables: $ timestamp : chr "02/01/2011 09:20:11 AM" "02/01/2011 09:43:09 AM" "02/01/2011 10:00:56 AM" "02/01/2011 10:05:05 AM" ... $ question.type : chr "4. Strategy-Based" "4. Strategy-Based" "4. Strategy-Based" "3. Skill-Based: Non-Technical" ... $ question.format: chr "In-person" "In-person" "In-person" "Phone" ... $ time.spent : chr "5-10 minutes" "10-20 minutes" "5-10 minutes" "5-10 minutes" ... $ library.name : chr "Scott" "Scott" "Scott" "Scott" ... $ location.name : chr "Drop-in Desk" "Drop-in Desk" "Drop-in Desk" "Drop-in Desk" ... $ initials : chr "AA" "AA" "AA" "AA" ... $ day : Date, format: "2011-02-01" "2011-02-01" "2011-02-01" "2011-02-01" ... $ month : Date, format: "2011-02-01" "2011-02-01" "2011-02-01" "2011-02-01" ... > head(l) timestamp question.type question.format time.spent library.name location.name initials 1 02/01/2011 09:20:11 AM 4. Strategy-Based In-person 5-10 minutes Scott Drop-in Desk AA 2 02/01/2011 09:43:09 AM 4. Strategy-Based In-person 10-20 minutes Scott Drop-in Desk AA 3 02/01/2011 10:00:56 AM 4. Strategy-Based In-person 5-10 minutes Scott Drop-in Desk AA 4 02/01/2011 10:05:05 AM 3. Skill-Based: Non-Technical Phone 5-10 minutes Scott Drop-in Desk AA 5 02/01/2011 10:17:20 AM 4. Strategy-Based In-person 5-10 minutes Scott Drop-in Desk AA 6 02/01/2011 10:30:07 AM 4. Strategy-Based In-person 5-10 minutes Scott Drop-in Desk AA
The columns are:
Now I have these fields in the data frame that I will use:
> head(subset(l, select=c("day", "month", "question.type", "time.spent", "library.name"))) day month question.type time.spent library.name 1 2011-02-01 2011-02-01 4. Strategy-Based 5-10 minutes Scott 2 2011-02-01 2011-02-01 4. Strategy-Based 10-20 minutes Scott 3 2011-02-01 2011-02-01 4. Strategy-Based 5-10 minutes Scott 4 2011-02-01 2011-02-01 3. Skill-Based: Non-Technical 5-10 minutes Scott 5 2011-02-01 2011-02-01 4. Strategy-Based 5-10 minutes Scott 6 2011-02-01 2011-02-01 4. Strategy-Based 5-10 minutes Scott
But I’m going to just take a sample of all of this data, because this is just for illustrative purposes, not real analysis. Let’s grab 10,000 random entries from this data frame and put that into
> l.sample <- l[sample(nrow(l), 10000),]
An easy thing to ask first is: How many questions are asked each month in each library?
Here’s how I did it before. I’ll run the command and show the resulting data frame. I used the
plyr package, which is (was) great, and its
ddply function, which applies a function to a data frame and gives a data frame back. Here I have it collapse the data frame
l along the two columns specified (month and library.name) and use
nrow to count how many rows result. Then I check how long it would take to perform that operation on the entire data set.
> library(plyr) > sample.allquestions.pm <- ddply(l.sample, .(month, library.name), nrow) > head(sample.allquestions.pm) month library.name V1 1 2011-02-01 Bronfman 63 2 2011-02-01 Scott 60 3 2011-02-01 Scott Information 183 4 2011-02-01 SMIL 57 5 2011-02-01 Steacie 57 6 2011-03-01 Bronfman 46 > system.time(allquestions.pm <- ddply(l, .(month, library.name), nrow)) user system elapsed 2.812 0.518 3.359
system.time line there show how long the previous command takes to run on the entire data frame: almost 3.5 seconds! That is slow. Do a few of those, chopping and slicing the data in various ways, and it will really add up.
This is a bad way of doing it. It works! But it’s slow and I wasn’t thinking about the problem the right way. Using
nrow was wrong: I should have been using
count (also from
plyr), which I wrote up a while back, with some examples. That’s a much faster and more sensible way of counting up the number of rows in a data set.
But now that I can use
dplyr, I can approach the problem in a whole new way.
First, I’ll clear
plyr out of the way, then load
dplyr. Doing it this way means no function names collide.
> search()  ".GlobalEnv" "package:plyr" "package:lubridate" "package:ggplot2" "ESSR" "package:stats" "package:graphics" "package:grDevices"  "package:utils" "package:datasets" "package:methods" "Autoloads" "package:base" > detach("package:plyr") > library(dplyr)
See how nicely you can construct and chain operations with
> l.sample %.% group_by(month, library.name) %.% summarise(count=n()) Source: local data frame [277 x 3] Groups: month month library.name count 1 2011-02-01 Bronfman 63 2 2011-02-01 SMIL 57 3 2011-02-01 Scott 60 4 2011-02-01 Scott Information 183 5 2011-02-01 Steacie 57 6 2011-03-01 Bronfman 46 7 2011-03-01 SMIL 59 8 2011-03-01 Scott 71 9 2011-03-01 Scott Information 220 10 2011-03-01 Steacie 61 .. ... ... ...
%.% operator lets you chain together different operations, and just for the sake of clarity of reading, I like to arrange things so first I specify the data frame on its own and then walk through the things I do to it. First,
group_by breaks down the data frame by columns and does some magic. Then
summarise collapses the different chunks of resulting data into one line each, and I use
count=n() to make a new column,
count, which contains the count of how many rows there were in each chunk, calculated with the
n() function. In English I’m saying, “take the
l data frame, group it by
library.name, and count how many rows are in each grouping.” (Also, notice I didn’t need to use the
head command to stop it running off the screen, it made it nicely readable on its own.)
It’s easier to think about, it’s easier to read, it’s easier to play with … and it’s much faster. How long would this take to run on the entire data set?
> system.time(l %.% group_by(month, library.name) %.% summarise(count=n())) user system elapsed 0.032 0.000 0.033
0.03 seconds elapsed time! That is 0.9% of the 3.35 seconds the old way.
Graphing it is easy, using Hadley Wickham’s marvellous ggplot2 package.
> library(ggplot2) > sample.allquestions.pm <- l.sample %.% group_by(month, library.name) %.% summarise(count=n()) > ggplot(sample.allquestions.pm, aes(x=month, y=count)) + geom_bar(stat="identity") + facet_grid(library.name ~ .) + labs(x="", y="", title="All questions") > ggsave(filename="20140422-all-questions-1.png", width=8.33, dpi=72, units="in")
You can see the ebb and flow of the academic year: September, October and November are very busy, then things quiet down in December, then January, February and March busy, then it cools off in April and through the summer. (Students don’t ask a lot of questions close to and during exam time—they’re studying, and their assignments are finished.)
What about comparing year to year? Here’s a nice way of doing that.
First, pick out the numbers of the months and years. The
format command knows all about how to handle dates and times. See the man page for
strptime or your favourite language’s date manipulation commands for all the options possible. Here I use %m to find the month number and %Y to find the four-digit year. Two examples, then the commands:
> format(as.Date("2014-04-22"), "%m")  "04" > format(as.Date("2014-04-22"), "%Y")  "2014" > sample.allquestions.pm$mon <- format(as.Date(sample.allquestions.pm$month), "%m") > sample.allquestions.pm$year <- format(as.Date(sample.allquestions.pm$month), "%Y") > head(sample.allquestions.pm) Source: local data frame [6 x 5] Groups: month month library.name count mon year 1 2011-02-01 Bronfman 63 02 2011 2 2011-02-01 SMIL 57 02 2011 3 2011-02-01 Scott 60 02 2011 4 2011-02-01 Scott Information 183 02 2011 5 2011-02-01 Steacie 57 02 2011 6 2011-03-01 Bronfman 46 03 2011 > ggplot(sample.allquestions.pm, aes(x=year, y=count)) + geom_bar(stat="identity") + facet_grid(library.name ~ mon) + labs(x="", y="", title="All questions") > ggsave(filename="20140422-all-questions-2.png", width=8.33, dpi=72, units="in")
This plot changes the x-axis to the year, and facets along two variables, breaking the the chart up vertically by library and horizontally by month. It’s easy now to see how months compare to each other across years.
With a little more work we can rotate the x-axis labels so they’re readable, and put month names along the top. The
month function from
lubridate makes this easy.
> sample.allquestions.pm$month.name <- month(sample.allquestions.pm$month, label = TRUE) > head(sample.allquestions.pm) Source: local data frame [6 x 6] Groups: month month library.name count mon year month.name 1 2011-02-01 Bronfman 63 02 2011 Feb 2 2011-02-01 SMIL 57 02 2011 Feb 3 2011-02-01 Scott 60 02 2011 Feb 4 2011-02-01 Scott Information 183 02 2011 Feb 5 2011-02-01 Steacie 57 02 2011 Feb 6 2011-03-01 Bronfman 46 03 2011 Mar > ggplot(sample.allquestions.pm, aes(x=year, y=count)) + geom_bar(stat="identity") + facet_grid(library.name ~ month.name) + labs(x="", y="", title="All questions") + theme(axis.text.x = element_text(angle = 90)) > ggsave(filename="20140422-all-questions-3.png", width=8.33, dpi=72, units="in")
By August I will have published the current awareness newsletter Current Cites every month for twenty-four years — with all but the first of those years (1990-1991) freely available on the Internet. My children, now in college, aren’t even that old. In fact, my only absence from its publication was the period shortly after their birth. Time well spent, I have to say.
Although the publication was born at UC Berkeley, it outgrew its host and has long been hosted elsewhere and no longer has any contributors from Berkeley. From its first day it was written by volunteers — first by employees who volunteered to be a part of the entity that gave it birth, then by people who truly had no compensation for keeping it alive except the love of doing it. When I’m ready to pass it on, or should I die suddenly, I’m sure someone who loves it like I do will step up and keep it going. That’s what commitment is made of.
Meanwhile, I have witnessed almost every other Internet-born publication go down to dust — whether sponsored by an organization or not. The only Internet-based open access publication I can think of that equals or exceeds (I’m not arguing) our longevity is TidBITS, by Adam Engst. And guess what? We share something in common. Commitment. Adam has been just as committed or more to publishing TidBITS as I have been to Current Cites.
So here’s the thing: the only effective strategy for preserving things for the future is commitment. I don’t mean to suggest it must be the commitment of an individual — far from it. There are many examples of institutional commitment. But simply because either an individual or an organization is involved does not by itself signal the sufficient commitment for long-term preservation. I have personally saved web sites from certain neglect or destruction by moving them from an institutional host to either another institutional host or my personal server.
Therefore, I’ve long thought that what we really need for digital preservation is a digital preservation marketplace. For example, let’s say my doctor has said that I have roughly six months to live. After picking myself up off the floor and drying my eyes, eventually I would get around to finding someone to carry the loves of my web life forward. I would need a commitment marketplace. Someplace where I could go to say “I have this. It consists of X. It requires Y to keep it going. You must love it, like you would love a rescue dog.” And individuals or organizations could apply to take it over.
Either it has value or it doesn’t, and the digital preservation marketplace would decide. But without true commitment there is no technology, no metadata standard, no prayer, that will save it. Believe me, I’ve lived — and am living — it. Just call it a commitment. Do not look to technology to save anything. Look only to your heart. It is the only thing that has ever saved anything worth saving or ever will.
Photo by Hector Alejandro, Creative Commons Attribution 2.0 Generic License.
Today, the American Library Association (ALA) named Mary Lynn Collins, a library trustee from Frankfort, Ky., the winner of the 2014 White House Conference on Library and Information Services (WHCLIST) Award. The award, which is given to a non-librarian participating in National Library Legislative Day, covers hotel fees in addition to a $300 stipend to reduce the cost of attending the event.
During this year’s National Library Legislative Day, to be held May 5–6, 2014, hundreds of librarians and library supporters from across the country will gather in the nation’s capital to meet with members of Congress to discuss key library issues. As a champion for libraries, Collins incorporates her first-hand knowledge of the Kentucky legislature into her advocacy strategies. Before Collins became a founding member and current president of the Friends of Kentucky Libraries, she served for nearly 30 years on the staff of the Kentucky legislature as a legislative analyst.
Collins has used her legislative experience to gain support for Kentucky libraries that are facing harmful lawsuits in the past few years. In the future, she plans to lead her library group to increase advocacy efforts with congressional representatives.
“As a member of the Friends of Kentucky Libraries, I have seen advocacy at the state and local level become more important each year,” said Collins. “We have in the last three sessions of our state legislature seen legislation that was deemed detrimental to libraries and through the advocacy of library professionals, trustees and friends, and we have been able to defeat those efforts.”
The White House Conference on Library and Information Services—an effective force in library advocacy nationally, statewide and locally—turned its assets over to the ALA Washington Office after the last conference was held in 1991 in order to transmit the spirit of committed, passionate library support to a new generation of advocates. Leading up to National Library Legislative Day each year, the ALA seeks nominations for the award. Representatives of WHCLIST and the ALA Washington office choose the recipient.
The OKFestival team is launching our call for volunteers today, and we are excited to bring on board amazing members of our community who will help us to make this festival the huge success we are anticipating. Apply now!
Volunteers are integral to our ability to run OKFestival – without you, we wouldn’t have enough hands to get everything done over the days of the festival!
If you want to come to Berlin this July 15th-17th and help us to create the best Open festival there has ever been, please apply today at the link above, and then spread the word to ensure others know about the festival too!
There is no hard deadline on applying, but the sooner you apply the better your chance of being selected to come and make Open history with us at this year’s OKFestival. We can’t wait to see you there!
Linked data is a process for embedding the descriptive information of archives into the very fabric of the Web. By transforming archival description into linked data, an archivist will enable other people as well as computers to read and use their archival description, even if the others are not a part of the archival community. The process goes both ways. Linked data also empowers archivists to use and incorporate the information of other linked data providers into their local description. This enables archivists to make their descriptions more thorough, more complete, and more value-added. For example, archival collections could be automatically supplemented with geographic coordinates in order to make maps, images of people or additional biographic descriptions to make collections come alive, or bibliographies for further reading.
Publishing and using linked data does not represent a change in the definition of archival description, but it does represent an evolution of how archival description is accomplished. For example, linked data is not about generating a document such as EAD file. Instead it is about asserting sets of statements about an archival thing, and then allowing those statements to be brought together in any number of ways for any number of purposes. A finding aid is one such purpose. Indexing is another purpose. For use by a digital humanist is anther purpose. While EAD files are encoded as XML documents and therefore very computer readable, the reader must know the structure of EAD in order to make the most out of the data. EAD is archives-centric. The way data is manifested in linked data is domain-agnostic.
The objectives of archives include collection, organization, preservation, description, and often times access to unique materials. Linked data is about description and access. By taking advantages of linked data principles, archives will be able to improve their descriptions and increase access. This will require a shift in the way things get done but not what gets done. The goal remains the same.
Many tools are ready exist for transforming data in existing formats into linked data. This data can reside in Excel spreadsheets, database applications, MARC records, or EAD files. There are tiers of linked data publishing so one does not have to do everything all at once. But to transform existing information or to maintain information over the long haul requires the skills of many people: archivists & content specialists, administrators & managers, metadata specialists & catalogers, computer programers & systems administrators.
Moving forward with linked data is a lot like touristing to Rome. There are many ways to get there, and there are many things to do once you arrive, but the result will undoubtably improve your ability to participate in the discussion of the human condition on a world wide scale.
The article below was written by library advocate Anthony Chow, Ph.D., who is an assistant professor of the Department of Library and Information Studies at the University of North Carolina at Greensboro, and the co-chair of the North Carolina Library Association’s Legislative and Advocacy Committee
Knowledge is power. I have always believed that. As a professional educator and father of three, the gift of literacy is a gift for the future. My wife and I read to all three of our kids every day for years until one day our youngest, Emma, said she did not want to be read to anymore. She wanted and could do it now on her own. Emma and her brother and sister were empowered with the gift of reading—a door to endless possibilities, a pathway towards knowledge about whatever they wanted and needed. This is a wonderful feeling for any parent or educator. This is freedom and independence personified.
Do libraries make you happy? I sincerely hope you will join us.
Both school and public libraries have played a pivotal role in helping build the joy and love of reading in our children. For this, my wife and I will be forever grateful. I am a library advocate and wish the same feeling of joy and empowerment for all Americans. I want to give back what they have given to me.
I am also a professor at The University of North Carolina at Greensboro’s Library and Information Studies Department. My job is to prepare future librarians and a significant part of my teaching philosophy is to lead by example and be extremely active in service as part of my own pathway of life-long learning. This is how I became involved in North Carolina’s library advocacy efforts five years ago.
My passion for libraries and library advocacy derives from my personal and professional conviction that they are indeed an essential part of the American story—past, present, and future. As a member of the North Carolina delegation attending National Library Legislative Day (NLLD) for the past four years, I had the honor and privilege of meeting with our state’s legislators to tell them my story and let them know unequivocally that libraries are a fundamental part of our life and the lives of many North Carolinians and Americans across the country.
As a grizzled veteran learning under accomplished mentor Carol Walters, retired director of the Sandhills Regional Library System, and Brandy Hamilton, Regional Library Manager of the East Regional Library in Wake County, I was asked to help lead our 2014 delegation.
As we planned for this year’s NLLD we had two primary goals: 1) Allow our youth a voice to speak directly to legislators about how important libraries are to them personally, and 2) Find unique ways to make a splash and have people pay attention to us and our message of strong libraries for everyone.
The North Carolina Library Association (NCLA) created the NCLA student ambassador program and this year we are bringing 20 K-12 students to personally meet with their legislators and tell them first-hand how important libraries are to them. The creativity, energy, and diversity of their winning entries were refreshing and breathtaking in their depth and breadth. The youth are our future and their support of libraries could not be more authentically stated.
Coinciding with our focus on youth was the emergence of Pharrell William’s academy award nominated song “Happy” and the emergence of “Happy Dances” across the world on YouTube. We decided that doing a “Happy Dance” as part of our advocacy efforts made perfect sense and dancing was the perfect, positive, and fun way of expressing our support for libraries.
Like any large event an idea must be supported by passionate, talented, and brave people willing to dedicate their time, expertise, and pride on the line by doing something different. One of our new faculty members, Dr. Rebecca Morris, was a majorette at Pennsylvania State University and it was her brilliant idea to choose the song “Happy” as our flash mob dance and her willingness to take the lead on the choreography and instructional video that allowed the idea to become a reality. Our idea was quickly supported by North Carolina’s State Librarian Cal Shepard and our movement was born and off and running.
Our initial NCLA video also prompted several schools in North Carolina, West Wilkes Middle School and Smithfield-Selma High School, to film their own videos as well the Charlotte-Mecklenburg Public Library System.
In reserving the location for the flash mob Happy Dance, I was told by the U.S. Capitol Police that just dancing was not a clear enough expression of our First Amendment rights. So, in collaboration with the American Library Association (ALA), our dance turned into a full blown rally, which will take place from 2:30-4:00 p.m. right in front of the U.S. Capitol on Site 10 (across from the Library of Congress at the intersection of Independence and First Street). The flash mob will start promptly at 3:00 p.m. led by Dr. Morris, myself, and the majority of the North Carolina delegation including our State Librarian and many of our Student Ambassadors.
This is not really my story but our story and our future. Library advocacy is one clear-cut way for me to give back in some small way what they have given to me, my family, North Carolina, and our nation. Knowledge is power. Literacy is a gift that keeps on giving. Libraries also do so much more for other people—youth programming, access to technology, work force development, a place for the community to meet, and books—lots of books—in all different formats.
When I asked our State Librarian what was the overarching message she wanted to convey this year, she told me in no uncertain terms that this year is a celebration as North Carolina libraries are booming and we need the continuing support of our legislators to help us keep growing and providing vital services to our communities. Libraries make me so happy that I will dance for libraries. Do libraries make you happy? I sincerely hope you will join us.
Also, at the recent EverCloud workshop Mahadev Satyanarayanan, my colleague from the long-gone days of the Andrew Project, gave an impressive live demo of C-MU's Olive emulation technology. The most impressive part was that the emulations started almost instantly, despite that fact that they were demand-paging over the hotel's not super-fast Internet.
Think of two trends in the development of the library's network presence. These have emerged successively and continue to operate together.
The decentered library network presence is an important component of library service although it still appears to be an emergent interest in strategic or organizational terms.
In the centripetal trend, the focus was on integration around a singular, 'centered' network presence: the library website. The library website was the principal de facto network manifestation of the library, and the integration of library services in the website was a goal. Early examples of this were discussions around 'portals', one-stop-shops and metasearch. Later, this trend continued with more consolidated approaches which overcame some of the cost and inefficiencies of integration. Included here were the use of unified search systems often deploying integrated discovery layer products, the use of resource guides to manage resources with a consistent approach, the adoption of content management systems. Service consolidation, a stronger focus on providing a coherent user experience, a move to cloudsourced applications (discovery and resource guides for example), and an emerging emphasis on full library discovery help create a more unified experience at the library website.
In the centrifugal trend, the network library presence is decentered, unbundled or decoupled to an evolving ecosystem of services, each with a particular focus or scope. Think for example of how aspects of user engagement have been unbundled to various social networking sites (Facebook, Twitter, Pinterest, Flickr, ...), or of how parts of the discovery experience has been unbundled to Google Scholar or PubMed or to a cloud-based discovery layer, or of how some library services are atomized and delivered as mobile apps, toolbar applications, or 'widgets' in learning management systems and other web environments external to the library's. The library website is now a part, albeit an important part, of this evolving network presence. In this way, the library network presence has been decentered, subject to a centrifugal trend to multiple network locations potentially closer to user workflows. There are two important drivers here. One is the desire to reach into user workflows, acknowledging that potential library users may not always come to the library website. A second is the desire to make institutional resources (digitized special collections, research and learning materials, for example) available to external audiences in more effective ways. This is an aspect of the 'inside-out library'.
Here are some strands of the 'decentered' library network presence.
Now, despite the fact that there is quite a bit of activity supporting what I am calling here the decentered network presence, it has not crystallized as a service or organizational category for the library. It is an area of emergent interest. There seem to be at least three factors at play.
Clearly, there are different dynamics at play in the components of the decentered network presence of the library. However, we can expect a more holistic view to emerge in coming years.
Another note to my future self about upgrading Ubuntu.
Ubuntu 14.04 was released yesterday. I have two laptops that run it. I did the unimportant one first, and everything went fine. Then I did the important one, the one where I do all my work, and after restarting it came up with a boot error:
error: symbol 'grub_term_highlight_color' not found
I had two reactions. First, boot errors are solvable. The boot stuff is on one part of my hard drive, and my real stuff is on another part, and it’s fine where it is, I just need to fix the boot stuff. Besides, I have backups. So with a bit of fiddling, I’ll be able to fix it. Second, cripes, what the hell? I’ve been using this laptop for six months or a year or more since a major upgrade, and now it’s telling me there’s some problem with how it boots up? That is a load of shite.
Searching turned up evidence other people had the same problem, and they were being blamed for having an improper boot sector or some such business. For a few minutes I felt like non-geeks feel when presented with problems like this: despair … annoyance … frustration … the first pangs of hate.
But such is life. When upgrading a system we must be prepared for possible problems. We cannot expect it to always go smoothly. Even in the face of such technical problems we must try to remain tranquil.
It’s solvable, I remembered. So I downloaded a Boot-Repair Disk image—this is a very useful tool, and it works even though it’s a year old—and put it on a USB drive with startup disk creator, then booted up, ran
sudo boot-repair, used all the default answers, let it do its work, and everything was all right. Phew.
Aside from that, everything about the upgrade went perfectly fine. This time I did it at the command line with
sudo do-release-upgrade. It took a while to download all the upgraded packages, but the actual update went quickly and smoothly. My thanks to everyone involved with Debian, Ubuntu, GNU/Linux, and everything else.
(However, I’m glad I had another machine available where I could do the download and set up the boot disk. Without it, I would have been in trouble. I don’t know if a similar problem might have arisen when Windows or MacOS users do an upgrade.)
It seems to me I have done this once (or twice) before, but I feel like it is time to continue blogging on Loomware. My Loomware blog started in February of 2004, so I guess I can call this the 10th anniversary and just get on with it!
One of the drivers for me was the incredible interest in the Islandora digital asset management system, which had its genesis in 2007 just after I joined UPEI. In the last 7 years Islandora has seen adoption in countries all over the world, and for a wide range of functions. I will start the posts next week with a series on the coming version of Islandora - 7.x-1.3, which is our way of saying the 3rd release of Islandora for Drupal 7 and Fedora 3. This new series will describe all the awesome goodness in the upcoming release, solution pack by solution pack, module by module and innclude some shoutouts to friends and colleagues who are giving their time and extpertise to build a great open source ecosystem!
The ongoing digital revolution continues to create new opportunities for education, entrepreneurship, job skills training and more. Those of us with home broadband, smartphones or both can easily take advantage of these opportunities. However, for millions of Americans currently living without personal access to high-capacity internet or who lack digital literacy skills, libraries serve as the on-ramp to the digital world. With a growing number of people turning to libraries to avail themselves of broadband-enabled technologies, library networks are being strained more than ever before. Yesterday, the Institute for Library and Museum Services (IMLS) held a public hearing to discuss the importance of high-speed connectivity in libraries and outline strategies for helping libraries expand bandwidth to accommodate growing network use.
Federal Communications Commission (FCC) Chairman Thomas Wheeler’s opening remarks set the tone for the day: “Andrew Carnegie built 2,500 libraries in a public-private partnership, defining information access for millions of people for more than a century,” he said. “We stand on the precipice of being able to have the same kind of seminal impact on the flow of information and ideas in the 21st century…That’s why reform of the E-rate program is so essential. The library has always been the on-ramp to the world of information and ideas, and now that on-ramp is at gigabit speeds.”
The hearing convened three expert panels, each of which discussed a different dimension of library connectivity. The first panel propounded strategies for helping libraries procure the resources they need to build network capacity. Chris Jowaisas of the Gates Foundation urged libraries to underscore the ways in which their activities advance the goals of top giving foundations. “[Libraries should]…package their services to meet foundation needs,” Jowaisas said. “With a robust and reliable broadband connection, libraries and communities can move into more areas of exploration and innovation. The foundation hopes the network of supporters of this vision grows because we have seen and learned first-hand from investments in public libraries that they are key organizations for growing opportunity.”
Following his remarks, Clarence Anthony of the National League of Cities stressed the need for the library community to ramp up its efforts to make government leaders aware of the extent to which urban communities rely on libraries for broadband access.
The second panel analyzed current library connectivity data and identified areas where the data falls short in assessing broadband capacity. Larra Clark of ALA’s Office for Information Technology Policy drew on 20 years of research to illustrate that the progress libraries have made in expanding bandwidth—while meaningful—has generally not proven sufficient to accommodate the growing needs of users. About 9 percent of public libraries reported speeds of 100 Mbps or greater in the 2012 Public Library Funding & Technology Access Study, and the forthcoming Digital Inclusion Survey shows this number has only climbed to 12 percent. More than 66 percent of public libraries report they would like to increase their broadband connectivity speeds. “Libraries aren’t standing still, but too many are falling behind,” Clark said.
Researcher John Horrigan also gave the audience a preview of forthcoming research looking at Americans’ levels of digital readiness, which finds significant variations in digital skills even among people who are highly connected to digital tools. Of the 80 percent of Americans with home broadband or a smartphone, nearly one-fifth (or 34 million adults) has a low level of digital skills. “(Libraries) are the vanguard in the forces we bring to bear to bolster digital readiness,” Horrigan noted. “Libraries will have more demands placed upon them, which makes the case for ensuring they have the resources to meet these demands compelling.”
The final panel built on the capacity-building strategies offered by Jowaisas and Anthony by providing real-world examples of successful efforts to expand library bandwidth. Gary Wasdin of the Omaha Public Library System discussed ways in which his libraries are leveraging federal dollars to engage private funders in efforts to build broadband capacity, and Eric Frederick of Connect Michigan described how public-private synergies are improving library connectivity in his state. The final panelist was Linda Lord, Maine state librarian and chair of ALA’s E-rate Task Force. Lord discussed ALA’s efforts to inform the FCC’s ongoing E-rate modernization proceeding. “ALA envisions that all libraries will be at a gig (1Gbps) by 2018”, Lord said. The E-rate program provides schools and libraries with telecommunications services at discounted rates. Linda went on to clearly articulate ALA’s commitment to updating the program to help libraries address 21st century challenges.
The post Library broadband takes center stage at IMLS hearing appeared first on District Dispatch.
March 28, 2014
I was recently selected by the Code4Lib community to receive a diversity scholarship to attend the Code4Lib conference in Raleigh, North Carolina. The Code4Lib conference was the perfect place to make new connections with people who aim to make information more accessible through technology. As someone who is in close proximity in technology and usability, I was interested in the new strides taking place in this area. At this conference, I made new contacts for future collaboration and attend talks ranging from Linked Open Data and Google Analytics.
Diversity Scholarship Trip Report
Coming to my first Code4Lib was significant because when I first began connecting with the group and its resources, I was a freshly-minted graduate in the middle of a career change. By the time I landed in Raleigh, three months into a new job, I was an information professional--more or less.
After graduating last May from library school, I admit to using the Code4Lib website obsessively during my quest for employment; I quickly found the site, wiki, listserv and journal invaluable. There was a level of energy and involvement by users that made it stand out from other, more conventional professional organizations. Plus, the job postings often described exactly the kinds of emerging, interdisciplinary positions I was most interested in. Code4Lib was a network I wanted to be a part of. Miraculously, my search worked out: I was offered a position, though I had not yet started when I finally applied for the diversity scholarship.
As a recipient of Diversity Scholarship for the 9th annual Code4Lib conference in Raleigh, North Carolina, I had an enlightening and incredible experience. I learned a great deal of information that revolved around library system usability, emerging coding frameworks, and applying social justice to user-centered design. Throughout the conference, I asked myself, how I could use these concepts and coding techniques for my daily work at my institution? As a “one-man shop” I have limited support for implementing many of these technologies. However, as I have networked with the diverse members of the code4lib community I know that it will be a bit easier trying to experiment with these techniques.
My time at the conference revealed that many libraries are passionately striving to make end-user systems usable, accessible, and transparent. There were numerous presentations that revolved around these ideas, such as using APIs to create data visualizations for displaying library statistics, real-time interactive discovery systems and interfaces, moving away from “list” type listings of holdings to network-node maps, web accessibility for differently abled patrons, and much more. The numerous lightning talks also provided a great wealth of information (all within 5 minutes!)
Jennifer Maiko Kishi
Code4Lib 2014 Conference Report
1 April 2014
As a new professional in the field, lone digital archivist, and a first timer to the Code4Lib Conference, my experience was incredibly inspiring and enriching. I value Code4Lib’s collective mission of teaching and learning through community, collaboration, and a free exchange of ideas. The conference was unique and unlike any other library or archives conference I have attended. I appreciate the thoughtfulness of planning events to specifically welcome new attendees. The newcomer dinner was not only a great way to meet fellow newbies (and oldtimers) on the evening before the conference, but also provided familiar faces to say hello to the following day. Moreover, Code4Lib resolved my session selecting anxieties, where I always feel like I’ve missed out on yet another important session. The conference is set up so that all attendees will have equal opportunities to view the sessions together in a continuous fashion, in addition to live streams made available to those unable to attend. The conference was jam packed with back to back presentations, lightning talks, and breakout sessions. There was a good balance of interesting topics by insightful speakers, mixed in with scheduled breaks with copious coffee and tea to stay alert and focused throughout the day.
Code4Lib 2014: Conference Review
J. (Jenny) Gubernick
I was fortunate to receive a diversity scholarship to help defray the costs of attending Code4Lib 2014 in Raleigh, NC. Although I am still processing the somewhat overwhelming amount of information I absorbed, I suspect that I will look back at this past week as a transformative experience. I pivoted from thinking of myself as “not a real programmer,” “lucky to have any job,” and that “maybe someday I can do something cool,” to thinking of myself as being in a position of great empowerment to learn and do, and being ready to apply my skills to a more complex work. I look forward to continuing to be part of this community in months and years to come.
Code4Lib trip report
31 March 2014
As a diversity scholarship recipient, I was afforded the opportunity to attend the 2014 Code4Lib conference in Raleigh, NC. The conference consisted of two and a half days of presentations and one day of preconference workshops. Looking back on the experience, I am impressed by the content of the presentations, the openness of the community, and the overall sense of curiosity and exploration. I learned a great deal and am looking forward to applying the inspiration and motivation that I took away from the conference in my daily work.
Prior to the start of the conference itself, I attended the “Archival Discovery and Use” pre-conference session. True to its name, Code4Lib has historically been more library-focused, but this session covered topics like the modern relevance of archival finding aids, archival crowdsourcing, and presentation methods for digitized materials. Because librarians and archivists have so many intertwined concerns, I was glad to see the archival community represented.
Coral Sheldon Hess
I had an enjoyable and educational time at Code4Lib 2014. It was my first time attending any Code4Lib event, and I am grateful to have had the opportunity to be there, thanks to the Diversity Scholarship sponsored by the Council on Library and Information Resources/Digital Library Federation, EBSCO, ProQuest, and Sumana Harihareswara. Thank you to the sponsors, the scholarship and organizing committees, and everyone else involved with the conference for this amazing learning experience!
Things that went well
A Newbie, Troublesome Cataloger at Code4Lib
In March 2014, I attended my first (and definitely not only) Code4Lib National Conference. I had been following the Code4Lib group via their website, journal, wiki and local NYC chapter for some time; but being a metadata/cataloging person, I was hesitant to jump into a meeting of programmers, coders, systems librarians, and others. I am immensely glad that I did not let this hesitation hold me back this year, as the 2014 Code4Lib Conference was the best and most inviting conference that I have ever attended.
After all the conferences and the craziness at work, LibTechConf seems like ages ago and though it’s been a little while, I wanted to write the usual reflection that I do. I wish I had done it sooner now, but I’m finally getting to it. Great Keynotes I normally prefer getting keynote speakers from outside […]
If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice.
Linked data in archival practice is not new. Others have been here previously. You can benefit from their experience and begin publishing linked data right now using tools with which you are probably already familiar. For example, you probably have EAD files, sets of MARC records, or metadata saved in database applications. Using existing tools, you can transform this content into RDF and put the result on the Web, thus publishing your information as linked data.
If you have used EAD to describe your collections, then you can easily make your descriptions available as valid linked data, but the result will be less than optimal. This is true not for a lack of technology but rather from the inherent purpose and structure of EAD files.
A few years ago an organisation in the United Kingdom called the Archive’s Hub was funded by a granting agency called JISC to explore the publishing of archival descriptions as linked data. The project was called LOCAH. One of the outcomes of this effort was the creation of an XSL stylesheet (ead2rdf) transforming EAD into RDF/XML. The terms used in the stylesheet originate from quite a number of standardized, widely accepted ontologies, and with only the tiniest bit configuration / customization the stylesheet can transform a generic EAD file into valid RDF/XML for use by anybody. The resulting XML files can then be made available on a Web server or incorporated into a triple store. This goes a long way to publishing archival descriptions as linked data. The only additional things needed are a transformation of EAD into HTML and the configuration of a Web server to do content negotiation between the XML and HTML.
For the smaller archive with only a few hundred EAD files whose content does not change very quickly, this is a simple, feasible, and practical solution to publishing archival descriptions as linked data. With the exception of doing some content negotiation, this solution does not require any computer technology that is not already being used in archives, and it only requires a few small tweaks to a given workflow:
EAD is a combination of narrative description and a hierarchal inventory list, and this data structure does not lend itself very well to the triples of linked data. For example, EAD headers are full of controlled vocabularies terms but there is no way to link these terms with specific inventory items. This is because the vocabulary terms are expected to describe the collection as a whole, not individual things. This problem could be overcome if each individual component of the EAD were associated with controlled vocabulary terms, but this would significantly increase the amount of work needed to create the EAD files in the first place.
The common practice of using literals to denote the names of people, places, and things in EAD files would also need to be changed in order to fully realize the vision of linked data. Specifically, it would be necessary for archivists to supplement their EAD files with commonly used URIs denoting subject headings and named authorities. These URIs could be inserted into id attributes throughout an EAD file, and the resulting RDF would be more linkable, but the labor to do so would increase, especially since many of the named items will not exist in standardized authority lists.
Despite these short comings, transforming EAD files into some sort of serialized RDF goes a long way towards publishing archival descriptions as linked data. This particular process is a good beginning and outputs valid information, just information that is not as linkable as possible. This process lends itself to iterative improvements, and outputting something is better than outputting nothing. But this particular proces is not for everybody. The archive whose content changes quickly, the archive with copious numbers of collections, or the archive wishing to publish the most complete linked data possible will probably not want to use EAD files as the root of their publishing system. Instead some sort of database application is probably the best solution.
In some ways MARC lends it self very well to being published via linked data, but in the long run it is not really a feasible data structure.
Converting MARC into serialized RDF through XSLT is at least a two step process. The first step is to convert MARC into MARCXML and then MARCXML into MODS. This can be done with any number of scripting languages and toolboxes. The second step is to use a stylesheet such as the one created by Stefano Mazzocchi to transform the MODS into RDF/XML — mods2rdf.xsl From there a person could save the resulting XML files on a Web server, enhance access via content negotiation, and called it linked data.
Unfortunately, this particular approach has a number of drawbacks. First and foremost, the MARC format had no place to denote URIs; MARC records are made up almost entirely of literals. Sure, URIs can be constructed from various control numbers, but things like authors, titles, subject headings, and added entries will most certainly be literals (“Mark Twain”, “Adventures of Huckleberry Finn”, “Bildungsroman”, or “Samuel Clemans”), not URIs. This issue can be overcome if the MARCXML were first converted into MODS and URIs were inserted into id or xlink attributes of bibliographic elements, but this is extra work. If an archive were to take this approach, then it would also behoove them to use MODS as their data structure of choice, not MARC. Continually converting from MARC to MARCXML to MODS would be expensive in terms of time. Moreover, with each new conversion the URIs from previous iterations would need to be re-created.
Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) goes a long way to implementing a named authority database that could be linked from archival descriptions. These XML files could easily be transformed into serialized RDF and therefore linked data. The resulting URIs could then be incorporated into archival descriptions making the descriptions richer and more complete. For example the FindAndConnect site in Australia uses EAC-CPF under the hood to disseminate information about people in its collection. Similarly, “SNAC aims to not only make the [EAC-CPF] records more easily discovered and accessed but also, and at the same time, build an unprecedented resource that provides access to the socio-historical contexts (which includes people, families, and corporate bodies) in which the records were created” More than a thousand EAC-CPF records are available from the RAMP project.
If you have archival descriptions in either of the METS or MODS formats, then transforming them into RDF is as far away as your XSLT processor and a content negotiation implementation. As of this writing there do not seem to be any METS to RDF stylesheets, but there are a couple stylesheets for MODS. The biggest issue with these sorts of implementations are the URIs. It will be necessary for archivists to include URIs into as many MODS id or xlink attributes as possible. The same thing holds true for METS files except the id attribute is not designed to hold pointers to external sites.
Some archives and libraries use a content management system called ContentDM. Whether they know it or not, ContentDM comes complete with an OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting) interface. This means you can send a REST-ful URL to ContentDM, and you will get back an XML stream of metadata describing digital objects. Some of the digital objects in ContentDM (or any other OAI-PMH service provider) may be something worth exposing as linked data, and this can easily be done with a system called oai2lod. It is a particular implementation of D2RQ, described below, and works quite well. Download application. Feed oai2lod the “home page” of the OAI-PMH service provider, and oai2load will publish the OAI-PMH metadata as linked open data. This is another quick & dirty way to get started with linked data.
Publishing linked data through XML transformation is functional but not optimal. Publishing linked data from a database comes closer to the ideal but requires a greater amount of technical computer infrastructure and expertise.
Databases — specifically, relational databases — are the current best practice for organizing data. As you may or may not know, relational databases are made up of many tables of data joined together with keys. For example, a book may be assigned a unique identifier. The book has many characteristics such as a title, number of pages, size, descriptive note, etc. Some of the characteristics are shared by other books, like authors and subjects. In a relational database these shared characteristics would be saved in additional tables, and they would be joined to a specific book through the use of unique identifiers (keys). Given this sort of data structure, reports can be created from the database describing its content. Similarly, queries can be applied against the database to uncover relationships that may not be apparent at first glance or buried in reports. The power of relational databases lies in the use of keys to make relationships between rows in one table and rows in other tables. The downside of relational databases as a data model is infinite variety of fields/table combinations making them difficult to share across the Web.
Not coincidently, relational database technology is very much the way linked data is expected to be implemented. In the linked data world, the subjects of triples are URIs (think database keys). Each URI is associated with one or more predicates (think the characteristics in the book example). Each triple then has an object, and these objects take the form of literals or other URIs. In the book example, the object could be “Adventures Of Huckleberry Finn” or a URI pointing to Mark Twain. The reports of relational databases are analogous to RDF serializations, and SQL (the relational database query language) is analogous to SPARQL, the query language of RDF triple stores. Because of the close similarity between well-designed relational databases and linked data principles, the publishing of linked data directly from relational databases makes whole lot of sense, but the process requires the combined time and skills of a number of different people: content specialists, database designers, and computer programmers. Consequently, the process of publishing linked data from relational databases may be optimal, but it is more expensive.
Thankfully, many archivists probably use some sort of behind the scenes database to manage their collections and create their finding aids. Moreover, archivists probably use one of three or four tools for this purpose: Archivist’s Toolkit, Archon, ArchivesSpace, or PastPerfect. Each of these systems have a relational database at their heart. Reports could be written against the underlying databases to generate serialized RDF and thus begin the process of publishing linked data. Doing this from scratch would be difficult, as well as inefficient because many people would be starting out with the same database structure but creating a multitude of varying outputs. Consequently, there are two alternatives. The first is to use a generic database application to RDF publishing platform called D2RQ. The second is for the community to join together and create a holistic RDF publishing system based on the database(s) used in archives.
D2RQ is a very powerful software system. It is supported, well-documented, executable on just about any computing platform, open source, focused, functional, and at the same time does not try to be all things to all people. Using D2RQ it is more than possible to quickly and easily publish a well-designed relational database as RDF. The process is relatively simple:
The downside of D2RQ is its generic nature. It will create an RDF ontology whose terms correspond to the names of database fields. These field names do not map to widely accepted ontologies & vocabularies and therefore will not interact well with communities outside the ones using a specific database structure. Still, the use of D2RQ is quick, easy, and accurate.
If you are going to be in Rome for only a few days, you will want to see the major sites, and you will want to adventure out & about a bit, but at the same time is will be a wise idea to follow the lead of somebody who has been there previously. Take the advise of these people. It is an efficient way to see some of the sights.
Henry Newman has an excellent post entitled SSD vs. HDD Pricing: Seven Myths That Need Correcting. His seven myths are:
Developers from the New York Times have released some open source software meant for displaying and managing large digital content collections, and doing so client-side, in the browser with JS.
Developed for journalism, this has some obvious potential relevance to the business of libraries too, right? Large collections (increasingly digital), that’s what we’re all about, ain’t it?
Today we’re open-sourcing two internal projects from The Times:
- PourOver.js, a library for fast filtering, sorting, updating and viewing large (100k+ item) categorical datasets in the browser, and
- Tamper, a companion protocol for compressing categorical data on the server and decompressing in your browser. We’ve achieved a 3–5x compression advantage over gzipped JSON in several real-world applications.
…Collections are important to developers, especially news developers. We are handed hundreds of user submitted snapshots, thousands of archive items, or millions of medical records. Filtering, faceting, paging, and sorting through these sets are the shortest paths to interactivity, direct routes to experiences which would have been time-consuming, dull or impossible with paper, shelves, indices, and appendices….
…The genesis of PourOver is found in the 2012 London Olympics. Editors wanted a fast, online way to manage the half a million photos we would be collecting from staff photographers, freelancers, and wire services. Editing just hundreds of photos can be difficult with the mostly-unimproved, offline solutions standard in most newsrooms. Editing hundreds of thousands of photos in real-time is almost impossible.
Yep, those sorts of tasks sound like things libraries are involved in, or would like to be involved in, right?
The actual JS does some neat things with figuring out how to incrementally and just-in-time send delta’s of data, etc., and some good UI tools. Look at the page for more.
I am increasingly interested in what ‘digital journalism’ is up to these days. They are an enterprise with some similarities to libraries, in that they are an information-focused business which is having to deal with a lot of internet-era ‘disruption’. Journalistic enterprises are generally for-profit (unlike most of the libraries we work in), but still with a certain public service ethos. And some of the technical problems they deal with overlap heavily with our area of focus.
It may be that the grass is always greener, but I think the journalism industry is rising to the challenges somewhat better than ours is, or at any rate is putting more resources into technical innovation. When was the last time something that probably took as many developer-hours as this stuff, and is of potential interest outside the specific industry, came out of libraries?
Presentation of PeerLibrary at iAnnotate 2014 conference in San Francisco. Including the demo of current version in development, 0.2.
I have seen several different approaches to division of labor in developing, deploying, and maintaining web apps.
The one that seems to work best to me is when the same team responsible for developing an app is the team responsible for deploying it and keeping it up, as well as for maintaining it. The same team — and ideally the same individual people (at least at first; job roles and employment changes over time, of course).
If the people responsible for writing the app in the first place are also responsible for deploying it with good uptime stats, then they have incentive to create software that can be easily deployed and can stay up reliably. If it isn’t at first, then the people who receive the pain of this are the same people best placed to improve the software to deploy better, because they are most familiar with it’s structure and how it might be altered.
Software is always a living organism, it’s never simply “done”, it’s going to need modifications in response to what you learn from how it’s users use it, as well as changing contexts and environments. Software is always under development, the first time it becomes public is just one marker in it’s development lifecycle, and not a clear boundary between “development” and “deployment”.
Compare this to other divisions of labor, where maybe one team does “R&D” on a nice prototype, then hands their code over to another team to turn it into a production service, or to figure out how to get it deployed and keep it deployed reliably and respond to trouble tickets. Sometimes these teams may be in entirely different parts of the organization. If it doesn’t deploy as easily or reliably as the ‘operations’ people would like, do they need to convince the ‘development’ people that this is legit and something should be done? And when it needs additional enhancements or functional changes, maybe it’s the crack team of R&Ders who do it, even though they’re on to newer and shinier things; or maybe it’s the operations people expected to it, even though they’re not familiar with the code since they didn’t write it; or maybe there’s nobody to do it at all, because the organization is operating on the mistaken assumption that developing software is like constructing a building, when it’s done it’s done.
I just don’t find that it works well to create robust, reliable software which can evolve to meet changing requirements.
Recently I ran into a quote from an interview with Werner Vogels, Chief Technology Officer at Amazon, expressing these benefits of “You build it, you run it.”:
There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.
I was originally directed to that quote by this blog post on the need for shared dev and ops responsibility, which I reccommend too.
In this world of silos, development threw releases at the ops or release team to run in production.
The ops team makes sure everything works, everything’s monitored, everything’s continuing to run smoothly.
When something breaks at night, the ops engineer can hope that enough documentation is in place for them to figure out the dial and knobs in the application to isolate and fix the problem. If it isn’t, tough luck.
Putting developers in charge of not just building an app, but also running it in production, benefits everyone in the company, and it benefits the developer too.
It fosters thinking about the environment your code runs in and how you can make sure that when something breaks, the right dials and knobs, metrics and logs, are in place so that you yourself can investigate an issue late at night.
As Werner Vogels put it on how Amazon works: “You build it, you run it.”
The responsibility to maintaining your own code in production should encourage any developer to make sure that it breaks as little as possible, and that when it breaks you know what to do and where to look.
That’s a good thing.
None of this means you can’t have people who focus on ops other people who focus on dev; but I think it means they should be situated organizationally close to each other, on the same teams, and that the dev people have to have share some ops responsibilities, so they feel some pain from products that are hard to deploy, or hard to keep running reliably, or hard to maintain or change.
 Note some people think even constructing a building shouldn’t be “when it’s done it’s done”, but that buildings too should be constructed in such a way that allows continual modification by those who inhabit them, in response to changing needs or understandings of needs.
In a previous post, we examined why Open Science is necessary to take advantage of the huge corpus of data generated by modern science. In our project Detection of Archaeological residues using Remote sensing Techniques, or DART, we adopted Open Science principles and made all the project’s extensive data available through a purpose-built data repository built on the open-source CKAN platform. But with so many academic repositories, why did we need to roll our own? A final post will look at how the portal was implemented.
DART’s overall aim is to develop analytical methods to differentiate archaeological sediments from non-archaeological strata, on the basis of remotely detected phenomena (e.g. resistivity, apparent dielectric permittivity, crop growth, thermal properties etc). DART is a data rich project: over a 14 month period, in-situ soil moisture, soil temperature and weather data were collected at least once an hour; ground based geophysical surveys and spectro-radiometry transects were conducted at least monthly; aerial surveys collecting hyperspectral, LiDAR and traditional oblique and vertical photographs were taken throughout the year, and laboratory analyses and tests were conducted on both soil and plant samples. The data archive itself is in the order of terabytes.
Analysis of this archive is ongoing; meanwhile, this data and other resources are made available through open access mechanisms under liberal licences and are thus accessible to a wide audience. To achieve this we used the open-source CKAN platform to build a data repository, DARTPortal, which includes a publicly queryable spatio-temporal database (on the same host), and can support access to individual data as well as mining or analysis of integrated data.
This means we can share the data analysis and transformation processes and demonstrate how we transform data into information and synthesise this information into knowledge (see, for example, this Ipython notebook which dynamically exploits the database connection). This is the essence of Open Science: exposing the data and processes that allow others to replicate and more effectively build on our science.
Pleased though we are with our data repository, it would have been nice not to have to build it! Individual research projects should not bear the burden of implementing their own data repository framework. This is much better suited to local or national institutions where the economies of scale come into their own. Yet in 2010 the provision of research data infrastructure that supported what DART did was either non-existent or poorly advertised. Where individual universities provided institutional repositories, these were focused on publications (the currency of prestige and career advancement) and not on data. Irrespective of other environments, none of the DART collaborating partners provided such a data infrastructure.
Data sharing sites like Figshare did not exist – and when it did exist the size of our hyperspectral data, in particular, was quite rightly a worry. This situation is slowly changing, but it is still far from ideal. The positions taken by Research Councils UK and the Engineering and Physical Science Research Council (EPSRC) on improving access to data are key catalysts for change. The EPSRC statement is particularly succinct:
Two of the principles are of particular importance: firstly, that publicly funded research data should generally be made as widely and freely available as possible in a timely and responsible manner; and, secondly, that the research process should not be damaged by the inappropriate release of such data.
This has produced a simple economic issue – if research institutions can not demonstrate that they can manage research data in the manner required by the funding councils then they will become ineligible to receive grant funding from that council. The impact is that the majority of universities are now developing their own, or collaborating on communal, data repositories.
DART was generously funded through the Science and Heritage Programme supported by the UK Arts and Humanities Research Council (AHRC) and the EPSRC. This means that these research councils will pay for data archiving in the appropriate domain repository, in this case the Archaeology Data Service (ADS). So why produce our own repository?
Deposition to the ADS would only have occurred after the project had finished. With DART, the emphasis has been on re-use and collaboration rather than primarily on archiving. These goals are not mutually exclusive: the methods adopted by DART mean that we produced data that is directly suitable for archiving (well documented ASCII formats, rich supporting description and discovery metadata, etc) whilst also allowing more rapid exposure and access to the ‘full’ archive. This resulted in DART generating much richer resource discovery and description metadata than would have been the case if the data was simply deposited into the ADS.
The point of the DART repository was to produce an environment which would facilitate good data management practice and collaboration during the lifetime of the project. This is representative of a crucial shift in thinking, where projects and data collectors consider re-use, discovery, licences and metadata at a much earlier stage in the project life cycle: in effect, to create dynamic and accessible repositories that have impact across the broad stakeholder community rather than focussing solely on the academic community. The same underpinning philosophy of encouraging re-use is seen at both FigShare and DataHub. Whilst formal archiving of data is to be encouraged, if it is not re-useable, or more importantly easily re-useable, within orchestrated scientific workflow frameworks then what is the point.
In addition, it is unlikely that the ADS will take the full DART archive. It has been said that archaeological archives can produce lots of extraneous or redundant ‘stuff’. This can be exacerbated by the unfettered use of digital technologies – how many digital images are really required for the same trench? Whilst we have sympathy with this argument, there is a difference between ‘data’ and ‘pretty pictures’: as data analysts, we consider that a digital photograph is normally a data resource and rarely a pretty picture. Hence, every image has value.
This is compounded when advances in technology mean that new data can be extracted from ‘redundant’ resources. For example, Structure from Motion (SfM) is a Computer Vision technique that extracts 3D information from 2D objects. From a series of overlapping photographs, SfM techniques can be used to extract 3D point clouds and generate orthophotographs from which accurate measurements can be taken. In the case of SfM there is no such thing as redundancy, as each image becomes part of a ‘bundle’ and the statistical characteristics of the bundle determine the accuracy of the resultant model. However, one does need to be pragmatic, and it is currently impractical for organisations like the ADS to accept unconstrained archives. That said, it is an area that needs review: if a research object is important enough to have detailed metadata created about it, then it should be important enough to be archived.
For DART, this means that the ADS is hosting a subset of the archive in long-term re-use formats, which will be available in perpetuity (which formally equates to a maximum of 25 years), while the DART repository will hold the full archive in long term re-use formats until we run out of server money. We are are in discussion with Leeds University to migrate all the data objects over to the new institutional repository with sparkling new DOIs and we can transfer the metadata held in CKAN over to Open Knowledge’s public repository, the dataHub. In theory nothing should be lost.
The point on perpetuity is interesting. Collins Dictionary defines perpetuity as ‘eternity’. However, the ADS defines ‘digital’ perpetuity as 25 years. This raises the question: is it more effective in the long term to deposit in ‘formal’ environments (with an intrinsic focus on preservation format over re-use), or in ‘informal’ environments (with a focus on re-use and engagement over preservation (Flickr, Wikimedia Commons, DART repository based on CKAN, etc)? Both Flickr and Wikimedia Commons have been around for over a decade. Distributed peer to peer sharing, as used in Git, produces more robust and resilient environments which are equally suited to longer term preservation. Whilst the authors appreciate that the situation is much more nuanced, particularly with the introduction of platforms that facilitate collaborative workflow development, this does have an impact on long-term deployment.
Licences are fundamental to the successful re-use of content. Licences describe who can use a resource, what they can do with this resource and how they should reference any resource (if at all).
Two lead organisations have developed legal frameworks for content licensing, Creative Commons (CC) and Open Data Commons (ODC). Until the release of CC version 4, published in November 2013, the CC licence did not cover data. Between them, CC and ODC licences can cover all forms of digital work.
At the top level the licences are permissive public domain licences (CC0 and PDDL respectively) that impose no restrictions on the licensees use of the resource. ‘Anything goes’ in a public domain licence: the licensee can take the resource and adapt it, translate it, transform it, improve upon it (or not!), package it, market it, sell it, etc. Constraints can be added to the top level licence by employing the following clauses:
Each of these clauses decreases the ‘open-ness’ of the resource. In fact, the NC and ND clause are not intrinsically open (they restrict both who can use and what you can do with the resource). These restrictive clauses have the potential to produce license incompatibilities which may introduce profound problems in the medium to long term. This is particularly relevant to the SA clause. Share-alike means that any derived output must be licensed under the same conditions as the source content. If content is combined (or mashed up) – which is essential when one is building up a corpus of heritage resources – then content created under a SA clause can not be combined with content that includes a restrictive clause (BY, NC or ND) that is not in the source licence. This licence incompatibility has a significant impact on the nature of the data commons. It has the potential to fragment the data landscape creating pockets of knowledge which are rarely used in mainstream analysis, research or policy making. This will be further exacerbated when automated data aggregation and analysis systems become the norm. A permissive licence without clauses like Non-commercial, Share-alike or No-derivatives removes such licence and downstream re-user fragmentation issues.
For completeness, specific licences have been created for Open Government Data. The UK Government Data Licence for public sector information is essentially an open licence with a BY attribution clause.
At DART we have followed the guidelines of The Open Data Institute and separated out creative content (illustrations, text, etc.) from data content. Hence, the DART content is either CC-BY or ODC-BY respectively. In the future we believe it would be useful to drop the BY (attribution) clause. This would stop attribute stacking (if the resource you are using is a derivative of a derivative of a derivative of a ….. (you get the picture), at what stage do you stop attribution) and anything which requires bureaucracy, such as attributing an image in a powerpoint presentation, inhibits re-use (one should always assume that people are intrinsically lazy). There is a post advocating ccZero+ by Dan Cohen. However, impact tracking may mean that the BY clause becomes a default for academic deposition.
The ADS uses a more restrictive bespoke default licence which does not map to national or international licence schemes (they also don’t recognise non CC licences). Resources under this licence can only be used for teaching, learning, and research purposes. Of particular concern is their use of the NC clause and possible use of the ND clause (depending on how you interpret the licence). Interestingly, policy changes mean that the use of data under the bespoke ADS licence becomes problematic if university teaching activities are determined to be commercial. It is arguable that the payment of tuition fees represents a commercial activity. If this is true then resources released under the ADS licence can not be used within university teaching which is part of a commercial activity. Hence, the policy change in student tuition and university funding has an impact on the commercial nature of university teaching which has a subsequent impact on what data or resources universities are licensed to use. Whilst it may never have been the intention of the ADS to produce a licence with this potential paradox, it is a problem when bespoke licences are developed, even if they were originally perceived to be relatively permissive licences. To remove this ambiguity it is recommended that submissions to the ADS are provided under a CC licence which renders the bespoke ADS licence void.
In the case of DART, these licence variations with the ADS should not be a problem. Our licences are permissive (by attribution is the only clause we have included). This means the ADS can do anything they want with our resources as long as they cite the source. In our case this would be the individual resource objects or collections on the DART portal. This is a good thing, as the metadata on the DART portal is much richer than the metadata held by the ADS.
Christopher Gutteridge (University of Southampton) and Alexander Dutton (University of Oxford) have collated a Google doc entitled ‘Concerns about opening up data, and responses which have proved effective‘. This document describes a number of concerns commonly raised by academic colleagues about increasing access to data. For DART two issues became problematic that were not covered by this document:
The former point is interesting – does the process of undertaking open science, or at least providing open data, undermine the novelty of the resultant scientific process? With open science it could be difficult to directly attribute the contribution, or novelty, of a single PhD student to an openly collaborative research process. However, that said, if online versioning tools like Git are used, then it is clear who has contributed what to a piece of code or a workflow (the benefits of the BY clause). This argument is less solid when we are talking solely about open data. Whilst it is true that other researchers (or anybody else for that matter) have access to the data, it is highly unlikely that multiple researchers will use the same data to answer exactly the same question. If they do ask the same question (and making the optimistic assumption that they reach the same conclusion), it is still highly unlikely that they will have done so by the same methods; and even if they do, their implementations will be different. If multiple methods using the same source data reach the same conclusion then there is an increased likelihood that the conclusion is correct and that the science is even more certain. The underlying point here is that 21st-century scientific practice will substantially benefit from people showing their working. Exposure of the actual process of scientific enquiry (the algorithms, code, etc.) will make the steps between data collection and publication more transparent, reproduceable and peer-reviewable – or, quite simply, more scientific. Hence, we would argue that open data and research novelty is only a problem if plagiarism is a problem.
The journal publication point is equally interesting. Publications are the primary metric for academic career progression and kudos. In this instance it was the policy of the ‘leading journal in this field’ that they would not publish a paper from a dataset that was already published. No credible reasons were provided for this clause – which seems draconian in the extreme. It does indicate that no one size fits all approach will work in the academic landscape. It will also be interesting to see how this journal, which publishes work which is mainly funded by EPSRC, responds to the EPSRC guidelines on open data.
This is also a clear demonstration that the academic community needs to develop new metrics that are more suited to 21st century research and scholarship by directly link academic career progression to other source of impact that go beyond publications. Furthermore, academia needs some high-profile exemplars that demonstrate clearly how to deal with such change. The policy shift and ongoing debate concerning ‘Open access’ publications in the UK is changing the relationship between funders, universities, researchers, journals and the public – a similar debate needs to occur about open data and open science.
The altmetrics community is developing new metrics for “analyzing, and informing scholarship” and have described their ethos in their manifesto. The Research Councils and Governments have taken a much greater interest in the impact of publically funded research. Importantly public, social and industry impact are as important as academic impact. It is incumbent on universities to respond to this by directly linking academic career progression through to impact and by encouraging improved access to the underlying data and procesing outputs of the research process through data repositories and workflow environments.
If you sell an ebook through Amazon's Kindle Direct program, Amazon doesn't want you to offer it for less somewhere else. It's easy to understand why; if you're a consumer, you hate to pay $10 for an ebook on Amazon and then find that you can get it direct from the author for $5. But is it legal for Amazon to enjoin a publisher from offering better prices in other channels? In other words, is Amazon allowed to insist on a "Most Favored Nation" (MFN) provision?
4. Setting Your List Price
You must set your Digital Book's List Price (and change it from time-to-time if necessary) so that it is no higher than the list price in any sales channel for any digital or physical edition of the Digital Book.
But if you choose the 70% Royalty Option, you must further set and adjust your List Price so that it is at least 20% below the list price in any sales channel for any physical edition of the Digital Book.
Although the judge found that the MFN clause in this instance was critical to Apple’s ability to orchestrate the unlawful conspiracy, Judge Cote explicitly held that MFN clauses are not, in and of themselves, “inherently illegal.” Judge Cote explained that “entirely lawful contracts may contain an MFN …. The issue is not whether an entity … used an MFN, but whether it conspired to raise prices.” This determination, she stated, must be based on consideration of the “totality of the evidence,” rather than on the language of the agency agreement or MFN alone. Examining the facts in this particular case, Judge Cote found that Apple’s use of the MFN clause to facilitate the e-book conspiracy with the publishers constituted a “per se” violation of the antitrust laws.
depending on the economic and commercial circumstances, MFN clauses have on occasion caused concern to competition authorities. In particular:
- They can act as a disincentive to price cutting. If a supplier knows that, by offering a discount to any third-party customer, the supplier must also offer the customer benefiting from the MFN clause a discount to ensure that the latter enjoys the most favourable price, that is a "double cost" to price cutting, and therefore could have the effect of deterring price cuts and keeping prices higher than they might otherwise be.
If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra.
Linked data is not a fad. It is not a trend. It makes a lot of computing sense, and it is a modern way of fulfilling some the goals of archival practice. Just like Rome, it is not going away. An understanding of what linked data has to offer is akin to experiencing Rome first hand. Both will ultimately broaden your perspective. Consequently it is a good idea to make a concerted effort to learn about linked data, as well as visit Rome at least once. Once you have returned from your trip, discuss what you learned with your friends, neighbors, and colleagues. The result will be enlightening everybody.
The previous sections of this book described what linked data is and why it is important. The balance of book describes more of the how’s of linked data. For example, there is a glossary to help reenforce your knowledge of the jargon. You can learn about HTTP “content negotiation” to understand how actionable URIs can return HTML or RDF depending on the way you instruct remote HTTP servers. RDF stands for “Resource Description Framework”, and the “resources” are represented by URIs. A later section of the book describes ways to design the URIs of your resources. Learn how you can transform existing metadata records like MARC or EAD into RDF/XML, and then learn how to put the RDF/XML on the Web. Learn how to exploit your existing databases (such as the one’s under Archon, Archivist’s Toolkit, or ArchiveSpace) to generate RDF. If you are the Do It Yourself type, then play with and explore the guidebook’s tool section. Get the gentlest of introductions to searching RDF using a query language called SPARQL. Learn how to read and evaluate ontologies & vocabularies. They are manifested as XML files, and they are easily readable and visualizable using a number of programs. Read about and explore applications using RDF as the underlying data model. There are a growing number of them. The book includes a complete publishing system written in Perl, and if you approach the code of the publishing system as if it were a theatrical play, then the “scripts” read liked scenes. (Think of the scripts as if they were a type of poetry, and they will come to life. Most of the “scenes” are less than a page long. The poetry even includes a number of refrains. Think of the publishing system as if it were a one act play.) If you want to read more, and you desire a vetted list of books and articles, then a later section lists a set of further reading.
After you have spent some time learning a bit more about linked data, discuss what you have learned with your colleagues. There are many different aspects of linked data publishing, such as but not limited to:
In archival practice, each of these things would be done by different sets of people: archivists & content specialists, administrators & managers, computer programers & systems administrators, metadata experts & catalogers. Each of these sets of people have a piece of the publishing puzzle and something significant to contribute to the work. Read about linked data. Learn about linked data. Bring these sets of people together discuss what you have learned. At the very least you will have a better collective understanding of the possibilities. If you don’t plan to “go to Rome” right away, you might decide to reconsider the “vacation” at another time.
Even Michelangelo, when he painted the Sistine Chapel, worked with a team of people each possessing a complementary set of skills. Each had something different to offer, and the discussion between themselves was key to their success.
Making the Journal the best that it can be.
Comprehensive social search on the Internet remains an unsolved problem. Social networking sites tend to be isolated from each other, and the information they contain is often not fully searchable outside the confines of the site. EgoSystem, developed at Los Alamos National Laboratories (LANL), explores the problems associated with automated discovery of public online identities for people, and the aggregation of the social, institution, conceptual, and artifact data connected to these identities. EgoSystem starts with basic demographic information about former employees and uses that information to locate person identities in various popular online systems. Once identified, their respective social networks, institutional affiliations, artifacts, and associated concepts are retrieved and linked into a graph containing other found identities. This graph is stored in a Titan graph database and can be explored using the Gremlin graph query/traversal language and with the EgoSystem Web interface.
This article describes how the University of North Texas Libraries' Digital Projects Unit used simple, freely-available APIs to add place names to metadata records for over 8,000 maps in two digital collections. These textual place names enable users to easily find maps by place name and to find other maps that feature the same place, thus increasing the accessibility and usage of the collections. This project demonstrates how targeted large-scale, automated metadata enhancement can have a significant impact with a relatively small commitment of time and staff resources.
In late 2012, OSU Libraries and Press partnered with Maria’s Libraries, an NGO in Rural Kenya, to provide users the ability to crowdsource translations of folk tales and existing children's books into a variety of African languages, sub-languages, and dialects. Together, these two organizations have been creating a mobile optimized platform using open source libraries such as Wink Toolkit (a library which provides mobile-friendly interaction from a website) and Globalize3 to allow for multiple translations of database entries in a Ruby on Rails application. Research regarding successes of similar tools has been utilized in providing a consistent user interface. The OSU Libraries & Press team delivered a proof-of-concept tool that has the opportunity to promote technology exploration, improve early childhood literacy, change the way we approach foreign language learning, and to provide opportunities for cost-effective, multi-language publishing.
In this article, we present a case study of how the main publishing format of an Open Access journal was changed from PDF to EPUB by designing a new workflow using JATS as the basic XML source format. We state the reasons and discuss advantages for doing this, how we did it, and the costs of changing an established Microsoft Word workflow. As an example, we use one typical sociology article with tables, illustrations and references. We then follow the article from JATS markup through different transformations resulting in XHTML, EPUB and MOBI versions. In the end, we put everything together in an automated XProc pipeline. The process has been developed on free and open source tools, and we describe and evaluate these tools in the article. The workflow is suitable for non-professional publishers, and all code is attached and free for reuse by others.
The Valley Library at Oregon State University Libraries & Press supports access to technology by lending laptops and e-readers. As a newcomer to tablet lending, The Valley Library chose to implement its service using Google Nexus tablets and an open source custom firmware solution, CyanogenMod, a free, community-built Android distribution. They created a custom build of CyanogenMod featuring wireless updates, website shortcuts, and the ability to quickly and easily wipe devices between patron uses. This article shares code that simplifies Android tablet maintenance and addresses Android application licensing issues for shared devices.
As the archival horizon moves forward, optical media will become increasingly significant and prevalent in collections. This paper sets out to provide a broad overview of optical media in the context of archival migration. We begin by introducing the logical structure of compact discs, providing the context and language necessary to discuss the medium. The article then explores the most common data formats for optical media: Compact Disc Digital Audio, ISO 9660, the Joliet and HFS extensions, and the Universal Data Format (with an eye towards DVD-Video). Each format is viewed in the context of preservation needs and what archivists need to be aware of when handling said formats. Following this, we discuss preservation workflows and concerns for successfully migrating data away from optical media, as well as directions for future research.
Digital signage has been used in the commercial sector for decades. As display and networking technologies become more advanced and less expensive, it is surprisingly easy to implement a digital signage program at a minimal cost. In the fall of 2011, the University of Florida (UF), Health Sciences Center Library (HSCL) initiated the use of digital signage inside and outside its Gainesville, Florida facility. This article details UF HSCL’s use and evaluation of DigitalSignage.com signage software to organize and display its digital content.
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.
Visit the LITA Job Site for more available jobs and for information on submitting a job posting.
We are happy to announce the v1.0 release of the OCLC Python 2.7 Authentication Library via Github. This code library is the fourth implementation that the OCLC Developer Network is releasing to assist developers working with our web services protected by our API key system.
Today I found the following resources and bookmarked them on <a href=
Digest powered by RSS Digest
Before Whatson accepts a search query it will first ingest, analyze and index documents so that searches don’t take forever. I have shown how Whatson will use Apache Tika to extract metadata and convert different content types into plain text. After that, the plain text will be split up into words, called tokens, so that queries can later be matched up to documents. Here is a simple example:
The tokenizer analyzes whitespace and punctuation to produce a list of tokens. Partial example, with pipes inserted by me:
Dr. | Lanyon | sat | alone | over | his | wine | . | This | was | a | hearty | , | healthy …
The tokenizer was smart enough to keep the period with “Dr.” but separate it out when it was used to split up sentences. This is why you don’t want to build from scratch.
The Apache Software Foundation. Apache OpenNLP Developer Documentation.
dpdearing. Getting started with OpenNLP 1.5.0 – Sentence Detection and Tokenizing.
The library is the heart of the University. From it, the lifeblood of scholarship flows to all parts of the University; to it the resources of scholarship flow to enrich the academic body. With a mediocre library, even a distinguished faculty cannot attain its highest potential; with a distinguished library, even a less than brilliant faculty may fulfill its mission. For the scientist, the library provides an indispensable backstop to his laboratory and field work. For the humanist, the library is not only his reference centre; it is indeed his laboratory and the field of his explorations. What he studies, researches and writes is the product of his reading in the library. For these reasons, the University library must be one of the primary concerns for those responsible for the development and welfare of the institution. At the same time, the enormous cost of acquisitions, the growing scarcity of older books, the problem of storage and cataloguing make the library one of the most painful headaches of the University administrator.
From the Report to the Committee on University Affairs and the Committee of Presidents of Provincially-Assisted Universities, by the Commission to Study the Development of Graduate Programmes in Ontario Universities, chaired by Gustave O. Arlt, F. Kenneth Hare, and J.W.T. Spinks, published in 1966. I think I found this in Evolution of the Heart: A History of the University of Toronto Library Up to 1981 by Robert Blackburn, which is a fine book, and very interesting. Blackburn was the chief librarian there for about 25 years.
In March, the U.S. House Judiciary Subcommittee on Courts, Intellectual Property and the Internet held a hearing on Section 512, the provision that provides protection for internet service providers from liability for the infringing actions of network users. The Library Copyright Alliance (LCA) submitted comments (pdf) in support of no changes to the existing law, holding that this provision helps libraries provide online services in good faith without liability for the potentially illegal actions of a third party.
Though libraries were not specifically represented in the hearing, one line of questioning directed at both Google and Automattic Inc.—owner of WordPress—stands out as relevant to both present and future methods of delivering content and services to library patrons: “free” as the opposite of “legal” or “legitimate.”
Several representatives focused on witnesses Katherine Oyama, senior copyright policy counsel for Google, and Paul Sieminski, general counsel for Automattic Inc., expressing significant confusion about how Google creates and modifies indexing and search algorithms, as well as the nuances of copyright protection on a blogging platform. “Free” was the watchword, and many subcommittee members expressed the same basic concerns.
Rep. Judy Chu (D-CA) asked about autocomplete results in Google that include “free” and “watch online,” saying that such results “induce infringement” on the part of searchers. Rep. Cedric Richmond (D-LA) further echoed worries that unsophisticated Internet users like his grandmother would be “induced to infringe” by seeing an autocomplete result for “watch 12 Years a Slave free online.”
But the most colorful exchange began with Rep. Tom Marino (R-PA) expressing disbelief that Google could not simply ban or remove terms such as “watch X movie online for free” from the engine.
Oyama rightly pointed out that “we are not going to ban the word ‘free’ from search…there are many legitimate sources for music and films that are available for free.” She also promoted YouTube’s ContentID software as an effective answer to alleged infringement, though there are certainly reasons to remain wary of the “software savior” in addressing takedown notices (more on ContentID coming soon).
As libraries begin exploring ways to deliver legally obtained and responsibly monitored content to patrons, we will have to offer a counterpoint to the concept of “free” as the automatic enemy of rights holders. While we know that it is anything but free to provide these services (no-fee or no-charge is perhaps a better description), the public often perceives it as such, and simply banning phrases like “read for free” or “watch for free” from the world’s largest Internet index will not reduce infringement. Instead, it removes a responsible and reliable source from top page results, which is the exact opposite of what the lawmakers above support.
If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra. If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice. For a week, do everything you would do in a few days, and make one or two day-trips outside Rome in order to get a flavor of the wider community. If you can afford two weeks, then do everything you would do in a week, and in addition befriend somebody in the hopes of establishing a life-long relationship.
When you read a guidebook on Rome — or any travel guidebook — there are simply too many listed things to see & do. Nobody can see all the sites, visit all the museums, walk all the tours, nor eat at all the restaurants. It is literally impossible to experience everything a place like Rome has to offer. So it is with linked data. Despite this fact, if you were to do everything linked data had to offer, then you would do all of things on the following list starting at the first item, going all the way down to evaluation, and repeating the process over and over:
Given that it is quite possible you do not plan to immediately dive head-first into linked data, you might begin by getting your feet wet or dabbling in a bit of experimentation. That being the case, here are a number of different “itineraries” for linked data implementation. Think of them as strategies. They are ordered from least costly and most modest to greatest expense and completest execution:
Last week, the American Library Association (ALA) joined an amicus brief calling for reconsideration of a 9th circuit court decision in Garcia v. Google, case where actress Cindy Sue Garcia sued Google for not removing a YouTube video in which she appears. Garcia appears for five seconds in “Innocence of Muslims,” the radical anti-Islamic video that fueled the attack on the American embassy in Benghazi. The video was uploaded on YouTube, exposing Garcia to threats and hate mail. Garcia did not know that her five second performance would be used in a controversial video.
Garcia turned to the copyright law for redress, arguing that her five second performance was protected by copyright, and therefore, as a rights holder she could ask that the video be removed from YouTube. While we empathize with Garcia’s situation, the copyright law does not protect performances in film—instead these performances are works-for-hire. This ruling, if taken to its extreme, would hold that anyone who worked on a film—from the editor to the gaffer—could claim rights, creating a copyright permissions nightmare.
On appeal, the judge agreed that the copyright argument was weak, but nonetheless ruled for Garcia. The video currently is not available for public review. This decision needs to be reheard en banc—the copyright ruling is mistaken, and perhaps more importantly, the copyright law cannot be used to restrain speech. While the facts of this case are not at all appealing, we agree that rules of law need to be upheld. Fundamental values of librarianship—including intellectual freedom, fair use, and preservation of the cultural record—are in serious conflict with the existing court ruling.
The post Appeals court decision undermines free speech, misinterprets copyright law appeared first on District Dispatch.
One of the most useful resources I found when developing a child theme in the wordpress thematic theme framework was the theme structure document formerly found on the bluemandala.com website.
With permission from Deryk, I am reproducing it here: http://e-records.chrisprom.com/manualuploads/thematic-structure.html.
And, here is another great resource for development using Thematic: http://visualizing.thematic4you.com/
On 26 March 2014 I gave a short talk at the March 2014 AR Standards Community Meeting in Arlington, Virginia. The talk was called “Stuff, Standards and Sites: Libraries and Archives in AR.” My slides and the text of what I said are online:
I struggled with how best to talk to non-library people, all experts in different aspects of augmented reality, about how our work can fit with theirs. The stuff/standards/sites components gave me something to hang the talk on, but it didn’t all come together as well as I’d hoped and in the heat of actually speaking I forgot to mention a couple of important things. Ah well.
I made the slides a new way. They are done with reveal.js, but I wrote them in Emacs with Org and then used org-reveal to export them. It worked beautifully! The diagrams in the slides are done in text in Org with ditaa and turned into images on export.
What I write in Org looks like this (here I turned image display off, but one keystroke makes them show):
When turned into slides, that looks like this:
Working this way was a delight. No more nonsense about dragging boxes and stuff around like in Power Point. I get to work with pure text, in my favourite editor, and generate great-looking slides, all with free software.
To turn all the slides into little screenshots, I used this little script I found in a GitHub gist: PhantomJS script to capture/render screenshots of the slides of a Reveal.js powered slideshow. I had to install phantom.js first, but on my Ubuntu system that was just a simple
sudo apt-get install phantomjs.
For Immediate Release
CHICAGO — Tablets, desktops, smartphones, laptops, minis: we live in a world of screens, all of different sizes. Library websites need to work on all of them, but maintaining separate sites or content management systems is resource intensive and still unlikely to address all the variations. By using responsive Web design, libraries can build one site for all devices—now and in the future. In “Responsive Web Design for Libraries: A LITA Guide,” published by ALA TechSource, experienced responsive Web developer Matthew Reidsma, named “a web librarian to watch” by ACRL’s TechConnect blog, shares proven methods for delivering the same content to all users using HTML and CSS. His practical guidance will enable Web developers to save valuable time and resources by working with a library’s existing design to add responsive Web design features. With both clarity and thoroughness, and firmly addressing the expectations of library website users, this book:
Reidsma is Web services librarian at Grand Valley State University, in Allendale, Mich. He is the cofounder and editor in chief of Weave: Journal of Library User Experience, a peer-reviewed, open-access journal for library user experience professionals. He speaks frequently about library websites, user experience and responsive design around the world. Library Journal named him a “Mover & Shaker” in 2013. He writes about libraries and technology at Matthew Reidsma.
The Library and Information Technology Association (LITA), a division of ALA, educates, serves and reaches out to its members, other ALA members and divisions, and the entire library and information community through its publications, programs and other activities designed to promote, develop, and aid in the implementation of library and information technology.
Italian Lectures on Semantic Web and Linked Data: Practical Examples for Libraries, Wednesday May 7, 2014 at The American University of Rome – Auriana Auditorium (Via Pietro Roselli, 16 – Rome, Italy)
Please RSVP to f.wallner at aur.edu by May 5.
This event is generously sponsored by regesta.exe, Bucap Document Imaging SpA, and SOS Archivi e Biblioteche.
“Put down the marker, step away from the whiteboard.” I joked that once in a design session. A picture can represent a rich array of information in a single frame — that is its strength and weakness. “A picture paints a thousand words. Stop all the talking!” It can take a while to assimilate all the information in a diagram. Here is my first cut at an architecture diagram for Whatson, my home basement attempt at building Watson using public knowledge and open source technology. I will detail the components in future posts as the build proceeds.
1. Data Source to Index. In order for Whatson to be able to answer questions in a timely fashion, data sources must be pre-processed. Data sources must be crawled and indexed. The index is the structured target for searches.
1.1. Data Sources. I will download data sources including public domain literature and Wikipedia. Other sources may be added. The more data sources, the smarter Whatson will be.
1.2. Crawl. In a previous post, I showed how I can use Apache Tika to convert different content types (e.g., html, pdf) into plain text, and extract metdata. This is the crawl stage. The common plain text format makes further processing much easier.
1.3. Index. Using OpenNLP I will process text along a UIMA pipeline. UIMA is an open, industry standard architecture for Natural Language Processing (NLP) stages. A UIMA pipeline is a series of text-processing steps including: parsing document content into individual words or tokens; analyzing the tokens into parts of speech like nouns and verbs; and identification of entities like people, locations and organizations.
2. Question to Answer. Once the data sources have been crawled and indexed, a question may be submitted to Whatson. The output must be Whatson’s single best answer.
2.1. Question. A user interface will accept a question in natural language.
2.2. Cognitive Analysis. Whatson will analyze the question text. The analysis first submits the text to the UIMA pipeline built for step 1. The pipeline outputs are used here to make the question easier to analyze for the next step, deciding the question type. Is the question seeking a person or a place? Is the context literal or figurative? Current or historical? Based on the question type, modules will be enlisted to answer the question. The modular approaches simulates the human brain, with different modules dedicated to different kinds of knowledge and cognitive processing. The modules use domain-specific logic to search for answers in the index prepared in step 1. For example, a literature module will have domain-specific rules for analyzing literature. This approach prevents Whatson from wild goose chases and speeds up processing. The output of the cognitive analysis is a candidate answer and confidence level from each enlisted module.
2.3 Dialog. Whatson needs to decide which answer from the cognitive analysis is best. If the stakes are low, it will simply select the answer with the highest confidence level. The questioner can respond whether the answer is satisfactory. A dialog may continue with additional questions. If Whatston is used in a context that has penalties, like playing Jeopardy, it might not risk giving its best answer if the confidence level is low. If the context permits, Whatson could ask for hints or prompt for a restatement of the question.
Baker. Final Jeopardy: Man vs. Machine and the Quest to Know Everything.
Ingersoll, Morton & Farris. Taming Text.
Every day, through secret contracts being carried out within public institutions, there is confirmation that the interest of the public is not served. A few days ago, young Nigerians in Abuja were arrested for protesting against the reckless conduct of the recruitment exercise at the Nigerian Immigration Service (NIS) that led to the death of 19 applicants.
Although the protesters were later released, the irony still stings that whilst no one has been held for the resulting deaths from the reckless recruitment conduct, the young voices protesting against this grave misconduct are being silenced by security forces. Most heart-breaking is the reality that the deadly outcomes of the recruitment exercise could have been avoided with more conscientious planning, through an adherence to due process and diligence in the selection of consultants to carry out the exercise.
A report released by Premium times indicates that the recruitment exercise was conducted exclusively by the Minister of Interior who hand-picked the consultant that carried out the recruitment exercise at the NIS. The non-responsiveness of the Ministry in providing civic organizations including BudgIT and PPDC with requested details of the process through which the consultant was selected gives credence to the reports of due process being flouted.
The non-competitive process through which the consultant was selected is in sharp breach of the Public procurement law and its results have undermined the concept of value for money in the award of contracts for public services. Although a recruitment website was built and deployed by the hired consultant, the information gathered by the website does not seem to have informed the plan for the conduct of the recruitment exercise across the country which left Nigerians dead in its wake. Whilst the legality of the revenue generated from over 710,000 applicants is questioned, it is appalling that these resources were not used to ensure a better organized recruitment exercise.
This is not the first time that public institutions in Nigeria have displayed reckless conduct in the supposed administration of public services to the detriment of Nigerians. The recklessness with which the Ministry of Aviation took a loan to buy highly inflated vehicles, the difficulty faced by BudgIT and PPDC in tracking the exact amount of SURE-P funds spent, the 20 billion Dollars unaccounted for by the NNPC are a few of the cases where Nation building and development is undermined by public institutions.
In the instance of the NIS recruitment conducted three weeks ago, some of the consequences have been immediate and fatal, yet there is foot dragging in apportioning liability and correcting the injustice that has been dealt to Nigerians. On the same issue, public resources have been speedily deployed to silence protesters.
It is time that our laws which require due process and diligence are fully enforced. Peaceful protests should no longer be clamped down because Nigerians are justified for being outraged by any form of institutional recklessness. The Nigerian Immigration Service recruitment exercise painfully illustrates that the outcomes of secret contracts could be deadly and such behaviour cannot be allowed to continue. We must stop institutional recklessness, we must stop secret contracts.
Ms. Seember Nyager coordinates procurement monitoring in Nigeria. Follow Nigerian Procurement Monitors at @Nig_procmonitor.
Amidst a flurry of congressional hearings and treaty negotiations, it is important to remember that statistics often tell half of the story. As I catch up on recent U.S. House subcommittee hearings, I continue to marvel at how often both committee members and witnesses conflate a total number of takedown notices with actual cases of infringement. This is not a new problem; the “Chilling Effect” is a well-documented (pdf) result of widespread abuse of Section 512 takedown notices. In 2009, Google reported that over a third of DMCA takedown notices were invalid:
Google notes that more than half (57%) of the takedown notices it has received under the US Digital Millennium Copyright Act 1998, were sent by business targeting competitors and over one third (37%) of notices were not valid copyright claims.
And that doesn’t even include YouTube or Blogger takedown statistics! The numbers aren’t much better today. Google’s latest Transparency Report shows over 27 million removal requests over the past three years, with nearly a million of those requests denied (requests cited as “improper” or “abusive”) in 2011 alone. Many rights holders will continue to point to takedown notice numbers as evidence of widespread infringement, but this simply bolsters a landscape in which everybody is guilty until proven innocent of violating copyright.
A new but still “pre-published” version of the Linked Archival Metadata: A Guidebook is available. From the introduction:
The purpose of this guidebook is to describe in detail what linked data is, why it is important, how you can publish it to the Web, how you can take advantage of your linked data, and how you can exploit the linked data of others. For the archivist, linked data is about universally making accessible and repurposing sets of facts about you and your collections. As you publish these fact you will be able to maintain a more flexible Web presence as well as a Web presence that is richer, more complete, and better integrated with complementary collections.
And from the table of contents:
There are a number of versions:
Feedback desired and hoped for.
A few weeks back, I dropped Google search in favor of DuckDuckGo, an alternative search engine that does not log your searches. Today, I’m here to report on that experience and suggest two even better secure search tools: StartPage and Ixquick.
As I outlined in my initial blog post, DuckDuckGo falls down probably as a consequence of its emphasis on privacy. Whereas Google results are based on an array of personal variables that tie specific result sets to your social graph…a complex web of data points collected on you through your Chrome Browser, Android apps, browser cookies, location data, possibly even the contents of your documents and emails stored on Google’s servers (that’s a guess, but totally within the scope of reason). This is a considerable handicap for DuckDuckGo.
But moreover, Google’s algorithm remains superior to everything else out there.
The benefits of using DuckDuckGo, of course, are that you are far more anonymous, especially if you are searching in private browser mode, accessing the Internet through a VPN or Tor, etc.
Again, given the explosive revelations about aggressive NSA data collection and even of government programs that hack such social graphs, and the potential leaking of that data to even worse parties, many people may decide that, on balance, they are better off dealing with poor search precision rather than setting themselves up for a cataclysmic breach of their data.
I’m one such person, but to be quite honest, I was constantly turning back to Google because DuckDuckGo just wouldn’t get me what I knew was out there.
Fortunately, I found something better: StartPage and Ixquick.
There are two important things to understand about StartPage and Ixquick:
But, like DuckDuckGo, neither Ixquick or StartPage are able to source your social graph, so they will never get results as closely tailored to you as Google. By design, they are not looking at your cookies or building their own database of you, so they won’t be able to guess your location or political views, and therefore, will never skew results around those variables. Then again, your results will be more broadly relevant and serendipitous, saving you from the personal echo-chamber that you may have found in Google.
It’s been over a month since I switched from DuckDuckGo to StartPage and so far it’s been quite good. StartPage even has a passable image and video search. I almost never go to Google anymore. In fact, I’ve used a browser plugin called Stylish to re-skin Google’s search interface with the NSA logo just as a humorous reminder that every search is being collected by multiple parties.
For that matter, I’ve used the same plugin to re-skin StartPage since where they get high marks for privacy and search results, they’re interface design needs major work…but I’m just picky that way.
So, with my current setup, I’ve got StartPage as my default browser, set in my omnibar in Firefox. Works like a charm!
As part of the redesign for the new site, the main thing that I really wanted to change in terms of the look was the front page.Based on my experience and discussions with staff about what our users look for when they arrive at the site, I had an idea of what information should be […]
After the last post Seb got me wondering if there were any differences between libraries, archives and museums when looking at upload and comment activity in Flickr Commons in Aaron’s snapshot of the Flickr Commons metadata.
First I had to get a list of Flickr Commons organizations and classify them as either a library, museum or archive. It wasn’t always easy to pick, but you can see the result here. I lumped galleries in with museums. I also lumped historical societies in with archives. Then I wrote a script that walked around in the Redis database I already had from loading Aaron’s data.
In doing this I noticed there were some Flickr Commons organizations that were missing from Aaron’s snapshot:
Update: Aaron quickly fixed this.
I didn’t do any research to see if these organizations had significant activity. Also, since there were close to a million files, I didn’t load the British Library activity yet. If there’s interest in adding them into the mix I’ll splurge for the larger ec2 instance.
Anyhow, below are the results. You can find the spreadsheet for these graphs up in Google Docs
This was all done rather quickly, so if you notice anything odd or that looks amiss please let me know. Initially it seemed a bit strange to me that libraries, archives and museums trended so similarly in each graph, even if the volume was different.
I was in New York a couple of weeks ago, and I went to the Strand Bookstore, that multistory heaven of used and new books. I wandered around a while and got some things I’d been wanting. I wanted to read something set in New York so I looked first at Lawrence Block’s books and got The Burglar in the Closet, which opens with Bernie Rhodenbarr sitting in Gramercy Park, which I’d just passed by on the walk down, and then at Donald E. Westlake and got Get Real, the last of the Dortmunder series, and mostly set in the Lower East Side. Welcome to New York.
While I was standing near a table in the main aisle on the ground floor an older woman carrying some bags passed behind me and accidentally knocked some books to the floor. “Oh, I’m sorry, did I do that?” she said in a thick local accent. A young woman and I both leaned over to pick up the books. I was confused for a moment, because it looked like the cover had ripped, but it hadn’t, the rip was printed.
Then I saw what the book was: Dying Every Day: Seneca at the Court of Nero, by James Romm. A new book about Seneca, the Roman senator and Stoic philosopher! Fate had actually put this book in my hand. “It is destined to be,” I thought, and immediately bought it.
It’s a fine book, a gripping history and biography, covering in full something I only knew a tiny bit about. Seneca wrote a good amount of philosophy, including the Epistles, a series of letters full of Stoic advice to a younger friend, but the editions of his philosophy (or his tragedies) don’t go much into the details of Seneca’s life. They might mention he was a senator and advisor to Nero, and rich (as rich as a billionaire today), but then they get on to analyzing the subtleties of his thoughts on nature or equanimity.
Seneca led an incredible life: he was a senator in Rome, he was banished by the emperor Claudius on trumped-up charges of an affair with Caligula’s sister, but was later called back to Rome at the behest of Agrippina, Nero’s mother, to act as an advisor and tutor to the young man. Five years later, Agrippina poisoned Claudius, and Nero became emperor.
Seneca was very close to Nero and stayed as his advisor for years. It worked fairly well at first, but Nero was Nero. This is the main matter of the book: how Seneca, the wise Stoic, stayed close to Nero, who gradually went out of control: wild behaviour, crimes, killings, and eventually the murder of his mother Agrippina. An attempt to kill her on a boat failed, and then:
None of Seneca’s meditations on morality, Virtue, Reason, and the good life could have prepared him for this. Before him, as he entered Nero’s room, stood a frightened and enraged youth of twenty-three, his student and protégé for the past ten years. For the past five, he had allied with the princeps against his dangerous mother. Now the path he had first opened for Nero, by supporting his dalliance with Acte, had led to a botched murder and a political debacle of the first magnitude. It was too late for Seneca to detach himself. The path had to be followed to its end.
Every word Seneca wrote, every treatise he published, must be read against his presence in this room at this moment. He stood in silence for a long time, as though contemplating the choices before him. There were no good ones. When he finally spoke, it was to pass the buck to Burrus. Seneca asked whether Burrus could dispatch his Praetorians to take Agrippina’s life.
Seneca supported Nero’s matricide.
It’s impossible to match that, and other things Seneca did, with his Stoic writings, but it was all the same man. It’s a remarkable and paradoxical life.
Romm’s done a great job of writing this history. It’s full of detail (especially drawing on Tacitus), with lots of people and events to follow, but it’s all presented clearly and with a strong narrative. If you liked I, Claudius you’ll like this, and I see similar comments about House of Cards and Game of Thrones.
I especially recommend this to anyone interested in Stoicism. Thrasea Pateus is a minor figure in the book, another senator and also a Stoic, but one who acted like a Stoic should have, by opposing Nero. He was new to me. Seneca’s Stoic nephew Lucan, who wrote the epic poem The Civil War, also appears. He was friends with Nero but later took part in a conspiracy to kill the emperor. It failed, and Lucan had to commit suicide, as did Seneca, who wasn’t part of the plot.
There’s a nice chain of philosophers at the end of the book. After Nero’s death, Thrasea’s Stoic son-in-law Helvidius Priscus returns to Rome, as does the great Stoic Musonius Rufus and Demetrius the Cynic. The emperor Vespasian later banished philosophers from Rome (an action that seems very puzzling these days; I’m not sure what the modern equivalent would be), but for some reason let Musonius Rufus stay. One of his students was Epictetus, who had been a slave belonging to Epaphroditus, who in turn had been Nero’s assistant and had been with him when Nero, on the run, committed suicide—in fact, Epaphroditus helped his master by opening up the cut in his throat.
Later the Stoics were banished from Rome again, and Epictetus went to Greece and taught there. He never wrote anything himself, but one of his students, Arrian, wrote down what he said, which is why we now have the very powerful Discourses. And years later this was read by Marcus Aurelius, the Stoic emperor, a real philosopher king.
For a good introduction to the book, listen to this interview with James Romm on WNYC in late March. It’s just twenty minutes.
The following guest post is by Nicole Valentinuzzi, from our Stop Secret Contracts campaign partner Publish What You Fund.
A new campaign to Stop Secret Contracts, supported by the Open Knowledge Foundation, Sunlight Foundation and many other international NGOs, aims to make sure that all public contracts are made available in order to stop corruption before it starts.
As transparency campaigners ourselves, Publish What You Fund is pleased to be a supporter of this new campaign. We felt it was important to lend our voice to the call for transparency as an approach that underpins all government activity.
We campaign for more and better information about aid, because we believe that by opening development flows, we can increase the effectiveness and accountability of aid. We also believe that governments have a duty to act transparently, as they are ultimately responsible to their citizens.
This includes publishing all public contracts that governments put out for tender, from school books to sanitation systems. These publicly tendered contracts are estimated to top nearly US$ 9.5 trillion each year globally, yet many are agreed behind closed doors.
These secret contracts often lead to corruption, fraud and unaccountable outsourcing. If the basic facts about a contract aren’t made publicly available – for how much and to whom to deliver what – then it is not possible to make sure that corruption and abuses don’t happen.
But what do secret contracts have to do with aid transparency, which is what we campaign for at Publish What You Fund? Well, consider the recent finding by the campaign that each year Africa loses nearly a quarter of its GDP to corruption…then consider what that money could have been spent on instead – things like schools, hospitals and roads.
This is money that in many cases is intended to be spent on development. It should be published – through the International Aid Transparency Initiative (IATI), for example – so that citizens can follow the money and hold governments accountable for how it is spent.
But corruption isn’t just a problem in Africa – the Stop Secret Contracts campaign estimates Europe loses an estimated €120 billion to corruption every year.
At Publish What You Fund, we tell the world’s biggest providers of development cooperation that they must publish their aid information to IATI because it is the only internationally-agreed, open data standard. Information published to IATI is available to a wide range of stakeholders for their own needs – whether people want to know about procurement, contracts, tenders or budgets. More than that, this is information that partner countries have asked for.
Governments use tax-payer money to award contracts to private companies in every sector, including development. We believe that any companies that receive public money must be subject to the same transparency requirements as governments when it comes to the goods and services they deliver.
Greater transparency and clearer understanding of the funds that are being disbursed by governments or corporates to deliver public services can only be helpful in building trust and supporting accountability to citizens. Whether it is open aid or open contracts, we need to get the information out of the hands of governments and into the hands of citizens.
Ultimately for us, the question remains how transparency will improve aid – and open contracts are another piece of the aid effectiveness puzzle. Giving citizens full and open access to public contracts is a crucial first step in increasing global transparency. Sign the petition now to call on world leaders to make this happen.
On Thursday, April 17, 2014, from 9:30–11:30 a.m., leaders from the American Library Association (ALA) will participate in “Libraries and Broadband: Urgency and Impact,” a public hearing hosted by the Institute for Museum and Library Services (IMLS) that will explore the need for high-speed broadband in American libraries. Larra Clark, director of the ALA Program on Networks, and Linda Lord, ALA E-rate Task Force Chair and Maine State Librarian, will present on two panels.
The hearing, which takes place during National Library Week (April 13–19, 2014), will explore innovative library practices, partnerships and strategies for serving our communities; share available research on library broadband connections and services; and discuss solutions for improving library connectivity to drive education, community and economic development. During her discussion, Clark will share findings from relevant library research managed by the ALA Office for Research & Statistics, including the IMLS-funded Digital Inclusion Survey and the Public Libraries Funding Technology Access Study, funded by the Bill & Melinda Gates Foundation. Lord will discuss ALA e-rate policy recommendations for boosting libraries toward gigabit broadband speeds.
Federal Communications Commission Chairman Thomas Wheeler will make opening remarks at the hearing, and expert panelists from across the library, technology, and public policy spectrum will explore the issue of high-speed broadband in America’s libraries. IMLS Director Susan H. Hildreth will chair the hearing along with members of the National Museum Services Board including, Christie Pearson Brandau of Iowa, Charles Benton of Illinois, Winston Tabb of Maryland, and Carla Hayden also of Maryland.
Interested participants may register to attend the event in-person at D.C.’s Martin Luther King Jr. Memorial Library. Alternatively, participants can also tune into event virtually, as IMLS will stream the hearing live on YouTube or Google+. Library staff may also participate by submitting written comments sharing their successes, challenges or other input related to library broadband access and use into the hearing record on or before April 24, 2014. Each comment must include the author’s name and organizational affiliation, if any, and sent to firstname.lastname@example.org. Guidance for submitting testimony is available here (pdf).
The post ALA to participate in IMLS hearing on libraries and broadband appeared first on District Dispatch.
Cartoons made for the Fiesole conference in Cambridge UK Filed under: Doodles Tagged: cambridge, fiesole, ghent, library, scanning