We asked our LITA Midwinter Workshop Presenters to tell us a little more about themselves and what to expect from their workshops in January. This week, we’re hearing from Elizabeth Wickes, who will be presenting the workshop:
Introduction to Practical Programming (For registration details, please see the bottom of this blog post)
LITA: We’ve seen your formal bio but can you tell us a little more about you?
Elizabeth: I once wrote an entire Python program just so I could have a legitimate reason to say “for skittle in skittles.” Attendees will meet this program during the workshop. I can also fix pretty much anything with hot glue.
LITA: Who is your target audience for this workshop?
Elizabeth: This workshop speaks to the librarian or library student who is curious about programming and wants to explore it within a very library-centric context. So many of the existing books and resources on programming are for people with extensive math backgrounds. This workshop will present the core concepts and basic workflows with a humanities voice.
LITA: How much experience with programming do attendees need to succeed in the workshop?
Elizabeth: Any amount is helpful, but nothing is required. I’ll be presenting the topics from the ground up, presuming that folks have never seen any code before.
LITA: If your workshop was a character from the Marvel or Harry Potter universe, which would it be, and why?
Elizabeth: I would say Snape, if I had to pick a character. But hear me out! The topic might seem moody and unapproachable, but on the inside just wants to love! Also, programming is really like potions class, where you are combining lots of little pieces very precisely to somehow produce something shiny and beautiful. My final argument: Alan Rickman.
LITA: Name one concrete thing your attendees will be able to take back to their libraries after participating in your workshop.
Elizabeth: Attendees will leave the workshop with a greater understanding of assessment strategies for material selection and a solid structure on which to build as a self-taught programmer.
LITA: What kind of gadgets/software do your attendees need to bring?
Elizabeth: Participants should bring a laptop (not a tablet) with an operating system they are comfortable using. Macs are easiest to set up but any current computer will work.
LITA: Respond to this scenario: You’re stuck on a desert island. A box washes ashore. As you pry off the lid and peer inside, you begin to dance and sing, totally euphoric. What’s in the box?
Elizabeth: Perhaps I’m singing because the box brought me a singing voice. But seriously, I’d be super excited to get sunscreen in that situation.
LITA members get one third off the cost of Mid-Winter workshops. Use the discount promotional code: LITA2015 during online registration to automatically receive your member discount. Start the process at the ALA web sites:
When you start the registration process and BEFORE you choose the workshop, you will encounter the Personal Information page. On that page there is a field to enter the discount promotional code: LITA2015
As in the example below. If you do so, then when you get to the workshops choosing page the discount prices, of $235, are automatically displayed and entered. The discounted total will be reflected in the Balance Due line on the payment page.
Please contact the LITA Office if you have any registration questions.
Last week Chris Zammarelli asked Amanda Etches and me for some library website inspiration. So we decided to compile a short list of some sites that we’re liking right now. If we missed one that you really like, please holler!
The huge search box. The visual design of the site is pleasant, but the best part of the HCL website is the catalog integration. Totally into it. Search results are legible, and bib records aren’t filled with junk that people don’t want to see (though additional information is available below).
At 1440 x 900,there’s some odd white space on the left of most pages. (A somewhat minor gripe, to be sure.)
Wish the search box was a bit bigger, but it is in a conventional location so maybe that’s okay. Also, the site uses the classic and ever popular public library audience segmentation of kids/teens/adults. We understand the problem that this solves but think there’s probably a better solution out there somewhere.
The following is a guest post by NDIIPP summer intern Elizabeth Tobey. Liz is a graduate student in the Masters of Library Science program at the University of Maryland.
Along with the fall weather, food, activities and the new layer of clothes that are now necessary, this season also brings us a new and improved Viewshare. The new Viewshare has all the capabilities of the previous version but has a simplified workflow; an improved, streamlined look; and larger and more legible graphics in its views.
Originally launched in 2011, Viewshare is visualization software that libraries, archives and museums can use for free to generate “views” of their digital collections. Users have discovered a multitude of applications for Viewshare, including visualizations of LAM (Library, Archives and Museum) collections’ data, representation of data sets in academic scholarship and student use of Viewshare in library science classwork.
The new version of Viewshare has streamlined the workflow so that users can proceed directly from uploading data sets into creating views. The old Viewshare divided this process into three distinct stages: uploading records, augmenting data fields and creating/sharing views. While all these functions are still part of the Viewshare workflow, the new Viewshare accelerates the process by creating your first view for you directly from the imported data.
Once you have uploaded your data from the web or from a file on your computer, the fields will immediately populate records in a List View of your collection. You can immediately start reviewing the uploaded records in the List View, and if you choose, can begin creating additional views once you save your data set.
List View of uploaded data set.
Once you save your data set, you can start adding new views immediately.
Like in the old version of Viewshare, you will need to augment some of your data fields in order to get the best results in creating certain types of views, such as maps based upon geographical location or timelines based upon date. Viewshare still needs to generate latitudinal/longitudinal coordinates for locations and standardize dates, but the augmentation process has been simplified.
In the new Viewshare, you can create an augmented field by clicking on the blue “Add a Property” button and entering information into the dialog box about the augmented field and the fields you wish to base it upon. Here, the user is creating an augmented date field for use in a timeline:
Augmenting fields has also been streamlined.
Once you hit the “Create Property” button, Viewshare automatically starts augmenting the data. A status bar at the top of the window alerts the user when the field has been created successfully. The new field appears at the very top of the field list:
A status bar alerts users to the progress of augmenting fields.
Another great feature of the new Viewshare is that whenever you make changes to a record field (such as changing a field type from text to date), Viewshare saves those changes automatically. (However, you still need to remember to hit the “Save” button for any new views or widgets you create!).
The views in the new Viewshare have larger, more readable graphics than in the previous version. Here is an example of a pie chart showing conference participation data in the old Viewshare:
Old pie chart view.
The pie chart takes up only about a third of the screen width and is tilted at an angle. Here is the same view in the new Viewshare:
Charts and graphics are larger and more legible in the new version.
Here, the pie chart occupies more than half of the screen and is displayed flat rather than tilted. This new style of view renders Viewshare graphics much more legible, especially when projected onto a screen.
Lastly, Viewshare has been redesigned with a simplified, streamlined interface that is as pleasing to the eye as it is easy to use. Unlike the old Viewshare, where lists of a user’s data sets and views were listed under different tabs, the new Viewshare consolidates the list of views into one dashboard:
Navigation has also been streamlined. Instead of multiple navigation options (a top menu and two sets of tabs) in the old Viewshare, the navigation options have been consolidated into a dropdown menu at the upper right hand of the browser window. Thus, it is easier for users to find the information they need.
Some users may wonder whether the new Viewshare will affect existing data sets and views they have created. Viewshare’s designers have already thought of this, and, rest assured, all existing accounts, data sets and views will be migrated from the old version to the new version. Users will still be able to access, view, embed and share data sets that they uploaded in the past.
Many of the changes to Viewshare were influenced directly by user feedback about the older version. Here at the Library of Congress we are eager to hear your suggestions about improving Viewshare and about any problems you encounter in its use. Please feel free to report your problems and suggestions by clicking on the green “Feedback” tab on the Viewshare website. You should also feel free to add your comments and contact information in the comment form below.
Enjoy the rest of fall, and make sure to take time to check out Viewshare’s new features and look!
An important consideration in creating really professional looking closed captions is placing them correctly. I don’t rely on captions, but I do increasingly turn them on to improve my viewing experience. I’ve come to appreciate some attributes of really well done captions. Accuracy is certainly important. The captions should match the words spoken. As someone who can hear, I see inaccurate captions all too often. Thoroughness is another factor. Are all the sounds important for the action represented in captions. Captions will also include a “music” caption, but other sounds, especially those off screen are often omitted. But accuracy and thoroughness aren’t the only factors to consider when evaluating caption quality.
Placement of captions can be equally important. The captions should not block other important content. They should not run off the edge of the screen. If two speakers are on screen you want the appropriate captions to be placed near each speaker. If a sound or voice is coming from off screen, the caption is best placed as close to the source as possible. These extra clues can help with understanding the content and action. These are the basics. There are other style guidelines for producing good captions. Producing good captions is something of an art form. More than two rows long is usually too much, and rows ought to be split at phrase breaks. Periods should be used to end sentences and are usually the end of a single cue. There’s judgment necessary to have pleasing phrasing.
While there are tools for doing this proper placement for television and burned in captions, I haven’t found a tool for this for Web video. While I haven’t yet have a tool to do this, in the following I’ll show you how to:
Control placement of captions for your HTML5 video using cue settings.
Play around with different cue settings to better understand how they work.
Style captions with CSS.
The <video> element has an API which allows you to get a list of all tracks for that video.
Let’s say we have the following video markup which is the only video on the page. This video is embedded far below, so you should be able to run these in the console of your developer tools right now.
Here we get the first video on the page:
You can then get all the tracks (in this case just one) with the following:
Alternately, if your track element has an id you can get it more directly:
Once you have the track you can see the kind, label, and language:
You can also get all the cues as a TextTrackCueList:
In our example we have just two cues. We can also get just the active cues (in this case only one so far):
Now we can see the text of the current cue:
Now the really interesting part is that we can change the text of the caption dynamically and it will immediately change:
We can also then change the position of the cue using cue settings. The following will move the first active cue to the top of the video.
The cue can also be aligned to the start of the line position:
Now for one last trick we’ll add another cue with the arguments of start time and end time in seconds and the cue text:
We’ll set a position for our new cue before we place it in the track:
Then we can add the cue to the track:
And now you should see your new cue for most of the duration of the video.
Playing with Cue Settings
The other settings you can play with including position and size. Position is the text position
as a percentage of the width of the video. The size is the width of the cue as a percentage of the width of the video.
In browsers that support styling of cues (Chrome, Opera, Safari), the demonstration also allows you to apply styling to cues in a few different ways. This CSS code is included in the demo to show some simple examples of styling.
Then the following cue text can be added to show red text with a yellow background. The
In the demo you can see which text styles are supported by which browsers for styling the ::cue pseudo-element. There’s a text box at the bottom that allows you to enter any arbitrary styles and see what effect they have.
A genre comprises a class of communicative events, the members of which share some set of communicative purposes. These purposes are recognized by the expert members of the parent discourse community and thereby constitute the rationale for the genre. This rationale shapes the schematic structure of the discourse and influences and constrains choice of content and style. Communicative purpose is both a privileged criterion and one that operates to keep the scope of a genre as here conceived narrowly focused on comparable rhetorical action. In addition to purpose, exemplars of a genre exhibit various patterns of similarity in terms of structure, style, content and intended audience. If all high probability expectations are realized, the exemplar will be viewed as prototypical by the parent discourse community. The genre names inherited and produced by discourse communities and imported by others constitute valuable ethnographic communication, but typically need further validation.1
As we well know, Data is only data until you use it for storytelling and insights. Some people are super talented and can use D3 or other amazing visual tools, just see this great list of resources on Visualising Advocacy. In this 1 hour Community Session, Nika Aleksejeva of Infogr.am shares some easy ways that you can started with simple data visualizations. Her talk also includes tips for telling a great story and some thoughtful comments on when to use various data viz techniques.
We’d love you to join us and do a skillshare on tools and techniques. Really, we are tool agnostic and simply want to share with the community. Please do get in touch and learn more:
about Community Sessions.
The Future of Scholarship project aims to build a stronger, better connected network of people interested in open access in the humanities and social sciences. It will serve as a central point of reference for leading voices, examples, practical advice and critical debate about the future of humanities and social sciences scholarship on the web.
If you’d like to join us and hear about new resources and developments in this area, please leave us your details and we’ll be in touch.
For now we’ll leave you with some thoughts on why open access to humanities and social science scholarship matters:
“Open access is important because it can give power and resources back to academics and universities; because it rightly makes research more widely and publicly available; and because, like it or not, it’s beginning and this is our brief chance to shape its future so that it benefits all of us in the humanities and social sciences” – Robert Eaglestone, Professor of Contemporary Literature and Thought, Royal Holloway, University of London.
“For scholars, open access is the most important movement of our times. It offers an unprecedented opportunity to open up our research to the world, irrespective of readers’ geographical, institutional or financial limitations. We cannot falter in pursuing a fair academic landscape that facilitates such a shift, without transferring prohibitive costs onto scholars themselves in order to maintain unsustainable levels of profit for some parts of the commercial publishing industry.” Dr Caroline Edwards, Lecturer in Modern & Contemporary Literature, Birkbeck, University of London and Co-Founder of the Open Library of Humanities
“If you write to be read, to encourage critical thinking and to educate, then why wouldn’t you disseminate your work as far as possible? Open access is the answer.” – Martin Eve, Co-Founder of the Open Library of Humanities and Lecturer, University of Lincoln.
“Our open access monograph The History Manifesto argues for breaking down the barriers between academics and wider publics: open-access publication achieved that. The impact was immediate, global and uniquely gratifying–a chance to inject ideas straight into the bloodstream of civic discussion around the world. Kudos to Cambridge University Press for supporting innovation!” — David Armitage, Professor and Chair of the Department of History, Harvard University and co-author of The History Manifesto
“Technology allows for efficient worldwide dissemination of research and scholarship. But closed distribution models can get in the way. Open access helps to fulfill the promise of the digital age. It benefits the public by making knowledge freely available to everyone, not hidden behind paywalls. It also benefits authors by maximizing the impact and dissemination of their work.” – Jennifer Jenkins, Senior Lecturing Fellow and Director, Center for the Study of the Public Domain, Duke University
“Unhappy with your current democracy providers? Work for political and institutional change by making your research open access and joining the struggle for the democratization of democracy” – Gary Hall, co-founder of Open Humanities Press and Professor of Media and Performing Arts, Coventry University
I will pat myself on the back (somebody has to). I wrote in the 2004 edition of Copyright Copyright, “Fair use cannot be reduced to a checklist. Fair use requires that people think.” This point has been affirmed (pdf) by the Eleventh Circuit Court of Appeals in the long standing Georgia State University (GSU) e-reserves copyright case. The appeals court rejected the lower court’s use of quantitative fair use guidelines in making its fair use ruling, stating that fair use should be determined on a case-by-case basis and that the four factors of fair use should be evaluated and weighed.
Lesson: Guidelines are arbitrary and silly. Determine fair use by considering the evidence before you. (see an earlier District Dispatch article).
The lower court decision was called a win for higher education and libraries because only five assertions of infringement (out of 99) were actually infringing. Hooray for us! But most stakeholders on both sides of the issue, felt that the use of guidelines in weighing the third factor—amount of the work—was puzzling to say the least (but no matter, we won!)
Now that the case has been sent back to the lower court, some assert that GSU has lost the case. But not so fast. This decision validates what the U.S. Supreme Court has long held that fair use is not to be simplified with “bright line rules, for the statute, like the doctrine it recognizes, calls for case-by-case analysis. . . . Nor may the four statutory factors be treated in isolation, one from another. All are to be explored, and the results weighed together, in light of the purposes of copyright.” (510 U.S. 569, 577–78).
Thus, GSU could prevail. Or it might not. But at least fair use will be applied in the appropriate fashion.
That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Philip Schreur of Stanford. We were fortunate that staff from several BIBFRAME testers participated: Columbia, Cornell, George Washington University, Princeton, Stanford and University of Washington. They shared their experiences and tips with others who are still monitoring BIBFRAME developments.
Much of the testers’ focus has been on data evaluation and identifying problems or errors in converting MARC records to BIBFRAME using either the BIBFRAME Comparison Service or Transformation Service. Some have started to create BIBFRAME data from scratch using the BIBFRAME Editor. This raised a concern among managers about how much time and staffing was needed to conduct this testing. Several institutions have followed Stanford’s advice and enrolled staff in the Library Juice Academy series to gain competency in XML and RDF- based systems, a good skill set to have for digital library and linked data work, not just for BIBFRAME. Others are taking Zepheira’s Linked Data and BIBFRAME Practical Practitioner Training course. The Music Library Association’s Bibliographic Control Committee has created a BIBFRAME Task Force focusing on how LC’s MARC-to-BIBFRAME converter handles music materials.
Rather than looking at how MARC data looks like in BIBFRAME, people should be thinking about how RDA (Resource Description and Access) works with BIBFRAME. We shouldn’t be too concerned if BIBFRAME doesn’t handle all the MARC fields and subfields, as many are rarely used anyway. See for example Roy Tennant’s “MARC Usage in WorldCat”, which shows the fields and subfields that are actually used in WorldCat, and how they are used, by format. (Data is available by quarters in 2013 and for 1 January 2013 and 1 January 2014, now issued annually.) Caveat: A field/subfield might be used rarely, but is very important when it occurs. For example, a Participant/Performer note (511) is mostly used in visual materials and recordings; for maps, scale is incredibly important. People agreed the focus should be on the most frequently used fields first.
Moving beyond MARC gives libraries an opportunity to identify entities as “things not strings”. RDA was considered “way to stringy” for linked data. The metadata managers mentioned the desire to use various identifiers, including id.loc.gov, FAST, ISNI, ORCID, VIAF and OCLC WorkIDs. Sometimes transcribed data would still be useful, e.g., a place of publication that has changed names. Many still questioned how authority data fits into BIBFRAME (we had a separate discussion earlier this year on Implications of BIBFRAME Authorities.) Core vocabularies need to be maintained and extended in one place so that everyone can take advantage of each other’s work.
Several noted “floundering” due to insufficient information about how the BIBFRAME model was to be applied. In particular, it is not always clear how to differentiate FRBR “works” from “BIBFRAME “works”. There may never be a consensus on what a “work” is between “FRBR and non-FRBR people”. Concentrate instead on identifying the relationships among entities. If you have an English translation linked to a German translation linked to a work originally published in Danish, does it really matter whether you consider the translations separate works or expressions?
Will we still have the concept of “database of record”? Stanford currently has two databases of record, one for the ILS and one for the digital library. A triple store will become the database of record for materials not expressed in MARC or MODS. This raised the question of developing a converter for MODS used by digital collections. Melanie Wacker of Columbia is working with a MODS Editorial Committee subgroup that is mapping MODS to BIBFRAME. LC has also been working on a mapping. Colorado College has done some sample MODS to BIBFRAME transformations.
How do managers justify the time and effort spent on BIBFRAME testing to administrators and other colleagues? Currently we do not have new services built upon linked data to demonstrate the value of this investment. The use cases developed by the Linked Data for Libraries project offers a vision of what could be done, that can’t be now, in a linked data environment. A user interface is needed to show others what the new data will look like; pulling data from external resources is the most compelling use case.
It includes a mother of pearl dial, luminous hands and indexes plus a date display. There are numerous kinds of couple watches at Price - Angels, including Crystal Dial Stainless Steel Water Resistant Sweethearts Couple Wrist Watch, Fashionable Style Rectangular Dial Crystal Display Stainless Steel Band Couple Watch (Grey). Article Source: is just a review of mens Invicta watches.
Semifinalists for the Knight News Challenge will be chosen tomorrow and the refinement period will begin. This is your last chance to show your support for our submission before the next stage of the competition. The Knight Foundation is asking “How might we leverage libraries as a platform to build more knowledgeable communities?” We believe that PeerLibrary closely parallels the theme of the challenge and provides an answer to the foundation’s question. By facilitating a community of independent learners and promoting collaborative reading and discussion of academic resources, PeerLibrary is modernizing the concept of a library in order to educate and enrich the global community. Please help us improve our proposal, give us feedback, and wish PeerLibrary good luck in the next stage of the Knight News Challenge.
LYRASIS has published three open source software case studies on FOSS4LIB.org as part of its continuation of support and services for libraries and other cultural heritage organizations interested in learning about, evaluating, adopting, and using open source software systems.
With support from a grant from The Andrew W. Mellon Foundation, LYRASIS asked academic and public libraries to share their experiences with open source systems, such as content repositories, integrated library systems, and websites. Of the submitted proposals, LYRASIS selected three concepts for development into case studies from Crawford County Federated Library System (Koha), Fenway Libraries Online (Coral), and the University of Chicago Library (Kuali OLE). The three selected organizations then prepared narrative descriptions of their experience and learning, to provide models, advice, and ideas for others.
Each case study details how the organization handled the evaluation, selection, adoption, conversion, and implementation of the open source system. They also include the rationale for going with an open source solution. The case studies all provide valuable information and insights, including:
Actual experiences, both good and bad
Steps, decision points, and processes used in evaluation, selection, and implementation
Factors that led to selection of an open source system
Organization-wide involvement of and impact to staffs and patrons
Useful tools created or applied to enhance the open source system and/or expand its functionality, usefulness, or benefit
Plans for ongoing support and future enhancement
Key takeaways from the process, including what worked well, what didn’t work as planned, and what the organization might do differently in the future
The goal of freely offering these case studies to the public is to help cultural heritage organizations use firsthand experience with open source to inform their evaluation and decision-making process, the same objective of FOSS4LIB.org. While open source software is typically available at no cost, these case studies provide tangible examples of the associated costs, time, energy, commitment and resources required to effectively leverage open source software and participate in the community.
“These three organizations expertly outline the in-depth process of selecting and implementing open source software with insight, humor, candor and clarity. LYRASIS is honored to work with these organizations to share this invaluable information with the larger community,” stated Kate Nevins, Executive Director of LYRASIS. “The case studies exemplify the importance of understanding the options and experiences necessary to fully utilize open source software solutions.”
This project took over 7 years and went through a few big iterations. I was just finishing library school when it started and learned a lot from the other advisory board members. I appreciate how the much more experienced folks on the advisory board helped bring me up to speed on issues I was less familiar with, and how they treated me, even though I was just a student.
At the Gender and Sexuality in Information Studies Colloquium the program session I was the most excited about was Porn in the library. There were 3 presentations in this panel exploring this theme.
First, Joan Beaudoin and Elaine Ménard presented The P Project: Scope Notes and Literary Warrant Required! Their study looked at 22 websites that are aggregators of free porn clips. Most of these sites were in English, but a few were in French. Ménard acknowledged that it is risky and sometimes uncomfortable to study porn in the academy. They looked at the terminology used to describe porn videos, specifically the categories available to access porn videos. They described their coding manual which outlined various metadata facets (activity, age, cinematography, company/producers, age, ethnicity, gender, genre, illustration/cartoon, individual/stars, instruction, number of individuals, objects, physical characteristics, role, setting, sexual orientation). I learned that xhamster has scope notes for their various categories (mouseover the lightbulb icon to see).
While I appreciate that Beaudoin and Ménard are taking a risk to look at porn, I think they made the mistake of using very clinical language to legitimize and sanitize their work. I’m curious why they are so interested in porn, but realize that it might be too risky for them to situate themselves in their research.
It didn’t seem like they understood the difference between production company websites and free aggregator sites. Production company sites have very robust and high quality metadata and excellent information architecture. Free aggregator sites that have variable quality metadata and likely have a business model that is based on ads or referring users to the main production company websites. Porn is, after all, a content business, and most porn companies are invested in making their content findable, and making it easy for the user to find more content with the same performers, same genre, or by the same director.
Beaudoin and Ménard expressed disappointment that porn companies didn’t want to participate in their study. As these two researchers don’t seem to understand the porn industry or have relationships with individuals I don’t think it’s surprising at all. For them to successfully build on this line of inquiry I think they need to have some skin in the game and clearly articulate what they offer their research subjects in exchange for building their own academic capital.
It was awesome to have a quick Twitter conversation with Jiz Lee and Chris Lowrance, the web manager for feminist porn company Pink and White productions, about how sometimes the terms a consumer might be looking for is prioritized over the performers’ own gender identity.
Jiz Lee is genderqueer porn performer and uses the pronouns they/them and is sometimes misgendreed by mainstream porn and by feminist porn. I am a huge fan of their work.
I think this is the same issue that Amber Billy, Emily Drabinski and K.R. Roberto raise in their paper What’s gender got to do with it? A critique of RDA rule 9.7. They argue that it is regressive for a cataloguer to assign a binary gender value to an author. In both these cases someone (porn company or consumer, or cataloguer) is assigning gender to someone else (porn performer or content creator). This process can be disrespectful, offensive, inaccurate and highlights a power dynamic where the consumer’s (porn viewer or researcher/student/librarian) desires/politics/needs/worldview is put above someone’s own identity.
Next, Lisa Sloniowski and Bobby Noble. presented Fisting the Library: Feminist Porn and Academic Libraries (which is the best paper title ever).I’ve been really excited their SSHRC funded porn archive research. This research project has become more of a conceptional project, rather than building a brick and mortar porn archive. Bobby talked about the challenging process of getting his porn studies class going at York University. Lisa talked they initially hoped to start a porn collection as part of York University Library’s main collection, not as a reading room or a marginal collection. Lisa spoke about the challenges of drafting a collection development policy and some of the labour issues, presumably about staff who were uncomfortable with porn having to order, catalogue, process and circulate porn. They also talked about the Feminist Porn Awards and second feminist porn conference that took place before the Feminist Porn Awards last year.
Finally, Emily Lawrence and Richard Fry presented Pornography, Bomb Building and Good Intentions: What would it take for an internet filter to work? They presented a philosophical argument against internet filters. They argued that for a filter to not overblock and underblock it would need to be mind reading and fortune telling. A filter would need to be able to read an individual’s mind and note factors like the person viewing, their values, their mood, etc and be fortune telling by knowing exactly what information that the user was seeking before they looked at it. I’ve been thinking about internet filtering a lot lately, because of Vancouver Public Library’s recent policy change that forbids “sexually explicit images”. I was hoping to get a new or deeper understanding on filtering but was disappointed.
This colloquium was really exciting for me. The conversations that people on the porn in the library panel were having are discussions I haven’t heard elsewhere in librarianship. I look forward to talking about porn in the library more.
Relevance is a complex concept which reflects aspects of a query, a document, and the user as well as contextual factors. Relevance involves many factors such as the user's preferences, task, stage in their information-seeking, domain knowledge, intent, and the context of a particular search. Tom Burton-West, one of the HathiTrust developers, has been working on practical relevance ranking for all the volumes in HathiTrust for a number of years.
This week is Open Access Week all around the world, and from Open Knowledge’s side we are following up on last year’s tradition by putting together a blog post series to highlight great Open Access projects and activities in communities around the world. Every day this week will feature a new writer and activity.
Open Access Week, a global event now entering its eighth year, is an opportunity for the academic and research community to continue to learn about the potential benefits of Open Access, to share what they’ve learned, and to help inspire wider participation in helping to make Open Access a new norm in scholarship and research.
This past year has seen lots in great progress and with the Open Knowledge blog we want to help amplify this amazing work done in communities around the world:
Tuesday, Jonathan Gray from Open Knowledge: “Open Knowledge work on Open Access in humanities and social sciences”
We’re hoping that this series can inspire even more work around Open Access in the year to come and that our community will use this week to get involved both locally and globally.
A good first step is to sign up at http://www.openaccessweek.org for access to a plethora of support resources, and to connect with the worldwide Open Access Week community. Another way to connect is to join the Open Access working group.
Open Access Week is an invaluable chance to connect the global momentum toward open sharing with the advancement of policy changes on the local level. Universities, colleges, research institutes, funding agencies, libraries, and think tanks use Open Access Week as a platform to host faculty votes on campus open-access policies, to issue reports on the societal and economic benefits of Open Access, to commit new funds in support of open-access publication, and more. Let’s add to their brilliant work this week!
Georgia State University Library. Photo by Jason Puckett via flickr.
On Friday, the U.S. Court of Appeals for the 11th Circuit handed down an important decision in Cambridge University Press et al. v. Carl V. Patton et al. concerning the permissible “fair use” of copyrighted works in electronic reserves for academic courses. Although publisher’s sought to bar the uncompensated excerpting of copyrighted material for “e-reserves,” the court rejected all such arguments and provided new guidance in the Eleventh Circuit for how “fair use” determinations by educators and librarians should best be made. Remanding to the lower court for further proceedings, the court ruled that fair use decisions should be based on a flexible, case-by-case analysis of the four factors of fair use rather than rigid “checklists” or “percentage-based” formulae.
Courtney Young, president of the American Library Association (ALA), responded to the ruling by issuing a statement.
The appellate court’s decision emphasizes what ALA and other library associations have always supported—thoughtful analysis of fair use and a rejection of highly restrictive fair use guidelines promoted by many publishers. Critically, this decision confirms the importance of flexible limitations on publisher’s rights, such as fair use. Additionally, the appeals court’s decision offers important guidance for reevaluating the lower courts’ ruling. The court agreed that the non-profit educational nature of the e-reserves service is inherently fair, and that that teachers’ and students’ needs should be the real measure of any limits on fair use, not any rigid mathematical model. Importantly, the court also acknowledged that educators’ use of copyrighted material would be unlikely to harm publishers financially when schools aren’t offered the chance to license excerpts of copyrighted work.
Moving forward, educational institutions can continue to operate their e-reserve services because the appeals court rejected the publishers’ efforts to undermine those e-reserve services. Nonetheless, institutions inside and outside the appeals court’s jurisdiction—which includes George, Florida and Alabama—may wish to evaluate and ultimately fine tune their services to align with the appeals court’s guidance. In addition, institutions that employ checklists should ensure that the checklists are not applied mechanically.
In 2008, publishers Cambridge, Oxford University Press, and SAGE Publishers sued Georgia State University for copyright infringement. The publishers argued that the university’s use of copyright-protected materials in course e-reserves without a license was a violation of the copyright law. Previously, in May 2012, Judge Orinda Evans of the U.S. District Court ruled in favor of the university in a lengthy 350-page decision that reviewed the 99 alleged infringements, finding all but five infringements to be fair uses.
This is the third of three posts about the workshop.
Part 1 introduced the Evolving Scholarly Record framework. Part 2 described the two plenary discussions. This part summarizes the breakout discussions.
Following the presentations, attendees divided into breakout groups. There were a variety of suggested topics, but the discussions took on lives of their own. The breakout discussions surfaced many themes that may merit further attention:
Support for researchers
It may be the institution’s responsibility to provide infrastructure to support compliance with mandates, but it is certainly the library’s role to assist researchers in depositing their content somewhere and to ensure that deposits are discoverable. We should establish trust by offering our expertise and familiarity with reliable external repositories, deposit, compliance with mandates, selection, description … and support the needs of researchers during and after their projects. Access to research outcomes involves both helping researchers find and access information they need as inputs to their work and helping them to ensure that their outputs are discovered and accessible by others. We should also find ways to ensure portability of research outputs throughout a researcher’s career. We need to partner with faculty and help them take the long view. We cannot do this by making things harder for the researcher, but by making it seamless, building on the ways they prefer to work.
Adapting to the challenge
We need to retool and reskill to add new expertise: ensuring that processes are retained along with data, promoting licenses that allow reusability, thinking about what repositories can deliver back to research, and adding developers to our teams. When we extend beyond existing library standards, we need to look elsewhere to see what we can adopt rather than create. We need to leverage and retain the trust in libraries, but need resources to do the work. While business models don’t exist yet, we need to find ways to rebalance resources and contain costs. One of the ways we might do that is to build library expertise and funding into the grant proposal process, becoming an integral part of the process from inception to dissemination of results.
Academic libraries should first collect, preserve, and provide access to materials created by those at their institution. How do libraries put a value on assets (to the institution, to researchers, to the wider public)? Not just outputs but also the evidence-base and surrounding commentary. What should proactively be captured from active research projects? How many versions should be retained? What role should user-driven demand play? What is needed to ensure we have evidence for verification and retain results of failed experiments? What need not be saved (locally or at all)? When is sampling called for? What about deselection? While we can involve researchers in identifying resources for preservation, in some cases we may have to be proactive and hunt them down and harvest them ourselves.
Competitiveness (regarding tenure, reputation, IP, and scooping) can inhibit sharing. Timing of data sharing can be important, sometimes requiring an embargo. Privacy issues regarding research subjects must be considered. Researchers may be sensitive about sharing “personal” scientific notes – or sharing data before their research is published. Different disciplines have different traditions about sharing.
Collaboration with others in the university
Policy and financial drivers (mandates, ROI expectations, reputation and assessment) will motivate a variety of institutional stakeholders in various ways. How can expertise be optimized and duplication be minimized? Libraries can’t change faculty behaviors, so need to join together with those with more influence. When Deans see that libraries can address parts of the challenge, they will welcome involvement. When multiple units are employing different systems and services, IT departments and libraries may become key players. There are limits to institutional capacity, so cooperating with other institutions is also necessary.
Collaboration with other stakeholders in a distributed archive across publishers, subjects, nations
We need to understand various solutions for fixity, versioning, and citation. We need to accommodate persistent object identifiers and multiple researcher name identifiers. We need to explore ways to link the various research materials related to the same project. We need to coordinate metadata in objects (e.g., an instrument’s self-generated metadata) with metadata about the objects and metadata about the context). Embedded links need to be maintained. Campus systems may need to interoperate with external systems (such as SHARE). We should help find efficient metrics for assessing researcher impact and enhancing institutional reputation. We should consider collaborating on processes to capture content from social media. In doing these things we should be contributing to developing standards, best practices, and tools.
Attendees of the workshop feel that stewardship efforts will evolve from informal to more formal. Mandates, cost-savings, and scale will motivate this evolution. It is a common good to have demonstrable historical record to document what is known, to protect against fraud, and for future research to build upon. Failure to act is a risk for libraries, for research, and for the scholarly record.
Future Evolving Scholarly Record workshops will expand the discussion and contribute to identifying topics for further investigation. The next scheduled workshops will be in Washington DC on December 10, 2014 and in San Francisco on June 4, 2015. Watch for more details and for announcements of other workshops on the OCLC Research events page.
attempt to answer two questions. First, what fraction of the top-cited articles are published in non-elite journals and how has this changed over time. Second, what fraction of the total citations are to non-elite journals and how has this changed over time.
For the first question they observe that:
The number of top-1000 papers published in non-elite journals for the representative subject category went from 149 in 1995 to 245 in 2013, a growth of 64%. Looking at broad research areas, 4 out of 9 areas saw at least one-third of the top-cited articles published in non-elite journals in 2013. For 6 out of 9 areas, the fraction of top-cited papers published in non-elite journals for the representative subject category grew by 45% or more.
and for the second that:
Considering citations to all articles, the percentage of citations to articles in non-elite journals went from 27% in 1995 to 47% in 2013. Six out of nine broad areas had at least 50% of citations going to articles published in non-elite journals in 2013.
They summarize their method as:
We studied citations to articles published in 1995-2013. We computed the 10 most-cited journals and the 1000 most-cited articles each year for all 261 subject categories in Scholar Metrics. We marked the 10 most-cited journals in a category as the elite journals for the category and the rest as non-elite.
Any thoughts about the validity of the findings? Google has access to high-quality data, so it is unlikely that they are significantly mis-characterizing journals or papers.They examine the questions separately in each of their 261 subject categories, and re-evaluate the top-ranked papers and journals each year.
Do they take into account the overall growth of article publishing in the time frame examined? Their method excludes all but the most-cited 1000 papers in each year, so they consider a decreasing fraction of the total output each year:
The first question asks what fraction of the top-ranked papers appear in top-ranked journals, so the total volume of papers is irrelevant.
The second question asks what fraction of all citations (from all journals, not just the top 1000) are to top-ranked journals. Increasing the number of articles published doesn't affect the proportion of them in a given year that cite top-ranked journals.
What's really going on here? Across all fields, the top-ranked 10 journals in their respective fields contain a gradually but significantly decreasing fraction of the papers subsequently cited. Across all fields, a gradually but significantly decreasing fraction of citations are to the top-ranked 10 journals in their respective fields. This means that authors of cite-worthy papers are decreasingly likely to publish in, read from, and cite papers in their field's top-ranked journals. In other words, whatever value that top-ranked journals add to the papers they publish is decreasingly significant to authors.
Much of the subsequent discussion on liblicense misinterprets the paper, mostly by assuming that when the paper refers to "elite journals" it means Nature, NEJM, Science and so on. As revealed in the quote above, the paper uses "elite" to refer to the top-ranked 10 journals in each of the individual 261 fields. It seems unlikely that a broad journal such as Nature would publish enough articles in any of the 261 fields to be among the top-ranked 10 in that field. Looking at Scholar Metrics, I compiled the following list, showing all the categories (Scholar Metrics calls them subcategories) which currently have one or more global top-10 journals among their "elite journals" in the paper's sense:
Life Sciences & Earth Sciences (general): Nature, Science, PNAS
Health & Medical Sciences (general): NEJM, Lancet, PNAS
Cell Biology: Cell
Molecular Biology: Cell
Oncology: Journal of Clinical Oncology
Chemical & Material Sciences (general): Chemical Reviews, Journal of the American Chemical Society
Only 7 of the 261 categories currently have one or more global top-10 journals among their "elite". Only 3 categories are specific, the other 4 are general. The impact of the global top-10 journals on the paper's results is minimal.
Lets look at this another way. No matter how well their work is regarded by others in their field, researchers in the vast majority of fields have no prospect of ever publishing in a global top-10 journal because those journals effectively don't publish papers in those fields. And if they ever did, the paper is likely to be junk, as illustrated by my favorite example, because the global top-10 journal's stable of reviewers don't work in that field. The global top-10 journals are important to librarians, because they look at scholarly communication from the top down, to publishers, because they are important to librarians so they anchor the "big deals", and to researchers in a small number of important fields. To every one else, they may be interesting but they are not important.
Acharya et al conclude:
First, the fraction of top-cited articles published in non-elite journals increased steadily over 1995-2013. While the elite journals still publish a substantial fraction of high-impact articles, many more authors of well-regarded papers in diverse research fields are choosing other venues.
Second, now that finding and reading relevant articles in non-elite journals is about as easy as finding and reading articles in elite journals, researchers are increasingly building on and citing work published everywhere.
Both seem right to me, which reinforces the message that, even on a per-field basis, highly rated journals are not adding as much value as they did in the past (which was much less than commonly thought). Authors of other papers are the ultimate judge of the value of a paper (they are increasingly awarding citations to papers published elsewhere), and of the value of a journal (they are increasingly publishing work that other authors value elsewhere).
On October 27, 2014, the American Library Association (ALA) will host “$2.2 Billion Reasons to Pay Attention to WIOA,” an interactive webinar that will explore ways that public and community college libraries can receive funding for employment skills training and job search assistance from the recently-passed Workforce Innovation and Opportunity Act. The no-cost webinar, which includes speakers from the U.S. Departments of Education and Labor, takes place Oct 27, 2014, from 2:00–3:00 p.m. EDT.
The Workforce Innovation and Opportunity Act allows public and community college libraries to be considered additional One-Stop partners and authorizes adult education and literacy activities provided by public and community college libraries as an allowable statewide employment and training activity. Additionally, the law defines digital literacy skills as a workforce preparation activity.
Moderator: Sari Feldman, president-elect, American Library Association and executive director, Cuyahoga County Public Library
Susan Hildreth, director, Institute of Museum and Library Services
Heidi Silver-Pacuilla, team leader, Applied Innovation and Improvement, Office of Career, Technical, and Adult Education, U.S. Department of Education
Kimberly Vitelli, chief of Division of National Programs, Employment and Training Administration, U.S. Department of Labor
It was a hot, dusty day in Moab, Utah. I drove into town from my beautiful campsite overlooking the La Sal Mountains, where I’d been cycling and exploring the beautiful country. I was taking a few days off from work, and even though I was relaxing, I had a phone call I didn’t want to reschedule. So back to town I went, straight to—naturally—the public library. I had fond memories of the library from a previous visit a few years back: a beautiful building with reliable Wi-Fi. Aside from not being allowed to bring coffee inside, it would be a great place to check email and take a call on the bench outside.
As I entered the library, I decided that transitioning from adventure mode to work mode required, at least, washing some of Moab’s ample sand and dust off of my hands. I washed my hands and what happened next I did automatically, without consideration or contemplation: I cupped my hands and splashed some water on my face. Refreshing! I then wet a paper towel to wipe the sunscreen off of the back of my neck.
It was at about this point that I realized just what was going on; I was the guy bathing in the library restroom!
Half shocked, half amused by my actions, I quickly made sure I didn’t drip anywhere and sully the otherwise very clean and pleasant basin.
I can’t say I’m proud of my mindless act, but it did get me thinking about the very sensitive issue of appropriate behavior in libraries.
I’m not going on a campaign encouraging libraries to offer showers to their patrons, but not because I think the idea is ridiculous. I actually think it is a legitimate potential service offering. That such a service would likely be useful for only a very small segment of library users is one reason why it isn’t worth pursuing.
But as a theoretical concept, I find nothing inherently wrong or illogical with the idea of a library offering showers. It is simply an idea that hasn’t found many appropriate contexts.
Even so, with the smallest amount of imagination I can think of contexts in which this could work. What about a multiuse facility that houses a restaurant, a gym, a coworking space, and a library? Seems like an amazing place. And don’t forget that the new central library in Helsinki, Finland—to be completed in 2017—will feature sauna facilities. These will be contextually and culturally appropriate.
This is about more than showers and saunas. It is about our long-held assumptions and how we react to new ideas. When we’re closed off to concepts without examining them fully, or without exploring the frameworks in which they exist, we’re unlikely truly to innovate or create any radically meaningful experiences. When evaluating new initiatives, we should consider the library less and our communities more. Without this sort of thinking, we’d have never realized libraries with popular materials, web access, and instructional classes, let alone cafés, gaming nights, and public health nurses.
Learning about our contexts—our communities—takes more than facilitating surveys and leading focus groups. After all, those techniques put less emphasis on people and more on their opinions. Even though extra work is required, the techniques aren’t mysterious. There are well-established methods we can use to learn about the individuals in our areas and then design contextually appropriate programs and services.
To the Grand County Public Library in Moab, my apologies for the slight transgression. I did leave the restroom in the same shape as I found it. To everyone else, if you’re in Moab, visit the library. But if you need a place to clean up in that city, try the aquatic center. It has nice pools and clean showers.
As archives increasingly process born-digital collections one thing is clear; processing digital collections often involves working with tons of email. There is already some great work exploring how to deal with email, but given that it is such a significant problem area it is great to see work focused on developing tools to make sense of this material. Of particular concern is how email is simultaneously so ubiquitous and so messy. I’ve heard cases of repositories needing to deal with hundreds of millions of email objects in a single collection. Beyond that, in actual practice people use email for just about everything, so email records are often a messy mixture of public, private, personal and professional material.
Email? from user tamaleaver on Flickr.
To this end, the ePADD project at Stanford, with the help of an NHPRC grant, is working to produce an open-source tool that will allow repositories and individuals to interact with email archives before and after they have been transferred to a repository. I was lucky enough to sit in on a presentation from the projects technical advisor, Dr. Sudheendra Hangal, on the status of the project and am thrilled to have this opportunity to discuss work on it with him and his colleagues, Glynn Edwards and Peter Chan as part of our Insights Interview series. Glynn is the Head of Technical Services in the Manuscripts division in Stanford Libraries and the Manager of the Born-Digital Program and Peter is a digital archivist at Stanford Libraries.
Trevor: Could you briefly describe the scope and objective of the ePADD project? Specifically, what problem are you working to solve and how are you going about solving it?
Glynn: The ePADD project grew out of earlier experimentation during the Mellon-funded AIMS grant. One of the collections contained 50,000 unique email messages. Peter, who was our new digital archivist, experimented with Gephi (exporting header information to create social network graphs) and Forensic Toolkit. Neither option managed to provide a suitable tool for processing or to facilitate discovery. FTK did not allow us to flag individual messages that contained personal identifying information for restriction and neither provided a view of the entities within the corpus nor expose them to remote researchers.
Most of the previous email projects and tools we researched were focused specifically on acquisition and preservation. They did not address other core functions of stewardship – appraisal, processing and access (discovery & delivery).
During this experimental period, Peter discovered MUSE (Memories USing Email), a research project in the Mobisocial group of Stanford’s Computer Science Dept. Using NLP and a built-in lexicon, it allowed us to extract entities, view by correspondent or a graphical visualization of sentiments based on the lexical terms. This was a step in the right direction and we began a multiyear collaboration with Sudheendra Hangal, MUSE’s creator.
The objectives are to create an open-source Java-based software program built on MUSE that supports different activities aligned with core functions of the digital curation lifecycle: appraisal, accessioning, processing, discovery and delivery. In effect it would allow a user, anyone from the creator, donor, curator, archivist or researcher, to use the collection both before and after transfer to a repository.
Stanford, along with our collaborating partners (NYU, Smithsonian, Columbia, and Bodleian @ Oxford), created and prioritized a set of specifications for the initial development cycle, funded by NHPRC. We also developed and published a beta site to demonstrate our concept for exporting entities and correspondents to facilitate discovery. We have been steadily receiving more email collections. Our most recent acquisition contains over 600,000 unique messages. The grant states that the program will handle at least 250,000 messages – so this latest archive will be more than adequate as a stress test!
Trevor: Could you tell us a bit about the design of the workflow in the tool? How are you envisioning donors and processing archivists working with it?
Peter: The workflow is designed as follows:
Creators of email archives use the appraisal module to scan their archives and identify messages they don’t want to transfer. They can also flag messages as “restricted” and enter annotations to specify the terms of the restriction. The files exported from ePADD will NOT contain the messages flagged as “Do Not Transfer.”
After receiving the files from donors, processing archivists will then identify messages to be restricted according to the policy of their institutions and communication with the donor. Depending on the resources available, processing archivists may want to confirm the email addresses of correspondents suggested by ePADD. Archivists may also want to reconcile the correspondents/person entities extracted with authority records suggested by ePADD. After they finish processing, archivists will output two versions of the archive from ePADD. Neither set contains any restricted messages.
The first set is designed for the discovery module, with all messages redacted barring identified entities (people, places, organizations) and email address with a masked domain name. This version will be stored in the web server to be used for the discovery module. Public researchers with internet access can browse and search the archives using the discovery module. They will only see a redacted version of the original messages containing extracted entities, but this is still useful to them to get a sense of the entities present in the archive, without being able to see what is said about them.
The second version, designed for the delivery module, will be stored in the reading room computer designated for email delivery. Researchers using the designated computer in the reading room will be able to browse and search the archives. The messages, when displayed, will be the full messages without redaction. Researchers can define their own lexicon to analyze the collection. They may request copies by flagging the messages they need. Public service archivists/librarians can then give the researchers the files according to the policy of their institutions.
Glynn: I would only add that the appraisal module is meant to make it possible for a creator/donor to review their email archives, to create their own lexicon if desired, and prepare the files for export and final transfer to a repository. During this process they may take actions on specific messages (individually) or sets of messages (bulk by topic or correspondent) as restricted or elected not to transfer. We felt this functionality was important to offer a donor for two reasons. First, in the hope that they weed out irrelevant messages or spam! Second, there may be individuals they correspond with that do not want their messages archived – this is a case for one of our collections.
Trevor: How do you imagine archivists using this tool? Further, how do you see it fitting in with the ecosystem of other open-source tools and platforms that act as digital repository platforms and other tools for processing and working born digital archival materials like BitCurator and Archivematica?
The social life of email at Enron – a new study from user chieftech on Flickr.
Peter: I consider “processing” of born-digital materials to include both identifying restricted materials AND arranging/describing the intellectual content of the materials. My understanding of Bitcurator and Archivematica is that neither offers tools to arrange/describe the intellectual content of the materials. ePADD offers four tools to arrange/describe the intellectual content of email archives. First, it uses a natural language processing library to extract personal names, organizational names and locations in email archives to give researchers a sense of people, organizations and locations in the archives. Second, it gathers all image files in one place for researchers to browse and if necessary go to the messages containing the images.
Third, it offers user-definable lexicons which contain categories of words the system will use to search against the emails so that researchers/archivists can browse emails according to the lexicons they defined. Finally, ePADD reconciles the correspondents and personal names mentioned in messages with the FAST (Faceted Application of Subject Terminology) dataset which is derived from the Library of Congress Subject Headings. Archivists can then give their confirmation to the suggested matches by ePADD. If none of the suggestions are correct, they can enter their own links to the authority records.
I can see people using ePADD to appraise, process, discover and deliver emails and sending the files generated for delivery and discovery to systems using Archivematica for long term preservation.
Trevor: In Sudheendra’s presentation I saw some really interesting things happening with approaches to identifying different distinct email addresses that are associated with the same individual over time in a collection, and some interesting approaches to associating the names of individuals with canonical data for names of people. I think he also illustrated ways that the content of the messages could be identified and associated with subjects. Could you tell us a bit about how this works and how you are thinking about the possibilities and impact on things like archival description that these approaches could have?
Glynn: With email archives – or any born-digital materials – archivists need automated methods to get through large amounts of data. ePADD incorporates several methods of automation to assist with processing of email. Here are three:
1. Correspondents & name resolution
During ingest, ePADD gathers all correspondents and recipients from email headers and performs basic name resolution tasks. When your cursor rolls over a name, different versions that were aggregated appear in a pop-up window. The archivist can go into the back end and override or edit the addresses that are associated with a specific name.
I would direct you to the wonderful documentation on processing and using email archives on the MUSE website. Regarding the resolution of correspondents, Sudheendra states (PDF) in the “MUSE: Reviving Memories Using Email Archives” report that “MUSE performs entity resolution by unifying names and email addresses in email headers when either the name or email address (as specified in the RFC-822 email header) is equivalent. This is essential since email addresses and even name spellings for a person are likely to change in a long-term archive.”
ePADD performs this process during ingest and allows the donor (appraisal module) or archivist (processing module) to correct or edit the email aliases that are automatically bundled together by ePADD at ingestion.
2. Entity extraction & disambiguation
ePADD extracts entities from the email corpus using Apache’s openNLP library and checks them against OCLC’s FAST database to identify authorities. In the case of multiple hits on a name, it shows all the matching records and can read data from DBpedia to automatically rank the likelihood of each record being the correct one. The archivist finally confirms which authority record is correct.
Example of how ePADD connects to third party data to disambiguate, aggregate and link names & identities.
Algorithms are also used to help the archivist or researcher understand context while reading a message. For example, suppose a conversation mentions Bob, which could refer to any number of Bobs present in the archive. ePADD analyzes the occurrences of Bob throughout the archive with respect to the text and headers of this message, and thinks: “Hmm…when the name Bob is used with the people copied on this email, and when these other names appear in the message, its more likely to be Bob Creeley than other Bobs in the archive like Dylan or Woodward.” It displays a popup with the ranked list of possibilities (see image).
The colored bar underneath each full name indicates the likelihood of that association. This feature can be used by an archivist during processing or by researchers in the delivery module to understand the archive’s contents better. If you think about it, we humans do this kind of context-based disambiguation all the time; ePADD is helping us along by trying to automate some of it.
Example of identification of named entities in the Robert Creeley email archive
3. Lexical searches & review
The archivist can use the built-in lexicons or create one in order to tease out the subjects or topics in the archive. MUSE came with a “sentiment” lexicon and ePADD will include another default lexicon based on searching for Personally Identifiable Information and sensitive material. This will include the ability to identify regular expressions – such as credit card or social security numbers as well as material that may be governed by FERPA or HIPAA. These lexicons are editable or one could start from scratch and create a specialized one. The beauty of this is that once the terms are indexed by ePADD the user can view the messages individually or in a visualization graph.
An ePADD discovery module visualization graph.
Trevor: As a follow-up to that question, how is the project conceptualizing the role of the archivist engaging with some of these automated processes for description? Sudheendra showed how an archivist could intervene and accept/reject or tweak the resulting bundling of email addresses and associate them with named entities. With that said, I imagine it would be a huge undertaking, and one that seems inconsistent with an MPLP approach, to have an archivist review all of this metadata. To that end, are there ways the project can enable some level of review of particularly important figures and still communicate which part is automated and which part has been reviewed? Or are there other ways the team is thinking about this kind of issue?
Peter: In view of the large number of correspondents and personal names mentioned in an email archive, reviewing ALL name entities is usually not feasible. Depending on resources we have for each archive, we can review, say the top 1000 most mentioned names in an archive.
Glynn: Agreed. This is similar to processing the analog or paper correspondence in a collection. The archivist usually selects correspondents that are either well known, that have substantive letters – either in form or extent. Not all correspondents in a collection make it into the finding aid as added entries, into folder-level description, or even into a detailed index. With ePADD the top 50 or 100 correspondents (in extent) are easily and automatically identified.
However, because researchers may be interested in entities/correspondents that we do not “process,” we are considering allowing them much of the same functionality in the full text access module in the reading room. One example, would be allowing the researcher to create a new lexicon and search by their terms.
To identify what’s been processed is a work in process. We still need to build in some administrative features – such as scope and content notes – to let the researcher know the types/depth of actions performed.
Trevor: How are you thinking about authenticity of records in the context of this project? That is, what constitutes the original and authentic format of these records and how does the project work to ensure the integrity of those records over time. Similarly, how are you thinking about documenting decisions and actions taken in the appraisal process on the records?
Peter: According to “ISO 15489-1, Information and documentation–Records Management,” an authentic (electronic) record is one that can be proven:
a) to be what it purports to be,
b) to have been created or sent by the person purported to have created or sent it, and
c) to have been created or sent at the time purported.
Format is not part of the requirements for an authentic electronic record. One of the reasons is that electronically produced documents actually are not objects at all but rather, by their nature, products that have to be processed each time they are used. There is no transfer, no reading without a re-creation of the information. Furthermore, electronic records are at risk because of technical obsolescence as newer formats replace older ones.
ePADD does not address the issue of authenticity in this round of funding. This issue is definitely important and complicated and I would like to address it in the future.
Trevor: What lessons has the team learned so far about working with email archives? Are there any assumptions or thoughts you had about working with email as records that have evolved or changed while working on the project?
Peter: Conversion of old archived emails can be tricky. Even though normalization is not within the scope of ePADD, still people need to convert emails to MBox format before ePADD can work on them. One of our partners found missing headers from emails when looking at them using ePADD. The emails came from old Groupwise emails that were migrated into Outlook and then converted to Mbox. Is this a conversion error when converting Groupwise emails to Outlook? Or when converting the Outlook emails to mbox? Or when ePADD parses the emails?
Attachment files come in diverse file formats. The ability to view files in attachments is an important feature for a system like ePADD. Apache OpenOffice.org can read files in ~50 file formats. On the other hand, QuickView Plus can view files in ~500 file formats. Should we integrate a commercial software in ePADD in order to view files in the 450 file formats which Apache OpenOffice is not capable of? If yes, ePADD will not be an open-source project anymore. If no, ePADD users have to face the fact that there are files they are not able to view.
Glynn: The sheer volume of data to review can be very daunting. The more specific the terms in the lexicon to perform automated indexing to messages the better. You want to discover messages that should be restricted but not have too many false positives to wade through during review.
The ability to process in bulk cannot be stressed too much. When performing actions on a set of messages – either from a lexical result, correspondent or a user-defined search. ePADD allows to you apply any action to that entire subset. You can also apply actions to original folders. For example, if messages are organized into a folder marked “human resources,” the archivist or donor may choose to flag all the messages in that folder as “restricted until 2050.”
Trevor: What are the next steps for the project? What sorts of things are you exploring for the future?
Peter: I would like to look at the topics/concepts exchanged in emails (and match them against the Library of Congress Subject Heading – Topical). It would be interesting to know what books and movies were mentioned in emails. Publishing extracted entities as linked open data is definitely one thing I would like to do as well. However, it all depends on funding.
Glynn: This is the fun part – envisioning what else is needed or desired in future iterations. It is, however, reliant on funding and collaboration. Input is needed across different types of institutions – museums, government, academic, corporate to name a few. While many of the use cases would be similar, there are unique aspects or goals for different institutions.
Over the past few weeks, we’ve taken part in the NDSA-sponsored meeting (see Chris Prom’s blog post) and held ePADD’s first Advisory Group meeting. These sparked some wonderful discussions and ideas about next steps for greater discovery, delivery and collaboration.
There is a definite need in the profession to begin defining and documenting use cases, to analyze and document life cycles of email archives and existing tools in order to evaluate gaps and future needs, to further discovery through exporting correspondents and extracted entities from ePADD and publishing them with a dynamic search interface across archives. Other avenues we would like to explore are the ability to process and deliver other document types (beyond email), including social media.
The final delivery or access module is intended for reading room access, and we hope to provide more robust tools to allow user interaction with the archives. Additionally, we would like to offer data dumps for text mining/analysis or extractions of header information for social network analysis. Currently these are managed by correspondence through Special Collections.
One suggestion from our Advisory Group was to broaden use of ePADD before final release in the summer (2015). By allowing other repositories to use ePADD for processing we would expose more email collections for researcher use and hopefully get more feedback for the development and specification teams. This will be a better demonstration and test of the program. To this end we plan to release ePADD beyond our grant collaborators to other institutions that have already expressed strong interest.
Sudheendra: Glynn and Peter have answered your questions wonderfully, so I’ll just jump in with a little bit of speculation. In the last couple of decades, we’re seeing that a lot of our lives are reflected in our online activities, be it email, blogs, Facebook, Twitter or any other medium. A small example: a cousin of mine spent almost a year organizing a major dance performance for her daughter. She was reflecting on the effort, and exclaimed to me: “That was so much work. You know, I should save all those emails!” I think that is very telling. All of us have wonderful stories in our lives. There are moments of joy and exasperation, love and sorrow, accomplishment and failure, and they are often captured in our electronic communications. We should be able to preserve them, reflect on them, and hand them over to future generations. We already do this with photographs, which are wonderful. However, text-based communications are complementary to images because they capture thoughts, feelings and intentions in a way that images do not.
Unfortunately, the misuse of personal data for commercial or surveillance reasons is causing many people to be wary of preserving their own records, and even to go out of their way to delete them. This is a pity, because there is so much value buried in archives, if only users could keep their data under their own control, and have good tools with which to make sense of it. So in the next decade, I predict that individuals and families will routinely use tools like ePADD to preserve history important to them. We’re all archivists in that sense.
Online education has extended its presence to public libraries. Online learning and career training, by services such as Ed2Go and Lynda, are usually offered complimentary to college and university students. Similar services such as Gale Courses, Universal Class and Treehouse are geared toward public library use. Gale Courses is a subscription service of Cengage Learning. It is a hybrid of Ed2Go, offering courses that range from GED preparation to PC Security. Courses are six weeks in length and are instructor led. Universal Class offers hundreds of courses on a variety of topics, including dog obedience training, to patrons of diverse interests. Courses are self-paced and users can begin a course at anytime. Treehouse is uniquely geared toward web design, development and programming for personal computers and mobile device applications. Users can select self-paced educational Tracks that are focused on a specific development area.
An alternative to MOOCs
A considerable population of the general public cannot afford to pursue a formal education. Extending the services of the library into web-based learning, online courses provide access to continuing education for the general public. The mention of free online education is not complete without a nod to massive open online courses (MOOC). MOOCs can be non-profit or commercial. They offer free or affordable online education, of varying course structure, to students around the world. Though MOOCs and open courseware are comparable alternatives, library-hosted continuing education offers additional incentives from those of most freely available online courses.
Education as a service
One advantage to using a service provided by the public library is that patrons can use the computers available on site. For patrons lacking home computer access, they can incorporate another library service into their education. Continuing education courses are free to library card holders at participating libraries. If your regional library does not offer the service, you can always purchase a library card from a participating library. Considering that each course can range from $50 to the mid $100s, the benefit of access to hundreds of courses outweighs the cost of purchasing a library card. Patrons will receive a certificate of completion for each completed course and in the case of Universal Class they will receive continuing education units that are approved by the International Association for Continuing Education and Training (IACET). Treehouse opts for using a point-based system and Badges, digital awards, which signify a user’s progress. Online education also helps to highlight the public library as an evolving source of public information.
All three continuing education providers, offer free trials and demo courses for anyone interested in their services.
My past long posts about multi-threaded concurrency in Rails ActiveRecord are some of the most visited posts on this blog, so I guess I’ll add another one here; if you’re a “tl;dr” type, you should probably bail now, but past long posts have proven useful to people over the long-term, so here it is.
I’m in the middle of updating my app that uses multi-threaded concurrency in unusual ways to Rails4. The good news is that the significant bugs I ran into in Rails 3.1 etc, reported in the earlier post have been fixed.
However, the ActiveRecord concurrency model has always made it too easy to accidentally leak orphaned connections, and in Rails4 there’s no good way to recover these leaked connections. Later in this post, I’ll give you a monkey patch to ActiveRecord that will make it much harder to accidentally leak connections.
Rails keeps a ConnectionPool of individual connections (usually network connections) to the database. Each connection can only be used by one thread at a time, and needs to be checked out and then checked back in when done.
You can check out a connection explicitly using `checkout` and `checkin` methods. Or, better yet use the `with_connection` method to wrap database use. So far so good.
But ActiveRecord also supports an automatic/implicit checkout. If a thread performs an ActiveRecord operation, and that thread doesn’t already have a connection checked out to it (ActiveRecord keeps track of whether a thread has a checked out connection in Thread.current), then a connection will be silently, automatically, implicitly checked out to it. It still needs to be checked back in.
And you can call `ActiveRecord::Base.clear_active_connections!`, and all connections checked out to the calling thread will be checked back in. (Why might there be more than one connection checked out to the calling thread? Mostly only if you have more than one database in use, with some models in one database and others in others.)
And that’s what ordinary Rails use does, which is why you haven’t had to worry about connection checkouts before. A Rails action method begins with no connections checked out to it; if and only if the action actually tries to do some ActiveRecord stuff, does a connection get lazily checked out to the thread.
And after the request had been processed and the response delivered, Rails itself will call `ActiveRecord::Base.clear_active_connections!` inside the thread that handled the request, checking back connections, if any, that were checked out.
The danger of leaked connections
So, if you are doing “normal” Rails things, you don’t need to worry about connection checkout/checkin. (modulo any bugs in AR).
But if you create your own threads to use ActiveRecord (inside or outside a Rails app, doesn’t matter), you absolutely do. If you proceed blithly to use AR like you are used to in Rails, but have created Threads yourself — then connections will be automatically checked out to you when needed…. and never checked back in.
The best thing to do in your own threads is to wrap all AR use in a `with_connection`. But if some code somewhere accidentally does an AR operation outside of a `with_connection`, a connection will get checked out and never checked back in.
And if the thread then dies, the connection will become orphaned or leaked, and in fact there is no way in Rails4 to recover it. If you leak one connection like this, that’s one less connection available in the ConnectionPool. If you leak all the connections in the ConnectionPool, then there’s no more connections available, and next time anyone tries to use ActiveRecord, it’ll wait as long as the checkout_timeout (default 5 seconds; you can set it in your database.yml to something else) trying to get a connection, and then it’ll give up and throw a ConnectionTimeout. No more database access for you.
In Rails 3.x, there was a method `clear_stale_cached_connections!`, that would go through the list of all checked out connections, cross-reference it against the list of all active threads, and if there were any checked out connections that were associated with a Thread that didn’t exist anymore, they’d be reclaimed. You could call this method from time to time yourself to try and clean up after yourself.
But this was a pretty expensive operation, and in Rails4, not only does the ConnectionPool not do this for you, but the method isn’t even available to you to call manually. As far as I can tell, there is no way using public ActiveRecord API to clean up a leaked connection; once it’s leaked it’s gone.
So this makes it pretty important to avoid leaking connections.
(Note: There is still a method `clear_stale_cached_connections` in Rails4, but it’s been redefined in a way that doesn’t do the same thing at all, and does not do anything useful for leaked connection cleanup. That it uses the same method name, I think, is based on misunderstanding by Rails devs of what it’s doing. See Fear the Reaper below. )
Monkey-patch AR to avoid leaked connections
I understand where Rails is coming from with the ‘implicit checkout’ thing. For standard Rails use, they want to avoid checking out a connection for a request action if the action isn’t going to use AR at all. But they don’t want the developer to have to explicitly check out a connection, they want it to happen automatically. (In no previous version of Rails, back from when AR didn’t do concurrency right at all in Rails 1.0 and Rails 2.0-2.1, has the developer had to manually check out a connection in a standard Rails action method).
So, okay, it lazily checks out a connection only when code tries to do an ActiveRecord operation, and then Rails checks it back in for you when the request processing is done.
The problem is, for any more general-purpose usage where you are managing your own threads, this is just a mess waiting to happen. It’s way too easy for code to ‘accidentally’ check out a connection, that never gets checked back in, gets leaked, with no API available anymore to even recover the leaked connections. It’s way too error prone.
That API contract of “implicitly checkout a connection when needed without you realizing it, but you’re still responsible for checking it back in” is actually kind of insane. If we’re doing our own `Thread.new` and using ActiveRecord in it, we really want to disable that entirely, and so code is forced to do an explicit `with_connection` (or `checkout`, but `with_connection` is a really good idea).
So, here, in a gist, is a couple dozen line monkey patch to ActiveRecord that let’s you, on a thread-by-thread basis, disable the “implicit checkout”. Apply this monkey patch (just throw it in a config/initializer, that works), and if you’re ever manually creating a thread that might (even accidentally) use ActiveRecord, the first thing you should do is:
Once you’ve called `forbid_implicit_checkout_for_thread!` in a thread, that thread will be forbidden from doing an ‘implicit’ checkout.
If any code in that thread tries to do an ActiveRecord operation outside a `with_connection` without a checked out connection, instead of implicitly checking out a connection, you’ll get an ActiveRecord::ImplicitConnectionForbiddenError raised — immediately, fail fast, at the point the code wrongly ended up trying an implicit checkout.
This way you can enforce your code to only use `with_connection` like it should.
Note: This code is not battle-tested yet, but it seems to be working for me with `with_connection`. I have not tried it with explicitly checking out a connection with ‘checkout’, because I don’t entirely understand how that works.
DO fear the Reaper
In Rails4, the ConnectionPool has an under-documented thing called the “Reaper”, which might appear to be related to reclaiming leaked connections. In fact, what public documentation there is says: “the Reaper, which attempts to find and close dead connections, which can occur if a programmer forgets to close a connection at the end of a thread or a thread dies unexpectedly. (Default nil, which means don’t run the Reaper).”
The problem is, as far as I can tell by reading the code, it simply does not do this.
What does the reaper do? As far as I can tell trying to follow the code, it mostly looks for connections which have actually dropped their network connection to the database.
A leaked connection hasn’t necessarily dropped it’s network connection. That really depends on the database and it’s settings — most databases will drop unused connections after a certain idle timeout, by default often hours long. A leaked connection probably hasn’t yet had it’s network connection closed, and a properly checked out not-leaked connection can have it’s network connection closed (say, there’s been a network interruption or error; or a very short idle timeout on the database).
The Reaper actually, if I’m reading the code right, has nothing to do with leaked connections at all. It’s targeting a completely different problem (dropped network, not checked out but never checked in leaked connections). Dropped network is a legit problem you want to be handled gracefullly; I have no idea how well the Reaper handles it (the Reaper is off by default, I don’t know how much use it’s gotten, I have not put it through it’s paces myself). But it’s got nothing to do with leaked connections.
Someone thought it did, they wrote documentation suggesting that, and they redefined `clear_stale_cached_connections!` to use it. But I think they were mistaken. (Did not succeed at convincing @tenderlove of this when I tried a couple years ago when the code was just in unreleased master; but I also didn’t have a PR to offer, and I’m not sure what the PR should be; if anyone else wants to try, feel free!)
So, yeah, Rails4 has redefined the existing `clear_stale_active_connections!` method to do something entirely different than it did in Rails3, it’s triggered in entirely different circumstance. Yeah, kind of confusing.
Oh, maybe fear ruby 1.9.3 too
When I was working on upgrading the app, I’m working on, I was occasionally getting a mysterious deadlock exception:
ThreadError: deadlock; recursive locking:
In retrospect, I think I had some bugs in my code and wouldn’t have run into that if my code had been behaving well. However, that my errors resulted in that exception rather than a more meaningful one, maybe possibly have been a bug in ruby 1.9.3 that’s fixed in ruby 2.0.
If you’re doing concurrency stuff, it seems wise to use ruby 2.0 or 2.1.
Can you use an already loaded AR model without a connection?
Let’s say you’ve already fetched an AR model in. Can a thread then use it, read-only, without ever trying to `save`, without needing a connection checkout?
Well, sort of. You might think, oh yeah, what if I follow a not yet loaded association, that’ll require a trip to the db, and thus a checked out connection, right? Yep, right.
Okay, what if you pre-load all the associations, then are you good? In Rails 3.2, I did this, and it seemed to be good.
But in Rails4, it seems that even though an association has been pre-loaded, the first time you access it, some under-the-hood things need an ActiveRecord Connection object. I don’t think it’ll end up taking a trip to the db (it has been pre-loaded after all), but it needs the connection object. Only the first time you access it. Which means it’ll check one out implicitly if you’re not careful. (Debugging this is actually what led me to the forbid_implicit_checkout stuff again).
Didn’t bother trying to report that as a bug, because AR doesn’t really make any guarantees that you can do anything at all with an AR model without a checked out connection, it doesn’t really consider that one way or another.
Safest thing to do is simply don’t touch an ActiveRecord model without a checked out connection. You never know what AR is going to do under the hood, and it may change from version to version.
Concurrency Patterns to Avoid in ActiveRecord?
Rails has officially supported multi-threaded request handling for years, but in Rails4 that support is turned on by default — although there still won’t actually be multi-threaded request handling going on unless you have an app server that does that (Puma, Passenger Enterprise, maybe something else).
So I’m not sure how many people are using multi-threaded request dispatch to find edge case bugs; still, it’s fairly high profile these days, and I think it’s probably fairly reliable.
If you are actually creating your own ActiveRecord-using threads manually though (whether in a Rails app or not; say in a background task system), from prior conversations @tenderlove’s preferred use case seemed to be creating a fixed number of threads in a thread pool, making sure the ConnectionPool has enough connections for all the threads, and letting each thread permanently check out and keep a connection.
I think you’re probably fairly safe doing that too, and is the way background task pools are often set up.
That’s not what my app does. I wouldn’t necessarily design my app the same way today if I was starting from scratch (the app was originally written for Rails 1.0, gives you a sense of how old some of it’s design choices are; although the concurrency related stuff really only dates from relatively recent rails 2.1 (!)).
My app creates a variable number of threads, each of which is doing something different (using a plugin system). The things it’s doing generally involve HTTP interactions with remote API’s, is why I wanted to do them in concurrent threads (huge wall time speedup even with the GIL, yep). The threads do need to occasionally do ActiveRecord operations to look at input or store their output (I tried to avoid concurrency headaches by making all inter-thread communications through the database; this is not a low-latency-requirement situation; I’m not sure how much headache I’ve avoided though!)
So I’ve got an indeterminate number of threads coming into and going out of existence, each of which needs only occasional ActiveRecord access. Theoretically, AR’s concurrency contract can handle this fine, just wrap all the AR access in a `with_connection`. But this is definitely not the sort of concurrency use case AR is designed for and happy about. I’ve definitely spent a lot of time dealing with AR bugs (hopefully no longer!), and just parts of AR’s concurrency design that are less than optimal for my (theoretically supported) use case.
I’ve made it work. And it probably works better in Rails4 than any time previously (although I haven’t load tested my app yet under real conditions, upgrade still in progress). But, at this point, I’d recommend avoiding using ActiveRecord concurrency this way.
What to do?
What would I do if I had it to do over again? Well, I don’t think I’d change my basic concurrency setup — lots of short-lived threads still makes a lot of sense to me for a workload like I’ve got, of highly diverse jobs that all do a lot of HTTP I/O.
At first, I was thinking “I wouldn’t use ActiveRecord, I’d use something else with a better concurrency story for me.” DataMapper and Sequel have entirely different concurrency architectures; while they use similar connection pools, they try to spare you from having to know about it (at the cost of lots of expensive under-the-hood synchronization).
Except if I had actually acted on that when I thought about it a couple years ago, when DataMapper was the new hotness, I probably would have switched to or used DataMapper, and now I’d be stuck with a large unmaintained dependency. And be really regretting it. (And yeah, at one point I was this close to switching to Mongo instead of an rdbms, also happy I never got around to doing it).
I don’t think there is or is likely to be a ruby ORM as powerful, maintained, and likely to continue to be maintained throughout the life of your project, as ActiveRecord. (although I do hear good things about Sequel). I think ActiveRecord is the safe bet — at least if your app is actually a Rails app.
So what would I do different? I’d try to have my worker threads not actually use AR at all. Instead of passing in an AR model as input, I’d fetch the AR model in some other safer main thread, convert it to a pure business object without any AR, and pass that in my worker threads. Instead of having my worker threads write their output out directly using AR, I’d have a dedicated thread pool of ‘writers’ (each of which held onto an AR connection for it’s entire lifetime), and have the indeterminate number of worker threads pass their output through a threadsafe queue to the dedicated threadpool of writers.
That would have seemed like huge over-engineering to me at some point in the past, but at the moment it’s sounding like just the right amount of engineering if it lets me avoid using ActiveRecord in the concurrency patterns I am, that while it officially supports, it isn’t very happy about.
The time rail or progress bar on video players gives the viewer some indication of how much of the video they’ve watched, what portion of the video remains to be viewed, and how much of the video is buffered. The time rail can also be clicked on to jump to a particular time within the video. But figuring out where in the video you want to go can feel kind of random. You can usually hover over the time rail and move from side to side and see the time that you’d jump to if you clicked, but who knows what you might see when you get there.
Some video players have begun to use the time rail to show video thumbnails on hover in a tooltip. For most videos these thumbnails give a much better idea of what you’ll see when you click to jump to that time. I’ll show you how you can create your own thumbnail previews using HTML5 video.
We usually follow agile practices in our archival processing. This style of processing became popularized by the article More Product, Less Process: Revamping Traditional Archival Processing by Mark A. Greene and Dennis Meissner. For instance, we don’t read every page of every folder in every box of every collection in order to describe it well enough for us to make the collection accessible to researchers. Over time we may decide to make the materials for a particular collection or parts of a collection more discoverable by doing the work to look closer and add more metadata to our description of the contents. But we try not to allow the perfect from being the enemy of the good enough. Our goal is to make the materials accessible to researchers and not hidden in some box no one knows about.
At NCSU Libraries we have begun digitizing more archival videos. And for these videos we’re much more likely to treat them like other archival materials. We’re never going to watch every minute of every video about cucumbers or agricultural machinery in order to fully describe the contents. Digitization gives us some opportunities to automate the summarization that would be manually done with physical materials. Many of these videos don’t even have dialogue, so even when automated video transcription is more accurate and cheaper we’ll still be left with only the images. In any case, the visual component is a good place to start.
Video Thumbnail Previews
When you hover over the time rail on some video viewers, you see a thumbnail image from the video at that time. YouTube does this for many of its videos. I first saw that this would be possible with HTML5 video when I saw the JW Player page on Adding Preview Thumbnails. From there I took the idea to use an image sprite and a WebVTT file to structure which media fragments from the sprite to use in the thumbnail preview. I’ve implemented this as a plugin for Mediaelement.js. You can see detailed instructions there on how to use the plugin, but I’ll give the summary here.
1. Create an Image Sprite from the Video
This uses ffmpeg to take a snapshot every 5 seconds in the video and then uses montage (from ImageMagick) to stitch them together into a sprite. This means that only one file needs to be downloaded before you can show the preview thumbnail.
2. Create a WebVTT metadata file
This is just a standard WebVTT file except the cue text is metadata instead of captions. The URL is to an image and uses a spatial Media Fragment for what part of the sprite to display in the tooltip.
3. Add the Video Thumbnail Preview Track
Put the following within the <video> element.
4. Initialize the Plugin
The following assumes that you’re already using Mediaelement.js, jQuery, and have included the vtt.js library.
One of the DOM API features I hadn’t used before is MutationObserver. One thing the thumbnail preview plugin needs to do is know what time is being hovered over on the time rail. I could have calculated this myself, but I wanted to rely on MediaElement.js to provide the information. Maybe there’s a callback in MediaElement.js for when this is updated, but I couldn’t find it. Instead I use a MutationObserver to watch for when MediaElement.js changes the DOM for the default display of a timestamp on hover. Looking at the time code there then allows the plugin to pick the correct cue text to use for the media fragment. MutationObserver is more performant than the now deprecated MutationEvents. I’ve experienced very little latency using a MutationObserver which allows it to trigger lots of events quickly.
The plugin currently only works in the browsers that support MutationObserver, which is most current browsers. In browsers that do not support MutationObserver the plugin will do nothing at all and just show the default timestamp on hover. I’d be interested in other ideas on how to solve this kind of problem, though it is nice to know that plugins that rely on another library have tools like MutationObserver around.
This plugin is brand new and works for me, but there are some caveats. All the images in the sprite must have the same dimensions. The durations for each thumbnail must be consistent. The timestamps currently aren’t really used to determine which thumbnail to display, but is instead faked relying on the consistent durations. The plugin just does some simple addition and plucks out the correct thumbnail from the array of cues. Hopefully in future versions I can address some of these issues.
Since we already had the sprite images for the time rail hover preview, I created another interface to allow a user to jump through a video. Under the video player is a control button that shows a modal with the thumbnail sprite. The sprite alone provides a nice overview of the video that allows you to see very quickly what might be of interest. I used an image map so that the rather large sprite images would only have to be in memory once. (Yes, image maps are still valid in HTML5 and have their legitimate uses.) jQuery RWD Image Maps allows the map area coordinates to scale up and down across devices. Hovering over a single thumb will show the timestamp for that frame. Clicking a thumbnail will set the current time for the video to be the start time of that section of the video. One advantage of this feature is that it doesn’t require the kind of fine motor skill necessary to hover over the video player time rail and move back and forth to show each of the thumbnails.
This feature has just been added this week and deployed to production this week, so I’m looking for feedback on whether folks find this useful, how to improve it, and any bugs that are encountered.
I expect that automated summarization services will become increasingly important for researchers as archives do more large-scale digitization of physical collections and collect more born digital resources in bulk. We’re already seeing projects like fondz which autogenerates archival description by extracting the contents of born digital resources. At NCSU Libraries we’re working on other ways to summarize the metadata we create as we ingest born digital collections. As we learn more what summarization services and interfaces are useful for researchers, I hope to see more work done in this area. And this is just the beginning of what we can do with summarizing archival video.
Most of the conferences I go to are technology ones that are focused on practical applications and knowledge sharing on how we have solved specific technical problems or figured out new, more efficient ways to do old things. It’s been a long time since I’ve been to a conference that’s about broader ideas and a much longer time since I’ve been to an academic conference. This was outside my comfort zone and it was an extremely worthwhile experience.
There were 100 attendees. I’d estimate that library and information studies professors and PhD students made up 50%, library school grad students made up 25%, and the other 25% of us were practioners, who work almost exclusively in academic settings. The conference participants had the best selection of glasses, and I was inspired to document some of them.
The program was great and I had a very hard time picking which of the 3 streams I wanted to attend. A few people scampered between rooms to catch papers in different streams. Program highlights for me was the panel on porn in the library and the panel on gender and content. My thoughts on the porn in the library panel became a bit long, so I’ll post those tomorrow.
In my opinion it was a shame that most of the presenters defaulted to a traditional academic style of conference presentation, that is, they stood at the front of the room and read their papers to the audience without making much eye contact. For me the language was sometimes unnecessarily dense and that many of the theoretical concepts discussed would’ve been more successful if expressed in plain English.
I was also disappointed that there wasn’t a plan to post the papers online. Lisa explained to me that for those librarians and scholars in a university environment publications are important to tenure and promotion. Conference presentations count, but not as much as peer reviewed publications, which don’t count as much as book publications. I know there’s a plan in the works for a edition of Library Trends that will be published in 2 years. Also, I know from the interest on Twitter that there are many people who weren’t able to travel to Toronto and attend in person who are very hungry to read these papers. For the technology conferences I go to it is standard to share as much as possible: to livestream the conference, to archive the Twitter stream, and to post presentations online and made code public too. I hope that most of the presenters will figure out a way to share their work openly without it costing them in academic prestige. There’s got to be a way to do this.
There was a really magical feeling at this first colloquium on gender and sexuality in LIS. Everyone brought their smarts, ideas and generous spirits. I think a lot of us have been starved for this kind of environment, engagement and community.
My brain, heart and sinuses are full. I’m exhausted and heading home to Vancouver. This one day of connections and ideas will keep me going for another year. Kudos to the organizers Emily Drabinski, Patrick Keilty and Litwin Books for organizing this. I’m hungry for more.
[Note to readers: sick and tired of it all, I am going to report these "incidents" publicly because I just can't hack it anymore.]
I was in a meeting yesterday about RDF and application profiles, in which I made some comments, and was told by the co-chair: "we don't have time for that now", and the meeting went on.
Today, a man who was not in the meeting but who listened to the audio sent an email that said:
"I agree with Karen, if I correctly understood her point, that this is "dangerous territory". On the call, that discussion was postponed for a later date, but I look forward to having that discussion as soon as possible because I think it is fundamental."
And he went on to talk about the issue, how important it is, and at one point referred to it as "The requirement is that a constraint language not replace (or "hijack") the original semantics of properties used in the data."
The co-chair (I am the other co-chair, although reconsidering, as you may imagine) replied:
"The requirement of not hijacking existing formal specification languages for expressing constraints that rely on different semantics has not been raised yet."
"Has not been raised?!" The email quoting me stated that I had raised it the very day before. But an important issue is "not raised" until a man brings it up. This in spite of the fact that the email quoting me made it clear that my statement during the meeting had indeed raised this issue.
Later, this co-chair posted a link to a W3C document in an email to me (on list) and stated:
"I'm going on holidays so won't have time to explain you, but I could, in theory (I've been trained to understand that formal stuff, a while ago)"
That is so f*cking condescending. This happened after I quoted from W3C documents to support my argument, and I believe I had a good point.
So, in case you haven't experienced it, or haven't recognized it happening around you, this is what sexism looks like. It looks like dismissing what women say, but taking the same argument seriously if a man says it, and it looks like purposely demeaning a woman by suggesting that she can't understand things without the help of a man.
I can't tell you how many times I have been subjected to this kind of behavior, and I'm sure that some of you know how weary I am of not being treated as an equal no matter how equal I really am.
Quiet no more, friends. Quiet no more.
(I want to thank everyone who has given me support and acknowledgment, either publicly or privately. It makes a huge difference.)
So there are two retirements in our household today. Sandy is giving her last sermon as a regular, full-time UCC pastor. She isn’t going to stop pastoring, but she’s stepping down and looking forward to consultancies, supply preaching, and interim positions.
I was a child bride… ok, perhaps that’s a mild exaggeration… but I’m a long time from retirement myself. I’m a Boomer with not the greatest retirement portfolio, plenty of years in front of me, and lots of vim and vigor, and LibraryLand will have me around in the full-time regular workforce for a very long time. (And my Uncle Bob, may he rest in peace, worked into his mid-80s, had a stroke on a Friday night, and left this world on Sunday. I may want to kick back in a couple of decades and do other things, like travel and write–but go you, Uncle Bob.)
However, I do get to retire from my role as the pastor’s wife. I do use the word “wife” deliberately, because I think if your spouse is a minister, even if you are the husband, you are the Wife, as in Judy Brady’s wife–the docile, compliant shadow behind the Main Event.
This is not a role I have embraced perhaps as fully as I could have, but in my two decades in this role, it has been a learning experience. I have some very good memories, such as the holiday reception in Albany, New York where I had purchased this well-known locally-smoked ham, and it was so good people stopped being polite and just stood in a circle around this huge joint of meat, hacking away at it and gobbling with abandon. I remember the Christmas open house in Palo Alto; our rental home, a fake Eichler, was so packed I had to slither sideways into the kitchen to refresh the mulled cider. I also have any number of heartwarming moments with children, elderly people, and the sort of folks who end up in churches these days, which is to say people who feel a need for something much larger and older and more organized than themselves.
LibraryLand is all a-buzz these days with the notion of threshold concepts. As I dutifully make an effort to understand this concept, I see it describing a point at which you do not know something, and then you do. And like a bride, you are carried over the threshold, to be forever transformed.
I don’t have any serious objections to freshening up our concepts of how we teach information literacy with this model — it’s certainly better than arguing against library instruction per se, as Michael Gorman did in 1991. (Oh yes, he did! The things you learn skimming bibliographies.) But–and I’m guessing this isn’t antithetical to the whole threshold idea–I do think some thresholds are more like train tracks you walk along for a good long while until the town you were looking for begins to slowly swim into focus on the horizon.
I don’t recall when the threshold for my awareness of being the minister’s Wife emerged. It’s an interesting place to be. It’s not simply a matter of being that person who sits in the back pew and will do what is asked of her — serving cookies, showing up to help make the holiday jam, folding bulletins, or showing up in a dressy dress and looking interested about the wedding of two people I don’t know and will never hear from again. I am the person people remember to chat with, though never in great depth. I am the one who will not talk back if spoken to sharply; I have bit my tongue so often I’m surprised I still have one. I am the person most parishioners will forget as soon as we move on.
Threshold theory includes the idea of troublesome knowledge: “the process of crossing the threshold commonly causes some mental and emotional discomfort (troublesome).” I am the person who a parishioner once asked, “So, you’re the one making the real salary, eh?”and that startling moment caught me because it was an assumption that made so many things clearer to me, and was also — frighteningly for a librarian — true. While these days the trend is not to pay the pastor in “free” housing and a few chickens now and then, but in wages with pension plans, it’s still a profession that usually requires a two-salary household.
Despite the need to make a “real salary,” some unchurched people assume I have the time –and even the obligation — to be the Wife. As noted above, I do within limits, but I also need to focus on doing those things that ensure we have enough money to live on, which mean I am not available to help organize meals for the homeless at 3 pm on a weekday afternoons or joining the knitting group on Thursday mornings. When I do volunteer for something like coffee hour, it is usually squeezed into a day that began at 6 AM with doctoral homework that will be resumed once I have wiped down the church kitchen counters and folded the tablecloths.
A threshold I crossed many years ago that can also be lost on unchurched people was the need to have my own spiritual life. In Olden Days, the (male) minister married some darling parishioner, who then moved into the helpmate role — quite a bargain for the church to get a twofer, but in addition to the labor issues, it left these women in a strange place. I have often wondered about the private worlds of these Wives. Did they really see their husbands as their spiritual muses? The patriarchal implications of this arrangement make my toes curl in discomfort (talk about troublesome knowledge!). Who did these women turn to when they needed pastoral care?
Additionally, unchurched people — and some churched people — don’t get the nuance that when I attend Sandy’s church, I am essentially visiting her workplace. Work — even other people’s work — is not a stress-free experience. It’s a worldly place full of personalities and interactions, the stories for which spill over into my life enough to make what you call a sanctuary often feel to me like an office, with all that entails. I always try to have a spiritual home elsewhere, someplace I am not the Wife but just me, another parishioner. It feels so different, so unburdened.
The part I have liked about being the Wife has been its narrative stance. I watch church life unfold on its little tableaux, one of the last big volunteer activities in American life. Just like in a good novel, its inhabitants are both predictable and surprising. The Christmas play features an adorable child who will make everyone laugh. A parishioner will die, and the corner of the pew she sat in will remain empty until a clueless new person sits there, breaking the spell. Parishioners will stand up during Joys and Concerns to share stories of illness, death, life, and global sadness. The same group of elderly women found in every church, temple, and mosque will meet to knit blankets for homeless people and gossip. A baptism, the child held aloft like a prize, will make everyone breathe with hope.
The decades I have spent as the Wife have given me a privileged observer status, one that will continue as Sandy’s ministry continues in new, different ways. I won’t miss the decades where Saturday night was a “school night” for Sandy; only on vacations do we experience secular Sunday life, and I can see its attraction. And for the most part, I won’t miss being the Wife. But I will miss observing people trying to connect with something larger than themselves, with all the awkwardness and challenge and beauty that entails.
In the many talks about schema.org, it seems that one topic that isn't covered, or isn't covered sufficiently, is "where do you do it?" That is, where does it fit into your data flow? I'm going to give a simple, typical example. Your actual situation may vary, but I think this will help you figure out your own case.
The typical situation is that you have a database with your data. Searches go against that database, the results are extracted, a program formats these results into a web page, and the page is sent to the screen. Let's say that your database has data about authors, titles and dates. These are stored in your database in a way that you know which is which. A search is done, and let's say that the results of the search are:
author: Williams, R title: History of the industrial sewing machine date: 1996
This is where you are in your data flow:
The next thing that happens (and remember, I'm speaking very generally) is that the results then are fed into a program that formats them into HTML, probably within a template that has all your headers, footers, sidebars and branding and sends the data to the browser. The flow now looks like
Let's say that you will display this as a citation, that looks like:
Williams, R. History of the industrial sewing machine. 1996.
Without any fancy formatting, the HTML for this is:
<p>Williams, R. History of the industrial sewing machine. 1996.</p>
Now we can see the problem that schema.org is designed to fix. You started with an author, a title and date, but what you are showing to the world is a string of characters are that undifferentiated. You have lost all the information about what these represent. To a machine, this is just another of many bazillions of paragraphs on the web. Even if you format your data like this:
<p>Author: Williams, R.</p> <p>Title: Williams, R. History of the industrial sewing machine</p> <p>Date: 1996</p>
What we want is for the program that is is formatting the HTML to also include some metadata from schema.org that retains the meaning of the data you are putting on the screen. So rather than just putting HTML formatting, it will add formatting from schema.org. Schema.org has metadata elements for many different types of data. Using our example, let's say that this is a book, and here's how you could mark that up in schema.org:
<div vocab="http://schema.org/"> <div typeof="Book"> <p> <span property="author">Williams, R.</span> <span property="name">History of the industrial sewing machine</span>. <span property="datePublished">1996</span>. </p> </div> </div>
Again, this is a very simple example, but when we test this code in the Google Rich Snippet tool, we can see that even this very simple example has added rich information that a search engine can make use of:
To see a more complex example, this is what Dan Scott and I have done to enrich the files of the Bryn Mawr Classical Reviews.
From these you can see a couple of things. The first is that the schema.org markup does not change how your pages look to a user viewing your data in a browser. The second is that hidden behind that simple page is a wealth of rich information that was not visible before.
Now you are probably wondering: well, what's that going to do for me? Who will use it? At the moment, the users of this data are the search engines, and they use the data to display all of that additional information that you see under a link:
In this snippet, the information about stars, ratings, type of film and audience comes from schema. org mark-up on the page.
Because the data is there, many of us think that other users and uses will evolve. The reverse of that is that, of course, if the information isn't there then those as yet undeveloped possibilities cannot happen.
Semantic web technologies can support the rapid and transparent validation of scientific claims by interconnecting the assumptions and evidence used to support or challenge assertions. One important application domain is medication safety, where more efficient acquisition, representation, and synthesis of evidence about potential drug-drug interactions is needed. Exposure to potential drug-drug interactions (PDDIs), defined as two or more drugs for which an interaction is known to be possible, is a significant source of preventable drug-related harm. The combination of poor quality evidence on PDDIs, and a general lack of PDDI knowledge by prescribers, results in many thousands of preventable medication errors each year. While many sources of PDDI evidence exist to help improve prescriber knowledge, they are not concordant in their coverage, accuracy, and agreement. The goal of this project is to research and develop core components of a new model that supports more efficient acquisition, representation, and synthesis of evidence about potential drug-drug interactions. Two Semantic Web models—the Micropublications Ontology and the Open Annotation Data Model—have great potential to provide linkages from PDDI assertions to their supporting evidence: statements in source documents that mention data, materials, and methods. In this paper, we describe the context and goals of our work, propose competency questions for a dynamic PDDI evidence base, outline our new knowledge representation model for PDDIs, and discuss the challenges and potential of our approach.
“Information had very strong geographical boundaries,” he says. “I come from a place where those boundaries are very, very apparent. They are in your face. To be able to make a dent in that is a very attractive proposition.”
Acharya’s continued leadership of a single, small team (now consisting of nine) is unusual at Google, and not necessarily seen as a smart thing by his peers. By concentrating on Scholar, Acharya in effect removed himself from the fast track at Google…. But he can’t bear to leave his creation, even as he realizes that at Google’s current scale, Scholar is a niche.
…But like it or not, the niche reality was reinforced after Larry Page took over as CEO in 2011, and adopted an approach of “more wood behind fewer arrows.” Scholar was not discarded — it still commands huge respect at Google which, after all, is largely populated by former academics—but clearly shunted to the back end of the quiver.
…Asked who informed him of what many referred to as Scholar’s “demotion,” Acharya says, “I don’t think they told me.” But he says that the lower profile isn’t a problem, because those who do use Scholar have no problem finding it. “If I had seen a drop in usage, I would worry tremendously,” he says. “There was no drop in usage. I also would have felt bad if I had been asked to give up resources, but we have always grown in both machine and people resources. I don’t feel demoted at all.”
Emacs 24.4 has a new feature: prettify-symbols-mode. Bozhidar Batsov wrote about it and I’ve seen other mentions of it in discussions of what’s coming when this new version is released. It’s not out yet, but I decided to stop waiting and compile from source so I can run the latest development version because there was something I wanted to do.
I use R a lot, and I use Emacs to edit it with ESS mode. In R I’m a great user of Hadley Wickham’s dplyr package (the vignette has examples that show its power), and so I spend a lot of time writing %>% (“x %>% f(y) turns into f(x, y)” says the docs).
As great as dplyr is, I thought its %>% wasn’t very eyepleasing, so I made Emacs show me the pipe symbol | when actually the R operator is there. On top of that, the built-in prettifying already takes care of turning <- into ← (Unicode LEFTWARDS ARROW).
So when I write this:
usage <- read.csv("refworks-usage.csv")
usage <- tbl_df(usage)## Which accounts have logged in the most?
usage %>% select(user_id, name, number_of_logins)%>% arrange(desc(number_of_logins))
What I see is:
usage ← read.csv("refworks-usage.csv")
usage ← tbl_df(usage)## Which accounts have logged in the most?
usage | select(user_id, name, number_of_logins)| arrange(desc(number_of_logins))
I think that’s much nicer. It shows a lot better in a screenshot:
It also works in the interactive mode, at an R session with a REPL, and does the prettifying on the fly there. Copying and pasting from one to the other works perfectly because what’s copied and executed is different from what I see.
How I got it working
As I said, this requires Emacs 24.4, which will be released soon, so you can wait for that or compile from source. Xah Lee’s How to build Emacs from git Repository and How to build Emacs on Linux are helpful; all I had to do to prepare to get it to compile was add a few missing packages with sudo install libgtk-3-dev libxpm-dev libgif-dev.
With that done, I was now running 220.127.116.11 (they’re moving from 24.4 to 25), and I added this to ~/.emacs.d/init.el in my Emacs config:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of BIBFRAME, it was the season of RDA, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to metadata Heaven, we were all going direct the other way– in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
There were a MARC with a large set of tags and an RDA with a plain face, on the throne of library metadata; there were a Schema.org with a large following and a JSON-LD with a fair serialization, on the throne of all else. In both camps it was clearer than crystal to the lords of the Library preserves of monographs and serials, that things in general were settled for ever.
But they weren’t. Oh were they not. It mayhaps would have been pleasant, back in 2014, to have settled everything for all time, but such things were not to be.
The library guilds united behind the RDA wall, where they frantically ran MARC records through the furnace to forge fresh new records of RDA, employed to make the wall ever thicker and higher.
The Parliamentary Library assaulted their ramparts with the BIBFRAME, but the stones flung by that apparatus were insufficient to breach the wall of RDA.
Meanwhile, the vast populace in neither camp employed Schema.org, to garner the attention of the monster crawlers and therefore their many minions, ignoring the internecine squabbles over arcane formats.
Eventually warfare settled down to a desultory, almost emotionless flinging of insults and the previous years of struggle were rendered meaningless.
So now we, the occupants of mid-century modernism, are left to contemplate the apparent fact that formats never really mattered at all. No, dear reader, they never did. What mattered was the data, and the parsing of it, and its ability to be passed from hand to hand without losing meaning or value.
One wonders what those dead on the Plain of Standards would say if they could have lived to see this day.
My humble and abject apologies to Mr. Charles Dickens, for having been so bold as to damage his fine work with my petty scribblings.
Reference librarian assisting readers. Photo by the Library of Congress.
Every day, public library staff are asked to answer legal questions. Since these questions are often complicated and confusing, and because there are frequent warnings about not offering legal advice, reference staff may be uncomfortable addressing legal reference questions. To help reference staff build confidence in responding to legal inquiries, the American Library Association (ALA) and iPAC will host the free webinar “Lib2Gov.org: Connecting Patrons with Legal Information” on Wednesday, November 12, 2014, from 2:00–3:00 p.m. EDT.
The session will offer information on laws, legal resources and legal reference practices. Participants will learn how to handle a law reference interview, including where to draw the line between information and advice, key legal vocabulary and citation formats. During the webinar, leaders will offer tips on how to assess and choose legal resources for patrons. Register now as space is limited.
Catherine McGuire, head of Reference and Outreach at the Maryland State Law Library, will lead the free webinar. McGuire currently plans and presents educational programs to Judiciary staff, local attorneys, public library staff and members of the public on subjects related to legal research and reference. She currently serves as Vice Chair of the Conference of Maryland Court Law Library Directors and the co-chair of the Education Committee of the Legal Information Services to the Public Special Interest Section (LISP-SIS) of the American Association of Law Libraries (AALL).
This is the third installment in our deep dive series on the WorldCat Discovery API. This week we will be taking a close look at some of the demo code we have written ourselves to exercise the API throughout its development process. We have decided to share our work through our OCLC Developer Network Github.com account.
This is the second of three posts about the workshop.
Part 1 introduced the Evolving Scholarly Record framework. This part summarizes the two plenary discussions.
Research Records and Artifact EcologiesNatasa Miliç-Frayling, Principal Researcher, Microsoft Research Cambridge
Natasa illustrated the diversity and complexity of digital research information comparing it to a rainbow and asking how do we preserve a rainbow? She began with the question, How can we support the reuse of scientific data, tools, and resources to facilitate new scientific discoveries? We need to take a sociological point of view because scientific discovery is a social enterprise within communities of practice – and the information takes a complex journey from the lab to the paper, evolving en route. When teams consist of distributed scientists notions of ownership and sharing are challenged. We need to be attuned to the interplay between technology and collaborative practices as it affects the information artifacts.
Natasa encouraged a shift in thinking from the record to the ecology, as she shared her study of the artifacts ecology of a particular nanotechnology endeavor. Their ecosystem has electronic lab books, includes tools, ingests sensor data, and incorporates analysis and interpretation. This ecosystem provides context for understanding the data and other artifacts, but scientists want help linking these artifacts and overcoming limitations of physical interaction. They want content extraction and format transformation services. They want to create project maps and overviews to support their work in order to convey meaning to guide third party reuse of the artifacts. Preservation is not just persistence; it requires a connection with the contemporary ecosystem. A file and an application can persist and be completely unusable. They need to be processed and displayed to be experienced and this requires preserving them in their original state and virtualising the old environments on future platforms. She acknowledged the challenges in supporting research, but implored libraries to persevere.
A Perspective on Archiving the Evolving Scholarly RecordHerbert Van de Sompel, Scientist, Los Alamos National Laboratory
Herbert took a web-focused view, saying that not only is nearly everything digital, it is nearly all networked, which must be taken into account when we talk about archiving. His presentation reflected thinking in progress with Andrew Treloar, of the Australian National Data Service. Herbert highlighted the “collect” and “fix” roles and how the materials will be obtained by archives. He used Roosendaal and Geurtz’s functions of scholarly communication to structure his talk: Registration (the claim, with its related objects), Certification (peer review and other validation), Awareness (alerts and discovery of new claims), and Archiving (preserving over time), emphasizing that there is no scholarly record without archiving. The four functions had been integrated in print journal publishing, but now the functions are disaggregated and distributed among many entities.
Herbert then characterized the future environment as the Web of Objects. Scholarly communication is becoming more visible, continuous, informal, instant, and content-driven. As a result, research objects are more varied, compound, diverse, networked, and open. He discussed several challenges this presents to libraries. Archiving must take into account that objects are often hosted on common web platforms (e.g., GitHub, SlideShare, WordPress), which are not necessarily dedicated to scholarship. We archive only 50% of journal articles and they tend to be the easy, low-risk titles. “Web at Large” resources are seldom archived. Today’s approach to archiving focuses on atomic objects and loses context. We need to move toward archiving compound objects in various states of flux, as resources on the web rather than as files in file systems. He distinguished between recording (short-term, no guarantees, many copies, and tied to the scholarly process) and archiving (longer-term, guarantees, one copy, and part of the scholarly record). Curatorial decisions need to be made to transfer materials from the recording infrastructures to an archival infrastructure through collaborations, interoperability, and web-scale processes.
I’ve used different HTML slideshow tools in the past, but was never satisfied with them. I didn’t like to have to run a server just for a slideshow. I don’t like when a slideshow requires external dependencies that make it difficult to share the slides. I don’t want to actually have to write a lot of HTML.
I want to write my slides in a single Markdown file. As a backup I always like to have my slides available as a PDF.
For my latest presentations I came up with workflow that I’m satisfied with. Once all the little pieces were stitched together it worked really well for me. I’ll show you how I did it.
I had looked at DZSlides before but had always passed it by after seeing what a default slide deck looked like. It wasn’t as flashy as others and doesn’t immediately have all the same features readily available. I looked at it again because I liked the idea that it is a single file template. I also saw that Pandoc will convert Markdown into a DZSlides slideshow.
To convert my Markdown to DZSlides it was as easy as:
What is even better is that Pandoc has settings to embed images and any external files as data URIs within the HTML. So this allows me to maintain a single Markdown file and then share my presentation as a single HTML file including images and all–no external dependencies.
Just follow these naming conventions:
Presentation Markdown should be named presentation.md
Output presentation HTML will be named presentation.html
Create a stylesheet in styles.css
You can put images wherever you want, but I usually place them in an images directory.
Automate the build
Now what I wanted was for this script to run any time the Markdown file changed. I used Guard to watch the files and set off the script to convert the Markdown to slides. While I was at it I could also reload the slides in my browser. One trick with guard-livereload is to allow your browser to watch local files so that you do not have to have the page behind a server. Here’s my Guardfile:
Add the following to a Gemfile and bundle install:
Now I have a nice automated way to build my slides, continue to work in Markdown, and have a single file as a result. Just run this:
Now when any of the files change your HTML presentation will be rebuilt. Whenever the resulting presentation.html is changed, it will trigger livereload and a browser refresh.
Slides to PDF
The last piece I needed was a way to convert the slideshow into a PDF as a backup. I never know what kind of equipment will be set up or whether the browser will be recent enough to work well with the HTML slides. I like being prepared. It makes me feel more comfortable knowing I can fall back to the PDF if needs be. Also some slide deck services will accept a PDF but won’t take an HTML file.
The Webkit driver is used to take a snapshot of each slide, save it to a screenshots directory, and then ImageMagick’s convert is used to turn the PNGs into a PDF. You could just as well use other tools to stitch the PNGs together into a PDF. The quality of the resulting PDF isn’t great, but it is good enough. Also the capybara-webkit browser does not evaluate @font-face so the fonts will be plain. I’d be very interested if anyone gets better quality using a different browser driver for screenshots.
At this point I did have to set this up to be behind a web server. On my local machine I just made a symlink from the root of my Apache htdocs to my working directory for my slideshow. The script can be called with the following.
I start with adding the following markup to the presentation Markdown file.
Add some CSS to hide the notes by default but allow for them to display at the bottom of the slide.
Finally, I like to have an outline I can see of my presentation as I’m writing it. Since the Markdown just uses h1 elements to separate slides, I just use the following simple script to output the outline for my slides.
If you add an image to an HTML document you can style it with CSS. You can add borders, change its opacity, use CSS animations, and lots more. HTML5 video is just as easy to add to your pages and you can style video too. Lots of tutorials will show you how to style video controls, but I haven’t seen anything that will show you how to style the video itself. Read on for an extreme example of styling video just to show what’s possible.
Here’s a simple example of a video with a single source wrapped in a div:
Add some buttons under the video to style and play the video and then to stop the madness.
Using the class that gets added we can then style and animate the video element with CSS. This is a simplified version without vendor flags.
Stupid Video Styling Tricks
OK, maybe there aren’t a lot of practical uses for styling video with CSS, but it is still fun to know that we can. Do you have a practical use for styling video with CSS that you can share?