Visualized word frequencies, while often considered sophomoric, can be quite useful when it comes to understanding a text, especially when the frequencies are focused on things like parts-of-speech, named entities, or co-occurrences. Wordle visualizes such frequencies very well. For example, the 100 most frequent words in the Iliad and the Odyssey, the 100 most frequent nouns in the Iliad and the Odyssey, or the statistically significant words associated with the word ship from the Iliad and the Odyssey.
simple word frequencies
frequency of nouns
Significant words related to ship
Here is a generic Wordle recipe where Wordle will calculate the frequencies for you:
Download and install Wordle. It is a Java application, so you may need to download and install Java along the way, but Java is probably already installed on your computer.
Use your text editor to open reader.txt which is located in the etc directory/folder. Once opened, copy all of the text.
Open Wordle, select the “Your Text” tab, and paste the whole of the text file into the window.
Click the “Wordle” tab and your word cloud will be generated. Use the Wordle’s menu options to customize the output.
Congratulations, you have just visualized the whole of your study carrel.
Here is another recipe, a recipe where you supply the frequencies (or any other score):
Download and install AntConc.
Use the “Open Files(s)…” menu option to open any file in the txt directory.
Click the “Word list” tab, and then click the “Start” button. The result will be a list of words and their frequencies.
Use the “Save Output to Text File…” menu option, and save the frequencies accordingly.
Open the resulting file in your spreadsheet.
Remove any blank rows, and remove the columns that are not the words and their frequencies
Invert the order of the remaining two columns; make the words the first column and the frequencies the second column.
Copy the whole of the spreadsheet and paste it into your text editor.
Use the text editor’s find/replace function to find all occurrences of the tab character and replace them with the colon (:) character. Copy the whole of the text editor’s contents.
Open Wordle, click the “Your text” tab, paste the frequencies into the resulting window.
Finally, click the “Wordle” tab to generate the word cloud.
Notice how you used a variety of generic applications to achieve the desired result. The word/value pairs given to Wordle do not have be frequencies. Instead they can be any number of different scores or weights. Keep your eyes open for word/value combinations. They are everywhere. Word clouds have been given a bad rap. Wordle is a very useful tool.
I was always a bit confused about the inclusion of "pamflets" in the subtitle of the Decimal System, such as this title page from the 1922 edition:
Did libraries at the time collect numerous pamphlets? For them to be the second-named type of material after books was especially puzzling.
I may have discovered an answer to my puzzlement, if not THE answer, in Andrea Costadoro's 1856 work:
A "pamphlet" in 1856 was not (necessarily) what I had in mind, which was a flimsy publication of the type given out by businesses, tourist destinations, or public health offices. In the 1800's it appears that a pamphlet was a literary type, not a physical format. Costadoro says:
"It has been a matter of discussion what books should be considered pamphlets and what not. If this appellation is intended merely to refer to the SIZE of the book, the question can be scarecely worth considering ; but if it is meant to refer to the NATURE of a work, it may be considered to be of the same class and to stand in the same connexion with the word Treatise as the words Tract ; Hints ; Remarks ; &c, when these terms are descriptive of the nature of the books to which they are affixed." (p. 42)
To be on the shelves of libraries, and cataloged, it is possible that these pamphlets were indeed bound, perhaps by the library itself.
The Library of Congress genre list today has a cross-reference from "pamphlet" to "Tract (ephemera)". While Costadoro's definition doesn't give any particular subject content to the type of work, LC's definition says that these are often issued by religious or political groups for proselytizing. So these are pamphlets in the sense of the political pamphlets of our revolutionary war. Today they would be blog posts, or articles in Buzzfeed or Slate or any one of hundreds of online sites that post such content.
Churches I have visited often have short publications available near the entrance, and there is always the Watchtower, distributed by Jehovah's Witnesses at key locations throughout the world, and which is something between a pamphlet (in the modern sense) and a journal issue. These are probably not gathered in most libraries today. In Dewey's time the printing (and collecting by libraries) of sermons was quite common. In a world where many people either were not literate or did not have access to much reading material, the Sunday sermon was a "long form" work, read by a pastor who was probably not as eloquent as the published "stars" of the Sunday gatherings. Some sermons were brought together into collections and published, others were published (and seemingly bound) on their own. Dewey is often criticized for the bias in his classification, but what you find in the early editions serves as a brief overview of the printed materials that the US (and mostly East Coast) culture of that time valued.
What now puzzles me is what took the place of these tracts between the time of Dewey and the Web. I can find archives of political and cultural pamphlets in various countries and they all seem to end around the 1920's-30's, although some specific collections, such as the Samizdat publications in the Soviet Union, exist in other time periods.
Of course the other question now is: how many of today's tracts and treatises will survive if they are not published in book form?
The Association for Library Collections and Technical Services (ALCTS), the Library Information Technology Association (LITA) and the Library Leadership and Management Association (LLAMA) have announced that Emily Drabinski and Rebekkah Smith Aldrich will deliver keynote addresses at the Exchange Virtual Forum. The theme for the Exchange is “Building the Future Together,” and it will take place on the afternoons of May 4, 6 and 8. Each day has a different focus, with day 1 exploring leadership and change management; day 2 examining continuity and sustainability; and day 3 focusing on collaborations and cooperative endeavors. Drabinski’s keynote will be on May 4, and Smith Aldrich’s will be on May 8.
Emily Drabinski is the Critical Pedagogy Librarian at Mina Rees Library, Graduate Center, City University of New York (CUNY). She is also the liaison to the School of Labor and Urban Studies and other CUNY masters and doctoral programs. Drabinski’s research includes critical approaches to information literacy instruction, gender and sexuality in librarianship and the intersections of power and library systems and structures. She currently serves as the series editor for Gender and Sexuality in Information Studies (Library Juice Press/Litwin Books). Additionally, Drabinski serves on the editorial boards of College & Research Libraries, The Journal of Critical Library & Information Studies and Radical Teacher, a socialist, feminist and anti-racist journal devoted to the theory and practice of teaching. Drabinski has given numerous keynote addresses and presentations at the Big XII Teaching and Learning Conference, the Association of College and Research Libraries (ACRL) Conference and the Digital Library Federation.
Rebekkah Smith Aldrich, a powerful advocate for public libraries, is the executive director of the Mid-Hudson Library System (MHLS) in Hudson, New York. In addition, she is a certified sustainable building advisor (CSBA) and is a Leadership in Energy and Environmental Design Accredited Professional (LEED AP). Smith Aldrich holds an advanced certificate in Public Library Administration from the Palmer School of Library and Information Science at Long Island University, where she is also an adjunct professor. She is a founding member of ALA’s Sustainability Roundtable and helped to pass the ALA Resolution on the Importance of Sustainable Libraries in 2015. Smith Aldrich was named a Library Journal Mover and Shaker in 2010 and writes a sustainability column for the journal. A prolific writer, Smith Aldrich is the author of Sustainable Thinking: Ensuring Your Library’s Future in an Uncertain World. She has also contributed chapters to Better Library Design (Rowman and Littlefield) and The Green Library (Library Juice Press). Like Drabinski, Smith Aldrich has given numerous presentations and keynote addresses at venues that include IFLA, the Association of Rural and Small Libraries Conference, the US Embassy in Peru, the American Library Association Annual Conference and the New York Library Association Conference.
The Exchange is presented by the Association for Library Collections & Technical Services (ALCTS), the Library and Information Technology Association (LITA) and the Library Leadership & Management Association (LLAMA), divisions of the American Library Association. To get more information about the proposed future for joint projects such as the Exchange, join the conversation about #TheCoreQuestion.
Contact: Brooke Morris-Chott Program Officer, Communications ALCTS firstname.lastname@example.org
As an exercise, I
used Perma's public
look up the URLs for the Perma links cited in these two documents;
here are CSV files listing the 148
in the House Memorandum (one ill-formed) and the 129
in the President's memorandum. (Note that both CSV files include
duplicates, as some links are repeated in each document; I'm leaving
the duplicates in place in case you want to read along with the
The Evergreen Community is celebrating the new year with two new releases: Evergreen 3.4.2 and Evergreen 3.3.6. These two releases look to the past to find and resolve troublesome, long-standing bugs. The releases also look to the future, including bug fixes for Evergreen’s new and emerging features, like the experimental staff catalog and carousel features. 24 individuals from 11 different organizations contributed to these releases.
Evergreen 3.4.2 includes fixes related to the Acquisitions, Administration, Cataloging, Circulation, and Hatch portions of Evergreen. Evergreen 3.3.6 includes fixes related to the Acquisitions, Cataloging, and Circulation portions of Evergreen. To learn more about the specific improvements in these releases, please refer to the 3.4.2 release notes, 3.3.6 release notes, or the consolidated tabular release notes.
I’ve recently shifted gears from data collection to data analysis in my dissertation research–so of course what better time to navel gaze on the tools I’m using, rather than doing the work itself, right?
Well, not really.
Early in my PhD studies I went through a phase where I spent a significant chunk of time trying different citing and writing tools, before settling on collecting materials with BibDesk and DropBox, writing in Markdown with WriteRoom (sometimes Vim or VSCode), and then publishing with Pandoc. It hasn’t been perfect, but it has let me collect about 2k citations to the research literature, and to be able to easily draw on them in my written work in articles, my dissertation proposal, and here on this blog, which has been super handy.
Markdown kind of sucks for workshopping text with others. There are some tools for collaborating with others on Markdown text, but it can be challenging to get academics to use them when their time is so limited. So I’ve tended to give people Word docs or PDFs (generated with Pandoc), and solicit people’s feedback that way. Then the challenge is integrating those suggestions back into the original source. This means the track changes feature in Word or Google Docs don’t work right, but if people are interested in that level of granularity they can look at the Git repository…yep, using Markdown files means my manuscripts can all happily be versioned in a Git repository. While the collaborative lure of live editing in a Google Doc is a strong pull, I’ve actually come to appreciate the degree of separation that editing Markdown provides. It gives me a protective space in which to think.
Now I’m in the process of converting hundreds of pages of hand written field notes in a couple notebooks into text documents that I can use as a source material for open coding. For my last project I ended up using MAXQDA for qualitative data analysis after experimenting with Dedoose and NVivo. I was happy with it, but it wasn’t cheap, even with the educational license, which expired after a semester.
Close reading field notes and interviews is key to my method, and being able to annotate the text with codes is extremely important for coming to understand, and use, my data. But honestly all the bells and whistles built into the tools were not so important for me. I liked to see my codes, and bring up chunks of text for particular codes. But that was the extent of my analysis. I built themes out of the codes and used the codes as an index into my data. But I didn’t tend to use any of the other statistical tools that were available.
So this time around I’m trying something different. Because I am transcribing my own hand written notes, I’m going to write them as Markdown (basically text files), and then “mark up” regions of texts with codes using HTML, which is perfectly legit to use in Markdown files. I could have done this keying/annotating work in MAXQDA which provides a way to compose documents, but I was kind of surprised to find that NVivo didn’t have this capability–it really likes to be pointed at completed documents, instead of providing an environment for creating them.
So I created a very simple VSCode plugin I’m calling Anselm (named after Anselm Strauss) that makes it easy (with a few keystrokes) to highlight an arbitrary region of text in a Markdown document and mark it up with code, expressed with a <mark> tag.
So for example, assume I have this bit of text in my field notes:
Some heavy machinery, bulldozers mostly were also there. To the right was a cement structure with multiple large diagonal entrances. I missed the sign pointed to the right for Yard Waste and ended up on a road that led out of the facility. I thought about turning around but it was a one way road. There was another person in a truck at the exit who was watching people leave. I felt a bit like my movement through the facility was controlled.
Anselm will let you quickly (faster if you assign a keyboard shortcut) assign simple codes to pieces of the text:
Some <mark class="technology">heavy machinery, bulldozers mostly were also there</mark>. To the right was a <mark class="architecture">cement structure with multiple large diagonal entrances</mark>. I missed the sign pointed to the right for Yard Waste and ended up on a road that led out of the facility. I thought about turning around but it was a one way road. <mark class="surveillance">There was another person in a truck at the exit who was watching people leave. I felt a bit like my movement through the facility was controlled.</mark>
The plugin is so simple it seems barely worth mentioning. You can see it in operation in the video at the top of this blog post. I am writing about it here really just to give encouragement to you to develop the tools you need in your research. There is significant value in understanding how your research tools operate, and to being able to shape them to do what you need. There is also value in getting the research done, and not focusing on the tools themselves. So it’s a balancing act to be sure.
Text is the oldest and most stable communication technology (assuming we treat speech/signing as natural phenomenon – there are no human societies without it – whereas textual capability has to be transmitted, taught, acquired) and it’s incredibly durable. We can read texts from five thousand years ago, almost the moment they started being produced. It’s (literally) “rock solid” – you can readily inscribe it in granite that will likely outlast the human species.
Just using text for my field notes, means I can analyze it as text and not be limited by what MAXQDA or NVivo lets me do. I’m going to give it a try for a bit. If you want to try it out yourself you can find it in the VSCode plugin app store, just search for “Anselm”.
Finally, research is an opportunity to not only generate new findings (or knowledge) but also to develop new instruments and methods (practice). I certainly haven’t done anything revolutionary in creating this tiny VSCode plugin; but it has granted me the chance to reflect on what is happening when I code text in this way, and opened up other possibilities for its analysis later that would have been foreclosed by the limits of someone else’s tool.
But yeah, on with the analysis! 📙💻📊 I’ll probably be adding little bits and pieces to Anselm as I go. If you have ideas for it drop them into them in here.
I have an increasingly complicated relationship with the broader "open movement". I strongly believe in open access to knowledge - not just academic knowledge, though it's a good start. Yet the 'opening' of Indigenous peoples' knowledge around the world has been a key weapon of capitalism and colonialism. 'Open source' software underpins the operations of multi-billion dollar corporations, yet they not only pay no tax to governments, they also rarely provide financial support to the basic 'open' software systems upon which they rely. Indeed the awful similarities between 'open markets' and 'open software' are becoming more and more apparent to me. No wonder some say that Open is cancelled.
It wasn't just the catchy acronym, therefore, that made us call the miniconf "Generous and Open GLAM". Indeed, if I'd been able to get away with it within the broader conference theme I'd have simply called it Generous GLAM. The ideas of 'Generous' GLAM came from the 2018 miniconf and discussions about extending the concept of generous interfaces coined by Mitchell Whitelaw. What, we wondered, might a generous library, a generous gallery, or a generous archive look like. In particular, I wondered what it might mean for a GLAM catalogue API to be 'generous'. In the last part of our GO GLAM miniconf we workshopped this as a group, though I'm not sure we came to any definitive conclusions.
After Pia's talk we did a little workshop for about an hour thinking about some questions related to the talks and how we might build in 'generosity' to a theoretical API standard for GLAM institutions. The point wasn't really to design an actual API standard: apart from anything else, the current Australian context means that the Trove API is effectively the national standard, and we were fewer than twenty people who happened to be at a Linux Australia miniconf. But this sort of thinking is useful even if you're not actually building the thing you imagine. Every piece of technology, exhibition, collection, or process we create in memory institutions should be generous.
Here's the whiteboard notes, if you're interested:
The Islandora 8 committers have asked Alan Stanley to join their ranks and we are pleased to announced that he has accepted.
Alan has a long history with Islandora, as a developer, instructor, mentor, and all-around great member of our community. Alan's recent contributions to Islandora 8 include support for ORCID integration, OCR text extraction, and FITS metadata — and stay tuned for some upcoming improvements to content modeling, courtesy of Alan's work at UPEI.
We are extremely grateful for all of the time and effort that Alan has put into Islandora 8, and we couldn't be more excited to have him on board!
Further details of the rights and responsibilities of being a Islandora committer can be found here: https://github.com/Islandora/islandora/wiki/Islandora-Committers
This article explores the topic of information privilege and how this concept can be used with first-year students to teach about information literacy and privilege. It is building off the work of a credit-bearing first-year seminar that was taught on this topic and a survey that was conducted after the class was over. The purpose of this study was to identify what my students learned about information privilege and how they define this concept. This article makes recommendations for others wanting to incorporate a lens of information privilege to their library instruction.
This paper will share a case study of research data and best practices from my experience teaching a credit-bearing first-year seminar on the topic of information privilege. First described by Char Booth in 2014, the “concept of information privilege situates information literacy in a sociocultural context of justice and access” (Booth, 2014).
Prior to encountering this concept, I wrote my personal statement for graduate school on my belief that information should be freely available and that access to information is a human right. My undergraduate degree is in women’s studies, and bringing this lens with me to my library science classes helped meld the two worlds together. bell hooks’ Feminist Theory: From Margin to Center is a book that I come back to repeatedly because of its intersectional approach to feminist theory that centers the experiences of Black women, and it reminds me that even for large groups of people who are oppressed (for example, women), oppression is experienced differently based on intersectional characteristics (Crenshaw, 1989) like race, class, or religion (hooks, 1984).
An aspect of information privilege that particularly sparked my interest while in graduate school was the intersection of privileges that allow an individual to have internet access at home (Pew Research Center, 2019). In the first year of my library science program, I went on a two-week study abroad trip to Bangalore and Mysore, India, to study library and information organizations. On this trip in 2013, I found out that about 10% of India had access to the internet at that time, and this inspired me to look into internet access and digital divide issues in my own country. Now, as a teacher and a librarian, I carry that perspective with me into my classroom when I interact with students without expectations about what students know or don’t know how to search.
First-year students1 typically enter college or university and experience a new level of access to information through taking courses and what the university libraries offer. While academic libraries and universities pay for access to journals and databases, students are privileged in being able to use these resources. This is also a level of access to information most people don’t have outside of a higher academic setting, and that students will cease to have when they leave the higher education context. Investigating privilege through access to information helps students realize the myths they may have held depending on their background and perspective, such as thinking that everyone in the United States has access to the internet or that everyone has access to finding the health information they need. This article will go over how the class was taught, what topics were covered, and the assignments used. Topics included access to information in archives and museums, health and financial information, as well as open access in general. I will close with suggestions for incorporating information privilege into other aspects of library teaching.
Defining Information Privilege
Information privilege was first used in 2013, but later defined by Char Booth in 2014:
The concept of information privilege situates information literacy in a sociocultural context of justice and access. Information as the media and messages that underlie individual and collective awareness and knowledge building; privilege as the advantages, opportunities, rights, and affordances granted by status and positionality via class, race, gender, culture, sexuality, occupation, institutional affiliation, and political perspective (Booth, 2014).
This definition was the first one in the field of librarianship, and in naming it, Booth gave the impetus to further investigate information privilege. The definition is powerful because it talks about information literacy in connection with justice explicitly and how the two are connected. Talking about information access by itself ignores why individuals might not have access and why others might. The added lens of privilege is the key to furthering research and conversations about information access.
Sarah Hare and Cara Evanson have also observed information privilege as “a term that carries assumptions about who has power, who does not, and what types of information are valuable” in their article about this work with undergraduate students (Hare & Evanson, 2018).
In 2017, in writing about information literacy threshold concepts, Johnson and Smedley-López defined the concept another way with more examples:
One type of privilege is information privilege, or unequal access to information due to paywalls, and this is a prevalent and persistent issue and injustice in our society, with paywalls blocking the general public from accessing potentially life-changing information (Johnson & Smedley-López, 2017).
This third definition is different in the fact that it’s explicitly talking about information privilege in terms of open access to information and gives examples of this.
These definitions are distinctive and bring different angles to understanding the term. Booth explicitly breaks down the two words (information and privilege) and defines each of them to show the understanding of the words together. Hare and Evanson (2018) speak more to the world we live in that interacts with information, a world that is riddled with systemic inequity and commodification of information. Johnson and Smedley-López (2017) give examples of how access to information can be cut off or limited.
Because my class was designed to help students explore their own personal identities and privileges, I am most drawn to Booth’s definition as it puts these two terms, information and privilege, in conjunction with each other. Later in this article, I’ll share the personal definitions my students provided for information privilege.
A prior and related concept that influences information privilege is information poverty. Elfreda A. Chatman articulated a theory of information poverty that includes six propositional statements. Proposition two states class distinction correlates with information poverty and how “the condition of information poverty is influenced by outsiders who withhold privileged access to information” (Chatman, 1996). The concepts of insiders and outsiders come from the field of sociology of knowledge (Merton, 1972). If applying a lens of privilege, we can think about who are typically insiders (those with privilege) and who are outsiders (those who are marginalized) to information.
In terms of open access, Aaron Swartz’s “Guerilla Open Access Manifesto” in 2008 calls for those with privileged access to information to use their privilege to help others who do not have access to the same information:
Those with access to these resources – students, librarians, scientists – you have been given a privilege. You get to feed at this banquet of knowledge while the rest of the world is locked out. But you need not indeed, morally, you cannot – keep this privilege for yourselves. You have a duty to share it with the world (Swartz, 2008).
The manifesto has influenced the concept of information privilege because it defines clearly who has access (students, librarians, scientists) to information and their responsibility to share it. Swartz also names those who are in the wrong in actively keeping information behind paywalls, such as large corporations and politicians (Swartz, 2008).
ACRL Framework for Information Literacy
It is important to note that while the ACRL Framework for Information Literacy for Higher Education, published in 2016, mentions information privilege in the frame “Information has Value,” it is only mentioned once and gives no parameters about how to put this into practice in the field of information literacy, nor does it provide a definition. One of the opportunities that this gives the field is flexibility to set these parameters for ourselves.
The term “privilege” pops up in three other places within this document. The Frame “Authority Is Constructed and Contextual” reads:
Experts understand the need to determine the validity of the information created by different authorities and to acknowledge biases that privilege some sources of authority over others, especially in terms of others’ worldviews, gender, sexual orientation, and cultural orientations” (ACRL, 2016).
Under the Frame “Scholarship as Conversation,” privilege is mentioned twice:
While novice learners and experts at all levels can take part in the conversation, established power and authority structures may influence their ability to participate and can privilege certain voices and information” (ACRL, 2016).
Learners who are developing their information literate abilities…recognize that systems privilege authorities and that not having a fluency in the language and process of a discipline disempowers their ability to participate and engage” (ACRL, 2016).
The commonality in these examples is examining the ways privilege can affect both information literacy on a systematic level and also on an individual one. The lens of privilege can guide our field of librarianship in how we work with students by being cognizant of the power structures at play.
Laura Saunders explores social justice and information literacy by proposing another frame for the ACRL Framework called “Information Social Justice,” which more explicitly states different ways for information professionals to incorporate this lens into their information literacy practices (Saunders, 2017). I found this proposed frame to be a missing piece from the ACRL Framework, especially since the ACRL Framework mentions information privilege a few times.
I work at The University of Tennessee, Knoxville (UTK), a public university and the flagship campus in the University of Tennessee system. In Fall 2018, the semester I taught my class, there were 22,815 undergraduate students and 6,079 graduate students for a total enrollment of 28,894 (Garnder & Long, 2019). Looking at first-time first-year student data, there were 5,215 students, and of those, 76.5% of the Fall 2018 first-year students were Tennessee residents, and a mere 43 of them were international students (Gardner & Long, 2019). In terms of race, UTK is a predominantly white institution, with 78% of the Fall 2018 first-year student class self-identified as white (Garnder & Long, 2019). In January of 2018, Rankin & Associates, Consulting published their Campus Climate Assessment Project for the University of Tennessee, Knoxville. They found that 51.3% of undergraduate students in the study considered leaving the university because of a “lack of a sense of belonging” and that 31.8% considered leaving because the “climate was not welcoming” (Rankin & Associates, 2018). The study also found that students in “minority” groups (for example, sexual orientation, race, disability, religion) considered leaving the university at higher rates than those in “majority groups” (Rankin & Associates, 2018). This led me to propose a first-year seminar titled “Information Privilege” because I wanted an opportunity to hold conversations about privilege with first-year students over the course of a semester.
First-Year Studies (FYS) 129 – Information Privilege – Fall 2018
UTK has an office of First-Year Studies, which has an optional course, “First-Year Studies (FYS) 129” which is a one-credit, pass/fail course (First-Year Studies, n.d.) The goal of offering these courses is to better allow for connections between first-year students and faculty members in a smaller class setting. This course can be taught by any tenure-track or tenured faculty member at UTK. As a tenure-track faculty librarian, I saw this as an excellent opportunity to spend sixteen weeks with a group of first-year students talking about a topic that I find important: information privilege. My course proposal was approved, and by the time Fall 2018 rolled around, I had the minimum amount of students for the course to take place: twelve students.
Course Description and Format
My course description read as follows, and was used by students to determine whether or not to sign up for the course:
Information Privilege: Have you ever searched on Google and hit a paywall for an article? Have you ever been frustrated while searching for a particular piece of information? This seminar will explore the valuable impact access to information has on your quality of life and in your community. By the end of this class, you’ll be a more savvy and conscious searcher.
Reflecting on the description now, the course itself became much more than this, but I was struggling to think about how to write a course description that would entice students to sign up but not scare them away. The description doesn’t even mention the word privilege; it is only in the title. Thinking back now, I was nervous as an instructor that no first-year student would want to sign up for a class that had them think and reflect on their own privilege. In so doing, I made certain assumptions about what kind of classes a first-year student might want to take. In further iterations of this class I plan on being more direct about the content of the course and not assuming what type of classes students do or don’t want to take.
My course was discussion-based and built in time for reflection on both my and the student’s parts. We explored topics with the intersection of information privilege and internet access, archives, museums, open access, financial information, and health information.
Every week we either read an article or watched a video before coming to class and would then discuss it in class (Appendix A). As this was a pass/no pass one-credit course, I wanted to select readings or videos that would not take as much time to read or watch and would promote discussion in class.2 My students seemed surprised, but also enjoyed the fact that I would assign Wikipedia entries for them to skim before certain weeks. In preparing for a future iteration of this class, I would cut down on the broad topics to about four. I found that by the time we moved on to a new topic, there were always a few students who still weren’t sure about the prior concept, which showed up in terms of continued questions from the students and in a mid-semester evaluation. There were certain readings that were not entry-level enough for my students and would require in-class time, or an extra class period, to fully delve into that concept. One example of this was the topic of access to archival information. I discovered that none of my students had visited a physical archive before, but only after we had spent a whole class period talking about archives. This was a good reminder to me, as a teacher, to check in with students about their prior knowledge of course concepts and content so that students had a better idea of what we were talking about in class before moving on to larger concepts.
Students responded to weekly reflective discussion prompts about the reading or video before coming to class. Reflection papers were private and only read by me, in hopes that students would be more forthcoming in their reflections than they might be on a discussion board read by the whole class. Reflections were due before the in-person discussion of that week’s topic so that students could have time to read or watch the material for that week and have time to react to it on paper. Because all topics included reflecting on a different aspect of privilege, this format gave students time to reflect and process information and feelings before coming to class. It also gave me a heads-up as to where the in-class discussion might go. For example, if I read that multiple students were having a difficult time processing that not everyone has access to the internet in the United States (Pew Research Center, 2019) it allowed me to better facilitate a conversation on that topic.
In the Fall 2018 iteration of this course, I brought in three guest speakers over the course of the semester, and we went on one field trip. For weeks that a guest speaker was coming to class, I would reach out to my guest and ask them what they would like students to read or watch as homework for that week. I also asked my guests if they had reflective prompts for my students to do before their week. This was a class full of brand new students to the university, and I wanted to introduce them to other experts on campus. Lizeth Zepeda, my colleague at the time, came in to talk about archives and who gets to “see themselves” in an archive (Zepeda has a background in archives and is now a Research Instruction Librarian at California State University, Monterey Bay). Rachel Caldwell gave a historical overview of how we came to have the publishing industry we have today. Melanie Allen came to speak about accessing health information. The museum on campus had a fantastic temporary exhibit during that semester, titled “For All the World to See: Visual Culture and the Struggle for Civil Rights,” and we visited as a class with Academic Programs GA, Sadie Counts (McClung Museum of Natural History & Culture, n.d.). This enhanced the class by giving different voices and perspectives on this topic.
Assessment is an integral part of my teaching process, so there were weekly, informal, reflective assessments. At the end of every class period, I set aside two-three minutes for students to write down one thing that went well during that class, and one thing that was still confusing. This allowed me to check in on the pulse of the class and see where students were with the material after every class. If there were big pressing things that were still confusing, I would write a response in the learning management system to answer it for everyone. If it could wait until the next class period, I would make announcements at the beginning of class.
At the end of the semester, I realized that I wanted to know more about what students got out of the class, the impact the class had on them, and what they thought about the term “information privilege.” Since it was my first time teaching this class, and I wasn’t familiar with any other credit-bearing information privilege classes, I wanted to take this opportunity to capture and give voice to my students’ experiences with the class. I planned on teaching this class again and wanted to make evidence-based decisions in any changes I made to the course.
To gather this data, I created a survey (Appendix B). The survey consisted of five multiple-choice questions and two open-text questions. After approval from the Institutional Review Board (IRB), I emailed the survey to students after the semester was over and after final grades had been turned in so that students were not pressured or influenced to take the survey. Before starting the survey, students were provided with an informed consent statement that explained why this research was being done and that participating was optional. The survey was hosted in Qualtrics through my campus’s office of information technology. Survey responses were collected anonymously, and students were not asked to supply any identifying information, including demographic data. Multiple-choice survey questions were analyzed for areas where many students answered similarly on questions to look for themes. Narrative questions were coded for similarities, as well. I allowed answers to the survey for one month, and after the survey was closed, I downloaded the data from Qualtrics and put it in Google Sheets.
I had a few expectations going into this survey. I was expecting maybe half of my class to fill the survey out, especially since the semester was over, and it was winter break. I also made all of the questions optional and wasn’t sure that students would take the time to fill out the two open-text questions.
The study population is limited to the students who took FYS 129: Information Privilege in the Fall 2018 semester. Because there were only twelve students in my class, there is a small sample size. Eleven students started the survey, and ten students completed the survey. All the students were over eighteen. No demographic data was collected due to the small sample size and wanting to protect the anonymity of students.
Discussion and Analysis
Multiple Choice Questions
The five multiple choice questions gave the respondents a statement and asked them to choose either “Strongly agree,” “Agree,” “Disagree,” or “Strongly disagree” to say how they felt about each of those statements.
Question one asks students to “Strongly agree” “Agree,” “Disagree,” or “Strongly disagree” to the statement: “I had little or no prior knowledge of information privilege before taking this class.” 90% of students “Strongly agree” with this statement, 10% of students “Agree” with this statement, and 0% of the students “Disagree” or “Strongly disagree” with this statement. These answers reveal that none of the students had any prior understanding or knowledge of the term “information privilege” before taking the class, which surprised me. I had originally thought that students signed up thinking maybe that they knew a little bit about the topic or were interested in it, but this revealed that it was truly a new topic to the students. This is understandable, as it’s a term in the field of library science. The response provides a context for the rest of the survey questions by giving the lens that this was new information to students.
Question two asks students to “Strongly agree” “Agree,” “Disagree,” or “Strongly disagree” to the statement: “After taking this class, I have the working knowledge of information privilege to explain the concept to someone else.” Answers to question two show that 90% of the students agree or strongly agree that they could explain information privilege to someone else and that only 10% strongly disagree with this statement. I phrased the question this way because I was interested in how much of a grasp students felt they had of this concept at the end of the course. This shows that the course gave students more confidence in being able to dialogue about information privilege.
Question three asks students to “Strongly agree” “Agree,” “Disagree,” or “Strongly disagree” to the statement: “This class helped me reflect on my own privilege.” Answers to question three demonstrate that 90% of students agree or strongly agree that this class helped them reflect on their own privilege. 10% of students disagreed and did not think that this class helped them reflect on their own privilege. I was interested in the answers to this question in particular since it was such a motivating factor for wanting to propose and teach the course.
Question four asks students to “Strongly agree” “Agree,” “Disagree,” or “Strongly disagree” to the statement: “After taking this class, I have a better understanding of how information privilege impacts my life.” 80% of students “Strongly agree” with this statement and 20% of students “Agree” with this statement. All of the students who took the survey found that the course helped them to have a better understanding of how information privilege impacts their own lives, which tells me that even for the one respondent in the above questions who did not believe that this course helped them reflect on their own privilege, they are aware of how access to information impacts their life.
Question five asks students to “Strongly agree” “Agree,” “Disagree,” or “Strongly disagree” to the statement: “Learning about information privilege was valuable to my life.” 80% of students thought that learning about information privilege was valuable to their lives, while 20% of students did not find this information valuable. I was a little surprised by the answers to this question in comparison to the above questions. While all students knew how information privilege impacted their lives, 20% of students did not find this information to be valuable. This could be for several different reasons, such as that the 20% of students who did not find this information valuable did not feel the negative impact of information privilege on their own lives.
There were two optional open-text questions for students to respond to, which were the questions I was most excited to delve into and read. I was interested to see how my students framed things in their own words. The first open-text questions asked students to define “information privilege.” After coding their definitions, I identified the following themes: access to information, status, individuals, groups, and inequity. This question was optional, and only eight of the ten students filled it out. Because of the small sample size, some of these themes have as few as two respondents in that theme.
Access to information
The theme of access to information is in 100% of the definitions.
“A person’s access and understanding to information.”
“The access to information i have on a daily basis”
“The access you are born with or are given to information you need in your own life.”
“Your access to information based on location, status, or technology.”
“information privilege is idea that access to information is based on an individual’s status, affiliation, or power.”
“Individuals’ ablities [sic] to access to information are not equal, and some have advantages over others”
Individuals vs Groups
“Individuals’ ablities [sic] to access to information are not equal, and some have advantages over others”
“information privilege is idea that access to information is based on an individual’s status, affiliation, or power.”
“The idea that certain groups of people have access to more information than others and that this difference in access to information makes it difficult for those who don’t have access to move up in the world”
Two themes that stood out boldly are access and information. The two words appeared in all eight responses of the respondents who chose to define the term. The term “status” in two of these definitions is not a term we used in class, and it would be interesting to ask follow up questions about what students mean by “status.” While this question was not mandatory, I was impressed that eight of the ten students were able to create their own definition of a term that was unfamiliar to them sixteen weeks prior.
This survey data informs how I will teach the course in the future: I need to spend more time defining the term in the first or second class period. The way students defined terms made me think about what I emphasized throughout the semester. I think that I personally relied on “access to information” as a way to describe information privilege, and this survey made me want to include terms like “inequity” more in my course readings and assignments.
The second open text question was, “Is there anything else you want to share?” and was a place for students to say anything else that might be on their minds. This question was not required, and only two students put down other comments:
“I enjoyed the class and thought it was useful”
“This class was very helpful as it related to my English classes and Businesses class. It gave me a better understanding of how information barriers effect people.”
I am glad I put this question in, because I did not ask any questions about how the class helped students think or relate to their other classes, but was valuable information to have from this student.
I plan on doing a follow-up focus group study with this same group of students. I want to ask a question about what they thought the class was going to be about, or what they thought the term meant when signing up for the class. I want to explore further how first-year students interacted with the topic and find out how this class interacted with other classes that they took. In the survey, a question asked students if there was anything else they wanted to share. Only two students used this form in the survey, but one student noted: “This class was very helpful as it related to my English classes and Businesses class.” I am interested to know more about how what students learned in this class helped them in other classes.
While the class was a credit-bearing, semester-long class, there are takeaways and implications for different instruction settings. Using the lens of information privilege and privilege when teaching about information literacy allows for a deeper and more meaningful way of learning about these topics.
Information Privilege and Primary Sources / Archives
There are opportunities to talk about information privilege if teaching a library instruction session where students are looking for primary or archival information. Merely asking questions such as: “Why do you think it is important to have access to archival materials?” or “What intersections of privilege would allow you to see or not see yourself reflected in an archive?” or “Whose materials are often collected in archives?” can allow for discussion about privilege on this topic. I was lucky to have Lizeth Zepeda visit my classroom to talk about this subject and would recommend her article (Zepeda, 2018).
Information Privilege and Open Access
Using the Open Access model to talk about information privilege is an excellent way to incorporate this topic into a class in a small or large way. Based on teaching my class, this is a topic students were engaged in discussing. A great example of how to include it in a session is Jessea Young’s assignment on ProjectCORA titled, “Open Access: Strategies and Tools for Life after College” (Young, 2018).
I am thankful for the time, energy, and expertise of Char Booth, external reviewer, Amy Koester, internal reviewer, and Ian Beilin, publishing editor. Thank you also to Hailley Fargo and Suzy Wilson, who read very rough first drafts of this article to help me work through the initial writing process. Thank you to Regina Mays, who read over and gave me feedback on my survey questions before they went out to my students. Lastly, I am deeply grateful to my students who came to class every day with an open mind for this topic.
Chatman, E. A. (1996). The Impoverished Life-World of Outsiders. Journal of the American Society for Information Science, 47(3), 193–206.
Crenshaw, K. (1989). Demarginalizing the Intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. University of Chicago Legal Forum 1989:139-167.
hooks, b. (1984). Feminist theory: From margin to center. Boston, MA: South End Press.
Johnson, H. R., & Smedley- López, A. C. (2017). Information privilege in the context of community engagement in sociology. In Godbey, S., Wainscott, Susan Beth, & Goodman, Xan (Eds.), Disciplinary Applications of Information Literacy Threshold Concepts (pp 123-134).
I use “first-year students” as a term for first-year, first-time students at a university or college. The author does not use this term interchangeably to indicate the age or experience of the student.
I did ask my students halfway through the semester if they preferred readings or videos, and it was evenly split amongst the class.
I just upgraded from Xubuntu 17.10 to 18.04. It was a smooth upgrade and most everything appeared to be working. Below are the few issues I ran into. I’ll update this post when I uncover (and hopefully resolve) other issues.
The curl CLI was removed for some reason. I use it often and various scripts I use also rely on it being present. So I reinstalled it.
I’m using the Pop!_OS theme with Xubuntu. It has a really nice dark theme. In order to get the new version to install I had to purge all of the packages and reinstall them.
The power indicator I was using in my notification area was removed during installation. I had to fiddle with different panel and panel notification widget settings to get it right again. The only helpful part I can note is that running xfce4-panel --restart is useful to get settings applied.
The trackpoint mouse acceleration became super fast. First I had to uninstall libinput and reinstall evdev instead. I got that from this issue. After that I was able to go into settings and adjust the mouse speed down again.
SSH agent issue
I had an issue where I couldn’t connect with a remote server anymore via ssh. ssh-add fixed it up.
I’m developing a script to process some digitized and born digital video into adaptive bitrate formats. As I’ve gone along trying different approaches for creating these streams, I’ve wanted reliable ways to test them on a Linux desktop. I keep forgetting how I can effectively test DASH and HLS adaptive bitrate streams I’ve created, so I’m jotting down some notes here as a reminder. I’ll list both the local and online players that you can try.
While I’m writing about testing both DASH and HLS adaptive bitrate formats, really we need to consider 3 formats as HLS can be delivered as MPEG-2 TS segments or fragmented MP4 (fMP4). Since mid-2016 and iOS 10+ HLS segments can be delivered as fMP4. This now allows you to use the same fragmented MP4 files for both DASH and HLS streaming. Until uptake of iOS 10 is greater you likely still need to deliver video with HLS-TS as well (or go with an HLS-TS everywhere approach). While DASH can use any codec I’ll only be testing fragmented MP4s (though maybe not fully conformant to DASH-IF AVC/264 interoperability points). So I’ll break down testing by DASH, HLS TS, and HLS fMP4 when applicable.
The important thing to remember is that you’re not playing back a video file directly. Instead these formats use a manifest file which lists out the various adaptations–different resolutions and bitrates–that a client can choose to play based on bandwidth and other factors. So what we want to accomplish is the ability to play back video by referring to the manifest file instead of any particular video file or files. In some cases the video files will be self-contained, muxed video and audio and byte range requests will be used to serve up segments, but in other cases the video is segmented with the audio in either a separate single file or again the audio segmented similar to the video. In fact depending on how the actual video files are created they may even lack data necessary to play back independent of another file. For instance it is possible to create a separate initialization MP4 file that includes the metadata that allows a client to know how to play back each of the segment files that lack this information. Also, all of these files are intended to be served up over HTTP. They can also include links to text tracks like captions and subtitles. Support for captions in these formats is lacking for many HTML5 players.
Also note that all this testing is being done on Ubuntu 16.04.1 LTS though the Xubuntu variant and it is possible I’ve compiled some of these tools myself (like ffmpeg) rather than using the version in the Ubuntu repositories.
Playing Manifests Directly
I had hoped that it would be fairly easy to test these formats directly without putting them behind a web server. Here’s what I discovered about playing the files without a web server.
Players like VLC and other desktop players have limited support for these formats, so even when they don’t work in these players that doesn’t mean the streams won’t play in a browser or on a mobile device. I’ve had very little luck using these directly from the file system. Assume for this post that I’m already in a directory with the video manifest files: cd /directory/with/video
So this doesn’t work for a DASH manifest (Media Presentation Description): vlc stream.mpd
Neither does this for an HLS-TS manifest: vlc master.m3u8
In the case of HLS it looks like VLC is not respecting relative paths the way it needs to. Some players appear like they’re trying to play HLS, but I haven’t found a Linux GUI player yet that can play the stream directly from the file sytem like this yet. Suggestions?
Command Line Players
Local testing of DASH can be done with the GPAC MP4Client: MP4Client stream.mpd
This works and can tell you if it is basically working and a separate audio file is synced, but only appears to show the first adaptation. I also have some times when it will not play a DASH stream that plays just fine elsewhere. It will not show you whether the sidecar captions are working and I’ve not been able to use MP4Client to figure out whether the adaptations are set up correctly. Will the video sources actually switch with restricted bandwidth? There’s a command line option for this but I can’t see that it works.
For HLS-TS it is possible to use the ffplay media player that uses the ffmpeg libraries. It has some of the same limitations as MP4Client as far as testing adaptations and captions. The ffplay player won’t work though for HLS-fMP4 or MPEG-DASH.
Other Command Line Players
The mpv media player is based on MPlayer and mplayer2 and can play back both HLS-TS and HLS-fMP4 streams, but not DASH. It also has some nice overlay controls for navigating through a video including knowing about various audio tracks. Just use it with mpv master.m3u8. The mplayer player also works, but seems to choose only one adaptation (the lowest bitrate or the first in the list?) and does not have overlay controls. It doesn’t seem to recognize the sidecar captions included in the HLS-TS manifest.
Behind a Web Server
One simple solution to be able to use other players is to put the files behind a web server. While local players may work, these formats are really intended to be streamed over HTTP. I usually do this by installing Apache and allowing symlinks. I then symlink from the web root to the temporary directory where I’m generating various ABR files. If you don’t want to set up Apache you can also try web-server-chrome which works well in the cases I’ve tested (h/t @Bigggggg_Al).
GUI Players & HTTP
I’ve found that the GStreamer based Parole media player included with XFCE can play DASH and HLS-TS streams just fine. It does appear to adapt to higher bitrate versions as it plays along, but Parole cannot play HLS-fMP4 streams yet.
To play a DASH stream: parole http://localhost/pets/fmp4/stream.mpd
To play an HLS-TS stream: parole http://localhost/pets/hls/master.m3u8
Are there other Linux GUIs that are known to work?
Command Line Players & HTTP
ffplay and MP4Client also work with localhost URLs. ffplay can play HLS-TS streams. MP4Client can play DASH and HLS-TS streams, but for HLS-TS it seems to not play the audio.
And once you have a stream already served up from a local web server, there are online test players that you can use. No need to open up a port on your machine since all the requests are made by the browser to the local server which it already has access to. This is more cumbersome with copy/paste work, but is probably the best way to determine if the stream will play in Firefox and Chromium. The main thing you’ll need to do is set CORS headers appropriately. If you have any problems with this check your browser console to see what errors you’re getting. Besides the standard Access-Control-Allow-Origin “*” for some players you may need to set headers to accept pre-flight Access-Control-Allow-Headers like “Range” for byte range requests.
The Bitmovin MPEG-DASH & HLS Test Player requires that you select whether the source is DASH or HLS-TS (or progressive download). Even though Linux desktop browsers do not natively support playing HLS-TS this player can repackage the TS segments so that they can be played back as MP4. This player does not work with HLS-fMP4 streams, though. Captions that are included in the DASH or HLS manifests can be displayed by clicking on the gear icon, though there’s some kind of double-render issue with the DASH manifests I’ve tested.
The hls.js player has a great demo site for each version that has a lot of options to test quality control and show other metrics. Change the version number in the URL to the latest version. The other nice part about this demo page is that you can just add a src parameter to the URL with the localhost URL you want to test. I could not get hls.js to work with HLS-fMP4 streams, though there is an issue to add fMP4 support. Captions do not seem to be enabled.
There is also the JW Player Stream Tester. But since I don’t have a cert for my local server I need to use the JW Player HTTP stream tester instead of the HTTPS one. I was successfully able to test a DASH and HLS-TS streams with this tool. Captions only displayed for the HLS stream.
The commercial Radiant media player has a DASH and HLS tester than can be controlled with URL parameters. I’m not sure why the streaming type needs to be selected first, but otherwise it works well. It knows how to handle DASH captions but not HLS ones, and it does not work with HLS-fMP4.
The commercial THEOplayer HLS and DASH testing tool only worked for my HLS-TS stream and not the DASH or HLS-fMP4 streams I’ve tested. Maybe it was the test examples given, but even their own examples did not adapt well and had buffering issues.
Wowza has a page for video test players but it seems to require a local Wowza server be set up.
What other demo players are there online that can be used to test ABR streams?
One gap in my testing so far is the Shaka player. They have a demo site, but it doesn’t allow enabling an arbitrary stream.
Other Tools for ABR Testing
In order to test automatic bitrate switching it is useful to test that bandwidth switching is working. Latest Chromium and Firefox nightly both have tools built into their developer tools to simulate different bandwidth conditions. In Chromium this is under the network tab and in Firefox nightly it is only accessible when turning on the mobile/responsive view. If you set the bandwidth to 2G you ought to see network requests for a low bitrate adaptation, and if you change it to wifi it ought to adapt to a high bitrate adaptation.
There are decent tools to test HLS and MPEG-DASH while working on a Linux desktop. I prefer using command line tools like MP4Client (DASH) and mpv (HLS-TS, HLS-fMP4) for quick tests that the video and audio are packaged correctly and that the files are organized and named correctly. These two tools cover both formats and can be launched quickly from a terminal.
I plan on taking a DASH-first approach, and for desktop testing I prefer to test in video.js if caption tracks are added as track elements. With contributed plugins it is possible to test DASH and HLS-TS in browsers. I like testing with Plyr (with my modifications) if the caption file is included in DASH manifest since Plyr was easy to hack to make this work. For HLS-fMP4 (and even HLS-TS) there’s really no substitute to testing on an iOS device (and for HLS-fMP4 on an iOS 10+ device) as the native player may be used in full screen mode.
This is just a snapshot look at some features for one resource on the Wellcome Library site, the example may not be good or correct, and could be changed by the time that you read this. I’m also thinking out loud here and asking lots of questions about my own gaps in knowledge and understanding. In any case I hope this might be helpful to others.
If you visit the native page and scroll down it you’ll see some some other information. They’ve taken some care to expose tables within the text and show snippets of those tables. This is a really nice example of how older texts might be able to be used in new research. And even though Universal Viewer provides some download options they provide a separate download option for the tables. Clicking on one of the tables displays the full table. Are they using OCR to extract tables? How do they recognize tables and then ensure that the text is correct?
Otherwise the page includes the barest amount of metadata and the “more information” panel in UV does not provide much more.
There has been some discussion where the right place is to put a link back from the manifest to an HTML page for humans to see the resource. The related property may not be the right one and that YKK should be used instead.
What other types of seeAlso links are different manifests providing?
Several different services are given. Included are standard services for Content Search and autocomplete. (I’ll come back to to those in a bit.) There are also a couple services outside of the iiif.io context.
The first is an extension around access control. Looking at the context you can see different levels of access shown–open, clickthrough, credentials. Not having looked closely at the Authentication specification, I don’t know yet whether these are aligned with that or not. The other URLs here don’t resolve to anything so I’m not certain what their purpose is.
There’s also a service that appears to be around analytics tracking. From the context document http://universalviewer.io/context.json it appears that there are other directives that can be given to UV to turn off/on different features. I don’t remember seeing anything in the UV documentation on the purpose and use of these though.
One thing I’m interested in is how organizations name and give ids (usually HTTP URLs) for resources that require them. In this case the id of the only sequence is a URL that ends in “/s0” and resolves to an error page. The label for the sequence is “Sequence s0” which could have been automatically generated when the manifest was created. This lack of an id and label value for a sequence is understandable since these types of things wouldn’t regularly get named in a metata workflow or
This leaves me with the question of which ids ought to have something at the other end of the URI? Should every URI give you something useful back? And is there the assumption that HTTP URIs ought to be used rather than other types of identifiers–ones that might not have the same assumptions about following them to something useful? Or are these URIs just placeholders for good intentions of making something available there later on?
Both PDF and raw text renderings are made available. It was examples like this one that helped me to see where to place renderings so that UniversalViewer would display them in the download dialog. The PDF is oddly extensionless, but it does work and it has a nice cover page that includes a persistent URL and a statement on conditions of use. The raw text would be suitable for indexing the resource, but is one long string of text without break so not really for reading.
The viewingHint given here is “paged.” I’ll admit that one of the more puzzling things to me about the specification is exactly what viewing experience is being hinted at with the different viewing hints. How do each of these effect, or not, different viewers? Are there examples of what folks expect to see with each of the different viewing hints?
The canvases don’t dereference and seem to follow the same sort of pattern as sequences by adding “/canvas/c0” to the end of an identifier. There’s a seeAlso that points to OCR as ALTO XML. Making this OCR data available with all of the details of bounding boxes of lines and words on the text is potentially valuable. The unfortunate piece that there’s no MIME type for ALTO XML so the format here is generic and does not unambiguously indicate that the type of file will contain ALTA OCR.
Even more interesting for each canvas they deliver otherContent that includes an annotation list of the text of the page. Each line of the text is a separate annotation, and each annotation points to the relevant canvas including bounding boxes. Since the annotation list is on the canvas the on property for each annotation has the same URL except for a different fragment hash for the bounding box for the line. I wonder if there is code for extracting annotation lists like this from ALTO? Currently each annotation is not separately dereferenceable and uses a similar approach as seen with sequences and canvases of incrementing a number on the end of the URL to distinguish them.
In looking closer at the images on the canvas, I learned better about how to think about the width/height of the canvas as opposed to the width/height of the image resource and what the various ids within a canvas, image, and resource ought to mean. I’m getting two important bits wrong currently. The id for the image should be to the image as an annotation and not to the image API service. Similarly the id for the image resource ought to actually be an image (with format given) and again not the URL to the image API. For this reason the dimensions of the canvas can be different (and in many cases larger) than the dimensions of the single image resource given as it can be expensive to load very large images.
Content Search API
Both content search and autocomplete are provided for this resource. Both provide a simple URL structure where you can just add “?q=SOMETHING” to the end and get a useful response.
First thing I noticed about the content search response is that it is delivered as “text/plain” rather that JSON or JSON-LD, which is something I’ll let them know about. Otherwise this looks like a nice implementation. The hits include both the before and after text which could be useful for highlighting the text off of the canvas.
The annotations themselves use ids that include the the bounding box as a way to distinguish between them. Again they’re fine identifiers but do don’t have anything at the other end. So here’s an example annotation where “1697,403,230,19” is used for both the media fragment as well as the annotation URL:
It sounds like client-side search inside may at some point be feasible for a IIIF-compatible viewer, so I wanted to test the idea a bit further. This time I’m not going to try to paint a bounding box over an image like in my last post, but just use client-side search results to create IIIF Content Search API JSON that could be passed to a more capable viewer.
This page is a test for that. Some of what I need in a Presentation manifest I’ve only deployed to staging. From there this example uses an issue from the Nubian Message. First, you can look at how I created the lunr index using this gist. I did not have to use the manifest to do this, but it seemed like a nice little reuse of the API since I’ve begun to include seeAlso links to hOCR for each canvas. The manifest2lunr tool isn’t very flexible right now, but it does successfully download the manifest and hOCR, parse the hOCR, and create a data file with everything we need.
In the data file are included the pre-created lunr.js index and the documents including the OCR text. What was extracted into documents and indexed is the the text of each paragraph. This could be changed to segment by lines or some other segment depending on the type of content and use case. The id/ref/key for each paragraph combines the identifier for the canvas (shortened to keep index size small) and the x, y, w, h that can be used to highlight that paragraph. We can just parse the ref that is returned from lunr to get the coordinates we need. We can’t get back from lunr.js what words actually match our query so we have to fake it some. This limitation also means at this point there is no reason to go back to our original text for anything just for hit highlighting. The documents with original text are still in the original data should the client-side implementation evolve some in the future.
Also included with the data file is the URL for the original manifest the data was created from and the base URLs for creating canvas and image URLs. These base URLs could have a better, generic implementation with URL templates but it works well enough in this case because of the URL structure I’m using for canvases and images.
base canvas URL:
base image URL:
Now we can search and see the results in the textareas below.
Raw results that lunr.js gives us are in the following textarea. The ref includes everything we need to create a canvas URI with a xywh fragment hash.
Resulting IIIF Content API JSON-LD:
Since I use the same identifier part for canvases and images in my implementation, I can even show matching images without going back to the presentation manifest. This isn’t necessary in a fuller viewer implementation since the content search JSON already links back to the canvas in the presentation manifest, and each canvas already contains information about where to find images.
I’ve not tested if this content search JSON would actually work in a viewer, but it seems close enough to begin fiddling with until it does. I think in order for this to be feasible in a IIIF-compatible viewer the following would still need to happen:
Some way to advertise this client-side service and data/index file via a Presentation manifest.
A way to turn on the search box for a viewer and listen to events from it.
A way to push the resulting Content Search JSON to the viewer for display.
What else would need to be done? How might we accomplish this? I think it’d be great to have something like this as part of a viable option for search inside for static sites while still using the rest of the IIIF ecosystem and powerful viewers like UniversalViewer.
I wanted to push out these examples before the IIIF Hague working group meetings and I’m doing that at the 11th hour. This post could use some more editing and refinement of the examples, but I hope it still communicates well enough to see what’s possible with video in the browser.
IIIF solved a lot of the issues with working with large images on the Web. None of the image standards or Web standards were really developed with very high resolution images in mind. There’s no built-in way to request just a portion of an image. Usually you’d have to download the whole image to see it at its highest resolutions. Image tiling works around a limitation of image formats by just downloading the portion of the image that is in the viewport at the desired resolution. IIIF has standardized and image servers have implemented how to make requests for tiles. Dealing with high resolution images in this way seems like one of the fundamental issues that IIIF has helped to solve.
This differs significantly from the state of video on the web. Video only more recently came to the web. Previously Flash was the predominant way to deliver video within HTML pages. Since there was already so much experience with video and the web before HTML5 video was specified, it was probably a lot clearer what was needed when specifying video and how it ought to be integrated from the beginning. Also video formats provide a lot of the kinds of functionality that were missing from still images. When video came to HTML it included many more features right from the start than images.
As we’re beginning to consider what features we want in a video API for IIIF, I wanted to take a moment to show what’s possible in the browser with native video. I hope this helps us to make choices based on what’s really necessary to be done on the server and what we can decide is a client-side concern.
Crop a video on the spatial dimension (x,y,w,h)
It is possible to crop a video in the browser. There’s no built-in way that this is done, but with how video it integrated into HTML and all the other APIs that are available there cropping can be done. You can see one example below where the image of the running video is snipped and add to a canvas of the desired dimensions. In this case I display both he original video and the canvas version. We do not even need to have the video embedded on the page to play it and copy the images over to the canvas. The full video could have been completely hidden and this still would have worked. While no browser implements it a spatial media fragment could let a client know what’s desired.
Also, in this case I’m only listening for the timeupdate event on the video and copying over the portion of the video image then. That event only triggers so many times a second (depending on the browser), so the cropped video does not display as many frames as it could. I’m sure this could be improved upon with a simple timer or a loop that requests an animation frame.
And similar could be done solely by creating a wrapper div around a video. The div is the desired width with overflow hidden and the video is positioned relative to the div to give the desired crop.
This is probably the hardest one of these to accomplish with video, but both of these approaches could probably be refined and developed into something workable.
Truncate a video on the temporal dimension (start,end)
Scale the video on the temporal dimension (play at 1.5x speed)
This video plays back at 3 times the normal speed:
This video plays back at half the normal speed:
Change the resolution (w,h)
Two of the questions I’ll have about any feature being considered for IIIF A/V APIs are:
What’s the use case?
Can it be done in the browser?
I’m not certain what the use case for some of these transformations of video would be, but would like to be presented with them. But even if there are use cases, what are the reasons why they need to be implemented via the server rather than client-side? Are there feasibility issues that still need to be explored?
I do think if there are use cases for some of these and the decision is made that they are a client-side concern, I am interested in the ways in with the Presentation API and Web Annotations can support the use cases. How would you let a client know that a particular video ought to be played at 1.2x the default playback rate? Or that the video (for some reason I have yet to understand!) needs to be rotated when it is placed on the canvas? In any case I wonder to what extent making the decision that someone is a client concern might effect the Presentation API.
IIIF is working to bring AV resources into IIIF. I have been thinking about how to bring to AV resources the same benefits we have enjoyed for the IIIF Image and Presentation APIs. The initial intention of IIIF, especially with the IIIF Image API, was to meet a few different goals to fill gaps in what the web already provided for images. I want to consider how video works on the web and what gaps still need to be filled for audio and video.
This is a draft and as I consider the issues more I will make changes to better reflect my current thinking.
When images were specified for the web the image formats were not chosen, created, or modified with the intention of displaying and exploring huge multi-gigabit images. Yet we have high resolution images that users would find useful to have in all their detail. So the first goal was to improve performance of delivering high resolution images. The optimization that would work for viewing large high resolution images was already available; it was just done in multiple different ways. Tiling large images is the work around that has been developed to improve the performance of accessing large high resolution images. If image formats and/or the web had already provided a solution for this challenge, tiling would not have been necessary. When IIIF was being developed there were already tiling image servers available. The need remained to create standardized access to the tiles to aid in interoperability. IIIF accomplished standardizing the performance optimization of tiling image servers. The same functionality that enables tiling can also be used to get regions of an image and manipulate them for other purposes. In order to improve performance smaller derivatives can be delivered for use as thumbnails on a search results page.
The other goal for the IIIF Image API was to improve the sharing of image resources across institutions. The situation before was both too disjointed for consumers of images and too complex for those implementing image servers. IIIF smoothed the path for both. Before IIIF there was not just one way of creating and delivering tiles, and so trying to retrieve image tiles from multiple different institutions could require making requests to multiple different kinds of APIs. IIIF solves this issue by providing access to technical information about an image through an info.json document. That information can then be used in a standardized way to extract regions from an image and manipulate them. The information document delivers the technical properties necessary for a client to create the URLs needed to request the given sizes of whole images and tiles from parts of an image. Having this standard accepted by many image servers has meant that institutions can have their choice of image servers based on local needs and infrastructure while continuing to interoperate for various image viewers.
So it seems as if the main challenges the IIIF Image API were trying to solve were about performance and sharing. The web platform had not already provided solutions so they needed to be developed. IIIF standardized the pre-existing performance optimization pattern of image tiling. Through publishing information about available images in a standardized way it also improved the ability to share images across institutions.
What other general challenges were trying to be solved with the IIIF Image API?
Video and Audio
The challenges of performance and sharing are the ones I will take up below with regards to AV resources. How does audio and video currently work on the web? What are the gaps that still need to be filled? Are there performance problems that need to be solved? Are there challenges to sharing audio and video that could be addressed?
The web did not gain native support for audio and video until later in its history. For a long time the primary ways to deliver audio and video on the web used Flash. By the time video and audio did become native to the web many of the performance considerations of media formats already had standard solutions. Video formats have such advanced lossy compression that they can sometimes even be smaller than an image of the same content. (Here is an example of a screenshot as a lossless PNG being much larger than a video of the same page including additional content.) Tweaks to the frequency of full frames in the stream and the bitrate for the video and audio can further help improve performance. A lot of thought has been put into creating AV formats with an eye towards improving file size while maintaining quality. Video publishers also have multiple options for how they encode AV in order to strike the right balance for their content between compression and quality.
In addition video and audio formats are designed to allow for progressive download. The whole media file does not need to be downloaded before part of the media can begin playing. Only the beginning of the media file needs to be downloaded before a client can get the necessary metadata to begin playing the video in small chunks. The client can also quickly seek into the media to play from any arbitrary point in time without downloading the portions of the video that have come before or after. Segments of the media can be buffered to allow for smooth playback. Requests for these chunks of media can be done with a regular HTTP web server like Apache or Nginx using byte range requests. The web server just needs minimal configuration to allow for byte range requests that can deliver just the partial chunk of bytes within the requested range. Progressive download means that a media file does not have to be pre-segmented–it can remain a single whole file–and yet it can behave as if it has been segmented in advance. Progressive download effectively solves many of the issues with the performance of the delivery of very long media files that might be quite large in size. Media files are already structured in such a way that this functionality of progressive download is available for the web. Progressive download is a performance optimization similar to image tiling. Since these media formats and HTTP already effectively solve the issue of quick playback of media without downloading the whole media file, there is no need for IIIF to look for further optimizations for these media types. Additionally there is no need for special media servers to get the benefits of the improved performance.
Quality of Service
While progressive download solves many of the issues with delivery of AV on the web based on how the media files are constructed, it is a partial solution. The internet does not provide assurances on quality of service. A mobile device at the edge of the range of a tower will have more latency in requesting each chunk of content than a wired connection at a large research university. Even over the same stable network the time it takes for a segment of media to be returned can fluctuate based on network conditions. This variability can lead to media playback stuttering or stalling while retrieving the next segment or taking too much time to buffer enough content to achieve smooth playback. There are a couple different solutions to this that have been developed.
With only progressive download at your disposal one solution is to allow the user to manually select a rendition to play back. The same media content is delivered as several separate files at different resolutions and/or bitrates. Lower resolutions and bitrates mean that the segments will be smaller in size and faster to deliver. The media player is given a list of these different renditions with labels and then provides a control for the user to choose the version they prefer. The user can then select whether they want to watch a repeatedly stalling, but high quality, video or would rather watch a lower resolution video playing back smoothly. Many sites implement this pattern as a relatively simple way to take into account that different users will have different network qualities. The problem I have found with this solution for progressive download video is that I am often not the best judge of network conditions. I have to fiddle with the setting until I get it right if I ever do. I can set it higher than it can play back smoothly or select a much lower quality than what my current network could actually handle. I have also found sites that set my initial quality level much lower than my network connection can handle which results in a lesser experience until I make the change to a higher resolution version. That it takes me doing the switching is annoying and distracting from the content.
Adaptive Bitrate Formats
To improve the quality of the experience while providing the highest quality rendition of the media content that the network can handle, other delivery mechanisms were developed. I will cover in general terms a couple I am familiar with, that have the largest market share, and that were designed for delivery over HTTP. For these formats the client measures network conditions and delivers the highest quality version that will lead to smooth playback. The client monitors how long it takes to download each segment as well as the duration of the current buffer. (Sometimes the client also measures the size of the video player in order to select an appropriate resolution rendition.) The client can then adapt on the fly to network conditions to play the video back smoothly without user intervention. This is why it is called “smooth streaming” in some products.
For adaptive bitrate formats like HLS and MPEG-DASH what gets initially delivered is a manifest of the available renditions/adaptations of the media. These manifests contain pointers for where (which URL) to find the media. These could be whole media files for byte range requests, media file segments as separate files, or even in the case of HLS a further manifest/playlist file for each rendition/stream. While the media is often referred to in a manifest with relative URLs, it is possible to serve the manifest from one server and the media files (or further manifests) from a different server like a CDN.
How the media files are encoded is important for the success of this approach. For these formats the different representations can be pre-segmented into the same duration lengths for each segment across all representations. In a similar way they can also be carefully generated single files that have full frames relatively close together within a file and all have these full frames synchronized between all the renditions of the media. For instance all segments could be six seconds with an iframe every 2 seconds. This careful alignment of segments allows for switching between representations without having glitchy moments where the video stalls, without the video replaying or skipping ahead a moment, and with the audio staying synchronized with the video.
It is also possible in the case of video to have one or more audio streams separate from the video streams. Separate audio streams aligned with the video representations will have small download sizes for each segment which can allow a client to decide to continue to play the audio smoothly even if the video is temporarily stalled or reduced in quality. One use case for this audio stream performance optimization is the delivery of alternative language tracks as separate audio streams. The video and audio bitrates can be controlled by the client independently.
In order for adaptive formats like this to work all of the representations need to have the next required segment ready on the server in case the client decides to switch up or down bitrates. While cultural heritage use cases that IIIF considers do not include live streaming broadcasts, the number of representations that all need to be encoded and available at the same time effects the “live edge”–how close to real-time the stream can get. If segments are available in only one high bitrate rendition then the client may not be able to keep up with a live broadcast. If all the segments are not available for immediate delivery then it can lead to playback issues.
The manifests for adaptive bitrate formats also include other helpful technical information about the media. (For HLS the manifest is called a master playlist and for MPEG-DASH a Media Presentation Description.) Included in these manifests can be the duration of the media, the maximum/minimum height and width of the representations, the mimetype and codecs (including MP4 level) of the video and audio, the framerate or sampling rate, and lots more. Most importantly for quality of experience switching, each representation includes a number for its bandwidth. There are cases where content providers will deliver two video representations with the same height and width and different bitrates to switch between. In these cases it is a better experience for the user to maintain the resolution and switch down a bandwidth than to switch both resolution and bandwidth. The number of representations–the ladder of different bandwidth encodes–can be quite extensive for advanced cases like Netflix over-the-top (OTT aka internet) content delivery. These adaptive bitrate solutions are meant to scale for high demand use cases. The manifests can even include information about sidecar or segmented subtitles and closed captions. (One issue with adaptive formats is that they may not play back across all devices, so many implementations will still provide progressive download versions as a fallback.) Manifests for adaptive formats include the kind of technical information that is useful for clients.
Because there are existing standards for the adaptive bitrate pattern that have broad industry and client support, there is no need to attempt to recreate these formats.
AV Performance Solved
All except the most advanced video on demand challenges have current solutions through ubiquitous video formats and adaptive bitrate streaming. As new formats like VP9 increase in adoption the situation for performance will improve even further. These formats have bitrate savings through more advanced encoding that greatly reduces file sizes while maintaining quality. This will mean that adaptive bitrate formats are likely to require fewer renditions than are typically published currently. Note though that in some cases smaller file sizes and faster decoding comes at the expense of much slower encoding when trying to keep a good quality level.
There is no need for the cultural heritage community to try to solve performance challenges when the expert AV community and industry has developed advanced solutions.
Parameterized URLs and Performance
One of the proposals for providing a IIIF AV API alongside the Image API involves mirroring the existing Image API by providing parameters for segmenting and transforming of media. I will call this the “parameterized approach.” One way of representing this approach is this URL:
You can see more about this type of proposal here and here. The parameters after the identifier and before the quality would all be used to transform the media.
For the Image API the parameterized approach for retrieving tiles and other derivatives of an image works as an effective performance optimization for delivery. In the case of AV having these parameters does not improve performance. It is already possible to seek into progressive download and adaptive bitrate formats. There is not the same need to tile or zoom into a video as there is for a high definition image. A good consumer monitor will show you as full a resolution as you can get out of most video.
And these parameters do not actually solve the most pressing media delivery performance problems. The parameterized approach probably is not optimizing for bitrate which is one of the most important settings to improve performance. Having a bitrate parameter within a URL would be difficult to implement well. Bitrate could significantly increase the size of the media or increase visible artifacts in the video or audio beyond usability. Would the audio and video bitrates be controlled separately in the parameterized approach? Bitrate is a crucially important parameter for performance and not one I think you would put into the hands of consumers. It will be especially difficult as bitrate optimization for video on demand is slow and getting more complicated. In order to optimize variable bitrate encoding 2-pass encoding is used and slower encoding settings can further improve quality. With new formats with better performance for delivery, bitrate is reduced for the same quality while encoding is much slower. Advanced encoding pipelines have been developed that perform metrics on perceptual difference so that each video or even section of a video can be encoded at the lowest bitrate that still maintains the desired quality level. Bitrate is where performance gains can be made.
The only functionality proposed for IIIF AV that I have seen that might be helped by the parameterized approach is download of a time segment of the video. This is specific to download of just that time segment. Is this use case big enough to be seriously considered for the amount of complexity it adds? Why is download of a time segment crucial? Why would most cases not be met with just skipping to that section to play? Or can the need be met with downloading the whole video in those cases where download is really necessary? If needed any kind of time segment download use case could live as a separate non-IIIF service. Then it would not have any expectation of being real-time. I doubt most would really see the need to implement a download service like this if the need can be met some other way. In those cases where real-time performance to a user does not matter those video manipulations could be done outside of IIIF. For any workflow that needs to use just a portion of a video the manipulation could be a pre-processing step. In any case if there is really the desire for a video transformation service it does not have to be the IIIF AV API but could be a separate service for those who need it.
Most of the performance challenges with AV have already been solved via progressive download formats and adaptive bitrate streaming. Remaining challenges not fully solved with progressive download and adaptive bitrate formats include live video, server-side control of quality of service adaptations, and greater compression in new codecs. None of these are the types of performance issues the cultural heritage sector ought to try to take on, and the parameterized approach does not contribute solutions to these remaining issues. Beyond these rather advanced issues, performance is a solved problem that has had a lot of eyes on it.
If the parameterized approach is not meant to help with optimizing performance what problem is it trying to solve? The community would be better off steering clear of this trap of trying to optimize for performance and instead focus on problems that still need to be solved. The parameterized approach is sticking with a performance optimization pattern that does not add anything for AV. It has a detrimental fixation on the bitstream that does not work for AV especially as adaptive bitrate segmented formats are concerned. It appears motivated by some kind of purity of approach rather than taking into account the unique attributes of AV and solving these particular challenges well.
The other challenge a standard can help with is sharing of AV across institutions. If the parameterized approach does not solve a performance problem, then what about sharing? If we want to optimize for sharing and have the greatest number of institutions sharing their AV resources, then there is still no clear benefit for the parameterized approach. What about this parameterized approach aids in sharing? It seems to optimize for performance, which as we have seen above is not needed, at the expense of the real need to improve and simplify sharing. There are many unique challenges for sharing video across institutions on the web that ought to be considered before settling on a solution.
One of the big barriers to sharing is the complexity of AV. Compared to delivery of still images video is much more complicated. I have talked to a few institutions that have digitized video and have none of it online yet because of the hurdles. Some of the complication is technical, and because of this institutions are quicker to use easily available systems just to get something done. As a result many fewer institutions will have as much control over AV as they have over images. It will be much more difficult to gain that kind of control. For instance with some media servers they may not have a lot of control over how the video is served or the URL for a media file.
Video is expensive. Even large libraries often make choices about technology and hosting for video based on campus providing the storage for it. Organizations should be able to make the choices that work for their budget while still being able to share in as much as they desire and is possible.
One argument made is that many institutions had images they were delivering in a variety of formats before the IIIF Image API, so asking for similar changes to how AV is delivered should not be a barrier to pursuing a particular technical direction. The difficulty of institutions in dealing with AV can not be minimized in this way as any kind of change will be much greater and asking much more. The complexity and costs of AV and the choices that forces should be taken into consideration.
An important question to ask is who you want to help by standardizing an API for sharing? Is it only for the well-resourced institutions who self-host video and have the technical expertise? If it is required that resources live in a particular location and only certain formats be used it will lead to fewer institutions gaining the sharing benefits of the API because of the significant barriers to entry. If the desire is to enable wide sharing of AV resources across as many institutions as possible, then that ought to lead to a different consideration of the issues of complexity and cost.
One issue that has plagued HTML5 video from the beginning is the inability of the browser vendors to agree on formats and codecs. Early on open formats like WebM with VP8 were not adopted by some browsers in favor of MP4 with H.264. It became common practice out of necessity to encode each video in a variety of formats in order to reach a broad audience. Each source would be listed on the page (on a source element within a video element) and the browser picks which it can play. HTML5 media was standardized to use a pattern to accommodate the situation where it was not possible to deliver a single format that could be played across all browsers. It is only recently that MP4 with H.264 has been able to be played across all current browsers. Only after Cisco open sourced its licensed version of H.264 was this possible. Note while the licensing situation for playback has been improved there are still patent/licensing issues which mean that some institutions still will not create or deliver any MP4 with H.264.
But now even as H.264 can be played across all current browsers, there are still changes coming that mean a variety of formats will be present in the wild. New codecs like VP9 that provide much better compression are taking off and have been adopted by most, but not all, modern browsers. The advantages of VP9 are that it reduces file size such that storage and bandwidth costs can be reduced significantly. Encoding time is increased while performance is improved. And still other new, open formats like AV1 using the latest technologies are being developed. Even audio is seeing some change as Firefox and Chrome are implementing FLAC which will make it an option to use a lossless codec for audio delivery.
As the landscape for codecs continues to change the decision on which formats to provide should be given to each institution. Some will want to continue to use a familiar H.264 encoding pipeline. Others will want to take advantage of the cost savings of new formats and migrate. There ought to be allowance for each institution to pick which formats best meet their needs. Since sources in HTML5 media can be listed in order of preference, in as much as is possible a standard ought to support the ability of a client to respect the preferences of the institution for these reasons. So if WebM VP9 is the first source and the browser can play that format it should play it even if an MP4 H.264 is available which it can also play. The institution may make decisions around the quality to provide for each format to optimize for their particular content and intended uses.
Then there is the choice to implement adaptive bitrate streaming. Again institutions could decide to implement these formats for a variety of reasons. Delivering the appropriate adaptation for the situation has benefits beyond just enabling smooth playback. By delivering only the segment size a client can use based on network conditions and sometimes player size, the segments can be much smaller lowering bandwidth costs. The institution can make a decision depending on their implementation and use patterns whether their costs are more with storage or bandwidth and use the formats that work best for them. It can also be a courtesy to mobile users to deliver smaller segment sizes. Then there are delivery platforms where an adaptive bitrate format is required. Apple requires iOS applications to deliver HLS for any video over ten minutes long. Any of these types of considerations might nudge an AV provider to use ABR formats. They add complexity but also come with attractive performance benefits.
Any solution for an API for AV media should not try to pick winners among codecs or formats. The choice should be left to the institution while still allowing them to share the media in these formats with other institutions. It should allow for sharing AV in whatever formats an institution chooses. An approach which restricts which codecs and formats can be shared does harm and closes off important considerations for publishers. Asking them to deliver too many duplicate versions will also mean forcing certain costs. Will this variety of codecs allow for complete interoperability from every institution to every other institution and user? Probably not, but the tendency will be for institutions to do what is needed to support a broad range of browsers while optimizing for their particular needs. Guidelines and evolving best practices can also be part of any community built around the API. A standard for AV sharing should not shut off options while allowing for a community of practice to develop.
If an institution is able to deliver any of their video on the web, then that is an accomplishment. What could be provided to allow them to most easily share their video with other institutions? One simple approach would be for them to create a URL where they can publish information about the video. Some JSON with just enough technical information could map to the properties an HTML5 video player uses. Since it is still the case that many institutions are publishing multiple versions of each video in order to cover the variety of new and old browsers and mobile devices, it could include a list of these different video sources in a preferred order. Preference could be given to an adaptive bitrate format or newer, more efficient codec like VP9 with an MP4 fallback further down the list. Since each video source listed includes a URL to the media, the media file(s) could live anywhere. Hybrid delivery mechanisms are even possible where different servers are used for different formats or the media are hosted on different domains or use CDNs.
This ability to just list a URL to the media would mean that as institutions move to cloud hosting or migrate to a new video server, they only need to change a little bit of information in a JSON file. This greatly simplifies the kind of technical infrastructure that is needed to support the basics of video sharing. The JSON information file could be a static file. No need even for redirects for the video files since they can live wherever and change location over time.
Here is an example of what part of a typical response might look like where a WebM and an MP4 are published:
An approach that simply lists the available sources an institution makes available for delivery ought to be easier for more institutions over other options for sharing AV. It would allow them to effectively share the whole range of the types of audio and video they already have no matter what technologies they are currently using. In the simplest cases there would be no need for even redirects. If you are optimizing for widest possible sharing from the most institutions, then an approach along these lines ought to be considered.
Straight to AV in the Presentation API?
One interesting option has been proposed for IIIF to move forward with supporting AV resources. This approach is presented in What are Audio and Video Content APIs?. The mechanism is to list out media sources similar to the above approach but on a canvas within a Presentation API manifest. The pattern appears clear for how to provide a list of resources in a manifest in this way. It would not require a specific AV API that tries to optimize for the wrong concerns. The approach still has some issues that may impede sharing.
Requiring an institution to go straight to implementing the Presentation API means that nothing is provided to share AV resources outside of a manifest or a canvas that can be referenced separate from a Presentation manifest. Not every case of sharing and reuse requires the complexity of a Presentation manifest in order to just play back a video. There are many use cases that do not need a sequence with a canvas with media with an annotation with a body with a list of items–a whole highly nested structure, just to get to the AV sources needed to play back some media. This breaks the pattern from the Image API where it is easy and common to view an image without implementing Presentation at all. Only providing access to AV through a Presentation manifest lacks simplicity which would allow an institution to level up over time. What is the path for an institution to level up over time and incrementally adopt IIIF standards? Even if a canvas could be used as the AV API as a simplification over a manifest, requiring a dereferenceable canvas would further complicate what it takes to implement IIIF. Even some institutions that have implemented IIIF and see the value of a dereferenceable canvas have not gotten that far yet in their implementations.
One of the benefits I have found with the Image API is the ability to view images without needing to have the resource described and published to the public. This allows me to check on the health of images, do cache warming to optimize delivery, and use the resources in other pre-publication workflows. I have only implemented manifests and canvases within my public interface once a resource has been published, so would effectively be forced to publish the resource prematurely or otherwise change the workflow. I am guessing that others have also implemented manifests in such a way that is tied to their public interfaces.
Coupling of media access with a manifest has some other smaller implications. Requiring a manifest or canvas leads to unnecessary boilerplate when an institution does not have the information yet and still needs access to the resources to prepare the resource for publication. For instance a manifest and a canvas MUST have a label. Should they use “Unlabeled” in cases where this information is not available yet?
In my own case sharing with the world is often the happy result rather than the initial intention of implementing something. For instance there is value in an API that supports different kinds of internal sharing. Easy internal sharing enables us to do new things with our resources more easily regardless of whether the API is shared publicly. That internal sharing ought to be recognized as an important motivator for adopting IIIF and other standards. IIIF thus far has enabled us to more quickly develop new applications and functionality that reuse special collections image resources. Not every internal use will need or want the features found in a manifest, but just need to get the audio or video sources to play them.
If there is no IIIF AV API that optimizes for the sharing of a range of different AV formats and instead relies on manifests or canvases, then there is still a gap that could be filled. For at least local use I would want some kind of AV API in order to get the technical information I would need to embed in a manifest or canvas. This seems like it could be a common desire to decouple technical information about video resources from the fuller information needed for a manifest including attributes like labels needed for presentation with context to the public. Coupling AV access too tightly to Presentation does not help to solve the desire to decouple these technical aspects. It is a reasonable choice to consider this technical information a separate concern. And if I am already going through the work to create such an internal AV API, I would like to be able to make this API available to share my AV resources outside of a manifest or canvas.
Then there is also the issue of AV players. In the case of images many pan zoom image viewers were modified to work with the Image API. One of the attractions to delivery images via IIIF or adopting a IIIF image server is that there is choice in viewers. Is the expectation that any AV players would need to read in a Presentation manifest or canvas in order to support IIIF and play media? The complexity of the manifest and canvas documents may hinder adoption IIIF in media players. These are rather complicated documents that take some time to understand. A simpler API than Presentation may have a better chance to be more widely adopted for players and easier to maintain. We only have the choice of a couple featureful client side applications for presenting manifests (UniversalViewer and Mirador), but we already have many basic viewers for the Image API. Even though not all of those basic viewers are used within the likes of UniversalViewer and Mirador, the simpler viewers have still been of value for other use cases. For instance a simple image viewer can be used in a metadata management interface where UniversalViewer features like the metadata panel and download buttons are unnecessary or distracting. Would the burden of maintaining plugins and shims for various AV players to understand a manifest or canvas rest with the relatively small IIIF community rather than with the larger group of maintainers of AV players? Certainly having choice is part of the benefit of having the Image API supported in many different image viewers. Would IIIF still have the goal of being supported by a wide range of video players? This ability to have broad support within some of the foundational pieces like media players allows for better experimentation on top of it.
My own implementation of the Image API has shown how having a choice of viewers can be of great benefit. When I was implementing the IIIF APIs I wanted to improve the viewing experience for users by using a more powerful viewer. I chose UniversalViewer even though it did not have a very good mobile experience at the time. We did not want to give up the decent mobile experience we had previously developed. Moving to only using UV would have meant giving up on mobile use. So that we could still have a good mobile interface while UV was in the middle of improving its mobile view, we also implemented a Leaflet-based viewer alongside UV. We toggled each viewer on/off with CSS media queries. This level of interoperability at this lower level in the viewer allowed us to take advantage of multiple viewers while providing a better experience for our users. You can read more about this in Simple Interoperability Wins with IIIF. As AV players are uneven in their support of different features this kind of ability to swap out one player for another, say based on video source type, browser version, or other features, may be particularly useful. We have also seen new tools for tasks like cropping grow up around the Image API and it would be good to have a similar situation for AV players.
So while listing out sources within a manifest or canvas would allow for institutions with heterogeneous formats to share their distributed AV content, the lack of an API that covers these formats results in some complication, open questions, and less utility.
IIIF ought to focus on solving the right challenges for audio and video. There is no sense in trying to solve the performance challenges of AV delivery. That work has been well done already by the larger AV community and industry. The parameterized approach to an AV API does not bring significant delivery performance gains though that is the only conceivable benefit to the approach. The parameterized approach does not sufficiently help make it easier for smaller institutions to share their video. It does not provide any help at all to institutions that are trying to use current best practices like adaptive bitrate formats.
Instead IIIF should focus on achieving ubiquitous sharing of media across many types of institutions. The focus on solving the challenges with sharing media and the complexity and costs with delivering AV resources leads to meeting institutions more where they are at. A simple approach to an AV API that lists out the sources would more readily solve the challenges institutions will face with sharing.
Optimizing for sharing leads to different conclusions than optimizing for performance.
Since writing this post I’ve reconsidered some questions and modified my conclusions.
Update 2017-02-04: Canvas Revisited
Since I wrote this post I got some feedback on it, and I was convinced to try the canvas approach. I experimented with creating a canvas, and it looks more complex and nested than I would like, but it isn’t terrible to understand and create. I have a few questions I’m not sure how I’d resolve, and there’s some places where there could be less ambiguity.
I’d eventually like to have an image service that can return frames
from the video, but for now I’ve just included a single static poster
image as a thumbnail. I’m not sure how I’d provide a service like that
yet, though I had prototyped something in my image server. One way to
start with creating an image service that just provides full images
for the various sizes that are provided with the various adaptations.
Or could a list of poster image choices with width & height just be
provided somehow? I’m not sure what an info.json would look like for
non-tiled images. Are there any Image API examples out in the wild
that only provide a few static images?
I’ve included a width an height for the adaptive bitrate formats, but
what I really mean is the maximum height and width that’s provided for
those formats. It might be useful to have those values available.
I haven’t included duration for each format, though there would be
slight variations. I don’t know how the duration of the canvas would
be reconciled with the duration of each individual item. Might just be
close enough to not matter.
How would I also include an audio file alongside a video? Are all the
items expected to be a video and the same content? Would it be alright
to also add an audio file or two to the items? My use case is that I
have a lot of video oral histories. Since they’re mostly talking heads
some may prefer to just listen to the audio than to play the video.
How would I say that this is the audio content for the video?
I’m uncertain how with the seeAlso WebVTT captions I could say that
they are captions rather than subtitles, descriptions, or chapters.
Would it be possible to add a “kind” field that maps directly to an
HTML5 track element attribute? Otherwise it could be ambiguous what
the proper use for any particular WebVTT (or other captions format) file is.
Have you ever looked back at a graph of fluorescence change in neurons or gene expression data in C. elegans from years ago and wondered how exactly you got that result? Would you have enough findable notes at hand to repeat that experiment? Do you have a quick, repeatable method for preparing your data to be published with your manuscripts (as required by many journals and funders)? If these questions give you pause, we are interested in helping you!
For many data users, getting insight from data is not always a straightforward process. Data is often hard to find, archived in difficult to use formats, poorly structured or incomplete. These issues create friction and make it difficult to use, publish, and share data. The Frictionless Data initiative aims to reduce friction in working with data, with a goal to make it effortless to transport data among different tools and platforms for further analysis.
Over the last several years, Frictionless Data has produced specifications, software, and best practices that address identified needs for improving data-driven research such as generalized, standard metadata formats, interoperable data, and open-source tooling for data validation.
For researchers, Frictionless Data tools, specifications, and software can be used to:
Improve the quality of your dataset
Quickly find and fix errors in your data
Put your data collection and relevant information that provides context about your data in one container before you share it
Write a schema – a blueprint that tells others how your data is structured, and what type of content is to be expected in it
Facilitate data reuse by creating machine-readable metadata
Make your data more interoperable so you can import it into various tools like Excel, R, or Python
Read more about how to get started with our Field Guide tutorials
Importantly, these tools can be used on their own, or adapted into your own personal and organisational workflows. For instance, neuroscientists can implement Frictionless Data tooling and specs can help keep track of imaging metadata from the microscope to analysis software to publication; optimizing ephys data workflow from voltage recording, to tabular data, to analyzed graph; or to make data more easily shareable for smoother publishing with a research article.
We want to learn about your multifacet workflow and help make your data more interoperable between the various formats and tools you use.
We are looking for researchers and research-related groups to join Pilots, and are particularly keen to work with: scientists creating data, data managers in a research group, statisticians and data scientists, data wranglers in a database, publishers, and librarians helping researchers manage their data or teaching data best practices. The primary goal of this work will be to work collaboratively with scientists and scientific data to enact exemplar data practice, supported by Frictionless Data specifications and software, to deliver on the promise of data-driven, reproducible research. We will work with you, integrating with your current tools and methodologies, to enhance your workflows and provide increased efficiency and accuracy of your data-driven research.
Want to know more? Through our past Pilots, we worked directly with organisations to solve real problems managing data:
In an ongoing Pilot with the Biological and Chemical Oceanography Data Management Office (BCO-DMO), we helped BCO-DMO develop a data management UI, called Laminar, which incorporates Frictionless Data Package Pipelines on the backend. BCO-DMO’s data managers are now able to receive data in various formats, import the data into Laminar, and perform several pipeline processes, and then host the clean, transformed data for other scientists to (re)use. The next steps in the Pilot are to incorporate GoodTables into the Laminar pipeline to validate the data as it is processed. This will help ensure data quality and will also improve the processing experience for the data managers.
In a Pilot with the University of Cambridge, we worked with Stephen Eglen to capture complete metadata about retinal ganglion cells in a data package. This metadata included the type of ganglion cell, the species, the radius of the soma, citations, and raw images.
Collaborating with the Cell Migration Standard Organization (CMSO), we investigated the standardization of cell tracking data. CMSO used the Tabular Data Package to make it easy to import their data into a Pandas dataframe (in Python) to allow for dynamic data visualization and analysis.
To find out more about Frictionless data visit frictionlessdata.io or email the team email@example.com.
The University Library’s digital collections, encompassing more than 300 collections with over a million items, are now discoverable through the library’s Articles discovery tool, powered by Summon. Read on to learn about searching this trove of images and text, and how to add it to your library’s Summon instance.
Recently I've read and watched some interesting stuff about technology adoption and real life workflows, which luckily fits in very well with my new paid work. Build what you need using what you have to collaborate and empower and change explores what can happen when the outcome of a tech project is prioritised ahead of the technology: and also what can happen if the metrics are prioritised above everything else. It reminded me of Matthew MacDonald's piece Microsoft Access: the database software that won’t die. MacDonald points out that regardless of what professional software developers think of Access, for the average "power user" (a largely ignored group in recent times) it is incredibly useful and works pretty well most of the time. Indeed these articles taken together are a good primer for anyone wanting to move a large group of people from using one technology to another. Why are we still using MaRC in libraries, for example? There has been no shortage of proposals to move to something else - whether some kind of Linked Data, or at a bare minimum storing MaRC field data in XML or JSON formats by default rather than persisting with the esoteric and outdated ISO 2709. There are many answers to this, but a large part of the problem is simply that MaRC still does the job adequately for the the small, suburban public libraries that make up the vast majority of the world's libraries. Sure, it's far from perfect, but it's also not so completely unusable that there is enough incentive to move away from it, especially with several generations of librarians having learned to read, use and manipulate metadata stored as MaRC, and a (albeit shrinking) marketplace of library software that is built around MaRC records.
We can also see the reverse of this when looking at institutional repository software. Libraries in Australia and elsewhere appear to be moving out of an initial exploratory phase of diverse solutions using mostly open-source software, to a consolidation phase with the institutional repository 'market' shrinking to a very small number of proprietary systems. There are many reasons for this, and I'm by no means an expert on the ins and outs of repository software, but it seems this is primarily a response to the costs of DIY system maintenance and troubleshooting. Whilst open source advocates often speak about the freedom provided by FLOSS, we all know that with freedom comes responsibility. In the case of Australia, a very diverse range of software and version of software has meant that the mythical "open source user community", let alone commercial support, simply hasn't existed in sufficient numbers for any given system: especially when we're in the "wrong" timezone compared to most potential overseas supporters. This fuels the perception that using open source means needing full time tech support on staff - which is sometimes possible but often tenuous in the higher education funding environment. I've spoken about this before in the context of local government - there's a reason managers often ask "who else is using it?"
But here's the thing: just because someone else is doing the maintenance and writing all the code doesn't magically make the software more sophisticated and reliable. Most software is chewing-gum-and-string when you peek behind the curtain, and it's the boring old stuff that is the most robust. This is the case for all technologies: trains kept exploding, derailing, and colliding for decades before adequate safety systems were invented and rolled out. The first commercial passenger jet, the de Havilland Comet regularly disintegrated mid-air. Microfiche is arguably the most reliable dense storage technology for text and two-dimensional images.
Creating a hybrid of the factory and the putting-out system is feasible because networked digital technologies enable employers to project their authority farther than before. They enable discipline at a distance. The elastic factory, we could call it: the labor regime of Manchester, stretched out by fiber optic cable until it covers the whole world.
Tarnoff is primarily (though not exclusively) talking about services like Uber or Deliveroo, where workers have 'flexibility' as 'independent contractors' but are subject to the discipline of and receive much the same pay - and to an extent, conditions - as a nineteenth-century factory worker.
But to return to the issue of technology adoption in less brutal employment conditions, the most important consideration is not really the inherent 'sophistication' of a technology but rather how easy it is for a given organisation to maintain it, given the resources the organisation is willing and able to access.
The trick here is to be able to identify the point at which it is better to invest in enabling staff to learn a new tool vs continuing to use "what you have". If you can increase "what you have" (i.e. by expanding what staff know) then options about what you can use become broader. This is essentially what my current job is mostly about: Putting the ‘tech’ back into technical services and the rest of the library.
"What you have" is not only contextual to individuals or organisations but also reasonable expectations vary by industry and profession. This is one reason why so many workflows designed by programmers use git - it's become an industry-standard tool allowing a reasonable expectation that most if not all professional software developers will know how to use it. But this runs into reality once we move into other domains, particularly where there is cross-over: for example documentation workflows. git is a powerful tool for distributed version control of written texts: but it's also notoriously difficult to learn, can be frustrating and stressful to use, and has such torturous syntax and poor documentation that it's difficult to tell the difference between the parody docs and the real ones.
The following is an announcement of a Web-based demonstration to the Distant Reader:
Please join us for a web-based demo and Q&A on The Distant Reader, a web-based text analysis toolset for reading and analyzing texts that removes the hurdle of acquiring computational expertise. The Distant Reader offers a ready way to onboard scholars to text analysis and its possibilities. Eric Lease Morgan (Notre Dame) will demo his tool and answer your questions. This session is suitable for digital textual scholars at any level, from beginning to expert.
When: February 12, 2020 @ 1-2pm Pacific Standard Time
The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of just about any size (hundreds of books or thousands of journal articles), the Distant Reader analyzes the corpus, and outputs a myriad of reports enabling the researcher to use and understand the corpus. Designed with college students, graduate students, scientists, or humanists in mind, the Distant Reader is intended to supplement the traditional reading process.
This presentation outlines the problems the Reader is intended to address as well as the way it is implemented on the Jetstream platform with the help of both software and personnel resources from XSEDE. The Distant Reader is freely available for anybody to use at https://distantreader.org.
Other Distant Reader links of possible interest include:
As part of our ongoing effort to be transparent and informative, the Leadership of the NDSA approved the creation of a new working group to assist with communications about NDSA activities with the NDSA membership and beyond. We are looking for people to join in the effort! Any interest can be expressed by completing the following form by the end of January 2020.
Group Purpose: To develop, enhance, and refine the communications strategy with the NDSA membership.
Group Activities: The main expected activities of the group are listed below by public facing and operational categories.
Public Facing Activities
Website: Review and update website for accuracy and timely information. Edit web pages as requested by the Leadership team.
Social media: Posting to NDSA Twitter etc about NDSA and member activities, publications, etc
Blog: Coordinating and posting blog posts about activities, publications, etc
Update the shared online events and activities calendar
Management of Working Groups: Help monitor the processes of the multiple working groups that run on cycles (e.g. Fixity, Staffing Survey). Facilitate group timelines and work with the groups to produce their final published product.
Set up a schedule of known groups that run on cycles.
Contact past/present co-chairs of the group prior to next iteration
Monitor work (if necessary)
Help spin group back down, collect files in NDSA google drive,
Survey Coordination: There are many groups that routinely survey the NDSA membership (or greater). This group would help coordinate the timing of these surveys across all groups so the membership is not overwhelmed with surveys.
Set up a schedule of known upcoming surveys
Publications: This group would be responsible for managing and finalizing NDSA publications
Working with groups to know what publications are in the works
Recommendations for where/how to store working files, data sets, etc
Understanding the desired timeline for publication from the group and setting deadlines.
Putting publication into the official Publishing Template format
Reviewing publication (basic run through)
Publishing on NDSA OSF site
In coordination with public facing team:
Putting info on NDSA webpage about new publication (if appropriate)
Producing a blog post about new publication
Producing social media posts about new publication
To complete this work we are looking for individuals from NDSA member organizations to join us. Specific roles that are needed include::
Social Media posters/writers
Managing working groups and their output
Any interest can be expressed by completing thefollowing form by the end of January 2020. It is not expected that each individual will be doing all tasks – unless they choose to do so! Participating in NDSA in this manner is a great way to understand and interact with all of NDSA activities.
I am at the two-year mark of being in my role as systems librarian at Jacksonville University, and I continue to love what I do. I am working on larger-scale projects and continuing to learn new things every week. There has not been a challenge or new skill to learn yet that I have been afraid of.
My first post in this series highlighted groups and departments that may be helpful in learning your new role. Now that I’m a little more seasoned, I have had the opportunity to work with even more departments and individuals at my institution on various projects. Some of these departments may be unique to me, but I would imagine you would find counterparts where you work.
The Academic Technology (AT) department.
This department is responsible for classroom technology, online and hybrid
course software, and generally any technology that enhances student learning.
In the last year, I have worked with AT on finding and working with a vendor
for new technology for OPAC access and developing a digital media lab within
the library. I have worked with almost every individual in this department at
this writing. Many of them have technology experience that surpasses mine, so I
soak up all that I can when there is an opportunity to learn from them. So far,
my friends in AT have taught me more about accessibility, various hardware, and
Faculty. I work with
faculty in my role as a subject liaison and to teach information literacy
classes, but I have recently worked with faculty on setting up specific
databases. A recent database-linking project required participation from the
library and IT for different access points. The faculty member had not worked
on such a project before, so it was an excellent chance to flex my growing
systems muscle to explain the library’s role in the process. I also work with
faculty on linking electronic resources from our catalog in Blackboard and
Canva. These activities are excellent ways to not only build relationships with
faculty, but also communicate the variety of resources available to faculty.
Center for Teaching and Learning. Speaking of faculty and communication, get to know your institution’s hub for faculty teaching and learning. Our Center for Teaching and Learning hosts workshops and lectures for new and seasoned faculty. Developing a relationship with this department affords the opportunity to grow faculty knowledge about library technology and what it can do for them and their students. I am in the planning stages of developing a workshop to show faculty how to use the digital media lab mentioned above. For new faculty, find out if there is a new faculty orientation and see if you can present about the library’s website, electronic resources, or anything that may fall under your realm of responsibilities. For me, I am more behind-the-scenes than I used to be in my previous position, so it is nice to be visible and let people meet the person behind the name that may pop up in their e-mail.
The marketing department (again).
I’m repeating my mention of the marketing department because of social media. I
manage my library’s social media presence, and my institution’s marketing
department has been an asset when it comes to social media. Find out if there
are social media trainings to enhance your knowledge of social media platforms.
Are you using the right platforms to reach your audience? Are there specific
hashtags you should be using? Make sure you know your institution’s brand and
that you are developing content that aligns with the brand.
I remain a department of one in my library, but it
continues to take a village to accomplish all the things.
What other departments are important to
supporting the needs of a library’s systems department?
Applications for the mini-grant scheme must be submitted before midnight GMT on Sunday 9th February 2020 via filling in this form.
To be awarded a mini-grant, your event must fit into one of the four tracks laid out below. Event organisers can only apply once and for just one track.
Mini-grant tracks for Open Data Day 2020
Each year, the Open Data Day mini-grant scheme looks to highlight and support particular types of open data events by focusing applicants on a number of thematic tracks. This year’s tracks are:
Environmental data: Use open data to illustrate the urgency of the climate emergency and spur people into action to take a stand or make changes in their lives to help the world become more environmentally sustainable.
Tracking public money flows: Expand budget transparency, dive into public procurement, examine tax data or raise issues around public finance management by submitting Freedom of Information requests.
Open mapping: Learn about the power of maps to develop better communities.
Data for equal development: How can open data be used by communities to highlight pressing issues on a local, national or global level? Can open data be used to track progress towards the Sustainable Development Goals or SDGs?
What is a mini-grant?
A mini-grant is a small fund of between $200 and $300 USD to help support groups organising Open Data Day events. Event organisers can only apply once and for just one track.
The mini-grants cannot be used to fund government events, whether national or local. We can only support civil society actions. We encourage governments to find local groups and engage with them if they want to organise events and apply for a mini-grant.
The funds will only be delivered to the successful grantees after their event takes place and once the Open Knowledge Foundation team receives a draft blogpost about the event for us to publish on blog.okfn.org. In case the funds are needed before 7th March 2020, we will assess whether or not we can help on a case-by-case basis.
About Open Data Day
Open Data Day is the annual event where we gather to reach out to new people and build new solutions to issues in our communities using open data. The tenth Open Data Day will take place on Saturday 7th March 2020.
If you have started planning your Open Data Day event already, please add it to the global map on the Open Data Day website using this form.
You can also connect with others and spread the word about Open Data Day using the #OpenDataDay or #ODD2020 hashtags. Alternatively you can join the Google Group to ask for advice or share tips.
To get inspired with ideas for events, you can read about some of the great events which took place on Open Data Day 2019 in our wrap-up blog post.
As well as sponsoring the mini-grant scheme, Datopian will be providing technical support on Open Data Day 2020. Discover key resources on how to publish any data you’re working with via datahub.io and how to reach out to the Datopian team for assistance via Gitter by reading their Open Data Day blogpost.
Need more information?
If you have any questions, you can reach out to the Open Knowledge Foundation team by emailing firstname.lastname@example.org or on Twitter via @OKFN. There’s also the Open Data Day Google Group where you can connect with others interested in taking part.
Just a quickie to say that I’ve replaced the comment section at the bottom of each post with webmentions, which allows you to comment by posting on your own site and linking here. It’s a fundamental part of the IndieWeb, which I’m slowly getting to grips with having been a halfway member of it for years by virtue of having my own site on my own domain.
I’d already got rid of Google Analytics to stop forcing that tracking on my visitors, I wanted to get rid of Disqus too because I’m pretty sure the only way that is free for me is if they’re selling my data and yours to third parties. Webmention is a nice alternative because it relies only on open standards, has no tracking and allows people to control their own comments. While I’m currently using a third-party service to help, I can switch to self-hosted at any point in the future, completely transparently.
Thanks to webmention.io, which handles incoming webmentions for me, and webmention.js, which displays them on the site, I can keep it all static and not have to implement any of this myself, which is nice. It’s a bit harder to comment because you have to be able to host your own content somewhere, but then almost no-one ever commented anyway, so it’s not like I’ll lose anything! Plus, if I get Bridgy set up right, you should be able to comment just by replying on Mastodon, Twitter or a few other places.
A spot of web searching shows that I’m not the first to make the Disqus -> webmentions switch (yes, I’m putting these links in blatantly to test outgoing webmentions with Telegraph…):
Here are some scattered notes from reading Governing the Soul by Nikolas Rose with eye towards my field study (Rose, 1999). It’s quite a remarkable book given that it was originally written some 30 years ago, and the topic of how the sciences of psy* disciplines (psychology, psychiatory, anthropology, sociology, etc) have shaped our ideas about what it means to be a modern subject, still seems fresh and relevant.
In the section on work he devotes quite a bit of space to describing the Tavistock Institute which was instrumental to England during WW2, for developing methods for measuring how fit for particular combat jobs soldiers were, and for controlling morale. After the war the Tavistock Clinic became the Tavistock Institute, and it started translating these methods of measurement and analysis to the workplace, through partnerships with Unilever, Glacier Metals, and others. According to Rose it was during this time as they focused on the productivity of organizations that the theory of the “sociotechnical system” was created by Eric Trist, Ken Bamforth and A. K. Rice.
Trist and his colleagues thus claimed that the technology did not determine the relations of work – there were social and psychological properties that were independent of technology. Hence organizations could choose how tasks should be organized to promote the pscyhological and social processes that were conducive to efficeint, productive, and harmonious relations. The analytic procedure coul, as it were, be reversed to conceptualize and construct the details of a labour process that would be in line with both technological and psychological requirements.
It was not merely that a new language was being formulated for speaking about the internal world of the enterprise, factory, plant, mine, or hospital. It was rather that the microstructures of the internal world of the enterprise (the details of the technical organization, roles, responsibilities, machinery, shifts, and so forth) were opened to systematic analysis and intervention in the name of a psychological principle of health that was simultaneously a managerial principle of efficiency. (p. 93)
It’s significant that the origins of sociotechnical theory are in warfare, and that the ideas were almost seemlessly translated to manufacturing and labor processes once the war is over.
The chapters concerned with the psy* disciplines production of ideas of the child and the family are also relevant for my own research, because part of the rationale of my informants is that their work is being done for the benefit and protection of children. The emergence of social welfare and rights for children, and the emergence of public space of the community, with the private space of the family in the 19th century. That these moves weren’t so much about the rights of children, as they were moves to provide “antidotes to social unrest” while create modes of social control:
Further, it appeared that the extension of social regulation to the lives of children actually had little to do with recognition of their rights. Children came to the attention of social authorities as delinquents threatening property and security, as future workers requiring moralization and skills, as future soldiers requiring a level of physical fitness – in other words on account of the threat which they posed now or in the future to the welfare of the state. The apparent humanity, benevolence, and enlightenment of the extension of protection to children in their homes disguised the extension of surveillance and control over the family. Reformers arguing for such legislative changes were moral entrepreneurs, seeking to symbolize their values in the law and, in doing so, to extend their powers and authority over others. The upsurges of concern over the young – from juvenile delinquency in the nineteenth century to sexual abuse today – were actually moral panics: repetitive and predictable social occurrences in which certain persons or phenomena come to symbolize a range of social anxieties concerning threats to the established order and traditional values, the decline of morality and social discipline, and the need to take firm steps in order to prevent a downward spiral into disorder. Professional groups – doctors, psychologists, and social workers – used, manipulated, and exacerbated such panics in order to establish and increase their empires. The apparently inexorable growth of welfare surveillance over the families of the working class had arisen from an alignment between the aspirations of the professionals, the political concerns of the authorities, and the social anxieties of the powerful. (p. 125)
Later, Rose has this great Foucault quote that brings to mind how our current moment is different in terms of the documentation we are generating in social media and other spaces thanks to the Internet. I’ve included the full quote here, not just the excerpt from Rose:
For a long time ordinary individuality – the everyday individuality of everybody – remained below the threshold of description. To be looked at, observed, described in detail, followed from day to day by an uninterrupted writing was a privilege. The chronicle of a man, the account of his life, his historiography, written as he lived out his life formed part of the rituals of his power. The disciplinary methods reversed this relation, lowered the threshold of describable individuality and made of this description a means of control and a method of domination. It is no longer a monument for future memory, but a document for possible use. And this new describability is all the more marked in that the disciplinary framework is a strict one: the child, the patient, the madman, the prisoner, were to become, with increasing ease from the eighteenth century and according to a curve twhich is that of the mechanisms of discipline, the object of individual descriptions and biographical accounts. This turning of real lives into writing is no longer a procedure of heroization; it functions as a procedure of objectification and subjection. The carefully collated life of mental patients or delinquents belongs, as did the chronicle of kings or the adventures of the great popular bandits, to a certain political function of writing; but in a quite different technique of power. (Foucault, 2012 , p. 191-192)
Rose goes on to talk about the institutions that mechanize this new form of description:
Michel Foucault argued that the disciplines “make” individuals by means of some rather simple technical procedures. On the parade ground, in the factory, in the school and in the hospital, people were gathered together en masse, but by this very fact they could be observed as entities both similar to and different from one another. These institutions function in certain respects like telescopes, microscopes, or other scientific instruments: they established a regime of visibility in which the observed was distributed within a single common plane of sight. Second, these institutions operated according to a regulation of detail. These regulations, and the evaluation of conduct, manners, and so forth entailed by them, established a grid of codeability of person attributes. They act as norms, enabling the previously aleatory and unpredictable complexities of human conduct to be charted and judged in terms of conformity and deviation, to be coded and compared, ranked and measured. (p. 135-136)
Rose says that in addition to Foucault he relies on Lynch (1985) for this analysis. The focus on the role of description, codes and inscription seems to echo Bruno Latour’s idea of “immutable mobile”, who is cited on the next page. The immutable mobile being some objects produced by inscription that can then be accumulated and put side-by-side for analysis (Latour, 1986). Of course, the connection for me here, is the way that computer file fixity, or digital signatures for software, are deployed out of web archives, basically functioning as immutable mobiles that enable a new way of seeing, or as Rose says “technologies are revolutions in consciousness” (p. 153). As I analyze my field notes and interviews, the centrality of fixity to my story cannot be overstated. It is both a means of preservation and identification. There is a slippage when atomic files become containers for other files. But most of all these fixity values are gathered and deployed to enable a forensic vision of software. I think it’s also interesting to connect work being done on this site with network infrastructures such as IPFS and Dat that use file fixity to enable new distribution protocols for “files” on the web, as well as records of transaction (blockchain). It’s fascinating how prevalent this idea of file fixity is to so much of our computing infrastructure. And not just a little scary.
I’m not done yet reading yet, so I suspect there might be a Rose Notes 2 coming.
Foucault, M. (2012). Discipline & punish: The birth of the prison. Vintage.
Latour, B. (1986). Visualization and cognition: Drawing things together. In H. Kuklick (Ed.), Knowledge and society: Studies in the sociology of culture past and present (Vol. 6, pp. 1–40). JAI.
Lynch, M. (1985). Discipline and the material form of images: An analysis of scientific visibulity. Social Studies of Science, 15, 37–66.
Rose, N. (1999). Governing the soul: The shaping of the private self. Free Association Books.
This blogpost is part of a series showcasing projects developed during the 2019 Frictionless Data Tool Fund.
The 2019 Frictionless Data Tool Fund provided four mini-grants of $5,000 to support individuals or organisations in developing an open tool for reproducible research built using the Frictionless Data specifications and software. This fund is part of the Frictionless Data for Reproducible Research project, which is funded by the Sloan Foundation. This project applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts.
Open Referral creates standards for health, human, and social services data – the data found in community resource directories used to help find resources for people in need. In many organisations, this data lives in a multitude of formats, from handwritten notes to Excel files on a laptop to Microsoft SQL databases in the cloud.
For community resource directories to be maximally useful to the public, this disparate data must be converted into an interoperable format. Many organisations have decided to use Open Referral’s Human Services Data Specification (HSDS) as that format. However, to accurately represent this data, HSDS uses multiple linked tables, which can be challenging to work with. To make this process easier, Greg Bloom and Shelby Switzer from Open Referral decided to implement datapackage bundling of their CSV files using the Frictionless Data Tool Fund.
In order to accurately represent the relationships between organisations, the services they provide, and the locations they are offered, Open Referral aims to use their Human Service Data Specification (HSDS) makes sense of disparate data by linking multiple CSV files together by foreign keys. Open Referral used Frictionless Data’s datapackage to specify the tables’ contents and relationships in a single machine-readable file, so that this standardised format could transport HSDS-compliant data in a way that all of the teams who work with this data can use: CSVs of linked data.
In the Tool Fund, Open Referral worked on their HSDS Transformer tool, which enables a group or person to transform data into an HSDS-compliant data package, so that it can then be combined with other data or used in any number of applications.
The HSDS-Transformer is a Ruby library that can be used during the extract, transform, load (ETL) workflow of raw community resource data. This library extracts the community resource data, transforms that data into HSDS-compliant CSVs, and generates a datapackage.json that describes the data output. The Transformer can also output the datapackage as a zip file, called HSDS Zip, enabling systems to send and receive a single compressed file rather than multiple files. The Transformer can be spun up in a docker container — and once it’s live, the API can deliver a payload that includes links to the source data and to the configuration file that maps the source data to HSDS fields. The Transformer then grabs the source data and uses the configuration file to transform the data and return a zip file of the HSDS-compliant datapackage.
Example of a demo app consuming the API generated from the HSDS Zip
The Open Referral team has also been working on projects related to the HSDS Transformer and HSDS Zip. For example, the HSDS Validator checks that a given datapackage of community service data is HSDS-compliant. Additionally, they have used these tools in the field with a project in Miami. For this project, the HSDS Transformer was used to transform data from a Microsoft SQL Server into an HSDS Zip. Then that zipped datapackage was used to populate a Human Services Data API with a generated developer portal and OpenAPI Specification.
Further, as part of this work, the team also contributed to the original source code for the datapackage-rb Ruby gem. They added a new feature to infer a datapackage.json schema from a given set of CSVs, so that you can generate the json file automatically from your dataset.
Greg and Shelby are eager for the Open Referral community to use these new tools and provide feedback. To use these tools currently, users should either be a Ruby developer who can use the gem as part of another Ruby project, or be familiar enough with Docker and HTTP APIs to start a Docker container and make an HTTP request to it. You can use the HSDS Transformer as a Ruby gem in another project or as a standalone API. In the future, the project might expand to include hosting the HSDS Transformer as a cloud service that anyone can use to transform their data, eliminating many of these technical requirements.
Interested in using these new tools? Open Referral wants to hear your feedback. For example, would it be useful to develop an extract-transform-load API, hosted in the cloud, that enables recurring transformation of nonstandardised human service directory data source into an HSDS-compliant datapackage? You can reach them via their GitHub repos.
Today we're pleased to announce that North Carolina has joined Illinois, Arkansas, and New Mexico as the latest jurisdiction to make all of its appellate court decisions openly available online, in an authoritative, machine-readable format that is citable using a vendor-neutral citation.
As a result of North Carolina taking this important step, the Caselaw Access Project has removed all use and access restrictions from the North Carolina cases in its collection. You now can view or download the full text of all North Carolina cases without restriction. You can read individual cases with the CAP Case Browser or access many cases at once with the CAP API and Bulk Data. Here's an example!
We're delighted and inspired by the work of the North Carolina Supreme Court and the many dedicated professionals on the Court's staff. We hope that many other states will follow North Carolina’s example as soon as possible. Here's how to help make it happen.
Facebook and Alphabet (Google’s parent), which rely on advertising for, respectively, 97% and 88% of their sales.
depend on the idea that targeted advertising, exploiting as much personal information about users as possible, results in enough increased sales to justify its cost.This is despite the fact the both experimental research and the experience of major publishers and advertisers show the opposite. Now, The new dot com bubble is here: it’s called online advertising by Jesse Frederik and Maurits Martijn provides an explanation for this disconnect. Follow me below the fold to find out about it and enjoy some wonderful quotes from them.
There are three kinds of problem that mean dollars spent on Web advertising are wasted. First, the Web advertising ecosystem is rife with fraud. Much of it is click fraud, which means the ads are seen by bots not people. Second, ad targeting just doesn't work well to put ads in front of potential buyers. Third, because of the way ad targeting works, most of the potential buyers who see ads are people who would buy the product without seeing the ad.
'It's about 60 to 100 per cent fraud, with an average of 90 per cent, but it is not evenly distributed,' said Augustine Fou, an independent ad fraud researcher, in a report published this month. ... Among quality publishers, Fou reckons $1 spent buys $0.68 in ads actually viewed by real people. But on ad networks and open exchanges, fraud is rampant.
With ad networks, after fees and bots – which account for 30 per cent of traffic – are taken into account, $1 buys $0.07 worth of ad impressions viewed by real people. With open ad exchanges – where bots make up 70 per cent of traffic – that figure is more like $0.01. In other words, web adverts displayed via these networks just aren't being seen by actual people, just automated software scamming advertisers.
The third annual Bot Baseline Report reveals that the economic losses due to bot fraud are estimated to reach $6.5 billion globally in 2017. This is down 10 percent from the $7.2 billion reported in last year's study.
Today, fraud attempts amount to 20 to 35 percent of all ad impressions throughout the year, but the fraud that gets through and gets paid for now is now much smaller. We project losses to fraud to reach $5.8 billion globally in 2019. In our prior study, we projected losses of $6.5 billion for 2017. That 11 percent decline in two years is particularly impressive considering that digital ad spending increased by 25.4 percent between 2017 and 2019. ... Absent those measures, losses to fraud would have grown to at least $14 billion annually.
The bad guys, for very little investment, are reaping income of $5.8B/yr. Their ROI is vastly better than the advertisers, or the platforms. No-one cares.
Of the 400,000 web addresses JPMorgan’s ads showed up on in a recent 30-day period, said Ms. Lemkau, only 12,000, or 3 percent, led to activity beyond an impression. An intern then manually clicked on each of those addresses to ensure that the websites were ones the company wanted to advertise on. About 7,000 of them were not, winnowing the group to 5,000. The shift has been easier to execute than expected, Ms. Lemkau said, even as some in the industry warned the company that it risked missing out on audience “reach” and efficiency.
The cull apparently had no effect on the traffic to JPMorgan's website.
The publisher blocked all open-exchange ad buying on its European pages, followed swiftly by behavioral targeting. Instead, NYT International focused on contextual and geographical targeting for programmatic guaranteed and private marketplace deals and has not seen ad revenues drop as a result, according to Jean-Christophe Demarta, svp for global advertising at New York Times International.
Currently, all the ads running on European pages are direct-sold. Although the publisher doesn’t break out exact revenues for Europe, Demarta said that digital advertising revenue has increased significantly since last May and that has continued into early 2019.
Acquisti said the research showed that behaviourally targeted advertising had increased the publisher’s revenue but only marginally. At the same time they found that marketers were having to pay orders of magnitude more to buy these targeted ads, despite the minuscule additional revenue they generated for the publisher.
“What we found was that, yes, advertising with cookies — so targeted advertising — did increase revenues — but by a tiny amount. Four per cent. In absolute terms the increase in revenues was $0.000008 per advertisment,” Acquisti told the hearing. “Simultaneously we were running a study, as merchants, buying ads with a different degree of targeting. And we found that for the merchants sometimes buying targeted ads over untargeted ads can be 500% times as expensive.
Importantly, as we made those decisions and put our money where our mouth has been in terms of the need to increase the efficiency of that supply chain, ensure solid and strong placement of individual ads, we didn’t see a reduction in the growth rate. So as you know, we’ve delivered over 2% organic sales growth on 2% volume growth in the quarter. And that — what that tells me is that, that spending that we cut was largely ineffective.
The fact that targeting doesn't work shouldn't be a surprise to any Web user. Two universal experiences:
I buy something on Amazon. For days afterwards, many of the ads that make it past my ad blockers are for the thing I just bought.
For weeks now, about half the videos I watch on YouTube start by showing me exactly the same ad for Shen Yun. Every time I see it I click "skip ad", so YouTube should have been able after all this time to figure out that I'm not interested and show me a different ad.
But note that the ad platforms don't care. They can't be bothered to enhance their targeting algorithms to exploit what they know. Why? They get paid in both cases.
Frederik and Martijn focus on yet another problem with the numbers used to justify spending on Web ads, selection bias:
Economists refer to this as a "selection effect." It is crucial for advertisers to distinguish such a selection effect (people see your ad, but were already going to click, buy, register, or download) from the advertising effect (people see your ad, and that’s why they start clicking, buying, registering, downloading).
People really do click on the paid-link to eBay.com an awful lot. But if that link weren’t there, presumably they would click on the link just below it: the free link to eBay.com. The data consultants were basing their profit calculations on clicks they would be getting anyway.
There was a clash going on between the marketing department at eBay and the MSN network (Bing and Yahoo!). Ebay wanted to negotiate lower prices, and to get leverage decided to stop ads for the keyword ‘eBay’.
Tadelis got right down to business. Together with his team, he carefully analysed the effects of the ad stop. Three months later, the results were clear: all the traffic that had previously come from paid links was now coming in through ordinary links. Tadelis had been right all along. Annually, eBay was burning a good $20m on ads targeting the keyword ‘eBay’.
Economists at Facebook conducted 15 experiments that showed the enormous impact of selection effects. A large retailer launched a Facebook campaign. Initially it was assumed that the retailer’s ad would only have to be shown 1,490 times before one person actually bought something.
But the experiment revealed that many of those people would have shopped there anyway; only one in 14,300 found the webshop because of the ad. In other words, the selection effects were almost 10 times stronger than the advertising effect alone!
And this was no exception. Selection effects substantially outweighed advertising effects in most of these Facebook experiments. At its strongest, the selection bias was even 50 (!) times more influential.
Many publishers and platforms have done experiments showing that spending on Web advertising is wasted. But they don't care.
Why doesn't anyone bar a few researchers care that the Web advertising system is broken and wastes gigantic amounts of money?
It might sound crazy, but companies are not equipped to assess whether their ad spending actually makes money. It is in the best interest of a firm like eBay to know whether its campaigns are profitable, but not so for eBay’s marketing department.
Its own interest is in securing the largest possible budget, which is much easier if you can demonstrate that what you do actually works. Within the marketing department, TV, print and digital compete with each other to show who’s more important, a dynamic that hardly promotes honest reporting. ... Marketers are often most successful at marketing their own marketing.
"Bad methodology makes everyone happy,” said David Reiley, who used to head Yahoo’s economics team and is now working for streaming service Pandora. "It will make the publisher happy. It will make the person who bought the media happy. It will make the boss of the person who bought the media happy. It will make the ad agency happy. Everybody can brag that they had a very successful campaign."
We want certainty. We used to find it in the Don Drapers of the world, the ones with the best one-liners up their sleeves. Today we look for certainty from data analysts who are supposed to just show us the numbers.
"What Randall is trying to say," the former eBay economist interjected, "is that marketeers actually believe that their marketing works, even if it doesn’t. Just like we believe our research is important, even if it isn’t."
The Islandora Foundation could not be more happy to welcome our newest member-supporter: The University of Nevada, Las Vegas. One of the earliest adopters of Islandora 8, and a vital supporter who helped to built the platform they are moving to, UNLV has made a huge impact on the Islandora community since they first engaged in 2018.
We hope you will attend later this month on January 21 when UNLV joins our webinar series to give you a tour of the amazing things they're doing with Islandora 8. Register now for free.
DLF is excited to announce that we’re sending community member Justine Thomas to the Visual Resources Association Annual Conference in Baltimore, Maryland this March with support from our GLAM Cross-Pollinator Registration Awards. The conference brings together around 200 colleagues from diverse workplaces, including higher education, the corporate sector, museums, and archives to explore digital asset management, intellectual property rights, digital humanities, metadata standards, coding, imaging best practices and so much more.
About the Awardee
Justine Thomas (@JustineThomasM) is currently a Digital Programs Contractor at the National Museum of American History (NMAH) focusing on digital asset management and collections information support. Prior to graduating in 2019 with a Master’s in Museum Studies from the George Washington University, Justine worked at NMAH as a collections processing intern in the Archives Center and as a Public Programs Facilitator encouraging visitors to discuss American democracy and social justice issues.
The Cross-Pollinator Awards & Upcoming Opportunities
Since 2015 (and initially with support from the Kress Foundation), the GLAM Cross-Pollinator Registration Awards have aimed to foster communication and conversation among the GLAM communities. Each year, a member from each of our partner organizations receives free registration to attend the DLF Forum and, in exchange, a DLF member affiliate attends each partner conference. Learn more about those who attended the 2019 DLF Forum in Tampa, Florida and keep an eye out for their fellow reflections on the DLF blog in the coming weeks.
Students, faculty, and staff from DLF member institutions are eligible to apply for upcoming opportunities! Registration awards are still available for the upcoming annual meetings of ARLIS/NA, and AIC in 2019. Apply now!
Upcoming GLAM Cross-Pollinator Registration Award deadlines:
Art Libraries Society of North America (Registration Award): 14 February 2020
American Institute for Conservation (Registration Award): 13 March 2020
Our second Islandora event for 2020 will take Islandora home to where it was born: The University of Prince Edward Island, in beautiful Charlottetown PE. Come join us to learn all about Islandora, network with your community, and maybe tack on a few days to explore one of Canada's most beautiful destinations.
The camp will run from July 20 - 22, following the usual Islandora Camp format with one day of general sessions, a day of hands-on training workshops with tracks for front-end users and developers, and a day of sessions from the community. We hope to see you there!
The fundamental problem in the design of the LOCKSSsystem was to audit the integrity of multiple replicas of content stored in unreliable, mutually untrusting systems without downloading the entire content:
Multiple replicas, in our case lots of them, resulted from our way of dealing with the fact that the academic journals the system was designed to preserve were copyright, and the copyright was owned by rich, litigious members of the academic publishing oligopoly. We defused this issue by insisting that each library keep its own copy of the content to which it subscribed.
Unreliable, mutually untrusting systems was a consequence. Each library's system had to be as cheap to own, administer and operate as possible, to keep the aggregate cost of the system manageable, and to keep the individual cost to a library below the level that would attract management attention. So neither the hardware nor the system administration would be especially reliable.
Without downloading was another consequence, for two reasons. Downloading the content from lots of nodes on every audit would be both slow and expensive. But worse, it would likely have been a copyright violation and subjected us to criminal liability under the DMCA.
Lots of replicas are essential to the working of the LOCKSS protocol, but more normal systems don't have that many for obvious economic reasons. Back then there were integrity audit systems developed that didn't need an excess of replicas, including work by Mehul Shah et al, and Jaja and Song. But, primarily because the implicit threat models of most archival systems in production assumed trustworthy infrastructure, these systems were not widely used. Outside the archival space, there wasn't a requirement for them.
A decade and a half later the rise of, and risks of, cloud storage have sparked renewed interest in this problem. Yangfei Lin et al's Multiple‐replica integrity auditing schemes for cloud data storage provides a useful review of the current state-of-the-art. Below the fold, a discussion of their, and some related work. Their abstract reads:
Cloud computing has been an essential technology for providing on‐demand computing resources as a service on the Internet. Not only enterprises but also individuals can outsource their data to the cloud without worrying about purchase and maintenance cost. The cloud storage system, however, is not fully trustable. Cloud data integrity auditing is crucial for defending against the security threats of data in the untrusted multicloud environment. Storing multiple replicas is a commonly used strategy for the availability and reliability of critical data. In this paper, we summarize and analyze the state‐of‐the‐art multiple‐replica integrity auditing schemes in cloud data storage. We present the system model and security threats of outsourcing data to the cloud with classification of ongoing developments. We also summarize the existing data integrity auditing schemes for multicloud data storage. The important open issues and potential research directions are addressed.
There are three possible system architectures for auditing the integrity of multiple replicas:
As far as I'm aware, LOCKSS is unique in using a true peer-to-peer architecture, in which nodes storing content mutually audit each other.
In another possible architecture the data owner (DO in Yangfei Lin et al's nomenclature) audits the replicas.
Yangfei Lin et al generally consider an architecture in which a trusted third party audits the replicas on behalf of the DO.
Proof-of-Possession vs. Proof-of-Retrievability
There are two kinds of audit:
A Proof-of-Retrievability (PoR) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed and every bit was intact.
A Proof-of-Possession (PoP) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed, but not that every bit was intact. The paper uses the acronym PDP for Provable Data Possession.
Immutable, Trustworthy Storage
The reason integrity audits are necessary is that storage systems are neither reliable nor trustworthy, especially at scale. Some audit systems depend on storing integrity tokens, such as hashes, in storage which has to be assumed reliable. If the token storage is corrupted, it may be possible to detect but not recover from the corruption. It is generally assumed that, because the tokens are much smaller than the content to whose integrity they attest, they are correspondingly more reliable. But it is easy to forget that both the tokens and the content are made of the same kind of bits, and that even storage protected by cryptographic hardware has vulnerabilities.
In many applications of cloud storage it is important that confidentiality of the data is preserved by encrypting it. In the digital preservation context, encrypting the data adds a significant single point of failure, the loss or corruption of the key, so is generally not used. If encryption is used, some means for ensuring that the ciphertext of each replica is different is usually desirable, as is the use of immutable, trustworthy storage for the decryption keys. The paper discusses doing this via probabilistic encryption using public/private key pairs, or via symmetric encryption using random noise added to the plaintext.
If the replicas are encrypted they are not bit-for-bit identical and thus their hashes will be different whether they are intact or corrupt. Thus a homomorphic encryption algorithm must be used:
Homomorphic encryption is a form of encryption with an additional evaluation capability for computing over encrypted data without access to the secret key. The result of such a computation remains encrypted.
In Section 3.3 Yangfei Lin et al discuss two auditing schemes based on homomorphic encryption:
If an audit operation is not to involve downloading the entire content, it must involve the auditor requiring the system storing the replica to perform a computation that:
The storage system does not know the result of ahead of time.
Takes as input part (PoP) or all (PoR) of the replica.
Thus, for example, asking the replica store for the hash of the content is not adequate, since the store could have pre-computed and stored the hash, rather than the content.
PoP systems can, for example, satisfy these requirements by requesting the hash of a random range of bytes within the content. PoR systems can, for example, satisfy these requirements by providing a random nonce that the replica store must prepend to the content before hashing it. It is important that, if the auditor pre-computes and stores these random values, they be kept secret from the replica stores. If the replica store discovers them, it can pre-compute the responses to future audit requests and discard the content without detection.
Unfortunately, it is not possible to completely exclude the possibility that a replica store, or a conspiracy among the replica stores, has compromised the storage holding the auditor's pre-computed values. A ideal design of auditor would generate the random values at each audit time, rather than pre-computing them. Alas, this is typically possible only if the auditor has access to a replica stored in immutable, trustworthy storage (see above). In the mutual audit architecture used by the LOCKSS system, the nodes do have access to a replica, albeit not in reliable storage, so the random nonces the system uses are generated afresh for each audit.
It is an unfortunate reality of current systems that, over long periods, preventing secrets from leaking and detecting in a timely fashion that they have leaked are both effectively impossible.
Auditing Dynamic vs. Static Data
In the digital preservation context, the replicas being audited can be assumed to be static, or at least append-only. The paper addresses the much harder problem of auditing replicas that are dynamic, subject to updates through time. In Section 3.2 Yangfei Lin et al discuss a number of techniques for authenticated data structures (ADS) to allow efficient auditing of dynamic data:
There are three main ADSs: rank-based authenticated skip list (RASL), Merkle hash tree (MHT), and map version table (MVT).
Cloud storage adoption, due to the growing popularity of IoT solutions, is steadily on the rise, and ever more critical to services and businesses. In light of this trend, customers of cloud-based services are increasingly reliant, and their interests correspondingly at stake, on the good faith and appropriate conduct of providers at all times, which can be misplaced considering that data is the "new gold", and malicious interests on the provider side may conjure to misappropriate, alter, hide data, or deny access. A key to this problem lies in identifying and designing protocols to produce a trail of all interactions between customers and providers, at the very least, and make it widely available, auditable and its contents therefore provable. This work introduces preliminary results of this research activity, in particular including scenarios, threat models, architecture, interaction protocols and security guarantees of the proposed blockchain-based solution.
Everything they want from the Ethereum blockchain could be provided by the same kind of verifiable logs as are used in Certificate Transparency, thereby avoiding the problems of public blockchains. But doing so would face insuperable scaling problems under the transaction rates of industrial cloud deployments.
What library technology topics are you
passionate about? Have something you can help others learn?
LITA invites you to share your expertise with an international audience! Our courses and webinars are based on topics of interest to library technology workers and technology managers at all levels in all types of libraries. Taught by experts, they reach beyond physical conferences to bring high quality continuing education to the library world.
We deliberately seek and strongly encourage
submissions from underrepresented groups, such as women, people of color, the
LGBTQA+ community, and people with disabilities.
All topics related to the intersection of technology
and libraries are welcomed, including:
IT Project Management
Change management in technology
Big Data, High Performance Computing
Python, R, GitHub, OpenRefine, and other
programming/coding topics in a library context
Supporting Digital Scholarship/Humanities
Virtual and Augmented Reality
Implementation or Participation in Open
Source Technologies or Communities
Open Educational Resources, Creating and
Providing Access to Open Ebooks and Other Educational Materials
Managing Technology Training
Diversity/Inclusion and Technology
Accessibility Issues and Library Technology
Technology in Special Libraries
Ethics of Library Technology (e.g., Privacy
Concerns, Social Justice Implications)
Library/Learning Management System
Instructors receive a $500 honorarium for an
online course or $150 for a webinar, split among instructors. Check out our
list of current and past course offerings to see what topics have been covered recently.
Be part of another slate of compelling and useful online education programs
Questions or Comments?
For questions or comments related to teaching for LITA, contact
us at (312) 280-4268 or email@example.com
.The Library of Congress has finally posted the presentations from the 2019 Designing Storage Architectures for Digital Collections workshop that took place in early September, I've greatly enjoyed the earlier editions of this meeting, so I was sorry I couldn't make it this time. Below the fold, I look at some of the presentations.
[Slide 5] The total amount of storage manufactured each year continues its exponential growth at around 20%/yr. The vast majority (76%) of it is HDD, but the proportion of flash (20%) is increasing. Tape remains a very small proportion (4%).
[Slide 12] They contrast this 20% growth in supply with the traditionally ludicrous 40% growth in "demand". Their analysis assumes one byte of storage manufactured in a year represents one byte of data stored in that year, which is not the case (see my 2016 post Where Did All Those Bits Go? for a comprehensive debunking). So their supposed "storage gap" is actually a huge, if irrelevant, underestimate. But they hit the nail on the head with:
Key Point: HDD 75% of bits and 30% of revenue, NAND 20% of bits and 70% of revenue".
[Slide 9] The Kryder rates for NAND Flash, HDD and Tape are comparable;
$/GB decreases are competitive with all technologies.
$/GB decreases are in the 19%/yr range and not the classical Moore’s Law projection of 28%/yr associated with areal density doubling every 2 years
As my economic model shows, this makes long-term data storage a significantly greater investment.
[Slide 11] In 2017 flash was 9.7 times as expensive as HDD. In 2018 the ratio was 9 times. Thus, despite recovering from 2017's supply shortages, flash has not made significant progress in eroding HDD's $/GB advantage. By continuing current trends, they project that by 2026 flash will ship more bytes than HDD. But they project it will still be 6 times as expensive per byte. So they ask a good question:
In 2026 is there demand for 7X more manufactured storage annually and is there sufficient value for this storage to spend $122B more annually (2.4X) for this storage?
Jon Trantham of Seagate confirmed that, as it has been for a decade, the date for volume shipments of HAMR drives is still slipping in real time; "Seagate is now shipping HAMR drives in limited quantities to lead customers".
His presentation is interesting in that he provides some details of the extraordinary challenges involved in manufacturing HAMR drives, with pictures showing how small everything is:
The height from the bottom of the slider to the top of the laser module is less than 500 um
The slider will fly over the disk with an air-gap of only 1-2 nm
As usual, I will predict that the industry is far more likely to achieve the 15% CAGR in areal density line on the graph than the 30% line. Note the flatness of the "HDD Product" curve for the last five years or so.
The topic of tape provided a point-counterpoint balance.
Tape, unlike HDD, has consistently achieved published capacity roadmaps
For the last 8 years, the ratio of manufactured EB of tape to manufactured EB of HDD as remained constant in the 5.5% range
Unlike HDD, tape magnetic physics is not the limiting issues since tape bit cells are 60X larger than HDD bit cells ... The projected tape areal density in 2025 (90 Gbit/in2) is 13x smaller than today’s HDD areal density and has already been demonstrated in laboratory environments.
Carl Watts' Issues in Tape Industry needed only a few bullets to make his counterpoint that the risk in tape is not technological:
IBM is the last of the hardware manufacturers:
IBM is the only builder of LTO8
IBM is the only vendor left with enterprise class tape drives
If you only have one manufacturer how do you mitigate risk?
These cloud archival solutions all use tape:
Amazon AWS Glacier and Glacier Deep ($1/TB/month)
Azure General Purpose v2 storage Archive ($2/TB/month)
Google GCP Coldline($7/TB/month)
If it's all the same tape, how do we mitigate risk?
If, as Decad and Fontana claim:
Tape storage is strategic in public, hybrid, and private “Clouds”
then IBM has achieved a monopoly, which could have implications for tape's cost advantage. Jon Trantham's presentation described Seagate's work on robots, similar to tape robots and the Blu-Ray robots developed by Facebook, but containing hard disk cartridges descended from those we studied in 2008's Predicting the Archival Life of Removable Hard Disk Drives. We showed that the bits on the platters had similar life to bits on tape. Of course, tape has the advantage of being effectively a 3D medium where disk is effectively a 2D medium.
This table, note, is an over-simplification. The pricing is complex; operations are broken down more precisely than read and write; the exact features vary; and there may be discounts for reserved storage. Costs for data transfer within your cloud infrastructure may be less. The only way to get a true comparison is to specify your exact requirements (and whether the cloud provider can meet them), and work out the price for your particular case.
I've been writing enthusiastically about the long-term potential, but skeptically about the medium-term future, of DNA as an archival storage medium for more than seven years. I've always been impressed by the work of the Microsoft/UW team in this field, and Karin Strauss and Luis Ceze's DNA data storage and computation is no exception. It includes details of their demonstration of a complete write-to-read automated system (see also video), and discussion of techniques for performing "big data" computations on data stored in DNA.
The good Dr. Panglossloved Henry Newman's enthusiasm for 5G networking, but I'm a lot more skeptical. It is true that early 5G phones can demo nearly 2Gb/s in very restricted coverage areas in some US cities. But 5G phones are going to be more expensive to buy, more expensive to use, have less battery life, overheat, have less consistent bandwidth and almost non-existent coverage. In return, you get better peak bandwidth, which most people don't use. Customers are already discovering that their existing phone is "good enough". 5G is such a deal!
The reason the carriers are building out 5G networks isn't phones, it is because they see a goldmine in the Internet of Things. But combine 2Gb/s bandwidth with the IoT's notoriously non-existent security, and you have a disaster the carriers simply cannot allow to happen.
The IoT has proliferated for two reasons, the Things are very cheap and connecting them to the Internet is unregulated, so ISPs cannot impose hassles. But connecting a Thing to the 5G Internet will require a data plan from the carrier, so they will be able to impose requirements, and thus costs. Among the requirements will have to be that the Things have UL certification, adequate security and support, including timely software updates for their presumably long connected life. It is precisely the lack of these expensive attributes that have made the IoT so ubiquitous and such a security dumpster-fire!
Two presentations discussed fixity checks. Mark Cooper reported on an effort to validate both the inventory and the checksums of part of LC's digital collection. The conclusion was that the automated parts were reliable, the human parts not so much:
Content on storage is correct, inventory is not
Content custodians working around system limitations, resulting in broken inventory records
Content in the digital storage system needs to be understood as potentially dynamic, in particular for presentation and access
System needs to facilitate required actions in ways that are logged and versioned
Read the data back and hash it, which at scale gets expensive in access and bandwidth charges.
Hash the data in the cloud that stores it, which involves trusting the cloud to actually perform the hash rather than simply remember the hash computed at ingest.
I have yet to see a cloud API that implements the technique published by Mehul Shah et al twelve years ago, allowing the data owner to challenge the cloud provider with a nonce, thus forcing it to compute the hash of the nonce and the data at check time. See also my Auditing The Integrity Of Multiple Replicas.
Blockchain distributed ledger functionality presents a new way to ensure electronic systems provide electronic record authenticity / integrity.
May not help with preservation or long term access and may make these issues more complicated.
It is important to note that what NARA means by "government records" is quite different from what is typically meant by "records", and the legislative framework under which they operate may make applying blockchain technology tricky.
This is always a fascinating meeting. But, please, on the call for participation next year make it clear that anyone using projections for "data generated" in their slides as somehow relevant to "data storage" and archival data storage in particular will be hauled off stage by the hook.
We're gearing up for the second release of Islandora 8, which boasts a bevy of new community requested and developed features:
Full text extraction for Images and PDFs
Technical metadata extraction using FITS
Drupal revisions generate Fedora versions
That's a lot of community contributions! And that means we need a lot of testing for all those new features, too. We'll be freezing the code and holding a two-week testing sprint from January 27th to February 7th to make sure everything is good to go for the next release. So if you're interested in getting your hands on the latest Islandora 8 and don't mind providing us with valuable feedback, please consider signing up to be a tester. You can find the sign up sheet here. Committiment is minimal. We have a spreadsheet of test cases where you can put your name down for as much or as little as you like. Just run through the test cases you like and let us know if it worked for you or not. Plus, there will be plenty of people around on Slack or the mailing list to help out if you get stuck or have any questions. We hope to see you there!
Towards the end of 2019, we were made aware of two new Samvera-based repositories that you may like to take a look at:
The Southeastern Baptist Theological Seminary has produced an innovative site which can be driven from a scrolling timeline (as well as a conventional search). The SEBTS site can be found here.
The Bristish Library has developed a joint research database in collaboration with a number of partners: National Museums Scotland, Tate, MOLA, British Museum and Royal Botanic Gardens, Kew. The system is built in Samvera Hyku and was developed in collaboration with Ubiquity Press. Each Partner has their own repository but there is a single search portal which interrogates all of them. The portal is here, and there’s a very full blog post about the system here.
Samvera Connect 2020 will take place during the week beginning Monday 26th October, hosted by the University of California at Santa Barbara. The week will include a one-day Partner Meeting, and in parallel with that, a Developer Congress. Whilst the dates are fixed, you should not make any assumptions at this stage about what will happen on any given day. It has been pointed out that Halloween is at the end of the week and people with young families may wish to get home for it – the conference organizers are considering how best to facilitate that, probably this will involve changing the order of things from the ‘traditional’ timetable.
The curriculum for the upcoming Islandora and Fedora Camp at Arizona State University, February 24-26, 2020 is now available here.
Islandora and Fedora Camp, hosted by Arizona State University Libraries, offers everyone a chance to dive in and learn all about the latest versions of Islandora and Fedora. Training will begin with the basics and build toward more advanced concepts–no prior Islandora or Fedora experience is required. Participants can expect to come away with a deep dive Islandora and Fedora learning experience coupled with multiple opportunities for applying hands-on techniques working with experienced trainers from both communities.
The curriculum will be delivered by a knowledgeable team of instructors from the Islandora and Fedora communities, including:
David Wilcox, Fedora Program Leader, LYRASIS
Melissa Anez, Islandora Project and Community Manager, Islandora Foundation
Bethany Seeger, Digital Library Software Developer, Amherst College
Seth Shaw, Application Developer, University of Nevada, Las Vegas
Danny Lamb, Technical Lead, Islandora Foundation
Register today and join us in Arizona! Register by January 13, 2020 to receive a $50 early bird discount using the promo code: FC20EB.