Planet Code4Lib

GitHub Giveth; Wikipedia Taketh Away / Eric Hellman


One of the joys of administering Free-Programming-Books, the second most popular repo on GitHub, has been accepting pull requests (edits) from new contributors, including contributors who have never contributed to an open source project before. I always say thank you. I imagine that these contributors might go on to use what they've learned to contribute to other projects, and perhaps to start their own projects. We have some hoops to jump through- there's a linter run by Travis CI that demands alphabetical order, even for cyrillic and CJK names that I'm not super positive as to how they get "alphabetized". But I imagine that new and old contributors get some satisfaction when their contribution gets "merged into master", no matter how much that sounds like yielding to the hierarchy.

Contributing to Wikipedia is a different experience. Wikipedia accepts whatever edits you push to it, unless the topic has been locked down. No one says thank you. It's a rush to see your edit live on the most consulted and trusted site on the internet. But then someone comes and reverts or edits your edit. And instantly the emotional state of a new Wikipedia editor changes from enthusiasm  to bitter disappointment and annoyance at the legalistic (and typically white male) Wikipedian.

Psychologists know that that rewards are more effective motivations than punishments so maybe the workflow used on GitHub is kinder than that used on Wikipedia. Vandalism and spam are a difficult problem for truly open systems, and contention is even harder. Wikipedia wastes a lot of energy on contentious issues. The GitHub workflow simplifies the avoidance of contention and vandalism but sacrifices a bit of openness by depending a lot on the humans with merge privileges. There are still problems - every programmer has had the horrible experience of a harsh or petty code review, but at least there are tools that facilitate and document discussion.

The saving grace of GitHub workflow is that if the maintainers of a repo are mean or incompetent, you can just fork the repo and try to do better. In Wikipedia, controversy gets pushed up a hierarchy of privileged clerics. The Wikipedia clergy does an amazingly good job, considering what they're up against, and their workings are in the open for the most part, but the lowly wiki-parishioner rarely experiences joy when they get involved. In principle, you can fork wikipedia, but what good would it do you?

The miracle of Wikipedia has taught us a lot; as we struggle to modernize our society's methods of establishing truth, we need to also learn from GitHub.

Update 1/19: It seems this got picked up by Hacker News. The comment by @avian is worth noting. The flip side of my post is that Wikipedia offers immediate gratification, while a poorly administered GitHub repo can let contributions languish forever, resulting in frustration and disappointment. That's something repo admins need to learn from Wikipedia!

Mother Teresa and Margaret Sanger do not mix / District Dispatch

The American Library Association gets hundreds of calls a year from libraries tackling book challenges and other forms of censorship. Heck, we even celebrate with banned book week. Our Office of Intellectual Freedom (OIF) takes these calls and advises librarians on their options.

One library director in the small town of Trumbull in Connecticut called OIF when people objected to a painting on display at the Trumbull Public Library. It was part of a series of works by Robin Morris called the Great Minds Collection. Richard Resnick, a citizen of Trumbull commissioned the works and gave the collection of 33 artworks to the library to exhibit.

One painting—Onward We March—in the collection depicts several famous women at a rally. Mother Teresa is there, representing the Mission of Charity along with Gloria Steinem, Clara Barton, Susan B. Anthony and others including Margaret Sanger, the founder of Planned Parenthood.

The citizen complained about the juxtaposition of Mother Teresa and Margaret Sanger. Their argument was the Mother Teresa would never march with the likes of Sanger. It was offensive. The Missionaries of Charity, an organization founded by Mother Teresa, said the painting had to be removed because they had intellectual property rights of the image of Saint Teresa. The Library Board of Trustees who stood firm and maintained the painting remain. They noted the library’s support of free expression and diversity of opinion. They also noted that a copyright infringement claim seemed dubious. Was this just an excuse for removing the painting?

Enter the attorneys, religious leaders, ACLU and First Selectman Tim Herbst who represents the district in the state legislature. Herbst, with political ambitions, struggled with a decision he thought was his to make. (The Library Board of Trustees thought it was their decision). Despite the bogus copyright claims, the story about potential liability for the city was a convenient excuse to remove the painting from the library.

Against the decision by the Library Board of Trustees to keep the painting on display, Herbst removed the painting from the exhibit saying: “After learning that the Trumbull Library Board did not have the properly written indemnification for the display of privately owned artwork in the town’s library, and also being alerted to allegations of copyright infringement and unlawful use of Mother Teresa’s image, upon the advice of legal counsel, I can see no other respectful and responsible alternative than to temporarily suspend the display until the proper agreements and legal assurances.”

Less than a week later, the painting was back up after Richard Resnick, against the advice of his attorney, signed a document that he would take responsibility if the library or city was sued.

Herbst announced his decision to replace the painting at a town library meeting. While giving his remarks, there was a loud commotion in the library room next door. When people ran to look at what was happening, they saw a woman defaced the painting, using a back marker to cross out the face of Margaret Sanger. The woman fled the scene. Police were called and people were questioned, but the culprit was never found. Those at the meeting agreed, that in spite of their differences of opinion, none of them wanted the painting vandalized.

Since then, the library has tried to put the situation behind them. The Great Minds Collection is still being exhibited alongside the Onward We March painting restored. Robin Morris’s art has gained widespread recognition in popularity. Images of her work on cups, posters, shirts and shopping bags are now available.

The post Mother Teresa and Margaret Sanger do not mix appeared first on District Dispatch.

JANICE: a prototype re-implementation of JANE, using the Semantic Scholar Open Research Corpus / Alf Eaton, Alf

JANE

For many years, JANE has provided a free service to users who are looking to find experts on a topic (usually to invite them as peer reviewers) or to identify a suitable journal for manuscript submission.

The source code for JANE has recently been published, and the recommendation process is described in a 2008 paper: essentially the algorithm takes some input text (title and/or abstract), queries a Lucene index of PubMed metadata to find similar papers (with some filters for recency, article type and journal quality), then calculates a score for each author or journal by summing up the relevance scores over the most similar 50 articles.

JANE produces a list of the most relevant authors of similar work, and does some extra parsing to extract their published email addresses. As PubMed doesn't disambiguate authors (apart from the relatively recent inclusion of ORCID identifiers), the name is used as the key for each author, so it's possible (but unusual) that two authors with the same name could be combined in the search results.

Semantic Scholar

The latest release of Semantic Scholar's Open Research Corpus contains metadata for just over 20 million journal articles published since 1991, covering computer science and biomedicine. The metadata for each paper includes title, abstract, year of publication, authors, citations (papers that cited this paper) and references (papers that were cited by this paper). Importantly, authors and papers are each given a unique ID.

JANICE

JANICE is a prototype re-implementation of the main features of JANE: taking input text and finding similar authors or journals. It runs a More Like This query with the input text against an Elasticsearch index of the Open Research Corpus data, retrieves the 100 most similar papers (optionally filtered by publication date), then calculates a score for each author or journal by summing up their relevance scores.

The results of this algorithm are promising: using open peer review data from one of PeerJ's published articles, JANICE returned a list of suggested reviewers containing 2 of the 3 actual reviewers within the top 10; the other reviewer was only missing from the list because although they had authored a relevant paper, it happened to not use the same keywords as the input text (using word vectors would help here).

Coko

This prototype was built as part of the development of xpub, a journal platform produced by the Collaborative Knowledge Foundation and partner organisations.

Desire / Ed Summers

I recently reviewed an article draft that some EDGI folks were putting together that examines their work to date. The draft is quite useful if you are interested in how EDGI’s work to archive potentially at risk environmental scientific data fits in with related efforts such as Data Rescue, Data Refuge and Data Together. The article is also quite interesting because it positions their work by thinking of it in terms of an emerging framework for environmental data justice.

Environmental data justice is a relatively new idea that sits at the intersection of environmental justice and critical data studies (note I didn’t link to the Wikipedia entry because it needs quite a bit of improvement IMHO). I think it could be useful for ideas of environmental data justice to also draw on a long strand of thinking about archives as the embodiment-of and a vehicle-for social justice (Punzalan & Caswell, 2016), which goes back some 40 years. I think it could also be also useful to think it in terms of emerging ideas around data activism that are popping up in activities such as the Responsible Data Forum.

At any rate, this post wasn’t actually meant to about any of that, but just meant to be a note to myself about a reference in the EDGI draft to a piece by Eve Tuck entitled Suspending Damage: A Letter to Communities (Tuck, 2009).

In this open letter, published in the Harvard Educational Review, Tuck calls on researchers to put a moratorium on what she calls damaged centered research:

In damaged-centered research, one of the major activities is to document pain or loss in an individual, community, or tribe. Though connected to deficit models—frameworks that emphasize what a particular student, family, or community is lacking to explain underachievement or failure–damage-centered research is distinct in being more socially and historically situated. It looks to historical exploitation, domination, and colonization to explain contemporary brokenness, such as poverty, poor health, and low literacy. Common sense tells us this is a good thing, but the danger in damage-centered research is that it is a pathologizing approach in which the oppression singularly defines a community. Here’s a more applied definition of damage-centered research: research that operates, even benevolently, from a theory of change that establishes harm or injury in order to achieve reparation.

Instead Tuck wants to re-orient research around a theory of change that documents desire instead of damage:

As I will explore, desire-based research frameworks are concerned with understanding complexity, contradiction, and the self-determination of lived lives … desire-based frameworks defy the lure to serve as “advertisements for power” by documenting not only the painful elements of social realities but also the wisdom and hope. Such an axiology is intent on depathologizing the experiences of dispossessed and disenfranchised communities so that people are seen as more than broken and conquered. This is to say that even when communitiee are broken and conquered, they are so much more than that so much more that this incomplete story is an act of aggression.

Tuck points out that she isn’t suggesting that desire-based research should replace damaged-centered research, but that instead it is part of an epistemological shift: how knowledge is generated and understood, or how we know what we know. This is a subtle point, but Tuck does a masterful job of providing real examples in this piece, so its well worth a read if this sounds at all interesting.

I thought it was also interesting that Tuck draws on the work of Deleuze & Guattari (1987) in developing this idea of desire-based research:

Poststructuralist theorists Gilles Deleuze and Felix Guattari teach us that desire is assembled, crafted over a lifetime through our experiences. For them, this assemblage is the picking up of distinct bits and pieces that, without losing their specificity, become integrated into a dynamic whole. This is what accounts for the multiplicity, complexity, and contradiction of desire, how desire reaches for contrasting realities, even simultaneously. Countering theorists that posit desire as a hole, a gap, or that which is missing (such as, and somewhat famously, Foucault) Deleuze and Guattari insist that desire is not lacking but “involution” (p. 164). Exponentially generative, engaged, engorged, desire is not mere wanting but our informed seeking. Desire is both the part of us that hankers for the desired and at the same time the part that learns to desire. It is closely tied to, or may even be, our wisdom.

It’s interesting to consider how this focus on desire works in the EDGI piece. On the one hand, EDGI is quite focused on the damage that the Trump Administration poses to the environment. Trump and Pruitt’s arrogant dismissal of climate change, and US withdrawal from the Paris Agreement represents a dangerous departure from engagement in pressing world issues. But EDGI are also working on new forms of organizational collaboration, and experimenting with new architectures for data on the web that not only respond to the Trump election, but move us forward into thinking about how to manage this data in a distributed setting. It represents a desire for new modes of sharing and living together.

Finally, this piece by Tuck is of interest to me because of my own participation in the Documenting the Now project. We started out the project with the goal of documenting the damage that occurred when a police officer, Darren Wilson used his gun to kill teenager Michael Brown in Ferguson, Missouri. We started it to bring archival attention to the Black Lives Matter movement, that played out so significantly in social media, and in the streets of cities and towns across the United States and around the world.

Black Lives Matter’s focus on police violence is, by necessity, about damage. Damage has happened and is happening. Damage has been denied. Damage must be acknowledged, and repaired. But damage is not the whole story for Black Lives Matter. As we progressed in the Documenting the Now project, and met with activists in Ferguson we learned that our specific challenge was to document the actual complex lived experiences of the people involved. Their activism was not a simple and static thing that lends itself to traditional archival representation. The archive represented opportunities to be remembered, but also posed as a risk for being remembered frozen in a particular moment in time.

Most of all, it became apparent that the story of the activism in Ferguson was a story about desire that the activists had for each other, and for a way of living together in a community that celebrates love and respect for differences, and survival. If you are interested in listening the two meetings with activists are available online. You can also see this theme of desire at work in the Whose Streets documentary about Ferguson.

References

Deleuze, G., & Guattari, F. (1987). A thousand plateaus: Capitalism and schizophrenia. Bloomsbury Publishing.

Punzalan, R. L., & Caswell, M. (2016). Critical directions for archival approaches to social justice. Library Quarterly, 86(1), 25–42.

Tuck, E. (2009). Suspending damage: A letter to communities. Harvard Educational Review, 79(3), 409–428.

Indexing Semantic Scholar's Open Research Corpus in Elasticsearch / Alf Eaton, Alf

Semantic Scholar publishes an Open Research Corpus dataset, which currently contains metadata for around 20 million research papers published since 1991.

  1. Create a DigitalOcean droplet using a "one-click apps" image for Docker on Ubuntu (3GB RAM, $15/month) and attach a 200GB data volume ($20/month).
  2. SSH into the instance and start an Elasticsearch cluster running in Docker.
  3. Create a new index with a single shard: curl -XPUT 'http://localhost:9200/scholar' -H 'Content-Type: application/json' -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0, "codec": "best_compression" } } }'
  4. Install esbulk: VERSION=0.4.8; curl -L https://github.com/miku/esbulk/releases/download/v${VERSION}/esbulk_${VERSION}_amd64.deb -o esbulk.deb && dpkg -i esbulk.deb && rm esbulk.deb
  5. Fetch, unzip and import the Open Research Corpus dataset (inside the zip archive is a license.txt file and a gzipped, newline-delimited JSON file): VERSION=2017-10-30; curl -L https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/${VERSION}/papers-${VERSION}.zip -o papers.zip && unzip papers.zip && rm papers.zip && esbulk -index scholar -type paper -id id -verbose -z < papers-${VERSION}.json.gz && rm papers-${VERSION}.json.gz
  6. While importing, index statistics can be viewed at http://localhost:9200/scholar/_stats?pretty
  7. After indexing, optimise the Elasticsearch index by merging into a single segment: curl -XPOST 'http://localhost:9200/scholar/_forcemerge?max_num_segments=1'
  8. (recommended) Use ufw to prevent external access to the Elasticsearch service and put a web service (e.g. an Express app) in front of it, mapping routes to Elasticsearch queries.

January Innovator-in-Residence Update: Experiments with Jer Thorp / Library of Congress: The Signal

We’ve been delighted to have Library of Congress Innovator-in-Residence Jer Thorp with us since October. During the first three months of his residency he has connected with staff, visited collections, and explored forms of data to make better sense the inner workings of the Library. Jer has been weaving together those threads with experiments and other works in progress. 

Turning the Process on its Ear 

Jer has made record of his activity from the start via interviews with Library staff and from within the Library of Congress main reading room and stacks, while reflecting on what he has encountered. The result is the podcast “Artist in the Archive.” It’s on a roll with two episodes so far. The podcast follows a format that includes detailed discussions with National Digital Initiatives Chief Kate Zwaard, Curator of the Jay I. Kislak Collection of the Archaeology and History of the Early Americas, John Hessler, and Director For Acquisitions And Bibliographic Access, Beacher Wiggins. These longer discussions are framed by segments with Library of Congress curators and archivists such as Meg McAleer and Todd Harvey sharing share vignettes of unique collections; from Sputnik’s launch to folk revival in New York City, bringing the perspectives of the past to life. Listen to the first two episodes of “Artist in the Archive” and share your thoughts and questions with Jer. You can also find transcripts for episode one and episode two, as well as finding aids with images of objects described in episodes one and two

Arranging Appellations

Sometimes the language of the Library can be on the tip of your tongue; other times, you’d need a glossary to define the experience. For example, are you on the hunt for Hapax legomenon? Misplaced your best Volvelle? Learn more about unique and obscure terminology from the library world in this crowdsourced glossary Jer compiled in October. Finding your favorite library term missing? Let Jer know in the comments or on Twitter.

Experiments and Exploring Collections 

In October we shared details of Jer’s Library of Names app here on the Signal. Built with the name authority files from the Library of Congress MARC records, the Library of Names carves out the first names of authors at five-year intervals; exploring with the app allows one to imagine the mix of creators across time.

“A person of encyclopedic learning” according to Merriam-Webster; a polymath is an individual whose expertise spans diverse subject matter or disciplines. If you’ve listened to episode two of Jer’s podcast, you’ll have learned that the subject matter expertise of creators is captured in the name authority field in MARC records. What can these records tell us about the careers of authors? Armed with this data and a handful of questions, it is possible to probe the edges and overlaps of expertise; such as the painter-pianist-composer Ann Wyeth McCoy (sister of artist Andrew Wyeth).

While gathering stories from within collections here at the Library of Congress, Jer has also been making queries within the 25 million MARC records. Jer created a network map by taking approximately 9 million name authority files from the MARC records as a starting exercise. Next, he returned to those same people and calculated their movement across the map. He shared this exercise with reflections on its promise and potential problems with this approach on Twitter, along with the code. See this Twitter thread for more details and examples, such as a poet-diplomat-composer (Oswald von Wolkenstein) and a soldier-shoemaker-postmaster-teacher-surveyor-civil engineer-photographer-deacon in one Samuel Chase Hodgman. 

 

Network map of listed occupations for creators in Library of Congress MARC records

Polymaths mapped from name authority files from Library of Congress MARC records by Jer Thorp

Since MARC records also offer a lens on migration of creators, can those records tell us more about the centers of thought, publication, and social life? In the work-in-progress BirthyDeathy, Jer has used the locations and dates of birth and death in name authorities files to map the movement of historical figures, politicians, and authors. We’re looking forward to seeing more about what relocation signals over time as people arrived in and departed from cities like Boston, Berlin, Madrid, St. Petersburg, Vienna, Chicago, and San Francisco. 

Digital map outline with many points showing the location of birth and death in MARC Name Authority metadata mappedon

Screencap of Jer Thorp’s BirthyDeathy experiment demonstrating locations of births and deaths by year in name authority files from Library of Congress MARC records

Here’s what Jer had to say about what lies ahead with his work:

“The Library is of course well known for it’s grand holdings – Jefferson’s draft of the Declaration of Independence, the Waldseemüller map, the Gutenberg Bible. What has been striking to me is to find out how much everyday humanity is documented in the archives: children’s drawings of Sputnik, oral histories of tug boat captains, photographs of farm workers during the Great Depression. I’m eager to continue exploring novel ways to activate these collections.”

You can learn more about Jer’s work on his Library of Congress Labs Innovator-in-Residence page and explore his documentation and code

MarcEdit 7: The great [Normalization] escape / Terry Reese

working out some thoughts here — this will change as I continue working through some of these issues.

If you follow the MarcEdit development, you’ll know that last week, I posted a question in a number of venues about the affects of Unicode Normalization and its potential impacts for our community.  I’ve been doing a little bit of work in MarcEdit, having a number of discussions with vendors and folks that work with normalizations regularly – and have started to come up with a plan.  But I think there is a teaching opportunity here as well, an opportunity to discuss how we find ourselves having to deal with this particular problem, where the issue is rooted, and the impacts that I see right now in ILS systems and for users of tools like MarcEdit.  This isn’t going to be an exhaustive discussion, but hopefully it helps folks understand a little bit more what’s going on, and why this needs to be addressed.

Background

So, let’s start at the beginning.  What exactly are Unicode normalizations, and why is this something that we even need to care about….

Unicode Normalizations are, in my opinion, largely an artifact of our (the computing industry’s) transition from a non-Unicode world to Unicode, especially in the way that the extended Latin character sets ended up being supported.

So, let’s talk about character sets and code pages.  Character sets define the language that is utilized to represent a specific set of data.  Within the operating system and programming languages, these character sets are represented as code pages. For example, Windows provides support for the following code pages: https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx.    Essentially, code pages are lists of numeric values that tell the computer how to map a  representation of a letter to a specific byte.  So, let’s use a simple example, “A”.  In ASCII and UTF8 (and other) code pages, the A that we read, is actually represented as a byte of data.  This byte is 0x41.  When the browser (or word processor) sees this value, it checks the value against the defined code page, and then provides the appropriate value from the font being utilized.  This is why, in some fonts, some characters will be represented as a “?” or a block.  These represent bytes or byte sequences that may (or may not) be defined within the code page, but are not available in the font.

Prior to Unicode implementations, most languages had their own code pages.  In Windows, the US. English code page would default to 1252.  In Europe, if ISO-8859 was utilized, the code page would default to  28591.  In China, the code page could be one of many.  Maybe “Big-5”, or code page 950, or what is referred to as Simplified Chinese, or code page 936.  The gist here, is that prior to the Unicode standard, languages were represented by different values, and the keyboards, fonts, systems – would take the information about a specific code page and interpret the data so that it could be read.  Today, this is why catalogers may still encounter confusion if they get records from Asia, where the vendor or organization makes use of “Big-5” as the encoding.  When they open the data in their catalog (or editor), the data will be jumbled.  This is because MARC doesn’t include information about the record code page – rather, it defines values as Unicode, or something else.  So, it is on catalogers and systems to know the character set being utilized, and utilized tools to convert the byte points from a character encoding that they might not be able to use, to one that is friendly for their systems.

So, let’s get back to this idea of Normalization Forms.  My guess is that much of the Normalization mess that we find ourselves in is related to ISO-8859.  This code page and standard has been widely utilized in European countries, and provides a standard method of representing extended Latinate characters [those between 129-255], though, Normalizations also affect other languages as well.  Essentially, the unicode specification included ISO-8859 to ease the transition, but also provide new, composed code points for many of the characters.  And Normalizations were born.

Unicode Normalizations, very basically, define how characters are represented.  There are 4 primary normalization forms that I think we need to care about in libraries.  These are (https://en.wikipedia.org/wiki/Unicode_equivalence):

  1. NFC – The canonical normalization, which will replace decomposed characters with composed code points.
  2. NFD – The canonical normalization, but in which data is fully decomposed
  3. NFKC – A normalization that utilizes a full compatibility decomposition, followed by the replacement of sequences with their primary composites, if possible.
  4. NFKD – A normalization that utilizes a full compatibility decomposition.

 

Practically, what does this mean.  Well, it means that a value like: é can be represented in multiple ways.  In fact, this is a good example of the problems that having differing Unicode Normalization Forms is having in the library community.  In the NFC and NFKC notation, this value é is represented by a single code point, that represents the letter and its diacritic fully.  In the NFD and NFKD notations, this character is represented by code points that correspond to the “e” and the diacritic separately.  This has definite implications, as composed characters make indexing of data with diacritical marks easier, whereas, decomposed characters must be composed to index correctly.

And how does this affect the library community.  Well, we have this made up character encoding known as MARC8 (https://en.wikipedia.org/wiki/MARC-8).  MARC8 is a library specific character set (doesn’t have a code page value, so all rendering is done by applications the understand MARC8) that has no equivalent outside of the library.  Like many character sets with a need to represent wide-characters (those with diacritics), MARC8 represented characters with diacritics by utilizing decomposed characters (though, this decomposition was MARC8 specific).  For librarians, this matters because the U.S. Library of Congress, when providing instructions on support for Unicode in MARC records, provided for the ability to round-trip between MARC8 and UTF8 and back (http://www.loc.gov/marc/specifications/speccharucs.html).  This round-tripability comes at a cost, and that cost is that data, to be in sync with the recommendations, should only be provided in the NFKD notation.  This has implications however, as current generation operating systems are are generally implemented utilizing NFC as the internal representation for string data, and for programmers, who have to navigate challenges within their languages, as in most cases, language functions that deal with concepts like in-string searching or regular expressions, are done using settings that make them culturally aware (i.e., allows searching across data in different normalizations), but the replacement and manipulation of that data is almost always done using ordinal (binary) matching, which means that data using different normalization forms are not compatible.  And quite honestly, this is confusing the hell out of metadata people.  Using our “é” character as an example – a user may be able to open a program or work in a programming language, and find this value regardless of the underlying data normalization, but when it comes to making changes – the data will need to match the underlying normalization; otherwise, no changes are actually made.  And if you are a user that is just looking at the data on the screen (without the ability to see the underlying binary data or without knowledge of what normalization is being used), you’d rightly start to wonder, why didn’t the changes complete.  This is the legacy that round-trip support for MARC-8 has left us within the library community, and the implications of having data moving fluidly between different normalizations is having real consequences today.

Had We Listened to Gandalf

Can holding on with caption -- Run you fool
(Source: http://quicklol.com/wp-content/uploads/2012/03/run-you-fools-cat-lol.jpg)

The ability to round-trip data from MARC-8 to UTF8 and back seemed like such a good idea at the time.  And the specifications that the U.S. Library of Congress lays out were/are easy enough  to understand and implement.  But we should have known that it was going to be easy, and that in creating this kind of backward compatibility, we were just looking for trouble down the road.

Probably the first indication that this was going to be problematic was the use of Numeric Character Reference (NRC) form to represent characters that exist outside of the MARC-8 repertoire.  Once UTF8 became allowed and a standard for representation of bibliographic data, the frequency in which MARC-8 records began to be littered by NRC representations (i.e., &#xXXXX; notation) increased exponentially, as too did the number of questions on the MarcEdit list for ways to find better substitutions for that data – primarily, because most ILS providers never fully adopted support for NRC encoded data.  Looking back now, what is interesting, is that many of the questioned related to the substitution of NRC notations can be traced to the utilization of NFC normalized data and the rise in the presence of “smart” characters generated in our text editing systems.  Looking at the MarcEdit archive, I can find multiple entries from users looking to replace NRC data elements that exist, simply because these elements represented the composed data points, and were thus, in compatible with MARC-8.  So, we probably should have seen this coming…and quite honestly, should have made a break.  Data created in UTF8 will almost always result in some level of data change when being converted back to MARC8…we should probably have just accepted that was a likely outcome, and not worried about the importance of round-tripability.

But….we have, and did, and now we have to find a way to make the data that we have, within the limitations of our systems, work.   But what are the limitations or consequences when thinking about the normalization form of data?  The data should render the same, right?  The data should search the same, right?  The data should export the same, right?  The answer to those questions, is that this shouldn’t matter, if the local system was standardizing the normalization of data as it is added or exported from the system, but in practice, it appears that few (if any systems) do that, so the normalization form of the data can have significant impacts on what the user sees, can discovery, or can export.

What the user sees

Probably the most perplexing issues that arise related to the normalization form of data, arise in how the data is rendered to the user.  While normalization forms have a binary difference, the system should be able to accommodate these differences aren’t visible to the user.  Throughout this document, I’ve been using different normalized forms of the letter “é”, but if the browser and the operating system are working like they are suppose to – you – as the reader shouldn’t be aware of these differences.  But we know that this isn’t always the case.  Here’s one such example:

image

The top set of data represents the data seen in an ILS prior to export.  The bottom shows the data once reimported, but the Normalization form had shifted from NFC to NFKD.  The interface being presented to the user has chosen to represent the data as bytes to flag that the data is represented as a decomposed character.  But this is jarring to the user as they probably should care.

The above example is actually not as uncommon as you might think.  In experimenting with a variety of ILS systems, changes in Normalization form can often have unintended effects for the user…and since it is impossible to know which normalization form is utilized without looking at the data at the binary level – how would one know when changes to records will result in significant changes to the user experience.

The short answer, is you can’t.  I started to wonder how OCLC treats Unicode data, and if internally, OCLC normalized the data coming into and out of its system.  And the answer is no – as long as the data is valid, the characters, in whatever normalization, is accepted into the system.  To test this, I made changes to the following record: http://osu.worldcat.org/title/record-builder-added-this-test-record-on-06262013-130714/oclc/850940559.  First, I was interested if any normalization was happening when interacting with OCLC’s Metadata API, and secondly, I was wondering if data brought in with different normalizations would impact searching of the resource.  And, the answers to these questions are interesting.  First, I wanted to confirm that OCLC accepted data in any normalization provided (as was relayed to me by OCLC), and indeed that is the case.  OCLC doesn’t do any normalization, as far as I can tell, of data going into the system.  This means that a user could download a master record, and make no other change to the record but updating the normalized form, and replace that record.  From the users perspective, the change wouldn’t be noticeable – but at the data level, the changes could be profound.  Given the variety of differences in how different ILS system utilize data in the different Unicode normalization forms, this likely explains some of the “diacritic display issue” questions that periodically make their way on the MarcEdit listserv.  Users are expecting that their data is compatible with their system because the OCLC data downloaded is in UTF8 and their system supports UTF8.  However, unknown to the cataloger, the reliance of data existing in a specific normalized form may cause issues.

The second question I was interested in, as it related to OCLC, was indexing.  Would a difference in normalization form cause indexing issues.  We know that in some systems, it does.  And for many European users, I have long recommended using MarcEdit’s normalization options to ensure that data converted to UTF8 utilizes the NFC normalization – as it enables local systems to index data correctly (i.e., index the letter + diacritic, rather than the letter then diacritic, then other data).  I was wondering if OCLC would demonstrate this kind of indexing behavior, but curiously, I found OCLC had trouble indexing any data with diacritical values.  Since I’m sure that isn’t the expected result, I’ve reached out to see exactly what is the expectation for the user.

Indexing implications

As noted above, for years now, I’ve recommended that users who utilize Koha as their ILS system configure MarcEdit to utilize the NFC normalization as the standard data output when converting data between MARC-8 and UTF-8.  The reason for this has been to ensure that data indexes correctly rather than flatly.  But maybe this recommendation should have been more broadly.  While  I didn’t look at every system, one common aspect of many of the systems that I did look at show, is that data normalized as NFKD tends to not index a representation of data as diacritical value.   They either normalize all diacritical data way, or they index the data as it appears in the binary – so for example, a record like this: évery would index as e_acute_very, i.e., the indexed value would be a plan “e”, but if the data appeared in NFC notation, the data would be indexed as an é (the combined character) allowing users to search for data using the letter + diacritic.  How does your system index its data?  It’s a question I’m asking today, and wondering how much of an impact normalization form has without the ILS, as well as outside the ILS (as we reuse data in a variety of contexts).  Since each system may make different assumptions and indexing decisions based on UTF8 data presented – its an interesting question to consider.

Export implications

The best case scenario is that a system would export data the same way that its represented in the system.  This is what OCLC does – and while it likely helps to exasperate some of the problems I see upstream with systems that may look for specific normalizations, its regular and expected.  Is this behavior the rule?  Unfortunately it is not.  I see many examples where data is altered on export, and often times, the diacritic related, the issue can be traced to the normalized form of the original data.  Again, the system probably should care which form is provided (in a perfect work), but if the system is implementing the MARC specification as written (see LC guidance above), then developing operations around the expectation of NFKD formed would likely led to complications.  But again, you’d likely never know until you tried to take the data out of the system.

Thinking about this in MarcEdit

So if you’ve stayed with me this long, you may be wondering if there is anything that we can do about the problems, short of getting everyone to agree that we all normalize our data a certain way (good luck).  In MarcEdit, I’ve been looking at addressing this question in order to address the following problems that I get asked about regularly:

  1. When I try to replace x diacritic, I can find the instances, but when I try to replace, only some (or none) are replaced
  2. When I import my data back into my system, diacritics are decomposed
  3. How can I ensure my records can index diacritics correctly

 

The first two issues are ones that come up periodically, and are especially confusing to users because the differences in data is at a binary level – so hard to see.  The last issue, MarcEdit has provided a 1/2 answer for.  It has always provided a way to set normalization when converted data to UTF8, but once there, it assumes that the user will provide the data in the form that they require (I’m realizing, this is a bad assumption).

To address this problem, I’m providing a method in MarcEdit that will allow the user to force the normalization of UTF8 data into a specific normalization, and will enable the application to support search and replace of data, regardless of the normalized form of a character that a user might us.  This will show up in the MarcEdit preferences.  Under the MARCEngine settings, there are options related to data normalization.  These show up as:

image

MarcEdit has included support for sometime to set normalization when compiling data.  But this doesn’t solve the problem when trying to edit, search, etc. records in the MarcEditor or within the other areas of the program.  So, a new option will be available – Enforce Defined Normalization.  This will enable the application to save data in the preferred normalization and also force all user submitted data through a wrapper that will enable edit operations to be completed, regardless of the normalized form a user may use when searching for data or the underlying normalization form of the individual records.  Internally, MarcEdit will make this process invisible, but the output created will be records that place all UTF8 characters into the specified normalization.  This seems to be a good option, and its very unlikely that tomorrow, the systems that we use will suddenly all start to use UTF8 data the same way – and taking this approach, they don’t have to.  MarcEdit will work as a bridge to take data in any UTF8 normalization, and will ensure that the data outputted all meets the criteria specified for the user.

Sounds good – I think so.  But it makes me a little nervous as well.  Why – because OCLC takes any data provided to it.  In theory, a record could switch normalizations multiple times, if users pulled data down, edited them using this option, and uploaded the data back to the database.  Does this matter?  Will it cause unforeseen issues?  I don’t know – I’m asking OCLC.  I also worry that allowing users to specify normalization form could have cascading issues when it comes to record sharing.  No everyone uses MarcEdit (nor should they) and its hard to know what impact this makes on other coding tools, etc.  This is why this function won’t be enabled by default – but will need to be turned on by the user – as I continue to inquire and have conversations about the larger implications of this work.  The short answer is that this is a pain point, and a problem that needs to be addressed somehow.  I see too many questions and too many records where the normalization form of the data plays a role in providing confusing data to the user, confusing data to the cataloger, or difficulties in reusing or sharing the data with other systems/processes.  At the same time, this feels like a band-aid fix until we reach a point in the evolution of our systems and metadata that we can free ourselves from MARC-8, and begin to think only about our data in UTF8.

Conclusions

So what should folks take away from all this?  Let’s start with the obvious.  Just because your data is in UTF8, doesn’t mean that its the same as my data in UTF8.  Normalization forms of data, a tool that was initially used to ease the transition of data from non-Unicode to Unicode data, can have other implications as well.  The information that I’ve provided, are just examples of challenges that make there way to me due to my work with MarcEdit.  I’m sure other folks have had different experiences…and I’d love to hear these if you want to provide them below.

Best,

–tr

Python/Django warnings / Brown University Library Digital Technologies Projects

I recently updated a Django project from 1.8 to 1.11. In the process, I started turning warnings into errors. Django docs recommend resolving any deprecation warnings with current version, before upgrading to a new version of Django. In this case, I didn’t start my upgrade work by resolving warnings, but I did run the tests with warnings enabled for part of the process.

Here’s how to enable all warnings when you’re running your tests:

  1. From the CLI
    • use -Werror to raise Exceptions for all warnings
    • use -Wall to print all warnings
  2. In the code
    • import warnings; warnings.filterwarnings(‘error’) – raise Exceptions on all warnings
    • import warnings; warnings.filterwarnings(‘always’) – print all warnings

If a project runs with no warnings on a Django LTS release, it’ll (generally) run on the next LTS release as well. This is because Django intentionally tries to keep compatibility shims until after a LTS release, so that third-party applications can more easily support multiple LTS releases.

Enabling warnings is nice because you see warnings from python or other packages, so you can address whatever problems they’re warning about, or at least know that they will be an issue in the future.

attachment filename downloads in non-ascii encodings, ruby, s3 / Jonathan Rochkind

You tell the browser to force a download, and pick a filename for the browser to ‘save as’ with a Content-Disposition header that looks something like this:

Content-Disposition: attachment; filename="filename.tiff"

Depending on the browser, it might open up a ‘Save As’ dialog with that being the default, or might just go ahead and save to your filesystem with that name (Chrome, I think).

If you’re having the user download from S3, you can deliver an S3 pre-signed URL that specifies this header — it can be a different filename than the actual S3 key, and even different for different users, for each pre-signed URL generated.

What if the filename you want is not strictly ascii? You might just stick it in there in UTF-8, and it might work just fine with modern browsers — but I was doing it through the S3 content-disposition download, and it was resulting in S3 delivering an XML error message instead of the file, with the message “Header value cannot be represented using ISO-8859-1.response-content-disposition”.

Indeed, my filename in this case happened to have a Φ (greek phi) in it, and indeed this does not seem to exist as a codepoint in ISO-8859-1 (how do I know? In ruby, try `”Φ”.encode(“ISO-8859-1”)`, which perhaps is the (standard? de facto?) default for HTTP headers, as well as what S3 expects. If it was unicode that could be trans-coded to ISO-8859-1, would S3 have done that for me? Not sure.

But what’s the right way to do this?  Googling/Stack-overlowing around, I got different answers including “There’s no way to do this, HTTP headers have to be ascii (and/or ISO-8859-1)”, “Some modern browsers will be fine if you just deliver UTF-8 and change nothing else” [maybe so, but S3 was not], and a newer form that looks like filename*=UTF-8''#{uri-encoded ut8} [no double quotes allowed, even though they ordinarily are in a content-disposition filename] — but which will break older browsers (maybe just leading to them ignoring the filename rather than actually breaking hard?).

The golden answer appears to be in this stackoverflow answer — you can provide a content-disposition header with both a filename=$ascii_filename (where $filename is ascii or maybe can be ISO-8859-1?), followed by a filename*=UTF-8'' sub-header. And modern browsers will use the UTF-8 one, and older browsers will use the ascii one. At this point, are any of these “older browsers” still relevant? Don’t know, but why not do it right.

Here’s how I do it in ruby, taking input and preparing a) a version that is straight ascii, replacing any non-ascii characters with _, and b) a version that is UTF-8, URI-encoded.

ascii_filename = file_name.encode("US-ASCII", undef: :replace, replace: "_")
utf8_uri_encoded_filename = URI.encode(filename)

something["Content-Disposition"] = "attachment; filename=\"#{ascii_filename}\"; filename*=UTF-8''#{utf8_uri_encoded_filename}"

Seems to work. S3 doesn’t complain. I admit I haven’t actually tested this on an “older browser” (not sure how old one has to go, IE8?), but it does the right thing (include the  “Φ ” in filename) on every modern browser I tested on MacOS, Windows (including IE10 on Windows 7), and Linux.

Web Advertising and the Shark, revisited / David Rosenthal

There's a lot to add to Has Web Advertising Jumped The Shark? (which is a violation of  Betteridge's Law). Follow me below the fold for some of it.



First, I should acknowledge that, as usual, Maciej Cegłowski was ahead of the game. He spotted this more than two years ago and described it in The Advertising Bubble, based on a talk he gave in Sydney. The short version is:
There's an ad bubble. It's gonna blow.
Money flows in ad ecosystem
The longer version is worth reading, but here is a taste:
Right now, all the ad profits flow into the pockets of a few companies like Facebook, Yahoo, and Google. ... You'll notice that the incoming and outgoing arrows in this diagram aren't equal. There's more money being made from advertising than consumers are putting in.

The balance comes out of the pockets of investors, who are all gambling that their pet company or technology will come out a winner. They provide a massive subsidy to the adtech sector. ... The only way to make the arrows balance at this point will be to divert more of each consumer dollar into advertising (raise the ad tax), or persuade people to buy more stuff. ... The problem is not that these companies will fail (may they all die in agony), but that the survivors will take desperate measures to stay alive as the failure spiral tightens. ... The only way I see to avert disaster is to reduce the number of entities in the swamp and find a way back to the status quo ante, preferably through onerous regulation. But nobody will consider this.
What Doc Searls Saw
What Ev Williams Saw
Cegłowski was right that things would get bad. Last December Doc Searls, in After Peak Marketing, reported about the ads he and Ev Williams saw on Facebook when they read this post from one Mark Zuckerberg:
“Of all the content on Facebook, more than 99% of what people see is authentic. Only a very small amount is fake news and hoaxes. The hoaxes that do exist are not limited to one partisan view, or even to politics. Overall, this makes it extremely unlikely hoaxes changed the outcome of this election in one direction or the other.”
Searls points out that, despite Zuckerberg's "99% authentic" claim:
All four ads are flat-out frauds, in up to four ways apiece:
  1. All are lies (Tiger isn’t gone from Golf, Trump isn’t disqualified, Kaepernick is still with the Niners, Tom Brady is still playing), violating Truth in Advertising law.
  2. They were surely not placed by ESPN and CNN. This is fraud.
  3. All four of them violate copyright or trademark laws by using another company’s name or logo. (One falsely uses another’s logo. Three falsely use another company’s Web address.)
  4. All four stories are bait-and-switch scams, which are also illegal. (Both of mine were actually ads for diet supplements.)
Mark Zuckerberg announced changes to Facebook's News Feed to de-prioritize paid content, but Roger McNamee is skeptical of the effect:
Zuckerberg’s announcement on Wednesday that he would be changing the Facebook News Feed to make it promote “meaningful interactions” does little to address the concerns I have with the platform.
So am I. Note that the changes:
will de-prioritize videos, photos, and posts shared by businesses and media outlets, which Zuckerberg dubbed “public content”, in favor of content produced by a user’s friends and family.
They don't address the ads that Searls and Williams saw. But they do have the effect of decreasing traffic to publisher's content:
Publishers, on the other hand, were generally freaked out. Many have spent the past 5 years or so desperately trying to "play the Facebook game." And, for many, it gave them a decent boost in traffic (if not much revenue). But, in the process, they proceeded to lose their direct connection to many readers. People coming to news sites from Facebook don't tend to be loyal readers. They're drive-bys.
And thus divert advertising dollars to Facebook from other sites. The other sites have been hit by another of the FAANGs:
advertising firms are losing hundreds of millions of dollars following the introduction of a new privacy feature from Apple that prevents users from being tracked around the web.

Advertising technology firm Criteo, one of the largest in the industry, says that the Intelligent Tracking Prevention (ITP) feature for Safari, which holds 15% of the global browser market, is likely to cut its 2018 revenue by more than a fifth compared to projections made before ITP was announced.
AdBlock trending
Apple is responding to its customers. Back in 2015 Doc Searls wrote Beyond ad blocking — the biggest boycott in human history:
Ad blocking didn’t happen in a vacuum. It had causes. We start to see those when we look at how interest hockey-sticked in 2012. That was when ad-supported commercial websites, en masse, declined to respect Do Not Track messages from users ... As we see, interest in Do Not Track fell, while interest in ad blocking rose. (As did ad blocking itself.)
As blissex wrote in this comment, we are living:
In an age in which every browser gifts a free-to-use, unlimited-usage, fast VM to every visited web site, and these VMs can boot and run quite responsive 3D games or Linux distributions
This means that, as Brannon Dorsey demonstrated, ad blockers have become an essential way to defend against cryptojacking and botnets:
Anyone can make an account, create an ad with god-knows-what Javascript in it, then pay to have the network serve that ad up to thousands of browser.

So that's what Dorsey did -- very successfully. Within about three hours, his code (experimental, not malicious, apart from surreptitiously chewing up processing resources) was running on 117,852 web browsers, on 30,234 unique IP addresses. Adtech, it turns out, is a superb vector for injecting malware around the planet.

Some other fun details: Dorsey found that when people loaded his ad, they left the tab open an average of 15 minutes. That gave him huge amounts of compute time -- 327 full days, in fact, for about $15 in ad purchase. To see what such a botnet could do, he created one to run a denial-of-service attack (against his own site, just to see if it worked: It did pretty well). He got another to mine the cryptocurrency Monero, at rates that will be profitable if Monero goes much higher.

The most interesting experiment was in writing an adtech-botnet to store and serve Bittorrent files, via Webtorrent. That worked pretty well too: He got 180,175 browsers to run his torrent file in 24 hours, with a 702 Mbps upload speed for the entire network.
What Google could steal
Brannon Dorsey's post describing his experiments is a must-read. He computes that, for example, Google could limit itself to 10% CPU utilization and still have about 3 million cores for free, continuously. He concludes:
please, please, please BLOCK ADS. If you’ve somehow made it all the way to 2018 without using an ad blocker, 1) wtf… and 2) start today. In all seriousness, I don’t mean to be patronizing. An ad blocker is a necessary tool to preserve your privacy and security on the web and there is no shame in using one. Advertising networks have overstepped their bounds and its time to show them that we won’t stand for it.
If that isn't shark-jumped, I don't know what is.

Evergreen 3.03 and 2.12.9 released / Evergreen ILS

The Evergreen community is pleased to announce two maintenance releases of Evergreen 3.0.3 and 2.12.9.

Evergreen 3.0.3 has the following changes improving on Evergreen 3.0.2:

  • Fixes several issues related to the display of located URIs and records with bib sources in search results.
  • Setting opac_visible to false for a copy location group now hides only the location group itself, rather than also hiding every single copy in the group.
  • Fixes a bug that prevented the copy editor from displaying the fine level and loan duration fields.
  • The “Edit Items” grid action in the Item Status interface will now open in the combined volume/copy editor in batch. This makes the behavior consistent with the “Edit Selected Items” grid action in the copy buckets interface.
  • Staff members are now required to choose a billing type when creating a bill on a user account.
  • The Web client now provides staff users with an alert and option to override when an item with the Lost and Paid status is checked in.
  • Fixes a bug where the Web client offline circ interface was not able to set its working location.
  • Fixes an issue that prevented the ADMIN_COPY_TAG permission from being granted.
  • The MARC editor in the Web staff client now presents bib sources in alphabetical order.
  • Both circulation and grocery bills are now printed when a staff user selects a patron account and clicks “Print Bills”.
  • Fixes an issue in the XUL serials interface the “Receive move/selected” action from succeeding.
  • Fixes a typo in the user password testing interface.

Please note that the upgrade script for 3.0.3 contains a post-transaction command to forcibly update the visibility attributes of all bibs that make use of Located URIs or bib sources. This script may take a while to run on large datasets. If it it running too long, it can be canceled, and administrators can use a psql command detailed in the Release Notes to perform the same action serially over time without blocking writes to bibs.

Evergreen 2.12.9 has a fix that installs NodeJs from source, allowing the web staff client to build without failure.

Please visit the Evergreen downloads page to download the upgraded software and to read full release notes. Many thanks to everyone who contributed to the releases!

Publication: A Field Guide to “Fake News” and Other Information Disorders / Open Knowledge Foundation

This blog has been reposted from http://jonathangray.org/2018/01/08/field-guide-to-fake-news/

Last week saw the launch of A Field Guide to “Fake News and Other Information Disorders, a new free and open access resource to help students, journalists and researchers investigate misleading content, memes, trolling and other phenomena associated with recent debates around “fake news”.

The field guide responds to an increasing demand for understanding the interplay between digital platforms, misleading information, propaganda and viral content practices, and their influence on politics and public life in democratic societies.

It contains methods and recipes for tracing trolling practices, the publics and modes of circulation of viral news and memes online, and the commercial underpinnings of this content. The guide aims to be an accessible learning resource for digitally-savvy students, journalists and researchers interested in this topic.

The guide is the first project of the Public Data Lab, a new interdisciplinary network to facilitate research, public engagement and debate around the future of the data society – which includes researchers from several universities in Europe, including King’s College London, Sciences Po Paris, Aalborg University in Copenhagen, Politecnico of Milano, INRIA, École Normale Supérieure of Lyon and the University of Amsterdam. It has been undertaken in collaboration with First Draft, an initiative dedicated to improving skills and standards in the reporting and sharing of information that emerges online, which is now based at the Shorenstein Center on Media, Politics, and Public Policy at the John F. Kennedy School of Government at Harvard University.

Claire Wardle who leads First Draft comments on the release: “We are so excited to support this project as it provides journalists and students with concrete computational skills to investigate and map these networks of fabricated sites and accounts. Few people fully recognize that in order to understand the online disinformation ecosystem, we need to develop these computational mechanisms for monitoring this type of manipulation online. This project provides this skills and techniques in a wonderfully accessible way.”

A number of universities and media organisations have been testing, using and exploring a first sample of the guide which was released in April 2017. Earlier in the year, BuzzFeed News drew on several of the methods and datasets in the guide in order to investigate the advertising trackers used on “fake news” websites.

The guide is freely available at on the project website at fakenews.publicdatalab.org (direct PDF link here), as well as on Zenodo at doi.org/10.5281/zenodo.1136271. It is released under a Creative Commons Attribution license to encourage readers to freely copy, translate, redistribute and reuse the book. A translation is underway into Japanese. All the assets necessary to translate and publish the guide in other languages are available on the Public Data Lab’s GitHub page. Further details about contributing researchers, institutions and collaborators are available on the website.

The project is being launched at the Digital Methods Winter School 2018 organised by the Digital Methods Initiative at the University of Amsterdam, a year after we first started working on the project at the Winter School 2017. We are also in discussion with Sage about a book drawing on this project.

Formatting a LaCie external drive for Time Machine / Alf Eaton, Alf

  1. Plug in the drive and open Disk Utility.
  2. If only the 256MB setup volume is visible rather than the whole 2TB drive, select View > Show All Devices.
  3. Select the 2TB device and press "Erase".
  4. Choose a name, select "Mac OS Extended (Journaled)" (Time Machine doesn’t support APFS as it needs hard links to directories) and "GUID Partition Map" (Time Machine prefers GUID Partition Map, MBR is for Windows, Apple Partition Map is for old PowerPC Macs), then press "Erase".
  5. When Time Machine pops up, check "Encrypt backups" and accept the dialog.

DuraSpace Board of Directors Changes Leadership / DuraSpace News

As we begin a new year, DuraSpace welcomes new leadership to our Board of Directors. The Board helps set DuraSpace’s priorities to ensure that our digital heritage is widely discoverable and accessible over the long term with a community-based open source technology portfolio.

Telling VIVO Stories at The Marine Biological Laboratory Woods Hole Oceanographic Institution (MBLWHOI) with John Furfey / DuraSpace News

VIVO is member-supported, open source software and ontology for representing scholarship.

“Telling VIVO Stories” is a community-led initiative aimed at introducing project leaders and their ideas to one another while providing VIVO implementation details for the VIVO community and beyond. The following interview includes personal observations that may not represent the opinions and views of the Marine Biological Laboratory / Woods Hole Oceanographic Institution Library (MBLWHOI) or the VIVO Project.

NEW: The Realities of Research Data Management: Part Three Now Available! / HangingTogether

A new year heralds a new RDM report! Check out Incentives for Building University RDM Services, the third report in OCLC Research’s four-part series exploring the realities of research data management. Our new report explores the range of incentives catalyzing university deployment of RDM services. Our findings in brief: RDM is not a fad, but instead a rational response by universities to powerful incentives originating from both internal and external sources.

The Realities of Research Data Management, an OCLC Research project, explores the context and choices research universities face in building or acquiring RDM capacity. Findings are derived from detailed case studies of four research universities: University of Edinburgh, University of Illinois at Urbana-Champaign, Monash University, and Wageningen University and Research. Previous reports examined the RDM service space, and the scope of the RDM services deployed by our case study partners. Our final report will address sourcing and scaling choices in acquiring RDM capacity.

Incentives for Building University RDM Services continues the report series by examining the factors which motivated our four case study universities to supply RDM services and infrastructure to their affiliated researchers. We identify four categories of incentives of particular importance to RDM decision-making: compliance with external data mandates; evolving scholarly norms around data management; institutional strategies related to researcher support; and researcher demand for data management support. Our case studies suggest that the mix of incentives motivating universities to act in regard to RDM differ from university to university. Incentives, ultimately, are local.

RDM is both an opportunity and a challenge for many research universities. Moving beyond the recognition of RDM’s importance requires facing the realities of research data management. Each institution must shape its local RDM service offering by navigating several key inflection points: deciding to act, deciding what to do, and deciding how to do it. Our Realities of RDM report series examines these decisions in the context of the choices made by the case study partners.

Visit the Realities of Research Data Management website to access all the reports, as well as other project outputs.

 

 

Jobs in Information Technology: January 17, 2018 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

City of El Segundo, Library Services Director, El Segundo, CA

New York University, Division of Libraries, Metadata Librarian, New York, NY

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Digital Scholarship Resource Guide: So now you have digital data… (part 3 of 7) / Library of Congress: The Signal

This is part three of our Digital Scholarship Research Guide created by Samantha Herron. See parts one about digital scholarship projects and two about how to create digital documents.

So now you have digital data…

Great! But what to do?

Regardless of what your data are (sometimes it’s just pictures and documents and notes, sometimes it’s numbers and metadata), storage, organization, and management can get complicated.

Here is an excellent resource list from the CUNY Digital Humanities Resource Guide that covers cloud storage, password management, note storage, calendar/contacts, task/to-do lists, citation/reference management, document annotation, backup, conferencing & recording, screencasts, posts, etc.

From the above, I will highlight:

  • Cloud-based secure file storage and sharing services like Google Drive and Dropbox. Both services offer some storage space free, but increased storage costs a monthly fee. With Dropbox, users can save a file to a folder on their computer, and access it on their phone or online. Dropbox folders can be collaborative, shared and synced. Google Drive is a web-based service, available to anyone with a Google account; any file can be uploaded, stored, and shared with others through Drive. Drive will also store Google Documents and Sheets that can be written in browser, and collaborated on in real time.
  • Zotero, a citation management service. Zotero allows users to create and organize citations using collections and tags. Zotero can sense bibliographic information in the web browser, and add it to a library with the click of a button. It can generate citations, footnotes, endnotes, and in-text citations in any style, and can integrate with Microsoft Word.

If you have a dataset:

Here are some online courses from School for Data about how to extract, clean, and explore data.

OpenRefine is one popular software for working with and organizing data. It’s like a very fancy Excel sheet.

It looks like this:

Screenshot of the Open Refine tool.

Screenshot of the Open Refine tool.

Here is an introduction to OpenRefine from Owen Stephens on behalf of the British Library, 2014. Programming Historian also has a tutorial for cleaning data with OpenRefine.

Some computer-y basics

A sophisticated text editing software is good to have. Unlike a word processor like Microsoft Word, text editors are used to edit plaintext–text without other formatting like font, size, page breaks, etc. Text editors are important for writing code and manipulating text. Your computer probably has one preloaded (e.g. Notepad on Windows computers), but there are more robust ones that can be downloaded for free, like Notepad++ for Windows, Text Wrangler for Mac OSX, or Atom for either.

The command line is a way of interacting with a computer program with text instructions (commands), instead of point-and-click GUIs, (graphical user interfaces). For example, instead of clicking on your Documents folder and scrolling through to find a file, you can type text commands into a command prompt to do the same thing. Knowing the basics of the command line helps to understand how a computer thinks, and can be a good introduction to code-ish things for those who have little experience. This Command Line Crash Course from Learn Python the Hard Way gives a quick tutorial on how to use the command line to move through your computer’s file structure.

Code Academy has free, interactive lessons in many different coding languages.

Python seems to be the code language of choice for digital scholars (and a lot of other people). It’s intuitive to learn and can be used to build a variety of programs.

Screenshot of a command line interface.

Screenshot of a command line interface.

Next week we will dive into Text Analysis. See you then!

#LITAchat – LITA at ALA Midwinter 2018 / LITA

Attending the 2018 ALA Midwinter conference? Curious about what LITA is up to?

Join us on Friday, January 26, 1:00-2:00pm EST on Twitter to discuss and ask questions about the LITA events, activities, and more happening at this year’s 2018 ALA Midwinter Meeting in Denver, CO, February 9-13.

To participate, launch your favorite Twitter mobile app or web browser and search for the #LITAchat hashtag and select “Latest” to follow along and reply to questions asked by moderator or other participants. When replying to discussion or asking questions, add or incorporate the hashtags #alamw18 and #litachat.

See you there!

Bridging the Spectrum symposium at CUA/LIS highlights public policy directions in Washington / District Dispatch

On Friday, February 2, The Catholic University of America (CUA) Library and Information Sciences Department will host its Tenth Annual Bridging the Spectrum: A Symposium on Scholarship and Practice in Library and Information Science (LIS). A one-day event, Bridging the Spectrum provides attendees with a knowledge-sharing forum and meeting place for practitioners, students, and faculty in Library and Information Sciences and Services to share work and to foster unexpected connections across the spectrum of the information professions.

Photo of Alan InouyeDr. Alan Inouye will be the keynote speaker at CUA’s 10th annual Bridging the Spectrum symposium on February 2, 2018.

The keynote address this year will be given by American Library Association Washington Office Director Dr. Alan Inouye. In Making Sense of the Headlines: Advancing Public Policy for the LIS Community, Dr. Inouye looks at the interplay between forming national policy on LIS issues such as net neutrality, federal funding for libraries and education policy with larger trends in government, technology, commerce and society, asking, “What is the more fundamental change taking place? What is really happening beneath the surface and over time—policy-wise? And how can the library and information science community best influence policy and move our interests higher on the political agenda—or at least defend ourselves as much as possible?”

This year, Bridging the Spectrum continues this tradition with a varied program that covers and discusses a diverse set of trends and challenges faced within the LIS fields. Both the morning and afternoon sessions feature presentations and speakers focusing on topics from the impact of digitization and establishing credible news sources, to conducting outreach to minority groups and reinventing programming for the digital natives of the Millennial Generation and Generation Z. Beyond this, the symposium also features a poster lightning round, with posters discussing emerging trends and pedagogy in archival and librarian services.

“Since 2009, Catholic University of America has been proud to have established a community of learning and knowledge-sharing through our annual Bridging the Spectrum: Symposium on Scholarship and Practice,” says Dr. Renate Chancellor, Associate Professor and University Pre-Law Advisor. Chancellor, who serves on the Symposium Committee, went on to say that the impetus for the symposium was to create an opportunity for practitioners, students and faculty in LIS to come together to showcase the wide range of research taking place throughout DC/VA/MD region. “It’s exciting to know that we are celebrating our 10th anniversary and all of the wonderful speakers, panels, and poster sessions we have seen over the years,” says Chancellor, “and to know that we have been instrumental in fostering a forum for dialogue on the important issues relevant to the LIS community.”

Bridging the Spectrum: A Symposium on Scholarship and Practice in Library and Information Science, is open to the public, and will be held in the Great Room of the Pryzbala Student Center on CUA’s campus. For more information about how to register to attend, please visit http://lis.cua.edu/symposium/2018/. We look forward to seeing you there!

This guest post was contributed by Babak Zarin, an LIS candidate at CUA and research assistant for Dr. Renate Chancellor.

The post Bridging the Spectrum symposium at CUA/LIS highlights public policy directions in Washington appeared first on District Dispatch.

Educators ask for a better copyright / Open Knowledge Foundation

This blog has been reposted from the Open Education Working Group page.

 

Today we, the Open Education Working Group, publish a joint letter initiated by Communia Association for the Public Domain that urgently requests to improve the education exception in the proposal for a Directive on Copyright in the Digital Single Market (DSM Directive). The letter is supported by 35 organisations representing schools, libraries and non-formal education, and also individual educators and information specialists.

 

In September 2016 the European Commission published its proposal of a DSM Directive that included an education exception that aimed to improve the legal landscape. The technological ages created new possibilities for educational practices. We need copyright law that enables teachers to provide the best education they are capable of and that fits the needs of teachers in the 21st century. The Directive is able to improve copyright.

However, the proposal does not live up to the needs of education. In the letter we explain the changes needed to facilitate the use of copyrighted works in support of education. Education communities need an exception that covers all relevant providers, and which permits a diversity of educational uses of copyrighted content. We listed four main problems with the Commission’s proposal:

#1:  A limited exception instead of a mandatory one

The European Commission proposed a mandatory exception, which can be overridden by licenses. As a consequence educational exception will still be different in each Member State. Moreover, educators will need a help from a lawyer to understand what they are allowed to do.

#2 Remuneration should not be mandatory

Currently most Member States have exceptions for educational purposes that are completely or largely unremunerated. Mandatory payments will change the situation of those educators (or their institutions), which will have to start paying for materials they are now using for free.

#3: Excluding experts

The European Commission’s proposal does not include all important providers of education as only formal educational establishments are covered by the exception. We note that the European lifelong-learning model underlines the value of informal and non-formal education conducted in the workplace. All these are are excluded from the education exception.

#4: Closed-door policy

The European Commission’s proposal limits digital uses to secure institutional networks and to the premises of an educational establishment. As a consequence educators will not develop and conduct educational activities in other facilities such as libraries and museums, and they will not be able to use modern means of communication, such as emails and the cloud.

To endorse the letter, send an email to education@communia-associations.org. Do you want to receive updates on the developments around copyright and education, sign up for Communia’s newsletter Copyright Untangled.

You can read the full letter in this blog on the Open Education website or download the PDF.

Registration Open for Fedora Camp at NASA / DuraSpace News

Fedora is the robust, modular, open source repository platform for the management and dissemination of digital content. Fedora 4, the latest production version of Fedora, features vast improvements in scalability, linked data capabilities, research data support, modularity, ease of use and more. Fedora Camp offers everyone a chance to dive in and learn all about Fedora.
 
The Fedora team will offer a Camp from Wednesday May 16 - Friday May 18, 2018 at the NASA Goddard Space Flight Center  in Greenbelt, Maryland outside of Washington, D.C.

From Code to Colors: Working with the loc.gov JSON API / Library of Congress: The Signal

The following is a guest post by Laura Wrubel, software development librarian with George Washington University Libraries, who has joined the Library of Congress Labs team during her research leave.

The Library of Congress website has an API ( “application programming interface”) which delivers the content for each web page. What’s kind of exciting is that in addition to providing HTML for the website, all of that data–including the digitized collections–is available publicly in JSON format, a structured format that you can parse with code or transform into other formats. With an API, you can do things like:

  • build a dataset for analysis, visualization, or mapping
  • dynamically include content from a website in your own website
  • query for data to feed a Twitter bot

This opens up the possibility for a person to write code that sends queries to the API in the form of URLs or “requests,” just like your browser makes. The API returns a “response” in the form of structured data, which a person can parse with code. Of course, if there were already a dataset available to download that would be ideal. David Brunton explains how bulk data is particularly useful in his talk “Using Data from Historical Newspapers.” Check out LC for Robots for a growing list of bulk data currently available for download.

I’ve spent some of my time while on research leave creating documentation for the loc.gov JSON API.  It’s worth keeping in mind that the loc.gov JSON API is a work in progress and subject to change. But even though it’s unofficial, it can be a useful access point for researchers.  I had a few aims in this documentation project: make more people aware of the API and the data available from it, remove some of the barriers to using it by providing examples of queries and code, and demonstrate some ways to use it for analysis. I approached this task keeping in mind a talk I heard at PyCon 2017, Daniele Procida’s “How documentation works, and how to make it work for your project” (also available as a blog post), which classifies documentation into four categories: reference, tutorials, how-to, and explanation. This framing can be useful in making sure your documentation is best achieving its purpose. The loc.gov JSON API documentation is reference documentation, and points to Jupyter notebooks for Python tutorials and how-to code. If you have ideas about additional “how-to” guides and tutorials would be useful, I’d be interested to hear them!

At the same time that I was digging into the API, I was working on some Jupyter notebooks with Python code for creating image datasets, for both internal and public use. I became intrigued by the possibilities of programmatic access to thumbnail images from the Library’s digitized collections. I’ve had color on my mind as an entry point to collections since I saw Chad Nelson’s DPLA Color Browse project at DPLAfest in 2015.

So as an experiment, I created Library of Congress Colors.

Color bars representing cards from the Baseball Cards digital collection at the Library of Congress

View of colors derived from the Library of Congress Baseball Cards digital collection

The app displays six colors swatches, based on cluster analysis, from each of the images in selected collections. Most of the collections have thousands of images, so it’s striking to see the patterns that emerge as you scroll through the color swatches (see Baseball Cards, for example). It also reveals how characteristics of the images can affect programmatic analysis. For example, many of the digitized images in the Cartoons and Drawings collection include a color target, which was a standard practice when creating color transparencies. Those transparencies were later scanned for display online. While useful for assessing color accuracy, the presence of the target interferes with color analysis of the cartoon, so you’ll see colors from that target pop up in the color swatches for images in that collection. Similarly, mattes, frames, and other borders in the image can skew the analysis. As an example, click through the color bar below to see the colors in the original cartoon by F. Fallon in the Prints and Photographs Division. 

6 colored squares representing the dominant colors of a cartoon in the Library of Congress collection

A color swatch impacted by the presence of the color bar photographed near the cartoon  in Prints and Photographs collection

This project was a fun way to visualize the collection while testing the API, and I’ve benefited from working with the National Digital Initiatives team as I developed the project. They and their colleagues have been a source of ideas for how to improve the visualization, connected me with people who understand the image formats, and provided LC Labs Amazon Web Services storage for making the underlying data sets downloadable by others. We’ve speculated about the patterns that emerge in the colors and have dozens more questions about the collections from exploring the results.

Color bars representing posters from the Works Progress Administration Posters digital collection at the Library of Congress

View of colors derived from the Library of Congress Works Progress Administration (WPA) poster digital collection

There’s something about color that is delightful and inspiring. Since I’ve put the app out there, I’ve heard ideas from people about using the colors to inspire embroidery, select paint colors, or think about color in design languages. I’ve also heard from people excited to see Python used to explore library collections and view an example of using a public API. I, myself, am curious to see what people may find as they explore Library of Congress collection as data and use the loc.gov JSON API or one of the many other APIs to create their own data sets. What could LC Labs do to help with this? What would you like to see?

UPDATE: 50 Senators support CRA to restore Net Neutrality / District Dispatch

Senate legislation to restore 2015’s strong, enforceable net neutrality rules now has the bipartisan support from 50 of 100 senators and would be assured of passage if just one more Republican backs the effort. The bill is a Congressional Review Act (CRA) resolution from Sen. Ed Markey (D-MA), which would block the Federal Communications Commission’s (FCC) December repeal of net neutrality rules.

The measure is backed by all 49 members of the Senate Democratic caucus, including 47 Democrats and two independents who caucus with Democrats. Sen. Susan Collins (R-ME) is the only Republican to support the bill so far, and supporters are trying to secure one more Republican vote. A successful CRA vote, in this case, would invalidate the FCC’s net neutrality repeal and prevent the FCC from issuing a similar repeal in the future. But the Senate action needs a counterpart in the House, and this Congressional action would be subject to Presidential approval.

ALA is working with allies to encourage Congress to overturn the FCC’s egregious action. Email your members of Congress today and ask them to use a Joint Resolution of Disapproval under the CRA to repeal the December 2018 FCC action and restore the 2015 Open Internet Order protections.

We will continue to update you on the activities above and other developments as we continue to work to preserve a neutral internet.

The post UPDATE: 50 Senators support CRA to restore Net Neutrality appeared first on District Dispatch.

Not Really Decentralized After All / David Rosenthal

Here are two more examples of the phenomenon that I've been writing about ever since Economies of Scale in Peer-to-Peer Networks more than three years ago, centralized systems built on decentralized infrastructure in ways that nullify the advantages of decentralization:

A lookback on 2017 with OK Brazil / Open Knowledge Foundation

This blog has been written by Natalia Mazotte and Ariel Kogan, co-directors of Open Knowledge Brazil (OKBR). It has been translated from the original version at https://br.okfn.org/2017/12/29/como-foi-o-ano-de-2017-para-a-okbr by Juliana Watanabe, volunteer of OKBR.

For us at Open Knowledge Brazil (OKBR), the year 2017 was filled with multiple partnerships, support and participation in events; projects and campaigns for mobilisation. In this blog we selected some of these highlights. Furthermore, newsflash for the team: the journalist Natália Mozatte, that was already leading Escola de Datos (School of Data) in Brazil, became co-director with Ariel Kogan (executive director since July 2016).

Foto: Engin_Akyurt / Creative Commons CC0

Mobilisation

At the beginning of the year, OKBR and several other organizations introduced the Manifest for Digital Identification in Brazil. The purpose of the Manifest is to be a tool for society to take a stand towards the privacy and safety of personal data of citizens and turn digital identification into a safe, fair and transparent action.

We monitored one of the main challenges in the city of São Paulo and contributed to the mobilisation for this. Along with other civil society organisations, we urged the City Hall of São Paulo for transparency regarding mobility. The reason: on 25 January 2017, the first day of the new increase to the speed limits on Marginais Pinheiros and Tietê, we noticed several news items about the decrease in traffic accidents linked to the policy of reducing speed in certain parts of the city was unavailable on the site of the Traffic Engineering Company (CET).

For a few months, we conducted a series of webinars called OKBR Webinars Serires, about open knowledge of the world. We had the participation of the following experts: Bart Van Leeuwen, entrepreneur; Paola Villareal, Fellow from the Berkman Klein Center, designer/data scientist; Fernanda Campagnucci, journalist and analyst of public policies and Rufus Pollock, founder of Open Knowledge International.

We took part in a major victory for society! Along with the Movimento pela Transparência (PartidáriaMovement for Partisan Transparency), we conducted a mobilisation against the rapporteur’s proposal for a political reform, congressman Vicente Cândido (PT-SP), about hidden contributions from the campaign and the result was very positive. Besides us, a variety of organisations and movements took part in this initiative against hidden donations,: we published and handed out a public statement. The impact was huge: as a consequence, the rapporteur announced the withdrawal of secret donations.

We also participated in #NãoValeTudo, a collective effort to discuss the correct use of technology for electoral purposes along with AppCívico, o Instituto Update, o Instituto Tecnologia e Equidad.

Projects

We performed two cycles of OpenSpending. The first cycle initiated in January and involved 150 municipalities. In July, we published the report of cycle 1. In August, we started the second cycle of the game with something new: Guaxi, a robot which was the digital assistant to competitors. It is an expert bot developed with innovative chatbot technology, simulating human interaction with the users. This made the journey through the page of OpenSpending on Facebook easier. The report of the second cycle is available here.

Together with the Board of Assessment of Public Policies from FGV/DAPP we released the Brazilian edition of the Open Data Index (ODI). In total, we built three surveys: Open Data Index (ODI) Brazil, at the national level and ODI São Paulo and ODI Rio de Janeiro, at the municipal level. Months later, we ended the survey “Do you want to build the index of Open Data of your city?” and the result was pretty positive: 216 people have shown an interest to do the survey voluntarily in their town!

In this first cycle of decentralization and expansion of the ODI in the Brazilian municipality, we conducted an experiment with the first group: Arapiraca/AL, Belo Horizonte/MG, Bonfim/RR, Brasília/DF, Natal/RN, Porto Alegre/RS, Salvador/BA, Teresina/PI, Uberlândia/MG, Vitória/ES. We offered training for the local leaders, provided by the staff of the Open Data Index (FGV/DAPP – OKBR) so that they can accomplish the survey required to develop the index. In 2018, we’ll show the results and introduce the reports with concrete opportunities for the town move forward on the agenda of transparency and open data.

We launched LIBRE – a project of microfinance for journalism – a partnership from Open Knowledge Brazil and Flux Studio, with involvement from AppCivico too. It is a microfinance content tool that aims to bring a digital tool to the public that is interested in appreciating and sustaining journalism and quality content. Currently, some first portals are testing the platform in a pilot phase.

Events

We supported the events of Open Data Day in many Brazilian cities, as well as the Hackathon da Saúde (Health Hackathon), an action of the São Paulo City Hall in partnership with SENAI and AppCívico, and participated in the Hack In Sampa event at the City Council of São Paulo.

Natália Mazotte, co-director of OKBR, participated in AbreLatam and ConDatos, annual events which have become the main meeting point regarding open data in Latin America and the Caribbean. It is a time to talk about the status and the impact in the entire region. We also participated in the 7th edition of the Web forum in Brazil with the workshop “Open patterns and access to information: prospects and challenges of the government open data”. Along with other organizations, we organized the Brazilian Open Government meeting.

The School of Data, in partnership with Google News Lab, organised the second edition of the Brazilian Conference of Journalism of Data and Digital Methods (Coda.Br). We were one of the partner organisations for the first Course of Open Government for leadership in Weather, Forest and Farming, initiated by Imaflora and supported by the Climate and Land Use Alliance (CLUA).

We were the focal point in the research “Foundations of the open code as social innovators in emerging economies: a case study in Brazil”, from Clément Bert-Erboul, a specialist in economic sociology and the teacher Nicholas Vonortas.

And more to come in 2018

We would like to thank you to follow and take part of OKBR in 2017. We’re counting on you in 2018. Beyond our plan for the next year, we have the challenge and the responsibility to contribute in the period of the elections so that Brazil proceeds on the agendas of transparency, opening public information, democratic participation, integrity and the fight against corruption.

If you want to stay updated on the news and the progress of our projects, you can follow us on our BlogTwitter and Facebook.

A wonderful 2018 for all of us!

The Open Knowledge Brazil team.

Programmed Visions / Ed Summers

I’ve been meaning to read Wendy Hui Kyong Chun for some time now. Updating to Remain the Same is on my to-read list, but I recently ran across a reference to Programmed Visions: Software and Memory in Rogers (2017), which I wrote about previously, and thought I would give it a quick read beforehand.

Programmed Visions is a unique mix of computing history, media studies and philosophy that analyzes the ways in which software has been reified or made into a thing. I’ve begun thinking about using software studies as a framework for researching the construction and operation of web archives, and Chun lays a useful theoretical foundation that could be useful for critiquing the very idea of software, and investigating its performative nature.

Programmed Visions contains a set of historical case studies that it draws on as sites for understanding computing. She looks at early modes of computing involving human computers (ENIAC) which served as a prototype for what she calls “bureaucracies of computing” and the psychology of command and control that is built into the performance of computing. Other case studies involving the Memex, the Mother of All Demos, and John von Neumann’s use of biological models of memory as metaphors for computer memory in the EDVAC are described in great detail, and connected together in quite a compelling way. The book is grounded in history but often has a poetic quality that is difficult to summarize. On the meta level Chun’s use of historical texts is quite thorough and its a nice example of how research can be conducted in this area.

There are two primary things I will take away from Programmed Visions. The first is how software, the very idea of source code, is itself achieved through metaphor, where computing is a metaphor for metaphor itself. Using higher level computer programming languages gives software the appearance of commanding the computer, however the source code is deeply entangled with the hardware itself, the source code is interpreted and compiled by yet more software, which are ultimately reduced to fluctuations in voltages circuitry. The source code and software cannot be extracted from this performance of computing. This separation of software from hardware is an illusion that was achieved in the early days of computing. Any analysis of software must include the computing infrastructures that make the metaphor possible. Chun chooses an interesting passage from Dijkstra (1970) to highlight the role that source code plays:

In the remaining part of this section I shall restrict myself to programs written for a sequential machine and I shall explore some of the consequences of our duty to use our understanding of a program to make assertions about the ensuing computations. It is my (unproven) claim that the ease and reliability with which we can do this depends critically upon the simplicity of the relation between the two, in particular upon the nature of sequencing control. In vague terms we may state the desirability that the structure of the program text reflects the structure of the computation. Or, in other terms, “What can we do to shorten the conceptual gap between the static program text (spread out in”text space“) and the corresponding computations (evolving in time)? (p. 21)

Here Dijkstra is talking about the relationship between text (source code) and a performance in time by the computing machinery. It is interesting to think not only about how the gap can be reduced, but also how the text and the performance can fall out of alignment. Of course bugs are the obvious way that things can get misaligned: I instructed the computer to do X but it did Y. But as readers of source code we have expectations about what code is doing, and then there is the resulting complex computational performance. The two are one, and its only our mental models of computing that allow us to see a thing called software. Programmed Visions explores the genealogy of those models.

The other striking thing about Programmed Visions is what Chun says about memory. Von Neumann popularizes the idea of computer memory using work by McCulloch that relates the nervous system to voltages through the analogy of neural nets. On a practical level, what this metaphor allowed was for instructions that were previously on cards, or in the movements of computer programmers wiring circuits, are moved into the machine itself. The key point Chun makes here is the idea that Von Neumann use of biological metaphors for computing allows him to conflate memory and storage. It is important that this biological metaphor, the memory organ, was science fiction – there was no known memory organ at the time.

The discussion is interesting because it connects with ideas about memory going back to Hume and forward to Bowker (2005). Memories can be used to make predictions, but cannot be used to fully reconstruct the past. Memory is a process of deletion, but always creates the need for more:

If our machines’ memories are more permanent, if they enable a permanence that we seem to lack, it is because hey are constantly refreshed–rewritten–so that their ephemerality endures, so that they may “store” the programs that seem to drive them … This is to say that if memory is to approximate something so long lasting as storage, it can do so only through constant repetition, a repetition that, as Jacques Derrida notes, is indissociable from destruction (or in Bush’s terminology, forgetting). (p. 170)

In the ellided section above Chun references Kirschenbaum (2008) to stress that she does not mean to imply that software is immaterial. Instead Chun describes computer memory as undead, neither alive nor dead but somewhere in between. The circuits need to be continually electrically performed for the memory to be sustained and alive. The requirement to keep the bits moving, reminds me of Kevin Kelly’s idea of movage, and anticipates (I think?) Chun (2016). This (somewhat humorous) description of the computer memory as undead reminded me of the state that archived web content is in. For example when viewing content in the Wayback machine it’s not uncommon to run across some links failing, missing resources, lack of interactivity (search) that was once there. Also, it’s possible to slip around in time as pages are traversed that have been storedat different times. How is this the same and different from traditional archives of paper, where context is lost as well?

So I was surprised in the concluding chapter when Chun actually talks about the Internet Archive’s Wayback Machine (IWM) on pp 170-171. I guess I shouldn’t have been surprised, but the leap from Von Neumann’s first articulation of modern computer architecture forwards to a world with a massively distributed Internet and World Wide Web was a surprise:

The IWM is necessary because the Internet, which is in so many ways about memory, has, as Ernst (2013) argues, no memory–at least not without the intervention of something like the IWM. Other media do not have a memory, but they do age and their degeneratoin is not linked to their regeneration. As well, this crisis is brought about because of this blinding belief in digital media as cultural memory. This belief, paradoxically, threatens to spread this lack of memory everywhere and plunge us negatively into a way-wayback machine: the so-called “digital dark age.” The IWM thus fixes the Internet by offering us a “machine” that lets us control our movement between past and future by regenerating the Internet at a grand scale. The Internet Wayback Machine is appropriate in more ways than one: because web pages link to, rather than embed, images, which can be located anywhere, and because link locations always change, the IWM preserves only a skeleton of a page, filled with broken–rendered–links and images. The IWM, that is, only backs up certain data types. These “saved” are not quite dead, but not quite alive either, for their proper commemoration requires greater effort. These gaps not only visualize the fact that our constant regenerations affect what is regenerated, but also the fact that these gaps–the irreversibility of this causal programmable logic– are what open the World Wide Web as archive to a future that is not simply stored upgrades of the past. (p. 171-172)

I think some things have improved somewhat since Chun wrote those words, but her essential observation remains true: the technology that furnishes the Wayback Machine is oriented around a document based web, where representations of web resources are stored at particular points in time and played back at other points in time. The software infrastructures that generated those web representations are not part of the archive, and so the archive is essentially in an undead state–seemingly alive, but undynamic and inert. It’s interesting to think about how traditional archives have similar characteristics though: the paper documents that lack adequate provenance, or media artifacts that can be digitized but no longer played. We live with the undead in other forms of media as well.

One of my committee members recently asked for my opinion on why people often take the position that since content is digital we can now keep it all. The presumption being that we keep all data online or in near or offline storage and then rely on some kind of search to find it. I think Chun hits on part of the reason this might be when she highlights how memory has been conflated with storage. For some the idea that some data is stored is equivalent to having been remembered as well. But it’s actually in the exercise of the data, its use, or being accessed that memory is activated. This position that everything can be remembered because it is digital has its economical problems, but it is an interesting little philosophical conundrum, that will be important to keep in the back of my mind as I continue to read about memory and archives.

References

Bowker, G. C. (2005). Memory practices in the sciences (Vol. 205). Cambridge, MA: MIT Press.

Chun, W. H. K. (2016). Updating to remain the same: Habitual new media. MIT Press.

Dijkstra, E. W. (1970). Notes on structured programming. Technological University, Department of Mathematics.

Ernst, W. (2013). Digital memory and the archive. In J. Parikka (Ed.), (pp. 113–140). University of Minnesota Press.

Kirschenbaum, M. G. (2008). Mechanisms: New media and the forensic imagination. MIT Press.

Rogers, R. (2017). Doing web history with the internet archive: Screencast documentaries. Internet Histories, 1–13.

The Internet Society Takes On Digital Preservation / David Rosenthal

Another worthwhile initiative comes from The Internet Society, through its New York chapter. They are starting an effort to draw attention to the issues around digital presentation. Shuli Hallack has an introductory blog post entitled Preserving Our Future, One Bit at a Time. They kicked off with a meeting at Google's DC office labeled as being about "The Policy Perspective". It was keynoted by Vint Cerf with respondents Kate Zwaard and Michelle Wu. I watched the livestream. Overall, I thought that the speakers did a good job despite wandering a long way from policies, mostly in response to audience questions.

Vint will also keynote the next event, at Google's NYC office February 5th, 2017, 5:30PM – 7:30PM. It is labeled as being about "Business Models and Financial Motives" and, if that's what it ends up being about it should be very interesting and potentially useful. I hope to catch the livestream.

Delete Notes / Ed Summers

I recently finished reading Delete by Viktor Mayer-Schönberger and thought I would jot down some brief notes for my future self, since it is a significant piece of work for my interests in web archiving. If you are interested in memory and information and communication technologies (ICT) then this book is a must read. Mayer-Schönberger is a professor at the Oxford Internet Institute where he has focused on issues at the intersection of Internet studies and governance. Delete is a particularly good tonic for the widespread idea that electronic records are somehow fleeting, impermanent artifacts–a topic that Kirschenbaum (2008) explored so thoroughly a year earlier from a materialist perspective.

Delete functions largely in two modes. The first (and primary) is to give an overview of how our ideas of permanence have shifted with the wide availability of computer storage and the Internet. The focus isn’t so much on these technologies themselves, but on the impact that storage capabilities and wide distribution has had on privacy and more generally on our ability to think. Mayer-Schönberger observes that for much of human history remembering has been difficult and has required concerted effort (think archival work here). The default was to forget information. But today’s information technologies allow the default to be set to remember, and it now requires effort to forget.

Examining the potential impacts upon cognition and our ability to think are where this book shines brightest. If the default is set to remember how does this shape public discourse? How will ever present and unlimited storage with a surveillance culture work to cement self-censorship that will suppress free expression and identity formation? The book contends that ICT allow Bentham’s Panopticon to be extended not only in space, but also in time, recalling Orwell’s popular quote, as used by (Samuels, 1986):

Who controls the past controls the future. Who controls the present controls the past.

It is easy to see this mechanism at work in large social media companies like Google and Facebook, where efforts to delete our data can get interpreted instead as deactivate, which simply renders the content inaccessible to those outside the corporate walls. The data that these companies collect is core to their business, and extremely valuable. They often aren’t deleting it, even when we tell them to, and are incentivized not to. What’s more the parties that the data has been sold to, or otherwise shared with, probably aren’t deleting it either.

While it’s true that computer storage has greatly enabled the storage of information it has also, at the same time, accelerated our ability to generate information. I feel like Mayer-Schönberger could have spent more time addressing this side of the equation. There are huge costs associated with remembering at the scales that Google, Facebook and other companies are working at. There are also large costs associated with preserving information for the long term (Rosenthal et al., 2012). Are large technology companies invested in saving data for the long term? Or are they more oriented to extracting market value out of our personal data while it is valuable as an predictive economic indicator. Examples like Google’s preservation of Usenet are in short supply. It is important to think about now only how things are being remembered, but who is doing the remembering. If it does become valuable to remember for the long term then we face a situation when only very large organizations are able to do it, which

As others gain access to our information (especially when we do not approve or even know of it), we lose power and control. Because of the accessibility and durability of digital memory, information power not only shifts from the individual to some known transactional party, but to unknown others as well. This solidifies and deepens existing power differentials between the information rich and the information poor, and may even deny the latter their own conception of the past.

This future where access to our past could be increasingly denied is a disturbing thought indeed. It also foregrounds the importance of memory work as a source of power and vehicle for social justice (Jimerson, 2009, Punzalan & Caswell (2016)).

The second mode that Delete engages in is exploring possible responses to the problem of deletion and memory in ICT. One of the more useful contributions that Delete offers is a framework for thinking about the range of responses:

Information Power (Privacy) Cognition
Individuals Digital abstinence Cognitive adjustment
Laws Privacy rights Information ecology
Technology Privacy + DRM Full contextualization

This framework largely operates to carve out the area of cognition as a new territory for experimentation. Delete does do a nice job of talking about privacy laws and digital rights management (DRM) systems. In particular Halderman, Waters, & Felten (2004) is cited as an example of example of work on protocols for informed consent that could hold promise. I think we have seen these play out more recently in affordances of social media where for example in Facebook you have the ability to control whether other people can tag you in photos. It is an area we are looking to more fully develop in the Documenting the Now project.

But in general Meyer-Schönberger asserts that DRM and privacy rights tends to be oriented around binary operations (rights, no rights) and when rights are obtained the party can do whatever they want with the content. Power differentials are also at work, where large powerful companies obtain rights from the weaker individuals, who ultimately have little bargaining power. Even in countries that have privacy laws around personal data, there has been very little exercise of them. This was back in 2009, and it would be interesting to know if it’s still the case.

At any rate, all of these solutions are examined but ultimately put aside for some failing or another in order to promote the proposed solution which the book has been leading up to: reintroducing forgetting. Specifically the last few chapters make the case for adding controls into applications that let users control how long content lives. You do get the feeling that this is the horse that the book has been betting on since the beginning, and that other options have been summarily dispatched to leave it as the only contender. But Meyer-Schönberger’s analysis of the other options is really quite convincing, and is really meant to paint a full picture, not to invalidate the other approaches.

The idea of an HCI approach to forgetting must have been quite novel at the time, and I’d argue that it still is. We now see the ability to delete your account from social media platforms like Facebook for example. Snapchat has popularized the idea of ephemerality of personal data, which has been picked up by Instagram in their stories feature, where content expires after 24 hours. Instagram also offers an archive function which makes previous posts only available to you and not to the public or those who follow you. It is interesting to see these affordances being built into tools, and as far as I know not much research has been done into why and how users are choosing to use them. Also, it would be interesting to see how understandings of the deletion match up to reality. If you have run across research (or are doing it yourself) that has I’d love to hear from you.

The idea of expiration dates outlined in Delete is expressed as a negotiation between parties that is reminiscent gift agreement in archives. This makes me wonder if it could be useful to draw on the research literature around privacy in archives (MacNeil, 1992) and the impacts of digitization on archival preservation (Iacovino & Todd, 2007).

Which brings me to my last point, and one of the main drawbacks of Delete. Unfortunately there is very little mention of the role of libraries and archives throughout the book as memory institutions (p. 190). Really the only mention of archives and libraries comes near the end when discussing the potential negative impacts of expiration dates:

… expiration dates may be accused of impeding the work, and even calling into question the very existence, of archives and libraries. This belief is unfounded. Expiration dates let individuals decide how long they want information to be remembered. But societies have the power to override such individual decisions if necessary–for example, by mandating information retention, and by maintaining libraries, archives, and others special institutions to preserve the information about a particularly important event of the past. Nothing of this proposal would alter or change that–except perhaps the need to make such information retention exceptions explicit and transparent.

The book would have benefited from a look at record management processes, and processes of appraisal which work to select material for long term preservation and access. In addition to expiration there is the related idea of content that is embargoed for access until a certain time has passed. By what mechanisms would society override an expiration date? And what of situations where powerful institutions are able to override the archive (George, 2013)?

One interesting idea that Delete mentions is the idea of digital versions of “rusting” that mimics partial as well as gradual forgetting.

We could invision, for example, that older information would take longer to be retrieved from digital memory, much like how our brain sometimes requires extra time to retrieve events from the distant past. Or digital memory might require more query information to retrieve older information, mimicking how our brain sometimes needs additional stimuli for us to remember. A piece of digital information–say a document–could also become partly obfuscated and erased over time, and not in one fell swoop, thus resembling more closely human forgetting and partial recall.

This discussion of “rusting” reminds me of the ways in which cards can appear to age and are then archived in Trello or pictures are archived in Instagram. It also recalls the ways in which tape archive systems function to keep frequently used data in nearline storage which is a state in between online (readily accessible) and archived (taking a long time to access). Amazon’s Glacier operates in a similar way to save the cost of keeping data by pushing it offline into an archive.

This rusting could be methodically introduced into systems, for example it’s interesting to consider if the Library of Congress Twitter Archive were to be made available only on site (not on the Internet) and queries took up to 24 hours to complete. While these limitations could be criticized as an access bug, they could also serve as a rusting feature that operates as a limit or bounding function for memory. I think it’s possible to also look at the perceived clunkiness and imperfection of accessing the Internet Archive’s Wayback Machine by URL, and the current lack of search, as a feature that allows this very special archive to exist. A total-recall web archive (full coverage with search) could present serious cultural and cognitive challenges to our concept of memory, and the role of technology in mediating memories.

Delete is ultimately a call to do more research into the impacts of ICT on memory, and explore new affordances for memory in our tools and pratices.

We do not know well enough yet how human forgetting and recall work to replicate them with sufficient precision in digital code. But that should not keep us from trying. Perhaps it is possible to craft an expiration date mechanism that is a bit closer to how human memory and forgetting work, in return for only a modicum of added complexity.

I’m about to embark on more reading about memory and technologies, so if you have recommendations please let me know! As an aside I’ve started I’ve started tagging stuff I run across on the web with the tag delete.


PS. Delete also teaches the value of a succinct and evocative title for your work. It was fun using the book title interchangeably with the author’s name, as if the book had somehow become a persona or character, and take on a life of its own.

References

George, C. (2013). Archives beyond the pale: Negotiating legal and ethical entanglements after the belfast project. The American Archivist, 76(1), 47–67.

Halderman, J. A., Waters, B., & Felten, E. W. (2004). Privacy management for portable recording devices. In Proceedings of the 2004 acm workshop on privacy in the electronic society (pp. 16–24). ACM.

Iacovino, L., & Todd, M. (2007). The long-term preservation of identifiable personal data: A comparative archival perspective on privacy regulatory models in the European Union, Australia, Canada and the United States. Archival Science, 7(1), 107–127.

Jimerson, R. C. (2009). Archives power: Memory, accountability, and social justice. Society of American Archivists.

Kirschenbaum, M. G. (2008). Mechanisms: New media and the forensic imagination. MIT Press.

MacNeil, H. (1992). Without consent: The ethics of disclosing personal information in public archives. Scarecrow Press.

Punzalan, R. L., & Caswell, M. (2016). Critical directions for archival approaches to social justice. Library Quarterly, 86(1), 25–42.

Rosenthal, D. S., Rosenthal, D. C., Miller, E. L., Adams, I. F., Storer, M. W., & Zadok, E. (2012). The economics of long-term digital storage. Memory of the World in the Digital Age, Vancouver, BC.

Samuels, H. W. (1986). Who controls the past. The American Archivist, 109–124. Retrieved from http://americanarchivist.org/doi/abs/10.17723/aarc.49.2.t76m2130txw40746

MarcEdit Unicode Question [also posted on the listserv] / Terry Reese

** This was posted on the listserv, but I’m putting this out there broadly **
** Updated to include a video demonstrating how Normalization currently impacts users **

Video demonstrating the question at hand:

 

So, I have an odd unicode question and I’m looking for some feedback.  I had someone working with MarcEdit and looking for é.  This (and a few other characters) represent some special problems when doing replacements because they can be represented by multiple codepoints.  They can be represented as a letter + diacritic (like you’d find in MARC8) or they can be represented as a single code point.

Here’s the rub.  In Windows 10 — if you do a find and replace using either type of normalization (.NET supports 4 major normalizations), the program will find the string, and replace the data.  The problem is that it replaces the data in the normalization that is presented — meaning, that, if in your file, you have data where your system provides multiple codepoints (the traditional standard with MARC21 — what is called the KD normalization) and you do a search where the replacement using a single code point, the replacement will replace the multiple code points with a single code point.  This is apparently, a Windows 10 behavior.  But I find this behaves differently on Mac system (and linux) — which is problematic and confusing.
At the same time, most folks don’t realize that characters like é have multiple iterations, and MarcEdit can find them but won’t replace them unless they are ordinally equivalent (unless you do a case insensitive search).  So, the tool may tell you it’s found fields with this value, but that when the replacement happens, it reports replacements having been made, but no data is actually changed (because ordinally, they are *not* the same).
So, I’ve been thinking about this.  There is something I could do.  In the preferences, I allow users to define which unicode normalization they want to use when converting data to Unicode.  This value only is used by the MarcEngine.  However, I could extend this to the editing functions.  Using this method, I could for data that comes through the search to conform to the desired normalization — but, you still would have times, again, where you are looking for data say that is normalized in Form C, you’ve told me you want all data in Form KD, and so again, é may not be found because again, ordinally they are not correct.
The other option — and this seems like the least confusing, but it has other impacts, would be to modify the functions so that the tool tests the Find string and based on the data present, normalizes all data so that it matches that normalization.  This way, replacements would always happen appropriately.  Of course, this means that if your data started in KD notation, it may end up (would likely end up, if you enter these diacritics from a keyboard) in C notation.  I’m not sure what the impact would be for ILS systems, as they may expect one notation, and get another.  They should support all Unicode notations, but given that MARC21 assumes KD notation, they may be lazy and default to that set.  To prevent normalization switching, I could have the program on save, ensure that all unicode data matches the encoding specified in the preferences.  That would be possible — it comes with a small speed costs — probably not a big one — but I’d have to see what the trade off would be.
I’m bringing this up because on Windows 10 — it looks as those the Replace functionality in the system is doing these normalizations automatically.  From the users perspective, this is likely desired, but from a final output — that’s harder to say.  And since you’d never be able to tell if the Normalization has changed unless you looked at the data under a hex editor (because honestly, it shouldn’t matter, but again, if your ILS only supported a single normalization, it very much would) — this could be a problem.
My initial inclination, given that Windows 10 appears to be doing normalization on the fly allowing users to search and replace é in multiple normalizations — is to potentially normalizing all data that is recognized as UTF8, which would allow me to filter all strings going into the system, and then when saving, push out the data using the normalization that was requested.  But then, I’m not sure if this is still a big issue, or, if knowing that the data is in single or multiple code points (from a find a replace persepctive) is actually desired.
So, I’m pushing this question out to the community, especially as UTF8 is becoming the rule, and not the exception.

Tax season is here: How libraries can help communities prepare / District Dispatch

This blog post, written by Lori Baux of the Computer & Communications Industry Association, is one in a series of occasional posts contributed by leaders from coalition partners and other public interest groups that ALA’s Washington Office works closely with. Whatever the policy – copyright, education, technology, to name just a few – we depend on relationships with other organizations to influence legislation, policy and regulatory issues of importance to the library field and the public.

It’s hard to believe, but as the holiday season comes to an end, tax season is about to begin.

For decades, public libraries have become unparalleled resources in their communities, far beyond their traditional, literary role. Libraries assist those who need it most by providing free Internet access, offering financial literacy classes, job training, employment assistance and more. And for decades, libraries have served as a critical resource during tax season.

Each year, more and more Americans feel as though they lack the necessary resources to confidently and correctly file their taxes on time. This is particularly true for moderate and lower-income individuals and families who are forced to work multiple jobs just to make ends meet. The question is “where is help available?”

Libraries across the country are stepping up their efforts to assist local taxpayers in filing their taxes for free. Many libraries offer in-person help, often serving as a Volunteer Income Tax Assistance (VITA) location or AARP Tax-Aide site. But appointments often fill up quickly, and many communities are without much, if any free in-person tax assistance.

There is an option for free tax prep that libraries can provide—and with little required from already busy library staff. The next time that a local individual or family comes looking for a helping hand with tax preparation, libraries can guide them to a free online tax preparation resource—IRS Free File:

  • Through the Free File Program, those who earned $66,000 or less last year—over 70 percent of all American taxpayers—are eligible to use at least one of 12 brand-name tax preparation software to file their Federal (and in many cases, state) taxes completely free of charge. More information is available at www.irs.gov/freefile. Free File starts on January 12, 2018.
  • Free File complements local VITA programs, where people can get in-person help from IRS certified volunteers. There are over 12,000 VITA programs across the country to help people in your community maximize their refund and claim all the credits that they deserve, including the Earned Income Tax Credit (EITC). Any individual making under $54,000 annually may qualify. More information on VITAs is available at www.irs.gov/vita. More information about AARP Tax-Aide can be found here.

With help from libraries and volunteers across the nation, we can work together to ensure that as many taxpayers as possible have access to the resources and assistance that they need to file their returns.

The Computer & Communications Industry Association (CCIA) hosts a website – www.taxtimeallies.org – that provides resources to inform and assist eligible taxpayers with filing their taxes including fact sheets, flyers and traditional and social media outreach tools. CCIA also encourages folks to download the IRS2Go app on their mobile phone.

Thanks to help from libraries just like yours, we can help eligible taxpayers prepare and file their tax returns on time and free of charge.

Lori Baux is Senior Manager for Grassroots Programs, directing public education and outreach projects on behalf of the Computer & Communications Industry Association (CCIA), an international not-for-profit membership organization dedicated to innovation and enhancing society’s access to information and communications.

The post Tax season is here: How libraries can help communities prepare appeared first on District Dispatch.

New edition of Data Journalism Handbook to explore journalistic interventions in the data society / Open Knowledge Foundation

This blog has been reposted from http://jonathangray.org/2017/12/20/new-edition-data-journalism-handbook/

The first edition of The Data Journalism Handbook has been widely used and widely cited by students, practitioners and researchers alike, serving as both textbook and sourcebook for an emerging field. It has been translated into over 12 languages – including Arabic, Chinese, Czech, French, Georgian, Greek, Italian, Macedonian, Portuguese, Russian, Spanish and Ukrainian – and is used for teaching at many leading universities, as well as teaching and training centres around the world.

A huge amount has happened in the field since the first edition in 2012. The Panama Papers project undertook an unprecedented international collaboration around a major database of leaked information about tax havens and offshore financial activity. Projects such as The Migrants Files, The Guardian’s The Counted and ProPublica’s Electionland have shown how journalists are not just using and presenting data, but also creating and assembling it themselves in order to improve data journalistic coverage of issues they are reporting on.

The Migrants’ Files saw journalists in 15 countries work together to create a database of people who died in their attempt to reach or stay in Europe.

Changes in digital technologies have enabled the development of formats for storytelling, interactivity and engagement with the assistance of drones, crowdsourcing tools, satellite data, social media data and bespoke software tools for data collection, analysis, visualisation and exploration.

Data journalists are not simply using data as a source, they are also increasingly investigating, interrogating and intervening around the practices, platforms, algorithms and devices through which it is created, circulated and put to work in the world. They are creatively developing techniques and approaches which are adapted to very different kinds of social, cultural, economic, technological and political settings and challenges.

Five years after its publication, we are developing a revised second edition, which will be published as an open access book with an innovative academic press. The new edition will be significantly overhauled to reflect these developments. It will complement the first edition with an examination of the current state of data journalism which is at once practical and reflective, profiling emerging practices and projects as well as their broader consequences.

“The Infinite Campaign” by Sam Lavigne (New Inquiry) repurposes ad creation data in order to explore “the bizarre rubrics Twitter uses to render its users legible”.

Contributors to the first edition include representatives from some of the world’s best-known newsrooms data journalism organisations, including the Australian Broadcasting Corporation, the BBC, the Chicago Tribune, Deutsche Welle, The Guardian, the Financial Times, Helsingin Sanomat, La Nacion, the New York Times, ProPublica, the Washington Post, the Texas Tribune, Verdens Gang, Wales Online, Zeit Online and many others. The new edition will include contributions from both leading practitioners and leading researchers of data journalism, exploring a diverse constellation of projects, methods and techniques in this field from voices and initiatives around the world. We are working hard to ensure a good balance of gender, geography and themes.

Our approach in the new edition draws on the notion of “critical technical practice” from Philip Agre, which he formulates as an attempt to have “one foot planted in the craft work of design and the other foot planted in the reflexive work of critique” (1997). Similarly, we wish to provide an introduction to a major new area of journalism practice which is at once critically reflective and practical. The book will offer reflection from leading practitioners on their experiments and experiences, as well as fresh perspectives on the practical considerations of research on the field from leading scholars.

The structure of the book reflects different ways of seeing and understanding contemporary data journalism practices and projects. The introduction highlights the renewed relevance of a book on data journalism in the current so-called “post-truth” moment, examining the resurgence of interest in data journalism, fact-checking and strengthening the capacities of “facty” publics in response to fears about “alternative facts” and the speculation about a breakdown of trust in experts and institutions of science, policy, law, media and democracy. As well as reviewing a variety of critical responses to data journalism and associated forms of datafication, it looks at how this field may nevertheless constitute an interesting site of progressive social experimentation, participation and intervention.

The first section on “data journalism in context” will review histories, geographies, economics and politics of data journalism – drawing on leading studies in these areas. The second section on “data journalism practices” will look at a variety of practices for assembling data, working with data, making sense with data and organising data journalism from around the world. This includes a wide variety of case studies – including the use of social media data, investigations into algorithms and fake news, the use of networks, open source coding practices and emerging forms of storytelling through news apps and data animations. Other chapters look at infrastructures for collaboration, as well as creative responses to disappearing data and limited connectivity. The third and final section on “what does data journalism do?”, examines the social life of data journalism projects, including everyday encounters with visualisations, organising collaborations across fields, the impacts of data projects in various settings, and how data journalism can constitute a form of “data activism”.

As well as providing a rich account of the state of the field, the book is also intended to inspire and inform “experiments in participation” between journalists, researchers, civil society groups and their various publics. This aspiration is partly informed by approaches to participatory design and research from both science and technology studies as well as more recent digital methods research. Through the book we thus aim to explore not only what data journalism initiatives do, but how they might be done differently in order to facilitate vital public debates about both the future of the data society as well as the significant global challenges that we currently face.

This is Jeopardy! Or, How Do People Actually Get On That Show? / LITA

This past November, American Libraries published a delightful article on librarians that have appeared on the iconic game show Jeopardy! It turns out one of our active LITA members also recently appeared on the show. Here’s her story…

On Wednesday, October 18th, one of my lifelong dreams will come true: I’ll be a contestant on Jeopardy!

It takes several steps to get onto the show: first, you must pass an online exam, but you don’t really learn the results unless you make it to the next stage: the invitation to audition. This step is completed in person, comprising a timed, written test, playing a mock game with other aspiring players in front of a few dozen other auditionees, and chatting amiably in a brief interview, all while being filmed. If you make it through this gauntlet, you go into “the pool”, where you remain eligible for a call to be on the show for up to 18 months. Over the course of one year of testing and eligibility, around 30,000 people take the first test, around 1500 to 1600 people audition in person, and around 400 make it onto the show each season.

For me, the timeline was relatively quick. I tested online in October 2016, auditioned in January 2017, and thanks to my SoCal address, I ended up as a local alternate in February. Through luck of the draw, I was the leftover contestant that day. I didn’t tape then, but was asked back directly to the show for the August 3rd recording session, which airs from October 16th to October 20th.

The call is early – 7:30am – and the day’s twelve potential contestants take turns with makeup artists while the production team covers paperwork, runs through those interview stories one-on-one, and pumps up the contestants to have a good time. Once you’re in, you’re sequestered. There’s no visiting with family or friends who accompanied you to the taping and no cellphones or internet access allowed. You do have time to chat with your fellow contestants, who are all whip smart, funny, and generally just as excited as you are to get to be on this show. There’s also no time to be nervous or worried: you roll through the briefing onto the stage for a quick run-down on how the podiums work (watch your elbows for the automated dividers that come up for Final Jeopardy!), how to buzz in properly (there’s a light around the big game board that you don’t see at home that tells you when you can ring in safely), and under no circumstances are you to write on the screen with ANYTHING but that stylus!

Next, it’s time for your Hometown Howdy, the commercial blurb that airs on the local TV station for your home media market. Since I’d done it before when I almost-but-not-quite made it on the air in February, I knew they were looking for maximum cheese. My friends and family tell me that I definitely delivered.

Immediately before they let in the live studio audience for seating, contestants run through two quick dress rehearsal games to get out any final nerves, test the equipment for the stage crew, and practice standing on the risers behind the podiums without falling off.

Then it’s back to the dressing room, where the first group is drawn. They get a touch-up on makeup, the rest of the contestant group sits down in a special section of the audience, and it’s off to the races! There are three games filmed before the lunch break, then the final two are filmed. The contestants have the option to stay and watch the rest of the day if they’re defeated, but most choose to leave if it’s later on in the filming cycle. The adrenaline crash is pretty huge, and some people may need the space to let out their mixed feelings. If you win, you are whisked back to the dressing room for a quick change, a touch-up again, and back out to the champion’s podium to play again.

You may be asking, when do contestants meet Alex? Well, it happens exactly twice, and both times, the interactions are entirely on film and broadcast in (nearly) their entirety within the show. To put all of those collusion rumors around the recent streak of Austin Rogers to rest, the interview halfway through the first round and the hand-shaking at the end of the game are the only times that Alex and the contestants meet or speak with one another; there is no “backstage” where the answer-giver and the question-providers could possibly mingle. Nor do the contestants ever get to do more than wave “hello” to the writers for the show. Jeopardy! is very careful to keep its two halves very separated. The energy and enthusiasm of the contestant team – Glenn, Maggie, Corina, Lori, and Ryan – is genuine, and when your appearance is complete, you feel as though you have joined a very special family of Jeopardy! alumni.

Once you’ve been a contestant on Jeopardy!, you can never be on the show again. The only exception is if you do well enough to be asked back to the Tournament of Champions. While gag rules prohibit me from saying more about how I did, I can say that the entire experience lived up to the hype I had built around it since I was a child, playing along in my living room and dreaming of the chance to respond in the form of a question.

Islandora Camp - Call for Proposals / Islandora

Doing something great with Islandora that you want to share with the community? Have a recent project that the world just needs to know about? Send us your proposals to present at Islandora Camp! Presentations should be roughly 20-25 minutes in length (with time after for questions) and deal with Islandora in some way. Want more time or to do a different format? let us know in your proposal and we'll see what we can do.

You can see examples of previous Islandora camp sessions on our YouTube channel.

The Call for Proposals for iCampEU in Limerick will be open until March 1st.

Type: 
Tell us your name.
Tell us where you're joining us from.
Tell us what you want to call your proposal. You can change this later.
Tell us about what you want to present.
Please give a brief summary that can be printed in the camp schedule if your proposal is accepted.

Islandora Camp EU 2018 - Registration / Islandora

Islandora Camp is heading to Ireland June 20 - 22, 2018, hosted by the University of Limerick. Early Bird rates are available until March 1st, 2018, after which the rate will increase to €399,00.

360,00 €
Attendee Information: 
Please enter the full name of the person who will attend the event.
Please provide the email address of the person attending so we can send notices and updates. We promise we'll keep them to a minimum!
Please select the curriculum you wish to join.
Admin: For repository and collection managers, librarians, archivists, and anyone else who deals primarily with the front-end experience of Islandora and would like to learn how to get the most out of it, or developers who would like to learn more abut the front-end experience.
Developer: For developers, systems people, and anyone dealing with Islandora at the code-level, or any front-end Islandora users who are interested in learning more about the developer side.
Islandora Camp comes with a t-shirt. What size is preferred?
Please let us know about any dietary restrictions or other special considerations that may need to be accommodated.
We would like to share your name and email address with your fellow attendees (and ONLY them) before the event so you can see who else is going. If you would rather we not include your info, please opt out.
What do you want to learn at this camp? Be as general or specific as possible - if you have particular questions or problems you're tackling, or topics you'd like to learn about, please put them here.

It Isn't About The Technology / David Rosenthal

A year and a half ago I attended Brewster Kahle's Decentralized Web Summit and wrote:
I am working on a post about my reactions to the first two days (I couldn't attend the third) but it requires a good deal of thought, so it'll take a while.
As I recall, I came away from the Summit frustrated. I posted the TL;DR version of the reason half a year ago in Why Is The Web "Centralized"? :
What is the centralization that decentralized Web advocates are reacting against? Clearly, it is the domination of the Web by the FANG (Facebook, Amazon, Netflix, Google) and a few other large companies such as the cable oligopoly.

These companies came to dominate the Web for economic not technological reasons.
Yet the decentralized Web advocates persist in believing that the answer is new technologies, which suffer from the same economic problems as the existing decentralized technologies underlying the "centralized" Web we have. A decentralized technology infrastructure is necessary for a decentralized Web but it isn't sufficient. Absent an understanding of how the rest of the solution is going to work, designing the infrastructure is an academic exercise.

It is finally time for the long-delayed long-form post. I should first reiterate that I'm greatly in favor of the idea of a decentralized Web based on decentralized storage. It would be a much better world if it happened. I'm happy to dream along with my friend Herbert Van de Sompel's richly-deserved Paul Evan Peters award lecture entitled Scholarly Communication: Deconstruct and Decentralize?. He describes a potential future decentralized system of scholarly communication built on existing Web protocols. But even he prefaces the dream with a caveat that the future he describes "will most likely never exist".

I agree with Herbert about the desirability of his vision, but I also agree that it is unlikely. Below the fold I summarize Herbert's vision, then go through a long explanation of why I think he's right about the low likelihood of its coming into existence.

Herbert identifies three classes of decentralized Web technology and explains that he decided not to deal with these two:
  • Distributed file systems. Herbert is right about this. Internet-scale distributed file systems were first prototyped in the late 90s with Intermemory and Oceanstore, and many successors have followed in their footsteps. None have achieved sustainability or Internet platform scale. The reasons are many, the economic one of which I wrote about in Is Distributed Storage Sustainable? Betteridge's Law applies, so the answer is "no".
  • Blockchains. Herbert is right about this too. Even the blockchain pioneers have to admit that, in the real world, blockchains have failed to deliver any of their promised advantages over centralized systems. In particular, as we see with Bitcoin, maintaining decentralization against economies of scale is a fundamental, unsolved problem:
    Trying by technical means to remove the need to have viable economics and governance is doomed to fail in the medium- let alone the long-term. What is needed is a solution to the economic and governance problems. Then a technology can be designed to work in that framework.
    And, as Vitalik Buterin points out, the security of blockchains depends upon decentralization:
    In the case of blockchain protocols, the mathematical and economic reasoning behind the safety of the consensus often relies crucially on the uncoordinated choice model, or the assumption that the game consists of many small actors that make decisions independently.
Herbert's reason for disregarding distributed file systems and blockchains is that they both involve entirely new protocols. He favors the approach being pursued at MIT in Sir Tim Berners-Lee's Solid project, which builds on existing Web protocols. Herbert's long experience convinces him (and me) that this is a less risky approach. My reason is different; they both reduce to previously unsolved problems.

The basic idea of Solid is that each person would own a Web domain, the "host" part of a set of URLs that they control. These URLs would be served by a "pod", a Web server controlled by the user that implemented a whole set of Web API standards, including authentication and authorization. Browser-side apps would interact with these pods, allowing the user to:
  • Export a machine-readable profile describing the pod and its capabilities.
  • Write content for the pod.
  • Control others access to the content of the pod.
Pods would have inboxes to receive notifications from other pods. So that, for example, if Alice writes a document and Bob writes a comment in his pod that links to it in Alice's pod, a notification appears in the inbox of Alice's pod announcing that event. Alice can then link from the document in her pod to Bob's comment in his pod. In this way, users are in control of their content which, if access is allowed, can be used by Web apps elsewhere.

In Herbert's vision, institutions would host their researchers "research pods", which would be part of their personal domain but would have extensions specific to scholarly communication, such as automatic archiving upon publication.

Herbert demonstrates that the standards and technology needed to implement his pod-based vision for scholarly communication exist, if the implementation is currently a bit fragile. But he concludes by saying:
By understanding why it is not feasible we may get new insights into what is feasible.
I'll take up his challenge, but in regard to the decentralized Web that underlies and is in some respects a precondition for his vision. I hope in a future post to apply the arguments that follow to his scholarly communication vision in particular.

The long explanation for why I agree with Herbert that the Solid future "will most likely never exist" starts here. Note that much of what I link to from now on is a must-read, flagged (MR). Most of them are long and cover many issues that are less, but still, related to the reason I agree with Herbert than the parts I cite.

Cory Doctorow introduces his post about Charlie Stross' keynote for the 34th Chaos Communications Congress (MR) by writing (MR):
Stross is very interested in what it means that today's tech billionaires are terrified of being slaughtered by psychotic runaway AIs. Like Ted Chiang and me, Stross thinks that corporations are "slow AIs" that show what happens when we build "machines" designed to optimize for one kind of growth above all moral or ethical considerations, and that these captains of industry are projecting their fears of the businesses they nominally command onto the computers around them. 
Stross uses the Paperclip Maximizer thought experiment to discuss how the goal of these "slow AIs", which is to maximize profit growth, makes them a threat to humanity. The myth is that these genius tech billionaire CEOs are "in charge", decision makers. But in reality, their decisions are tightly constrained by the logic embedded in their profit growth maximizing "slow AIs".

Here's an example of a "slow AI" responding to its Prime Directive and constraining the "decision makers".  Dave Farber's IP list discussed Hiroko Tabuchi's New York Times article How Climate Change Deniers Rise to the Top in Google Searches, which described how well-funded climate deniers were buying ads on Google that appeared at the top of search results for climate change. Chuck McManis (Chuck & I worked together at Sun Microsystems. He worked at Google then built Blekko, another search engine.) contributed a typically informative response. As previously, I have Chuck's permission to quote him extensively:
publications, as recently as the early 21st century, had a very strict wall between editorial and advertising. It compromises the integrity of journalism if the editorial staff can be driven by the advertisers. And Google exploited that tension and turned it into a business model.
How did they do that?
When people started using Google as an 'answer this question' machine, and then Google created a mechanism to show your [paid] answer first, the stage was set for what has become a gross perversion of 'reference' information.
Why would they do that? Their margins were under pressure:
The average price per click (CPC) of advertisements on Google sites has gone down for every year, and nearly every quarter, since 2009. At the same time Microsoft's Bing search engine CPCs have gone up. As the advantage of Google's search index is eroded by time and investment, primarily by Microsoft, advertisers have been shifting budget to be more of a blend between the two companies. The trend suggests that at some point in the not to distant future advertising margins for both engines will be equivalent.
And their other businesses weren't profitable:
Google has scrambled to find an adjacent market, one that could not only generate enough revenue to pay for the infrastructure but also to generate a net income . Youtube, its biggest success outside of search, and the closest thing they have, has yet to do that after literally a decade of investment and effort.
So what did they do?
As a result Google has turned to the only tools it has that work,  it has reduced payments to its 'affiliate' sites (AdSense for content payments), then boosted the number of ad 'slots' on Google sites, and finally paying third parties to send search traffic preferentially to Google (this too hurts Google's overall search margin)
And the effect on users is:
On the search page, Google's bread and butter so to speak, for a 'highly contested' search (that is what search engine marketeers call a search query that can generate lucrative ad clicks) such as 'best credit card' or 'lowest home mortgage', there are many web browser window configurations that show few, if any organic search engine results at all!
In other words, for searches that are profitable, Google has moved all the results it thinks are relevant off the first page and replaced them with results that people have paid to put there. Which is pretty much the definition of "evil" in the famous "don't be evil" slogan notoriously dropped in 2015. I'm pretty sure that no-one at executive level in Google thought that building a paid-search engine was a good idea, but the internal logic of the "slow AI" they built forced them into doing just that.

Another example is that Mark Zuckerberg's "personal challenge" for 2018 is to "fix Facebook". In Facebook Can't Be Fixed (MR) John Battelle writes:
You cannot fix Facebook without completely gutting its advertising-driven business model.

And because he is required by Wall Street to put his shareholders above all else, there’s no way in hell Zuckerberg will do that.

Put another way, Facebook has gotten too big to pivot to a new, more “sustainable” business model.
...
If you’ve read “Lost Context,” you’ve already been exposed to my thinking on why the only way to “fix” Facebook is to utterly rethink its advertising model. It’s this model which has created nearly all the toxic externalities Zuckerberg is worried about: It’s the honeypot which drives the economics of spambots and fake news, it’s the at-scale algorithmic enabler which attracts information warriors from competing nation states, and it’s the reason the platform has become a dopamine-driven engagement trap where time is often not well spent.
John Battelle's “Lost Context is also (MR).

I have personal experience of this problem. In the late 80s I foresaw a bleak future for Sun Microsystems. Its profits were based on two key pieces of intellectual property, the SPARC architecture and the Solaris operating system. In each case they had a competitor (Intel and Microsoft) whose strategy was to make owning that kind of IP too expensive for Sun to compete. I came up with a strategy for Sun to undergo a radical transformation into something analogous to a combination of Canonical and an App Store. I spent years promoting and prototyping this idea within Sun.

One of the reasons I have great respect for Scott McNealy is that he gave me, an engineer talking about business, a very fair hearing before rejecting the idea, saying "Its too risky to do with a Fortune 100 company". Another way of saying this is "too big to pivot to a new, more “sustainable” business model". In the terms set by Sun's "slow AI" Scott was right and I was wrong. Sun was taken over by Oracle in 2009; their "slow AI" had no answer for the problems I identified two decades earlier. But in those two decades Sun made its shareholders unbelievable amounts of money.

In Herbert's world of scholarly communication, a similar process can be seen at work in the history of open access (MR, my comments here). In May 1995 Stanford Libraries' HighWire Press pioneered the move of scholarly publishing to the Web by putting the Journal of Biological Chemistry on-line. Three years later, Vitek Tracz was saying:
with the Web technology available today, publishing can potentially happen independently of publishers. If authors started depositing their papers directly into a central repository, they could bypass publishers and make it freely available.
He started the first commercial open-access publisher, BioMed Central, in 2000 (the Springer "slow AI" bought it in 2008). In 2002 came the Budapest Open Access Initiative:
By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
Sixteen years later, the "slow AIs" which dominate scholarly publishing have succeeded in growing profits so much that Roger Schonfeld can tweet:
I want to know how anyone can possibly suggest that Elsevier is an enemy of open access. I doubt any company today profits more from OA and its growth!
What Elsevier means by "open access" is a long, long way from the Budapest definition. The Open Access advocates, none of them business people, set goals which implied the demise of Elsevier and the other "slow AIs" without thinking through how the "slow AIs" would react to this existential threat. The result was that the "slow AIs" perverted the course of "open access" in ways that increased their extraction of monopoly rents, and provided them with even more resources to buy up nascent and established competitors.

Elsevier's Research Infrastructure
Now the "slow AIs" dominate not just publishing, but the entire infrastructure of science. If I were Elsevier's "slow AI" I would immediately understand that Herbert's "research pods" needed to run on Elsevier's infrastructure. Given university IT departments current mania for outsourcing everything to "the cloud" this would be trivial to arrange. They've already done it to institutional repositories. Elsevier would then be able to, for example, use a Microsoft-like "embrace, extend and extinguish" strategy to exploit its control over researcher's pods.

Open access advocates point to the rise in the proportion of papers that are freely accessible. They don't point to the rise in payments to the major publishers, the added costs to Universities of dealing with the fragmented system, the highly restrictive licenses that allow "free access" in many cases, the frequency with which author processing charges are paid without resulting in free access, and all the other ills that the "slow AIs" have visited upon scholarly communication in the pursuit of profit growth.

What people mean by saying "the Web is centralized" is that it is dominated by a small number of extremely powerful "slow AIs", the FAANGs (Facebook, Apple, Amazon, Netflix, Google) and the big telcos. None of the discussion of the decentralized Web I've seen is about how to displace them, its all about building a mousetrap network infrastructure so much better along some favored axes that, magically, the world will beat a path to their door.

This is so not going to happen.

For example, you could build a decentralized, open source social network system. In fact, people did. It is called Diaspora and it launched in a blaze of geeky enthusiasm in 2011. Diaspora is one of the eight decentralization initiatives studied by the MIT Media Lab's Defending Internet Freedom through Decentralization (MR) report:
The alpha release of the Diaspora software was deeply problematic, riddled with basic security errors in the code. At the same time, the founders of the project received a lot of pressure from Silicon Valley venture capitalists to “pivot” the project to a more profitable business model. Eventually the core team fell apart and the Diaspora platform was handed over to the open source community, who has done a nice job of building out a support website to facilitate new users in signing up for the service. Today it supports just under 60,000 active participants, but the platform remains very niche and turnover of new users is high.
Facebook has 1.37*109 daily users, so it is about 22,800 times bigger than Diaspora. Even assuming Diaspora was as good as Facebook, an impossible goal for a small group of Eben Moglen's students, no-one had any idea how to motivate the other 99.996% of Facebook users to abandon the network where all their friends were and restart building their social graph from scratch. The fact that after 6 years Diaspora has 60K active users is impressive for an open source project, but it is orders of magnitude away from the scale needed to be a threat to Facebook. We can see this because Facebook hasn't bothered to react to it.

Suppose the team of students had been inspired, and built something so much better than Facebook along axes that the mass of Facebook users cared about (which don't include federation, censorship resistance, open source, etc.) that they started to migrate. Facebook's "slow AI" would have reacted in one of two ways. Either the team would have been made a financial offer they couldn't refuse, which wouldn't have made a dent in the almost $40B in cash and short-term investments on Facebook's balance sheet. Or Facebook would have tasked a few of their more than 1000 engineers to replicate the better system. They'd have had an easy job because (a) they'd be adding to an existing system rather than building from scratch, and (b) because their system would be centralized, so wouldn't have to deal with the additional costs of decentralization.

Almost certainly Facebook would have done both. Replicating an open source project in-house is very easy and very fast. Doing so would reduce the price they needed to pay to buy the startup. Hiring people good enough to build something better than the existing product is a big problem for the FAANGs. The easiest way to do it is to spot their startup early and buy it. The FAANGs have been doing this so effectively that it no longer makes sense to do a startup in the Valley with the goal of IPO-ing it; the goal is to get bought by a FAANG.

Lets see what happens when one of the FAANGs actually does see something as a threat. Last January Lina M. Kahn of the Open Markets team at the New America Foundation published Amazon's Antitrust Paradox (MR) in the Yale Law Review. Her 24,000-word piece got a lot of well-deserved attention for describing how platforms evade antitrust scrutiny. In August, Barry Lynn, Kahn's boss and the entire Open Markets team were ejected from the New America Foundation. Apparently, the reason was this press release commenting favorably on Google's €2.5 billion loss in an antitrust case in the EU. Lynn claims that:
hours after his press release went online, [New America CEO] Slaughter called him up and said: “I just got off the phone with Eric Schmidt and he is pulling all of his money,”
The FAANGs' "slow AIs" understand that antitrust is a serious threat. €2.5 billion checks get their attention, even if they are small compared to their cash hoards. The PR blowback from defenestrating the Open Markets team was a small price to pay for getting the message out that advocating for effective antitrust enforcement carried serious career risks.

This was a FAANG reacting to a law journal article and a press release. "All of his money" had averaged about $1M/yr over two decades. Imagine how FAANGs would react to losing significant numbers of users to a decentralized alternative!

Kahn argued that:
the current framework in antitrust—specifically its pegging competition to “consumer welfare,” defined as short-term price effects—is unequipped to capture the architecture of market power in the modern economy. We cannot cognize the potential harms to competition posed by Amazon’s dominance if we measure competition primarily through price and output. Specifically, current doctrine underappreciates the risk of predatory pricing and how integration across distinct business lines may prove anticompetitive. These concerns are heightened in the context of online platforms for two reasons. First, the economics of platform markets create incentives for a company to pursue growth over profits, a strategy that investors have rewarded. Under these conditions, predatory pricing becomes highly rational—even as existing doctrine treats it as irrational and therefore implausible. Second, because online platforms serve as critical intermediaries, integrating across business lines positions these platforms to control the essential infrastructure on which their rivals depend. This dual role also enables a platform to exploit information collected on companies using its services to undermine them as competitors.
In the 30s antitrust was aimed at preserving a healthy market by eliminating excessive concentration of market power. But:
Due to a change in legal thinking and practice in the 1970s and 1980s, antitrust law now assesses competition largely with an eye to the short-term interests of consumers, not producers or the health of the market as a whole; antitrust doctrine views low consumer prices, alone, to be evidence of sound competition. By this measure, Amazon has excelled; it has evaded government scrutiny in part through fervently devoting its business strategy and rhetoric to reducing prices for consumers.
Shop, Ikebukuro, Tokyo
The focus on low prices for "consumers" rather than "customers" is especially relevant for Google and Facebook; it is impossible to get monetary prices lower than those they charge "consumers". The prices they charge the "customers" who buy ad space from them are another matter, but they don't appear to be a consideration for current antitrust law. Nor is the non-monetary price "consumers" pay for the services of Google and Facebook in terms of the loss of privacy, the spam, the fake news, the malvertising and the waste of time.

Perhaps the reason for Google's dramatic reaction to the Open Markets team was that they were part of a swelling chorus of calls for antitrust action against the FAANGs from both the right and the left. Roger McNamee (previously) was an early investor in Facebook and friend of Zuckerberg's, but in How to Fix Facebook — Before It Fixes Us (MR) even he voices deep concern about Facebook's effects on society. He and ethicist Tristan Harris provide an eight-point prescription for mitigating them:
  1. Ban bots.
  2. Block further acquisitions.
  3. "be transparent about who is behind political and issues-based communication"
  4. "be more transparent about their algorithms"
  5. "have a more equitable contractual relationship with users"
  6. Impose "a limit on the commercial exploitation of consumer data by internet platforms"
  7. "consumers, not the platforms, should own their own data"
Why would the Facebook "slow AI" do any of these things when they're guaranteed to decrease its stock price? The eighth is straight out of Lina Kahn:
we should consider that the time has come to revive the country’s traditional approach to monopoly. Since the Reagan era, antitrust law has operated under the principle that monopoly is not a problem so long as it doesn’t result in higher prices for consumers. Under that framework, Facebook and Google have been allowed to dominate several industries—not just search and social media but also email, video, photos, and digital ad sales, among others—increasing their monopolies by buying potential rivals like YouTube and Instagram. While superficially appealing, this approach ignores costs that don’t show up in a price tag. Addiction to Facebook, YouTube, and other platforms has a cost. Election manipulation has a cost. Reduced innovation and shrinkage of the entrepreneurial economy has a cost. All of these costs are evident today. We can quantify them well enough to appreciate that the costs to consumers of concentration on the internet are unacceptably high.
McNamee understands that the only way to get Facebook to change its ways is the force of antitrust law.

Another of the initiatives studied by the MIT Media Lab's Defending Internet Freedom through Decentralization (MR) is Solid. They describe the project's goal thus:
Ultimately, the goal of this project is to render platforms like Facebook and Twitter as merely “front-end” services that present a user’s data, rather than silos for millions of people’s personal data. To this end, Solid aims to support users in controlling their own personal online datastore, or “pod,” where their personal information resides. Applications would generally run on the client-side (browser or mobile phone) and access data in pods via APIs based on HTTP.
In other words, to implement McNamee's #7 prescription.

Why do you think McNamee's #8 talks about the need to "revive the country’s traditional approach to monopoly"? He understands that having people's personal data under their control, not Facebook's, would be viewed by Facebook's "slow AI" as an existential threat. Exclusive control over the biggest and best personal data of everyone on the planet, whether or not they have ever created an account, is the basis on which Facebook's valuation rests.

The Media Lab report at least understands that there is an issue here:
The approach of Solid towards promoting interoperability and platform-switching is admirable, but it begs the question: why would the incumbent “winners” of our current system, the Facebooks and Twitters of the world, ever opt to switch to this model of interacting with their users? Doing so threatens the business model of these companies, which rely on uniquely collecting and monetizing user data. As such, this open, interoperable model is unlikely to gain traction with already successful large platforms. While a site like Facebook might share content a user has created–especially if required to do so by legislation that mandates interoperability–it is harder to imagine them sharing data they have collected on a user, her tastes and online behaviors. Without this data, likely useful for ad targeting, the large platforms may be at an insurmountable advantage in the contemporary advertising ecosystem.
The report completely fails to understand the violence of the reaction Solid will face from the FAANGs "slow AIs" if it ever gets big enough for them to notice.

Note that the report fails to understand that you don't have to be a Facebook user to have been extensively profiled. Facebook's "slow AI" is definitely not going to let go of the proprietary data it has collected (and in many cases paid other data sources for) about a person. Attempts to legislate this sharing in isolation would meet ferocious lobbying, and might well be unconstitutional. Nor is it clear that, even if legislation passed, the data would be in a form usable by the person, or by other services. History tends to show that attempts to force interoperability upon unwilling partners are easily sabotaged by them.

McNamee points out that, even if sharing were forced upon Facebook, it would likely do little to reduce their market power:
consumers, not the platforms, should own their own data. In the case of Facebook, this includes posts, friends, and events—in short, the entire social graph. Users created this data, so they should have the right to export it to other social networks. Given inertia and the convenience of Facebook, I wouldn’t expect this reform to trigger a mass flight of users. Instead, the likely outcome would be an explosion of innovation and entrepreneurship. Facebook is so powerful that most new entrants would avoid head-on competition in favor of creating sustainable differentiation. Start-ups and established players would build new products that incorporate people’s existing social graphs, forcing Facebook to compete again.
After all, allowing users to export their data from Facebook doesn't prevent Facebook maintaining a copy. And you don't need to be a Facebook user for them to make money from data they acquire about you. Note that, commendably, Google has for many years allowed users to download the data they create in the various Google systems (but not the data Google collects about them) via the Data Liberation Front, now Google TakeOut. It hasn't caused their users to leave.

No alternate social network can succeed without access to the data Facebook currently holds. Realistically, if this is to change, there will be some kind of negotiation. Facebook's going-in position will be "no access". Thus the going-in position for the other side needs to be something that Facebook's "slow AI" will think is much worse than sharing the data.

We may be starting to see what the something much worse might be. In contrast to the laissez-faire approach of US antitrust authorities, the EU has staked out a more aggressive position. It fined Google the €2.5 billion that got the Open Markets team fired. And, as Cory Doctorow reports (MR):
Back in 2016, the EU passed the General Data Protection Regulation, a far-reaching set of rules to protect the personal information and privacy of Europeans that takes effect this coming May.
Doctorow explains that these regulations require that:
Under the new directive, every time a European's personal data is captured or shared, they have to give meaningful consent, after being informed about the purpose of the use with enough clarity that they can predict what will happen to it. Every time your data is shared with someone, you should be given the name and contact details for an "information controller" at that entity. That's the baseline: when a company is collecting or sharing information about (or that could reveal!) your "racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, … [and] data concerning health or data concerning a natural person’s sex life or sexual orientation," there's an even higher bar to hurdle.
Pagefair has a detailed explanation of what this granting of granular meaningful consent would have to look like. It is not a viable user interface to the current web advertising ecosystem of real-time auctions based on personal information.
All of these companies need to get consent
Here is Pagefair's example of what is needed to get consent from each of them.

The start of a long, long chain of dialog boxes
Doctorow's take on the situation is:
There is no obvious way the adtech industry in its current form can comply with these rules, and in the nearly two years they've had to adapt, they've done virtually nothing about it, seemingly betting that the EU will just blink and back away, rather than exercise its new statutory powers to hit companies for titanic fines, making high profile examples out of a few sacrificial companies until the rest come into line.

But this is the same institution that just hit Google with a $2.73 billion fine. They're spoiling for this kind of fight, and I wouldn't bet on them backing down. There's no consumer appetite for being spied on online ... and the companies involved are either tech giants that everyone hates (Google, Facebook), or creepy data-brokers no one's ever heard of and everyone hates on principle (Acxiom). These companies have money, but not constituencies.

Meanwhile, publishers are generally at the mercy of the platforms, and I assume most of them are just crossing their fingers and hoping the platforms flick some kind of "comply with the rules without turning off the money-spigot" switch this May.
Pagefair's take is:
Websites, apps, and adtech vendors, should switch from using personal data to monetize direct and RTB advertising to “non-personal data”. Using non-personal, rather than personal, data neutralizes the risks of the GDPR for advertisers, publishers, and adtech vendors. And it enables them to address the majority (80%-97%) of the audience that will not give consent for 3rd party tracking across the web.
The EU is saying "it is impractical to monetize personal information". Since Facebook's and Google's business models depend on monetizing personal information, this is certainly looks like "something worse" than making it portable.

I remember at Esther Dyson's 2001 conference listening to the CEO of American Express explain how they used sophisticated marketing techniques to get almost all their customers to opt-in to information sharing. If I were Facebook's or Google's "slow AI" I'd be wondering if I could react to the GDPR by getting my users to opt-in to my data collection, and structuring things so they wouldn't opt-in to everyone else's. I would be able to use their personal information, but I wouldn't be able to share it with anyone else. That is a problem for everyone else, but for me its a competitive advantage.

It is hard to see how this will all play out:
  • The Chinese government is enthusiastic about enabling companies to monetize personal information. That way the companies fund the government's surveillance infrastructure:
    WeChat, the popular mobile application from Tencent Holdings, is set to become more indispensable in the daily lives of many Chinese consumers under a project that turns it into an official electronic personal identification system.
  • The US has enabled personal information to be monetized, but seems to be facing a backlash from both right and left.
  • The EU seems determined to eliminate, or at least place strict limits on, monetizing of personal information.
Balkanization of the Web seems more likely than decentralization.

If a decentralized Web doesn't achieve mass participation, nothing has really changed. If it does, someone will have figured out how to leverage antitrust to enable it. And someone will have designed a technical infrastructure that fit with and built on that discovery, not a technical infrastructure designed to scratch the itches of technologists.




ALA to Congress in 2018: Continue to #FundLibraries / District Dispatch

2017 was an extraordinary year for America’s libraries. When faced with serious threats to federal library funding, ALA members and library advocates rallied in unprecedented numbers to voice their support for libraries at strategic points throughout the year*. Tens of thousands of phone calls and emails to Congress were registered through ALA’s legislative action center. ALA members visited Congress in Washington and back home to demonstrate the importance of federal funding.

stack of gold coins sketchThe challenge to #FundLibraries in 2018 is great: not only is Congress late in passing an FY 2018 budget, it’s time to start working on the FY 2019 budget.

ALA members have a lot to be proud of. Thanks to library advocates, Congress did not follow the administration’s lead in March 2017, when the president made a bold move to eliminate the Institute of Museum and Library Services (IMLS) and virtually all federal library funding. In every single state and congressional district, ALA members spoke up in support for federal library funding. We reminded our senators and representatives how indispensable libraries are for the communities they represent. And our elected leaders listened. By the time FY 2018 officially began in October 2017, the Appropriations Committees from both chambers of Congress had passed bills that maintained (and in the Senate, increased by $4 million) funding for libraries.

Despite our strong advocacy, we have not saved library funding for FY 2018. We’re more than three months into the fiscal year, and the U.S. government still does not have an FY 2018 budget. Because the House and Senate have not reconciled their FY 2018 spending bills, the government is operating under a “continuing resolution” (CR) of the FY 2017 budget. What happens when that CR expires on January 19, 2018 is a matter of intense speculation; options include a bi-partisan budget deal, another CR or a possible government shutdown.

While government may seem to be paralyzed, this is no time for library advocates to take a break. The challenge in 2018 is even greater than 2017: not only is Congress late in passing an FY 2018 budget, it’s time to start working on the FY 2019 budget. The president is expected to release his FY 2019 budget proposal in February, and we have no reason to believe that libraries have moved up on the list of priorities for the administration.

2018 is a time for all of us to take our advocacy up a notch. Over the coming weeks, ALA’s Washington Office will roll out resources to help you tell your library story and urge your members of Congress to #FundLibraries. In the meantime, here’s what you can do:

Stay informed. The U.S. budget and appropriations process is more dynamic than ever this year. There is a strong chance that we will be advocating for library funding for FY 2018 and FY 2019 at the same time. Regularly visit DistrictDispatch.org, the Washington Office blog, where we will post the latest information on ALA’s #FundLibraries campaign and sign up for ALA’s Legislative Action Center.

Stay involved. What you show your decision-makers at home is important part of our year-round advocacy program because it helps supplement the messages that your ALA Washington team is sharing with legislators and their staff on the Hill. Keep showing them how your library – and IMLS funding – is transforming your community. Plan to attend National Library Legislative Day 2018 in Washington (May 7-8) or participate virtually from home.

Stay proud of your influence. Every day you prove that libraries are places of innovation, opportunity and learning – that libraries are a smart, high-return investment for our nation. When librarians speak, decision-makers listen!


*2017: Federal appropriations and library advocacy timeline

March The president announced in his first budget proposal that he wanted to eliminate IMLS and virtually all federal funding for libraries.
April ALA members asked their representatives to sign two Dear Appropriator letters sent from library champions in the House to the Chair and Ranking Members of the House Appropriations Subcommittee that deals with library funding (Labor, Health & Human Services, Education and Related Agencies, or “Labor-HHS”). One letter was in support of the Library Services and Technology Act (LSTA), and one letter was for the Innovative Approaches to Literacy program (IAL).

House Results: One-third of the entire House of Representatives, from both parties, signed each Dear Appropriator letter, and nearly 170 Members signed at least one.

May More than 500 ALA members came to Washington, D.C. to meet their members of Congress for ALA’s 2017 National Library Legislative Day.
Nearly identical Dear Appropriator letters were sent to Senate Labor-HHS Approps Subcommittee leaders.

Senate Results: 45 Senators signed the LSTA letter, and 37 signed the IAL letter.

July The House Labor-HHS Subcommittee and then the full Committee passed their appropriations bill, which included funding for IMLS, LSTA and IAL at 2017 levels.
September The House passed an omnibus spending package, which included 12 appropriations bills. The Senate Labor-HHS Subcommittee and then the full Committee passed their appropriations bill, which included a $4 million increase for LSTA above the 2017 level.  Unable to pass FY 2018 funding measures, Congress passed a continuing resolution, averting a government shutdown.
December Congress passed two additional CRs, which run through January 19, 2018.

The post ALA to Congress in 2018: Continue to #FundLibraries appeared first on District Dispatch.

2017: A Year to Remember for OK Nepal / Open Knowledge Foundation

This blog has been cross-posted from the OK Nepal blog as part of our blog series of Open Knowledge Network updates.

Best wishes for 2018 from OK Nepal to all of the Open Knowledge family and friends!!

The year 2017 was one of the best years for Open Knowledge Nepal. We started our journey by registering Open Knowledge Nepal as a non-profit organization under the Nepal Government and as we start to reflect 2017, it has been “A Year to Remember”. We were able to achieve many things and we promise to continue our hard work to improve the State of Open Data in South Asia in 2018 also.

Some of the key highlights of 2017 are:

  1. Organizing Open Data Day 2017

For the 5th time in a row, the Open Knowledge Nepal team led the effort of organizing International Open Data Day at Pokhara, Nepal. This year it was a collaborative effort of Kathmandu Living Labs and Open Knowledge Nepal. It was also the first official event of Open Knowledge Nepal that was held out of the Kathmandu Valley.  

  1. Launching Election Nepal Portal  

On 13th April 2017 (31st Chaitra 2073), a day before Nepalese New Year 2074, we officially released the  Election Nepal Portal in collaboration with Code for Nepal and made it open for contribution. Election Nepal is a crowdsourced citizen engagement portal that includes the Local Elections data. The portal will have three major focus areas; visualizations, datasets, and twitter feeds.

  1. Contributing to Global Open Data Index  

On May 2nd, 2017 Open Knowledge International launched the 4th edition of Global Open Data Index (GODI), a global assessment of open government data publication. Nepal has been part of this global assessment continuously for four years with lots of ups and downs. We have been leading it since the very beginning. With 20% of openness, Nepal was ranked 69 in 2016 Global Open Data Index. Also, this year we helped Open Knowledge International by coordinating for South Asia region and for the first time, we were able to get contributions from Bhutan and Afghanistan.

  1. Launching Local Boundaries   

To help journalists and researchers visualize the geographical data of Nepal in a map, we build Local Boundaries where we share the shapefile of Nepal federal structure and others. Local Boundaries brings the detailed geodata of administrative units or maps of all administrative boundaries defined by Nepal Government in an open and reusable format, free of cost. The local boundaries are available in two formats (TopoJSON and GeoJSON) and can be easily reused to map local authority data to OpenStreetMap, Google Map, Leaflet or MapBox interactively.

  1. Launching Open Data Handbook Nepali Version  

After the work of a year followed by a series of discussion and consultation, on 7 August 2017 Open Knowledge Nepal launched the first version of Nepali Open Data Handbook – An introductory guidebook used by governments and civil society organizations around the world as an introduction and blueprint for open data projects. The handbook was translated with the collaborative effort by volunteers and contributors.  Now the Nepali Handbook is available at http://handbook.oknp.org

  1. Developing Open Data Curriculum and Open Data Manual  

To organize the open data awareness program in a structured format and to generate resources which can be further use by civil society and institution, Open Knowledge Nepal prepared an Open Data Curriculum and Open Data Manual. It contains basic aspects of open data like an introduction, importance, principles, application areas as well as the technical aspects of open data like extraction, cleaning, analysis, and visualization of data. It works as a reference and a recommended guide for university students, private sectors, and civil society.

  1. Running Open Data Awareness Program

The Open Data Awareness Program was conducted in 11 colleges and 2 youth organization, reaching more than 335+ youths are first of its kind conducted in Nepal. Representatives of Open Knowledge Nepal visited 7 districts of Nepal with the Open Data Curriculum and the Open Data Manual to train youths about the importance and use of open data.

  1. Organizing Open Data Hackathon  

The Open Data Hackathon was organized with the theme “Use data to solve local problems faced by Nepali citizens” at Yalamaya Kendra (Dhokaima Cafe), Patan Dhoka on November 25th, 2017. In this hackathon, we brought students and youths from different backgrounds under the same roof to work collaboratively on different aspects of open data.

  1. Co-organizing Wiki Data-a-thon

On 30th November 2017, we co-organized a Wiki Data-a-thon with Wikimedians of Nepal at Nepal Connection, Thamel on the occasion of Global Legislative Openness Week (GLOW). During the event, we scraped the data of last CA election and pushed those data in WikiData.  

  1. Supporting Asian Regional Meeting  

On 2nd and 3rd December 2017, we supported Open Access Nepal to organize Asian Regional Meeting on Open Access, Open Education and Open Data with the theme “Open in Action: Bridging the Information Divide”. Delegates were from different countries like the USA, China, South Africa, India, Bangladesh, China, Nepal. We managed the Nepali delegates and participants.

2018 Planning

We are looking forward to a prosperous 2018, where we plan to outreach the whole of South Asia countries to improve the state of open data in the region by using focused open data training, research, and projects. For this, we will be collaborating with all possible CSOs working in Asia and will serve as an intermediary for different international organizations who want to promote or increase their activities in Asian countries. This will help the Open Knowledge Network in the long run, and we will also get opportunities to learn from each others’ successes and failures, promote each other’s activities, brainstorm collaborative projects and make the relationship between countries stronger.

Besides this, we will continue also our work of data literacy like Open Data Awareness Program to make Nepalese citizens more data demanding and savvy, and launch a couple of new projects to help people to understand the available data.

To be updated about our activities, please follow us at different medias:

 

MarcEdit Updates (All versions) / Terry Reese

I’ve posted updates for all versions of MarcEdit, including MarcEdit MacOS 3.

MarcEdit 7 (Windows/Linux) changelog:
  • Bug Fix: Export Settings: Export was capturing both MarcEdit 6.x and MarcEdit 7.x data.
  • Enhancement: Task Management: added some continued refinements to improve speed and processing
  • Bug Fix: OCLC Integration: Corrected an issue occuring when trying to post bib records using previous profiles.
  • Enhancement: Linked Data XML Rules File Editor completed
  • Enhancement: Linked Data Framework: Formal support for local linked data triple stores for resolution

One of the largest enhancements is the updated editor to the Linked Data Rules File and the Linked Data Framework. You can hear more about these updates here:

MarcEdit MacOS 3:

Today also marks the availability of MarcEdit MacOS 3. You can read about the update here: MarcEdit MacOS 3 has Arrived!

If you have questions, please let me know.

–tr

MarcEdit MacOS 3 has Arrived! / Terry Reese

MarcEdit MacOS 3 is the latest branch of the MarcEdit 7 family. MarcEdit MacOS 3 represents the next generational update for MarcEdit on the Mac and is functionally equivalent to MarcEdit 7. MarcEdit MacOS 3 introduces the following features:

  1. Startup Wizard
  2. Clustering Tools
  3. New Linked Data Framework
  4. New Task Management and Task Processing
  5. Task Broker
  6. OCLC Integration with OCLC Profiles
  7. OCLC Integration and search in the MarcEditor
  8. New Global Editing Tools
  9. Updated UI
  10. More

 

There are also a couple things that are currently missing that I’ll be filling in over the next couple of weeks. Presently, the following elements are missing in the MacOS version:

  1. OCLC Downloader
  2. OCLC Bib Uploader (local and non-local)
  3. OCLC Holdings update (update for profiles)
  4. Task Processing Updates
  5. Need to update Editor Functions
    1. Dedup tool – Add/Delete Function
    2. Move tool — Copy Field Function
    3. RDA Helper — 040 $b language
    4. Edit Shortcuts — generate paired ISBN-13
    5. Replace Function — Exact word match
    6. Extract/Delete Selected Records — Exact word match
  6. Connect the search dropdown
    1. Add to the MARC Tools Window
    2. Add to the MarcEditor Window
    3. Connect to the Main Window
  7. Update Configuration information
  8. XML Profiler
  9. Linked Data File Editor
  10. Startup Wizard

Rather than hold the update till these elements are completed, I’m making the MarcEdit MacOS version available now so that users can be testing and interacting with the tooling, and I’ll finish adding these remaining elements to the application. Once completed, all versions of MarcEdit will share the same functionality, save for elements that rely on technology or practices tied to a specific operating system.

Updated UI

The MarcEdit MacOS 3 introduces a new UI. While the UI is still reflective of MacOS best practices, it also shares many of the design elements developed as part of MarcEdit 7. This includes new elements like the StartUp wizard with Fluffy Install agent:

 

The Setup Wizard provides users the ability to customize various application settings, as well as import previous settings from earlier versions of MarcEdit.

 

Updates to the UI

New Clustering tools

MarcEdit MacOS 3 provides MacOS users more tools, more help, more speed…it gives you more, so you can do more.

Downloading:

Download the latest version of MarcEdit MacOS 3 from the downloads page at: http://marcedit.reeset.net/downloads

-tr

Digital Scholarship Resource Guide: Making Digital Resources, Part 2 of 7 / Library of Congress: The Signal

This is part two in a seven part resource guide for digital scholarship by Samantha Herron, our 2017 Junior Fellow. Part one is available here, and the full guide is available as a PDF download

Creating Digital Documents

book scanner

Internet Archive staff members such as Fran Akers, above, scan books from the Library’s General Collections that were printed before 1923.  The high-resolution digital books are made available online at www.archive.org­ within 72 hours of scanning. 

The first step in creating an electronic copy of an analog (non-digital) document is usually scanning it to create a digitized image (for example, a .pdf or a .jpg). Scanning a document is like taking an electronic photograph of it–now it’s in a file format that can be saved to a computer, uploaded to the Internet, or shared in an e-mail. In some cases, such as when you are digitizing a film photograph, a high-quality digital image is all you need. But in the case of textual documents, a digital image is often insufficient, or at least inconvenient. In this stage, we only have an image of the text; the text isn’t yet in a format that can be searched or manipulated by the computer (think: trying to copy & paste text from a picture you took on your camera–it’s not possible).

Optical Character Recognition (OCR) is an automated process that extracts text from a digital image of a document to make it readable by a computer. The computer scans through an image of text, attempts to identify the characters (letters, numbers, symbols), and stores them as a separate “layer” of text on the image.

Example Here is a digitized copy of Alice in Wonderland in the Internet Archive. Notice that though this ebook is made up of scanned images of a physical copy, you can search the full text contents in the search bar. The OCRed text is “under” this image, and can be accessed if you select “FULL TEXT” from the Download Options menu. Notice that you can also download a .pdf.epub, or many other formats of the digitized book.

Though the success of OCR depends on the quality of the software and the quality of the photograph–even sophisticated OCR has trouble navigating images with stray ink blots or faded type–these programs are what allow digital archives users to not only search through catalog metadata, but through the full contents of scanned newspapers (as in Chronicling America) and books (as in most digitized books available from libraries and archives).

ABBYY FineReader, an OCR software.

ABBYY FineReader, an OCR software.

As noted, the automated OCR text often needs to be “cleaned” by a human reader. Especially with older, typeset texts that have faded or mildewed or are otherwise irregular, the software may mistake characters or character combinations for others (e.g. the computer might take “rn” to be “m” or “cat” to be “cot” and so on). Though often left “dirty,” OCR that has not been checked through prevents comprehensive searches: if one were searching a set of OCRed texts for every instance of the word “happy,” the computer would not return any of the instances where “happy” had been read as “hoppy” or “hoopy” (and conversely, would inaccurately find where the computer had read “hoppy” to be “happy”). Humans can clean OCR by hand to “train” the computer to interpret characters more accurately (see: machine learning).

In this image of some OCR, we can see some of the errors–the “E”s in the title were interpreted as “Q”s, in the third line, a “t’” was interpreted by the computer as an “f”.

Example of raw OCR text.

Example of raw OCR text.

Even with imperfect OCR, digital text is helpful for both close readings and distant reading. In addition to more complex computational tasks, digital text allows users to, for instance, find the page number of a quote they remember, or find out if a text ever mentions Christopher Colombus. Text search, enabled by digital text, has changed the way that researchers use database and read documents.

Metadata + Text Encoding

Bibliographic search–locating items in a collections–is one of the foundational tasks of libraries. Computer-searchable library catalogs have revolutionized this task for patrons and staff, enabling users to find more relevant materials more quickly.

Metadata is “data about data”. Bibliographic metadata is what makes up catalog records, from the time of card catalogs to our present day electronic databases. Every item in a library’s holdings has a bibliographic record made up of this metadata–key descriptors of an item that help users find an item when they need it. For example, metadata about a book might include its title, author, publishing date, ISBN, shelf location, and so on. In a electronic catalog search, this metadata is what allows users to increasingly narrow their results to materials targeted to their needs: Rich, accurate metadata, produced by human catalogers, allow users to find in a library’s holdings, for example, 1. any text material, 2. written in Spanish, 3.  about Jorge Luis Borges, 4. between 1990-2000.

cardcatalog

Washington, D.C. Jewal Mazique [i.e. Jewel] cataloging in the Library of Congress. Photo by John Collier, Winter 1942. //hdl.loc.gov/loc.pnp/fsa.8d02860

Metadata needs to be in a particular format to be read by the computer. A markup language is a system for annotating text to give the computer instructions about what each piece of information is. XML (eXtensible Markup Language) is one of the most common ways of structuring catalog metadata, because it is legible to both humans and machines.

XML uses tags to label data items. Tags can be embedded inside each other as well. In the example below, <recipe> is the first tag. All of the tags inside between <recipe> and it’s end tag </recipe>, (<title>, <ingredient list>, and <preparation>) are components of <recipe>. Further, <ingredient> is a component of <ingredient list>.

MARC (MAchine Readable Cataloging) standards, developed in the 1960s by Henriette Avram at the Library of Congress, is the international standard data format for the description of items held by libraries. Here are the MARC tags for one of the hits from our Jorge Luis Borges search above:

https://catalog.loc.gov/vwebv/staffView?searchId=9361&recPointer=0&recCount=25&bibId=11763921

The three numbers in the left column are “datafields” and the letters are “subfields”. Each field-subfield combination refers to a piece of metadata. For example, 245$a is the title, 245$b is subtitle, 260$ is the place of publication, and so on. The rest of the fields can be found here.

Here is some example XML.

Here is some example XML.

MARCXML is one way of reading and parsing MARC information, popular because it’s an XML schema (and therefore readable by both human and computer). For example, here is the MARCXML file for the same book from above: https://lccn.loc.gov/99228548/marcxml

The datafields and subfields are now XML tags, acting as ‘signposts’ for the computer about what each piece of information means. MARCXML files can be read by humans (provided they know what each datafield means) as well as computers.

The Library of Congress has made available their 2014 Retrospective MARC files for public use: http://www.loc.gov/cds/products/marcDist.php

Examples The Library of Congress’s MARC data could be used for cool visualizations like Ben Schmidt’s visual history of MARC cataloging at the Library of Congress. Matt Miller used the Library’s MARC data to make a dizzying list of every cataloged book in the Library of Congress.

photogrammar

An example of the uses of MARC metadata for non-text materials is Yale University’s Photogrammar, which uses the location information from the Library of Congress’ archive of US Farm Security Administration photos to create an interactive map.

TEI (Text Encoding Initiative) is another important example of xml-style markup. In addition to capturing metadata, TEI guidelines standardize the markup of a text’s contents. Text encoding tells the computer who’s speaking, when a stanza begins and ends, and denotes which parts of text are stage instructions in a play, for example.

Example Here is a TEI file of Shakespeare’s Macbeth from the Folger Shakespeare Library. Different tags and attributes (the further specifiers within the tags) describe the speaker, what word they are saying, in what scene, what part of speech the word is, etc. With an encoded text like this, it can easily be manipulated to tell you which character says the most words in the play, which adjective is used most often across all of Shakespeare’s works, and so on. If you were interested in the use of the word ‘lady’ in Macbeth, an un-encoded plaintext version would not allow you to distinguish between references to “Lady” Macbeth vs. when a character says the word “lady”. TEI versions allow you to do powerful explorations of texts–though good TEI copies take a lot of time to create.

Understanding the various formats in which data is entered and stored allows us to imagine what kinds of digital scholarship is possible with the library data.

Example The Women Writers Project encodes with TEI texts by early modern women writers and includes some text analysis tools.

Next week’s installment in the Digital Scholarship Resource Guide will show you what you can do with digital data now that you’ve created it. Stay tuned!

Jobs in Information Technology: January 10, 2018 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

University of Arkansas, Assistant Head of Special Collections, Fayetteville, AR

West Chester University, Electronic Resources Librarian, West Chester, PA

Miami University Libraries, Web Services Librarian, Oxford, OH

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Anxious Anger – or: why does my profession want to become a closed club / Peter Murray

I’m in the Austin, Texas, airport – having just left the closing session of the Re-Think It conference – and I’m wondering what the heck is happening to my chosen profession. When did we turn into an exclusive member’s only club with unrealistic demands on professionalism and a secret handshake?

The closing keynote featured current president of the American Library Association (ALA) Jim Neal and past president Julie Todaro on the topic Library Leadership in a Period of Transformation. The pair were to address questions like “What trends are provoking new thinking about the 21st century library?” and “Do 20th century visions and skills still matter?” I expected to be uplifted and inspired. Instead, I came away feeling anxious and angry about their view of the library profession and the premier library association, ALA.

To start with a bit of imposter syndrome exposure: I’ve been working in and around libraries for 25 years, but I don’t follow the internal workings and the politics of the principal librarian professional organization(s) in the United States. I read about the profession — enough to know that primary school librarians are under constant threat of elimination in many school districts and that usage of public libraries, particularly public libraries that are taking an expansive view of their role in the community, is through the roof. I hear the grumbles about how library schools are not preparing graduates of masters programs for “real world” librarianship, but in my own personal experience, I am indebted to the faculty at Simmons College for the education I received there. The pay inequity sucks. The appointment of a professional African American librarian to head the Library of Congress is to be celebrated, and the general lack of diversity in the professional ranks is a point to be worked on. My impression of ALA is of an unnecessarily large and bureaucratic organization with some seriously important bright spots (the ALA Washington Office for example), and that the governance of ALA is plodding and cliquish, but for which some close colleagues find professional satisfaction for their extra energies. I’m pretty much hands off ALA, particularly in the last 15 years, and view it (in the words of Douglas Adams) as Mostly Harmless.

So anxious and angry are unexpected feelings for this closing keynote. I don’t think there is a recording of Jim’s and Julie’s remarks, so here in the airport, the only thing I have to go on are my notes. I started taking notes at the beginning of their talks expecting there would be uplifting ideas and quotes that I could attribute to them as I talk with others about the aspirations of the FOLIO project (a crucial part of my day job). Instead, Julie kicked things off by saying the key task that she works on at her day job is maintaining faculty status for librarians. She emphasized the importance of credentialing and using the usefulness of skills to a library’s broader organization as a measure of value. Jim spoke of the role of library schools and library education to define classes of people: librarians, paraprofessionals, students, and the like, and that the ALA should be at the heart of minting credentials to be used (I think) as gatekeepers into “professional” jobs.

Hogwash. If I were to identify my school of thought, I’d say I come from the big melting pot of professional librarianship. I started in libraries just out of college with a degree in systems analysis working for my alma mater as they were bringing up their first automation system. In the first decade of my career, I worked in three academic libraries — each of whom did a fantastic job of instilling in me the raw knowledge and the embedded ethos of the library profession — before choosing to get a library degree. Some of the best librarians I know are not classically trained librarians, and in quiet voices will timidly offer that they do not have a degree. I met one such person during the Re-Think It conference, in fact, that I hope becomes a close colleague. They are drawn to the profession from other disciplines and bring of a wealth of skills and insights that make the library profession stronger. I’ve hired too many people using the phrase “or equivalent experience” to know that a library degree is not the only gateway to a successful team member. The value of the people I hired came from the skills they earned through experience and their outlook to grow as the library itself wanted to grow.

Julie closed her initial remarks by saying that “success — world domination — begins with attention to detail.” Jim spoke wistfully at the lack of an uber-OCLC that would be at the heart of all library technical services work. Such statements make me think that raving megalomania is a prerequisite for ALA president. I’m not sure this is the profession I want to be in.

Julie and Jim both had statements that I wholeheartedly agree with…at least in content if not delivery. As a profession “we’re going to be more unseen as things go digital” (Julie) and that is a challenge to take on. “Stop strategic planning; it is a waste of time” (Jim) and that our organizations need to be a loosely coupled structure of maverick units to move at a pace demanded by our users. Cooperation between libraries and removing duplicate effort is a key sustainability strategy and one that I take to heart in the FOLIO project. (I’m just not convinced that a national strategy of technical centers is desired, if even possible.)

I had no idea that such a panel could stir up such feelings, but there you go. In many ways, I hope that I misinterpreted the intent of Jim’s and Julie’s remarks, but the forcefulness with which they spoke them and the bodily reaction I had to hearing them leaves little room.

2018 WHCLIST award accepting nominations / District Dispatch

Those interesting in participating in National Library Legislative Day 2018 take note – nominations are now being accepted for the 2018 WHCLIST award. The award, sponsored by the White House Conference on Library and Information Services Taskforce (WHCLIST) and the ALA Washington Office, is open to non-librarian, first time participants of National Library Legislative Day (NLLD). WHCLIST winners receive a stipend of $300 and two free nights at the Liaison Hotel, where NLLD 2018 will be hosted.

WHCLIST 2017 winner Lori Rivas and Past President Julie Todaro during the 2017 NLLD briefing day events.WHCLIST 2017 winner Lori Rivas and Past President Julie Todaro.

Over the years, WHCLIST has been an effective force in library advocacy on the national stage, as well as statewide and locally. To transmit its spirit of dedicated, passionate library support to a new generation of advocates, WHCLIST provided its assets to the ALA Washington Office to fund this award. Both ALA and WHCLIST are committed to ensuring the American people get the best library services possible.

To apply for the WHCLIST Award, nominees must meet the following criteria:

  • The recipient should be a library supporter (trustee, friend, general advocate, etc), and not a professional librarian (this includes anyone currently employed by a library).
  • Recipient should be a first-time attendee of NLLD.
  • Should have a history of supporting librarians and library work in their community.

Representatives of WHCLIST and the ALA Washington Office will choose the recipient. The winner of the WHCLIST Award will be announced at National Library Legislative Day by the President of the American Library Association.

The deadline for applications is April 2, 2018.

To apply for the WHCLIST award, please submit a completed NLLD registration form; a letter explaining why you should receive the award; and a letter of reference from a library director, school librarian, library board chair, Friend’s group chair, or other library representative to:

Lisa Lindle
Manager, Advocacy and Grassroots Outreach
American Library Association
1615 New Hampshire Ave., NW
First Floor
Washington, DC 20009
llindle@alawash.org

Note: Applicants must register for NLLD and pay all associated costs. Applicants must make their own travel arrangements. The winner will be reimbursed for two free nights in the NLLD hotel in D.C and receive the $300 stipend to defray the costs of attending the event.

The post 2018 WHCLIST award accepting nominations appeared first on District Dispatch.

Vocational Awe and Librarianship: The Lies We Tell Ourselves / In the Library, With the Lead Pipe

In Brief

Vocational awe describes the set of ideas, values, and assumptions librarians have about themselves and the profession that result in notions that libraries as institutions are inherently good, sacred notions, and therefore beyond critique. I argue that the concept of vocational awe directly correlates to problems within librarianship like burnout and low salary. This article aims to describe the phenomenon and its effects on library philosophies and practices so that they may be recognized and deconstructed.

by Fobazi Ettarh

Author’s note: I use “librarians” here very broadly. I am not limiting the term to those who have the MLIS because vocational awe affects those who work in libraries at every level. I would argue that it often affects staff more than it does librarians due to the sociodemographics of people in staff level positions as well as the job precarity that many staff positions hold.

Introduction

On June 1st, Mike Newell wrote about Chera Kowalski and other librarians administering the anti-overdose drug Naloxone (more commonly known as Narcan) to patrons in and around McPherson Square Branch in Philadelphia.1 The article went viral and was shared sixteen thousand times. Since then, Kowalski has saved dozens more lives through the administration of Naloxone. More libraries have since followed Philadelphia’s lead in Narcan training. Representative Patrick Maloney of New York introduced the Life-saving Librarians Act2 giving the Secretary of Health and Human Services the authority to award grants for Naloxone rescue kits in public libraries. To Representative Maloney, and many librarians, training librarians to be literal life-savers makes sense because it serves the needs of patrons in our communities, and society as a whole. In addition to this core value of service, democracy is another value many believe libraries bring to society. Hillary Clinton, at the 2017 ALA Annual Conference in Chicago, commended Kowalski’s work and also stated, “…You are guardians of the First Amendment and the freedom to read and to speak. The work you do is at the heart of an open, inclusive, diverse society [and] I believe that libraries and democracy go hand in hand.”3

On its face, it seems natural that libraries and librarians should celebrate these stories. Indeed, these librarians are working to save the democratic values of society as well as going above and beyond to serve the needs of their neighbors and communities. However, when the rhetoric surrounding librarianship borders on vocational and sacred language rather than acknowledging that librarianship is a profession or a discipline, and as an institution, historically and contemporarily flawed, we do ourselves a disservice.

“Vocational awe” refers to the set of ideas, values, and assumptions librarians have about themselves and the profession that result in beliefs that libraries as institutions are inherently good and sacred, and therefore beyond critique. In this article, I would like to dismantle the idea that librarianship is a sacred calling; thus requiring absolute obedience to a prescribed set of rules and behaviors, regardless of any negative effect on librarians’ own lives. I will do this by demonstrating the ways vocational awe manifests. First, I will describe the institutional mythologies surrounding libraries and librarians. Second, I will dismantle these mythologies by demonstrating the role libraries play in institutional oppression. Lastly, I will discuss how vocational awe disenfranchises librarians and librarianship. By deconstructing some of these assumptions and values so integrally woven into the field, librarianship can hopefully evolve into a field that supports and advocates for the people who work in libraries as much as it does for physical buildings and resources.

Part One: The Mythos of Libraries and Librarianship

Librarianship as Vocation

The word “vocation” (from the Latin vocatio) is defined as “a call, summons,”4 and stemmed from early Christian tradition, where it was held that the calling required a monastic life under vows of chastity, poverty, and obedience.5 Indeed, from its earliest biblical instantiations, a vocation refers to the way one lives in response to God’s call. Although the word has since become used in more secular contexts, my use of the word “vocation” to describe contemporary views of librarianship skews closer to its original religious context, especially concerning the emphasis on poverty and obedience. Many librarians refer to the field of librarianship as a calling.6 ,  7 Their narratives of receiving the “call” to librarianship often fall right in line with Martin Luther’s description of vocation as the ways a person serves God and his neighbour through his work in the world. The links between librarianship and religious service are not happenstance. Indeed, the first Western librarians were members of religious orders,8 serving the dual functions of copying and maintaining book collections.

The Library as a Sacred Place

The physical space of a library, like its work, has also been seen as a sacred space. One could argue that it is treated like a sanctuary, both in its original meaning (keeper of sacred things and people), and in its more contemporary meaning as a shelter or refuge. Again, the original libraries were actual monasteries, with small collections of books stuffed in choir lofts, niches, and roofs.9 The carrels still prevalent in many libraries today are direct descendants of these religious places. The word “carrel” originally meant “working niche or alcove” and referred to a monastery cloister area where monks would read and write. Reflecting their conjoined history, churches and libraries had similar architectural structures. These buildings were built to inspire awe or grandeur,10 ,  11 and their materials meant to be treated with care. Even now the stereotypical library is often portrayed as a grandiose and silent space where people can be guided to find answers. The Bodleian Library, one of the oldest and largest libraries in Europe, still requires those who wish to use the library to swear an oath to protect the library: “I hereby undertake not to remove from the Library, or to mark, deface, or injure in any way, any volume, document, or other object belonging to it or in its custody; not to bring into the Library or kindle therein any fire or flame, and not to smoke in the Library; and I promise to obey all rules of the Library.”

Although contemporary architectural designs of libraries may not evoke the same feelings of awe they once did, libraries continue to operate as sanctuaries in the extended definition as a place of safety. Many libraries open their spaces to the disadvantaged and displaced populations in the community such as the homeless or the mentally ill. In the protests and civil unrest following the shooting death of unarmed black teenager Michael Brown in Ferguson, Missouri, the Ferguson Municipal Public Library (FMPL) became a makeshift school for children in the community. When the story went viral, there was an outpouring of books, supplies, and lunches for the children. The hashtag #whatlibrariesdo became a call to action and resulted in a huge spike in PayPal donations to FMPL. In addition, the sign on the library’s door stated, “During difficult times, the library is a quiet oasis where we can catch our breath, learn, and think about what to do next.” In this way, the library becomes a sanctuary threefold, a place where one can listen to the “still, small, voice,”12 a shelter for displaced populations, and a source of humanitarian aid. Since Ferguson, similar responses have occurred in libraries after major events in other areas such as Charlottesville, Virginia. And, in the current sociopolitical climate, much of the discourse surrounding these libraries center them as “safe spaces.”

Librarians as Priests and Saviors

If libraries are sacred spaces, then it stands to reason that its workers are priests. As detailed above, the earliest librarians were also priests and viewed their work as a service to God and their fellow man. Out of five hundred librarians surveyed, ninety-five percent said the service orientation of the profession motivated them to become librarians.13 Another study found that the satisfaction derived by serving people is what new librarians thrive on.14 Similarly, many Christians describe their religious faith as “serving God,” and to do so requires a life spent in service. Christians often reference Mark 10:45 to describe the gravity of a call to service: “For even the Son of Man did not come to be served, but to serve, and to give his life as a ransom for many.” Considering their conjoined history, it should come as no surprise that librarians, just like monks and priests, are often imagined as nobly impoverished as they work selflessly for the community and God’s sake. One study of seasoned librarians noted that, “surprisingly, for a profession as notoriously underpaid as librarianship, not a single respondent mentioned salary” as a negative feature of the profession.15 As with a spiritual “calling,” the rewards for such service cannot be monetary compensation, but instead spiritual absolution through doing good works for communities and society.

If librarians are priests then their primary job duty is to educate and to save. Biven-Tatums notes that public libraries “began as instruments of enlightenment, hoping to spread knowledge and culture broadly to the people.”16 The assumption within librarianship is that libraries provide the essential function of creating an educated, enlightened populace, which in turn brings about a better society. Using that logic, librarians who do good work are those who provide culture and enlightenment to their communities. Saint Lawrence, the Catholic Church’s official saint of librarians and archivists, is revered for being dangled over a charcoal fire rather than surrender the Church archives. Today, librarians continue to venerate contemporary “saints” of librarianship. One example is the “Connecticut Four,” four librarians who fought a government gag order when FBI agents demanded library records under the Patriot Act.17 And now Kowalski joins the ranks as a library “saint” through the literal saving of lives with Naloxone. All of these librarians set the expectation that the fulfillment of job duties requires sacrifice (whether that sacrifice is government intimidation or hot coals), and only through such dramatic sacrifice can librarians accomplish something “bigger than themselves.”

Part Two: Locating the Library in Institutional Oppression18

It is no accident that librarianship is dominated by white women.19 Not only were white woman assumed to have the innate characteristics necessary to be effective library workers due to their true womanhood,20 characteristics which include missionary-mindedness, servility, and altruism and spiritual superiority and piety, but libraries have continually been “complicit in the production and maintenance of white privilege.”21 These white women librarians in public libraries during the turn-of-the-century U.S. participated in selective immigrant assimilation and Americanization programs, projects “whose purpose was to inculcate European ethnics into whiteness”22 Librarianship, like the criminal justice system and the government, is an institution. And like other institutions, librarianship plays a role in creating and sustaining hegemonic values, as well as contributing to white supremacy culture. James and Okun define white supremacy culture as the ways that organizations and individuals normalize, enact, and reinforce white supremacy.23 Cultural representations of libraries as places of freedoms (like freedom of access and intellectual freedom), education, and other democratic values do not elide libraries’ white supremacy culture with its built-in disparity and oppression. In fact, each value on which librarianship prides itself is inequitably distributed amongst society. Freedom of access is arguably the most core value of librarianship. It runs throughout the entire Library Bill of Rights and is usually defined as the idea that all information resources provided by the library should be distributed equally, and be equitably accessible to all library users.

There have been, however, vast exceptions to this ideal. Quantitatively, the most significant of these exceptions was the exclusion of millions of African Americans from public libraries in the American South during the years before the civil rights movement.24 White response to desegregation efforts in public libraries varied. While some libraries quietly and voluntarily integrated, other libraries enforced “stand-up integration,” removing all of the tables and chairs from the building to minimize the interaction of the races in reading areas, or shut down the branch entirely. The result of these segregationist practices in libraries was a massive form of censorship, and this history demonstrates that access to materials is often implicated in larger societal systems of (in)equality. This should then hold true for other library values as well.

Protecting user privacy and confidentiality is necessary for intellectual freedom, and both are considered core values in librarianship.25 As mentioned earlier, when the Patriot Act passed in 2001, many librarians fought against handing over patron data, and there is a great deal of history of librarian activism around intellectual freedom. For example, the ALA’s Office for Intellectual Freedom coordinates the profession’s resistance efforts through the Freedom to Read Foundation. There are also multiple roundtables and committees focused on local, state, national,26 and international conflicts over intellectual freedom. However, similarly to freedom of access, there have been exceptions. And, as libraries grapple with justifying their existence, many have turned to gathering large amounts of patron data in order to demonstrate worth. Further, while often resisting government intrusions, libraries also commonly operate as an arm of the state. For example, Lexis-Nexis, a library vendor used in many libraries, is participating in a project to assist in building ICE’s Extreme Vetting surveillance system.27 This system would most likely gather data from public use computers and webpages in public, academic, and private libraries across the nation, and determine and evaluate one’s probability of becoming a positively contributing member of society, or whether they intend to commit criminal or terrorist acts after entering the United States. Although the erosion of privacy is not limited to libraries, other fields do not claim to hold the information needs and inquiries of their constituents quite as dearly.

Part Three: Martyrdom is not a long-lasting career

Up until this point, it might seem like I believe librarians should not take pride in their very important work. Or that librarians who love their work and have a passion for library values possess some inherent flaw. This is not my intent. Rather, I challenge the notion that many have taken as axiomatic that libraries are inherently good and democratic, and that librarians, by virtue of working in a library, are responsible for this “good” work. This sets up an expectation that any failure of libraries is largely the fault of individuals failing to live up to the ideals of the profession, rather than understanding that the library as an institution is fundamentally flawed. Below, I mention the primary ways vocational awe negatively impacts librarians.

Awe

We’ve now uncovered the roots of vocation within librarianship and its allusions to religiosity and the sacred. The vocational metaphor helps us understand cause. However, it is important not to forget awe, which represents the effect. Merriam-Webster defines awe as “an emotion variously combining dread, veneration, and wonder that is inspired by authority or by the sacred.”28 As mentioned earlier, libraries were created with the same architectural design as churches in order to elicit religious awe. Awe is not a comforting feeling, but a fearful and overwhelming one. One of its earliest uses was within the Hindu epic Mahabarata. The God Krishna inspired awe in the protagonist Arjuna and commanded him: “Do works for Me, make Me your highest goal, be loyal-in-love to Me, cut all [other] attachments…”29 A more modern, secular example of awe is the military doctrine “shock and awe,” which is characterized as rapid dominance that relies on the use of overwhelming power and spectacular displays of force to paralyze the enemy’s perception of the battlefield and destroy their will to fight. In both cases, awe is used as a method of eliciting obedience from people in the presence of something bigger than themselves.

As part of vocational awe in libraries, awe manifests in response to the library as both a place and an institution. Because the sacred duties of freedom, information, and service are so momentous, the library worker is easily paralyzed. In the face of grand missions of literacy and freedom, advocating for your full lunch break feels petty. And tasked with the responsibility of sustaining democracy and intellectual freedom, taking a mental health day feels shameful. Awe is easily weaponized against the worker, allowing anyone to deploy a vocational purity test in which the worker can be accused of not being devout or passionate enough to serve without complaint.

Burnout

With the expansion of job duties, and expectation of “whole-self” librarianship, it is no surprise that burnout is a common phenomenon within libraries. Harwell defines burnout as the prolonged exposure to workplace stressors that often drain an employee’s vitality and enthusiasm, and often leads to less engagement and productivity.30 And being overworked is not the sole cause of burnout. In a study of academic librarians,31 study participants said they are forced to regulate their emotions in their work and that they often feel an incongruity between the emotions they have to show and what they really feel. Librarians who interact with the public on a regular basis must interact with uncooperative and unwilling patrons, patrons who want preferential treatment, and so on. In the memorable phrasing of Nancy Fried Foster, patrons often approach the reference desk looking for a “Mommy Librarian,” someone who can offer emotional support, reassurance, sociality, answers, and interventions at points of pain or need.”32 The gendered expectations of a library profession that is majority female can certainly exacerbate the gendered expectations placed upon interactions with patrons. Ironically, institutional response to burnout is the output of more “love and passion,” through the vocational impulses noted earlier and a championing of techniques like mindfulness and “whole-person” librarianship.

Under-compensation

“One doesn’t go into librarianship for the money” is a common refrain amongst library workers, and the lack of compensation for library work is not a recent phenomenon. A 1929 report summarized that “improvement in these conditions has not yet reached a point where librarianship may be said to receive proper recognition and compensation.” And in the 2017 Library Journal‘s Placements and Salaries survey, graduates overwhelmingly pointed to underemployment issues as a source of unhappiness, including low wages; lack of benefits; having to settle for part-time, temporary, or nonprofessional positions; or having to piece together two or three part-time positions to support themselves. Librarians’ salaries continue to remain lower than those for comparable jobs in professions requiring similar qualifications and skills. Statistics like these point to the very secular realities of librarians. Librarianship is a job, often paid hourly. It’s not even everyone’s primary job. It has sick time, and vacation–or should–and imagining these facts aren’t important because of the importance of the library’s mission only serves the institution itself.

Through its enforcement of awe through the promotion of dramatic and heroic narratives, the institution gains free, or reduced price, labor. Through vocational mythologies that reinforce themes of sacrifice and struggle, librarianship sustains itself through the labor of librarians who only reap the immaterial benefits of having “done good work.”

Job creep

Job creep refers to the “slow and subtle expansion of job duties” which is not recognized by supervisors or the organization.33 As this article argues, librarians are often expected to place the profession and their job duties before their personal interests. And with such expectations, job creep can become a common phenomenon. The problem with job creep manifests in multiple ways. One, what employees originally did voluntarily is no longer considered “extra” but instead is simply viewed as in-role job performance, which leads to more and more responsibilities and less time in which to accomplish them. Employees who cannot do more than what is in the job description, perhaps for personal or health reasons, are consequently seen as not doing even the minimum, and management may come to believe that workers are not committed to the organization, or its mission, if they don’t do extra tasks. Returning to Chera Kowalski and all of the other librarians currently training to administer, and already administering, anti-overdose medication, this expectation has gone so far as to create a precedent for Representative Maloney to introduce the Life-saving Librarians Act. No longer are these trainings voluntary “extra” professional development; it will likely soon become part of the expected responsibilities of librarians across the country.

Adding duties like life-or-death medical interventions to already overstrained job requirements is an extreme but very real example of job creep. And with the upholding of librarianship as purely service-oriented and self-sacrificing, what is a librarian to do who may not feel equipped to intervene as a first responder? Or a librarian who is dedicated to, say, a library value of children’s literacy or freedom of information, but because of past traumas, cannot cope with regular exposure to loss of life on the job? Librarianship as a religious calling would answer that such a librarian has failed in her duties and demonstrated a lack of purity required of the truly devout. And without the proper training and institutional support that first responders, social workers, and other clinicians have, librarians, through such job creep, are being asked to do increasingly dangerous emotional and physical labor without the tools and support provided to other professions traditionally tasked with these duties. As newspapers, Clinton, and librarians around the nation celebrate Kowlaski and others like her, we must ask if those voices will chime in to also demand the therapy and medical services typically needed for PTSD and other common ailments of those working in such severe conditions. Do we expect those benefits to manifest, or librarians to again quietly suffer the consequences of their holy calling, saving society at the expense of their own emotional well-being?

Diversity

By the very nature of librarianship being an institution, it privileges those who fall within the status quo. Therefore librarians who do exist outside librarianship’s center can often more clearly see the disparities between the espoused values and the reality of library work. But because vocational awe refuses to acknowledge the library as a flawed institution, when people of color and other marginalized librarians speak out, their accounts are often discounted or erased. Recently, Lesley Williams of Evanston, Illinois, made headlines for being fired from her library due to comments (on her personal social media accounts), illustrating the hypocritical actions of her library in regards to the lack of equitable access to information. Although she was advocating for the core library value of equitable access, similar to that of the “Connecticut Four,” her actions were regarded as unprofessional.

As I mentioned earlier, vocational awe ties into the phenomena of job creep and undercompensation in librarianship due to the professional norms of service-oriented and self-sacrificing workplaces. But creating professional norms around self-sacrifice and underpay self-selects those who can become librarians. If the expectation built into entry-level library jobs includes experience, often voluntary, in a library, then there are class barriers built into the profession. Those who are unable to work for free due to financial instability are then forced to either take out loans to cover expenses accrued or switch careers entirely. Librarians with a lot of family responsibilities are unable to work long nights and weekends. Librarians with disabilities are unable to make librarianship a whole-self career.

Conclusions

Considering the conjoined history of librarianship and faith, it is not surprising that a lot of the discourse surrounding librarians and their job duties carries a lot of religious undertones. Through the language of vocational awe, libraries have been placed as a higher authority and the work in service of libraries as a sacred duty. Vocational awe has developed along with librarianship from Saint Lawrence to Chera Kowalski. It is so saturated within librarianship that people like Nancy Kalikow Maxwell can write a book, Sacred Stacks: The Higher Purpose of Libraries and Librarianship, not only detailing connections between librarianship and faith, but concluding the book by advising librarians to nurture that religious image conferred upon them. The ideals of librarianship are not ignoble, and having an emotional attachment to the work one does is not negative in itself, and is often a valued goal in most careers. What I have tried to do with this article is illustrate that history and expose the problematic underpinnings. Because vocational awe is so endemic and connected to so many aspects of librarianship, the term gives the field a way to name and expose these things that are so amorphous that they can be explained or guilted away, much like microaggressions. And, through the power of naming, can hopefully provide a shield librarians can use to protect themselves.

The problem with vocational awe is the efficacy of one’s work is directly tied to their amount of passion (or lack thereof), rather than fulfillment of core job duties. If the language around being a good librarian is directly tied to struggle, sacrifice, and obedience, then the more one struggles for their work, the “holier” that work (and institution) becomes. Thus, it will become less likely that people will feel empowered, or even able, to fight for a healthier workspace. A healthy workplace is one where working around the clock is not seen as a requirement, and where one is sufficiently compensated for the work done, not a workplace where “the worker [is] taken for granted as a cog in the machinery.”34

Libraries are just buildings. It is the people who do the work. And we need to treat these people well. You can’t eat on passion. You can’t pay rent on passion. Passion, devotion, and awe are not sustainable sources of income. The story of Saint Lawrence may be a noble one, but martyrdom is not a long-lasting career. And if all librarians follow in his footsteps, then librarianship will cease to exist. You might save a life when wandering outside for lunch, but you deserve the emotional support you’ll no doubt need as a result of that traumatic event. You may impress your supervisor by working late, but will that supervisor come to expect that you continually neglect your own family’s needs in the service of library patrons? The library’s purpose may be to serve, but is that purpose so holy when it fails to serve those who work within its walls every day? We need to continue asking these questions, demanding answers, and stop using vocational awe as the only way to be a librarian.


Thanks and Acknowledgments

I’m very much indebted to the amazing and knowledgeable editors at In the Library with the Lead Pipe, and in particular to Sofia Leung and nina de jesus for diligently removing all traces of footnote inconsistency, tense changes, and rogue commas, as well as helping me create the best possible version of this article. I would also like to thank Amy Koester for keeping us all on track when necessary, and also being incredibly flexible when life inevitably got in the way. Finally, I would like to thank my partner for providing copious support as she listened to me whine and provided a steady stress writing diet of Salt and Vinegar Pringles and Uncrustables. Any mistakes left in this document are most definitely my own.


Works Cited

“Core Values of Librarianship of the American Library Association.”Accessed December 4, 2017. http://www.ala.org/advocacy/intfreedom/corevalues

Anonymous. “Who would be a librarian now? You know what, I’ll have a go.” The Guardian. March, 2016.

Biddle, Sam and Woodman, Spencer. “These are the technology firms lining up to build Trump’s ”Extreme Vetting Program.'” The Intercept. August 7 2017.

Bivens-Tatum, Wayne. Libraries and the Enlightenment. Library Juice Press, 2012.

Clinton, Rodham Hilary. “Closing General Session” (speech, Chicago, Illinois, June 27, 2017), American Library Association Annual Conference, https://americanlibrariesmagazine.org/wp-content/uploads/2017/06/HRC-Transcript.pdf

de jesus, nina. Locating the Library in Institutional Oppression. In the Library with the Lead Pipe. September 24, 2014.

Emmelhainz, Celia, Seale, Maura, and Erin Pappas. “Behavioral Expectations for the Mommy Librarian: The Successful Reference Transaction as Emotional Labor.” The Feminist Reference Desk: Concepts, Critiques and Conversations, edited by Maria T. Accardi, 27-45. Library Juice Press: Sacramento, CA, 2017. escholarship.org/uc/item/2mq851m0

Emmet, Dorothy. “Vocation.” Journal of Medical Ethics 4, no. 3:(1978): 146-147.

Easwaran, Eknath. The Bhagavad Gita. Tomales: Nilgiri Press, 2009.

Foster, Nancy Fried. “The Mommy Model of Service.” In Studying Students: The Undergraduate Research Project at the University of Rochester, edited by Nancy Fried Foster and Susan Gibbons, 72-78. Chicago: Association of College and Research Libraries, 2007.

Frankenberg, R. White women, race matters: The social construction of whiteness. Minneapolis: University of Minnesota Press, 1993.

Garrison, D. The tender technicians: The feminization of public librarianship, 1876– 1905. Journal of Social History, (1972). 6 no. 2, 131–159.

Garrison, D. Apostles of culture: The public librarian and American society, 1876–1920. Madison: University of Wisconsin Press, 1979.

Graham, Patterson Toby. A Right to Read : Segregation and Civil Rights in Alabama’s Public Libraries, 1900-1965. Tuscaloosa: University of Alabama Press, 2002.

Harwell, Kevin. “Burnout Strategies for Librarians.” Journal of Business & Finance Librarianship 13, no. 3 (2008): 379-90.

Hildenbrand, S. Reclaiming the American library past: Writing the women in. Norwood, NJ: Ablex, 1996.

Houdyshell, Mara, Patricia A. Robles, and Hua Yi. “What Were You Thinking: If You Could Choose Librarianship Again, Would You?” Information Outlook, July 3, 1999, 19– 23.

Hunter, Gregory. Developing and Maintaining Practical Archives. New York: Neal Schuman, 1997.

Inklebarger, Timothy. 2014. “Ferguson’s Safe Haven.” American Libraries 45, no. 11/12: 17-18.

Jacobsen, Teresa L. “Class of 1988.” Library Journal, July 12, 2004, 38– 41.

Jones, Kenneth, and Okun, Tema. Dismantling Racism: A Workbook for Social Change Groups, ChangeWork, 2001 http://www.cwsworkshop.org/PARC_site_B/dr-culture.html

Julien, Heidi, and Shelagh Genuis. “Emotional Labour in Librarians’ Instructional Work.” Journal of Documentation 65, no. 6 (2009): 926-37.

Kaser, David. The Evolution of the American Academic Library Building. Lanham, MD: Scarecrow Press, 1997.

Keltner, D, and Haidt, J. “Approaching awe, a moral, spiritual, and aesthetic emotion.” Cognition and Emotion 17, no. 2 (2003): 297–314.

Linden, M., I. Salo, and A. Jansson. “Organizational Stressors and Burnout in Public Librarians.” Journal of Librarianship and Information Science, 2016.

Maxwell, Nancy Kalikow. Sacred Stacks: The Higher Purpose of Libraries and Librarianship. Chicago: American Library Association, 2006.

Mukherjee, A. K. Librarianship: Its Philosophy and History. Asia Publishing House, 1966.

Newell, Mike. “For these Philly librarians, drug tourists and overdose drills are part of the job” The Inquirer (Philadelphia, PA), June 1, 2017.

Newhouse, Ria, and April Spisak. “Fixing the First Job.” Library Journal, Aug. 2004, 44– 46.

Pawley, Christine, and Robbins, Louise S. Libraries and the Reading Public in Twentieth-Century America. Print Culture History in Modern America. Madison, WI: U of Wisconsin P, 2013.

Peet, Lisa. “Ferguson Library: a community’s refuge: library hosts children, teachers during school closing.” Library Journal, January 1, 2015.

Pevsner, Nikolaus. A History of Building Types. Princeton, NJ: Princeton University Press, 1976.

Pitcavage, Mark. “With Hate in their Hearts: The State of White Supremacy in the United States” last modified July 2015. https://www.adl.org/education/resources/reports/state-of-white-supremacy

Rosen, Ellen. Improving Public Sector Productivity: Concepts and Practice. Thousand Oaks, CA, USA: Sage Publications, 1993.

Rubin, Richard E. Foundations of Library and Information Science. Neal-Schuman Publishers, Inc, 2010.

Scholes, Jefferey. “Vocation.” Religion Compass, 4 (2010): 211–220. doi: 10.1111/j.1749-8171.2010.00215.x

Schlesselman-Tarango, Gina. “The Legacy of Lady Bountiful: White Women in the Library.” Library Trends, (2016) 667–86. Retrieved from: http://scholarworks.lib.csusb.edu/library-publications/34.

Van Dyne, and Ellis, “Job creep: A reactance theory perspective on organizational citizenship behavior as overfulfillment of obligations,” in The employment relationship: examining psychological and contextual perspectives, edited by Phillip Appleman New York : Oxford University Press.

  1. Mike Newell. “For these Philly librarians, drug tourists and overdose drills are part of the job” The Inquirer (Philadelphia, PA), June 1, 2017.
  2. Life-saving Librarians Act, H.R.4259 (2017-2018).
  3. Hillary Rodham Clinton., “Closing General Session” (speech, Chicago, Illinois, June 27, 2017), American Library Association Annual Conference, https://americanlibrariesmagazine.org/wp-content/uploads/2017/06/HRC-Transcript.pdf
  4. Jeffrey Scholes. “Vocation.” Religion Compass, 4 (2010): 211–220.
  5. Dorothy Emmet. “Vocation.” Journal of Medical Ethics, 4, no. 3:(1978): 146-147.
  6. Anonymous. “Who would be a librarian now? You know what, I’ll have a go.” The Guardian. March, 2016.
  7. Jamie Baker. Librarianship As Calling. The Ginger (Law) Librarian. March, 6.
  8. Richard E. Rubin. Foundations of Library and Information Science. Neal-Schuman Publishers, Inc. (2010) p. 36.
  9. A. K. Mukherjee. Librarianship: Its Philosophy and History. Asia Publishing House (1966) p. 88.
  10. Nikolaus Pevsner. A History of Building Types. (Princeton, NJ: Princeton University Press, 1976) p. 98.
  11. David Kaser. The Evolution of the American Academic Library Building. (Lanham, MD: Scarecrow Press, 1997) p. 5-16, 47-60.
  12. 1 Kings 19:11-13, KJV.
  13. Mara, Houdyshell, Patricia A. Robles, and Hua Yi. “What Were You Thinking: If You Could Choose Librarianship Again, Would You?” Information Outlook, July 3, 1999, 19– 23.
  14. Ria Newhouse and April Spisak. “Fixing the First Job.” Library Journal, Aug. 2004, 44– 46.
  15. Teresa L. Jacobsen “Class of 1988.” Library Journal, July 12, 2004, 38–41.
  16. Wayne Bivens-Tatum. Libraries and the Enlightenment. Library Juice Press, 2012.
  17. Doe v. Gonzalez, 386 F. Supp. 2d 66 (D.Conn. 2005).
  18. nina de jesus. Locating the Library in Institutional Oppression. In the Library with the Lead Pipe. September 24, 2014.
  19. Gina Schlesselman-Tarango. “The Legacy of Lady Bountiful: White Women in the Library.” Library Trends, (2016) 667–86. Retrieved from: http://scholarworks.lib.csusb.edu/library-publications/34.
  20. See Garrison, 1972, 1979; Hildenbrand, 1996.
  21. Todd Honma. Trippin’ over the color line: The invisibility of race in library and information studies. InterActions: UCLA Journal of Education and Information Studies, (2005)1 no.2, 1–26. Retrieved from http://escholarship.org/uc/item/4nj0w1mp
  22. R. Frankenberg. White women, race matters: The social construction of whiteness. Minne- apolis: University of Minnesota Press, 1993.
  23. Kenneth Jones and Tema Okun. Dismantling Racism: A Workbook for Social Change Groups. ChangeWork, 2001 http://www.cwsworkshop.org/PARC_site_B/dr-culture.html
  24. Toby Patterson Graham. A Right to Read : Segregation and Civil Rights in Alabama’s Public Libraries, 1900-1965. (Tuscaloosa: University of Alabama Press, 2002).
  25. “Core Values of Librarianship of the American Library Association.”Accessed December 4, 2017. http://www.ala.org/advocacy/intfreedom/corevalues
  26. E.g. ALA Intellectual Freedom Committee (IFC),Intellectual Freedom Round Table (IFRT), Freedom to Read Foundation (FTRF), etc.
  27. Sam Biddle and Spencer Woodman. “These are the technology firms lining up to build Trump’s ”Extreme Vetting Program.'”The Intercept. August 7 2017.
  28. Merriam-Webster’s Collegiate Dictionary. 11th ed. Springfield, MA: Merriam-Webster, 2003. Continually updated at https://www.merriam-webster.com/.
  29. Eknath Easwaran. The Bhagavad Gita. Tomales: Nilgiri Press, 2009.
  30. Kevin Harwell. “Burnout Strategies for Librarians.” Journal of Business & Finance Librarianship 13, no. 3 (2008): 379-90.
  31. Julien, Heidi, and Shelagh Genuis. “Emotional Labour in Librarians’ Instructional Work.” Journal of Documentation 65, no. 6 (2009): 926-37.
  32. Emmelhainz, Celia, Seale, Maura, and Erin Pappas. “Behavioral Expectations for the Mommy Librarian: The Successful Reference Transaction as Emotional Labor.” The Feminist Reference Desk: Concepts, Critiques and Conversations, edited by Maria T. Accardi, 27-45. Library Juice Press: Sacramento, CA, 2017. escholarship.org/uc/item/2mq851m0
  33. Van Dyne, and Ellis, “Job creep: A reactance theory perspective on organizational citizenship behavior as overfulfillment of obligations,” in The employment relationship: examining psychological and contextual perspectives ed. Phillip Appleman (New York : Oxford University Press).
  34. Ellen Rosen. Improving Public Sector Productivity: Concepts and Practice, (Thousand Oaks, CA, USA: Sage Publications, 1993) p. 139.

Static React / Ed Summers

This post contains some brief notes about building offline, static web sites using React, in order to further the objectives of minimal computing. But before I go there, first let me give you a little background…

The Lakeland Community Heritage Project is an effort to collect, preserve, and interpret the heritage and history of those African Americans who have lived in the Lakeland community of Prince George’s County, Maryland from the late 19th century to the present. This effort has been led by members of the Lakeland community, with help from students from University of Maryland working with Professor Mary Sies to collect photographs, maps, deeds, and oral histories and published them in an Omeka instance at lakeland.umd.edu. As Mary nears retirement she has become increasingly interested in making these resources available and useful to the community of Lakeland, rather than embedded in a software application that is running on servers owned by UMD.

Recently MITH has been in conversation with LCHP to help explore ways that this data stored in Omeka could be meaningfully transferred to the Lakeland community. This has involved first getting the Omeka site back online, since it partially fell offline as the result of some infrastructure migrations at UMD. We also have been collecting and inventorying disk drives of content used by the students as they have collected and transfer devices over the years.

One relatively small experiment I tried recently was to extract all the images and their metadata from Omeka to create a very simple visual display of the images that could run in a browser without an Internet connection. The point was to provide a generous interface from which community members attending a meeting could browse content quickly and potentially take it away with them. Since we were going to be doing this in a environment where there wasn’t stable network access it was important that for the content to be browsed without an Internet connection. We wanted to be able to put the application on a thumb drive, and move it around as a zip file, which could also ultimately allow us to make it available to community members independent of the files needing to be kept online on the Internet.

The first step was getting all the data out of Omeka. This was a simple matter with Omeka’s very clean, straightforward and well documented REST API. Unfortunately, LCHP was running an older version of Omeka (v1.3.1) that needed to be upgraded to 2.x before the API was available. The upgrade process itself leapfrogged a bunch of versions so I wasn’t surprised to run into a small snag, which I was fortunately able to fix myself (go team open source).

I wrote a small utility named nyakara that talks to Omeka and downloads all the items (metadata and files) as well as the collections they are a part of, and places them on the filesystem. This was a fairly straightforward process because Omeka’s database ensures the one-to-many-relationships between a site and its collections, items, and files which means they can be written to the filesystem in a structured way:

omeka.example.org
omeka.example.org/site.json
omeka.example.org/collections
omeka.example.org/collections/1
omeka.example.org/collections/1/collection.json
omeka.example.org/collections/1/items
omeka.example.org/collections/1/items/1
omeka.example.org/collections/1/items/1/item.json
omeka.example.org/collections/1/items/1/files
omeka.example.org/collections/1/items/1/files/1
omeka.example.org/collections/1/items/1/files/1/fullsize.jpg
omeka.example.org/collections/1/items/1/files/1/original.jpg
omeka.example.org/collections/1/items/1/files/1/file.json
omeka.example.org/collections/1/items/1/files/1/thumbnail.jpg
omeka.example.org/collections/1/items/1/files/1/square_thumbnail.jpg

This post was really meant to be about building a static site with React, and not about extracting data from Omeka. But this filesystem data is kinda like a static site, right? It was really just building the foundation for the next step of building the static site application, since I didn’t really want to keep downloading content from the API as I was developing my application. Having all the content local made it easier to introspect with command line tools like grep, find and jq as I was building the static site.

Before I get into a few of the details here’s a short video that shows what the finished static site looked like:

Lakeland Static Site Demo from Ed Summers on Vimeo.

You can see that content is loaded dynamically as the user scrolls down the page. Lots of content is presented at once in random orderings each time to encourage serendipitous connections between items. Items can also be filtered based on type (buildings, people and documents). If you want to check it out for yourself download this zip file and open up the index.html in the root of your home directory. Go ahead and turn off your wi-fi connection so you can see it working without an Internet connection.

When building static sites in the past I’ve often reached for Jekyll but this time I was interested in putting together a small client side application that could be run offline. This shouldn’t be seen as an either/or situation: it would be quite natural to create a static site using Jekyll that embeds a React application within it. But for the sake of experimentation I wanted to see how far I could go just using React.

Ever since I first saw Twitter’s personal archive download (aka Grailbird) I’ve been thinking about the potential of offline web applications to function as little time capsules for web content that can live independently of the Internet. Grailbird lets you view your Twitter content offline in a dynamic web application where you can view your tweets over time. Over the past few years the minimal computing has been gaining traction in the digital humanities community, as a way to ethically and sustainably deliver web content without necessarily needing to mentally make promises of keeping it online forever.

React seemed like a natural fit because I’ve been using it for the past year on another project. React offers a rich ecosystem of tools, plugins and libraries like Redux for building complex client side apps. The downside of using React is that it is not as easy for people to set up out of the box, or for changing over time if you you aren’t an experienced software developer. With Jekyll it’s not simple, but at least its relatively easy to dive in and edit HTML and CSS. But on the plus side for Reactf you really want to deliver an unchanging finished thing (static) artifact, then maybe these things don’t really matter so much?

At any rate it seemed like a worthwhile experiment. So here are a few tidbits I learned when bending React to the purposes of minimal computing:

The first is to build a static representation of your data. Many React applications rely on an external REST API being available. This type of dependency is an obvious no-no for minimal computing applications, because an Internet connection is needed, and someone needs to keep the REST API service up and running constantly, which is infrastructure and costs money.

One way of getting around this is to take all the structured data your application needs and bundle it up as a single file. You can see the one I created for my application here. As you can see it contains metadata for all the photographs expressed as JSON. But the the JSON itself is part of a global JavaScript variable declaration which allows it to be loaded by the browser without relying on an asynchronous HTTP call. Browsers need to limit the ability of JavaScript to fetch files from the filesystem for security reasons. This JavaScript file is loaded immediately by your web browser when it loads the index.html, and the app can access it globally as window.DATA. Think of it like a static read-only, in memory database for your application. The wrapping HTML will look as simple as something like this:

Similarly, the image files need to be available locally. I took all the images and saved them into a directory I named static, and named the file using a unique item id (from Omeka) which allowed the metadata and data to be conceptually linked:

lakeland-images/static/{omeka-id}/fullsize.jpg

My React application has an Image component that simply renders the image along with a caption using the >figure<, <img> <figcaption> elements.

  • image
class Image extends Component {
  render() {
    return (
      <Link to={'/item/' + this.props.item.id + '/'}>
        <figure className={style.Image}>
          <img src={'static/' + this.props.item.id + '/fullsize.jpg'} />
          <figcaption>
            {this.props.item.title}
          </figcaption>
        </figure>
      </Link>
    )
  }
}

It’s pretty common to use webpack to build React applications, and the [copy-webpack-plugin] will handle copying the files from the static directory into the distribution directory during the build.

You may have noticed that in both cases the data.js and images are being loaded using a relative URL (without a leading slash, or a protocol/hostname). This is a small but important detail that allows the application to be moved around from zip file, to thumb drive to disk drive, without needing paths to be rewritten. The images and data are loaded relative to where the index.html was initially loaded from.

In addition many React applications these days use the new History API in modern browsers. This lets your application have what appear to be normal URLs structured with slashes which you can manage with react-router. However slash URLs are problematic in a offline static site for a couple reasons. The first is that there is no server so you can’t tweak it to respond to any request with the HTML file I included above that will bootstrap your application. This means that if you reload a page you will get a 404 not found.

The other problem is that while the History API works fine for an offline application, the relative links to bundle.js, data.js and the images will break because they will be relative to the new URL.

Fortunately there is a simple solution to this: manage the URLs the way we did before the History API, using hash fragments. So instead of:

file:///lakeland-images/index.html/items/123

you’ll have:

file:///lake-landimages/index.html#/items/123

This way the browser will look to load static/data.js from file:///lakeland-images/ instead of file://lakeland-images/index.html/items/. Luckily react-router lets you simply import and use createHashHistory in your application initialization and it will write these URLs for you.

It’s important to reiterate that this was an experiment. We don’t know if the LCHP is interested in us developing this approach further. But regardless I thought it was worth just jotting down these notes for others considering similar approaches with React and minimal computing applications.

I’ll just close by saying in some ways it seems counter-intuitive to refer to a React application as an example of minimal computing. After working with React off and on for a couple years it still seems quite complicated when you throw Redux into the mix. Assembling the boilerplate needed to get started is still tedious, unless you use create-react-app which is a smart way to start. It’s much easier to get Jekyll out of the box and start using it.

But static sites ultimately rely on a web browser, which is an insanely complicated piece of code. With a few exceptions (e.g. Flash) browsers have been pretty good at maintaining backwards compatibility as they’ve evolved along with the web. JavaScript is so central to a functioning web it’s difficult to imagine it going away. So really this approach is a bet on the browser and the web remaining viable. Whatever happens to the web and the Internet we can probably rely on some form of browser continuing to exist as functioning software, either natively, or in some sort of emulator, for a good time to come…or at least longer than the typical website will be kept online.

Offline Sites with React / Ed Summers

This post contains some brief notes about building offline, static web sites using React, in order to further the objectives of minimal computing. But before I go there, first let me give you a little background…

The Lakeland Community Heritage Project is an effort to collect, preserve, and interpret the heritage and history of African Americans who have lived in the Lakeland community of Prince George’s County, Maryland since the late 19th century. This effort has been led by members of the Lakeland community, with help from students from the University of Maryland working with Professor Mary Sies. As part of the work they’ve collected photographs, maps, deeds, and oral histories and published them in an Omeka instance at lakeland.umd.edu. As Mary is wrapping up the UMD side of the project she has become increasingly interested in making these resources available and useful to the community of Lakeland, rather than leaving them embedded in a software application that is running on servers owned by UMD.

Sneakernet

Recently MITH has been in conversation with LCHP to help explore ways that this data stored in Omeka could be meaningfully transferred to the Lakeland community. This has involved first getting the Omeka site back online, since it partially fell offline as the result of some infrastructure migrations at UMD. We also have been collecting and inventorying disk drives of content used by the students as they have collected and transfer devices over the years.

One relatively small experiment I tried recently was to extract all the images and their metadata from Omeka to create a very simple visual display of the images that could run in a browser without an Internet connection. The point was to provide a generous interface from which community members attending a meeting could browse content quickly and potentially take it away with them. Since this meeting was in a environment where there wasn’t stable network access it was important that for the content to be browsed without an Internet connection. We also wanted to be able to put the application on a thumb drive, and move it around as a zip file, which could also ultimately allow us to make it available to community members independent of the files needing to be kept online on the Internet at a particular location. Basically we wanted the site to be on the Sneakernet instead of the Internet.

Static Data

The first step was getting all the data out of Omeka. This was a simple matter with Omeka’s very clean, straightforward and well documented REST API. Unfortunately, LCHP was running an older version of Omeka (v1.3.1) that needed to be upgraded to 2.x before the API was available. The upgrade process itself leapfrogged a bunch of versions so I wasn’t surprised to run into a small snag, which I was fortunately able to fix myself (go team open source).

I wrote a small utility named nyaraka that talks to Omeka and downloads all the items (metadata and files) as well as the collections they are a part of, and places them on the filesystem. This was a fairly straightforward process because Omeka’s database ensures the one-to-many-relationships between a site and its collections, items, and files which means they can be written to the filesystem in a structured way:

lakeland.umd.edu
lakeland.umd.edu/site.json
lakeland.umd.edu/collections
lakeland.umd.edu/collections/1
lakeland.umd.edu/collections/1/collection.json
lakeland.umd.edu/collections/1/items
lakeland.umd.edu/collections/1/items/1
lakeland.umd.edu/collections/1/items/1/item.json
lakeland.umd.edu/collections/1/items/1/files
lakeland.umd.edu/collections/1/items/1/files/1
lakeland.umd.edu/collections/1/items/1/files/1/fullsize.jpg
lakeland.umd.edu/collections/1/items/1/files/1/original.jpg
lakeland.umd.edu/collections/1/items/1/files/1/file.json
lakeland.umd.edu/collections/1/items/1/files/1/thumbnail.jpg
lakeland.umd.edu/collections/1/items/1/files/1/square_thumbnail.jpg

This post was really meant to be about building a static site with React, and not about extracting data from Omeka. But this filesystem data is kinda like a static site, right? It was really just laying the foundation for the next step of building the static site application, since I didn’t really want to keep downloading content from the API as I was developing the application. Having all the content local made it easier to introspect with command line tools like grep, find and jq as I was building the static site.

React

Before I get into a few of the details here’s a short video that shows what the finished static site looked like:

Lakeland Static Site Demo from Ed Summers on Vimeo.

You can see that content is loaded dynamically as the user scrolls down the page. Lots of content is presented at once in random orderings each time to encourage serendipitous connections between items. Items can also be filtered based on type (buildings, people and documents). If you want to check it out for yourself download and unzip this zip file and open up the index.html in the directory that is created. Go ahead and turn off your wi-fi connection so you can see it working without an Internet connection.

When building static sites in the past I’ve often reached for Jekyll but this time I was interested in putting together a small client side application that could be run offline. This shouldn’t be seen as an either/or situation: it would be quite natural to create a static site using Jekyll that embeds a React application within it. But for the sake of experimentation I wanted to see how far I could go just using React.

Ever since I first saw Twitter’s personal archive download (aka Grailbird) I’ve been thinking about the potential of offline web applications to function as little time capsules for web content that can live independently of the Internet. Grailbird lets you view your Twitter content offline in a dynamic web application where you can view your tweets over time. Over the past few years the minimal computing movement has been gaining traction in the digital humanities community, as a way to ethically and sustainably deliver web content without needing to make promises about keeping it online forever, or 25 years (whichever comes first).

React seemed like a natural fit because I’ve been using it for the past year on another project. React offers a rich ecosystem of tools, plugins and libraries like Redux for building complex client side apps. The downside of using React is that it is not as easy for people to set up out of the box, or for changing over time if you you aren’t an experienced software developer. With Jekyll it’s not simple, but at least its relatively easy to dive in and edit HTML and CSS. But on the plus side for React, if you really want to deliver an unchanging, finished (static) artifact, then maybe these things don’t really matter so much?

At any rate it seemed like a worthwhile experiment. So here are a few tidbits I learned when bending React to the purposes of minimal computing:

Static Database

The first is to build a static representation of your data. Many React applications rely on an external REST API being available. This type of dependency is an obvious no-no for minimal computing applications, because an Internet connection is needed, and someone needs to keep the REST API service up and running constantly, which is infrastructure and costs money.

One way of getting around this is to take all the structured data your application needs and bundle it up as a single file. You can see the one I created for my application here. As you can see it contains metadata for all the photographs expressed as JSON. But the the JSON itself is part of a global JavaScript variable declaration which allows it to be loaded by the browser without relying on an asynchronous HTTP call. Browsers need to limit the ability of JavaScript to fetch files from the filesystem for security reasons. This JavaScript file is loaded immediately by your web browser when it loads the index.html, and the app can access it globally as window.DATA. Think of it like a static read-only, in memory database for your application. The wrapping HTML will look as simple as something like this:

Update: Another more scalable approach to this suggested by Alex Gil after this post went live, is to try using an in browser database like PouchDB. When combined with Lunr for search this could make for quite a rich and extensible data layer for minimal computing browser apps.

Static Images

Similarly, the image files need to be available locally. I took all the images and saved them into a directory I named static, and named the file using a unique item id (from Omeka) which allowed the metadata and data to be conceptually linked:

lakeland-images/static/{omeka-id}/fullsize.jpg

My React application has an Image component that simply renders the image along with a caption using the <figure>, <img> <figcaption> elements.

  • image
class Image extends Component {
  render() {
    return (
      <Link to={'/item/' + this.props.item.id + '/'}>
        <figure className={style.Image}>
          <img src={'static/' + this.props.item.id + '/fullsize.jpg'} />
          <figcaption>
            {this.props.item.title}
          </figcaption>
        </figure>
      </Link>
    )
  }
}

It’s pretty common to use webpack to build React applications, and the copy-webpack-plugin will handle copying the files from the static directory into the distribution directory during the build.

URLs

You may have noticed that in both cases the data.js and images are being loaded using a relative URL (without a leading slash, or a protocol/hostname). This is a small but important detail that allows the application to be moved around from zip file, to thumb drive to disk drive, without needing paths to be rewritten. The images and data are loaded relative to where the index.html was initially loaded from.

In addition many React applications these days use the new History API in modern browsers. This lets your application have what appear to be normal URLs structured with slashes which you can manage with react-router. However slash URLs are problematic in a offline static site for a couple reasons. The first is that there is no server so you can’t tweak it to respond to any request with the HTML file I included above that will bootstrap your application. This means that if you reload a page you will get a 404 not found.

The other problem is that while the History API works fine for an offline application, the relative links to bundle.js, data.js and the images will break because they will be relative to the new URL.

Fortunately there is a simple solution to this: manage the URLs the way we did before the History API, using hash fragments. So instead of:

file:///lakeland-images/index.html/items/123

you’ll have:

file:///lake-landimages/index.html#/items/123

This way the browser will look to load static/data.js from file:///lakeland-images/ instead of file://lakeland-images/index.html/items/. Luckily react-router lets you simply import and use createHashHistory in your application initialization and it will write these URLs for you.

Minimal?

It’s important to reiterate that this was an experiment. We don’t know if the LCHP is interested in us developing this approach further. But regardless I thought it was worth just jotting down these notes for others considering similar approaches with React and minimal computing applications.

I’ll just close by saying in some ways it seems counter-intuitive to refer to a React application as an example of minimal computing. As Alex Gil says:

In general we can say that minimal computing is the application of minimalist principles to computing. In reality, though, minimal computing is in the eye of the beholder.

After working with React off and on for a couple years it still seems quite complicated–especially when you throw Redux into the mix. Assembling the boilerplate needed to get started is still tedious, unless you use create-react-app which is a smart way to start. By comparison it’s much easier to get Jekyll out of the box and start using it. But, if the goal is truly to deliver something static and unchanging, then perhaps this up front investment in time is not so significant.

Static sites, thus conceived, ultimately rely on a web browser, which are insanely complicated pieces of code. With a few exceptions (e.g. Flash) browsers have been pretty good at maintaining backwards compatibility as they’ve evolved along with the web. JavaScript is so central to a functioning web it’s difficult to imagine it going away. So really this approach is a bet on the browser and the web remaining viable. Whatever happens to the web and the Internet we can probably rely on some form of browser continuing to exist as functioning software, either natively, or in some sort of emulator, for a good time to come…or at least longer than the typical website is kept online.


Many thanks to Raff Viglianti, Trevor Muñoz and Stephanie Sapienza who helped frame and explore many of the ideas expressed in this post.

How to Handle Meltdown and Spectre for Solr / Lucidworks

Recent news reports have revealed that most Intel processors are vulnerable to a security flaw that allows processes to read the memory of other processes running on the same Intel CPU. At this time it appears that some of the flaws do appear to affect AMD CPUs as well, but the more serious performance-impacting do not. Because cloud providers use Intel CPUs and virtualization to support multiple clients on the same VM, this can be especially troubling to multi-tenant hosting environments such as Amazon Web Services. However, Google has stated that it believes that it has successfully mitigated the flaw in its Google Cloud Platform, although some user patches are required.

It is important to understand the risk of this bug, but not to overestimate it. To operate, the exploit needs to be already running inside of software in your computer. It does not allow anyone on the internet to take control of your server over http, for instance. If there is an existing vulnerability, it does make it worse as the vulnerable process might be used to read memory from other processes.

There are already operating system patches out for this bug. Unfortunately, the operating system level patch for this bug requires creating a software isolation layer which will have a significant impact on performance. Estimates are that its impact can be between 5-30%. Every piece of software running in the Application space may be affected. The impact will vary, and each application will need to be performance and load tested.

Some customers running on their own internal hardware may decide that, given the vector of the exploit and the performance cost of the fix, they may decide to delay applying it. Other customers running on more vulnerable environments or with more specific security concerns may need to apply it and deal with the performance implications.

Fortunately for Lucidworks customers, Fusion and its open source Solr core are especially adept at scale. For high capacity systems, the most cost-effective solution may be to add a number of additional nodes to allow for the increased weight of the operating system. Additionally, by tuning the Fusion pipeline it may be possible to reduce the number of calls necessary to perform queries or parallelize some calls thus compensating for the loss of performance through optimization in other areas.

In either case Lucidworks is here for our customers. If you’re considering applying the fix, please reach out to your account manager to understand ways that we can help mitigate any issues you may have. If you do not currently have or know your account manager, please file a support request or use the Lucidworks contact us page.

The post How to Handle Meltdown and Spectre for Solr appeared first on Lucidworks.

Improving Digital Equity: The civil rights priority libraries and school technology leaders share / District Dispatch

This blog post, written by Consortium for School Networking (CoSN) CEO Keith Krueger, is first in a series of occasional posts contributed by leaders from coalition partners and other public interest groups that ALA’s Washington Office works closely with. Whatever the policy – copyright, education, technology, to name just a few – we depend on relationships with other organizations to influence legislation, policy and regulatory issues of importance to the library field and the public.

Learning has gone digital. Students access information, complete their homework, take online courses and communicate with technology and the internet.

The Consortium for School Networking is a longtime ally of the American Library Association on issues related to education and telecommunications, especially in advocating for a robust federal E-rate program.

Digital equity is one of today’s most pressing civil rights issues. Robust broadband and Wi-Fi, both at school and at home, are essential learning tools. Addressing digital equity – sometimes called the “homework gap” – is core to CoSN’s vision, and a shared value with our colleagues at ALA.

That is why the E-rate program has been so important for the past 20 years, connecting classrooms and libraries to the internet. Two years ago the Federal Communications Commission (FCC) modernized E-rate by increasing funding by 60 percent and focused on broadband and Wi-Fi. This action made a difference. CoSN’s 2017 Infrastructure Survey found that the majority of U.S. school districts (85 percent) are fully meeting the FCC’s short-term goal for broadband connectivity of 100 Mbps per 1,000 students.

While this is tremendous progress, we have not completed the job. Recurring costs remain the most significant barrier for schools in their efforts to increase connectivity. More than half of school districts reported that none of their schools met the FCC’s long-term broadband connectivity goal of 1 Gbps per 1,000 students. The situation is more critical in rural areas where nearly 60 percent of all districts receive one or no bids for broadband services. This lack of competition remains a significant burden for rural schools.

And learning doesn’t stop at the school door. CoSN has demonstrated how school systems can work with mayors, libraries, the business community and other local partners to address digital equity. In CoSN’s Digital Equity Action Toolkit, we show how communities are putting Wi-Fi on school buses, mapping out free Wi-Fi homework access from area businesses, loaning Wi-Fi hotspots to low-income families and working to ensure that broadband offerings are not redlining low-income neighborhoods. A great example is the innovative partnership that Charlotte Mecklenburg Schools has established with the Mecklenburg Library System in North Carolina. CoSN also partners with ALA to fight the FCC’s misguided plans to roll back the Lifeline broadband offerings.

Of course, the most serious digital gap is ensuring that all students, regardless of their family or zip code, have the skills to use these new tools effectively. We know that digital literacy and citizenship are essential skills for a civil society and safer world. Librarians have always been on the vanguard of that work, and our education technology leaders are their natural allies. Learn about these efforts and what more we can do by attending CoSN/UNESCO’s Global Symposium on Educating for Digital Citizenship in Washington, DC on March 12, 2018.

As we start 2018, I am often asked to predict the future. What technologies or trends are most important in schools? CoSN annually co-produces the Horizon K-12 report, and I strongly encourage you to read the 2017 Horizon K-12 Report to see how emerging technologies are impacting learning in the near horizons.

However, my top recommendation is that education and library leaders focus on “inventing” the future. Working together, let’s focus on enabling learning where each student can personalize their education – and where digital technologies close gaps rather than make them larger.

Keith R. Krueger, CAE, has been CEO of the Consortium for School Networking for the past twenty-three years. He has had a strong background in working with libraries, including being the first Executive Director the Friends of the National Library of Medicine at NIH.

The post Improving Digital Equity: The civil rights priority libraries and school technology leaders share appeared first on District Dispatch.