Planet Code4Lib

News Regarding the Future of LITA after the Core Vote / LITA

Dear LITA members,

We’re writing about the implications of LITA’s budget for the upcoming 2020-21 fiscal year, which starts September 1, 2020. We have reviewed the budget and affirmed that LITA will need to disband if the Core vote does not succeed.

Since the Great Recession, membership in professional organizations has been declining consistently. LITA has followed the same pattern and as a result, has been running at a deficit for a number of years. Each year, LITA spends more on staff, events, equipment, software, and supplies than it takes in through memberships and event registrations. We were previously able to close our budgets through the use of our net asset balance which is, in effect, like a nest egg for the division.

Of course, that could not continue indefinitely. Our path towards sustainability has culminated in the proposal to form Core: Leadership, Infrastructure, Futures. The new division would come with significant efficiencies for staff and programming that make it possible to continue our activities.

Starting at Midwinter 2020, we began learning about significant financial challenges at the ALA level. ALA used the net asset balances of the divisions in order to meet current expenses. In the assessment of LITA’s leadership, it is unlikely to be able to restore these funds. In forming a budget for the 2020-21 fiscal year, it is clear to LITA staff and leadership that we cannot operate without the net asset balance; it is not possible to compose a budget that both breaks even and includes the activities that make LITA, LITA.

Therefore we anticipate that fiscal year 2020-21 will be a transitional year: either completing the merger with ALCTS and LLAMA to create the new division Core, or phasing out LITA entirely. In the event that all three divisions do not vote in favor of Core, we will spend the year working through an orderly wind-down that includes transferring some key activities to other ALA units, where possible.

We regret to be the bearers of news that will surely sadden any of us who value LITA’s spirit and the friendships we’ve built within our division. We hope this will not be a surprise to longtime members, and we want to be as transparent as possible as we approach this important vote starting on March 9.

The LITA Board is fully supportive of the exciting possibility of Core, and we hope you are as well. We know this has been a disorienting time while we work towards a possible merger. We want you to know that we value your work, and we’re doing our best to make sure it will continue as part of Core. We believe the vote will pass and that we’ll spend the next year working together to expand support and resources for you to do even more in the future. We will share more details about the path we’ve chosen (including finances) after the vote.

Sincerely,
Emily G. Morton-Owens, LITA President
Bohyun Kim, LITA Past President
Evviva Weinraub Lajoie, LITA Vice President/President Elect
Lindsay Cronk, LITA Director at Large
Tabatha Farney, LITA Director at Large
Jodie Gambill, LITA Division Councilor
Amanda L. Goodman, LITA Director at Large
Margaret Heller, LITA Director at Large
Hong Ma, LITA Director at Large
Berika Williams, LITA Director at Large
Topher Lawton, LITA Parliamentarian

Infrastructure for heritage institutions – first results / Lukas Koster

Permalink: https://purl.org/cpl/3017


In July 2019 I published the post Infrastructure for heritage institutions in which I described our planning to realise a “coherent and future proof digital infrastructure” for the Library of the University of Amsterdam. Time to look back: how far have we come? And time to look forward: what’s in store for the near future?

 

Ongoing activities

I mentioned three “currently ongoing activities”: 

  • Monitoring and advising on infrastructural aspects of new projects
  • Maintaining a structured dynamic overview of the current systems and dataflow environment
  • Communication about the principles, objectives and status of the programme

So, what is the status of these ongoing activities?

 

Monitoring and advising

We have established a small dedicated “governance” team that is charged with assessing, advising and monitoring large and small projects that impact the digital infrastructure, and with creating awareness among stakeholders about the role of the larger core infrastructure team. The person managing the institutional project portfolio has agreed to take on the role of governance team coordinator, which is a perfect combination of responsibilities.

 

Dynamic overview

Until now we have a number of unrelated instruments to describe infrastructural components and relations, with different objectives. The two main ones are a huge static diagram that tries to capture all internal and external systems and relationships without detailed specifications, and there is the dynamic DataMap repository describing all dataflows between systems and datastores. The latter uses a home made extended version of the Data Flow Diagram (DFD) methodology, as described in an earlier post Analysing library data flows for efficient innovation (see also my ELAG 2015 presentation Datamazed). In that post I already mentioned Archimate as a possible future way to go. And this is exactly what we are going to do now. DFD is OK for describing dataflows, but not for documenting the entire digital infrastructure including digital objects, protocols, etc. Archimate version 3.1 can be used for digital and physical  structures as well as for data, application and business structures. We are currently deciding on the templates and patterns to use (Archimate is very flexible and can be used in very many different ways). The plan is to collaborate with the central university architecture community and document our infrastructure in the tool that they are already using.

 

Communication

This series of posts is one of the ways we communicate about the programme externally. For internal communication we have set up a section on the university library intranet.

 

Projects

I mentioned thirteen short term projects. How are they coming on? For all projects we are adopting a pragmatic approach. Use what is already available, set short term realistic goals, avoid solutions that are too complicated.

 

Object PID’s

I did some research into persistent identifiers (PID’s) and documented my findings in an internal memo. It consists of a general theoretical description of PID’s (what they are, administration and use, characterization of existing PID systems, object types PID’s can be assigned to and linked data requirements), and a practical part describing current practices, pros and cons of existing PID systems, a list of requirements, practical considerations and recommendations. A generic English version of this document is published in Code4Lib Journal issue 47 with the title “Persistent identifiers for heritage objects“.

In January 2020 we have started to test the different scenarios that are possible for implementing PID’s.

 

Object platform/Digital objects/IIIF

The library is currently executing an exploratory study into options for a digital object platform. There have been conversations with a number of institutions similar to the university library (university libraries, archives, museums) discussing their existing and future solutions. There will also be discussions with vendors, among which Ex Libris, the supplier of our current central collection management platform Alma. This study will result in a recommendation in the first half of 2020, after which an implementation project will be started.

The Digital Objects and IIIF topics are part of this comprehensive project, and obviously Alma is considered as a candidate. The library has already developed a IIIF test environment as a separate pilot project.

 

Licensing

We are taking first steps in setting up a dedicated team for deciding on default standard licences and regulations for collections, metadata and digital objects, per type when relevant. Furthermore, the team will assess dedicated licences and regulations in case the default ones do not apply. We are currently thinking along the lines of Public Domain Mark or Creative Commons CC0 for content that is not subject to copyright, CC-BY or CC-BY-SA for copyrighted  content, and righsstatements.org for content for which copyright is unclear.

For metadata the corresponding Open Data Commons licences are considered. For that part of the metadata in our central cataloguing system Alma which originates in Worldcat, OCLC recommends applying an ODC-BY licence according to the OCLC Worldcat Rights and Responsibilities. For the remaining metadata we are considering a public domain mark or an ODC-BY.

If it is feasible, the assigned licences and regulations for objects may be added to the metadata of the relevant digital objects in the collection management systems, both as text and in machine-readable form. In any case, the licences and regulations will be published in all online end user interfaces and in all machine/application interfaces.

 

Metadata set/Controlled vocabularies

Both defining the standard minimum required metadata for various use cases and selecting and implementing controlled vocabularies/authority files are aspects of data quality assurance. Both issues will be addressed simultaneously.

Regarding the metadata sets required for the various use cases and target audiences, this is a long term process, which will have to be carried out in smaller batches focused on specific audiences and use cases. Then again, because of the large number of catalogued objects it is practically impossible to extend and enrich the metadata for all objects manually. New items are catalogued using RDA Core Elements, in which minimum elements required for describing resources by type are defined. There is also a huge legacy metadata records base with many non-standard descriptions. Hopefully automated tools can be employed in the future for improving and extending metadata for specific use cases. This will be explored in the Data enrichment and Georeference projects.

 

Regarding the controlled vocabularies, on the contrary, there are short term practical solutions available. Libraries have been using authority files for cataloguing for a long time, especially for people and organisations (creators, contributors) and subjects. In most cases, besides the string values, also the identifiers of the terms in the authority files used have been recorded in our cataloguing records. In the past we have used national authority files for The Netherlands, currently we are using international authority files: Library of Congress Name Authority File and FAST. Fortunately, all these authority files have been published on the web as open and linked data, with persistent URI’s for each term. This means that we can dynamically construct and publish these persistent URI’s through human and machine readable interfaces for all vocabulary terms that we have registered. We are currently testing the options.

 

Data enrichment/Georeference

The Data enrichment and Georeference projects are closely related to the Open Maps pilot, in which a set of digitised maps from a unique 19th century atlas serve as practical test bed and implementation for the Digital Infrastructure programme. As such, these projects do not contribute to the improvement of the digital infrastructure in the narrow sense. However they demonstrate the extended possibilities of such an improved digital infrastructure. Both projects are directly related to all other projects defined in the programme, and offer valuable input for them.

Essentially both projects are aimed at creating additional object metadata on top of the basic metadata set, targeted at specific audiences, derived from the objects themselves.

An initial draft action plan was created for both projects to be executed simultaneously, in collaboration with a university digital humanities group and the central university ICT department. For the Data enrichment project the idea is to use OCR, text mining and named entity recognition methods to derive valuable metadata from various types of texts printed on maps. The Georeference project is targeted at obtaining various georeferences for the maps themselves and for selected items on the maps. All new data should have resolvable identifiers/URI’s in order to be able to be used for linked data.

 

Other projects

The remaining projects (ETL Hub, Digitisation Workflow, Access/Reuse, Linked Data) are dependent on the other activities carried out in the programme.

 

An Extract-Transform-Load platform for streamlining data flows and data conversions can only be effectively implemented when a more or less complete overview of the system and dataflow environment is available, and the extent of the role of Alma as central data hub has become clear. Moreover, the standardisation of basic metadata set, controlled vocabularies and persistent identifiers is required. In the end it could also turn out that an ETL Hub is not necessary at all.

 

The Digitisation Workflow can only be defined when a Digital Object Platform is up and running, and digital object formats are sorted out. It is also dependent on functioning PID and License workflows and established metadata sets and controlled vocabularies.

 

Acces and Reuse of metadata and digital objects depends on the availability of a Digital Object Platform, standardised metadata sets, controlled vocabularies, PID’s and license policies.

 

Last but not least, linked data can only be published if PID’s as linked data URI’s, open licences, standardised metadata sets and controlled vocabularies with URI’s are implemented. For Linked Data an ETL Hub might be required.

 
 

French Translation of the 2013 Levels of Preservation Now Available / Digital Library Federation

While all of the recent news has been about the Levels of Digital Preservation V2.0, we have an update to share about the 2013 version of the Levels.  Our colleagues from the National Library of France (BnF) have translated the original Levels of Preservation into French.  

Stay tuned for the release of the Levels of Digital Preservation V2.0 in French which more colleagues from the French watch group on formats are currently drafting!  

If you would be interested in translating the Levels of Digital Preservation V2.0 into another language please contact us at ndsa.digipres@gmail.com

 

La traduction en français des Niveaux de Préservation Numérique (NDSA Levels) 2013 est maintenant disponible

Bien que les nouvelles récentes concernent essentiellement les Niveaux de Préservation Numérique V2.0, nous avons une mise à jour à partager sur la version 2013 des Niveaux. Nos collègues de la Bibliothèque nationale de France (BnF) ont traduit la première version des Niveaux de Préservation Numérique en français.

Restez à l’écoute pour la sortie de Niveaux de Préservation Numérique V2.0 en français : les collègues de la cellule nationale de veille sur les formats en ont déjà une première ébauche !

Si vous souhaitez traduire les niveaux de préservation numérique V2.0 dans une autre langue, veuillez nous contacter à ndsa.digipres@gmail.com.

The post French Translation of the 2013 Levels of Preservation Now Available appeared first on DLF.

Get that bread / Coral Sheldon-Hess

Quite some time ago, I committed to combining my professional blog posts here with the ones I make about crafts and recipes and things like that. And yet, I sometimes go multiple years without making a craft post, when I never ever go multiple years without crafting something. I’m just bad about documenting it, not only here, but on Ravelry.

It’s easier when I feel like I have something worth documenting! In this case, I want to tell you about my take on the New Artisan Bread in Five Minutes a Day recipe. It’s worth saying at the outset: they make their main recipe available online without the purchase of the book. There are enough useful tips and other recipes in the book that I still found the new edition worth buying (at Half Price Books, admittedly), on top of the original I already had, but your mileage may vary. Also worth saying: they tell you how to do their bread in a Dutch oven online, too. (And not, that I’ve found, in any of their books.)

The things I want to add to the discussion: 1) a couple of hacks for people who, like me, do not have a kitchen fan that vents outdoors (I promise I’ll explain why this matters) and who like at least a little bit of whole grain in their bread, plus 2) photos of some of the steps they don’t show as clearly in the book, that I felt like I had to figure out on my own. I’m still experimenting (always!), but I have a base recipe/approach that I like and that I think is good enough to share.

My approach

The first part of the recipe starts their way: if it isn’t your first batch, keep back just a little bit of dough from the previous batch (a couple of ounces is all I bother with), to help it develop the sourdoughy flavor more quickly, and add it to your 3 cups of ~100 degree F water, 1Tbsp of yeast, and 1Tbsp of Kosher salt. Like they say, you’ll want 32 ounces of flour, but here’s the mix I use: 10-12 ounces of whole grain (my last batch was 4 ounces of oat flour and 9 ounces of King Arthur whole wheat flour) and the rest is King Arthur Unbleached All-Purpose flour. The brand of the white flour is only important if you’re going to put in more than, oh, maybe about a cup of whole grain (and especially if any of it is non-gluten-containing, like oat): King Arthur has a fairly high gluten content, enough that you can get a nicely-textured bread, even with a third of your flour being whole wheat, without requiring any additives.

Side note about the mixing process: for anyone reading this who, like me, has hand or shoulder pain, a KitchenAid stand mixer (not the very smallest one, but yes, their standard model) with the paddle (not the dough hook!) will mix all of it up really nicely. My mixer was a hand-me-down from my mom, who didn’t use it anymore, and even if I only ever use it for bread, it’s worth the counter space! (I use it for other things, though.) I made the previous version of this recipe for over a year, back when I lived in Alaska, but I eventually stopped because I dreaded the process of mixing the dough by hand. The mixer has been a game changer.

Once it’s all mixed, you put the dough into a large container (if it’s designed to be airtight, use a nail to poke a hole in the top of the container, or else leave the lid loose so that gases can escape) and let it rise until it flattens out.

After it’s done rising, it goes in the fridge. When you’re ready to bake, you take it back out. It’ll look like this:

You’ll dust with flour, like I did above, and then you’ll reach in, pull out about a grapefruit-sized piece of dough (you’re aiming for 4 loaves per batch), probably using a serrated knife to cut it so you can form a ball.

pulling up a ball of very wet dough

The book talks at more length than I will about the forming of the ball and the process of building a “gluten cloak,” but I’ll show you what the before and after look like, anyway:

You’ll note the silicone baking mat. You can’t use that if you’re following their directions, because they bake at 500 degrees F; I do not. If you go by the main recipe, you’ll put cornmeal on the pizza peel before you lay the loaf down to rise, and you’ll place a pizza stone and an empty metal pan into the oven when you go to preheat; I don’t do that anymore, because I use the Dutch oven method. But also! If you go by their Dutch oven recipe, you’ll still set your oven to 500 degrees, which I argue is unnecessary. At 500 degrees, a number of things burn, and unless you have a very good kitchen fan, you’ll have a smoky kitchen. My fan vents right back into my kitchen, so a 500 degree oven is a non-starter for me.

I did use parchment paper for a while, instead of the baking mat. In some ways, it was easier? But for some reason, my loaves started sticking to the paper after a couple of weeks of doing it that way. When I realized my little cheapie baking mat (an Aldi find!) was rated for above 450, I tried that, and I like it a lot! It sure beats scraping paper off the bottom of a loaf of bread. (I specify that it’s rated for “above 450,” rather than 450 flat, because I haven’t used an oven thermometer in this oven yet. I don’t actually know how hot it gets, so I wanted to leave wiggle room, just in case. Oven temperatures are really just an approximation, did you know that? Yeah, fun fact.)

Anyway, yes. I let the dough sit for at least 15 minutes, and then I pre-heat the oven and the Dutch oven at 450 degrees F. Once I know they’re fully heated (30 minutes), I sprinkle more flour on top of the dough, slash it with a serrated knife, and gently place it into the very hot Dutch oven.

At 450 F, I let the bread steam for 17 minutes, instead of the recipe’s suggested 15. I think it looks really nice at that point:

A steamy boi.

And then I bake it for about 15 more minutes with the lid off, until it’s good and crispy on the outside. It makes a crackling sound as it cools! And you have to let it cool completely before you can cut it, unless you’re eating the whole thing right away. (I won’t judge.) It messes it up if you cut it while it’s hot and then try to store it.

Now, my method isn’t perfect. The bottom gets more brown than I want, in a couple of spots where the baking mat touches the bottom of the Dutch oven. (It doesn’t taste burnt, but it sure looks bad.) Also, I burned the heck out of my forearm on the Dutch oven, one time, so I recommend buying very long baking gloves that are rated up to 500 degrees. So. Maybe there’ll be an updated post where I’ve figured out the burning issue and … still wear tall gloves, honestly, because hot ceramic-covered metal hurts a lot.

Just a side note, but if you find you need to use up the dough and don’t need more loaves of bread, it makes a pretty good pizza!

a large, unevenly-shaped pizza with broccoli on it

Trading for images / Galen Charlton

Let’s search a Koha catalog for something that isn’t at all controversial:

Screenshot of results from a catalog search of a Koha system for "anarchist"

What you search for in a library catalog ought to be only between you and the library — and that, only briefly, as the library should quickly forget. Of course, between “ought” and “is” lies the Devil and his details. Let’s poke around with Chrome’s DevTools:

  1. Hit Control-Shift-I (on Windows)
  2. Switch to the Network tab.
  3. Hit Control-R to reload the page and get a list of the HTTP requests that the browser makes.

We get something like this:

Screenshot of Chrome DevTool's Network tab showing requests made when doing the "anarchist" Koha catalog search.

There’s a lot to like here: every request was made using HTTPS rather than HTTP, and almost all of the requests were made to the Koha server. (If you can’t trust the library catalog, who can you trust? Well… that doesn’t have an answer as clear as we would like, but I won’t tackle that question here.)

However, the two cover images on the result’s page come from Amazon:

https://images-na.ssl-images-amazon.com/images/P/0974458902.01.TZZZZZZZ.jpg
https://images-na.ssl-images-amazon.com/images/P/1849350949.01.TZZZZZZZ.jpg

What did I trade in exchange for those two cover images? Let’s click on the request on and see:

:authority: images-na.ssl-images-amazon.com
:method: GET
:path: /images/P/0974458902.01.TZZZZZZZ.jpg
:scheme: https
accept: image/webp,image/apng,image/,/*;q=0.8
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: no-cache
dnt: 1
pragma: no-cache
referer: https://catalog.libraryguardians.com/cgi-bin/koha/opac-search.pl?q=anarchist
sec-fetch-dest: image
sec-fetch-mode: no-cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36

Here’s what was sent when I used Firefox:

Host: images-na.ssl-images-amazon.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0
Accept: image/webp,/
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Referer: https://catalog.libraryguardians.com/cgi-bin/koha/opac-search.pl?q=anarchist
DNT: 1
Pragma: no-cache

Amazon also knows what my IP address is. With that, it doesn’t take much to figure out that I am in Georgia and am clearly up to no good; after all, one look at the Referer header tells all.

Let’s switch over to using Google Book’s cover images:

https://books.google.com/books/content?id=phzFwAEACAAJ&printsec=frontcover&img=1&zoom=5
https://books.google.com/books/content?id=wdgrJQAACAAJ&printsec=frontcover&img=1&zoom=5

This time, the request headers are in Chrome:

:authority: books.google.com
:method: GET
:path: /books/content?id=phzFwAEACAAJ&printsec=frontcover&img=1&zoom=5
:scheme: https
accept: image/webp,image/apng,image/,/*;q=0.8
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: no-cache
dnt: 1
pragma: no-cache
referer: https://catalog.libraryguardians.com/
sec-fetch-dest: image
sec-fetch-mode: no-cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36
x-client-data: CKO1yQEIiLbJAQimtskBCMG2yQEIqZ3KAQi3qsoBCMuuygEIz6/KAQi8sMoBCJe1ygEI7bXKAQiNusoBGKukygEYvrrKAQ==

and in Firefox:

Host: books.google.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0
Accept: image/webp,/
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Referer: https://catalog.libraryguardians.com/
DNT: 1
Pragma: no-cache
Cache-Control: no-cache

On the one hand… the Referer now contains only the base URL of the catalog. I believe this is due to a difference in how Koha figures out the correct image URL. When using Amazon for cover images, the ISBN of the title is normalized and used to construct a URL for an <img> tag. Koha doesn’t currently set a Referrer-Policy, so the default of no-referrer-when-downgrade is used and the full referrer is sent. Google Book’s cover image URLs cannot be directly constructed like that, so a bit of JavaScript queries a web service and gets back the image URLs, and for reasons that are unclear to me at the moment, doesn’t send the full URL as the referrer. (Cover images from OpenLibrary are fetched in a similar way, but full Referer header is sent.)

As a side note, the x-client-data header sent by Chrome to books.google.com is… concerning.

There are some relatively simple things that can be done to limit leaking the full referring URL to the likes of Google and Amazon, including

  • Setting the Referrer-Policy header via web server configuration or meta tag to something like origin or origin-when-cross-origin.
  • Setting referrerpolicy for <script> and <img> tags involved in fetching book jackets.

This would help, but only up to a point: fetching https://books.google.com/books/content?id=wdgrJQAACAAJ&printsec=frontcover&img=1&zoom=5 still tells Google that a web browser at your IP address has done something to fetch the book jacket image for The Anarchist Cookbook. Suspicious!

What to do? Ultimately, if we’re going to use free third-party services to provide cover images for library catalogs, our options to do so in a way that preserves patron privacy boil down to:

  • Only use sources that we trust to not broadcast or misuse the information that gets sent in the course of requesting the images. The Open Library might qualify, but ultimately isn’t beholden to any particular library that uses its data.
  • Proxy image requests through the library catalog server. Evergreen does this in some cases, and it wouldn’t be much work to have Koha do something similar. It should be noted that Coce does not help in the case of Koha, as all it does is proxy image URLs, meaning that it’s still the user’s web browser fetching the actual images.
  • Figure out a way to obtain local copies of the cover images and serve them from the library’s web server. Sometimes this is necessary anyway for libraries that collect stuff that wasn’t commercially sold in the past couple decades, but otherwise this is a lot of work.
  • Do nothing and figure that Amazon and Google aren’t trawling through their logs correlate cover image retrieval with the potential reading interests. I actually have a tiny bit of sympathy to that approach — it’s not beyond the realm of possibility that cover image access logs are simply getting ignored, unlike say, direct usage data from Kindle or Google Books — but ostriches sticking their head in the sand are not known as a good model for due diligence.

Non-free book jacket and added content services are also an option, of course — and at least unlike Google and Amazon, it’s plausible that libraries could insist on contracts (with teeth) that forbid misuse of patron information.

My thanks to Callan Bignoli for the tweet that inspired this ramble.

Marginalia in the mail / Hugh Rundle

I quit Twitter a little over a week ago. I'm very good at quitting Twitter - I've done it several times. When I first started ausglam.space as an alternative, I realised that whilst I wanted to get out of the toxic and addictive time suck that is the Twitter experience, I still wanted somewhere to share interesting things I've read or encountered. This was the genesis of the marginalia series on this blog. A trend of the last couple of years has been an upsurge in personal email newsletters, and I've started to see the attraction. I subscribe to Dan Cohen's Humane Ingenuity and Audrey Watters' Hack Education News Weekly, and Seb Chan also has a newsletter called Fresh and new. The common denominator is that this is regular, relatively brief writing, expected to be delivered to and read in an email client. I've read various things about what is driving the move to email newsletters, but for myself it's fairly simple: I don't want to be on Twitter, I don't expect people to subscribe to my blog via RSS, and my Marginalia posts are to a fair degree contextual to a particular time. They're also conceptually quite different to a 'normal' blog post, which for me tends to be more of an essay arguing a point. In Marginalia I might be arguing one or more points, but primarily it's a vehicle to say "Check out this interesting thing, here is my off-the-cuff response to it".

All of which is to say I'm moving Marginalia to an email newsletter. You can sign up at marginalia.hugh.run, and since I'm publishing it using write.as you can also subscribe via RSS and Mastodon/ActivityPub by following @share@marginalia.hugh.run

You can read the first (eighth) post at marginalia.hugh.run/marginalia-8-reality-kindness-and-the-machine

Folding / Ed Summers

This is just a mental bookmark for the metaphor of "folding" for studying algorithmic processes which I encountered in Lee et al. (2019). The motivation for studying folding is to refocus attention on how algorithms are relationally situated rather than simply examining them as black boxes that need to be opened, made transparent, and (hopefully) made accountable and just. The paper's main argument centers on dispelling the idea that algorithms can be made "fair" in their instrumentation, without looking at how the algorithms are related together with non-algorithmic systems.

Speaking about the extensive attention that has been paid to the research into bias of algorithms (Burrell, 2016 ; Diakopoulos, 2016; Eubanks, 2018 ; Noble, 2018 ; Pasquale, 2015 ) Lee et al point out:

One line of reasoning in this critical research implies that if only algorithms were designed in the optimal and correct way, they would generate results that were objective and fair. It is precisely this rule-bound and routinised nature of algorithms that seems to promise unbiased and fair sentencing. We find this reasoning misleading as it hides the multitude of relations algorithms are part of and produce. In a sense, the very notion of biased algorithms is linked to an objectivist understanding of how knowledge is produced, and worryingly sidesteps decades of research on the practices of knowledge production. In this article, we instead want to stress that algorithms cannot offer uniquely objective, fair and logical alternatives to the social structures of our worlds.

This is a pretty strong point that they are making in the last sentence. One reason why this paper's orientation appeals to me is that it draws the concept of "folding" out of work of STS scholars such as Law, Mol, Serres and Latour to act as a method for studying algorithmic systems as agents that participate in a larger network of shifting relations:

Rather than thinking about objects, relations and concepts as stable entities with fixed distances and properties, we might attend to how different topologies produce different nearness and rifts. In this way, technologies, such as algorithms, can be understood as folding time and space as much as social, political and economic relations. By analysing algorithms in this manner, we argue that we can gain a better understanding of how they become part of ordering the world: sometimes superimposing things that might seem distant and something tearing apart things that might seem close. To be more concerete, using operations of folding to understand algorithms allows us to pay attention to how diverse things such as values, computations, datasets or analytical methodologies are algorithmially brought together to produce different versions of the social and natural orders.

The paper uses four cases studies involving AIDS, Zika virus, and financial metrics to highlight three types of questions that help tease out the various types of folding that algorithms participate in:

  • What people, objects or relations are produced as proximate or far away by algorithms?
  • What is made part ofthe universal and what becomes invisible?
  • How do assumpions about eh normal become folded into algorithms?

The topological idea proximity is particularly salient for me because it helps talk about how algorithms can pull disconnected things into relation and thus pull them closer together. Processes that are separate in physical space may be closely aligned in an algorithmic space. Also useful is the idea of the "the normal" and how algorithms often have hidden away within them an argument about what is normal. He isn't cited, but I'm reminded of Foucault's work on governmentality here, not just for his critique of power, but also for his methods for examining how knowledge and power go hand in hand to produce practices and norms.

This way of thinking of algorithms appeals to me as I'm analyzing the results of my field study, where I examined how a federal agency's choices about what to collect from the web (appraisal) were expressed as part of a sociotechnical system. This system has a particular set of algorithms at its core: in this case fixity algorithms. Fixity algorithms in themselves are as close as we can imagine to an objective measure, since they are mathematical procedures for summarizing the contents of distinct bytestreams. This neutrality is the means by which their archive is assembled and deployed. But all sorts of phenomena factor into what collected data files are "fixed" and it is the means by which fixity algorithms are folded into other processes (forensics, preservation, data packaging, law enforcement, surveillance) that the story gets interesting.

The process of folding seems like a useful way to talk about how human and non-human actors are brought into collaboration in algorithmic systems. It accomodates some measure of intentional design with social and political contingency. In my own work it is helpful for decentering my own tendency to zoom in on the technical details of how an algorithm is implemented in order to unpack the various design decisions, and instead look laterally at how the algorithm enables and disables participation in other social, political, and technical activities.

References

Burrell, J. (2016). How the machine thinks: Understanding opacity in machine learning algorithms. Big Data & Society, 3(1).

Diakopoulos, N. (2016). Accountability in algorithmic decision making. Communications of the ACM, 59(2), 58–62.

Eubanks, V. (2018). Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.

Lee, F., Bier, J., Christensen, J., Engelmann, L., Helgesson, C.-F., & Williams, R. (2019). Algorithms as folding: Reframing the analytical focus. Big Data & Society.

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce oppression. New York University Press.

Pasquale, F. (2015). The black box society: The secret algorithms that control money and information. Harvard University Press.

#LISMentalHealth: That time my brain and job tried to kill me / Meredith Farkas

lismh

Happy LIS Mental Health Week friends! I want to start this post by recognizing someone who has done a great deal to support library workers’ mental health in the face of toxic workplaces, Kaetrena Davis Kendrick. Kaetrena has done some incredibly valuable research on low morale and toxic workplaces in librarianship and has created an awesome supportive community on Facebook (awesome praxis, Kaetrena!). I love the generosity she exhibits in sharing her insights and observations on Twitter as she is conducting research. This particular observation, however, stopped me in my tracks —

I wasn’t the person she met with, but it hit me hard because I came very close to the same thing when I worked in a toxic environment. Even though writing this is bringing up a lot of trauma, I want to share this experience because I’m sure there are people experiencing similar things in libraries right now who might be blaming themselves for systemic issues that would be toxic and problematic whether they were there or not. I want you to know that it’s not your fault. And that there is hope, even when it doesn’t feel that way.

I had worked as a social worker (psychotherapist) with children and teens in extremely stressful (sometimes life-and-death) situations and I’d worked in a small, cash-strapped library that required constant ingenuity, so I felt like I had a pretty good handle on work stress. But I’d never previously worked with people who actively disliked me. In fact, I’ve always been one of those people who works really hard to get along with everyone (thanks anxiety brain). I remember living in a house in my Sophomore year of college that basically was broken up into two factions who couldn’t stand each other. I was the person who was friends with people on both sides and somehow managed to skirt all the conflicts. But in this previous job, I was placed in a pretty impossible situation; one I know many other “coordinators” can understand.

My position was created because library administration wanted to build a culture of instructional assessment and they weren’t making any headway because there was significant distrust of administration. The idea that sending someone in to achieve this who not only did not have any authority over the liaisons (I had four library faculty reporting to me, but they were not the majority of liaisons) but also was on the tenure track and at the mercy of these same colleagues was laughable. It’s no surprise that my colleagues largely saw me as the enemy since I was trying to achieve the things administration wanted. And when I came back to administration, they balked at the idea that they should need to provide any support to build a culture of assessment. It felt like it was all on me. Everything, from developing learning outcomes for our program to getting people to document doing any assessment work at all was a battle. I spent so much time lobbying colleagues one-on-one to allay concerns about things only to find them resisting the very things we’d discussed in group meetings. My supervisor (the AUL) and I would come up with a game plan in a meeting together and then when I presented it, he’d sit there with his mouth shut while I got pummeled for it. I will fully admit that I wasn’t perfect, I reacted badly to things sometimes, and I would have approached this work differently knowing what I do now, but the resistance was extreme.

And it was more than just the resistance. There was also the backstabbing, which was like something from a TV show — not something I’d ever seen in real life. This was something I was totally unprepared for. I’m not particularly politic. I wear my heart on my sleeve. I’m incapable of manipulating or sucking up to people. Artifice is not my jam. But I had a colleague who was expert at making friends with administrators and then using those relationships to bad-mouth people and try to harm those they didn’t like. They had a habit of writing what a few of us called “gotcha emails,” where they’d write the email in a way designed to make us look bad and would cc: our common supervisor even though there was no reason to do so. Even though they were not on the committee, one of their complaints about me (which was demonstrably false) ended up in my third year review letter for tenure and I had to spend tons of time (with deep anxiety) gathering evidence and asking colleagues to write letters on my behalf to rebut it. Trying to call the person on the things they did pretty much blew up in my face. By the time I left the job, that person had pretty much been ostracized by most of the liaison librarians for their bad behavior, but administration somehow managed to see them as the victim.

I think the worst part was how people would support me in private, and then throw me under a bus in public. They’d come to me before meetings and tell me they supported my ideas and then sit silently while I was pummeled in public or even side with others. Or they’d come to me after a meeting and tell me how awful it was and how badly they felt for me. But NEVER, NOT ONCE, did anyone have my back in those meetings. No one stood up for me. The message I got was that I wasn’t worth standing up for. It made me feel so small. Last summer, I ran into a colleague I’d worked with there who told me that they’d felt really badly for how I was treated and hoped I would consider coming back to work there in the future as the culture had changed. At the time, I’d had literally no idea they were sympathetic. This person had tenure and little to lose in defending me, but they didn’t.

In spite of it all, I didn’t give up. After two years of this, I felt like I’d finally gotten to an okay place. Instead of trying to get people to do things, I’d repositioned myself as a resource to the liaison librarians; here to help them with teaching or assessment. I created meetings where we could workshop instructional issues. I’d developed a guide full of different classroom assessment techniques they could try. I’d also developed or got involved in large-scale assessment projects and invited colleagues to join in scoring so that they could get their feet wet with assessment without having to do a lot. I rebranded my team as the Instructional Design Team, supporting the creation of tutorials and other learning objects for our colleagues. That same team had just started work on my baby, Library DIY. We’d just hired a new AUL for Public Services (my boss) who was really into instruction and it felt like we might finally be ready to make headway on things.

Less than one month after my new boss started her job, I was summoned to a meeting with her and our AUL who handled library HR. The meeting seemed ominous, but I was told by my boss “don’t worry, it’s nothing bad.” So imagine my surprise when I arrived and was handed a brand-new job description: General Education Instruction Coordinator and Social Sciences Librarian. My new AUL said she felt that it made more sense for her to be in charge of the overall instruction program and for all of the public services librarians to report directly to her. But I knew how things worked at this place. If you didn’t meet expectations, no one worked with you. No one coached you. You were shunted off to something less important in the hopes that you’d take the hint and leave. (A year later, ironically, the same thing would happen to the boss who did this to me.) But I’d uprooted my family and made them move across the country to a much more expensive city (and one in which finding another library job was next-to impossible). I had a young child who I wanted to grow up rooted in a community. We’d just bought a house! I felt trapped. I felt like I’d let my family down. I felt like a failure. I felt like I’d destroyed my entire career; that it was over. All I could feel was intense shame and I couldn’t stop perseverating over what happened and blaming myself for all of it. In spite of the fact that I could clearly see that I was working in a toxic situation, I blamed myself 100% for all of it. I internalized everything.

I spent the next year in a deep and relentless depression. It was, without question, the worst I’ve ever felt in my life. I felt worthless. I felt like I didn’t deserve to be happy; like I didn’t deserve to exist. I couldn’t sleep. I obsessed about dying. I felt like I was already dead. I went to work like a robot and did my job and felt like I was rearranging deck chairs on the Titanic because the world had already ended and why couldn’t everyone else see that? I spent my weekends lying in bed staring at the ceiling or crying hysterically or yelling at my poor husband. I talked to my husband about quitting, trying to do consulting, teaching more online, ANYTHING, but it wasn’t really economically feasible. I wanted out of the pain and out of that job and I couldn’t see any way forward. I am pretty sure I wouldn’t be here if I hadn’t had my husband and son to anchor me, but I still felt like I had let them down utterly and ruined all their lives by coming here and failing so spectacularly. My husband was the only one who really knew what I was going through. I didn’t even tell my work colleagues who I was friends with, because I was afraid that if I told them and they didn’t care or didn’t try to help me, I’d feel even more lost. I just went through the motions.

I don’t know what I would have done had I not gotten my current job. I couldn’t have survived another year there. Even six years later, I feel tremendously lucky to be here, not only because I’m not there anymore, but because I get to work with amazing and dedicated colleagues and the great students we have here. Looking back now, I can see how impossible the situation was and it’s telling that I was the first and last head of instruction that library had. But at the time, I was absolutely demoralized and I took everything as a referendum on my worthlessness. As someone who experienced trauma growing up, my brain has been primed to see failures as being all my fault (and successes as being caused by external factors beyond my control). Self-blame comes very naturally to me.

While I was grateful to get out of that work environment, I recognized that there were a lot of things about how I positioned work in my life that weren’t healthy and needed to change. I’m sharing some of the work I’ve been doing on myself here in case others could benefit:

I recognize that most issues in libraries are systemic in nature, NOT individual – instead of blaming myself for everything, I try to see the big picture; how the problems I’m facing might be related to forces bigger than me. While I love my current job, every library has baggage and I am better now at seeing now how resistance I sometimes face is related to that baggage more than it is to me personally. I think library workers who get burned out feel even worse because they blame themselves for feeling that way. It’s like the Buddhist concept of the second arrow — we’re already in pain and we increase our pain by blaming ourselves for it. Things like burnout are very much a systemic failure, not a personal one, and seeing that, instead of blaming yourself, is one of the keys to emerging from burnout.

I don’t let achievement culture tell me what I’m worth – I’ve written about not chasing success anymore before, but I promise you, letting go of achievement culture is one of the most freeing things you can do. So much of what I do now is really valuable, but deeply unflashy. A lot of it is that maintenance work that makes a library run — like instruction scheduling or running our learning assessment project. I’m not going to win awards for anything I’m doing now, but I also don’t really care. If I’m happy with what I’m doing and I feel like I’m doing good, that’s enough.

I try to step back and look mindfully at situations – people who have survived trauma tend to get stuck in a lot of knee-jerk cycles of self-blame and self-harm. Mindfulness can help us look beneath anger, hurt, fear, and self-blame at the assumptions about others and ourselves that underlie those feelings. Meditation and tools like RAIN have really helped me to slow down and stop beating myself up all the time. It’s helped me deepen my compassion towards myself and others. I try to avoid the fight-or-flight thinking that makes me feel like I have to react, respond to that email, do, do, do right away.

Lately, I’ve been doing a lot of talking to my (imagined) future self. Looking back, I have so much compassion for me in my previous job and hindsight helps me to see the big picture and what really matters. I’ve started trying to conjure my future self when I’m struggling with something, both for compassion and to ask “is this something that is going to matter a year from now? Is this worth getting worked up over?” More often that not, it isn’t. If you’re uncomfortable imagining a conversation with your future self, you could imagine talking to a person who you feel loves you unconditionally. In my case, it’s my Abuela, who passed away several years ago. Seeing myself the way she saw me always helps to increase my compassion for myself.

I try to be less attached to outcomes — the negative part of being really dedicated to and passionate about work is that you also get attached to the projects you’re working on, especially things you believe will really be important for patrons. And it can viscerally hurt when people stomp all over the things you’ve built. Over the past few years, I have learned how to let go, which has definitely changed my work style, but it’s decreased my anxiety enormously. It’s made accepting no’s and negative changes that I have absolutely no control over much easier. I still care deeply about our students and I still advocate for things, but I see my place in a much larger system and recognize that there is only so much I can control.

I try to see people as whole people with their own insecurities and fears — When we are caught in anger, hurt, or insecurity, it becomes much more difficult to see people we perceive as harming us as whole people. A brilliant book I read recently, Radical Compassion by the incredible psychologist and meditation teacher Tara Brach, called this phenomenon “unreal others:”

When we’re caught in the trance of anger and blame, our survival brain shapes every dimension of our experience. Our bodies are tense, our hearts numb or constricted; our thinking is agitated and rigid… This cutoff from our whole brain dramatically impacts how we perceive others. Rather than real beings with subjective feelings like ourselves, they become what I call Unreal Others. Our attention focuses on their faults, their differences from us, on how they are threatening or impeding us.

By stepping back from survival brain mode, I’m now better able to let go of hurt and anger. When I’ve felt harmed by someone at work, I’ve made an effort to try and understand why they did or said what they did. More often than not, it has little to do with me or my worth and a lot more to do with them and their own insecurities. I used to really hold onto my hurts because it felt like people should be held accountable for the harmful things they did, but I’ve learned that it doesn’t serve me. Being angry and hurt just makes me feel like crap. Feeling compassion, forgiving, and letting go feels freeing, even if sometimes it means not holding people accountable.

I try to cultivate gratitude and appreciation – this summer, I listened to an episode of my favorite podcast, Hurry Slowly that featured the organizational psychologist Adam Grant. In it, he talked about the benefits of explicitly expressing gratitude towards people and how it not only makes the recipient (who likely didn’t know how you felt about them) feel great, but it also has positive impacts for the sender. And he’s right; it feels great! I’ve been making this a part of my practice bit-by-bit and it has led to some really special moments and heart-to-hearts with my colleagues. I don’t know what it is about our profession that there isn’t a lot of formal or informal recognition provided (as if it’s not an infinitely renewable resource, come on!), but I’m not going to wait for a culture change to start behaving the way I know we should.

I’d love to hear what strategies you’ve used to improve your relationships with work and with yourself. I won’t be able to make the #LISMentalHealth chat tonight because I’m going to be at the Buddhist meditation class my husband and I started attending last Spring; one of my self-care and community-care strategies. Whoever you are, I wish you peace and greater compassion for yourself and others.

Image credit: The awesome #LISMentalHealth Zine. Order your copy today!

Combating other people’s data / Open Knowledge Foundation

This blog is the first in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/.

By Monica Granados

Follow the #otherpeoplesdata on Twitter and in it you will find a trove of data users trying to make sense of data they did not collect. While the data may be open, having no metadata or information about what variables mean, doesn’t make it very accessible. As a staunch advocate of open science, I have made my data open and accessible by providing context and the data on Github.

A screengrab of my GitHub repository with data files

Data files are separate in GitHub.

And while all the data and a ReadMe is available and the R code allows you to download the data through the R console – a perfect setting to reproduce the analysis, the reuse of the data is questionable. Without definitions and an explanation of the data, taking the data out of the context of my experiment and adding it to something like a meta-analysis is difficult. Enter Data packages.

What are Data Packages?

Data packages is a tool created by the Open Knowledge Foundation to be able to bundle your raw data and your meta-data so that it becomes more usable and shareable.

For my first data package I will use data from my paper in Freshwater Biology on variation in benthic algal growth rate in experimental mesocosms with native and non-native omnivores. I will use the Data Package Creator online tool for this package creation. The second package will be done in the R programming language.

Presently, the data is distributed in my GitHub repo but the Data Package Creator will allow me to combine the algae, snail and tile sampling data together in one place.

Write a Table Schema

A schema is a blueprint that tells us how your data is structured, and what type of content is to be expected in it. I will start by loading the algae data. The data is already available on GitHub so I will use the hyperlink option in the Data Package Creator. To create the schema, I have to load my data using the Raw link from GitHub. Since my data has the first row as the column headings the Data Package Creator recognizes them. Once loaded, I can add addition information about my different variables in the “Title” and “Description.” For example, for the variable “Day” I added a more explicit description in the of “Experimental day” in the Title and more information about the length of the experiment in the Description.

To add the snail and tile sampling datasets I will click on “Add a resource” for each and add the titles and descriptions.

Add dataset’s metadata

Next I will add the data’s metadata. I added a title, author and description to the metadata and I choose tabular data package since its just CSV files. I also added a CC-BY license so that anyone can use the data as well.

Then I validated the data (see note below) and downloaded the package which is available here.

Tu data es mi data

The Golden Rule states: Do unto others as you would have them do unto you. I think we have all been subject to other people’s data, the frustration and the disappointment that follows when we determine that the data is unusable. By adopting the use of data packages, you can make your data more re-usable and accessible. Most importantly prevent another #otherpeoplesdata tweet.

A screen grab of the pilot project blog

Find out more about the scientist’s pilot.

Are you interested in learning more about data packages and Frictionless data? The Open Knowledge Foundation is looking for pilot collaborations with scientists now. Find out more here.

Boards of ALCTS, LITA and LLAMA put Core on March 2020 ballot / LITA

The Boards of the Association for Library Collections & Technical Services (ALCTS), Library Information Technology Association (LITA) and the Library Leadership & Management Association (LLAMA) have all voted unanimously to send to members their recommendation that the divisions form a new division, Core: Leadership, Infrastructure, Futures. 

ALCTS, LITA and LLAMA will vote on the recommendation during the upcoming American Library Association (ALA) election. If approved by all three memberships, and the ALA Council, the three long-time divisions will end operations on August 31, 2020, and merge into Core on September 1.

Members of the three Boards emphasized that Core will continue to support the groups in which members currently find their professional homes while also creating new opportunities to work across traditional division lines. It is also envisioned that Core would strengthen member engagement efforts and provide new career-support services. If one or more of the division memberships do not approve Core on the ballot, the three divisions will remain separate but continue to face membership declines and financial pressures.  

In order to share pertinent information and facilitate discussion regarding the possible merger, ALCTS, LITA and LLAMA will each lead separate town halls, as well as focus groups. In addition, the divisions invite feedback and discussion through Twitter using the hashtags #TheCoreQuestion and #EndorseCore. Visit the Core website for full details for all upcoming events and past Core discussions.  

Join us Monday, February 24th, 12:30 – 1:30 p.m. CST for a LITA Town Hall meeting to learn details about the ALA election vote to combine the three divisions, LITA, ALCTS, LLAMA into the new proposed division Core. The election runs March 9 – April 1, and all members of the three divisions are encouraged to vote. Register for the town hall meeting today!

The Association for Library Collections & Technical Services (ALCTS) is the national association for information providers who work in collections and technical services, such as acquisitions, cataloging, collection development, preservation and continuing resources in digital and print formats. ALCTS is a division of the American Library Association.

The Library and Information Technology Association (LITA) is the leading organization reaching out across types of libraries to provide education and services for a broad membership. The membership includes new professionals, systems librarians, library administrators, library schools, vendors and anyone else interested in leading edge technology and applications for librarians and information providers.

The Library Leadership and Management Association (LLAMA) advances outstanding leadership and management practices in library and information services by encouraging and nurturing individual excellence in current and aspiring library leaders.

Meet the organisations who have been awarded Open Data Day 2020 mini-grants / Open Knowledge Foundation

The Open Knowledge Foundation is happy to announce that dozens of organisations from all over the world have been awarded mini-grants to support the running of events that celebrate Open Data Day on Saturday 7th March 2020.

Thanks to the generous support of this year’s mini-grant funders – Datopian, the Foreign & Commonwealth Office, Hivos, the Latin American Open Data Initiative (ILDA), Mapbox, Open Contracting Partnership and Resource Watch – the Open Knowledge Foundation will be giving out a total of 67 mini-grants to the organisations listed below in order to help them run great events on or around Open Data Day.

We received 246 mini-grant applications this year and were greatly impressed by the quality of the events being organised all over the world.

Learn more about Open Data Day, discover events taking place and find out how to get technical assistance or connect with the global open data community by checking out the information at the bottom of this blogpost.

Here are the organisations whose Open Data Day events will be supported by mini-grants divided up by the tracks their events are devoted to:

Environmental data

  • Escuela de Fiscales’ event in Argentina will promote the use of open data for the training, dissemination and development of civic activism in the preservation of the environment in the community
  • Nigeria’s Adamawa Agricultural Development Programme will sensitise fishery stakeholders – especially fishermen –  on the importance of stock taking to prevent overfishing in our water bodies and how to update the fisheries database using open data
  • Afonte Jornalismo de Dados (Afonte Data Journalism) in Brazil will provide awareness about environmental politics and empower the community to use public and open data
  • The Department of Agriculture at the Asuogyaman District Assembly in Ghana will host local farming organisations to create awareness on the need for data to be open and to show the effect of climate change on agriculture and related livelihoods using rainfall data 
  • An Open Data Day event planned at Tangaza University College in Kenya will discuss how to tackle climate change challenges with data
  • The University of Dodoma in Tanzania will invite girls from a local school to a geospatial open data networking event to instill environmental thinking among young girls
  • The Open Internet for Democracy team and Creative Commons Venezuela chapter will join forces to train a group of environmental journalists about open and reliable data sources they can use to develop stories
  • iWatch Africa will host a forum in Ghana to leveraging the power of public domain satellite and drone imagery to track deforestation and water pollution in West Africa
  • Ghana’s Africa Open Data and Internet Research Foundation will run a hackathon on how local communities can use open data for sustainable development especially to improve sanitation issues
  • Sustenta in Mexico will share knowledge about sustainable development, climate change and sustainability
  • Grafoscopio / HackBo (Colombia) will bring together two citizen science communities working on air quality issues and reproducible research, data activism, visualisation and storytelling
  • Youth for Environmental Development (Malawi) will inspire university students to take action and contribute to environmental protection through mapping
  • WikiRate from Germany will engage the public in the research and collection of open data about how companies are impacting climate change
  • Liga de Defensa del Medio Ambiente (LIDEMA) from Bolivia plan to identify open data sources that can help address socio-environmental conflicts
  • The Open Cities Lab team in South Africa will create an open and accessible space for community scientists to meet, network and collaborate on an air quality project
  • Costa Rica’s ACCESA will help attendants identify and visualise new and unexpected relationships and connections between land-use and territorial planning, on the one hand, and climate change and decarbonisation
  • Young Volunteers for the Environment from Togo will promote the use of open data in environmental protection
  • Técnicas Rudas will collectively explore Mexico’s mandated public data on construction projects and their environmental impacts

Tracking public money flows

  • Spotlight for Transparency and Accountability Initiative in Nigeria will host an event to increasing understanding of and access to local budget data
  • EldoHub will hold a hackathon to develop tools and systems which can facilitate county governments’ involvement in Kenya’s transparency, accountability and public participation
  • FollowTheMoney Kaduna will use contracting data including responses to FOI letters and on the spot assessment of projects and infrastructures across communities in Kaduna state, Nigeria
  • The Alliance of Independent Journalists in Bandung, Indonesia will use open contracting data to encourage collaboration among civil society groups to access and monitor public budgets
  • Afroleadership in Cameroon will organise a training on the analysis of budget data by civil society using open data
  • The event run by Construction Sector Transparency Initiative (CoST Malawi) will call for greater transparency and accountability in public budget management the through Open Contracting for Infrastructure Data Standard
  • Somalia’s Bareedo Platform will encourage uptake of local public contracting data
  • The Perkumpulan Inisiatif in Indonesia plan to host a youth open budget hack clinic building on the principles of public participation in fiscal policy from the Global Initiative for Fiscal Transparency
  • Dataphyte in Nigeria will support change agents to track and use budget, procurement and revenue data to demand accountability
  • The Collective of Journalists for Peace and Freedom of Expression from Mexico will design a workshop to explain all the contracts of the City Council of Mazatlan, Sinaloa
  • Diálogos will visualise the volume of public procurement of the main ministries of the Government of Guatemala 
  • The 1991 Open Data Incubator will facilitate a workshop and discussions to share the experiences of many parties working with or producing open data in Ukraine
  • LEAD University in Costa Rica will organise an event for data science students to meet public officials behind the National Public Procurement Portal
  • Russia’s Infoculture will hold a conference on open data and information transparency
  • The Kikandwa Rural Communities Development Organisation showcase the Uganda Budget Information website and how to use it to report, track and monitor public funds
  • The Centre for Information, Peace and Security in Africa will work with journalists in Tanzania to promote openness in public contract in terms of transparency and integrity on public expenditures and value for public money
  • Bolivia’s CONSTRUIR Foundation will organise a data camp to advocate for more and better public contracting data
  • Ojoconmipisto in Guatemala will teach students and journalists how to investigate and tell stories from public budget and contracting data
  • Sluggish Hackers will use their event to investigate how to track public money flows from the National Assembly or local assemblies in South Korea

Open mapping

  • Exegetic Analytics in South Africa will expose the South African R community to a range of resources for working with open spatial data
  • OpenMap Development Tanzania will spread awareness on the usefulness of open data for development among participants through workshops, trainings, break-out sessions and a mapathon 
  • Spain’s TuTela Learning Network will map the housing situation of migrant women in Granada
  • ODI Leeds in the UK will host a data surgery to assist attendees with their data, converting the data into GeoJSON files and mapping it
  • Girolabs from Paraguay will show initiatives using and producing open data
  • Open Knowledge Belgium will use open data to build a map visualising the streets names of Hasselt by gender
  • BloGoma (Blogosphère Gomatracienne) in the Democratic Republic of the Congo will using open mapping solutions to increase young people’s knowledge of free local HIV-related services
  • OpenStreetMap Kenya and Map Kibera will empower young people in Kibera Slum, (Africa’s largest urban slum) with skills in open mapping
  • The University of Pretoria in South Africa will develop a complete map of minibus taxi routes in Mamelodi East with the local knowledge of school learners
  • Brazil’s Federal University of Bahia wants to popularise open data mapping systems, especially OpenStreetMap, among undergraduate students and young people from vulnerable areas of Salvador
  • Youth Innovation Lab in Nepal will showcase crowdsourced streetlights data for Kathmandu collected by digital volunteers to influence policy for the maintenance of streetlights
  • Transparência Hackday Portugal / Open Knowledge Portugal will host a morning of hacking and learning, followed by an afternoon of quick talks and networking

Data for equal development

  • Footprints Bridge International will focus on how open data can help create jobs for rural youth and women in Ghana
  • The Bangladesh chapter of Creative Commons will host a mini conference to discuss the benefits of open source projects and open government data in the country
  • Nigeria’s National Institute for Freshwater Fisheries Research is developing an event to sensitise agricultural stakeholders on the need and benefits of data for equal development
  • MapBeks will organise a mapping party in the Philippines to highlight HIV facilities and LGBT-friendly spaces on OpenStreetMap
  • Khumbo Bangala Chirembo is a librarian at the Lilongwe University of Agriculture and Natural Resources in Malawi. He will host a workshop for other librarians to raise awareness of open data and its benefits
  • Mexico’s Future Lab aims to give visibility to women and the LGBT community in local decision making within government, business and civil society using open data
  • Zimbabwe Library Association’s Open Data Day event will highlight the importance of open data in promoting and supporting the girl child as well as raising the negative effects of gender-based violence against women and the role that libraries can play in providing current awareness to communities
  • Young Professionals for Agricultural Development (YPARD) in the Democratic Republic of the Congo will help young and female agricultural entrepreneurs explore how they can use open data to create new businesses
  • Datasketch in Colombia will organise a series of lightning talks from social entrepreneurs and journalists to share their work using open data
  • Youths in Technology and Development Uganda plan to share innovative data tools and a FAIR open data road map to measure progress against the SDGs in the country
  • Mexico’s Ligalab will open a space for local speakers to present their open data projects and for the community to gather and engage with local issues towards equal development
  • The Association SUUDU ANDAL in Burkina Faso plan to emphasise the importance of open data for development and accountability during their event
  • Argentina’s Fundación Conocimiento Abierto will run a Data Camp on gender and diversity before spending the next few months developing local apps using open data
  • NaimLab Peru are organising an event for undergraduate students to share the open data work being done by local and national organisations
  • YWCA Honduras’ event will host a focus group for local women from middle and low income backgrounds to discuss and generate data on female migration in Honduras
  • We Are Capable Data for Good Namibia (WACDGN) will train young Namibians in using data science skills for sustainable development projects
  • Tutator from Bolivia will use their event to understand the impact of open data in the livelihood of the beneficiaries of social services

About Open Data Day

Open Data Day is the annual event where we gather to reach out to new people and build new solutions to issues in our communities using open data. The tenth Open Data Day will take place on Saturday 7th March 2020.

If you have started planning your Open Data Day event already, please add it to the global map on the Open Data Day website using this form

You can also connect with others and spread the word about Open Data Day using the #OpenDataDay or #ODD2020 hashtags. Alternatively you can join the Google Group to ask for advice or share tips.

To get inspired with ideas for events, you can read about some of the great events which took place on Open Data Day 2019 in our wrap-up blog post.

Technical support

As well as sponsoring the mini-grant scheme, Datopian will be providing technical support on Open Data Day 2020. Discover key resources on how to publish any data you’re working with via datahub.io and how to reach out to the Datopian team for assistance via Gitter by reading their Open Data Day blogpost.

Need more information?

If you have any questions, you can reach out to the Open Knowledge Foundation team by emailing network@okfn.org or on Twitter via @OKFN. There’s also the Open Data Day Google Group where you can connect with others interested in taking part.

Jobs in Information Technology: February 19, 2020 / LITA

New this week

Visit the LITA Jobs Site for additional job listings and information on submitting your own job posting.

Early-bird registration ends March 1st for the Exchange / LITA

With stimulating programming, including discussion forums and virtual poster sessions, the Exchange will engage a wide range of presenters and participants, facilitating enriching conversations and learning opportunities in a three-day, fully online, virtual forum. Programming includes keynote presentations from Emily Drabinski and Rebekkah Smith Aldrich, and sessions focusing on leadership and change management, continuity and sustainability, and collaborations and cooperative endeavors. The Exchange will take place May 4, 6, and 8.

In addition to these sessions, the Exchange will offer lightning rounds and virtual poster sessions. For up-to-date details on sessions, be sure to check the Exchange website as new information is being added regularly.

Early-bird registration rates are $199 for ALCTS, LITA, and LLAMA members, $255 for ALA individual members, $289 for non-members, $79 for student members, $475 for groups, and $795 for institutions. Early-bird registration ends March 1.

Want to register your group or institution? Groups watching the event together from one access point will receive single (1) user access to the live stream over all three days and unlimited user account creation on the Exchange event site. Institutions with a maximum of six (6) concurrent user access points receive access to the live stream over all three days and unlimited user account creation on the Exchange event site.

Group and institutional members are encouraged to create their own user accounts and participate in the event’s discussions and non-streaming content. To learn more, contact ALCTS Program Officer, Tom Ferren at tferren@ala.org or call (312) 280-5038.

Interested in submitting a poster for the virtual poster session? The deadline for submitting virtual poster proposals is March 6. Learn more and submit your proposal today.

Open to What? A Critical Evaluation of OER Efficacy Studies / In the Library, With the Lead Pipe

In Brief
This selective literature review evaluates open educational resources (OER) efficacy studies through the lens of critical pedagogy. OER have radical potential as transformative tools for critical pedagogy or they can serve as a cost-free version of the status quo, inclined toward propagating austerity. This review analyzes studies published since 2008 with regard to cost, access, pedagogy, commercialization, and labor. These criteria are used to make explicit subjects indirectly addressed, if not ignored completely, in the existing literature. Typically, ample attention is paid to a study’s design and methodology but the underlying institutional infrastructure and decision-making process is unexamined. What emerges is an incomplete picture of how OER are adopted, developed, and sustained in higher education. Measurables like student outcomes, while important, are too often foregrounded to appeal to administrators and funding organizations. The review concludes with suggestions for how to utilize critical pedagogy for future studies and grassroots OER initiatives.

By Ian McDermott

Introduction

Open Educational Resources (OER) are misunderstood and underutilized in higher education (higher ed). In part, this situation can be traced to definitions. What are OER? In 2002, the United Nations Educational, Scientific, and Cultural Organization (UNESCO) coined the term OER (2002) and defined them as non-commercial learning materials. In 2012, UNESCO refined their definition to include “any type of educational materials that are in the public domain or introduced with an open license” (UNESCO, 2012). These educational materials encompass everything from textbooks and curricula to lecture notes and animation. The William and Flora Hewlett Foundation, a charitable foundation that supports OER initiatives, states OER are “high-quality teaching, learning, and research materials that are free for people everywhere to use and repurpose” (2018). David Wiley, Founder and Chief Academic Officer at Lumen Learning, argues that it is flexible licensing and permissions in opposition to conventional, restrictive copyright that are central to OER. Wiley cites the 5 Rs of OER as the most important features: the ability to retain, reuse, revise, remix, and redistribute (Hilton, Wiley, Stein & Johnson, 2010; Wiley, Bliss & McEwen, 2014).

These definitions, while useful, hint at the motivations of the organizations and individuals behind them. As an international, aspirational organization, UNESCO’s broad definition is inclusive and emphasizes the public domain and open licenses. The Hewlett Foundation’s definition signals an interesting shift, emphasizing “high quality” OER, which is not surprising since Hewlett, as an OER funder, has a financial stake in OER development. In 2019 Hewlett granted nearly $8 million to 18 OER initiatives at universities and organizations, including the University of California at Berkeley, University of Cape Town, Creative Commons, and the Wiki Education Foundation (Hewlett Foundation, Grants). Wiley’s 5 Rs model is arguably the preeminent OER definition. It is clear and concise while articulating a broad set of practices. One critique points out that several of the 5 Rs require access to technology and the requisite skills (Lambert, 2018). Like the Hewlett Foundation, Wiley has a vested interest in the success of OER. Lumen Learning is a company that provides a suite of educational technology products that colleges and universities pay to use; Lumen’s Candela, Waymaker and OHM provide the infrastructure for many instructors teaching with OER. While their products are often less expensive than commercial textbooks and platforms, some argue their business model betrays the ethos of open access initiatives (Downes, 2017; see Wiley, 2017, for counterpoint). Critically, Wiley’s initial definition of 2007 only included 4 Rs (Wiley, 2007). He added retain as the fifth R in 2014. As a practice, creators of any work should retain certain rights. Coincidentally or not, the right to retain is critical to the Lumen business model. It enables Lumen to monetize OER materials by packaging them in a proprietary, fee-based system.

These definitions vary enough to preclude a shared understanding of OER. In fact, a majority of college and university faculty are not familiar with OER (Seaman & Seaman, 2017, p. 16). Current OER practice varies depending on the practitioner’s affiliation (e.g. professor at a public university, academic librarian, Lumen employee, adjunct faculty member, student). Beyond sharing resources, higher ed lacks a common OER practice and existing OER practices lack an explicit social justice mission.

This situation presents an excellent opportunity to define, develop, implement, and advocate for OER in critical ways that address social justice issues facing higher ed: cost and access, pedagogical practice, and academic labor. Studies that assess OER’s impact on higher ed tend to focus on efficacy and perceptions. When compared to commercial textbooks and learning materials, these studies measure whether OER are effective at producing positive student outcomes and if they are perceived favorably by students and instructors. To develop a social justice-oriented analysis of OER, I am going to use critical pedagogy as a theoretical lens to review OER efficacy studies. Listed below are criteria and examples for evaluating these studies. This literature review examines OER efficacy case studies based on how they address the below criteria. Subsequent studies should be judged for how well they remedy them.

Critical Pedagogy Criteria 1: Cost & Access

OER adoption eliminates textbook costs and democratizes access; online books are available in multiple formats and accessible for all learners, including formats that do not rely on consistent internet access (e.g. PDF download); acknowledges that high priced textbooks are a barrier to learning because many students do not purchase expensive textbooks; cost and access to textbooks and learning materials are connected to students outcomes: course grade, enrollment intensity, withdrawal rates, etc.

Critical Pedagogy Criteria 2: Pedagogical Practice

Replacing commercial textbooks with OER is a pedagogical decision, beyond cost and access; details are provided about commercial textbooks and OER; faculty are making pedagogical decisions and are transparent about materials adopted, including relevant software (e.g. learning management software); open and critical pedagogy is used to involve and reflect students’ voices.

Critical Pedagogy Criteria 3: Academic Labor

Labor required for OER initiative is described, including work done by faculty, educational technologists, graduate assistants, librarians, undergraduate students, and others; price of academic labor and funding sources included.

Critical Pedagogy

Critical pedagogy has been used to analyze and reimagine education for over 50 years. OER have this potential when put into critical pedagogy practice. For this review, I define critical pedagogy via two foundational texts: Paulo Freire’s Pedagogy of the Oppressed (1968) and bell hooks’s Teaching to Transgress: Education as the Practice of Freedom (1994). Brazilian educator and theorist Freire (1968) argues for liberatory education, “[w]here knowledge emerges only through invention and re-invention, through the restless, impatient, continuing, hopeful human inquiry human beings pursue in the world, with the world, and with each other” (p. 72). Teachers minimize their authoritative role through a reconciliation of the teacher-student contradiction, “so that both are simultaneously teachers and students” (Freire, 1968, p. 72). This model of education combats what Freire (1968) termed the banking concept of education, “in which the students are the depositories and the teacher is the depositor” (p. 72). For Freire, education is one site in the struggle against larger forces of oppression. Leveling hierarchy as much as possible in the student-teacher relationship is fundamental to the struggle.

Though his ideas have influenced educators throughout the world, Freire’s early writings emerged from his experience teaching the illiterate poor in Brazil how to read. In the United States, feminist educator and author bell hooks has explored critical pedagogy for decades in higher ed, as it intersects with race, gender, and class. In Teaching to Transgress hooks (1994) contrasts her ecstatic experience of education as “the practice of freedom” when she was a child in all-black schools in the south with the oppressive, racist schools she attended during integration that strove to “reinforce domination” (p. 4). For hooks, critical pedagogy means “creating [a] democratic setting where everyone feels a responsibility to contribute” (1994, p. 39). This practice requires a desire to transgress, to empower the oppressed through critical pedagogy: students of color, queer students, poor students.

More recently, classroom faculty, librarians, instructional designers, and others in higher education have examined OER with a critical pedagogy perspective (e.g. Darder, Torres, Baltodano, 2017; Accardi, Drabinski, and Kumbier, 2010). In her analysis of OER and the open access movement in libraries and higher education, Crissinger warns “how openness, when disconnected from its political underpinnings, could become as exploitative as the traditional system it had replaced” (2015). In analyzing key texts of open educational practice (OEP), Lambert finds little explicit social justice (2018).

Critical pedagogy must be a part of OER practice. If students cannot afford a textbook, they are already oppressed. Faced with this contradiction, how can we possibly create the “democratic setting” hooks strives for? Replacing an expensive textbook with a free one is not critical pedagogy, because expensive textbooks are one symptom of higher ed’s disease. Eliminating expensive textbooks is a first step toward confronting the contradictions students and faculty face in higher ed. For example, five publishers control 80% of the textbook market (Senack and Donoghue, 2016, p. 4) and over 70% of faculty members hold contingent positions (American Association of University Professors, n.d.). Can the strategic use of OER effect the kind of change in higher education that places critical pedagogy at its center and eschews the austerity mindset that currently governs the field?

Background

Some broader context on the unaffordability of higher education is necessary to understand why OER are a pressing topic. First and foremost, the price of higher education continues to increase as the cost burden has been shifted to students and their families. According to the State Higher Education Executive Officers Association (SHEEO), 2017 was the first year a majority of states relied on tuition and fees more than state and local educational appropriations (SHEEO, 2018, pp. 8-9) to fund public higher education. Nationwide, spending per student by public higher educational institutions has decreased by 8% since 1992 while per student tuition has increased 96% (SHEEO, 2018; Brownstein, 2018). Student debt now exceeds individual credit card debt (Johnson, 2019).

Focusing on textbooks, the Bureau of Labor Statistics (BLS) reports that the price of college textbooks has increased 88% between 2006-2016, far outpacing tuition, fees and college housing during the same period (2016); a similar study by the U.S. Government Accountability Office (GAO) reached similar results (2013, p. 6). In some cases, the price of textbooks is greater than the price of tuition (Fischer, Hilton, Robinson, & Wiley, 2015, p. 160; Goodwin, 2011, p. 15). 65% of surveyed students admitted high cost prevented them from buying a textbook (Senack, 2014, p. 11). Students specifically cite textbook prices as an impediment preventing them from passing, completing, or even enrolling in classes (Florida Virtual Campus, 2019, pp. 31-32). Therefore, reducing or eliminating the cost of textbooks is one step toward lowering the barriers to higher education.

Scope, Methods, Objectives

This review is limited to OER efficacy studies in higher education published in North America between 2008 and 2019. Books, news articles, reports issued by governmental and non-profit organizations, and blogs are included as secondary sources. The body of literature on OER efficacy is not voluminous, but it is growing. A comprehensive article-length review is not possible or desirable.

As much as possible, the studies, reports, and articles selected for this review are published in open access journals and websites, though articles from the following databases and search engines were used: Education Resources Information Center (ERIC), Library and Information Science Source (EBSCOhost), Education Source (EBSCOhost), and Google Scholar. The Open Education Group’s The Review Project is an indispensable resource, which “provides a summary of all known empirical research on the impacts of OER adoption (including our own)” (Open Education Group, 2019). To date, this ongoing literature review includes 48 peer-reviewed articles, theses/dissertations, and white papers.

The studies were chosen as a representative sample and for their ability to meet the criteria discussed above: cost and access, pedagogical practice, and academic labor. Past and current literature reviews on OER efficacy (Hilton III, 2016; Abri and Dabbagh, 2018; Hilton III, 2019) emphasize quantitative and qualitative data and survey design. Following Crissinger’s (2015) and Lambert’s (2018) analyses, the objective for this study is to search for evidence of critical pedagogy and social justice in OER efficacy studies.

Analysis and Commentary

This section organizes OER efficacy case studies into three subsections. These subsections are organized in descending order by the frequency with which they are addressed in the studies under review.

1. Cost Reduction, Increased Access, and Student Outcomes
Every study addresses how OER help reduce the cost of higher education and increases access to textbooks and learning materials. The studies measure student outcomes in classes using OER (test group) compared with classes using commercial textbooks (control group); student outcomes include A, B, C grades, D, F, Withdrawal rates, enrollment intensity, final exam grades, and others. Often, student outcomes are similar across the test and control groups, though some studies present a case for correlation between cost and access and improved student outcomes.

2. OER and Pedagogy
Some studies provide details about the pedagogical decisions made with regard to OER adoption. For example, which OER textbook replaced the commercial option. But studies rarely name the commercial textbook. Even fewer studies discuss how OER intersect with pedagogical theories or faculty/student/staff collaborations.

3. OER and Academic Labor
Rarest of all is the study that provides details about the academic labor required for OER initiatives. Adopting an OER textbook may require a significant amount of work for a single professor teaching a single section. The number of people only increases for large, multi-section courses reliant on course management software. Very few studies detail the personnel involved or the costs required.

Some studies are discussed in more than one subsection, though each subsection foregrounds one of the above topics. While I use critical pedagogy as a lens to analyze OER efficacy studies, I am not primarily concerned with how critical pedagogy is used in specific OER textbooks or learning materials. The below studies do not provide such granular detail. Instead, I am analyzing these studies for evidence, or lack thereof, of critical approaches to OER adoption and survey design as it relates to cost and access, pedagogy, and academic labor.

Cost Reduction, Increased Access, and Student Outcomes

Many OER studies identify cost reduction and increased access as the initial motivation for OER adoption. The authors and investigators then track student outcomes for test and control groups across a variety of metrics. In nearly all studies, student outcomes are the same or better in classes taught with OER.

The University of California, Davis (UC Davis), created the STEMWiki Hyperlibrary to provide students with a no-cost replacement for existing commercial textbooks (Allen, Guzman-Alvarez, Molinaro, & Larsen, 2015, p. 3). ChemWiki (part of the Hyperlibrary) was used as the exclusive textbook in seven chemistry classes at UC Davis, Purdue University, Sacramento City College, and Howard University. Allen, et al. (2015) claim that ChemWiki implementation saved students approximately $500,000 dollars in textbook expenditures (p. 3), though the commercial textbook replaced by ChemWiki is not mentioned by name. It is not clear how the authors arrived at this figure; perhaps it is based on an estimate assuming all students purchased the commercial textbooks. All available research indicates many students do not purchase expensive textbooks. Such opacity is not helpful. For OER to flourish, it is important to name the resources being replaced, and their cost. Readers, especially those considering adopting OER, deserve to know these details to help them make informed decisions at their own institutions.

The Virginia State University School of Business turned to OER in hopes of reducing inequality in the classroom and improving student outcomes. Prior to this study, only 47% of VSU students purchased textbooks for their courses. Students cited affordability as the primary barrier; many VSU students struggle financially and work at least one job in addition to their full-time courseload (Feldstein, Hilton III, Hudson, Martin, & Wiley, 2012, p. 1). VSU faculty investigated ebook alternatives in order to lower costs and ensure students would have ongoing access to course materials. They contracted Flat World Knowledge (FWK), then an OER provider, and paid for per-student seat licenses. VSU faculty purposely avoided commercial and proprietary platforms that would restrict access for students without regular internet access. Therefore, students could read the textbooks online or download and retain all materials in several formats (Feldstein, et al., 2012, pp. 1-2).

However, working with commercial entities on OER initiatives has considerable drawbacks. One year after the VSU study, FWK “evolved from open education resources to fair pricing” according to their website. This means that the textbooks VSU faculty had hoped to make available for free were now subject to “fair pricing.” FWK and VSU students and faculty may have divergent ideas of what’s a fair price for a textbook. At the time of this writing, the FWK website lists most e-textbooks between $25-$30 and most print copies (ebook included) list for $55. This price is much lower than many commercial alternatives, but it is a lot more than free.

The percentage of African American and Latinx students that receive a bachelor’s degree or higher lags far behind white students. In a 2018 study at the University of Georgia (UGA), Colvard, Watson, and Park sought to address the attainment gap through OER adoption in eight general education courses. The authors point to the connection between public disinvestment in higher education and rising costs for students. They argue that shifting the cost burden away from taxpayers and onto students exacerbates ethnic and racial disparities in educational attainment. Students saved over 3 million dollars as a result of these OER adoptions. Cleverly, this study disambiguated student data in order to determine if OER have a greater impact on students eligible for Pell Grants, part time students, and non-white students (Colvard, Watson, & Park, 2018, p. 264). The results are promising as the percentage of students receiving grades A, A-, and B+ in OER test courses increased dramatically for all three populations (Colvard, et al., 2018, pp. 269-271).

The last study in this subsection presents a convincing argument for cost reduction as a contributor to student outcomes. Fischer, Hilton, Robinson, and Wiley designed the largest efficacy study upon its publication in 2015. It is a quasi-experimental study that analyzed efficacy results across four four-year colleges and six community colleges for approximately 16,000 students in fifteen undergraduate courses: approx. 5,000 in the test group and 11,000 in the control group (Fischer, Hilton, Robinson, & Wiley, 2015, p. 164). The study measured outcomes in four categories: course completion, passing courses with at least a C- grade, credit hours during the semester tested (enrollment intensity), and credit hours in the following semester.

Fischer, et al. claim that cost is more impactful on student outcomes than instructional design and mode of delivery (Fischer, et al., 2015, p. 169). In this study and others like it, student outcomes are similar when using OER or commercial textbooks. However, the authors see a correlation between saving money on textbooks and enrollment intensity. The test group (those using OER) enrolled in more credits in the surveyed semester, and the following semester, than the control group (Fischer, et al., 2015, pp. 167-168). Their argument is that students use their savings to enroll in more classes. Causation is impossible to prove but this hypothesis is provocative.

The refrain that student outcomes are the same or better when using OER is increasingly common. This argument is used to encourage OER adoption. But OER need practitioners committed to critical pedagogy to move beyond a free version of the status quo. Fischer, et al. (2015) admit that future studies should analyze textbook quality and teacher effects (p. 170). They do not provide any details about the learning materials used in their study. This omission is too common in OER efficacy studies. These issues are taken up in the following subsections.

OER and Pedagogy

The fact that the vast majority of OER efficacy studies show that student outcomes are the same or better when using OER is promising. However, most studies lack an in-depth analysis of the pedagogical choices driving OER initiatives. This section examines case studies for evidence of critical pedagogy with regard to OER adoption.

Though never specifically mentioned, critical pedagogy is at the center of the UC Davis, ChemWiki study discussed above. Allen, et al. stress the importance of faculty and student engagement in authoring and reviewing CHEMWiki teaching materials. As the name suggests, ChemWiki utilizes a model similar to Wikipedia, a comparison the authors embrace (Allen, et al., 2015, p. 2). Teaching modules are created by many instructors and can be hyperlinked within each course’s instance of ChemWiki. In other words, labor is distributed horizontally in an effort to draw on collective expertise and avoid the centralization of expertise used in authoring traditional textbooks.

Colvard, et al. argue that their study, and OER by extension, addresses all three of the great challenges facing higher education: affordability; retention and completion; quality of student learning (Colvard, et al., 2018, p. 273). Quality of student learning is measured by academic performance, which improved in the test group. But the study reveals little about pedagogy. Most of the classes adopted OpenStax textbooks, a major OER textbook publisher based out of Rice University. UGA’s Center for Teaching and Learning assisted with some OER adoptions but no further details are provided. As a result, pedagogy and academic labor are hinted at but never discussed. One study cannot cover all topics and this one does a remarkable job of situating OER in a social justice context. Perhaps a future study could widen the aperture of social justice to better account for pedagogy and the academic labor required to adopt OER at a large, public university.

Hendricks, Reinsberg, and Rieger acknowledge that most studies ignore pedagogy by providing, “a very specific description of how the open textbook used in the course we are studying has been adapted to fit into that course” (2017, p. 82). In this study at the University of British Columbia (UBC), the authors adopted an OpenStax physics textbook and edited out sections of the textbook that were not relevant to the course (Hendricks, Reinsberg, & Rieger, 2017, p. 90). Professors also stopped using a commercial software package for homework. Instead, they added the textbook’s review questions to the course website in an attempt to reduce cost, simplify administration, and simplify students’ experience (Hendricks, et al., 2017, p. 83-84). In this instance, getting rid of the commercial homework system, rather than the textbook, generated the greatest savings.

Hendricks, et al. found that students’ problem-solving abilities were slightly negatively impacted by the new homework system. The previous commercial system provided hints and tutorials as students completed their homework, whereas the new system simply provided correct/incorrect feedback. However, their transparency demonstrates that moving away from commercial entities in higher education may not be painless. Critical pedagogies are necessarily difficult because the intention is to leave behind pre-existing approaches. In this regard, the authors show that there is much more to student outcomes than “the same or better results.”

Critical approaches factored into the decision-making process in the Virginia State University study. Feldstein, et al. do not provide details on pedagogical methods used in the courses, but VSU Business School faculty identified the value in adopting OER with Creative Commons licenses. This way, materials are relatively easy to revise and remix and their teaching materials can reflect current events and different points of view (Feldstein, et al., 2012, pp. 1-2). As one professor put it, “Since students now had permanent access to content, the value was in the information and not in the textbook as a commodity” (Feldstein, et al., 2012, p. 8).

Pawlyshyn, Braddlee, Casper, and Miller document OER adoption for ten high enrollment courses at seven institutions, part of the Project Kaleidoscope Open Course Initiative (KOCI). Their writing is reflective to an extent rarely found in OER efficacy studies. They dedicate just as much space to pedagogical decision-making as to costs and student outcomes. This fact may be connected to the project design. Participants collaborated across institutions (and held weekly Skype calls!), which surfaced important differences at the respective institutions. For example, student populations varied from remedial to college entry (Pawlyshyn, Braddlee, Casper, & Miller, 2013). Consequently, faculty developed targets for their specific student populations. For OER initiatives to succeed, the authors make the following recommendation: “Introduce and facilitate OER efforts through faculty initiative rather than making a top-down institutional directive. Eventually, institutional policy must support emergent practice” (Pawlyshyn, et al., 2013).

Even when documenting KOCI’s shortcomings, Pawlyshyn et al. provide critical reflections. Some faculty resisted KOCI based on perceived limitations to academic freedom and of “corporate interference” since KOCI used Lumen Learning and received funding from the Bill and Melinda Gates Foundation and the Hewlett Foundation (Pawlyshyn, et al., 2013).

The following section examines how decisions regarding academic labor, which can include collaborating with commercial vendors, is discussed in OER efficacy studies—when academic labor is discussed at all.

OER and Academic Labor

Academic labor is rarely covered in these studies. This is understandable insofar as the focus of most studies is cost savings and student outcomes. However, academic labor is central to any OER initiative. Who is doing the work? Are they getting paid? Is this work acknowledged for promotion and tenure? Based on the available literature, it is difficult to answer these questions. Calling attention to the matter will hopefully help remedy this glaring omission in the literature.

Hendricks, et al. (2017) acknowledge the costs of adopting OER: “the literature on open textbooks related to cost focuses on cost savings to students, but it’s important to keep in mind the possible costs for faculty and institutions in terms of time and support when using open textbooks” (p. 94). Faculty and graduate assistants worked together during the summer months to prepare the course. The latter were paid with a teaching and learning grant of C$20,000 from University of British Columbia. Ensuring fair compensation for graduate assistants and contingent workers is crucial from a critical pedagogy perspective. However, there is no indication the grant covered the time and effort spent by faculty planning the project, securing funds, selecting materials, and learning new systems. Are these tasks considered part of their job, were they paid a stipend for extra labor, or given course release time, to name a few payment options? Transparency on the working conditions of all faculty and staff, contingent and full time, is necessary as we use critical pedagogy to implement and document just labor practices for OER initiatives.

Pawlyshyn, et al. directly address payment and incentives in a section called “Motivations.” In addition to a small stipend, faculty participants received travel funds to attend OER conferences. The authors claim this was an even greater motivator than the stipend and they make explicit recommendations for other OER initiatives to allocate funds for conference attendance (Pawlyshyn, et al., 2013). Though the authors do not explain why professional development funds were so popular, the implication is that faculty relished the opportunity to share their work and learn from others in a community of practice. One shortcoming of their report is it does not include any information about how Lumen Learning was involved in KOCI, especially with regard to MyOpenMath (MOM), a free, online course management system. It would be helpful to know if KOCI used the free version of MOM or the Lumen-supported version, Lumen OHM. Each option presents distinct cost and maintenance issues, namely vendor fees versus local maintenance expenses.

Allen, et al. contrast the commercial textbook publishing process–a small group of experts deciding on relevant content–with the horizontal crowdsourcing of ChemWiki. The infrastructure of ChemWiki is developed and maintained by professors, research assistants, and students who regularly review and update content for difficulty (Allen, et al., 2015, p. 3). The authors do not discuss how, or if, in the case of students, this labor is compensated or otherwise supported.

The final example in this subsection examines a study that looks to OER as an institutional cost saving measure. Bowen, Chingos, Lack, and Nygren (2012) examine an OER hybrid learning environment (a mix of in-person and online). Published by Ithaka S+R, a consulting non-profit, the study tested traditional and hybrid classes for a basic statistics course designed at Carnegie Mellon University and taught at six public universities. Like most studies, Bowen, et al. (2012) found the hybrid format produced the same or better results than traditional classroom instruction (pp. 18-21).

Unlike most OER studies, Bowen, et al. also tested whether or not the OER/hybrid method can lower instructor costs. In their model, the hybrid course would be supervised by tenure-track faculty, with in-person sections led by “teaching assistants” and administrative work handled by a “part-time instructor” (Bowen, et al., 2012, p. 25). Admittedly, this is one line of inquiry in a lengthy report, but using OER as a way to lower operating costs is anathema to critical pedagogy and social justice. The authors estimate large scale implementation could reduce instructor costs 36%-57% (Bowen, et al., 2012, p. 26). They do not include how they reach these numbers, likely because they would be perceived as controversial, if not incendiary.

Conclusion and Future Considerations

OER efficacy studies are just as revealing for what they omit as for what they include. It is challenging to design a methodologically sound study, especially under tight timelines and tight(er) budgets. Given this reality, OER efficacy studies tell the tidiest story: saving students money is good and OER may improve student learning. In this respect, these studies conform to the logics of funders and administrators, not students, faculty, librarians, and staff working at colleges and universities. But this story elides an inconvenient truth: if students are not buying expensive textbooks to begin with (Florida Virtual Campus, 2019; Feldstein, et al., 2012), are they saving money or are they not spending money they do not have in the first place?

This is not to say that well-designed OER efficacy studies are irrelevant. The above studies are valuable for their analysis of and advocacy for OER initiatives. But the desire to quantify all aspects of higher ed is reflected in the literature. The statistics are given primacy over pedagogy. Can an education committed to measuring “student success” ever be liberatory? Critical pedagogy does not reduce students to their letter grades or how many dollars they saved. Rather, students and faculty engage in dialog about defining academic success. In contrast to the above OER efficacy studies, qualitative approaches used in OER perception studies could be incorporated more often to center students’ voices. Action research is another approach. According to Sagor, action research, “is a disciplined process of inquiry conducted by and for those taking the action. The primary reason for engaging in action research is to assist the ‘actor’ in improving and/or refining his or her actions” (Sagor, 2000). Action research on OER initiatives would be a welcome addition to the literature, as the method aligns nicely with critical pedagogy.

Bowen, et al. (2012) seem to accept the divestment of public funds for higher education as a permanent reality, instead of an ongoing struggle (pp. 4-6). Their solutions address the perspective of administration, not faculty or students. Moreover, how are OER being commercialized? David Wiley, a co-author on several above studies, and many others, is the Chief Academic Officer at Lumen Learning. A deeper investigation into “open washing,” or proprietary practices disguised as open access/licensing, as defined by Watters (2014), in OER initiatives is needed.

Alternative perspectives abound. Brier and Fabricant decry austerity and commercialization in their full-throated defense of public higher education, Austerity Blues: Fighting for the Soul of Public Higher Education (2016). Winn’s (2012) Marxist analysis of OER in higher education cautions against administrators’ attempts to exploit OER for surplus value in the form of increased enrollment, lower teaching costs, and cultural prestige (pp. 143-144). Farrow (2017) criticizes the austerity mindset, obsessed with efficiencies that “promote the idea that technological innovation can offer neat solutions to challenges faced by educational institutions” (p. 131).

As the title of this article asks, open to what? A free version of the status quo? The above analysis shows that OER efficacy studies would benefit from greater transparency. This transparency applies to pedagogy, technology, and the financial and emotional costs for students, faculty, and staff. It is one thing to use critical pedagogy to diagnose the problem with the above studies. It is a far more important challenge to address higher ed’s contradictions and power struggles: teacher/student, faculty/administrator, proprietary/open access, banking education/open pedagogy. Critical pedagogy opens the door.


Acknowledgements

I would like to thank peer reviewers Ryan Randall and Nicole Williams for their insightful, critical, and encouraging comments. Thank you to Ian Beilin for serving as publishing editor. A very special thank you to Professor Maria Jerskey at LaGuardia Community College, who runs the Literacy Brokers writing group. I never would have finished this article without her guidance, along with other LaGuardia colleagues who participate in the writing group. In particular, many thanks to Professors Dominique Zino and Derek Stadler for their invaluable feedback on multiple drafts of this article.


References

Abri, M.A., & Dabbagh, N. (2018). Open Educational Resources: A Literature Review. Journal of Mason Graduate Research 6(1), 83-104. doi: https://doi.org/10.13021/G8jmgr.v6i1.2386.

Accardi, M., Drabinski, E., & Kumbier, A (eds). (2010). Critical library instruction: Theories and methods. Duluth, MN: Library Juice Press.

Allen, G., Guzman-Alvarez, A., Molinaro, M., & Larsen, D. (2015). Assessing the Impact and Efficacy of the Open-Access ChemWiki Textbook Project. Educause Learning Initiative Brief. Retrieved from https://library.educause.edu/resources/2015/1/assessing-the-impact-and-efficacy-of-the-openaccess-chemwiki-textbook-project.

American Association of University Professors. (n.d.). Background Facts on Contingent Faculty Positions. Retrieved from https://www.aaup.org/issues/contingency/background-facts.

Bowen, W. G., Chingos, M. M., Lack, K. A., & Nygren, T. I. (2012). Interactive Learning Online at Public Universities: Evidence from Randomized Trials. Ithaka S+R. https://doi.org/10.18665/sr.22464.

Brownstein, R. (2018, May 3). American Higher Education Hits a Dangerous Milestone. The Atlantic. Retrieved from https://www.theatlantic.com/politics/archive/2018/05/american-higher-education-hits-a-dangerous-milestone/559457/.

Bureau of Labor Statistics, U.S. Department of Labor. (2016, August 30). College Tuition and Fees Increase 63 Percent Since January 2006. TED: The Economics Daily. Retrieved from https://www.bls.gov/opub/ted/2016/college-tuition-and-fees-increase-63-percent-since-january-2006.htm.

Colvard, N., Watson, E. C., & Hyojin, P. (2018). The Impact of Open Educational Resources on Various Student Success Metrics. International Journal of Teaching and Learning in Higher Education 30(2), 262-276. Retrieved from http://www.isetl.org/ijtlhe/pdf/IJTLHE3386.pdf.

Crissinger, S., (2015). A Critical Take on OER Practices: Interrogating Commercialization, Colonialism, and Content. In the Library With a Lead Pipe. Retrieved from http://www.inthelibrarywiththeleadpipe.org/2015/a-critical-take-on-oer-practices-interrogating-commercialization-colonialism-and-content/.

Downes, J. (2017). If We Talked About the Internet Like We Talk About OER. Half an Hour. Retrieved from https://halfanhour.blogspot.com/2017/11/if-we-talked-about-internet-like-we.html.

Farrow, R. (2017). Open Education and Critical Pedagogy. Learning, Media and Technology 42, 130-146. http://oro.open.ac.uk/id/eprint/46662.

Feldstein, A. Martin, M., Hudson, A., Warren, K., Hilton III, J., & Wiley, D. (2012). Open Textbooks and Increased Student Access and Outcomes. European Journal of Open, Distance and E-Learning, 2. Retrieved from https://www.eurodl.org/.

Fischer, L., Hilton, J., Robinson, T.J., & Wiley, D. (2015). A Multi-institutional Study of the Impact of Open Textbook Adoption on the Learning Outcomes of Post-secondary Students. Journal of Computing Higher Education 27(3), 159-172. https://doi.org/10.1007/s12528-015-9101-x.

Florida Virtual Campus. (2019). 2018 Florida Student Textbook and Course Material Survey: Results and Findings. Retrieved from https://dlss.flvc.org/colleges-and-universities/research/textbooks.

Goodwin, M. A. L. (2011) The Open Course Library: Using Open Education Resources to Improve Community College Access. (Unpublished doctoral dissertation). Pullman: Washington State University. Retrieved from http://hdl.handle.ne/2376/3497.

Hendricks, C., Reinsberg, S., & Rieger, G. (2017). The Adoption of an Open Textbook in a Large Physics Course: An Analysis of Cost, Outcomes, Use, and Perceptions. The International Review Of Research In Open And Distributed Learning 18(4), 78-99. http://dx.doi.org/10.19173/irrodl.v18i4.3006.

Hilton III, J., Fischer, L., Wiley, D., & William, L. (2016). Maintaining Momentum Toward Graduation: OER and the Course Throughput Rate. The International Review Of Research In Open And Distributed Learning 17(6), 18-27. http://dx.doi.org/10.19173/irrodl.v17i6.2686.

Hilton, J. (2019, August 6). Open Educational Resources, Student Efficacy, and User Perceptions: A Synthesis of Research Published Between 2015 and 2018. Educational Technology Research and Development. https://doi.org/10.1007/s11423-019-09700-4.

Johnson, D. M. (2019, September 23). What Will It Take to Solve the Student Debt Crisis? Harvard Business Review. Retrieved from https://hbr.org/.

Lambert, S. R. (2018). Changing Our (Dis)Course: A Distinctive Social Justice Aligned Definition of Open Education. Journal of Learning for Development 5(3), 225-244. Retrieved from https://jl4d.org/index.php/ejl4d/article/view/290/334.

Paulsen, M. B., & St. John, E. P. (2002). Social Class and College Costs: Examining the Financial Nexus Between College Choice and Persistence. The Journal of Higher Education 73(2), 189–236. https://doi.org/10.1080/00221546.2002.11777141.

Pawlyshyn, N., Braddlee, D., Casper, L., & Miller, H. (2013, November 4). Adopting OER: A Case Study of Cross-institutional Collaboration and Innovation. Educause Review. Retrieved from http://er.educause.edu/articles/2013/11/adopting-oer-a-case-study-of-crossinstitutional-collaboration-and-innovation.

Sagor, R. (2000). What is Action Research? In R. Sagor (author), Guiding School Improvement with Action Research. Retrieved from http://www.ascd.org/publications/books/100047/chapters/What-Is-Action-Research%C2%A2.aspx.

Senack, E. (2014). Fixing the Broken Textbook Market: How Students Respond to High Textbook Costs and Demand Alternatives. The Student Public Interest Research Groups. Retrieved from https://uspirg.org/reports/usp/fixing-broken-textbook-market.

Senack, E., & Donoghue, R. (2016). Covering the Cost: Why We Can No Longer Afford to Ignore High Textbook Prices. The Student Public Interest Research Groups. Retrieved from https://studentpirgs.org/2016/02/03/covering-cost/.

State Higher Education Executive Officers Association. (2018). State Higher Education Finance (SHEF) Fiscal Year 2017. Retrieved from https://sheeomain.wpengine.com/wp-content/uploads/2019/02/SHEEO_SHEF_FY2017_FINAL-1.pdf.

UNESCO. (2002). Forum on the Impact of Open Courseware for Higher Education in Developing Countries: Final report. Paris. Retrieved from http://www.unesco.org:80/iiep/eng/focus/opensrc/PDF/OERForumFinalReport.pdf.

UNESCO. (2012). 2012 Paris OER Declaration. Retrieved from http://www.unesco.org/new/fileadmin/MULTIMEDIA/HQ/CI/WPFD2009/English_Declaration.html.

U.S. Government Accountability Office. (2013, June 6). College Textbooks: Students Have Greater Access to Textbook Information (GAO-13-368). Retrieved from https://www.gao.gov/products/GAO-13-368.

Watters, A. (2014, August 16). From “Open” to Justice. Hack Education. Retrieved from http://hackeducation.com/2014/11/16/from-open-to-justice.

Wiley, David. (2007, August 8). Open Education License Draft. Iterating Toward Openness. Retrieved from https://opencontent.org/blog/archives/355.

Wiley, D. (2017, November 8). If We Talked About the Internet Like We Talk About OER: The Cost Trap and Inclusive Access. Iterating Toward Openness. Retrieved from https://opencontent.org/blog/archives/5219.

Wiley, D. (2017, November 13). More on the Cost Trap and Inclusive Access. Iterating Toward Openness. https://opencontent.org/blog/archives/5244.

William and Flora Hewlett Foundation. (2020). Hewlett Foundation: Grants. Retrieved from https://hewlett.org/grants/?sort=date&grant_strategies=31557&current_page=1.

Cloud-hosted web archive data: The winding path to web archive collections as data / Archives Unleashed Project

Web archives are hard to use, and while the past activities of Archives Unleashed has helped to lower these barriers, they’re still there. But what if we could process our collections, place the derivatives in a data repository, and then allow our users to directly work with them in a cloud-hosted notebook? 🤔

Last year around this time, the Archives Unleashed team was working on what can now be referred to as our first iteration of notebooks for web archive analysis. This initial set of notebooks used the derivatives created by the Archives Unleashed Cloud, and made use of a “Madlibs” approach of tweaking variables to work through analysis of a given web archive collection. The team wrote this up, and presented a poster on it at JCDL 2019, “The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration of Web Archives. We also deployed a variation of them for the D.C. Datathon in the Spring of 2019. They were successful in helping teams jump start their research at the datathon. But looking back, I can see they were a challenge to use. Just look at this “Getting Started” section of the README.

One of the biggest lessons that I keep relearning is that I need to make the tools I help create easier for the user. Our users come from a wide variety of backgrounds, and if we want to hit our goal of making “petabytes of historical internet content accessible to scholars and others interested in researching the recent past,” we must iterate on our tools and processes to make them easier and simpler to use so that web archives can be easily used as source material or data across a wide variety of academic disciplines. Looking back at the slides from a presentation Ian Milligan and I gave at RESAW 2019 the graphic Ian used in one of the slides really hits home for me.

Let’s move towards users graphic from RESAW 2019 presentation.

Right now, a couple parts of the Archives Unleashed Project are culminating as we approach the final two datathons in New York and Montreal that, I hope, should move us a lot closer to a large segment of our users: a new release of the Archives Unleashed Toolkit, and an updated method for producing datasets for the datathons.

Earlier this month, we released version 0.50.0 of the Archives Unleashed Toolkit. If you check out the release notes, you’ll notice a lot of new functionality with DataFrames, and a new approach to our user documentation. We provide a “cookbook” approach to working with the Toolkit where we provide a number of examples of how to generate categories of results, and what to do with those results. Though, as much time and effort as we have put into improving the Toolkit over the last few years, there is only a small portion of our users who will jump into using it as is; a library for Apache Spark. Because it is a library for Apache Spark, it presents a large challenge to many of our users, and potential users. You need Java, Apache Spark, you need to know how you want to deploy Apache Spark, and will also need, potentially, above average command line skills to just get up and running with processing and analyzing web archive collections.

Over the past few years, we’ve not only iterated on our various software tools, we’ve also continuously tweaked our datathons. Since the attendees have just under two days to form teams, formulate research questions, and investigate the web archive collections, we’ve experimented with kick starting the analysis for participants. For earlier datathons, attendees were provided with some homework that provided them with a crash course in the Toolkit, and then we provided teams with some raw collections — just the WARCs! — and let them go to town. The datathons have evolved quite a bit since, we started with providing teams with Cloud derivatives along with the raw collection data, and most recently, we provided teams with raw collection data, derivatives and notebooks.

For our final two datathons coming up this spring, we decided to try one last iteration on the datasets and utilities we provide to attendees. We’re collaborating closely with our host institutions to not only provide access to the raw collection data of a few of their collections, but also process their web archives. We’re producing our standard set of Cloud derivatives as we have before, and we’re introducing some new derivatives that are DataFrame equivalents of the Cloud derivatives, as well as some new derivatives. These DataFrame derivatives are written out in the Apache Parquet format, which is a columnar storage format. These derivatives can be easily loaded in notebook using common data science tools and utilities, such as pandas, that we hope more users may already be familiar with.

So, how does this all work?

Collaborating with our great hosts at Columbia University (Columbia University Libraries, and Ivy Plus Libraries Confederation) for our New York datathon, and Bibliothèque et Archives nationales du Québec and the International Internet Preservation Consortium for our Montreal datathon, we’re almost done processing all of their collections. We’re providing access to the collection derivatives in Zenodo and Scholars Portal Dataverse. Each collection’s derivatives now has a citable DOI, and Zenodo supports DOI versioning. So, as a web archive collection grows, never versions of the derivatives can added to the record, allowing each version to have a DOI, as well as a DOI for the entire dataset. So, we can now start to see these web archive collections as data! An important note with these datasets, is that they are truly a collaboration, and you may notice the multiple authors on each. Each collection’s selectors, curators, and web archivists are co-authors, along with Archives Unleashed team members who processed the data.

So now that we have lots of web archive collection data, how can we improve the notebooks?

The answer here is to take what we learned in our previous notebooks, and throw pretty much everything else out and start from scratch. Why? If the end goal is to move closer to our users, then what if our users didn’t have to install Anaconda and all the dependencies locally to just fire up a notebook? What if they could just open a browser, and go from there? That’d be great, right? And, we can do that out of the box with Google Colaboratory! For free, you get up to 35G of RAM, ~100G of disk space, and a few CPU cores.

Example notebook dataset setup.

Let’s take a moment here to quickly walk through an example web archive collection as data, in a fairly simple notebook. Above, we see a brief description of the collection, and acknowledgement to Bibliothèque et Archives nationales du Québec (Merci beaucoup BAnQ!), and then we pull down the dataset to our Google Colabatory environment to work with.

Example notebook environment setup.

From there, we can load in some standard Python data science libraries, then load our Parquet derivatives as pandas DataFrames, and we’re off to the races. We don’t need the Toolkit at all!

Using tldextract to work with web archive data.
Bar graph of tld data.

Above we can see that we pull in a handy library, tldextract, that allows us to examine the top occurring domains and suffixes, which in the next step plots them!

We have a few more example notebooks in the repository, and hope to add more in the future that are not only created by the Archives Unleashed team, but from users! That’d be pretty great, eh?

Is this the best we can do? Definitely not. There will hopefully be more improvements in the future. But, for the time being, we are getting there. I believe it is a huge improvement from where we started in July of 2017 in terms working with web archive collections. We’re helping to lower the barrier to access web archives by creating derivatives files, depositing them into data repositories so that those collections can truly be used as citable data, and we’re also providing a gentle on-ramp to analysis with our example notebooks.

The next items to work on and solve are further lowing the barriers to access web archive collections. The librarian and archivist in me wants to make sure we’re not gatekeepers, and see these really wonderful collections used. The researcher in me wants to get at this data for research with the least amount of interaction and friction possible.

So, what if there was a pipeline for processing a given collection, or segment of a collection, and publishing the derivatives in a given data repository along with a sample notebook?

🤔🤔🤔

I really think we could make this happen one day.


Cloud-hosted web archive data: The winding path to web archive collections as data was originally published in Archives Unleashed on Medium, where people are continuing the conversation by highlighting and responding to this story.

Welcome to 2020! News from your NDSA! / Digital Library Federation

Welcome to 2020! News from your NDSA

Greetings one and all. For starters, happy (belated) new year! We have had a lot on the burners this past year and 2020 will be no different. Here are some highlights (past and present):

 

Governance

We took a long and deep look at our overall governance structure. But what does that mean, exactly? Well, this time last year parts of the Coordinating Committee were meeting weekly to refocus our documentation and structure of our various components: Coordinating Committee, Interest Groups, Working Groups, etc. What we realized is that though good work was getting done, there was not coordination among all the various efforts. In other words, we were missing opportunities to leverage all the great talent from our Membership. We set out to create a structure that changes all that. In 2020, you will see new, publicly available, documentation of the various groups, their charges, and their focus for 2020. This will help us be accountable to you for all the great work that is going on. Stay tuned for more information on that!

 

NDSA Agenda

The long-awaited Agenda will be published in Q1 2020. I want to thank each and every one of you that contributed your thoughts and feedback to it! We think it is a great representation of digital preservation practice since the last one [2015] came out.

 

Interest and Working Groups (IG/WG)

So much has been going on in this area that sometimes it is hard to keep track. The NDSA Leadership (Coordinating Committee, IG/WG chairs, DLF) meet monthly to keep each other up to date and accountable. Each IG/WG has its minutes available for others to review for past and present topics and research areas. We have also created a new group – The Communications, Outreach, and Publications Working Group that will be streamlining our actual products (like the Agenda, Levels of Preservation, survey results etc) as well as focusing on a new website structure. Our updated Storage Survey will also be coming out in Q1. Lots will be happening there in 2020.

 

Levels of Digital Preservation

Many of you have been involved in this work. The Levels of Preservation is an ongoing Working Group (chaired by yours truly) and will be rolling out updates on a more frequent basis. For 2020, the Curatorial Team will be finishing up its work and the new Training and Advocacy Subgroup will be all spun up. I want to thank all of you (there were almost 100 people) who participated on the various reboot teams—from across the globe! Version 2’s matrix and the associated Assessment tool is available with more resources coming soon. 

 

New Members

One of the best parts of this work is welcoming new members. We had a banner year last year with a bumper crop of new members. Here they are!

  • Clemson University
  • Illinois Institute of Technology
  • Roper Center for Public Opinion and Research (renewal)
  • University of Miami Libraries
  • University of Cincinnati Libraries
  • LIBNOVA, SL
  • University of Connecticut Library
  • The University of Washington Libraries
  • University of Louisville Libraries / William F. Ekstrom Library
  • University of the Balearic Islands
  • University of Colorado Boulder

We could always do more to highlight what each of you does and will work to do more of that in 2020.

 

DigiPres Conference

We had an amazing conference and turnout this past year in Tampa! Thanks to the entire planning team and DLF for all their hard work! 2020 will be held in Baltimore, 11-12 November. We hope to see you there. Keep an eye out for the CFP – coming soon!

If you are interested in learning more, participating in any part of the NDSA, let me know! Send an email to NDSA@diglib.org If you are not a member, please consider becoming one. The process is painless, I promise!

 

Wishing you all the best for 2020!

Bradley J. Daigle, Chair NDSA Leadership Team

The post Welcome to 2020! News from your NDSA! appeared first on DLF.

The Scholarly Record At The Internet Archive / David Rosenthal

The Internet Archive has been working on a Mellon-funded grant aimed at collecting, preserving and providing persistent access to as much of the open-access academic literature as possible. The motivation is that much of the "long tail" of academic literature comes from smaller publishers whose business model is fragile, and who are at risk of financial failure or takeover by the legacy oligopoly publishers. This is particularly true if their content is open access, since they don't have subscription income. This "long tail" content is thus at risk of loss or vanishing behind a paywall.

The project takes two opposite but synergistic approaches:
  • Top-Down: Using the bibliographic metadata from sources like CrossRef to ask whether that article is in the Wayback Machine and, if it isn't trying to get it from the live Web. Then, if a copy exists, adding the metadata to an index.
  • Bottom-up: Asking whether each of the PDFs in the Wayback Machine is an academic article, and if so extracting the bibliographic metadata and adding it to an index.
Below the fold, a discussion of the progress that has been made so far.

Top-down

The top-down approach is obviously the easier of the two, and has already resulted in fatcat.wiki, a fascinating search engine now in beta:
Fatcat is a versioned, user-editable catalog of research publications including journal articles, conference proceedings, and datasets

Features include archival file-level metadata (verified digests and long-term copies), an open, documented API, and work/release indexing (eg, distinguishing between and linking pre-prints, manuscripts, and version-of-record).
In contrast to the Wayback Machine's URL search, Fatcat allows searches for:
  • A Release, a published version of a Work, for example a journal article, a pre-print, or a book. Search terms include a DOI.
  • A Container, such as a journal or a serial. Search terms include an ISSN.
  • A Creator, such as an author. Search terms include an author name or an ORCID.
  • A File, based on a hash.
  • A File Set, such as a dataset.
  • A Work, such as an article, its pre-print, its datasets, etc.
Because "David Rosenthal" is a relatively common name (there were at least two of us at Stanford), I am a fairly difficult test case. But here is a FatCat search for releases by David S. H. Rosenthal. The first 30 include:
  • Articles from 1985 and 2011 co-authored by David S. Rosenthal, an Australian oncologist.
  • A 1955 paper on pneumonia by S. David Sternberg and Joseph H. Rosenthal.
  • Papers from 1970, 1971 and 1972 whose authors include H. David and S. Rosenthal.
  • Psychology papers from 1968, 1975 and 1976 co-authored by David Rosenthal with no middle initial.
What does this show, apart from that handling of middle initials could be improved? Only 21/30 are me. So are 17/30 on page 2, 15/30 on page 3, but only 4/30 on page 4. FatCat has most of what it has on me in the first 100 entries. Not too bad for a beta-test, but could be improved.

That is where the "wiki" part of fatcat.wiki comes in. FatCat is user-editable, able to support Wikipedia-style crowd-sourcing of bibliographic metadata. It is too soon to tell how effective this will be, but it is an attempt to address one of the real problems of the "long tail", that the metadata is patchy and low-quality.

Bottom-up

The bottom-up approach is much harder. At the start of the project the Internet Archive estimated that the Wayback Machine had perhaps 650M PDFs, and about 6% of them were academic papers. Whether a PDF is an academic paper is only obvious even to humans some of the time, and it is likely that the PDFs that aren't obvious include much of the academic literature that is at significant risk of loss.

The grant supported Paul Baclace to use machine learning techniques to build a classifier capable of distinguishing academic articles from other PDFs in the Internet Archive's collection. As he describes in Making A Production Classifier Ensemble, he faced two big problems. First, cost:
Internet Archive presented me with this situation in early 2019: it takes 30 seconds to determine whether a PDF is a research work, but it needs to be less than 1 second. It will be used on PDFs already archived and on PDFs as they are found during a Web crawl. Usage will be batch, not interactive, and it must work without GPUs. Importantly, it must work with multiple languages for which there are not many training examples.
Second, accuracy based on the available training sets:
The positive training set is a known collection of 18 million papers which can be assumed to be clean, ground truth. The negative training set is simply a random selection from the 150 million PDFs which is polluted with positive cases. An audit of 1,000 negative examples showed 6–7% positive cases. These were cleaned up by manual inspection and bootstrapping was used to increase the size of the set.
Since some PDFs contained an image of the text rather than the text itself, Baclace started by creating an image of the first page:
The image produced is actually a thumbnail. This works fine because plenty of successful image classification is done on small images all the time. This has a side benefit that the result is language independent: at 224x224, it is not possible to discern the language used in the document.
The image was run through an image classifier. Then the text was extracted:
After much experimentation, the pdftotext utility was found to be fast and efficient for text. Moreover, because it is run as a subprocess, a timeout is used to avoid getting stuck on parasitic cases. When exposing a system to millions of examples, bad apple PDFs which cause the process to be slow or stuck are possible.
How good did the text classification have to be?
The classic computer science paper has the words “abstract” and “references”. How well would it work to simply require both of these words? I performed informal measurements and found that requiring both words had only 50% accuracy, which is dismal. Requiring either keyword has 84% accuracy for the positive case. Only 10% of the negative cases had either of these keywords. This gave me an accuracy target to beat.
Baclace used two text classifiers. First, fastText from Facebook:
It’s a well-written C++ package with python bindings that makes no compromises when it comes to speed. “Fast” means 1msec., which is 2 orders of magnitude faster than a deep neural network on a CPU. The accuracy on a 20,000 document training and test set reached 96%. One of the big advantages is that it is full text, but a big disadvantage is that this kind of bag-of-words classifier does no generalization to languages with little or no training examples.
Second, BERT:
By now, everyone knows that the BERT model was a big leap ahead for “self-supervised” natural language. The multilingual model was trained with over 100 languages and it is a perfect choice for this project.

Using the classifier mode for BERT, the model was fine-tuned on a 20,000 document training and test set. The result was 98% accuracy.
System Structure
The results of the three classifiers were combined as shown in the diagram:
Each model described above computes a classification confidence. This makes it possible to create an ensemble classifier that selectively uses the models to emphasize speed or accuracy. For a speed example, if text is available and the fastText linear classifier confidence is high, BERT could be skipped. To emphasize accuracy, all three models can be run and then the combined confidence values and classification predictions make an overall classification.
The result is a fast and reasonably accurate classifier. Its output can be fed to standard tools for extracting bibliographic metadata from the PDFs of academic articles, and the results of this added to FatCat's index.

Synergy

Now the Internet Archive has two independent ways to find open access academic articles in the "long tail":
  • by looking for them where the metadata says they should be,
  • and by looking at the content collect by their normal Web crawling  and by use of "Save Page Now",
they could work together. For example, the bottom-up approach could be used to validate user edits to FatCat, or to flag examples of metadata that was missing from the metadata services, or inconsistent with the metadata extracted from the content.

Islandoracon 2021 - Seeking Expressions of Interest / Islandora

The Islandora Foundation is seeking Expressions of Interest to host the 2021 Islandora Conference (Islandoracon). 

This will be the fourth iteration of our community’s largest gathering. Previous Islandoracon locations include:

  • 2015: University of Prince Edward Island, Charlottetown, PE
  • 2017: MacMaster University, Hamilton, ON
  • 2019: Simon Fraser University, Vancouver, BC

 

Deadline for submissions: July 31, 2020

 

Requirements:

  • The host must cover the cost of the venue (whether by offering your own space or paying the rent on a venue). All other costs (transportation, catering, supplies, etc) will be covered by the Islandora Foundation. The venue must have:

    • Space for up to 150 attendees total, with room for at least two simultaneous tracks, and additional pre-conference workshop facilities, with appropriate A/V equipment. Laptop-friendly seating a requirement.

    • On-site quiet room or sensory friendly room (a small room away from the bustle of the conference; this can be a regular office or conference room) for attendee use.

    • Provides wireless internet capable of supporting 150+ simultaneous connections, at no extra charge for conference attendees.

    • A location convenient to an airport and hotels (or other accommodations, such as student housing)

    • A local planning committee willing to help with organization.

    • A location that is accessible and inclusive. This includes, but isn’t limited to:

      • Available gender neutral or trans-friendly bathrooms.
      • Available accessible and/or family bathrooms. 
      • Catering options that can accommodate different dietary needs.
      • Rooms and entrances that are accessible for those with limited mobility
  • The host is not responsible for developing the Islandoracon program, pre-conference events, sponsorships, or social events.

The EOI must include:

  • The name of the institution(s)
  • Primary contact (with email)
  • Proposed location, with a brief description of amenities, travel, and other considerations that would make it a good location for the conference.
  • A proposed time of year. We do not have a fixed schedule, so if there is a season when your venue is particularly attractive, the conference dates can move accordingly.

The location will be selected by Islandora staff in consultation with the Islandora Coordinating Committee.

The Islandora Foundation is committed to being a welcoming and inclusive space for our entire community. As such, Islandoracon will not be held in a location where local laws may be in conflict with our Code of Conduct.

Twitter / pinboard

New issue of the The #Code4Lib Journal published. Some terrific looking papers, including a review of PIDs for heri…

Editorial / Code4Lib Journal

on diversity and mentoring

Scraping BePress: Downloading Dissertations for Preservation / Code4Lib Journal

This article will describe our process developing a script to automate downloading of documents and secondary materials from our library's BePress repository. Our objective was to collect the full archive of dissertations and associated files from our repository into a local disk for potential future applications and to build out a preservation system. Unlike at some institutions, our students submit directly into BePress, so we did not have a separate repository of the files; and the backup of BePress content that we had access to was not in an ideal format (for example, it included "withdrawn" items and did not effectively isolate electronic theses and dissertations). Perhaps more importantly, the fact that BePress was not SWORD-enabled and lacked a robust API or batch export option meant that we needed to develop a data-scraping approach that would allow us to both extract files and have metadata fields populated. Using a CSV of all of our records provided by BePress, we wrote a script to loop through those records and download their documents, placing them in directories according to a local schema. We dealt with over 3,000 records and about three times that many items, and now have an established process for retrieving our files from BePress. Details of our experience and code are included.

Persistent identifiers for heritage objects / Code4Lib Journal

Persistent identifiers (PID’s) are essential for getting access and referring to library, archive and museum (LAM) collection objects in a sustainable and unambiguous way, both internally and externally. Heritage institutions need a universal policy for the use of PID’s in order to have an efficient digital infrastructure at their disposal and to achieve optimal interoperability, leading to open data, open collections and efficient resource management. Here the discussion is limited to PID’s that institutions can assign to objects they own or administer themselves. PID’s for people, subjects etc. can be used by heritage institutions, but are generally managed by other parties. The first part of this article consists of a general theoretical description of persistent identifiers. First of all, I discuss the questions of what persistent identifiers are and what they are not, and what is needed to administer and use them. The most commonly used existing PID systems are briefly characterized. Then I discuss the types of objects PID’s can be assigned to. This section concludes with an overview of the requirements that apply if PIDs should also be used for linked data. The second part examines current infrastructural practices, and existing PID systems and their advantages and shortcomings. Based on these practical issues and the pros and cons of existing PID systems a list of requirements for PID systems is presented which is used to address a number of practical considerations. This section concludes with a number of recommendations.

Dimensions & VOSViewer Bibliometrics in the Reference Interview / Code4Lib Journal

The VOSviewer software provides easy access to bibliometric mapping using data from Dimensions, Scopus and Web of Science. The properly formatted and structured citation data, and the ease in which it can be exported open up new avenues for use during citation searches and reference interviews. This paper details specific techniques for using advanced searches in Dimensions, exporting the citation data, and drawing insights from the maps produced in VOS Viewer. These search techniques and data export practices are fast and accurate enough to build into reference interviews for graduate students, faculty, and post-PhD researchers. The search results derived from them are accurate and allow a more comprehensive view of citation networks embedded in ordinary complex boolean searches.

Automating Authority Control Processes / Code4Lib Journal

Authority control is an important part of cataloging since it helps provide consistent access to names, titles, subjects, and genre/forms. There are a variety of methods for providing authority control, ranging from manual, time-consuming processes to automated processes. However, the automated processes often seem out of reach for small libraries when it comes to using a pricey vendor or expert cataloger. This paper introduces ideas on how to handle authority control using a variety of tools, both paid and free. The author describes how their library handles authority control; compares vendors and programs that can be used to provide varying levels of authority control; and demonstrates authority control using MarcEdit.

Managing Electronic Resources Without Buying into the Library Vendor Singularity / Code4Lib Journal

Over the past decade, the library automation market has faced continuing consolidation. Many vendors in this space have pushed towards monolithic and expensive Library Services Platforms. Other vendors have taken "walled garden" approaches which force vendor lock-in due to lack of interoperability. For these reasons and others, many libraries have turned to open-source Integrated Library Systems (ILSes) such as Koha and Evergreen. These systems offer more flexibility and interoperability options, but tend to be developed with a focus on public libraries and legacy print resource functionality. They lack tools important to academic libraries such as knowledge bases, link resolvers, and electronic resource management systems (ERMs). Several open-source ERM options exist, including CORAL and FOLIO. This article analyzes the current state of these and other options for libraries considering supplementing their open-source ILS either alone, hosted or in a consortial environment.

Shiny Fabric: A Lightweight, Open-source Tool for Visualizing and Reporting Library Relationships / Code4Lib Journal

This article details the development and functionalities of an open-source application called Fabric. Fabric is a simple to use application that renders library data in the form of network graphs (sociograms). Fabric is built in R using the Shiny package and is meant to offer an easy-to-use alternative to other software, such as Gephi and UCInet. In addition to being user friendly, Fabric can run locally as well as on a hosted server. This article discusses the development process and functionality of Fabric, use cases at the New College of Florida's Jane Bancroft Cook Library, as well as plans for future development.

Analyzing and Normalizing Type Metadata for a Large Aggregated Digital Library / Code4Lib Journal

The Illinois Digital Heritage Hub (IDHH) gathers and enhances metadata from contributing institutions around the state of Illinois and provides this metadata to the Digital Public Library of America (DPLA) for greater access. The IDHH helps contributors shape their metadata to the standards recommended and required by the DPLA in part by analyzing and enhancing aggregated metadata. In late 2018, the IDHH undertook a project to address a particularly problematic field, Type metadata. This paper walks through the project, detailing the process of gathering and analyzing metadata using the DPLA API and OpenRefine, data remediation through XSL transformations in conjunction with local improvements by contributing institutions, and the DPLA ingestion system’s quality controls.

Scaling IIIF Image Tiling in the Cloud / Code4Lib Journal

The International Archive of Women in Architecture, established at Virginia Tech in 1985, collects books, biographical information, and published materials from nearly 40 countries that are divided into around 450 collections. In order to provide public access to these collections, we built an application using the IIIF APIs to pre-generate image tiles and manifests which are statically served in the AWS cloud. We established an automatic image processing pipeline using a suite of AWS services to implement microservices in Lambda and Docker. By doing so, we reduced the processing time for terabytes of images from weeks to days. In this article, we describe our serverless architecture design and implementations, elaborate the technical solution on integrating multiple AWS services with other techniques into the application, and describe our streamlined and scalable approach to handle extremely large image datasets. Finally, we show the significantly improved performance compared to traditional processing architectures along with a cost evaluation.

Where Do We Go From Here: A Review of Technology Solutions for Providing Access to Digital Collections / Code4Lib Journal

The University of Toronto Libraries is currently reviewing technology to support its Collections U of T service. Collections U of T provides search and browse access to 375 digital collections (and over 203,000 digital objects) at the University of Toronto Libraries. Digital objects typically include special collections material from the university as well as faculty digital collections, all with unique metadata requirements. The service is currently supported by IIIF-enabled Islandora, with one Fedora back end and multiple Drupal sites per parent collection (see attached image). Like many institutions making use of Islandora, UTL is now confronted with Drupal 7 end of life and has begun to investigate a migration path forward. This article will summarise the Collections U of T functional requirements and lessons learned from our current technology stack. It will go on to outline our research to date for alternate solutions. The article will review both emerging micro-service solutions, as well as out-of-the-box platforms, to provide an overview of the digital collection technology landscape in 2019. Note that our research is focused on reviewing technology solutions for providing access to digital collections, as preservation services are offered through other services at the University of Toronto Libraries.

MarcEdit 7.2.x Updates / Terry Reese

I posted two updates today

  1. Version 7.2.100 (which the auto update will pull down).
  2. Version 7.2.95 (which won’t be pulled down, but can be picked up from the main downloads page)

So why the two versions?  Version 7.2.100 shifts the .NET framework that the program targets from 4.6.1 to 4.7.2.  There are differences, and I had to make a handful of changes to the UI to correct some unsafe UI updating practices which 4.7.2 won’t all.  I believe I’ve caught them all.  If I’ve missed something and you get a UI error, let me know where and how to recreate so I can fix it.  The primary difference for users will be that 4.7.2 is faster, is part of the current .NET development branches (so I got to remove a number of workarounds I’d been keeping because I was using 4.6.1), and no longer supports Vista.  If you need to support Vista, you can go to the download page and pick up 7.2.95.  This is the last version built against 4.6.1 that will support Vista.

See the downloads page: https://marcedit.reeset.net/downloads if you want to see how I provide access to older versions should you ever have a need to downgrade (due to supported operating system or problematic bug or regression).

–tr

Books read, January-February 2020 / Mark Matienzo

I’m trying to do a better job tracking what I’ve been reading. Here’s a start.

Key: ✅ finished; 👍🏽 recommended; 🤯 blew my mind; 🔀 non-linear reading; 😒 meh

  • Cynthia Cruz, Disquieting: Essays on Silence (Book*hug, 2019) ✅👍🏽🤯
  • Margarita García Robayo, Fish Soup (Charco Press, 2018, tr. Charlotte Coombe) 😒
  • The Desert Fathers: Sayings of the Early Christian Monks (Penguin Books, 2003, tr. Benedicta Ward) 🔀🤯
  • Jenny Odell, How to Do Nothing: Resisting the Attention Economy (Melville House, 2019) ✅👍🏽
  • Elinor Ostrom, Governing the Commons: The Evolution of Institutions for Collective Action (Cambridge UP, 1990)

Announcing Incoming NDSA Coordinating Committee Members for 2020 / Digital Library Federation

Please join me in welcoming the three new 2020 elected Coordinating Committee members: Courtney Mumma, Daniel Noonan, and Nathan Tallman.

Courtney C. Mumma, is an archivist and a librarian. She is the Deputy Director of the Texas Digital Library consortium, a collective of university libraries working towards open, sustainable, and secure digital heritage and scholarly communications. She has over a decade of experience in open source software development and maintenance, infrastructure support and digital preservation good practice and education.

Daniel Noonan, Associate Professor/Digital Preservation Librarian at The Ohio State University, plays a key role in developing a trusted digital preservation ethos and infrastructure at The Ohio State University Libraries (OSUL). This position contributes strategy and expertise, and provides leadership through close collaboration with faculty, staff, and other leaders in OSUL’s Information Technology, Preservation & Reformatting, Special Collections & Archives, Archival Description and Access, and Publishing and Repository Services groups. Previously, he was OSUL’s Electronic Records/Digital Resources Archivist and Electronic Records Manager/Archivist. Simultaneously, Dan was an adjunct faculty member for Kent State University, teaching an archives foundations course.

Nathan Tallman, Digital Preservation Librarian at Penn State University, coordinates policies, workflows and practices to ensure the long-term preservation and access of the University Libraries’ born-digital and born-analog collections. He advises on equipment, infrastructure, and vendors for Penn State digital content. Nathan also helps manage access systems by coordinating local practices and support for digital collections.

Members of the NDSA Coordinating Committee serve staggered terms. We thank our outgoing Coordinating Committee members, Carol Kussmann and Helen Tibbo, for their service and many contributions. We are also grateful to the very talented, qualified individuals who participated in this election.

To sustain a vibrant, robust community of practice, we rely on and deeply value the contributions of all members, including those who took part in voting.
Best wishes to all as we welcome Courtney, Dan, and Nathan to their new roles within NDSA!

~ Bradley J. Daigle, Chair NDSA Leadership Team

The post Announcing Incoming NDSA Coordinating Committee Members for 2020 appeared first on DLF.

Economic Limits Of Proof-of-Stake Blockchains / David Rosenthal

In 2018's Cryptocurrencies Have Limits I discussed Eric Budish's The Economic Limits Of Bitcoin And The Blockchain, an important analysis of the economics of two kinds of "51% attack" on Bitcoin and other cryptocurrencies based on "Proof-of-Work" (PoW) blockchains. Among other things, Budish shows that, for safety, the value of transactions in a block must be low relative to the fees in the block plus the reward for mining the block. In last year's The Economics Of Bitcoin Transactions I discussed Raphael Auer's Beyond the doomsday economics of “proof-of-work” in cryptocurrencies, in which Auer shows that:
proof-of-work can only achieve payment security if mining income is high, but the transaction market cannot generate an adequate level of income. ... the economic design of the transaction market fails to generate high enough fees.
Follow me below the fold for a discussion of a fascinating recent paper that extends Budish's analysis.

Both Budish and Auer assumed PoW blockchains underlay the cryptocurrencies they analyzed. PoW has long been criticized for its severe environmental impacts (The top 5 cryptocurrencies have at times been estimated to use as much energy as The Netherlands).

Cohen's Critque
In response, nearly five years ago, Ethereum (the #2 cryptocurrency by "market cap") started work on a Proof-of-Stake (PoS) mining algorithm. Technically, as Vitalik Buterin wrote, it turned out to be non-trivial; Ethereum still uses PoW. Bram Cohen, the creator of BitTorrent, included a detailed critique of PoS in an EE380 talk entitled Stopping grinding attacks in proofs of space. The major problem is that, given the very high Gini coefficients of cryptocurrencies, PoS networks would inevitably be even more centralized than PoW networks in practice are.

Now, via Yves Smith, we find a post entitled More (or less) economic limits of the blockchain by Prof. Joshua Gans (U. Toronto) and Prof. Neil Gandal (Tel Aviv U.) in which they summarize their paper with the same title (the most recent version of the paper is paywalled, I am working from the earlier open access version at SSRN). The importance of this paper is that it extends the economic analysis of Budish to PoS blockchains. Their abstract reads:
Cryptocurrencies such as Bitcoin rely on a ‘proof of work’ scheme to allow nodes in the network to ‘agree’ to append a block of transactions to the blockchain, but this scheme requires real resources (a cost) from the node. This column examines an alternative consensus mechanism in the form of proof-of-stake protocols. It finds that an economically sustainable network will involve the same cost, regardless of whether it is proof of work or proof of stake. It also suggests that permissioned networks will not be able to economise on costs relative to permissionless networks.

Permissionless Networks

Gans & Gandal's analysis of permissionless PoS blockchains asks:
The economic question is whether PoS type systems can perform more efficiently than PoW systems.
They show,
using the methodology for examining blockchain sustainability developed by Budish (2018), that the (perhaps) surprising answer is no! In the case of Permissionless blockchains (i.e. free entry,) the cost of PoW schemes are identical to the cost of PoS schemes.
If PoS delivered the same functionality as PoW at the same cost, it should be preferred as lacking PoW's environmental impact. Alas, implementing a practical, attack-resistant PoS system has proven so non-trivial that it has consumed five years of work by the Ethereum team without success.

Gans & Gandal assume that PoS nodes are rational economic actors, accounting for the interest foregone by the staked cryptocurrency. As we see with Bitcoin's Lightning Network, true members of the cryptocurrency cult are not concerned that the foregone interest on capital they devote to making the system work is vastly greater than the fees they receive for doing so. The reason is that, as David Gerard writes, they believe that "number go up". In other words, they are convinced that the finite supply of their favorite coin guarantees that its value will in the future "go to the moon", providing capital gains that vastly outweigh the foregone interest.

Permissioned Networks

Permissioned networks are those in which some central authority controls the set of nodes forming the network, granting or withholding permission to participate. In a footnote, Gans & Gandal correctly point out that:
Formally, PoW and PoS  are “Sybil” control mechanisms, rather than consensus protocols. These mechanisms need to be combined with consensus protocols to make the system work. For example, in the case of Bitcoin, the longest chain in the blockchain is the consensus rule.
Sybil attacks involve the attacker creating enough spurious nodes to overwhelm the honest network nodes. In another footnote, they note that:
Another class of methods is based on Byzantine Fault Tolerance. In these methods, a node is chosen at random to be a validator but a block is only considered final if a supermajority (A) other staked nodes agree that it is valid. The advantage is that the block can be relied upon without having to wait t periods of time. In order to compare PoS and PoW, we do not examine this alternative in the paper.
There have been permissionless PoS implementations using BFT, but they should have examined it in the context of permissioned networks. Expending resources, whether through PoW or PoS or some other scheme to make participating in the network expensive enough to deter Sybil attacks that could only be mounted by the central authority controlling the network, is pointless. Permissioned networks have no need for PoW or PoS, they can use the canonical means of establishing consensus in a fixed-size network of unreliable and potentially malign nodes, Byzantine Fault Tolerance.

Citations

The authors cite a couple of papers I should have found earlier:
  • Budish showed that Bitcoin was unsafe unless the value of transactions in a block was less than the sum of the mining reward and the fees for the transactions it contains. The mining reward is due to decrease to zero, at which point safety requires fees larger than the value of the transactions, not economically viable. In 2016 Arvind Narayanan's group at Princeton published a related instability in Carlsten et al's On the instability of bitcoin without the block reward. Narayanan summarized the paper in a blog post:
    Our key insight is that with only transaction fees, the variance of the miner reward is very high due to the randomness of the block arrival time, and it becomes attractive to fork a “wealthy” block to “steal” the rewards therein.
    Note that:
    We model transaction fees as arriving at a uniform rate. The rate is non-uniform in practice, which is an additional complication.
    The rate is necessarily non-uniform, because transactions are in a blind auction for inclusion in the next block, which leads to over-payment. As Izabella Kaminska wrote:
    In the world of bitcoin, urgent transactions subsidise non-urgent transactions.

    This might be justifiable if payment urgency was somehow a reflection of status, wealth or hierarchy, but it's not. Poor people need to make urgent payments just as often as the wealthy. A payment network which depends on gouging the desperate to run efficiently, while giving free gifts to the non-desperate is no basis for a system of money.
  • Evangelos Deirmentzoglou et al's A Survey on Long-Range Attacks for Proof of Stake Protocols provides a comprehensive overview of PoS. Their abstract reads:
    Despite common arguments about the prevalence of blockchain technology, in terms of security, privacy, and immutability, in reality, several attacks can be launched against them. This paper provides a systematic literature review on long-range attacks for proof of stake protocols. If successful, these attacks may take over the main chain and partially, or even completely, rewrite the history of transactions that are stored in the blockchain. To this end, we describe how proof of stake protocols work, their fundamental properties, their drawbacks, and their attack surface. After presenting long-range attacks, we discuss possible countermeasures and their applicability.
    This paper will repay further study.

Jobs in Information Technology: February 13, 2020 / LITA

New this week

Visit the LITA Jobs Site for additional job listings and information on submitting your own job posting.

LITA Blog Call for Contributors / LITA

Blog from anywhere, even your favorite cafe!

We’re looking for new contributors for the LITA Blog!

Do you have just a single idea for a post or a series of posts? No problem! We’re always looking for guest contributors with new ideas.

Do you have thoughts and ideas about technology in libraries that you’d like to share with LITA members? Apply to be a regular contributor!

If you’re a member of LITA, consider either becoming a regular contributor for the next year or submitting a post or two as a guest.

Apply today!

New position at NC Cardinal: Project Manager / Evergreen ILS

NC Cardinal is a growing consortium providing access to Evergreen for approximately half of the public libraries throughout North Carolina. Our team located in Raleigh, close to the Research Triangle Park, in a beautiful part of the country. We’re looking to add another member to our expanding team to focus on project management. Feel free to contact me if you have questions.

Responsibilities include:

55% Project Management

· Works with the Program Manager, teammates, clients and vendors to design and implement ILS related projects from requirements gathering to completion. Assesses outcome and impact of projects.

· Communicates with Directors and staff of member libraries to understand their needs, explain functionality and configurability of ILS software and ensure satisfaction with project outcomes.

· Researches and tests the functionality of the Integrated Library System (ILS) and works with colleagues and community members to understand and document the functionality of the ILS.

· Administers organizational settings, circulation / hold policies, and other system configurations.

· Monitors resource sharing activities and assists member libraries with inquiries related to resource sharing policies and procedures

· Provides libraries with assistance with policies and implementation of the Student Access program

· Assists with process of migrating libraries into the consortium, helping to specify and explain configuration options and implement configuration decisions.

40% Technical Support

· Provides professional support and assistance to NC Cardinal member libraries. Troubleshoots and resolves support issues in an efficient manner.

· Responds to telephone calls, email and requests for help tickets, bug reporting, or technical support and refers issues to the appropriate NC Cardinal team member in a timely manner.

· Reports bugs to the larger Evergreen community and participates in development and bug patches of the Evergreen open source software.

· Acts as liaison between the libraries and vendors to resolve system performance issues.

· Serves as an expert resource and technical advisor to the NC Cardinal program.

· Provides training activities related to NC Cardinal functions and services.

· Creates documentation for program policies and procedures, including software functionality and internal policies and procedures.

· Works with the Cardinal team to administer union catalog of public library collections; including technical assistance, continuing education, and resource sharing.

· Monitors daily operations of the NC Cardinal system.

· Evaluates programs and services regularly; identifies unmet needs.

5% Community Engagement

· Seeks resources/opportunities for staff training to enhance technical skills.

· Supports networks and task forces when necessary to facilitate the accomplishment of NC Cardinal goals.

· Attends seminars, conferences, job-related training and other staff development programs, and serves on committees and task forces; shares new knowledge with library development team through emails, meetings and in-house training opportunities.

· Participates in larger library and Evergreen community; presents at local, state, and national conferences and workshops.

· Maintains knowledge of current trends and developments in the field by reading appropriate books, journals, and other literature and attending related seminars and conferences; applies trends and technologies where appropriate.

· Monitors communications from outside sources as well as throughout the NC Cardinal system, and disseminates as appropriate.

· Establishes and maintains relationships by participating in state and national committees.

· Participates as a library development team member to establish and accomplish division goals and objectives.

Apply at: https://www.governmentjobs.com/careers/northcarolina/jobs/2708248/nc-cardinal-project-manager

More On The Ad Bubble / David Rosenthal

Google UI Timeline
Two weeks ago a firestorm erupted over a seemingly insignificant change to the UI of Google's search engine. It was enough to get Google to backtrack. A week later Daisuke Wakabayashi and Tiffany Hsu had the details in Why Google Backtracked on Its New Search Results Look, including this informative timeline graphic of the history of such changes since 2007. Their explanation for why Google made the change was:
Users complained that Google was trying to trick people into clicking on more paid results, while marketing executives said it was yet another step in blurring the line between ads and unpaid search results, forcing them to spend more money with the internet company.
Well, yes, but follow me below the fold for the bigger picture.

Why was this change so important that Google had to push the envelope of the FTC's guidelines?
Josh Zeitz, another Google spokesman for the ads team, said the design changes were in line with guidelines from the Federal Trade Commission. In 2013, the F.T.C. made recommendations for how search engines should label ads, but stopped short of specific requirements other than that paid results should be “noticeable and understandable to consumers.”

Google’s recent changes adhered to some of the guidelines but ignored others. Google did not follow what the F.T.C. prescribed for “visual cues” with paid results marked by “prominent shading that has a clear outline,” by a “border that distinctly sets off advertising” from unpaid search results or by both. But the new ad icon met the F.T.C.’s recommendation for ad labels to appear before the paid result on the upper left-hand side.
Two years ago on Dave Farber's IP list my friend Chuck McManis provided a lengthy and informative response to a post discussing Hiroko Tabuchi's article entitled How Climate Change Deniers Rise to the Top in Google Searches. He wrote:
To put this in context,  consider the challenge of Google's eroding search advertising margins[1]. Google traditionally reported something called 'CPC' or (Cost Per Click) in their financial reports. This was the price that an advertiser paid Google when that advertiser's ad was 'clicked' on by a user. In a goods economy, it might equivalent to the 'average selling price' or ASP for the widget. But unlike a goods economy, Google could (and to date has) made up for this price erosion by increased ad volume.
...
[1] The average price per click (CPC) of advertisements on Google sites has gone down for every year, and nearly every quarter, since 2009. At the same time Microsoft's Bing search engine CPCs have gone up. As the advantage of Google's search index is eroded by time and investment, primarily by Microsoft, advertisers have been shifting budget to be more of a blend between the two companies. The trend suggests that at some point in the not to distant future advertising margins for both engines will be equivalent.
Another way of looking at the erosion of Google's CPC illustrates the "advertising is a bubble" theme. Since ads are sold via an auction mechanism, advertisers are gradually figuring out that the value of an ad is less than they thought it was, so what they're willing to bid for it is less. They are still wildly optimistic about the value, as I discussed in Advertising Is A Bubble, but the trend is there.

Now, Tom Foremski elaborates on McManis' observation in The mysterious disappearance of Google's click metric:
Google's recent end-of-year 2019 financial report was a stunner: it included new financial details, but it removed several more.

For the first time, the revenues for YouTube and the cloud IT business were disclosed, but without any cost of operations, and Google removed key metrics that have been included for more than 15 years: How much money it makes per click (Cost-per-Click (CPC)) and the growth of paid clicks.

These monetization metrics are typically found on the second page of every quarterly earnings release from Google -- which underscores their importance in a 10-page document. Yet they are missing from the latest Google 2019 Q4 report with no explanation.
YoY drop in CPC each quarter
What does Google want to prevent investors focusing on?
The seemingly unstoppable revenue per click decline is the most concerning. Look at these past 19 quarters, but it's been going on for far longer.

Google has a rapidly deflating advertising product, sometimes 29% less revenue per click, every quarter, year-on-year, year after year.
As McManis pointed out, this trend started a decade ago, long before the chart starts in 2015 Q1.

Change in CPC vs. change in Clicks
Foremski's second chart shows that, up to now, Google has kept growing income despite falling CPC (the red bars) by growing the number of clicks (the blue bars). The mismatch between the red and blue lengths looks OK for Google, but the mismatch between their slopes doesn't. This is the reason why they keep trying to hide the fact that something is an ad, because they desperately need people to click on them.

Clearly, either the missing 2019 Q4 CPC number was bad, or Google can see bad CPC numbers coming,. Otherwise it wouldn't be missing. Foremski sums up thus:
what does this say about the effectiveness of Google's ads? They aren't very good and their value is declining at an astounding and unstoppable pace.
And points to the inevitable, Google doubling down on their decaying business model:
To survive, Google must find ways of showing even more ads. This is the future with Google -- more ads in more places. Or rather, more ineffective ads in more places. This is an unsustainable business model.
More and more of a less and less effective product is pretty much the definition of a bubble.



Learn the latest in Library UX with this LITA Webinar / LITA

There’s a seat waiting for you… Register for this LITA webinar today!

How to Talk About Library UX – Redux

Presenter: Michael Schofield

Librarian / Director of Engineering, WhereBy.Us

Wednesday, March 11, 2020

12:00 – 1:00 pm Central Time

The last time we did this webinar was in 2016 – and a lot’s changed. The goal then was to help establish some practical benchmarks for how to think about the user experience and UX design in libraries, which suffered from a lack of useful vocabulary and concepts: while we might be able to evangelize the importance of UX, LibUXers struggled with translating their championship into the kinds of bureaucratic goals that unlocked real budget for our initiatives.

It’s one thing to say, “the patron experience is critical!” It’s another thing to say, “the experience is critical – so pay for OptimalWorkshop, or hire a UX Librarian, or give me a department.”

And let’s be real, this is still a real obstacle. But now, there are more examples than ever about successful UX programs in libraries, models for how even whole UX departments might be structured. The hill you have to climb to pitch UX is a little less steep.

What’s changed is twofold: the collective level of UX maturity in librarianship (it’s gone up!), and the increasing pace of practical thinking in the “fields” of service design and researchOps.

This 60-minute webinar – “How to talk about UX Redux” – is benchmark 2.0.

Learning objectives for this program include:

  • Understand a higher baseline of UX, its role in the organizational mission, and its part in a larger ecosystem of “products,” services, and policies.
  • Learn a high-level working vocabulary for UX and service design.
  • Learn new insights into the practice of “researchOps,” and sound arguments for allocating more of the library budget into this kind of thinking.

This course is geared toward librarians and librarifriends who are invested — at least in spirit! — in improving the library UX. This webinar might be especially good for LibUXers who have already seen the concepts and practice of UX change weirdly and are looking for a hard reset.

View details and Register here.

Can’t attend the live event? No problem! Register and you’ll receive a link to the recording.

Unveiling the new okfn.org website, blog and logo / Open Knowledge Foundation



Today the Open Knowledge Foundation is launching its revamped website, updated blog and new logo.

Our vision is for a future that is fair, free and open. This will be our guiding principle in everything we do.

Our mission is to create a more open world – a world where all non-personal information is open, free for everyone to use, build on and share; and creators and innovators are fairly recognised and rewarded.

We understand that phrases like ‘open data’ and ‘open knowledge’ are not widely understood. It is our job to change that.

Our strategy, continuum and animated video aim for us to reach a wider and more mainstream audience with relatable and practical interventions.

This renewed mission has limitless possibilities and the Board and team are excited about our organisation’s next steps and hopeful for the future.

Please let us know any thoughts you have about our website, blog, animated video or new logo by emailing info@okfn.org.

Joint Working Group on eBooks and Digital Content in Libraries / LITA

John Klima, the LITA Representative to the Working Group on eBooks and Digital Content, recently agreed to an interview about the latest update from ALA Midwinter 2020. Watch the blog for more updates from John about the Working Group in the coming months!

What is the mission and purpose of the Working Group on eBooks and Digital Content?

Quoting from the minutes of the ALA Executive Board Fall meeting in October of 2019:

[The purpose of this working group is] to address library concerns with publishers and content providers specifically to develop a variety of digital content license models that will allow libraries to provide content more effectively, allowing options to choose between one-at-a-time, metered, and other options to be made at point of sale; to make all content available in print and for which digital variants have been created to make the digital content equally available to libraries without moratorium or embargo; to explore all fair options for delivering content digitally in libraries; and to urge Congress to explore digital content pricing and licensing models to ensure democratic access to information.

It’s a big charge to be sure.

Can you tell us more about how the group came to be? What are the most important long-term goals for the group?

Again, paraphrasing from the minutes of the ALA Executive Board Fall meeting in October of 2019:

At the 2019 ALA Conference Council iii, the ALA Council approved a resolution calling for establishment of a Joint Working Group on eBook and Digital Content Pricing in Libraries consisting of representatives from ALA, ULC, ASGCLA, COSLA, PLA, LITA, ALCTS, RUSA, SLA and other members to be determined.

There are more than 30 members to the working group with representatives from internal ALA organizations like LITA, PLA, YALSA, ALCTS, RUSA, and more, as well as representatives from outside of ALA including people like Steve Potash from OverDrive, Brian O’Leary from the Book Industry Study Group, and Sandra DeGroote from MLA. That means there’s a lot of smart people together in a room. The group is an attempt to put together representatives of all types of libraries, patrons, and econtent vendors.

In my opinion our main long-term goal is to put together a proposal for fair delivery of content to libraries that incorporates pricing and access that can be used by libraries of all sizes when working with econtent vendors. That isn’t a simple thing, but I’m confident this group will come up with a good proposal.

What is the group working on next?

Right now our co-chairs—Leah Dunn from UNC-Asheville and Kelvin Brown from Broward County Libraries (FL)—are pulling together minutes from our Midwinter meeting which will include a concept for how the group wants to proceed with our charge and what type of deliverables we will provide to ALA at the end of our term. At this time the group is focusing on how to organize itself to best tackle our charge.

Give us an overview of your meeting during ALA Midwinter and the future of this new working group.

Something that I talked about is that since I work in a public library, I don’t necessarily know all the concerns that academic or special libraries face with regards to econtent. This was echoed by people around the table. For many public libraries the Macmillan ebook embargo has a direct impact on providing content to patrons but often academic libraries aren’t customers of Macmillan. For them the conversation needed to be about something other than just one single vendor.

As a group we agreed that we needed to learn what libraries are already doing with regards to econtent that is working for them (and thereby providing a way to share that among all libraries) and elevating the conversation above individual vendors in order to include all types of libraries and all types of patrons/users.

What else should LITA members know about this working group? How can they get involved or help?

I’ll provide an update via the blog each time we meet as a group to provide LITA members with our progress. In the meantime, LITA members can email me with their concerns about ebooks. What is working for them right now? What access do they not want to lose? Are there technological concerns that could be addressed? What types of access do they not have?

Frictionless Data Pipelines for Ocean Science / Open Knowledge Foundation

This blog post describes a Frictionless Data Pilot with the Biological and Chemical Oceanography Data Management Office (BCO-DMO). Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by the BCO-DMO team members Adam Shepherd, Amber York, Danie Kinkade, and development by Conrad Schloer.

Scientific research is implicitly reliant upon the creation, management, analysis, synthesis, and interpretation of data. When properly stewarded, data hold great potential to demonstrate the reproducibility of scientific results and accelerate scientific discovery. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) is a publicly accessible earth science data repository established by the National Science Foundation (NSF) for the curation of biological, chemical, and biogeochemical oceanographic data from research in coastal, marine, and laboratory environments. With the groundswell surrounding the FAIR data principles, BCO-DMO recognized an opportunity to improve its curation services to better support reproducibility of results, while increasing process efficiencies for incoming data submissions. In 2019, BCO-DMO worked with the Frictionless Data team at Open Knowledge Foundation to develop a web application called Laminar for creating Frictionlessdata Data Package Pipelines that help data managers process data efficiently while recording the provenance of their activities to support reproducibility of results.

 

The mission of BCO-DMO is to provide investigators with data management services that span the full data lifecycle from data management planning, to data publication, and archiving.

BCO-DMO provides free access to oceanographic data through a web-based catalog with tools and features facilitating assessment of fitness for purpose. The result of this effort is a database containing over 9,000 datasets from a variety of oceanographic and limnological measurements including those from: in situ sampling, moorings, floats and gliders, sediment traps; laboratory and mesocosm experiments; satellite images; derived parameters and model output; and synthesis products from data integration efforts. The project has worked with over 2,600 data contributors representing over 1,000 funded projects. 

As the catalog of data holdings continued to grow in both size and the variety of data types it curates, BCO-DMO needed to retool its data infrastructure with three goals. First, to improve the transportation of data to, from, and within BCO-DMO’s ecosystem. Second, to support reproducibility of research by making all curation activities of the office completely transparent and traceable. Finally, to improve the efficiency and consistency across data management staff. Until recently, data curation activities in the office were largely dependent on the individual capabilities of each data manager. While some of the staff were fluent in Python and other scripting languages, others were dependent on in-house custom developed tools. These in-house tools were extremely useful and flexible, but they were developed for an aging computing paradigm grounded in physical hardware accessing local data resources on disk. While locally stored data is still the convention at BCO-DMO, the distributed nature of the web coupled with the challenges of big data stretched this toolset beyond its original intention. 

In 2015, we were introduced to the idea of data containerization and the Frictionless Data project in a Data Packages BoF at the Research Data Alliance conference in Paris, France. After evaluating the Frictionless Data specifications and tools, BCO-DMO developed a strategy to underpin its new data infrastructure on the ideas behind this project.

While the concept of data packaging is not new, the simplicity and extendibility of the Frictionless Data implementation made it easy to adopt within an existing infrastructure. BCO-DMO identified the Data Package Pipelines (DPP) project in the Frictionless Data toolset as key to achieving its data curation goals. DPP implements the philosophy of declarative workflows which trade code in a specific programming language that tells a computer how a task should be completed, for imperative, structured statements that detail what should be done. These structured statements abstract the user writing the statements from the actual code executing them, and are useful for reproducibility over long periods of time where programming languages age, change or algorithms improve. This flexibility was appealing because it meant the intent of the data manager could be translated into many varying programming (and data) languages over time without having to refactor older workflows. In data management, that means that one of the languages a DPP workflow captures is provenance – a common need across oceanographic datasets for reproducibility. DPP Workflows translated into records of provenance explicitly communicates to data submitters and future data users what BCO-DMO had done during the curation phase. Secondly, because workflow steps need to be interpreted by computers into code that carries out the instructions, it helped data management staff converge on a declarative language they could all share. This convergence meant cohesiveness, consistency, and efficiency across the team if we could implement DPP in a way they could all use. 

In 2018, BCO-DMO formed a partnership with Open Knowledge Foundation (OKF) to develop a web application that would help any BCO-DMO data manager use the declarative language they had developed in a consistent way. Why develop a web application for DPP? As the data management staff evaluated DPP and Frictionless Data, they found that there was a learning curve to setting up the DPP environment and a deep understanding of the Frictionlessdata ‘Data Package’ specification was required. The web application abstracted this required knowledge to achieve two main goals: 1) consistently structured Data Packages (datapackage.json) with all the required metadata employed at BCO-DMO, and 2) efficiencies of time by eliminating typos and syntax errors made by data managers.  Thus, the partnership with OKF focused on making the needs of scientific research data a possibility within the Frictionless Data ecosystem of specs and tools. 

Data Package Pipelines is implemented in Python and comes with some built-in processors that can be used in a workflow. BCO-DMO took its own declarative language and identified gaps in the built-in processors. For these gaps, BCO-DMO and OKF developed Python implementations for the missing declarations to support the curation of oceanographic data, and the result was a new set of processors made available on Github.

Some notable BCO-DMO processors are:

boolean_add_computed_field – Computes a new field to add to the data whether a particular row satisfies a certain set of criteria.
Example: Where Cruise_ID = ‘AT39-05’ and Station = 6, set Latitude to 22.1645.

convert_date – Converts any number of fields containing date information into a single date field with display format and timezone options. Often data information is reported in multiple columns such as `year`, `month`, `day`, `hours_local_time`, `minutes_local_time`, `seconds_local_time`. For spatio-temporal datasets, it’s important to know the UTC date and time of the recorded data to ensure that searches for data with a time range are accurate. Here, these columns are combined to form an ISO 8601-compliant UTC datetime value.

convert_to_decimal_degrees –  Convert a single field containing coordinate information from degrees-minutes-seconds or degrees-decimal_minutes to decimal_degrees. The standard representation at BCO-DMO for spatial data conforms to the decimal degrees specification.

reorder_fields –  Changes the order of columns within the data. This is a convention within the oceanographic data community to put certain columns at the beginning of tabular data to help contextualize the following columns. Examples of columns that are typically moved to the beginning are: dates, locations, instrument or vessel identifiers, and depth at collection. 

The remaining processors used by BCO-DMO can be found at https://github.com/BCODMO/bcodmo_processors

How does Laminar work?

In our collaboration with OKF, BCO-DMO developed use cases based on real-world data submissions. One such example is a recent Arctic Nitrogen Fixation Rates dataset.

 

 

The original dataset shown above needed the following curation steps to make the data more interoperable and reusable:

  • Convert lat/lon to decimal degrees
  • Add timestamp (UTC) in ISO format
  • ‘Collection Depth’ with value “surface” should be changed to 0
  • Remove parenthesis and units from column names (field descriptions and units captured in metadata).
  • Remove spaces from column names

The web application, named Laminar, built on top of DPP helps Data Managers at BCO-DMO perform these operations in a consistent way. First, Laminar prompts us to name and describe the current pipeline being developed, and assumes that the data manager wants to load some data in to start the pipeline, and prompts for a source location.

After providing a name and description of our DPP workflow, we provide a data source to load, and give it the name, ‘nfix’. 

In subsequent pipeline steps, we refer to ‘nfix’ as the resource we want to transform. For example, to convert the latitude and longitude into decimal degrees, we add a new step to the pipeline, select the ‘Convert to decimal degrees’ processor, a proxy for our custom processor convert_to_decimal_degrees’, select the ‘nfix’ resource, select a field form that ‘nfix’ data source, and specify the Python regex pattern identifying where the values for the degrees, minutes and seconds can be found in each value of the latitude column.

Similarly, in step 7 of this pipeline, we want to generate an ISO 8601-compliant UTC datetime value by combining the pre-existing ‘Date’ and ‘Local Time’ columns. This step is depicted below:

After the pipeline is completed, the interface displays all steps, and lets the data manager execute the pipeline by clicking the green ‘play’ button at the bottom. This button then generates the pipeline-spec.yaml file, executes the pipeline, and can display the resulting dataset.

 

The resulting DPP workflow contained 223 lines across this 12-step operation, and for a data manager, the web application reduces the chance of error if this pipelines was being generated by hand. Ultimately, our work with OKF helped us develop processors that follow the DPP conventions.

Our goal for the pilot project with OKF was to have BCO-DMO data managers using the Laminar for processing 80% of the data submissions we receive. The pilot was so successful, that data managers have processed 95% of new data submissions to the repository using the application.

This is exciting from a data management processing perspective because the use of Laminar is more sustainable, and acted to bring the team together to determine best strategies for processing, documentation, etc. This increase in consistency and efficiency is welcomed from an administrative perspective and helps with the training of any new data managers coming to the team. 

The OKF team are excellent partners, who were the catalysts to a successful project. The next steps for BCO-DMO are to build on the success of The Fricitonlessdata  Data Package Pipelines by implementing the Frictionlessdata Goodtables specification for data validation to help us develop submission guidelines for common data types. Special thanks to the OKF team – Lilly Winfree, Evgeny Karev, and Jo Barrett. 

New Committer: Eli Zoller / Islandora

The Islandora 8 committers have nominated and approved Eli Zoller to join the crew, and we are pleased to announced that he has accepted!

Eli is a recent newcomer to Islandora, but in his short tenure has managed to provide the upcoming versioning features in Islandora 8 v1.1.0. While working for Arizona State University, he has engaged the project in countless ways. Bug reporting, bug fixing, testing, documentation... you name it. He's even got a PR out there for the Mirador viewer :)

We are very lucky to have him focusing his intellect on Islandora 8, and we are very happy to have him joining the project in this new capacity. Further details of the rights and responsibilities of being a Islandora committer can be found here.

OpenRefine and the Distant Reader / Eric Lease Morgan

The student, researcher, or scholar can use OpenRefine to open one or more different types of delimited files. OpenRefine will then parse the file(s) into fields. It can makes many things easy such as finding/replacing, faceting (think “grouping”), filtering (think “searching”), sorting, clustering (think “normalizing/cleannig”), counting & tabulating, and finally, exporting data. OpenRefine is an excellent go-between when spreadsheets fail and full-blown databases are too hard to use. OpenRefine eats delimited files for lunch.

Many (actually, most) of the files in a study carrel are tab-delimited files, and they will import into OpenRefine with ease. For example, after all a carrel’s part-of-speech (pos) files are imported into OpenRefine, the student, researcher, or scholar can very easily count, tabulate, search (filter), and facet on nouns, verbs, adjectives, etc. If the named entities files (ent) are imported, then it is easy to see what types of entities exist and who might be the people mentioned in the carrel:

pos

Facets (counts & tabulations) of parts-of-speech

nouns

Most frequent nouns

entities

Types of named-entities

who

Who is mentioned in a file and how often

OpenRefine recipes

Like everything else, using OpenRefine requires practice. The problem to solve is not so much learning how to use OpenRefine. Instead, the problem to solve is to ask and answer interesting questions. That said, the student, researcher, or scholar will want to sort the data, search/filter the data, and compare pieces of the data to other pieces to articulate possible relationships. The following recipes endeavor to demonstrate some such tasks. The first is to simply facet (count & tabulate) on parts-of-speech files:

  1. Download, install, and run OpenRefine
  2. Create a new project and as input, randomly chose any file from a study carrel’s part-of-speech (pos) directory
  3. Continue to accept the defaults, and continue with “Create Project »”; the result ought to be a spreadsheet-like interface
  4. Click the arrow next to the POS column and select Facet/Text facet from the resulting menu; the result ought to be a new window containing a column of words and a column of frequencies — counts & tabulations of each type of part-of-speech in the file
  5. Go to Step #4, until you get tired, but this time facet by other values

Faceting is a whole like like “grouping” in the world of relational databases. Faceting alphabetically sorts a list and then counts the number of times each item appears in the list. Different types of works have different parts-of-speech ratios. For example, it is not uncommon for there to be a preponderance of past-tense verbs stories. Counts & tabulations of personal pronouns as well as proper nouns give senses of genders. A more in-depth faceting against adjectives allude to sentiment.

This recipe outlines how to filter (“search”):

  1. Click the “Remove All” button, if it exists; this ought to reset your view of the data
  2. Click the arrow next to the “token” column and select “Text filter” from the resulting menu
  3. In your mind, think of a word of interest, and enter it into the resulting search box
  4. Take notice of how the content in the spreadsheet view changes
  5. Go to Step #3 until you get tired
  6. Click the “Remove All” button to reset the view
  7. Text filter on the “token” column but search for “^N” (which is code for any noun) and make sure the “regular expression” check box is… checked
  8. Text facet on the “lemma” column; the result ought to be a count & tabulation of all the nouns
  9. Go to Step #6, but this time search for “^V” or “^J”, which are the codes for any verb or any adjective, respectively

By combining the functionalities of faceting and filtering the student, researcher, or scholar can investigate the original content more deeply or at least in different ways. The use of OpenRefine in this way is akin to leafing through book or a back-of-the-book index. As patterns & anomalies present themselves, they can be followed up more thoroughly through the use of a concordance and literally see the patterns & anomalies in context.

This recipe answers the question, “Who is mentioned in a corpus, and how often?“:

  1. Download, install, and run OpenRefine
  2. Create a new project and as input, select all of the files in the named-entity (ent) directory
  3. Continue to accept the defaults, but remember, all the almost all of the files in a study carrel are tab-delimited files, so remember to import them as “CSV / TSV / separator-based files”, not Excel files
  4. Continue to accept the defaults, and continue with “Create Project »”; the result ought to be a spreadsheet-like interface
  5. Click the arrow next to “type” column and select Facet/Text facet from the resulting menu; the result ought to be a new window containing a column of words and a column of frequencies — counts & tabulations of each type of named-entity in the whole of the study carrel
  6. Select “PERSON” from the list of named entities; the result ought to be a count & tabulation of the names of the people mentioned in the whole of the study carrel
  7. Go to Step #5 until tired, but each time select a different named-entity value

This final recipe is a visualization:

  1. Create a new parts-of-speech or named-entity project
  2. Create any sort of meaningful set of faceted results
  3. Select the “choices” link; the result ought to be a text area containing the counts & tabulation
  4. Copy the whole of the resulting text area
  5. Paste the result into your text editor, find all tab characters and change them to colons (:), copy the whole of the resulting text
  6. Open Wordle and create a word cloud with the contents of your clipboard; word counts may only illustrate frequencies, but sometimes the frequencies are preponderance.

A study carrel’s parts-of-speech (pos) and named-entities (ent) files enumerate each and every word or named-entity in each and every sentence of each and every item in the study carrel. Given a question relatively quantitative in nature and pertaining to parts-of-speech or named-entities, the pos and ent files are likely to be able to address the question. The pos and ent files are tab-delimited files, and OpenRefine is a very good tool for reading and analyzing such files. It does much more than was outlined here, but enumerating them here is beyond scope. Such is left up to the… reader.

2020 Forum Call for Proposals / LITA

LITA, ALCTS and LLAMA are now accepting proposals for the 2020 Forum, November 19-21 at the Renaissance Baltimore Harborplace Hotel in Baltimore, MD.

Intention and Serendipity: Exploration of Ideas through Purposeful and Chance Connections

Submission Deadline: March 30, 2020

Our library community is rich in ideas and shared experiences. The 2020 Forum Theme embodies our purpose to share knowledge and gain new insights by exploring ideas through an interactive, hands-on experience. We hope that this Forum can be an inspiration to share, finish, and be a catalyst to implement ideas…together.

We invite those who choose to lead through their ideas to submit proposals for sessions or preconference workshops, as well as nominate keynote speakers. This is an opportunity to share your ideas or unfinished work, inciting collaboration and advancing the library profession forward through meaningful dialogue.

We encourage diversity in presenters from a wide range of background, libraries, and experiences. We deliberately seek and strongly encourage submissions from underrepresented groups, such as women, people of color, the LGBTQA+ community, and people with disabilities. We also strongly encourage submissions from public, school, and special libraries.

Vendors wishing to submit a proposal should partner with a library representative who is testing/using the product.

Presenters will submit final presentation slides and/or electronic content (video, audio, etc.) to be made available online following the event. Presenters are expected to register and participate in the Forum as attendees.

For additional information about the 2020 LITA/ALCTS/LLAMA Forum, please visit https://forum.lita.org.

For questions, contact Berika Williams, Forum Planning Committee Chair, at berika.williams@tufts.edu.

Real World of Technology (MP3s) / Ed Summers

TL;DR I uploaded my MP3s of Ursula Franklin's The Real World of Technology to the Internet Archive where you can listen to it, or add it to your podcast player using this URL: https://archive.org/download/the-real-world-of-technology/podcast.xml


Back when I was starting out in the PhD program at UMD I ran across the work of Ursula Franklin. News of her passing was circulating in social media at the time. This is when I encountered her Massey Lectures, which I listened to, and enjoyed greatly. I can see looking back that listening to these lectures is what made me double down on the Science and Technology Studies angle in my own research.

At any rate, a colleague and friend of mine (hey Adam) recently found himself looking for the lectures because they disappeared from the CBC website. He ran ran across an old tweet that mentioned I had extracted MP3s from the Flash that the CBC published:



I still had these stashed away on and was about to zip them up to send to Adam, but then thought that maybe it would be good to make them publicly available now that they are no longer publicly available. This didn't seem like too much of an risk since the CBC is a public broadcaster, and was making the content available for free before.

So, I uploaded the audio to the Internet Archive, and was in the middle of writing a description for the Internet Archive item when I noticed something weird. The book says that there are six lectures and the CBC website only made five recordings available. I thought maybe some of the recordings could include more than one night? Or perhaps there was one missing? But after spending a few hours correlating the text of the 2nd edition with the recordings I came to realize a few things:

  1. The 2nd edition of the book supposedly only adds new content in Chapters 7-10, and leaves Chapters 1-6 from the 1st edition as is. But chapters 1-6 greatly depart from the lecture recordings in multiple places. It would be a major bit of work to analyze all the places where they diverge. Although come to think of it, that would be an interesting research project. Perhaps the 1st edition is closer to the recordings--but I wasn't able to get a hold of one quickly. If I had to guess I would say chapters 1-6 are identical in both editions, and that the differences between the recordings and the books were introduced when the lectures were first put into print. It could be also that Franklin wrote up the lectures, and then improvised a bit when reading them?

  2. The Wikipedia page says that the Massey Lectures are an "annual five-part series" which corroborates how there were only five recordings on the CBC website. Of course Wikipedia is Wikipedia, but a bit of looking around at other Massey lectures seemed to confirm that they are five episodes instead of six I listened to how each lecture started and ended, and heard the announcer say at the end which number in the series it was. The 4th seemed to flow directly into the final one. So perhaps when creating the book manuscript Franklin restructured the text so significantly that it created an additional chapter, and whoever wrote the preface aligned the number of lectures with the number of chapters, making five lectures into six lectures? It's actually kind of hard to say. If you have any knowledge please drop it on me.

So I am pretty uncomfortable saying that the recordings are complete, even though I'm sure they represented what was on the CBC website. There are no recordings on the website now so it doesn't lend much confidence.

Here is a little taste of what was cut from the book (at least the 2nd edition). It's from Lecture 4 and starts at 49:07. I'm including it because it's quite pessimistic about the current state of human technology and its historical arc, and points to a revolution in thinking that needs to happen if the human race is to survive. Remember, this is coming from a scientist:

I think that it is there that I would like to close. I think when we look at this real world of technology, and how we can conduct ourselves as human beings, responsibly, and compassionately, and intelligently, then we see a number of very real tasks, both within our communities, within the relationship between people, but also in the relationship between our community with nature. As I said at the beginning of these lectures, I do feel quite strongly, and you don't need to share that feeling, that we have come to the end of a historical period in which technology was helpful and useful to meet human needs over the last two or three hundred years. Kenneth Boulding once said "Nothing fails as well as success", and in that he meant that one risks failure when something that works is carried on beyond the point of it's appropriate usefuleness. And I think in terms of technology we have come to that point.

Although it appears that Boulding was citing Chesterton?

On the subject of nothing failing like success: if you like documentaries definitely check out Honeyland. It's about a beekeeper in Macedonia, who understands not to be too successful at her trade, in order to continue to live.

Anyhow, that's probably a lot of details about the lectures that you didn't want to know. But if you haven't listened to them before and want to, maybe I've made it a bit easier. You can find them here:

https://archive.org/details/the-real-world-of-technology/

Internet Archive do lots of cool stuff with your media uploads but I couldn't seem to find a way to get it to emit Podcast RSS. So I hand-rolled an RSS file which you could drop into your Podcast App of choice (if it still lets add a podcast by URL):

https://archive.org/download/the-real-world-of-technology/podcast.xml

Meta: Slow Blogging / David Rosenthal

Blogging is slow right now because my physical therapist wants me standing up and moving around at least every 15 minutes. Long-form blogging in 15-minute increments is hard.

LITA announces the 2020 Excellence in Children’s and Young Adult Science Fiction Notable Lists / LITA

The LITA Committee Recognizing Excellence in Children’s and Young Adult Science Fiction presents the 2020 Excellence in Children’s and Young Adult Science Fiction Notable Lists. The lists are composed of notable children’s and young adult science fiction published between November 2018 and October 2019 and organized into three age-appropriate categories. The annotated lists will be posted on the website at www.sfnotables.org.

The Golden Duck Notable Picture Books List is selected from books intended for pre-school children and very early readers, up to 6 years old. Recognition is given to the author and the illustrator:

Field Trip to the Moon by John Hare. Margaret Ferguson Books

Hello by Aiko Ikegami. Creston Books

How to be on the Moon by Viviane Schwarz. Candlewick Press

Out There by Tom Sullivan. Balzer + Bray

The Babysitter From Another Planet by Stephen Savage. Neal Porter Books

The Space Walk by Brian Biggs. Dial Books for Young Readers

Ultrabot’s First Playdate by Josh Schneider. Clarion Books

Good Boy by Sergio Ruzzier. Atheneum Books

Llama Destroys the World, written by Jonathan Stutzman, illustrated by Heather Fox. Henry Holt & Co

The Eleanor Cameron Notable Middle Grade Books List titles are chapter books or short novels that may be illustrated. They are written for ages 7 – 11. This list is named for Eleanor Cameron, author of the Mushroom Planet series.

Awesome Dog 5000 by Justin Dean. Random House Books for Young Readers 

Cog by Greg van Eekhout. HarperCollins

Field Trip (Sanity and Tallulah #2) by Molly Brooks. Disney-Hyperion 

Friendroid by M. M. Vaughan. Margaret K. McElderry Books

Klawde: Evil Alien Warlord Cat by Johnny Marciano & Emily Chenoweth. Penguin Workshop

Maximillian Fly by Angie Sage. Katherine Tegen Books

The Owls Have Come to Take Us Away by Ronald L. Smith. Clarion Books

The Greystone Secrets #1: The Strangers by Margaret Peterson Haddix. Katherine Tegen Books

We’re Not From Here by Geoff Rodkey. Crown Books for Young Readers

The Unspeakable Unknown by Eliot Sappingfield. G.P. Putnam’s Sons Books for Young Readers

Seventh Grade vs the Galaxy by Joshua S. Levy. Carolrhoda Books

The Hal Clement Notable Young Adult Books List contains science fiction books written for ages 12 – 18 with a young adult protagonist. This list is named for Hal Clement, a well-known science fiction writer and high school science teacher who promoted children’s science fiction.

Alien: Echo by Mira Grant. Imprint

Aurora Rising by Amie Kaufman and Jay Kristoff. Knopf Books for Young Readers

Girls With Sharp Sticks by Suzanne Young. Simon Pulse

The Hive by Barry Lyga and Morgan Baden. Kids Can Press

The Pioneer by Bridget Tyler. HarperTeen

How We Became Wicked by Alexander Yates. Atheneum/Caitlyn Dlouhy Books

The Waning Age by S.E. Grove. Viking Books for Young Readers

The Fever King by Victoria Lee. Skyscape

War Girls by Tochi Onyebuchi. Razorbill

I Hope You Get This Message by Farah Rishi. HarperTeen

Honor Bound by Rachel Caine and Ann Aguirre. Katherine Tegen Books

Contact:

Jenny Levine

Executive Director

Library and Information Technology Association

jlevine@ala.org

Topic Modeling Tool – Enumerating and visualizing latent themes / Eric Lease Morgan

Technically speaking, topic modeling is an unsupervised machine learning process used to extract latent themes from a text. Given a text and an integer, a topic modeler will count & tabulate the frequency of words and compare those frequencies with the distances between the words. The words form “clusters” when they are both frequent and near each other, and these clusters can sometimes represent themes, topics, or subjects. Topic modeling is often used to denote the “aboutness” of a text or compare themes between authors, dates, genres, demographics, other topics, or other metadata items.

Topic Modeling Tool is a GUI/desktop topic modeler based on the venerable MALLET suite of software. It can be used in a number of ways, and it is relatively easy to use it to: list five distinct themes from the Iliad and the Odyssey, compare those themes between books, and, assuming each chapter occurs chronologically, compare the themes over time.

topics

Simple list of topics

topics

Topics distributed across a corpus

topics

Comparing the two books of Homer

topics

Topics compared over time

Topic Modeling Tool Recipes

These few recipes are intended to get you up and running when it comes to Topic Modeling Tool. They are not intended to be a full-blown tutorial. This first recipe merely divides a corpus into the default number of topics and dimensions:

  1. Download and install Topic Modeling Tool
  2. Copy (not move) the whole of the txt directory to your computer’s desktop
  3. Create a folder/directory named “model” on your computer’s desktop
  4. Open Topic Modeling Tool
  5. Specify the “Input Dir…” to be the txt folder/directory on your desktop
  6. Specify the “Output Dir…” to be the folder/directory named “model” on your desktop
  7. Click “Learn Topics”; the result ought to be a a list of ten topics (numbered 0 to 9), and each topic is denoted with a set of scores and twenty words (“dimensions”), and while functional, such a result is often confusing

This recipe will make things less confusing:

  1. Change the number of topics from the default (10) to five (5)
  2. Click the “Optional Settings…” button
  3. Change the “The number of topic words to print” to something smaller, say five (5)
  4. Click the “Ok” button
  5. Click “Learn Topics”; the result will include fewer topics and fewer dimensions, and the result will probably be more meaningful, if not less confusing

There is no correct number of topics to extract with the process of topic modeling. “When considering the whole of Shakespeare’s writings, what is the number of topics it is about?” This being the case, repeat and re-repeat the previous recipe until you: 1) get tired, or 2) feel like the results are at least somewhat meaningful.

This recipe will help you make the results even cleaner by removing nonsense from the output:

  1. Copy the file named “stopwords.txt” from the etc directory to your desktop
  2. Click “Optional Settings…”; specify “Stopword File…” to be stopwords.txt; click “Ok”
  3. Click “Learn Topics”
  4. If the results contain nonsense words of any kind (or words that you just don’t care about), edit stopwords.txt to specify additional words to remove from the analysis
  5. Go to Step #3 until you get tired; the result ought to be topics with more meaningful words

Adding individual words to the stopword list can be tedious, and consequently, here is a power-user’s recipe to accomplish the same goal:

  1. Identify words or regular expressions to be excluded from analysis, and good examples include all numbers (\d+), all single-letter words (\b\w\b), or all two-letter words (\b\w\w\b)
  2. Use your text editor’s find/replace function to remove all occurrences of the identified words/patterns from the files in the txt folder/directory; remember, you were asked to copy (not move) the whole of the txt directory, so editing the files in the txt directory will not effect your study carrel
  3. Run the topic modeling process
  4. Go to Step #1 until you: 1) get tired, or 2) are satisfied with the results

Now that you have somewhat meaningful topics, you will probably want to visualize the results, and one way to do that is to illustrate how the topics are dispersed over the whole of the corpus. Luckily, the list of topics displayed in the Tool’s console is tab-delimited, making it easy to visualize. Here’s how:

  1. Topic model until you get a set of topics which you think is meaningful
  2. Copy the resulting topics, and this will include the labels (numbers 0 through n), the scores, and the topic words
  3. Open your spreadsheet application, and paste the topics into a new sheet; the result ought to be three columns of information (labels, scores, and words)
  4. Sort the whole sheet by the second column (scores) in descending numeric order
  5. Optionally replace the generic labels (numbers 0 through n) with a single meaningful word, thus denoting a topic
  6. Create a pie chart based on the contents of the first two columns (labels and scores); the result will appear similar to an illustration above and it will give you an idea of how large each topic is in relation to the others

Because of a great feature in Topic Modeling Tool it is relatively easy to compare topics against metadata values such as authors, dates, formats, genres, etc. To accomplish this goal the raw numeric information output by the Tool (the actual model) needs to be supplemented with metadata, the data then needs to be pivoted, and subsequently visualized. This is a power-user’s recipe because it requires: 1) a specifically shaped comma-separated values (CSV) file, 2) Python and a few accompanying modules, and 3) the ability to work from the command line. That said, here’s a recipe to compare & contrast the two books of Homer:

  1. Copy the file named homer-books.csv to your computer’s desktop
  2. Click “Optional Settings…”; specify “Metadata File…” to be homer-books.csv; click “Ok”
  3. Click “Learn Topics”; the result ought to pretty much like your previous results, but the underlying model has been enhanced
  4. Copy the file named pivot.py to your computer’s desktop
  5. When the modeling is complete, open up a terminal application and navigate to your computer’s desktop
  6. Run the pivot program (python pivot.py); the result ought to an error message outlining the input pivot.py expects
  7. Run pivot.py again, but this time give it input; more specifically, specify “./model/output_csv/topics-metadata.csv” as the first argument (Windows users will specify .\model\output_csv\topics-metadata.csv), specify “barh” for the second argument, and “title” as the third argument; the result ought to be a horizontal bar chart illustrating the differences in topics across the Iliad and the Odyssey, and ask yourself, “To what degree are the books similar?”

The following recipe is very similar to the previous recipe, but it illustrates the ebb & flow of topics throughout the whole of the two books:

  1. Copy the file named homer-chapters.csv to your computer’s desktop
  2. Click “Optional Settings…”; specify “Metadata File…” to be homer-chapters.csv; click “Ok”
  3. Click “Learn Topics”
  4. When the modeling is complete, open up a terminal application and navigate to your computer’s desktop
  5. Run pivot.py and specify “./model/output_csv/topics-metadata.csv” as the first argument (Windows users will specify .\model\output_csv\topics-metadata.csv), specify “line” for the second argument, and “title” as the third argument; the result ought to be a line chart illustrating the increase & decrease of topics from the beginning of the saga to the end, and ask yourself “What topics are discussed concurrently, and what topics are discussed when others are not?”

Topic modeling is an effective process for “reading” a corpus “from a distance”. Topic Modeling Tool makes the process easier, but the process requires practice. Next steps are for the student to play with the additional options behind the “Optional Settings…” dialog box, read the Tool’s documentation, take a look at the structure of the CSV/metadata file, and take a look under the hood at pivot.py.