Planet Code4Lib

Are you talking to Heroku redis in cleartext or SSL? / Jonathan Rochkind

In “typical” Redis installation, you might be talking to redis on localhost or on a private network, and clients typically talk to redis in cleartext. Redis doesn’t even natively support communications over SSL.

However, the Heroku redis add-on (the one from Heroku itself) supports SSL connections via “Stunnel”, a tool popular with other redis users use to get SSL redis connections too.

There are heroku docs on all of this which say:

While you can connect to Heroku Redis without the Stunnel buildpack, it is not recommend. The data traveling over the wire will be unencrypted.

Perhaps especially because on heroku your app does not talk to redis via localhost or on a private network, but on a public network.

But I think I’ve worked on heroku apps before that missed this advice and are still talking to heroku in the clear. I just happened to run across it when I got curious about the REDIS_TLS_URL env/config variable I noticed heroku setting.

Which brings us to another thing, that heroku doc on it is out of date, it doesn’t mention the REDIS_TLS_URL config variable, just the REDIS_URL one. The difference? the TLS version will be a url beginning with rediss:// instead of redis:// , note extra s, which many redis clients use as a convention for “SSL connection to redis probably via stunnel since redis itself doens’t support it”. The redis docs provide ruby and go examples which instead use REDIS_URL and writing code to swap the redis:// for rediss:// and even hard-code port number adjustments, which is silly!

(While I continue to be very impressed with heroku as a product, I keep running into weird things like this outdated documentation, that does not match my experience/impression of heroku’s all-around technical excellence, and makes me worry if heroku is slipping…).

The docs also mention a weird driver: ruby arg for initializing the Redis client that I’m not sure what it is and it doesn’t seem necessary.

The docs are correct that you have to tell the ruby Redis client not to try to verify SSL keys against trusted root certs, and this implementation uses a self-signed cert. Otherwise you will get an error that looks like: OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain)

So, can be as simple as:

redis_client = Redis.new(url: ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })

$redis = redis_client
# and/or
Resque.redis = redis_client

I don’t use sidekiq on this project currently, but to get the SSL connection with VERIFY_NONE, looking at sidekiq docs maybe on sidekiq docs you might have to(?):

redis_conn = proc {
  Redis.new(url: ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })
}

Sidekiq.configure_client do |config|
  config.redis = ConnectionPool.new(size: 5, &redis_conn)
end

Sidekiq.configure_server do |config|
  config.redis = ConnectionPool.new(size: 25, &redis_conn)
end

(Not sure what values you should pick for connection pool size).

While the sidekiq docs mention heroku in passing, they don’t mention need for SSL connections — I think awareness of this heroku feature and their recommendation you use it may not actually be common!

Static-Dynamic / Ed Summers

At work we’ve been moving lots of previously dynamic web sites over to being static sites. Many of these sites have been Wordpress, Omeka or custom sites for projects that are no longer active, but which retain value as a record of the research. This effort has been the work of many hands, and has largely been driven (at least from my perspective) by an urge to make the websites less dependent on resources like databases, and custom code to render content. But we are certainly not alone in seeing the value of static site technology, especially in humanities scholarship.

The underlying idea here is that a static site makes the content more resilient, or less prone to failure over time, because there are less moving pieces involved in keeping it in and on the web. The pieces that are moving are tried-and-true software like web browsers and web servers, instead of pieces of software that have had less eyes on them (my code). The web has changed a lot over its lifetime, but the standards and software of the web have stayed remarkably backwards compatible (Rosenthal, 2010). There is long term value in mainstreaming your web publishing practices and investing in these foundational technologies and standards.

Since they have less moving pieces static site architectures are also (in theory) more efficient. There is a hope that static sites lower the computational resources needed to keep your content on the web, which is good for the pocketbook and, more importantly, good for the environment. While there have been interesting experiments like this static site driven by solar energy it would be good to get some measurements on how significant these changes are. As I will discuss here in this post, I think static site technology can often push dynamic processing out of the picture frame, where it is less open to measurement.

Migrating completed projects to static sites has been fairly uncontroversial when the project is no longer being added to or altered. Site owners have been amenable to the transformation, especially when we tell them that this will help ensure that their content stays online in a meaningful and useful way. It’s often important to explain that “static” doesn’t mean their website will be rendered as an image or a PDF, or that the links won’t be navigable. Sometimes we’ve needed to negotiate the disabling of a search box. But it usually has sufficed to show how much the search is used (server logs are handy for this), and to point out that most people search for things elsewhere (e.g Google) and arrive at their website. So it has been a price worth paying. But I think this may say more about the nature of the projects we were curating, than it does about web content in general. YMMV. Cue the “There are no silver bullets” soundtrack. Hmm, what would that soundtrack sound like? Continuing on…

For ongoing projects where the content is changing, we’ve also been trying to use static site technology. Understandably there is a bit of an impedance mismatch between static site technology and dynamic curation of content. If the curators are comfortable with things like editing Markdown and doing Git commits/pushes things can go well. But if people aren’t invested in learning those things there is a danger that making a site static can well, make it too static, which can have significant consequences for an active project that uses the web as a canvas.

One way of adjusting for this is to make the static site more dynamic. (It’s ok if you lol here :) This tacking back can be achieved in a multitude of ways, but they mostly boil down to:

  1. Curating content in another dynamic platform and having an export of that data integrated statically into the website at build time (not at runtime).
  2. Curating content in another dynamic platform and including it at runtime in your web application using the platform’s API. For example think of how Google Maps is embedded in a webpage.

The advantage to #1 is that the curation platform can go away, and you still have your site, and your data, and are able to build it. But the disadvantage is that curatorial changes to the data do not get reflected in the website until the next build.

The advantage to #2 is that you don’t need to manage the data assets, and changes made in the platform are immediately available in the web application. But the downside is that your website is totally dependent on this external platform. The platform could choose to change their API, put the service behind a paywall, shut down the service or completely go out of business. Actually, it is a certainty that one of these will eventually happen. So, depending on the specifics, this kind of static site is arguably more vulnerable than the previous dynamic web application.

We’ve mostly been focused on scenario #1 in our recent work creating static sites for dynamic projects. For example we’ve been getting some good use of of Airtable as a content curation platform in conjunction with Gatsby and its gatsby-source-airtable plugin, which makes the Airtable data available via a GraphQL query that gets run at build time. Once built the pages have the data cooked into them, so Airtable can go away and our site will be fine. But the build is still quite dependent on Airtable.

In the (soon to be released) Unlocking the Airwaves project we also are using Airtable but we didn’t use the GraphQL plugin, mostly because I didn’t know about it at the time. On that project there is a fetch-data utility that will talk to our Airtable base and persist the data into the Git reposistory as JSON files. These JSON files are then queried as part of the build process. This means the build is insulated from changes in Airtable, and that you can choose when to pull the latest Airtable data down.

But the big downside to this approach is that when curators make changes in Airtable they want to see them reflected in the website quickly. They understandably don’t want to have to ask someone else for a build, or figure out how to build and push the site themselves. They just want their work to be available to others.

One stopgap between 1 and 2 that we’ve developed recently is to create a special table in Airtable called Releases. A curator can add a row with their name, a description, and a tagged version of the site to use, which will cause a build of the website to be deployed.

We have a program that runs from cron every few minutes and looks at this table and then decides whether a build of the site is needed. It is still early days so I’m not sure how well this approach will work in practice. It took a little bit of head scratching to come up with an initial solution where the release program uses this Airtable row as a lock to prevent multiple deploy processes from starting at the same time. Experience will show if we got this right. Let me know if you are interested in the details.

But stepping back a moment I think it’s interesting how static site web technology simultaneously creates a site of static content, while also pushing dynamic processing outwards into other places like this little Airtable Releases table, or one of the many so called Jamstack service providers. This is where the legibility of the processing gets lost, and it becomes even more difficult to ascertain the environmental impact of the architectural shift. What is the environmental impact of our use of NetlifyCMS or Airtable? It’s almost impossible to say, whereas before when we were running all the pieces ourselves it was conceiveable that we could audit them. These platforms are also sites of extraction as well. What economies are our data participating in?

This is all to say that understanding the motivations to go static are complicated, and more work could be done to strategically think through some of these design decisions in a repeatable and generative way.

References

Rosenthal, D. (2010). The half-life of digital formats. Retrieved from https://blog.dshr.org/2010/11/half-life-of-digital-formats.html

I Rest My Case / David Rosenthal

Jeff Rothenberg's seminal 1995 Ensuring the Longevity of Digital Documents focused on the threat of the format in which the documents were encoded becoming obsolete, and rendering its content inaccessible. This was understandable, it was a common experience in the preceeding decades. Rothenberg described two different approaches to the problem, migrating the document's content from the doomed format to a less doomed one, and emulating the software that accessed the document in a current environment.

The Web has dominated digital content since 1995, and in the Web world formats go obsolete very slowly, if at all, because they are in effect network protocols. The example of IPv6 shows how hard it is to evolve network protocols. But now we are facing the obsolescence of a Web format that was very widey used as the long effort to kill off Adobe's Flash comes to fruition. Fortunately, Jason Scott's Flash Animations Live Forever at the Internet Archive shows that we were right all along. Below the fold, I go into the details.

Preservationists inspired by Rothenberg's article siezed on migration as the only viable approach, perhaps because of emulation's greater technical challenges. They built systems that ingested content by preemptively migrating it to one of a small set of formats they assumed were unlikely to become obsolete. There were a number of problems with this "aggressive" approach, some of which I set out in the third post to this blog, Format Obsolescence: the Prostate Cancer of Preservation, such as:
Many digital preservation systems define levels of preservation; the higher the level assigned to a format, the stronger the "guarantee" of preservation the system offers. For example, PDF gets a higher level than Microsoft Word. Essentially, the greater the perceived difficulty of migrating a format, the lower the effort that will be devoted to preserving it. But the easier the format is to migrate, the lower the risk it is at. So investment, particularly in the "aggressive" approach, concentrates on the low-hanging fruit. This is neither at significant risk of loss, nor at significant risk of format obsolescence.
The idea that it was possible to assess the degree of doom a format would encounter in the future was suspect, to say the least. Right from the start of the LOCKSS Program in 1998 we disagreed with the "aggressive" approach, arguing that the most important thing was to collect and preserve the original bits, and work out how to provide access later, when it was requested. Our arguments fell on deaf ears, so in 2005 we implemented and demonstrated a technique by which on-access format migration was completely transparent to the user (see Transparent Format Migration of Preserved Web Content). This allowed the decision about the less-doomed format to be postponed until the answer was clear.

But underlying this approach was an assumption that some less-doomed format into which it was possible to migrate the doomed format without suffering catastrophic loss of information actually existed. In the case of Adobe Flash, even as obsolescence loomed, no-one identified such a format. The migration approach could only work for Flash when in 2016 Adobe Animate could convert it to HTML5 (only specified in 2014), an expensive and fragile migration. Flash content was less like a "document" in Rothenberg's sense, and more like a program.

Fortunately, as I detailed in my 2015 report Emulation and Virtualization as Preservation Strategies:
Recent developments in emulation frameworks make it possible to deliver emulations to readers via the Web in ways that make them appear as normal components of Web pages. This removes what was the major barrier to deployment of emulation as a preservation strategy.
Click image for emulation
Perhaps the most important such framework is the Internet Archive's Emularity, which injects an emulator into the reader's browser to process the preserved content. Now, Jason Scott writes:
Utilizing an in-development Flash emulator called Ruffle, we have added Flash support to the Internet Archive’s Emularity system, letting a subset of Flash items play in the browser as if you had a Flash plugin installed. While Ruffle’s compatibility with Flash is less than 100%, it will play a very large portion of historical Flash animation in the browser, at both a smooth and accurate rate.

We have a showcase of the hand-picked best or representative Flash items in this collection. If you want to try your best at combing through a collection of over 1,000 flash items uploaded so far, here is the link.

You will not need to have a flash plugin installed, and the system works in all browsers that support Webassembly.
The fact that it is now possible to access preserved Flash content is important, especially for the history of the Web. As Scott writes:
From roughly 2000 to 2005, Flash was the top of the heap for a generation of creative artists, animators and small studios. Literally thousands and thousands of individual works were released on the web. Flash could also be used to make engaging menu and navigation systems for webpages, and this was used by many major and minor players on the Web to bring another layer of experience to their users.
...
This period was the height of Flash. Nearly every browser could be expected to have a “Flash Plugin” to make it work, thousands of people were experimenting with Flash to make art and entertainment, and an audience of millions, especially young ones, looked forward to each new release.
Unfortunately, the misplaced priorities resulting from migration-based preservation strategies meant that formal preservation systems mostly failed to preserve Flash content, so much has probably been lost. But kudos to everyone who worked on making it possible to experience this important period in the development of the Web.

The Long Now / David Rosenthal

A talk by Stewart Brand and Danny Hillis about 25 years ago explaining the concept of the "Long Now" and the idea of building a 10,000-year clock to illustrate it was what started me thinking about long-term digital preservation. The idea of Lots Of Copies Keep Stuff Safe (LOCKSS), and the acronym came a couple of years later.

Hōryū-ji by Nekosuki, CC-BY-SA

Now, in The Data of Long-lived Institutions on the Long Now Foundation's blog, Alexander Rose refers to Hōryū-ji:
At about 1,400 years old, these are the two oldest continuously standing wooden structures in the world. And they’ve replaced a lot of parts of them. They keep the roofs on them, and even in a totally humid and raining environment, the central timbers of these buildings have stayed true. Interestingly, this temple was also the place where, over a thousand years ago, a Japanese princess had a vision that she needed to send a particular prayer out to the world to make sure that it survived into the future. And so she had, literally, a million wooden pagodas made with the prayer put inside them, and distributed these little pagodas as far and wide as she could. You can still buy these on eBay right now. It’s an early example of the philosophy of “Lots of Copies Keep Stuff Safe” (LOCKSS).
Below the fold, more on Rose's interesting post.
Hōryū-ji in itself is an example of the opposite approach to the long term, a single copy of durable materials carefully maintained:
This is Horyuji, an ancient Japanese temple, built in 607 AD. It is the world’s oldest surviving wooden structure. Horyuji was constructed from Japanese cypress that were roughly 2,000 years old. It has been 1,300 years since the cypress were cut down, and the wood still stands firm. “And they will be fine for the next 700 years,” said late-great chief Miya-daiku, Tsunekazu Nishioka. Miya-daiku is a specially trained carpenter who builds/maintains ancient temples and shrines. Their technical skills are usually passed down from father to son, through many years of hard, and skill developing, on-the-job training. According to Nishioka, “2000 year-old Japanese cypress is so robust yet resilient that it can maintain its great quality for another 2000 years, even after it’s cut down,” and that “people who initially build Horyuji in the 7th century knew about it. They knew that the building would last for another thousands of years, so they built it accordingly.”
Rebuilding Ise Shrine
by Utagawa Kuniyoshi
But it is a great example of the kind of thinking that animates the Long Now Foundation. The Ise Shrine provides an example of a different but equally effective technique:
These temples made of thatch and wood—totally ephemeral materials—have lasted for 2,000 years. They have documented evidence of exact rebuilding for 1,600 years, but this site was founded in 4 AD—also by a visionary Japanese princess. And every 20 years, with the Japanese princess in attendance, they move the treasures from one from one temple to the other. And the kami—the spirits—follow that. And then they deconstruct the temple, the old one, and they send all those parts out to various Shinto temples in Japan as religious artifacts.

I think the most important thing about this particular example is that each generation gets to have this moment where the older master teaches the apprentice the craft. So you get this handing off of knowledge that’s extremely physical and real and has a deliverable. It’s got to ship, right? “The princess is coming to move the stuff; we have to do this.” It’s an eight year process with tons of ritual that goes into rebuilding these temples.
Rose summarizes the Long Now's conceptual framework:
We use the 10,000 year time frame to orient our efforts at Long Now because global civilization arose when the last Interglacial period ended 10,000 years ago. It was only then, around 8,000 BC, that we had the emergence of agriculture and the first cities. If we can look back that far, we should be able to look forward that far. Thinking about ourselves as in the middle of a 20,000 year story is very different than thinking about ourselves as at the end of a 10,000 year story.
Rose presents a scolling list of long-lived institutions, starting with the Catholic Church sometime in the 1st century, followed by followed by 529AD the Order of St. Benedict. There are a few Universities; Cambridge, my alma mater is way down the list at 1209AD. But:
As you look at these, there’s a few things that stick out. Notice: brewery, brewery, winery, hotel, bar, pub, right? And also notice that a lot of them are in Japan. There’s been a rough system of government there for over 2,000 years (the Royal Family) that’s held together enough to enable something like the Royal Confectioner of Japan to be one of the oldest companies in the world. But there’s also temple builders and things like that.
...
In a survey of 5,500 companies over 200 years old, 3,100 are based in Japan. The rest are in some of the older countries in Europe.

But—and this was a fact I found curious, and one that speaks to the cyclical nature of things—90% of the companies that are over 200 years old have 300 employees or less; they’re not mega companies.
The most interesting part of Rose's post is at the end where he looks forward rather than back:
If any of us are to build an institution that’s going to last for the next few hundred or 1,000 years, the changes in demographics and the climate are a big part of it. This is the projected impact of climate change on agricultural yields until the 02080s. And as you can see, agricultural yields in the global south are going to be going down. In the global north and the much further north, more like Canada and Russia, they’re going to be getting a lot better. And this is going to change the world markets, world populations, and what we’re warring over for the next 100 years.
Note that the map is from 2007 and assumes that increased CO2 will improve crop yields. Climate gets a lot of attention right now, but demographics will have a huge impact:
There’s no scenario that I’ve seen where the world population doesn’t start going down by at least a hundred years from now, if not less than 50 years from now. ... And the world has really never lived through a time, except for a few short plague seasons, where the world population was going down—and, by extension, where the customer base was going down. ... Even more dangerous than the population going down is that the population is changing. ... If the world is made up largely of older people who hoard wealth, don’t work hard, and don’t make huge contributions of creativity to the world the way 20 year olds do, that world is a world that I don’t think we’re prepared to live in right now.

When We Look Back on 2020, What Will We See? / Dan Cohen

It is far too early to understand what happened in this historic year of 2020, but not too soon to grasp what we will write that history from: data—really big data, gathered from our devices and ourselves.

Sometimes a new technology provides an important lens through which a historical event is recorded, viewed, and remembered. When the September 11 Digital Archive gathered tens of thousand stories and photographs from 9/11 (I was involved with the project twenty years ago), it became clear that in addition to the mass medium of television, this tragic day was experienced in a more personal way by many Americans through the earpieces of cellphones and the tiny screens of low-resolution digital cameras. These technologies had only recently reached widespread adoption, but they were quickly pressed into service for communication and documentation, for frantic calls and messages, and as repositories of grainy photographs snapped in the moment.

Over the last two decades, of course, these nascent technologies matured and merged into the smartphone, added GPS and other sensors, and then hosted apps that helped themselves, with our consent and without, to location data, photos, and text. All of this information was then stored and aggregated in ways that were only vaguely conceivable in 2001.

Our year of 2020—somehow simultaneously overstuffed but also stretched thin, a year of Covid and protests against racism and a momentus election—will thus have a commensurately unwieldy digital historical record, densely packed with every need, opinion, and stress that our devices and sensors have captured and transmitted. That the September 11 Digital Archive collected 150,000 born-digital objects will strike future historians as confusingly slight, a desaturated daguerreotype compared to today’s hi-def canvas of data, teeming with vivid pixels. This year we will have generated billions of photographs, messages, and posts. Our movement through time and space has been etched as trillions of bytes about where we went and ate and shopped, or how much we hunkered down at home instead. But even if we hid from the virus, none of us will have been truly hidden. It’s all there in the data.

And it is not just the glowing rectangles we carry with us, through which we see and are seen, that will have produced and received an almost incalculable mass of data. In the testing and treatment of Covid, and the quest for a cure, scientists and doctors will have produced a detailed medical almanac from tens of millions of people, storing biological samples of blood and mucus and DNA for analysis, not just in the present, but also in decades to come.. “For life scientists, the freezer is the archive,” Joanna Radin, a historian of medicine at Yale, recently noted on a panel on “Data Histories of Health” at the Northeastern University Humanities Center.

Databases in the cloud and on ice: this is the record of 2020.

Some of the data we have collected in the present will form the basis for future investigations and understanding. One of those critical and lasting data sets, the Covid Tracking Project, led not by technologists but by humanists, will undoubtedly tell us a great deal about how different states approached the novel coronavirus with caution or carelessness. Contact tracing has created the possibility of network analyses of the interactions of people at a scale never seen before. The Documenting the Now project forged tools to allow for the ethical archiving of social media posts, which was used to gather the collective outpouring of social movements like Black Lives Matter. If the President’s tweets dominated the national news, DocNow collections will present a more democratic expressive history.

While each of these data sets contains vast information, in novel combinations they will prove especially revealing, as correlations between activity and illness, sentiments and social movements, become more apparent. Databases are structured so as to be joined; there will be debates over such syntheses and who gets to do them.

We also learned this year that our privacy is repeatedly violated to create darker archives. Code hidden within seemingly innocuous software such as weather apps tracked us and handed that information over to unknown third parties. The location pings of smartphones may present an atlas of our mobility, but at what cost? Thorny questions about privacy and ethics will only grow over time, and may rightly occlude the use of some data sets.

Other narratives await, embedded in the data like fossils in amber. My colleagues at the Boston Area Research Institute (BARI) at Northeastern, anticipating the importance of this year, began collecting posts to sites like Craigslist, Airbnb, and Yelp early on, and then preserved these compilations for future researchers. Those researchers will be able to discern which furniture we acquired to work at home, and which furniture we cast off to the curb as relics of the Before Times. They will map where some of us fled to, and the locations we shunned. They will see the kinds of foods that gave us comfort in a takeout bag, and the countless family restaurants that went out of business after surviving for generations through recessions and wars.

The data will uncover, even more than we already know, a great deal about the inequalities of modern America. Data will reveal, as a new report by BARI, the Center for Survey Research, and the Boston Public Health Commission, shows, who had to go to work and who could stay home; who had to take public transportation and who had access to a car; and who had safe access to food, and enough of it.

Appropriately, data was also the lens through which we experienced 2020. Every day we encountered numbers of all shapes and sizes, gazed obsessively at charts of rising cases and grim projections of future deaths, or read polls and forecasts of voting patterns. Like supplicants at Delphi, we strained to understand what these numbers were telling us. We quickly learned new statistical concepts, like R0 — and then just as quickly ignored them. 

One of the great ironies of 2020 is likely to be this: In this year in which the record of our existence was encoded in big data, that very same data was opaque to most of us, or was met by disbelief and distrust. We can only hope that those looking back on 2020, many years from now, can make sense of the chaos, using a dense historical record unlike anything that has come before.

Fedora 6 Alpha Release is Here / DuraSpace News

Today marks a milestone in our progress toward Fedora 6 – the Alpha Release is now available for download and testing! Over the past year, our dedicated Fedora team, along with an extensive list of active community members and committers, have been working hard to deliver this exciting release to all of our users.

So what does Alpha mean? Fedora 6.0 Alpha-1 is our initial release of the newly updated software. The 3 primary goals for Fedora 6 are robust migration support, enhanced digital preservations features, and improved performance and scale. We have been actively working on a strong feature set that we hope to release with the full version in early 2021.

Features Available in Alpha Release

For now, we are happy to deliver the following features with the Alpha:

Alpha Migration Tooling for Fedora 3, 4 & 5 to Fedora 6
Native Amazon S3 Support
Support for Oxford Common File Layout persistence
Built-in search service
Metrics collection available
Docker deployment option

To showcase what’s in store for Fedora 6, we also created this mini Highlight Reel featuring the Top 4 Features we thought you would be most excited about.

 

In the coming weeks, as we work toward Beta release, we want to ensure the broader community has ample opportunity to confirm the functionality of the software against local needs and use cases. We cannot emphasize enough how valuable your feedback will be. It is only through the feedback of those within our own community that we can help guide the development efforts and deliver a product you are proud to use.

Feedback and Testing

Please feel free to use the fedora-tech mailing list, the #fedora-6-testing channel in Fedora Slack or reach out to David Wilcox at david.wilcox@lyrasis.org to provide any and all feedback on this release. Even letting us know that you are testing the software would be greatly appreciated!

Stay tuned on the Road to Fedora 6 for more exciting updates as we move toward Beta and eventually full production release. Thanks to all involved, we couldn’t have done it without your support and dedication.

The post Fedora 6 Alpha Release is Here appeared first on Duraspace.org.

Empathy Daleks / Hugh Rundle

The annual CRIG Seminar is a highlight of the calendar in Victorian academic libraries, and like everything else in 2020, this year's program was a bit different to normal. Entirely online, free, and open to all, the organising committee came up with three fantastic sessions spread over a week and a half. The first session on ‌OER resources and open pedagogy with Sarah Lambert and Rajiv Jhangiani was outstanding. Dr Jhangiani showed himself to be an extremely effective communicator about the importance and pedagogical benefits of open educational practices, and I learned a lot about how to do that better. But unfortunately for him, it was Dr Lambert's talk that struck me more forcefully. Lambert's research focusses on social justice actions and discourse within open educational practice. She spoke among other things about the effect of using and revising openly licensed educational resources to make them, well, less uniformly male and white. Her research shows some interesting effects of diversifying text books that were completely obvious in retrospect but that I had never properly considered.

Validation as a pedagogical tool

Her studies indicate that diversifying the authors, perspectives, representations and examples in standard textbooks is not simply "more inclusive" or "just" in an abstract way (though that would be good anyway). Students who feel they belong — who feel validated as members or potential members of a profession or academic discipline — are more likely to succeed and complete their degrees. That is, Lambert suggests that diversifying the authors and even the examples or hypothetical actors in university textbooks by itself has a positive effect on completion rates, engagement, and student satisfaction with courses. Amy Nusbaum shows in a recent article that OER is an effective way to accelerate this, because with licenses allowing "remixing" of content the examples used within open textbooks can be updated to suit local needs without having to rewrite the entire text.

It's possible I have never felt more White and male than I did listening to this. The thought "that's amazing" was immediately followed by "that's obvious". Because of course there is more cognitive load required by someone trying to learn about a field of study that is new to them, if they also have to deal with the (perhaps correct) impression that people like them are not really welcome in that field. Of course if one sees oneself as potentially part of "the field" because one sees oneself in the literature, engaging with that literature will be easier. Women and racialised people have been saying this for decades or longer in all sorts of contexts.

But it was Lambert uttering the magic words about diverse texts improving "student success" that suddenly felt quite subversive. To understand why, we need to interrogate what universities usually mean when they talk about "student success", and particularly the infrastructures universities have been building around it.

Universities as a site of discipline

Education systems are sites of discipline. There are "canons", "standards", "traditions" and examinations. In an interview for The New Inquiry about his book, Beyond education: radical studying for another world, Eli Meyerhoff describes the standard education system:

Its key elements include a vertical imaginary of individualized ascent up levels of education, a pedagogical mode of accounting with a system of honor and shame that eventually took the form of graded exams, hierarchical relationships of teacher over student, separations of students from the means of studying, the commodification of access to the means of studying through tuition, and opposed figures of educational waste (e.g., the dropout) and value (e.g., the graduate). This mode of study shapes subjects for their participation in governance and work within the dominant mode of world-making.

Universities are obsessed above all with correct forms. Heaven help the unwitting new recruit who accidentally refers to Professor Smith as Ms Smith, or mixes up the Pro-Vice Chancellor with the Deputy Provost. Enormous amounts of time and money are dedicated to ensuring that students italicise the correct words and use particular punctuation for any of half a dozen referencing systems that manage to somehow be incredibly prescriptive in the specifics whilst relatively vague, undefined and incomplete, never quite covering all possibilities. Avoiding plagiarism is now defined as whether your paper passes the test set by commercial black box software. The march of progress has inevitably led to a what John Warner has described as a "plagiarism singularity":

Paper mills are now using Turnitin/WriteCheck to certify to their customers that the essays they’ve purchased from the paper mill will successfully pass the Turnitin/WriteCheck report at their institution.
...Turnitin now sits at the center of a perfect little marketplace, a plagiarism singularity if you will, where they get paid coming and going to certify work as “original,” even though the very circular nature of the arrangement means the software itself is worthless when it comes to detecting originality.

The point of all this isn't really to "certify originality" at all, but in fact the opposite. Ironically, the fake obsession with originality hinders actual research, to the point that researchers are reduced to begging scientific publishers not to reject articles on the basis that they use standardised methods.

Above all else, a university degree certifies the holder's ability to follow particular rules, only some of which are explicitly stated. Those who have difficulty following — or even recognising — these rules are a problem for which the university seeks a solution. They are referred to by an ever-growing list of euphemisms: "first in family", "at risk", "non-traditional background", "diverse", "low SES", "disengaged". Students are told in more and less subtle ways that they don't belong, that they're here under sufferance, and when, inevitably, they fail, it will be their fault. The educational experience of the average undergraduate today is a multi-year hazing by what Jeffrey Moro calls, simply, "cop shit".

Empathy Daleks

"Cop shit" is evolving from the merely Kafkaesque (the Turnitin black box) to the Orwellian (The Intelligent Campus). Rather than waiting for students to fail classes, universities are now eagerly signing up for prediction machines to identify precrime students at risk of failing in order to "intervene". Jisc has spruiked their "Intelligent Campus" product as part of a more widespread push to use "learning analytics" to measure student "engagement". One of the more troubling aspects of this project is the narrative that it is a potential solution to the growing number of students experiencing acute mental distress. Instead of recognising that universities are particular site of trauma within a broader crisis of community and collective care in hyper-capitalist societies, and responding with increased human interaction and human connection, Jisc's "Intelligent Campus" doubles down to present the very disembodied, machinic cause of the problem as its own "solution".

Rather than rebuilding universities as the sites of genuine human care, connection, and community claimed in their glossy brochures and award winning websites, the business model of today's university produces empathy Daleks. Instead of the emotional intelligence of humans, universities embrace the artificial intelligence of machines to identify and neutralise non-compliance. Using analytics and automation to "nudge" "at-risk" students works on a kind of infection model: "test and trace" for the inability to match up to the university's model of a "good" student.

What do we mean by "success"?

What does it mean to say a student has "failed"? Rola Ajjawi tells us that 40% of students will fail at least one unit in the course of their degree. This is an extraordinary number. If four in ten students "fail", then surely it would be more accurate to say that the higher education system is broken. But then again, perhaps universities don't see this as a failure at all. Economists like to talk about "stated preferences" versus "revealed preferences". In perhaps the most under-stated sentence in her book, Generous thinking: a radical approach to saving the university, Kathleen Fitzpatrick notes:

The inability of institutions of higher education to transform their internal structures and processes in order to fully align with their stated mission and values may mean that the institutions have not in fact fully embraced that mission or those values.

Fitzpatrick's hope is for nothing less than a complete reconstruction of not only the role, but the culture and structures of the university:

It is not just a matter of making it possible for more kinds of people to achieve conventionally coded success within the institution, but instead of examining what constitutes success, how it is measured, and why.

So this is why I see Sarah Lambert's framing of the impact of diversifying university textbooks as so subversive. Lambert completely flips the assumption about what and who needs to change for students to have more success and engagement with education and learning. It is not merely that universities should diversify their teaching material because it's nice, or politically correct, or just. It's not simply that they should do so because it would go some way to resolve the misalignment that Fitzpatrick has identified between the university's stated and practiced values. Far more dangerously, universities should do this because it is shown to improve their own stated measures of their own success. Improving "student success" in this way requires a change in the university, not in the students.

And whist it's certainly not enough on its own, diversifying teaching material addresses "student success" much more cheaply, effectively, and ethically than any empathy Dalek ever will.


Vitaminwater's #nophoneforayear contest / Casey Bisson

Back in the before times, Vitaminwater invited applicants to a contest to go a full year without a smartphone or tablet. It was partly in response to rising concerns over the effect of all those alerts on our brains. Over 100,000 people clamored for the chance, but author Elana A. Mugdan’s entry stood out with an amusing video, and in February 2019 the company took away her iPhone 5s and handed her a Kyocera flip phone.

Proud boy librarians and racism in libraries / Meredith Farkas

blindfolded woman

This week, people learned that a librarian (currently working in university IT rather than a library) was outed as a proud boy who ran a virulently racist, misogynistic, and homophobic Twitter account that had doxxed antifa activists (including some in my city). Back in the aughts, I knew him. He was part of the social circle of librarian techies with blogs that I was part of. We went to some of the same conferences and socialized at them. We weren’t close. I didn’t know him well. We emailed back and forth a few times about tech stuff. But he seemed to be a really kind and generous person. I never got sketchy vibes from him like I did around some male librarians. Our friend group, while pretty white, did include BIPOC librarians and queer librarians and he seemed pretty tight with everyone. One black librarian I know wrote about how as a woman of color she often has her guard up around white people (on alert for racism) and that she never felt any concern about him and had considered him a friend. I remember him singing karaoke at a gay bar at Computers in Libraries. I drifted away from that group when I had a baby and went through postpartum depression (and had the sad realization that when most of us disappear from social media people don’t even notice or care). I don’t recall staying in touch with him beyond 2009 or 2010.

Just a few years after drifting away from that group himself, his friends discover that he’s part of a terrorist organization and has been doing absolutely abhorrent things. I think the shock that people who knew him feel is related to the fact that he felt safe to so many people, even people from historically marginalized groups. (Now a story has come out that makes him sound like a creep who preyed on a vulnerable female librarian, so even then he clearly was showing different faces to different people.) I don’t think it’s shocking in the least that there are Proud Boys and people from other hate groups who work (or have worked) in libraries. But I can understand the shock that people feel when a friend makes a choice like that. As a member of several groups that have been targeted by Proud Boys, I will say that I’m surprised that I read him so completely wrong.

There have been a lot of people this week who didn’t know him offering hot takes that seem like they are blaming the people who were his friends and were blindsided by this. Unless those friends are still defending him, I can’t imagine why people would want to throw salt in a wound like that. These friends feel betrayed because they truly believed he shared their values. I think it’s wrong to make his former friends who feel absolutely gutted feel bad for expressing shock.They are not denying the current reality; they are processing their shock and pain. Unless you know of situations in which he did or said racist things that those people ignored, equating their shock about their former friend to denying the existence of racism in the field is a leap. At the same time, I fully understand and agree that much of our profession is too quick to dismiss the idea that racism is endemic in our field. And I also agree that it’s easy for white librarians to not see racism or to see things and not call it racism. But in this case, even BIPOC librarians are coming forward and expressing shock and people are really hurting. Let them grieve.

But once people have grieved, I hope they will take action. I’ve seen how silence works to enable racism to thrive. And usually that racism is much more subtle and insidious than being a member of the Proud Boys. I used to work with a librarian — let’s call him “Ned.” He cared deeply about students and faculty and was dedicated to his work. He was outgoing, a nice guy to most people, and was generous with many of his colleagues. But he was also a loose cannon and often said things that were inappropriate (at best) and sometimes really offensive. All this was well-known across our institution and yet he was well-liked. People would roll their eyes or say “oh that’s just Ned” in the way you might about an older Uncle who is both delightful and inappropriate in equal measure. Many of the racist or sexist things he’d do or say were subtle or were under the guise of jokes; the sorts of things that leave enough room for interpretation that he could pretend he wasn’t being racist or sexist or elitist. He would “jokingly” refer to an Asian-American colleague with a play on words that referred to a dictator from an Asian country that was not hers (and would have been offensive even if it had been from her country of origin). He would constantly ask women — especially female part-time librarians — to do work for him as if they were his personal assistants. In the minutes he took for a meeting, he referred to our part-time library faculty colleagues as “minions.” The first thing he told a Latinx colleague when they told him they were going to library school was “D’s still get degrees” as if he were automatically expected not to achieve. There were so many examples. But that was just “Ned being Ned.” How many of us have known a Ned? How many of us actually stood up to our Ned; held him accountable for his behavior?

For a while, I was complicit in encouraging his behavior by not calling him out. At first, it was because I was new and he’d been there for a long time and so many people I knew who I thought of as deeply antiracist never called him on the things he said. But then I just became another person rolling my eyes indulgently at him while he did whatever the hell he wanted. As I became more aware of the ways I was complicit in white supremacy, I recognized how our silence in not confronting him contributed to an environment in which BIPOC colleagues and my colleagues working in precarity didn’t feel safe or valued. So I started actively confronting him on things — in-person, on our email lists, and on our statewide library mailing list. And at first I was one of very few voices at the college pushing back on him (at least publicly), but then others started doing the same. And when he retired and sent a (again subtly for maximum deniability) racist and sexist screed to the whole college with his “suggestions” for us, lots of us responded publicly to make it clear how unacceptable his behavior was. In all the times I and others confronted him, he never really listened. Never acknowledged the pain he was causing people, even when BIPOC library workers told him directly how much his words hurt them. He just kept arguing on topics that were abstract to him but existential to others.

I still think about all the years he worked at my place of work and how our silence in the face of his behavior must have felt to our BIPOC colleagues and to others he denigrated in his always subtle or “humorous” ways. I wonder how his decreased expectations of our students impacted how he taught and worked with students of color (students he, at the same time, professed to care deeply about, which I think was also true). Academia is a safe haven for people like Ned because they can smugly stand behind academic freedom and their tenure. But those who work with people like Ned have a choice in whether we publicly confront them. Because we should — even if we know they won’t change and we can’t get them to stop and HR won’t act. At least we can make it clear to everyone that we find their behavior abhorrent. Our silence? That was racism. That was acceptance. And I own and feel shame for my role in encouraging him through my silence.

And at least with the Neds in the world, they are pretty public with their behavior. For every Ned, there are way more people who undermine, bully, exclude, and otherwise harm BIPOC library workers in ways significantly more subtle — and sometimes in ways even they are unconscious of. Those people who enforce elitist and white supremacist standards of professionalism. Those people who offer to mentor only those who remind them of themselves. Those people who hire for “fit.” Those people who make their BIPOC colleagues teach them how to not be racist. Those people who use the tenure process as a tool of oppression. When I read Tema Okun’s description of how white supremacy shows up in organizations, there was so much I recognized from the places where I’ve worked. White supremacy is foundational in our organizations and sometimes we are the ones perpetrating it, even if we don’t see it. But when our profession has tons of articles, books, and blog posts detailing the abuse, exclusion, microagressions, and more that BIPOC library workers deal with regularly in our white supremacist organizations, ignorance is not an acceptable excuse. We have to make it a priority to see where we promote or are complicit in allowing white supremacy to thrive and to try to dismantle structures that exclude, marginalize, undermine, and harm.

Another lesson we can take from this mess is how easy it is for people hide parts of themselves online for a wide variety of reasons. We all pick and choose the parts of ourselves we reveal through social media, but some people do it systematically in order to build trust that is not deserved. Several months ago (I think… does time even exist during ‘rona?) someone who ran a super-popular pseudonymous librarian twitter handle was revealed to have sexually harassed lots of women and specifically targeted women who were vulnerable. It’s so easy to deceive or be deceived on social media. It’s so easy to believe that online people are your close friends, that you really know them. And I hate that I’ve become so much more jaded about “Internet friendships,” especially now when we’re all so isolated. It sucks to feel like I can’t trust people.

But I honestly don’t want to go through life not being surprised if someone I was friendly with becomes a Proud Boy, because what would that mean I thought of every other human being?

 

Image credit: Pickpic

 

 

ISLE Sprint Wrapup / Islandora

ISLE Sprint Wrapup dlamb Fri, 11/20/2020 - 22:26
Body

Our first quarterly ISLE sprint has wrapped up and I'd like to thank everyone that participated.  From the community contributions we received, we can now move repository data between ISLE instances. This allows us to move a repository from development to testing and finally a production environment. It was pretty crucial that ISLE 8 be able to handle this, and the community really stepped up and delivered. Plus, as a bonus, we're also ready for composer 2.

The work's been done, but there's still a few pull requests under review:

Once those are merged, we can begin documenting the process of how to move your data around.  We'll also be seeking user feedback from both new and existing ISLE users as they set up their repositories.

As always, this was a successful sprint. That really speaks to what we're capable of when we work together on common ground to solve shared problems.  It never ceases to amaze me what can be accomplished if we put our heads together.  Thanks so much to everyone who signed up.

  • Nigel Banks - Lyrasis

  • Cary Gordon - Cherry Hill

  • Noah Smith - Born Digital

  • Gavin Morris - Born Digital

  • Hertzel Armengol - Born Digital

  • Andrija Sagic - Library Milutin Bojic

  • Alan Stanley - Agile Humanities

  • Aaron Birkland - John Hopkins' University

  • Seth Shaw - University of Nevada Las Vegas

  • Rosie Le Faive - University of Prince Edward Island

  • Janice Banser - Simon Fraser University

  • Jeffery Antoniuk - University of Alberta

  • Yamil Suarez - Berklee College of Music

We couldn't do it without awesome people like you and organizations that understand and value open source software. Thank you so much for your commitment to make Islandora the best it can be.

Weeknote 47 (2020) / Mita Williams

I had a staycation last week. It took me two days just to catch up on email I received while I was gone. And the only reason I was able to do that in two days is because I had booked the days off as meeting-free so I could attend an online conference.

Said conference was the 2020 Indigenous Mapping Workshop. I was not able to attend many of the sessions but the ones that I did rekindled my affection for web-maps and inspired me to make two proof-of-concept maps.

The first one is of bike parking in my neighbourhood. The location and photos were collected using a web form through Kobotoolbox.

I then downloaded a csv from the site and paired with this leaflet-omnivore powered map.

Bike parking in my neighbourhood

The second map I made was a more mischievous creation in which I used Mapbox Studio to rename the world.


Other things I did this week: chair our monthly Information Services Department meeting, selected a set of duplicate books as part of a larger weeding project, ordered a lot of books using ExLibris’ Rialto, did a LibraryChat shift, contributed to some collection management work, did some OJS support, attended several meetings, and wrote many emails.


One day I would like write a piece that applies the concept of technical debt to library services / the library as an organization.


I didn’t do much reading this week but I did read one article which I think has an exceptional title: Public Libraries Are Doing Just Fine, Thank You: It’s the “Public” in Public Libraries That is Threatened


This is a project that is close to my civic interests:

Introducing the Civic Switchboard Data Literacy Project!

We’re pleased to announce the receipt of an IMLS Laura Bush 21st Century Librarian Program Grant, which will support the next piece of Civic Switchboard project – the Civic Switchboard Data Literacy project! This project builds on the Civic Switchboard project’s exploration of civic data roles for libraries and will develop instructional materials to prepare MLIS students and current library workers for civic data work.

Through the Civic Switchboard project, we’ve learned about common barriers to entry that libraries are navigating with civic data work. We regularly heard library workers say that they feel unqualified to participate in their civic data ecosystems. With this barrier in mind, the Civic Switchboard Data Literacy project will build and pilot instructional material that MLIS instructors can integrate in coursework and that can be used in professional development training in library settings.

Let’s visualize some HAMLET data! Or, d3 and t-SNE for the lols. / Andromeda Yelton

In 2017, I trained a neural net on ~44K graduate theses using the Doc2Vec algorithm, in hopes that doing so would provide a backend that could support novel and delightful discovery mechanisms for unique library content. The result, HAMLET, worked better than I hoped; it not only pulls together related works from different departments (thus enabling discovery that can’t be supported with existing metadata), but it does a spirited job on documents whose topics are poorly represented in my initial data set (e.g. when given a fiction sample it finds theses from programs like media studies, even though there are few humanities theses in the data set).

That said, there are a bunch of exploratory tools I’ve had in my head ever since 2017 that I’ve not gotten around to implementing. But here, in the spirit of tossing out things that don’t bring me joy (like 2020) and keeping those that do, I’m gonna make some data viz!

There are only two challenges with this:

  1. By default Doc2Vec embeds content in a 100-dimensional space, which is kind of hard to visualize. I need to project that down to 2 or 3 dimensions. I don’t actually know anything about dimensionality reduction techniques, other than that they exist.
  2. I also don’t know know JavaScript much beyond a copy-paste level. I definitely don’t know d3, or indeed the pros and cons of various visualization libraries. Also art. Or, like, all that stuff in Tufte’s book, which I bounced off of.

(But aside from that, Mr. Lincoln, how was the play?)

I decided I should start with the pages that display the theses most similar to a given thesis (shout-out to Jeremy Brown, startup founder par excellence) rather than with my ideas for visualizing the whole collection, because I’ll only need to plot ten or so points instead of 44K. This will make it easier for me to tell visually if I’m on the right track and should let me skip dealing with performance issues for now. On the down side, it means I may need to throw out any code I write at this stage when I’m working on the next one. 🤷‍♀️

And I now have a visualization on localhost! Which you can’t see because I don’t trust it yet. But here are the problems I’ve solved thus far:

  1. It’s hard to copy-paste d3 examples on the internet. d3’s been around for long enough there’s substantial content about different versions, so you have to double-check. But also most of the examples are live code notebooks on Observable, which is a wicked cool service but not the same environment as a web page! If you just copy-paste from there you will have things that don’t work due to invisible environment differences and then you will be sad. 😢 I got tipped off to this by Mollie Marie Pettit’s great Your First d3 Scatterplot notebook, which both names the phenomenon and provides two versions of the code (the live-editable version and the one you can actually copy/paste into your editor).
  2. If you start googling for dimensionality reduction techniques you will mostly find people saying “use t-SNE”, but t-SNE is a lying liar who lies. Mind you, it’s what I’m using right now because it’s so well-documented it was the easiest thing to set up. (This is why I said above that I don’t trust my viz.) But it produces different results for the same data on different pageloads (obviously different, so no one looking at the page will trust it either), and it’s not doing a good job preserving the distances I care about. (I accept that anything projecting from 100d down to 2d will need to distort distances, but I want to adequately preserve meaning — I want the visualization to not just look pretty but to give people an intellectually honest insight into the data — and I’m not there yet.)

Conveniently this is not my first time at the software engineering rodeo, so I encapsulated my dimensionality reduction strategy inside a function, and I can swap it out for whatever I like without needing to rewrite the d3 as long as I return the same data structure.

So that’s my next goal — try out UMAP (hat tip to Matt Miller for suggesting that to me), try out PCA, fiddle some parameters, try feeding it just the data I want to visualize vs larger neighborhoods, see if I’m happier with what I get. UMAP in particular alleges itself to be fast with large data sets, so if I can get it working here I should be able to leverage that knowledge for my ideas for visualizing the whole thing.

Onward, upward, et cetera. 🎉

Comparing performance of a Rails app on different Heroku formations / Jonathan Rochkind

I develop a “digital collections” or “asset management” app, which manages and makes digitized historical objects and their descriptions available to the public, from the collections here at the Science History Institute.

The app receives relatively low level of traffic (according to Google Analytics, around 25K pageviews a month), although we want it to be able to handle spikes without falling down. It is not the most performance-optimized app, it does have some relatively slow responses and can be RAM-hungry. But it works adequately on our current infrastructure: Web traffic is handled on a single AWS EC2 t2.medium instance, with 10 passenger processes (free version of passenger, so no multi-threading).

We are currently investigating the possibility of moving our infrastructure to heroku. After realizing that heroku standard dynos did not seem to have the performance characteristics I had expected, I decided to approach performance testing more methodically, to compare different heroku dyno formations to each other and to our current infrastructure. Our basic research question is probably What heroku formation do we need to have similar performance to our existing infrastructure?

I am not an expert at doing this — I did some research, read some blog posts, did some thinking, and embarked on this. I am going to lead you through how I approached this and what I found. Feedback or suggestions are welcome. The most surprising result I found was much poorer performance from heroku standard dynos than I expected, and specifically that standard dynos would not match performance of present infrastructure.

What URLs to use in test

Some older load-testing tools only support testing one URL over and over. I decided I wanted to test a larger sample list of URLs — to be a more “realistic” load, and also because repeatedly requesting only one URL might accidentally use caches in ways you aren’t expecting giving you unrepresentative results. (Our app does not currently use fragment caching, but caches you might not even be thinking about include postgres’s built-in automatic caches, or passenger’s automatic turbocache (which I don’t think we have turned on)).

My initial thought to get a list of such URLs from our already-in-production app from production logs, to get a sample of what real traffic looks like. There were a couple barriers for me to using production logs as URLs:

  1. Some of those URLs might require authentication, or be POST requests. The bulk of our app’s traffic is GET requests available without authentication, and I didn’t feel like the added complexity of setting up anything else in a load traffic was worthwhile.
  2. Our app on heroku isn’t fully functional yet. Without having connected it to a Solr or background job workers, only certain URLs are available.

In fact, a large portion of our traffic is an “item” or “work” detail page like this one. Additionally, those are the pages that can be the biggest performance challenge, since the current implementation includes a thumbnail for every scanned page or other image, so response time unfortunately scales with number of pages in an item.

So I decided a good list of URLs was simply a representative same of those “work detail” pages. In fact, rather than completely random sample, I took the 50 largest/slowest work pages, and then added in another 150 randomly chosen from our current ~8K pages. And gave them all a randomly shuffled order.

In our app, every time a browser requests a work detail page, the JS on that page makes an additional request for a JSON document that powers our page viewer. So for each of those 200 work detail pages, I added the JSON request URL, for a more “realistic” load, and 400 total URLs.

Performance: “base speed” vs “throughput under load”

Thinking about it, I realized there were two kinds of “performance” or “speed” to think about.

You might just have a really slow app, to exagerate let’s say typical responses are 5 seconds. That’s under low/no-traffic, a single browser is the only thing interacting with the app, it makes a single request, and has to wait 5 seconds for a response.

That number might be changed by optimizations or performance regressions in your code (including your dependencies). It might also be changed by moving or changing hardware or virtualization environment — including giving your database more CPU/RAM resources, etc.

But that number will not change by horizontally scaling your deployment — adding more puma or passenger processes or threads, scaling out hosts with a load balancer or heroku dynos. None of that will change this base speed because it’s just how long the app takes to prepare a response when not under load, how slow it is in a test only one web worker , where adding web workers won’t matter because they won’t be used.

Then there’s what happens to the app actually under load by multiple users at once. The base speed is kind of a lower bound on throughput under load — page response time is never going to get better than 5s for our hypothetical very slow app (without changing the underlying base speed). But it can get a lot worse if it’s hammered by traffic. This throughput under load can be effected not only by changing base speed, but also by various forms of horizontal scaling — how many puma or passenger processes you have with how many threads each, and how many CPUs they have access to, as well as number of heroku dynos or other hosts behind a load balancer.

(I had been thinking about this distinction already, but Nate Berkopec’s great blog post on scaling Rails apps gave me the “speed” vs “throughout” terminology to use).

For my condition, we are not changing the code at all. But we are changing the host architecture from a manual EC2 t2.medium to heroku dynos (of various possible types) in a way that could effect base speed, and we’re also changing our scaling architecture in a way that could change throughput under load on top of that — from one t2.medium with 10 passenger process to possibly multiple heroku dynos behind heroku’s load balancer, and also (for Reasons) switching from free passenger to trying puma with multiple threads per process. (we are running puma 5 with new experimental performance features turned on).

So we’ll want to get a sense of base speed of the various host choices, and also look at how throughput under load changes based on various choices.

Benchmarking tool: wrk

We’re going to use wrk.

There are LOTS of choices for HTTP benchmarking/load testing, with really varying complexity and from different eras of web history. I got a bit overwhelmed by it, but settled on wrk. Some other choices didn’t have all the features we need (some way to test a list of URLs, with at least some limited percentile distribution reporting). Others were much more flexible and complicated and I had trouble even figuring out how to use them!

wrk does need a custom lua script in order to handle a list of URLs. I found a nice script here, and modified it slightly to take filename from an ENV variable, and not randomly shuffle input list.

It’s a bit confusing understanding the meaning of “threads” vs “connections” in wrk arguments. This blog post from appfolio clears it up a bit. I decided to leave threads set to 1, and vary connections for load — so -c1 -t1 is a “one URL at a time” setting we can use to test “base speed”, and we can benchmark throughput under load by increasing connections.

We want to make sure we run the test for long enough to touch all 400 URLs in our list at least once, even in the slower setups, to have a good comparison — ideally it would be go through the list more than once, but for my own ergonomics I had to get through a lot of tests so ended up less tha ideal. (Should I have put fewer than 400 URLs in? Not sure).

Conclusions in advance

As benchmarking posts go (especially when I’m the one writing them), I’m about to drop a lot of words and data on you. So to maximize the audience that sees the conclusions (because they surprise me, and I want feedback/pushback on them), I’m going to give you some conclusions up front.

Our current infrastructure has web app on a single EC2 t2.medium, which is a burstable EC2 type — our relatively low-traffic app does not exhaust it’s burst credits. Measuring base speed (just one concurrent request at a time), we found that performance dynos seem to have about the CPU speed of a bursting t2.medium (just a hair slower).

But standard dynos are as a rule 2 to 3 times slower; additionally they are highly variable, and that variability can be over hours/days. A 3 minute period can have measured response times 2 or more times slower than another 3 minute period a couple hours later. But they seem to typically be 2-3x slower than our current infrastructure.

Under load, they scale about how you’d expect if you knew how many CPUs are present, no real surprises. Our existing t2.medium has two CPUs, so can handle 2 simultaneous requests as fast as 1, and after that degrades linearly.

A single performance-L ($500/month) has 4 CPUs (8 hyperthreads), so scales under load much better than our current infrastructure.

A single performance-M ($250/month) has only 1 CPU (!), so scales pretty terribly under load.

Testing scaling with 4 standard-2x’s ($200/month total), we see that it scales relatively evenly. Although lumpily because of variability, and it starts out so much worse performing that even as it scales “evenly” it’s still out-performed by all other arcchitectures. :( (At these relatively fast median response times you might say it’s still fast enough who cares, but in our fat tail of slower pages it gets more distressing).

Now we’ll give you lots of measurements, or you can skip all that to my summary discussion or conclusions for our own project at the end.

Let’s compare base speed

OK, let’s get to actual measurements! For “base speed” measurements, we’ll be telling wrk to use only one connection and one thread.

Existing t2.medium: base speed

Our current infrastructure is one EC2 t2.medium. This EC2 instance type has two vCPUs and 4GB of RAM. On that single EC2 instance, we run passenger (free not enterprise) set to have 10 passenger processes, although the base speed test with only one connection should only touch one of the workers. The t2 is a “burstable” type, and we do always have burst credits (this is not a high traffic app; verified we never exhausted burst credits in these tests), so our test load may be taking advantage of burst cpu.

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://[current staging server]
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://staging-digital.sciencehistory.org
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   311.00ms  388.11ms   2.37s    86.45%
     Req/Sec    11.89      8.96    40.00     69.95%
   Latency Distribution
      50%   90.99ms
      75%  453.40ms
      90%  868.81ms
      99%    1.72s
   966 requests in 3.00m, 177.43MB read
 Requests/sec:      5.37
 Transfer/sec:      0.99MB

I’m actually feeling pretty good about those numbers on our current infrastructure! 90ms median, not bad, and even 453ms 75th percentile is not too bad. Now, our test load involves some JSON responses that are quicker to deliver than corresponding HTML page, but still pretty good. The 90th/99th/and max request (2.37s) aren’t great, but I knew I had some slow pages, this matches my previous understanding of how slow they are in our current infrastructure.

90th percentile is ~9 times 50th percenile.

I don’t have an understanding of why the two different Req/Sec and Requests/Sec values are so different, and don’t totally understand what to do with the Stdev and +/- Stdev values, so I’m just going to be sticking to looking at the latency percentiles, I think “latency” could also be called “response times” here.

But ok, this is our baseline for this workload. And doing this 3 minute test at various points over the past few days, I can say it’s nicely regular and consistent, occasionally I got a slower run, but 50th percentile was usually 90ms–105ms, right around there.

Heroku standard-2x: base speed

From previous mucking about, I learned I can only reliably fit one puma worker in a standard-1x, and heroku says “we typically recommend a minimum of 2 processes, if possible” (for routing algorithmic reasons when scaled to multiple dynos), so I am just starting at a standard-2x with two puma workers each with 5 threads, matching heroku recommendations for a standard-2x dyno.

So one thing I discovered is that bencharks from a heroku standard dyno are really variable, but here are typical ones:

$ heroku dyno:resize
 type     size         qty  cost/mo
 ───────  ───────────  ───  ───────
 web      Standard-2X  1    50

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   645.08ms  768.94ms   4.41s    85.52%
     Req/Sec     5.78      4.36    20.00     72.73%
   Latency Distribution
      50%  271.39ms
      75%  948.00ms
      90%    1.74s
      99%    3.50s
   427 requests in 3.00m, 74.51MB read
 Requests/sec:      2.37
 Transfer/sec:    423.67KB

I had heard that heroku standard dynos would have variable performance, because they are shared multi-tenant resources. I had been thinking of this like during a 3 minute test I might see around the same median with more standard deviation — but instead, what it looks like to me is that running this benchmark on Monday at 9am might give very different results than at 9:50am or Tuesday at 2pm. The variability is over a way longer timeframe than my 3 minute test — so that’s something learned.

Running this here and there over the past week, the above results seem to me typical of what I saw. (To get better than “seem typical” on this resource, you’d have to run a test, over several days or a week I think, probably not hammering the server the whole time, to get a sense of actual statistical distribution of the variability).

I sometimes saw tests that were quite a bit slower than this, up to a 500ms median. I rarely if ever saw results too much faster than this on a standard-2x. 90th percentile is ~6x median, less than my current infrastructure, but that still gets up there to 1.74 instead of 864ms.

This typical one is quite a bit slower than than our current infrastructure, our median response time is 3x the latency, with 90th and max being around 2x. This was worse than I expected.

Heroku performance-m: base speed

Although we might be able to fit more puma workers in RAM, we’re running a single-connection base speed test, so it shouldn’t matter to, and we won’t adjust it.

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-M  1    250

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   377.88ms  481.96ms   3.33s    86.57%
     Req/Sec    10.36      7.78    30.00     37.03%
   Latency Distribution
      50%  117.62ms
      75%  528.68ms
      90%    1.02s
      99%    2.19s
   793 requests in 3.00m, 145.70MB read
 Requests/sec:      4.40
 Transfer/sec:    828.70KB

This is a lot closer to the ballpark of our current infrastructure. It’s a bit slower (117ms median intead of 90ms median), but in running this now and then over the past week it was remarkably, thankfully, consistent. Median and 99th percentile are both 28% slower (makes me feel comforted that those numbers are the same in these two runs!), that doesn’t bother me so much if it’s predictable and regular, which it appears to be. The max appears to me still a little bit less regular on heroku for some reason, since performance is supposed to be non-shared AWS resources, you wouldn’t expect it to be, but slow requests are slow, ok.

90th percentile is ~9x median, about the same as my current infrastructure.

heroku performance-l: base speed

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-L  1    500

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   471.29ms  658.35ms   5.15s    87.98%
     Req/Sec    10.18      7.78    30.00     36.20%
   Latency Distribution
      50%  123.08ms
      75%  635.00ms
      90%    1.30s
      99%    2.86s
   704 requests in 3.00m, 130.43MB read
 Requests/sec:      3.91
 Transfer/sec:    741.94KB

No news is good news, it looks very much like performance-m, which is exactly what we expected, because this isn’t a load test. It tells us that performance-m and performance-l seem to have similar CPU speeds and similar predictable non-variable regularity, which is what I find running this test periodically over a week.

90th percentile is ~10x median, about the same as current infrastructure.

The higher Max speed is just evidence of what I mentioned, the speed of slowest request did seem to vary more than on our manual t2.medium, can’t really explain why.

Summary: Base speed

Not sure how helpful this visualization is, charting 50th, 75th, and 90th percentile responses across architectures.

But basically: performance dynos perform similarly to my (bursting) t2.medium. Can’t explain why performance-l seems slightly slower than performance-m, might be just incidental variation when I ran the tests.

The standard-2x is about twice as slow as my (bursting) t2.medium. Again recall standard-2x results varied a lot every time I ran them, the one I reported seems “typical” to me, that’s not super scientific, admittedly, but I’m confident that standard-2x are a lot slower in median response times than my current infrastructure.

Throughput under load

Ok, now we’re going to test using wrk to use more connections. In fact, I’ll test each setup with various number of connections, and graph the result, to get a sense of how each formation can handle throughput under load. (This means a lot of minutes to get all these results, at 3 minutes per number of connection test, per formation!).

An additional thing we can learn from this test, on heroku we can look at how much RAM is being used after a load test, to get a sense of the app’s RAM usage under traffic to understand the maximum number of puma workers we might be able to fit in a given dyno.

Existing t2.medium: Under load

A t2.medium has 4G of RAM and 2 CPUs. We run 10 passenger workers (no multi-threading, since we are free, rather than enterprise, passenger). So what do we expect? With 2 CPUs and more than 2 workers, I’d expect it to handle 2 simultaneous streams of requests almost as well as 1; 3-10 should be quite a bit slower because they are competing for the 2 CPUs. Over 10, performance will probably become catastrophic.

2 connections are exactly flat with 1, as expected for our two CPUs, hooray!

Then it goes up at a strikingly even line. Going over 10 (to 12) simultaneous connections doesn’t matter, even though we’ve exhausted our workers, I guess at this point there’s so much competition for the two CPUs already.

The slope of this curve is really nice too actually. Without load, our median response time is 100ms, but even at a totally overloaded 12 overloaded connections, it’s only 550ms, which actually isn’t too bad.

We can make a graph that in addition to median also has 75th, 90th, and 99th percentile response time on it:

It doesn’t tell us too much; it tells us the upper percentiles rise at about the same rate as the median. At 1 simultaneous connection 90th percentile of 846ms is about 9 times the median of 93ms; at 10 requests the 90th percentile of 3.6 seconds is about 8 times the median of 471ms.

This does remind us that under load when things get slow, this has more of a disastrous effect on already slow requests than fast requests. When not under load, even our 90th percentile was kind of sort of barley acceptable at 846ms, but under load at 3.6 seconds it really isn’t.

Single Standard-2X dyno: Under load

A standard-2X dyno has 1G of RAM. The (amazing, excellent, thanks schneems) heroku puma guide suggests running two puma workers with 5 threads each. At first I wanted to try running three workers, which seemed to fit into available RAM — but under heavy load-testing I was getting Heroku R14 Memory Quota Exceeded errors, so we’ll just stick with the heroku docs recommendations. Two workers with 5 threads each fit with plenty of headroom.

A standard-2x dyno is runs on shared (multi-tenant) underlying Amazon virtual hardware. So while it is running on hardware with 4 CPUs (each of which can run two “hyperthreads“), the puma doc suggests “it is best to assume only one process can execute at a time” on standard dynos.

What do we expect? Well, if it really only had one CPU, it would immediately start getting bad at 2 simulataneous connections, and just get worse from there. When we exceed the two worker count, will it get even worse? What about when we exceed the 10 thread (2 workers * 5 threads) count?

You’d never run just one dyno if you were expecting this much traffic, you’d always horizontally scale. This very artificial test is just to get a sense of it’s characteristics.

Also, we remember that standard-2x’s are just really variable; I could get much worse or better runs than this, but graphed numbers from a run that seemed typical.

Well, it really does act like 1 CPU, 2 simultaneous connections is immediately a lot worse than 1.

The line isn’t quite as straight as in our existing t2.medium, but it’s still pretty straight; I’d attribute the slight lumpiness to just the variability of shared-architecture standard dyno, and figure it would get perfectly straight with more data.

It degrades at about the same rate of our baseline t2.medium, but when you start out slower, that’s more disastrous. Our t2.medium at an overloaded 10 simultaneous requests is 473ms (pretty tolerable actually), 5 times the median at one request only. This standard-2x has a median response time of 273 ms at only one simultaneous request, and at an overloaded 10 requests has a median response time also about 5x worse, but that becomes a less tolerable 1480ms.

Does also graphing the 75th, 90th, and 99th percentile tell us much?

Eh, I think the lumpiness is still just standard shared-architecture variability.

The rate of “getting worse” as we add more overloaded connections is actually a bit better than it was on our t2.medium, but since it already starts out so much slower, we’ll just call it a wash. (On t2.medium, 90th percentile without load is 846ms and under an overloaded 10 connections 3.6s. On this single standard-2x, it’s 1.8s and 5.2s).

I’m not sure how much these charts with various percentiles on them tell us, I’ll not include them for every architecture hence.

standard-2x, 4 dynos: Under load

OK, realistically we already know you shouldn’t have just one standard-2x dyno under that kind of load. You’d scale out, either manually or perhaps using something like the neat Rails Autoscale add-on.

Let’s measure with 4 dynos. Each is still running 2 puma workers, with 5 threads each.

What do we expect? Hm, treating each dyno as if it has only one CPU, we’d expect it to be able to handle traffic pretty levelly up to 4 simultenous connections, distributed to 4 dynos. It’s going to do worse after that, but up to 8 there is still one puma worker per connection so it might get even worse after 8?

Well… I think that actually is relatively flat from 1 to 4 simultaneous connections, except for lumpiness from variability. But lumpiness from variability is huge! We’re talking 250ms median measured at 1 connection, up to 369ms measured median at 2, down to 274ms at 3.

And then maybe yeah, a fairly shallow slope up to 8 simutaneous connections than steeper.

But it’s all fairly shallow slope compared to our base t2.medium. At 8 connections (after which we pretty much max out), the standard-2x median of 464ms is only 1.8 times the median at 1 conection. Compared to the t2.median increase of 3.7 times.

As we’d expect, scaling out to 4 dynos (with four cpus/8 hyperthreads) helps us scale well — the problem is the baseline is so slow to begin (with very high bounds of variability making it regularly even slower).

performance-m: Under load

A performance-m has 2.5 GB of memory. It only has one physical CPU, although two “vCPUs” (two hyperthreads) — and these are all your apps, it is not shared.

By testing under load, I demonstrated I could actually fit 12 workers on there without any memory limit errors. But is there any point to doing that with only 1/2 CPUs? Under a bit of testing, it appeared not.

The heroku puma docs recommend only 2 processes with 5 threads. You could do a whole little mini-experiment just trying to measure/optimize process/thread count on performance-m! We’ve already got too much data here, but in some experimentation it looked to me like 5 processes with 2 threads each performed better (and certainly no worse) than 2 processes with 5 threads — if you’ve got the RAM just sitting there anyway (as we do), why not?

I actually tested with 6 puma processes with 2 threads each. There is still a large amount of RAM headroom we aren’t going to use even under load.

What do we expect? Well, with the 2 “hyperthreads” perhaps it can handle 2 simultaneous requests nearly as well as 1 (or not?); after that, we expect it to degrade quickly same as our original t2.medium did.

It an handle 2 connections slightly better than you’d expect if there really was only 1 CPU, so I guess a hyperthread does give you something. Then the slope picks up, as you’d expect; and it looks like it does get steeper after 4 simultaneous connections, yup.

performance-l: Under load

A performance-l ($500/month) costs twice as much as a performance-m ($250/month), but has far more than twice as much resources. performance-l has a whopping 14GB of RAM compared to performance-m’s 2.5GB; and performance-l has 4 real CPUs/hyperthreads available to use (visible using the nproc technique in the heroku puma article.

Because we have plenty of RAM to do so, we’re going to run 10 worker processes to match our original t2.medium’s. We still ran with 2 threads, just cause it seems like maybe you should never run a puma worker with only one thread? But who knows, maybe 10 workers with 1 thread each would perform better; plenty of room (but not plenty of my energy) for yet more experimentation.

What do we expect? The graph should be pretty flat up to 4 simultaneous connections, then it should start getting worse, pretty evenly as simultaneous connections rise all the way up to 12.

It is indeed pretty flat up to 4 simultaneous connections. Then up to 8 it’s still not too bad — median at 8 is only ~1.5 median at 1(!). Then it gets worse after 8 (oh yeah, 8 hyperthreads?).

But the slope is wonderfully shallow all the way. Even at 12 simultaneous connections, the median response time of 266ms is only 2.5x what it was at one connection. (In our original t2.medium, at 12 simultaneous connections median response time was over 5x what it was at 1 connection).

This thing is indeed a monster.

Summary Comparison: Under load

We showed a lot of graphs that look similar, but they all had different sclaes on the y-axis. Let’s plot median response times under load of all architectures on the same graph, and see what we’re really dealing with.

The blue t2.medium is our baseline, what we have now. We can see that there isn’t really a similar heroku option, we have our choice of better or worse.

The performance-l is just plain better than what we have now. It starts out performing about the same as what we have now for 1 or 2 simultaneous connections, but then scales so much flatter.

The performance-m also starts out about thesame, but sccales so much worse than even what we have now. (it’s that 1 real CPU instead of 2, I guess?).

The standard-2x scaled to 4 dynos… has it’s own characteristics. It’s baseline is pretty terrible, it’s 2 to 3 times as slow as what we have now even not under load. But then it scales pretty well, since it’s 4 dynos after all, it doesn’t get worse as fast as performance-m does. But it started out so bad, that it remains far worse than our original t2.medium even under load. Adding more dynos to standard-2x will help it remain steady under even higher load, but won’t help it’s underlying problem that it’s just slower than everyone else.

Discussion: Thoughts and Surprises

  • I had been thinking of a t2.medium (even with burst) as “typical” (it is after all much slower than my 2015 Macbook), and has been assuming (in retrospect with no particular basis) that a heroku standard dyno would perform similarly.
    • Most discussion and heroku docs, as well as the naming itself, suggest that a ‘standard’ dyno is, well, standard, and performance dynos are for “super scale, high traffic apps”, which is not me.
    • But in fact, heroku standard dynos are much slower and more variable in performance than a bursting t2.medium. I suspect they are slower than other options you might consider non-heroku “typical” options.



  • My conclusion is honestly that “standard” dynos are really “for very fast, well-optimized apps that can handle slow and variable CPU” and “performance” dynos are really “standard, matching the CPU speeds you’d get from a typical non-heroku option”. But this is not how they are documented or usually talked about. Are other people having really different experiences/conclusions than me? If so, why, or where have I gone wrong?
    • This of course has implications for estimating your heroku budget if considering switching over. :(
    • If you have a well-optimized fast app, say even 95th percentile is 200ms (on bursting t2.medium), then you can handle standard slowness — so what your 95th percentile is now 600ms (and during some time periods even much slower, 1s or worse, due to variability). That’s not so bad for a 95th percentile.
    • One way to get a very fast is of course caching. There is lots of discussion of using caching in Rails, sometimes the message (explicit or implicit) is “you have to use lots of caching to get reasonable performance cause Rails is so slow.” What if many of these people are on heroku, and it’s really you have to use lots of caching to get reasonable performance on heroku standard dyno??
    • I personally don’t think caching is maintenance free; in my experience properly doing cache invalidation and dealing with significant processing spikes needed when you choose to invalidate your entire cache (cause cached HTML needs to change) lead to real maintenance/development cost. I have not needed caching to meet my performance goals on present architecture.
    • Everyone doesn’t necessarily have the same performance goals/requirements. Mine of a low-traffic non-commercial site are are maybe more modest, I just need users not to be super annoyed. But whatever your performance goals, you’re going to have to spend more time on optimization on a heroku standard than something with much faster CPU — like a standard affordable mid-tier EC2. Am I wrong?


  • One significant factor on heroku standard dyno performance is that they use shared/multi-tenant infrastructure. I wonder if they’ve actually gotten lower performance over time, as many customers (who you may be sharing with) have gotten better at maximizing their utilization, so the shared CPUs are typically more busy? Like a frog boiling, maybe nobody noticed that standard dynos have become lower performance? I dunno, brainstorming.
    • Or maybe there are so many apps that start on heroku instead of switcching from somewhere else, that people just don’t realize that standard dynos are much slower than other low/mid-tier options?
    • I was expecting to pay a premium for heroku — but even standard-2x’s are a significant premium over paying for t2.medium EC2 yourself, one I found quite reasonable…. performance dynos are of course even more premium.


  • I had a sort of baked-in premise that most Rails apps are “IO-bound”, they spend more time waiting on IO than using CPU. I don’t know where I got that idea, I heard it once a long time ago and it became part of my mental model. I now do not believe this is true true of my app, and I do not in fact believe it is true of most Rails apps in 2020. I would hypothesize that most Rails apps today are in fact CPU-bound.

  • The performance-m dyno only has one CPU. I had somehow also been assuming that it would have two CPUs — I’m not sure why, maybe just because at that price! It would be a much better deal with two CPUs.
    • Instead we have a huge jump from $250 performance-m to $500 performance-l that has 4x the CPUs and ~5x the RAM.
    • So it doesn’t make financial sense to have more than one performance-m dyno, you might as well go to performance-l. But this really complicates auto-scaling, whether using Heroku’s feature , or the awesome Rails Autoscale add-on. I am not sure I can afford a performance-l all the time, and a performance-m might be sufficient most of the time. But if 20% of the time I’m going to need more (or even 5%, or even unexpectedly-mentioned-in-national-media), it would be nice to set things up to autoscale up…. I guess to financially irrational 2 or more performance-m’s? :(

  • The performance-l is a very big machine, that is significantly beefier than my current infrastructure. And has far more RAM than I need/can use with only 4 physical cores. If I consider standard dynos to be pretty effectively low tier (as I do), heroku to me is kind of missing mid-tier options. A 2 CPU option at 2.5G or 5G of RAM would make a lot of sense to me, and actually be exactly what I need… really I think performance-m would make more sense with 2 CPUs at it’s existing already-premium price point, and to be called a “performance” dyno. . Maybe heroku is intentionally trying set options to funnel people to the highest-priced performance-l.

Conclusion: What are we going to do?

In my investigations of heroku, my opinion of the developer UX and general service quality only increases. It’s a great product, that would increase our operational capacity and reliability, and substitute for so many person-hours of sysadmin/operational time if we were self-managing (even on cloud architecture like EC2).

But I had originally been figuring we’d use standard dynos (even more affordably, possibly auto-scaled with Rails Autoscale plugin), and am disappointed that they end up looking so much lower performance than our current infrastructure.

Could we use them anyway? Response time going from 100ms to 300ms — hey, 300ms is still fine, even if I’m sad to lose those really nice numbers I got from a bit of optimization. But this app has a wide long-tail ; our 75th percentile going from 450ms to 1s, our 90th percentile going from 860ms to 1.74s and our 99th going from 2.3s to 4.4s — a lot harder to swallow. Especially when we know that due to standard dyno variability, a slow-ish page that on my present architecture is reliably 1.5s, could really be anywhere from 3 to 9(!) on heroku.

I would anticipate having to spend a lot more developer time on optimization on heroku standard dynos — or, i this small over-burdened non-commercial shop, not prioritizing that (or not having the skills for it), and having our performance just get bad.

So I’m really reluctant to suggest moving our app to heroku with standard dynos.

A performance-l dyno is going to let us not have to think about performance any more than we do now, while scaling under high-traffic better than we do now — I suspect we’d never need to scale to more than one performance-l dyno. But it’s pricey for us.

A performance-m dyno has a base-speed that’s fine, but scales very poorly and unaffordably. Doesn’t handle an increase in load very well as one dyno, and to get more CPUs you have to pay far too much (especially compared to standard dynos I had been assuming I’d use).

So I don’t really like any of my options. If we do heroku, maybe we’ll try a performance-m, and “hope” our traffic is light enough that a single one will do? Maybe with Rails autoscale for traffic spikes, even though 2 performance-m dynos isn’t financially efficient? If we are scaling to 2 (or more!) performance-m’s more than very occasionally, switch to performance-l, which means we need to make sure we have the budget for it?

Storage Media Update / David Rosenthal

My last post on storage media was After A Decade, HAMR Is Still Nearly Here back in July. Below the fold, I look at some of the developments since then.

First, courtesy of Tom Coughlin, three items based on graphs from the latest issue of his fascinating newsletter.

Kryder's law still good

This graph shows that, despite the pessimism I have repeatedly expressed since 2012's Storage Will Be A Lot Less Free Than It Used To Be, the hard disk industry has managed to keep reducing the cost per byte of their products.

Note that the graph is in three sections, initially dropping rapidly, then flat due to the 2011 floods in Thailand, then dropping much more slowly. As I showed back in 2012, the exact Kryder rate, especially in the early years, has a big effect on the cost of preserving data for the long term (see here). So the slowing after the floods has significantly increased the cost of storing data for the long-term above what would have been expected before the floods.

Flash still can't kill HDDs

This graph shows that the result of the hard disk vendors' efforts is that, despite the technological advances by the flash industry (see below), the vast majority of bytes shipped continues to be in the form of hard disk.

I'm somewhat skeptical of Coughlin's projections for rapid increases in total exabytes shipped over the next few years, which would please the good Dr. Pangloss. Although HAMR and MAMR should allow significant increases in disk areal density, and advances in 3D flash should increase bits per wafer, I doubt that these will drive total exabytes as fast as Coughlin projects.

Approaching the limits

A major reason for my continuing (if perhaps futile) pessimism about Kryder's Law is that the closer you get to the physical limits of your technology, the slower and more expensive it is to make further progress. This graph shows that, for about the last four years in the hard disk industry, staying on the Kryder's Law curve has had no help from making the bits smaller. They have had to do it by other cost reductions.

Technologies progress in S-curves, as shown in Dave Anderson's 2009 slide. The areal density graph clearly shows that the current disk technology is at the top of its S-curve. The replacement technologies, HAMR and MAMR, have still not impacted the volume market after being imminent for a decade.

Dave's graph makes it look like the effect of the series of S-curves is a straight-line graph of progress. But the recent S-curves are much slower than earlier ones, so the overall graph has become S-shaped.

Micron improves 3D NAND

Source
Jim Salter reports in Micron announces new 3D NAND process—denser, faster, less expensive that:
On Monday, memory and storage vendor Micron announced that its new 176-layer 3D NAND (the storage medium underlying most SSDs) process is in production and has begun shipping to customers. The new technology should offer higher storage densities and write endurance, better performance, and lower costs.
Building a chip with 176 layers is a truly remarkable feat, a 37.5% improvement over their previous 128-layer product.

There are two different technologies for making the cells of flash memory, floating-gate and charge-trap. Micron's earlier 96-layer technology used floating-gate, their 128- and 176-layer replacement gate (RG) technologies use charge-trap. Charge-trap:
is a type of floating-gate MOSFET memory technology, but differs from the conventional floating-gate technology in that it uses a silicon nitride film to store electrons rather than the doped polycrystalline silicon typical of a floating-gate structure. This approach allows memory manufacturers to reduce manufacturing costs five ways:
  • Fewer process steps are required to form a charge storage node
  • Smaller process geometries can be used (therefore reducing chip size and cost)
  • Multiple bits can be stored on a single flash memory cell.
  • Improved reliability
  • Higher yield since the charge trap is less susceptible to point defects in the tunnel oxide layer
Source
Micron describes their RG technology in a white paper entitled Micron Transitions to Next-Generation 3D NAND Replacement-Gate Technology. They cite the following advantages, beyond the obvious increase in areal density:
  • Reduced capacitance between the cells in a stack, which allows simpler, faster algorithms in the flash controller's write path implementing fewer, sharper voltage pulses to program the cell. This reduces write latency and increases write bandwidth.
  • The use of metal rather than polysilicon for the NAND control gate, which reduces resistance and allows the write voltage pulse to ramp faster. Again, this reduces write latency and increases write bandwidth.
  • Shorter write pulses mean that:
    The strength and time of electric fields applied to the cell material and other NAND structures relate directly to the endurance of the NAND storage cell. The longer the electric field is applied, the more stress is created on the NAND, which reduces endurance.
  • Shorter write pulses use less power, reducing overall power consumption and heat generation
So this technology reduces cost per bit, write latency and bandwidth, and increases write endurance. Salter writes:
If Micron's claims of greatly increased write endurance pan out, it might become possible to replace incredibly expensive SLC (Single Level Cell) enterprise/data center SSDs with much cheaper 3D NAND devices in demanding applications. Meanwhile—assuming no large increase in per-wafer manufacturing cost—the roughly one-third increase in storage density per chip could mean similarly less expensive consumer devices.
Salter concludes:
We don't expect this to be the death knell for traditional hard drives yet. Even in the best possible case—no increase in manufacturing cost whatsoever—this would put the cost per terabyte of TLC NAND somewhere around $85. The cost per TB of conventional hard drives runs about $27, so there's still plenty of air between the two technologies when it comes to price.

Zoned Name Spaces

Anton Shilov's Western Digital's Ultrastar DC ZN540 Is the World's First ZNS SSD starts:
Western Digital is one of the most vocal proponents of the Zoned Namespaces (ZNS) storage initiative, so it is not surprising that the company this week became the first SSD maker to start sampling of a ZNS SSD. When used properly, the Ultrastar DC ZN540 drive can replace up to four conventional SSDs, provide higher performance and improve quality of service (QoS).
The Zoned Name Spaces initiative defines Zoned Storage:
Zoned Storage is a class of storage devices that enables host and storage devices to cooperate to achieve higher storage capacities, increased throughput, and lower latencies. The zoned storage interface is available through the SCSI Zoned Block Commands (ZBC) and Zoned Device ATA Command Set (ZAC) standards for Shingled Magnetic Recording (SMR) hard disks and with the NVMe Zoned Namespaces (ZNS) standard for NVMe Solid State Disks.
The Initiative's Zoned Storage Overview explains:
The zones of zoned storage devices must be written sequentially. Each zone of the device address space has a write pointer that keeps track of the position of the next write. Data in a zone cannot be directly overwritten. The zone must first be erased using a special command (zone reset).
So what is really going on here is an effort to expose the underlying limitations of SMR hard disks and flash storage to drivers and applications. It isn't surprising that Western Digital is "one of the most vocal proponents" of this:
Shingling, which means moving the tracks so close together that writing a track partially overwrites the adjacent track. Very sophisticated signal processing allows the partially overwritten data to be read. Shingled drives come in two forms. WD's drives expose the shingling to the host, requiring the host software to be changed to treat them like append-only media. Seagate's drives are device-managed, with on-board software obscuring the effect of shingling, at the cost of greater variance in performance.
We discovered the "greater variance in performance" when using Seagate's SMR "Archive" drives in A Cost-Effective DIY LOCKSS Box. By exposing the medium to the device driver as a set of append-only, eraseable zones the file system can more closely conform its write operations to the underlying device's capabilities, thus reducing the need to block write traffic while data is moved or erased on the medium. The lack of a standard way to do this has limited adoption of SMR, and has led to greater complexity in SSD firmware.

Changes to System.Diagnostics.Process in .NET Core / Terry Reese

In .NET Core, one of the changes that caught me by surprise is the change related to starting processes.  In the .NET framework – you can open a web site, file, etc. just by using the following:\

System.Diagnostics.Process.Start(path);

However, in .NET Core – this won’t work.  When trying to open a file, the process will fail – reporting that a program isn’t associated with the file type.  When trying to open a folder on the system, the process will fail with a permission error unless the application is running with administrator permissions (which you don’t want to be doing).  The change is related to a change in a property default – specifically:

System.Diagnostics.ProcessStartInfo.UseShellExecute

In the .NET framework – this property is set to true by default.  In the .NET Core, it is set to false.  The difference here probably makes sense – .NET Core is meant to be more portable and you do need to change this value on some systems.  To fix this, I’d recommend removing any direct calls to this assembly and run in through a function like this:

<code>

public static void OpenURL(string url)
  {
    var psi = new System.Diagnostics.ProcessStartInfo
    {
      FileName = url,
      UseShellExecute = true
    };
    try {
      System.Diagnostics.Process.Start(psi);
    } catch {
      psi.UseShellExecute = false;
      System.Diagnostics.Process.Start(psi);
    }
  }

public static void OpenFileOrFolder(string spath, string sarg = "")
  {
    var psi = new System.Diagnostics.ProcessStartInfo
    {
      FileName = spath,
      UseShellExecute = true
    };
    try {
      System.IO.FileAttributes attr = System.IO.File.GetAttributes(spath);
      if ((attr & System.IO.FileAttributes.Directory) == System.IO.FileAttributes.Directory) {
          System.Diagnostics.Process.Start(psi);
      } else {
        if (sargs.Trim().Length !=0) {
          psi.Arguments = sargs;
        }
        System.Diagnostics.Process.Start(psi);
      }
    } catch {
      psi.UseShellExecute = false;
      System.IO.FileAttributes attr = System.IO.File.GetAttributes(spath);
      if ((attr & System.IO.FileAttributes.Directory) == System.IO.FileAttributes.Directory) {
          System.Diagnostics.Process.Start(psi);
      } else {
        if (sargs.Trim().Length !=0) {
          psi.Arguments = sargs;
        }
      System.Diagnostics.Process.Start(psi);
    }
  }

Since this vexed me for a little bit – I’m putting this here so I don’t forget.

tr

10 Additions to NDSA Membership in Summer and Fall 2020 / Digital Library Federation

Since the spring of 2020, the NDSA Leadership unanimously voted to welcome 10 new members. Each of these new members brings a host of skills and experience to our group. Please help us welcome:

  • Arizona State University Library: With many of their materials from local Indigenous and LatinX communities, the Library is working with researchers from these communities to archive and preserve collections and artifacts unique to our region, making them accessible for generations to come.
  • Arkevist: A civil society that specializes in historical and genealogical research
  • discoverygarden: For more than a decade, discoverygarden has been building trusted repositories and digital asset management systems for organizations around the world.
  • Global Connexions: For two decades Federick Zarndt has provided consulting services to cultural heritage organizations and has contributed to NDSA, ALA, IFLA and ALTO.
  • LYRASIS: They are the non-profit organizational home of several open source projects that are focused on collecting, organizing, and ensuring long-term access to digital content including DSpace, ArchivesSpace, CollectionSpace, Islandora, Fedora Repository, and DuraCloud. 
  • Michigan Digital Preservation Network: MDPN is an IMLS-grant funded initiative to build a member-run statewide distributed digital preservation network with members ranging from libraries, archives, museums, and historical societies with the primary purpose of preserving cultural heritage materials
  • Robert L. Bogomolny Library – University of Baltimore: Robert L. Bogomolny Library is in the midst of a five year digital preservation implementation based upon results derived from conducting Institutional Readiness and Digital Preservation Capability Maturity Model exercises. Their Special Collections and Archives hold sizable digital collection materials, including 700TBs of digitized local TV news.
  • University of Pennsylvania Libraries: The Penn Libraries are working on many digital preservation activities, including but not limited to the ongoing development of a Samvera repository, web archiving initiatives, conducting a pilot of two preservation storage systems, and developing governance for workflows and policies in order to have robust and programmatic digital preservation practices.
  • University of Victoria Libraries: The UVic Libraries are currently involved in a number of digital preservation-related infrastructure projects, including Council of Prairie and Pacific University Libraries (COPPUL) Archivematica-as-a-Service and WestVault (a LOCKSS-based preservation storage network), and serve as infrastructure hosts for the Canadian Government Information Preservation Network (CGI-PN), the Public Knowledge Project Preservation Network (PKP-PN), and perma.cc. 
  • University of Wisconsin-Milwaukee: Over the past five years UWM has formed a Digital Preservation Community of Practice whose aim is to identify common digital preservation issues across departments and shared tools and workflows.  UWM also co-founded the Digital Preservation Expertise Group (DPEG), a University of Wisconsin System-wide group that shares digital preservation expertise, develops training, and investigates shared resources across all thirteen UW System Libraries.

Each organization has participants in one or more of the various NDSA interest and working groups – so keep an eye out for them on your calls and be sure to give them a shout out. Please join me in welcoming our new members. To review our list of members, you can see them here.

~ Dan Noonan, Vice Chair of the Coordinating Committee

The post 10 Additions to NDSA Membership in Summer and Fall 2020 appeared first on DLF.

Dryad and Frictionless Data collaboration / Open Knowledge Foundation

By Tracy Teal; originally posted in the Dryad blog: https://blog.datadryad.org/2020/11/18/frictionless-data/

Guided by our commitment to make research data publishing more seamless and also re-usable, we are thrilled to partner with Open Knowledge Foundation and the Frictionless Data team to enhance our submission processes. Integrating the Frictionless Data toolkit, Dryad will be able to directly provide feedback to authors on the structure of tabular files uploaded. This will also allow for automated file level metadata to be created at upload and available for download for published datasets.

We are excited to get moving on this project and with support from the Sloan Foundation, Open Knowledge Foundation has just announced a job opening to contribute to this work. Please check out the posting and circulate it to any developers who may be interested in building out this functionality with us: https://okfn.org/about/jobs/

Announcing Spanish Translations for the 2019 and 2013 Levels Matrix / Digital Library Federation

The NDSA is pleased to announce that both the original (2013) and Version 2 (2019) of the Levels Matrix  have been translated into Spanish by our colleagues from Mexico and Spain, Dr. David Leija (Universidad Autónoma de Tamaulipas) and Dr. Miquel Térmens (Universitat de Barcelona). Drs. Leija and Térmens are academic researchers and founders of APREDIG (Ibero-American association for digital preservation), a non-profit organization focused on spreading the importance of good practices of digital preservation for the spanish-speaking community.

Links to these documents are found below as well as on the Levels of Digital Preservation OSF project pages: 2019 (https://osf.io/qgz98/) and 2013 (https://osf.io/9ya8c/) as well as below.

In addition, Miquel Térmens and David Leija have written a report analyzing and documenting the use of the NDSA Levels in 8 public and private organizations in Spain, Mexico, Brazil and Switzerland.  The Methodology of digital preservation audits with NDSA Levels, can be found in Spanish here and should be cited as found below.  

  • Térmens, Miquel; Leija, David (2017). “Methodology of digital preservation audits with NDSA Levels”. El profesional de la información, v. 26, n. 3, pp. 447-456. https://doi.org/10.3145/epi.2017.may.11 | https://fima.ub.edu/pub/termens/docs/EPI-v26n3.pdf 

If you would be interested in translating the Levels of Digital Preservation V2.0 into another language please contact us at ndsa.digipres@gmail.com. 

 

Traducciones al español de la Matriz de Niveles de Preservación Digital 2019 y 2013

La NDSA se complace en anunciar que tanto la versión original como la versión 2 de la Matriz de Niveles de Preservación Digital han sido traducidas al español por nuestros colegas investigadores de México y España, el Dr. David Leija (Universidad Autónoma de Tamaulipas) y el Dr. Miquel Térmens (Universitat de Barcelona). Térmens y Leija son investigadores académicos fundadores de APREDIG (Asociación Iberoamericana de Preservación Digital), una organización sin ánimo de lucro enfocada en difundir la importancia de las buenas prácticas de preservación digital para la comunidad hispanohablante.

Los enlaces a estos documentos traducidos se encuentran a continuación, así como en las páginas del proyecto OSF de Niveles de Preservación Digital: 2019 ((https://osf.io/qgz98/) y 2013 (https://osf.io/9ya8c/).

Adicionalmente, Miquel Térmens y David Leija han escrito un reporte analizando y documentando el uso de los niveles NDSA en 8 organizaciones públicas y privadas de España, México, Brasil y Suiza. La Auditoría de Preservación Digital con NDSA Levels, se puede encontrar en español aquí y debe citarse como se encuentra a continuación.  

  • Térmens, Miquel; Leija, David (2017). “Auditoría de Preservación Digital con NDSA Levels”. El profesional de la información, v. 26, n. 3, pp. 447-456.      https://doi.org/10.3145/epi.2017.may.11 | https://fima.ub.edu/pub/termens/docs/EPI-v26n3.pdf 

Si está interesado en traducir los niveles de Preservación Digital V2.0 en otros idiomas por favor póngase en contacto en ndsa.digipres@gmail.com. 

 

The post Announcing Spanish Translations for the 2019 and 2013 Levels Matrix appeared first on DLF.

Thank you for your feedback about Open Data Day. Here’s what we learned. / Open Knowledge Foundation

Open Data Day is an annual celebration of open data all over the world facilitated by the Open Knowledge Foundation. Each year, groups from around the world create local events on the day where they will use open data in their communities. It is an opportunity to show the benefits of open data and encourage the adoption of open data policies in government, business and civil society.

With Open Data Day 2021 less than four months away, we asked the Open Data community to tell us how you think we can better support Open Data Day.

It’s not too late to have your say. Just visit the survey here.  

The responses we had were very encouraging. We received lots of feedback – both positive and negative. And many of you offered to help with Open Data Day 2021. Thank you so much!

We’ve gone through all the feedback and…

Here is a summary of what we learned

You told us that Open Data Day 2021 will be better if we …

  • present all the events together in a searchable directory, to show the amazing scale and variety of Open Data Day events
  • focus less on the geographic location of the events because some events are online and can be attended by anyone with an internet connection
  • give more mini-grants to more Open Data Day events 
  • confirm who has won the mini-grants at an earlier date – to help with event planning
  • focus on one data track – not four. Recommendations included climate change data, disaster risk management data, gender data, election data and Covid 19 data
  • give support, advice and opportunities for Covid Safe events and activities 
  • get better press coverage of Open Data Day events, and better connections with data journalists 
  • publish reports on Open Data Day events on the Open Data Day website, with more photos and videos
  • improve the mini-grant methodology to increase the measurable impact of Open Data Day mini-grants 
  • reduce bank changes by using innovative money transfer systems 
  • helped funding partners create better connections with event organisers and attendees

Here at Open Knowledge Foundation we will spend the next few weeks digesting all these great ideas and working out how best to respond to make sure Open Data Day 2021 is better than ever. Thanks again to everyone who already responded to our survey! 

MarcEdit 7.5/MarcEdit Mac 3.5 Work / Terry Reese

Every year, around this time, I try to dedicate significant time to address any large project work that may have been percolating around MarcEdit.  This year will be no different.  Over the past 4 months, I’ve been working on moving MarcEdit away from the .NET 4.7.2 Framework to .NET Core 3.1.  There a lot of reasons for looking at this, the most important being that this is the direction Microsoft is taking the framework – a move to unify the various .NET development platforms to make distribution and maintenance easier.  Well, with the release of .NET 5 this Nov., all the tools I need to officially make this transition are now in place.

So, over the next two months, I’ll be working on shifting MarcEdit away from Framework 4.7.2 and to .NET 5.  I believe this will be possible – I only have concerns about two libraries that I rely on – and if I have to, both are open source so I can look at potentially spending time helping the project maintainers target a non-framework build.  My hope is to have a working version of MarcEdit using NET 5 by Thanksgiving that I can start unit testing and testing locally. 

Of course, with this change, I’ll also have to change the installer process.  The reason is that this transition will remove the necessity of having to have .NET installed on one’s machine.  One of the changes to the framework is the ability to publish self contained applications – allowing for faster startup and lower memory usage.  This is something I’m excited about as I currently move slowly updating build frameworks due to the need to have these frameworks installed locally.  By removing that dependency, I’m hoping to be able to take advantages of changes to the C# language that make programming easier and more efficient, while also allowing me to remove some of the work around code I’ve had to develop to account for bugs or limitations in previous frameworks.

Finally, this change is going to simplify a lot of cross platform development – and once the initial transition has occurred, I’ll be spending time working on expanding the MarcEdit MacOS version.  There are a couple of areas where this program still lacks parity in relation to the Windows version, and these changes will give me the opportunity to close many of these gaps. 

–tr

Journal Map: developing an open environment for accessing and analyzing performance indicators from journals in economics / ZBW German National Library of Economics

by Franz Osorio, Timo Borst

Introduction

Bibliometrics, scientometrics, informetrics and webometrics have been both research topics and practical guidelines for publishing, reading, citing, measuring and acquiring published research for a while (Hood 2001). Citation databases and measures had been introduced in the 1960s, becoming benchmarks both for the publishing industry and academic libraries managing their holdings and journal acquisitions that tend to be more selective with a growing number of journals on the one side, budget cuts on the other. Due to the Open Access movement triggering a transformation of traditional publishing models (Schimmer 2015), and in the light of both global and distributed information infrastructures for publishing and communicating on the web that have yielded more diverse practices and communities, this situation has dramatically changed: While bibliometrics of research output in its core understanding still is highly relevant to stakeholders and the scientific community, visibility, influence and impact of scientific results has shifted to locations in the World Wide Web that are commonly shared and quickly accessible not only by peers, but by the general public (Thelwall 2013). This has several implications for different stakeholders who are referring to metrics in dealing with scientific results:
 
  • With the rise of social networks, platforms and their use also by academics and research communities, the term 'metrics' itself has gained a broader meaning: while traditional citation indexes only track citations of literature published in (other) journals, 'mentions', 'reads' and 'tweets', albeit less formal, have become indicators and measures for (scientific) impact.
  • Altmetrics has influenced research performance, evaluation and measurement, which formerly had been exclusively associated with traditional bibliometrics. Scientists are becoming aware of alternative publishing channels and both the option and need of 'self-advertising' their output.
  • In particular academic libraries are forced to manage their journal subscriptions and holdings in the light of increasing scientific output on the one hand, and stagnating budgets on the other. While editorial products from the publishing industry are exposed to a global competing market requiring a 'brand' strategy, altmetrics may serve as additional scattered indicators for scientific awareness and value.

Against this background, we took the opportunity to collect, process and display some impact or signal data with respect to literature in economics from different sources, such as 'traditional' citation databases, journal rankings and community platforms resp. altmetrics indicators:

  • CitEc. The long-standing citation service maintainted by the RePEc community provided a dump of both working papers (as part of series) and journal articles, the latter with significant information on classic impact factors such as impact factor (2 and 5 years) and h-index.
  • Rankings of journals in economics including Scimago Journal Rank (SJR) and two German journal rankings, that are regularly released and updated (VHB Jourqual, Handelsblatt Ranking).
  • Usage data from Altmetric.com that we collected for those articles that could be identified via their Digital Object Identifier.
  • Usage data from the scientific community platform and reference manager Mendeley.com, in particular the number of saves or bookmarks on an individual paper.

Requirements

A major consideration for this project was finding an open environment in which to implement it. Finding an open platform to use served a few purposes. As a member of the "Leibniz Research Association," ZBW has a commitment to Open Science and in part that means making use of open technologies to as great extent as possible (The ZBW - Open Scienc...). This open system should allow direct access to the underlying data so that users are able to use it for their own investigations and purposes. Additionally, if possible the user should be able to manipulate the data within the system.
The first instance of the project was created in Tableau, which offers a variety of means to express data and create interfaces for the user to filter and manipulate data. It also can provide a way to work with the data and create visualizations without programming skills or knowledge. Tableau is one of the most popular tools to create and deliver data visualization in particular within academic libraries (Murphy 2013). However, the software is proprietary and has a monthly fee to use and maintain, as well as closing off the data and making only the final visualization available to users. It was able to provide a starting point for how we wanted to the data to appear to the user, but it is in no way open.

Challenges

The first technical challenge was to consolidate the data from the different sources which had varying formats and organizations. Broadly speaking, the bibliometric data (CitEc and journal rankings) existed as a spread sheet with multiple pages, while the altmetrics and Mendeley data came from a database dumps with multiple tables that were presented as several CSV files. In addition to these different formats, the data needed to be cleaned and gaps filled in. The sources also had very different scopes. The altmetrics and Mendeley data covered only 30 journals, the bibliometric data, on the other hand, had more than 1,000 journals.
Transitioning from Tableau to an open platform was big challenge. While there are many ways to create data visualizations and present them to users, the decision was made to use to work with the data and Shiny to present it. R is used widely to work with data and to present it (Kläre 2017). The language has lots of support for these kinds of task over many libraries. The primary libraries used were R Plotly and R Shiny. Plotly is a popular library for creating interactive visualizations. Without too much work Plotly can provide features including information popups while hovering over a chart and on the fly filtering. Shiny provides a framework to create a web application to present the data without requiring a lot of work to create HTML and CSS. The transition required time spent getting to know R and its libraries, to learn how to create the kinds of charts and filters that would be useful for users. While Shiny alleviates the need to create HTML and CSS, it does have a specific set of requirements and structures in order to function.
The final challenge was in making this project accessible to users such that they would be able to see what we had done, have access to the data, and have an environment in which they could explore the data without needing anything other than what we were providing. In order to achieve this we used Binder as the platform. At it's most basic Binder makes it possible to share a Jupyter Notebook stored in a Github repository with a URL by running the Jupyter Notebook remotely and providing access through a browser with no requirements placed on the user. Additionally, Binder is able to run a web application using R and Shiny. To move from a locally running instance of R Shiny to one that can run in Binder, instructions for the runtime environment need to be created and added to the repository. These include information on what version of the language to use,  which packages and libraries to install for the language, and any additional requirements there might be to run everything.

Solutions

Given the disparate sources and formats for the data, there was work that needed to be done to prepare it for visualization. The largest dataset, the bibliographic data, had several identifiers for each journal but without journal names. Having the journals names is important because in general the names are how users will know the journals. Adding the names to the data would allow users to filter on specific journals or pull up two journals for a comparison. Providing the names of the journals is also a benefit for anyone who may repurpose the data and saves them from having to look them up. In order to fill this gap, we used metadata available through Research Papers in Economics (RePEc). RePEc is an organization that seeks to "enhance the dissemination of research in Economics and related sciences". It contains metadata for more than 3 million papers available in different formats. The bibliographic data contained RePEc Handles which we used to look up the journal information as XML and then parse the XML to find the title of the journal.  After writing a small Python script to go through the RePEc data and find the missing names there were only 6 journals whose names were still missing.
For the data that originated in an MySQL database, the major work that needed to be done was to correct the formatting. The data was provided as CSV files but it was not formatted such that it could be used right away. Some of the fields had double quotation marks and when the CSV file was created those quotes were put into other quotation marks resulting doubled quotation marks which made machine parsing difficult without intervention directly on the files. The work was to go through the files and quickly remove the doubled quotation marks.
In addition to that, it was useful for some visualizations to provide a condensed version of the data. The data from the database was at the article level which is useful for some things, but could be time consuming for other actions. For example, the altmetrics data covered only 30 journals but had almost 14,000 rows. We could use the Python library pandas to go through the all those rows and condense the data down so that there are only 30 rows with the data for each column being the sum of all rows. In this way, there is a dataset that can be used to easily and quickly generate summaries on the journal level.
Shiny applications require a specific structure and files in order to do the work of creating HTML without needing to write the full HTML and CSS. At it's most basic there are two main parts to the Shiny application. The first defines the user interface (UI) of the page. It says what goes where, what kind of elements to include, and how things are labeled. This section defines what the user interacts with by creating inputs and also defining the layout of the output. The second part acts as a server that handles the computations and processing of the data that will be passed on to the UI for display. The two pieces work in tandem, passing information back and forth to create a visualization based on user input. Using Shiny allowed almost all of the time spent on creating the project to be concentrated on processing the data and creating the visualizations. The only difficulty in creating the frontend was making sure all the pieces of the UI and Server were connected correctly.
Binder provided a solution for hosting the application, making the data available to users, and making it shareable all in an open environment. Notebooks and applications hosted with Binder are shareable in part because the source is often a repository like Github. By passing a Github repository to Binder, say one that has a Jupyter Notebook in it, Binder will build a Docker image to run the notebook and then serve the result to the user without them needing to do anything. Out of the box the Docker image will contain only the most basic functions. The result is that if a notebook requires a library that isn't standard, it won't be possible to run all of the code in the notebook. In order to address this, Binder allows for the inclusion in a repository of certain files that can define what extra elements should be included when building the Docker image. This can be very specific such as what version of the language to use and listing various libraries that should be included to ensure that the notebook can be run smoothly. Binder also has support for more advanced functionality in the Docker images such as creating a Postgres database and loading it with data. These kinds of activities require using different hooks that Binder looks for during the creation of the Docker image to run scripts.

Results and evaluation

The final product has three main sections that divide the data categorically into altmetrics, bibliometrics, and data from Mendeley. There are additionally some sections that exist as areas where something new could be tried out and refined without potentially causing issues with the three previously mentioned areas. Each section has visualizations that are based on the data available.
Considering the requirements for the project, the result goes a long way to meeting the requirements. The most apparent area that the Journal Map succeeds in is its goals is of presenting data that we have collected. The application serves as a dashboard for the data that can be explored by changing filters and journal selections. By presenting the data as a dashboard, the barrier to entry for users to explore the data is low. However, there exists a way to access the data directly and perform new calculations, or create new visualizations. This can be done through the application's access to an R-Studio environment. Access to R-Studio provides two major features. First, it gives direct access to the all the underlying code that creates the dashboard and the data used by it. Second, it provides an R terminal so that users can work with the data directly. In R-Studio, the user can also modify the existing files and then run them from R-Studio to see the results. Using Binder and R as the backend of the applications allows us to provide users with different ways to access and work with data without any extra requirements on the part of the user. However, anything changed in R-Studio won't affect the dashboard view and won't persist between sessions. Changes exist only in the current session.
All the major pieces of this project were able to be done using open technologies: Binder to serve the application, R to write the code, and Github to host all the code. Using these technologies and leveraging their capabilities allows the project to support the Open Science paradigm that was part of the impetus for the project.
The biggest drawback to the current implementation is that Binder is a third party host and so there are certain things that are out of our control. For example, Binder can be slow to load. It takes on average 1+ minutes for the Docker image to load. There's not much, if anything, we can do to speed that up. The other issue is that if there is an update to the Binder source code that breaks something, then the application will be inaccessible until the issue is resolved.

Outlook and future work

The application, in its current state, has parts that are not finalized. As we receive feedback, we will make changes to the application to add or change visualizations. As mentioned previously, there a few sections that were created to test different visualizations independently of the more complete sections, those can be finalized.
In the future it may be possible to move from BinderHub to a locally created and administered version of Binder. There is support and documentation for creating local, self hosted instances of Binder. Going that direction would give more control, and may make it possible to get the Docker image to load more quickly.
While the application runs stand-alone, the data that is visualized may also be integrated in other contexts. One option we are already prototyping is integrating the data into our subject portal EconBiz, so users would be able to judge the scientific impact of an article in terms of both bibliometric and altmetric indicators.
 

References

  • William W. Hood, Concepcion S. Wilson. The Literature of Bibliometrics, Scientometrics, and Informetrics. Scientometrics 52, 291–314 Springer Science and Business Media LLC, 2001. Link

  • R. Schimmer. Disrupting the subscription journals’ business model for the necessary large-scale transformation to open access. (2015). Link

  • Mike Thelwall, Stefanie Haustein, Vincent Larivière, Cassidy R. Sugimoto. Do Altmetrics Work? Twitter and Ten Other Social Web Services. PLoS ONE 8, e64841 Public Library of Science (PLoS), 2013. Link

  • The ZBW - Open Science Future. Link

  • Sarah Anne Murphy. Data Visualization and Rapid Analytics: Applying Tableau Desktop to Support Library Decision-Making. Journal of Web Librarianship 7, 465–476 Informa UK Limited, 2013. Link

  • Christina Kläre, Timo Borst. Statistic packages and their use in research in Economics | EDaWaX - Blog of the project ’European Data Watch Extended’. EDaWaX - European Data Watch Extended (2017). Link

 

さようなら (Sayōnara) / HangingTogether

Maneki-neko, from Wikimedia Commons

This is my 116th—and last—blog post. I’m retiring at the end of November, something I’ve deferred as I’ve had such a great time hanging with all of you—staff at our Partner institutions, professionals from all corners of the library, archival, and information technology worlds, and my OCLC colleagues. But it’s time. Of all the ways I know to say “good-bye”, the Japanese sayonara is the most wistful: it literally means “if it must be so” (shorter than “parting is such sweet sorrow.”)

You’ve inspired me and taught me so much! I hope I’ve contributed meaningfully to the evolving discussions around metadata, linked data, and multilingual support to improve access to the information our communities want.

I am proud to have been part of the foundation of the Unicode Consortium. My work with the East Asian Character Set (Z39.64) proved that “Han Unification” was feasible— just as we have one code for the character “a” whether it’s used in English, French, German, Tagalog, Indonesian, etc. with different pronunciations, we can have one code for each Chinese (“Han”, or 漢) character common to Chinese, Japanese, and Korean. I advocated strongly for “Han Unification” and wrote a position paper on it in 1991. The Unicode Chronology highlights the stages of incorporating Han Unification into Unicode 1987-1992; Unicode became an international standard (ISO 10646) in 1993.

Unicode represented an “infrastructure revolution” for all of us. Those of you of a certain age may recall the days when there was a separate character set used in library systems, which included a range of diacritics to be used with other characters, often for use in transliterations of non-Latin scripts. But the data could only be shared and used by other library systems—if you copied/pasted data into another application it came out as gibberish. Non-Latin scripts were each defined by separate national character sets, and unless you used the same national character set, you could not read the text. Unicode, the result of a consortium including major computer corporations, software companies, and research institutions, changed all that. The scope included a far wider range of scripts than any other character set (the latest version includes over 140,000 characters). Because Unicode significantly decreased the costs of developing products for a global market, it was very quickly implemented in software applications. And Unicode included all the “combining diacritics” that libraries had used for decades.  We take our ability to read non-Latin scripts in different applications and on websites for granted now.

Library catalogs still do not yet take advantage of the full range of scripts available in Unicode, however. Library users who read languages written in non-Latin scripts should be able to search and retrieve the metadata describing the resources written in those languages using the metadata in that script. Unfortunately, many of these non-Latin script resources are represented in catalogs only by transliteration, a barrier to access. (See my 2015 blog post, “Transcription vs. Transliteration.”)  I’m pleased that OCLC has taken steps to remedy that situation, starting with the languages written in Cyrillic script. My colleagues Jenny Toves, Bryan Baldus, and Mary Haessig blogged about this work earlier this year in “кириллица в WorldCat”.

Soon after the adoption of Unicode and (to me) amazingly quick implementation, I started bringing together the managers of technical services to discuss common issues and identify work that was needed to guide future developments that would improve the metadata underpinning the discovery of all the resources curated and managed by libraries, archives, and other cultural heritage organizations.  Over the last 27 years this group evolved into the OCLC Research Library Partners Metadata Managers Focus Group, which at one point included representatives from 63 Partner institutions in 12 countries spanning four continents. It spawned six working groups or task forces focused on particular issues and published reports of their investigations, such as Registering Researchers in Authority Files in 2014 and Addressing the Challenges with Organizational Identifiers and ISNI in 2016. My meta-synthesis of the Focus Group’s discussions over the last six years was recently published as an OCLC Research Report, Transitioning to the Next Generation of Metadata. The recordings of our November 2020 discussions about the report are available as “past webinars” on the Works in Progress Webinars web page.

The Focus Group’s intense interest in who was implementing linked data and for what purposes led to a series of “International Linked Data Surveys for Implementers” I conducted between 2014 and 2018. A total of 143 institutions in 23 countries reported one or more linked data project or service. The results of these surveys are shared for the benefit of others wanting to undertake similar efforts on the OCLC Research Linked Data Survey web page.

One of the most rewarding highlights of my career was collaborating with my OCLC colleagues and OCLC members on “Project Passage,” a linked data Wikibase prototype which served as a sandbox in which librarians from 16 institutions could experiment with creating linked data to describe resources. The project was stimulating, educational, and fun! I enjoyed writing up what we learned with some of the participants in the 2019 report, Creating Library Linked Data with Wikibase: Lessons Learned from Project Passage. This work generated another working group, Archives and Special Collections Linked Data Review Group, drawn from the OCLC Research Library Partnership’s rare book, archives, and special collections communities, which explored key issues of concern and opportunities for archives and special collections in transitioning to a linked data environment, summarized in the 2020 OCLC Research Report, Archives and Special Collections Linked Data: Navigating between Notes and Nodes.

I leave behind a set of publications and presentations. The relationships I’ve enjoyed with so many talented, inspiring staff within the OCLC Research Library Partnership I’ll treasure. I look forward to seeing what you all do in the coming years to leverage metadata and embed multilingualism into everything you do!

Sayōnara!

The post さようなら (Sayōnara) appeared first on Hanging Together.

Deep Dive: Moving ruby projects from Travis to Github Actions for CI / Jonathan Rochkind

So this is one of my super wordy posts, if that’s not your thing abort now, but some people like them. We’ll start with a bit of context, then get to some detailed looks at Github Actions features I used to replace my travis builds, with example config files and examination of options available.

For me, by “Continuous Integration” (CI), I mostly mean “Running automated tests automatically, on your code repo, as you develop”, on every PR and sometimes with scheduled runs. Other people may mean more expansive things by “CI”.

For a lot of us, our first experience with CI was when Travis-ci started to become well-known, maybe 8 years ago or so. Travis was free for open source, and so darn easy to set up and use — especially for Rails projects, it was a time when it still felt like most services focused on docs and smooth fit for ruby and Rails specifically. I had heard of doing CI, but as a developer in a very small and non-profit shop, I want to spend time writing code not setting up infrastructure, and would have had to get any for-cost service approved up the chain from our limited budget. But it felt like I could almost just flip a switch and have Travis on ruby or rails projects working — and for free!

Free for open source wasn’t entirely selfless, I think it’s part of what helped Travis literally define the market. (Btw, I think they were the first to invent the idea of a “badge” URL for a github readme?) Along with an amazing Developer UX (which is today still a paragon), it just gave you no reason not to use it. And then once using it, it started to seem insane to not have CI testing, nobody would ever again want to develop software without the build status on every PR before merge.

Travis really set a high bar for ease of use in a developer tool, you didn’t need to think about it much, it just did what you needed, and told you what you needed to know in it’s read-outs. I think it’s an impressive engineering product. But then.

End of an era

Travis will no longer be supporting open source projects with free CI.

The free open source travis projects originally ran on travis-ci.org, with paid commercial projects on travis-ci.com. In May 2018, they announced they’d be unifying these on travis-ci.com only, but with no announced plan that the policy for free open source would change. This migration seemed to proceed very slowly though.

Perhaps because it was part of preparing the company for a sale, in Jan 2019 it was announced private equity firm Idera had bought travis. At the time the announcement said “We will continue to maintain a free, hosted service for open source projects,” but knowing what “private equity” usually means, some were concerned for the future. (HN discussion).

While the FAQ on the migration to travis-ci.com still says that travis-ci.org should remain reliable until projects are fully migrated, in fact over the past few months travis-ci.org projects largely stopped building, as travis apparently significantly reduced resources on the platform. Some people began manually migrating their free open source projects to travis-ci.com where builds still worked. But, while the FAQ also still says “Will Travis CI be getting rid of free users? Travis CI will continue to offer a free tier for public or open-source repositories on travis-ci.com” — in fact, travis announced that they are ending the free service for open source. The “free tier” is a limited trial (available not just to open source), and when it expires, you can pay, or apply to a special program for an extension, over and over again.

They are contradicting themselves enough that while I’m not sure exactly what is going to happen, but no longer trust them as a service.

Enter Github Actions

I work mostly on ruby and Rails projects. They are all open source, almost all of them use travis. So while (once moved to travis-ci.com) they are all currently working, it’s time to start moving them somewhere else, before I have dozens of projects with broken CI and still don’t know how to move them. And the new needs to be free — many of these projects are zero-budget old-school “volunteer” or “informal multi-institutional collaboration” open source.

There might be several other options, but the one I chose is Github Actions — my sense that it had gotten mature enough to start approaching travis level of polish, and all of my projects are github-hosted, and Github Actions is free for unlimited use for open source. (pricing page; Aug 2019 announcement of free for open source). And we are really fortunate that it became mature and stable in time for travis to withdraw open source support (if travis had been a year earlier, we’d be in trouble).

Github Actions is really powerful. It is built to do probably WAY MORE than travis does, definitely way beyond “automated testing” to various flows for deployment and artifact release, to really just about any kind of process for managing your project you want. The logic you can write almost unlimited, all running on github’s machines.

As a result though…. I found it a bit overwhelming to get started. The Github Actions docs are just overwhelmingly abstract, there is so much there, you can almost anything — but I don’t actually want to learn a new platform, I just want to get automated test CI for my ruby project working! There are some language/project speccific Guides available, for node.js, python, a few different Java setups — but not for ruby or Rails! My how Rails has fallen, from when most services like this would be focusing on Rails use cases first. :(

There are some third part guides available that might focus on ruby/rails, but one of the problems is that Actions has been evolving for a few years with some pivots, so it’s easy to find outdated instructions. One I found helpful orientation was this Drifting Ruby screencast. This screencast showed me there is a kind of limited web UI with integrated docs searcher — but i didn’t end up using it, I just created the text config file by hand, same as I would have for travis. Github provides templates for “ruby” or “ruby gem”, but the Drifting Ruby sccreencast said “these won’t really work for our ruby on rails application so we’ll have to set up one manually”, so that’s what I did too. ¯\_(ツ)_/¯

But the cost of all the power github Actions provides is… there are a lot more switches and dials to understand and get right (and maintain over time and across multiple projects). I’m not someone who likes copy-paste without understanding it, so I spent some time trying to understand the relevant options and alternatives; in the process I found some things I might have otherwise copy-pasted from other people’s examples that could be improved. So I give you the results of my investigations, to hopefully save you some time, if wordy comprehensive reports are up your alley.

A Simple Test Workflow: ruby gem, test with multiple ruby versions

Here’s a file for a fairly simple test workflow. You can see it’s in the repo at .github/workflows. The name of the file doesn’t matter — while this one is called ruby.yml, i’ve since moved over to naming the file to match the name: key in the workflow for easier traceability, so I would have called it ci.yml instead.

Triggers

You can see we say that this workflow should be run on any push to master branch, and also for any pull_request at all. Many other examples I’ve seen define pull_request: branches: ["main"], which seems to mean only run on Pull Requests with main as the base. While that’s most of my PR’s, if there is ever a PR that uses another branch as a base for whatever reason, I still want to run CI! While hypothetically you should be able leave branches out to mean “any branch”, I only got it to work by explicitly saying branches: ["**"]

Matrix

For this gem, we want to run CI on multiple ruby versions. You can see we define them here. This works similarly to travis matrixes. If you have more than one matrix variable defined, the workflow will run for every combination of variables (hence the name “matrix”).

      matrix:
        ruby: [ '2.4.4', '2.5.1', '2.6.1', '2.7.0', 'jruby-9.1.17.0', 'jruby-9.2.9.0' ]

In a given run, the current value of the matrix variables is available in github actions “context”, which you can acccess as eg ${{ matrix.ruby }}. You can see how I use that in the name, so that the job will show up with it’s ruby version in it.

    name: Ruby ${{ matrix.ruby }}

Ruby install

While Github itself provides an action for ruby install, it seems most people are using this third-party action. Which we reference as `ruby/setup-ruby@v1`.

You can see we use the matrix.ruby context to tell the setup-ruby action what version of ruby to install, which works because our matrix values are the correct values recognized by the action. Which are documented in the README, but note that values like jruby-head are also supported.

Note, although it isn’t clearly documented, you can say 2.4 to mean “latest available 2.4.x” (rather than it meaning “2.4.0”), which is hugely useful, and I’ve switched to doing that. I don’t believe that was available via travis/rvm ruby install feature.

For a project that isn’t testing under multiple rubies, if we left out the with: ruby-version, the action will conveniently use a .ruby-version file present in the repo.

Note you don’t need to put a gem install bundler into your workflow yourself, while I’m not sure it’s clearly documented, I found the ruby/setup-ruby action would do this for you (installing the latest available bundler, instead of using whatever was packaged with ruby version), btw regardless of whether you are using the bundler-cache feature (see below).

Note on How Matrix Jobs Show Up to Github

With travis, testing for multiple ruby or rails versions with a matrix, we got one (or, well, actually two) jobs showing up on the Github PR:

Each of those lines summaries a collection of matrix jobs (eg different ruby versions). If any of the individual jobs without the matrix failed, the whole build would show up as failed. Success or failure, you could click on “Details” to see each job and it’s status:

I thought this worked pretty well — especially for “green” builds I really don’t need to see the details on the PR, the summary is great, and if I want to see the details I can click through, great.

With Github Actions, each matrix job shows up directly on the PR. If you have a large matrix, it can be… a lot. Some of my projects have way more than 6. On PR:

Maybe it’s just because I was used to it, but I preferred the Travis way. (This also makes me think maybe I should change the name key in my workflow to say eg CI: Ruby 2.4.4 to be more clear? Oops, tried that, it just looks even weirder in other GH contexts, not sure.)

Oh, also, that travis way of doing the build twice, once for “pr” and once for “push”? Github Actions doesn’t seem to do that, it just does one, I think corresponding to travis “push”. While the travis feature seemed technically smart, I’m not sure I ever actually saw one of these builds pass while the other failed in any of my projects, I probably won’t miss it.

Badge

Did you have a README badge for travis? Don’t forget to swap it for equivalent in Github Actions.

The image url looks like: https://github.com/$OWNER/$REPOSITORY/workflows/$WORKFLOW_NAME/badge.svg?branch=master, where $WORKFLOW_NAME of course has to be URL-escaped if it ocntains spaces etc.

The github page at https://github.com/owner/repo/actions, if you select a particular workflow/branch, does, like travis, give you a badge URL/markdown you can copy/paste if you click on the three-dots and then “Create status badge”. Unlike travis, what it gives you to copy/paste is just image markdown, it doesn’t include a link.

But I definitely want the badge to link to viewing the results of the last build in the UI. So I do it manually. Limit to the speciifc workflow and branch that you made the badge for in the UI then just copy and paste the URL from the browser. A bit confusing markdown to construct manually, here’s what it ended up looking like for me:

I copy and paste that from an existing project when I need it in a new one. :shrug:

Require CI to merge PR?

However, that difference in how jobs show up to Github, the way each matrix job shows up separately now, has an even more negative impact on requiring CI success to merge a PR.

If you want to require that CI passes before merging a PR, you configure that at https://github.com/acct/project/settings/branches under “Branch protection rules”.When you click “Add Rule”, you can/must choose WHICH jobs are “required”.

For travis, that’d be those two “master” jobs, but for the new system, every matrix job shows up separately — in fact, if you’ve been messing with job names trying to get it right as I have, you have any job name that was ever used in the last 7 days, and they don’t have the Github workflow name appended to them or anything (another reason to put github workflow name in the job name?).

But the really problematic part is that if you edit your list of jobs in the matrix — adding or removing ruby versions as one does, or even just changing the name that shows up for a job — you have to go back to this screen to add or remove jobs as a “required status check”.

That seems really unworkable to me, I’m not sure how it hasn’t been a major problem already for users. It would be better if we could configure “all the checks in the WORKFLOW, whatever they may be”, or perhaps best of all if we could configure a check as required in the workflow YML file, the same place we’re defining it, just a required_before_merge key you could set to true or use a matrix context to define or whatever.

I’m currently not requiring status checks for merge on most of my projects (even though i did with travis), because I was finding it unmanageable to keep the job names sync’d, especially as I get used to Github Actions and kept tweaking things in a way that would change job names. So that’s a bit annoying.

fail_fast: false

By default, if one of the matrix jobs fails, Github Acitons will cancel all remaining jobs, not bother to run them at all. After all, you know the build is going to fail if one job fails, what do you need those others for?

Well, for my use case, it is pretty annoying to be told, say, “Job for ruby 2.7.0 failed, we can’t tell you whether the other ruby versions would have passed or failed or not” — the first thing I want to know is if failed on all ruby versions or just 2.7.0, so now I’d have to spend extra time figuring that out manually? No thanks.

So I set `fail_fast: false` on all of my workflows, to disable this behavior.

Note that travis had a similar (opt-in) fast_finish feature, which worked subtly different: Travis would report failure to Github on first failure (and notify, I think), but would actually keep running all jobs. So when I saw a failure, I could click through to ‘details’ to see which (eg) ruby versions passed, from the whole matrix. This does work for me, so I’d chose to opt-in to that travis feature. Unfortunately, the Github Actions subtle difference in effect makes it not desirable to me.

Note You may see some people referencing a Github Actions continue-on-error feature. I found the docs confusing, but after experimentation what this really does is mark a job as successful even when it fails. It shows up in all GH UI as succeeeded even when it failed, the only way to know it failed would be to click through to the actual build log to see failure in the logged console. I think “continue on error” is a weird name for this; it is not useful to me with regard to fine-tuning fail-fast; or honestly in any other use case I can think of that I have.

Bundle cache?

bundle install can take 60+ seconds, and be a significant drag on your build (not to mention a lot of load on rubygems servers from all these builds). So when travis introduced a feature to cache: bundler: true, it was very popular.

True to form, Github Actions gives you a generic caching feature you can try to configure for your particular case (npm, bundler, whatever), instead of an out of the box feature “just do the right thing you for bundler, you figure it out”.

The ruby/setup-ruby third-party action has a built-in feature to cache bundler installs for you, but I found that it does not work right if you do not have a Gemfile.lock checked into the repo. (Ie, for most any gem, rather than app, project). It will end up re-using cached dependencies even if there are new releases of some of your dependencies, which is a big problem for how I use CI for a gem — I expect it to always be building with latest releases of dependencies, so I can find out of one breaks the build. This may get fixed in the action.

If you have an app (rather than gem) with a Gemfile.lock checked into repo, the bundler-cache: true feature should be just fine.

Otherwise, Github has some suggestions for using the generic cache feature for ruby bundler (search for “ruby – bundler” on this page) — but I actually don’t believe they will work right without a Gemfile.lock checked into the repo either.

Starting from that example, and using the restore-keys feature, I think it should be possible to design a use that works much like travis’s bundler cache did, and works fine without a checked-in Gemfile.lock. We’d want it to use a cache from the most recent previous (similar job), and then run bundle install anyway, and then cache the results again at the end always to be available for the next run.

But I haven’t had time to work that out, so for now my gem builds are simply not using bundler caching. (my gem builds tend to take around 60 seconds to do a bundle install, so that’s in every build now, could be worse).

Notifications: Not great

Travis has really nice defaults for notifications: The person submitting the PR would get an email generally only on status changes (from pass to fail or fail to pass) rather than on every build. And travis would even figure out what email to send to based on what email you used in your git commits. (Originally perhaps a workaround to lack of Github API at travis’ origin, I found it a nice feature). And then travis has sophisticated notification customization available on a per-repo basis.

Github notifications are unfortunately much more basic and limited. The only notification settings avaialable are for your entire account at https://github.com/settings/notifications, “GitHub Actions”. So they apply to all github workflows in all projects, there are no workflow- or project-specific settings. You can set to receive notification via web push or email or both or neither. You can receive notifications for all builds or only failed builds. That’s it.

The author of a PR is the one who receives the notifications, same as in travis. You will get notifications for every single build, even repeated successes or failures in a series.

I’m not super happy with the notification options. I may end up just turning off Github Actions notifications entirely for my account.

Hypothetically, someone could probably write a custom Github action to give you notifications exactly how travis offered — after all, travis was using public GH API that should be available to any other author, and I think should be usable from within an action. But when I started to think through it, while it seemed an interesting project, I realized it was definitely beyond the “spare hobby time” I was inclined to give to it at present, especially not being much of a JS developer (the language of custom GH actions, generally). (While you can list third-party actions on the github “marketplace”, I don’t think there’s a way to charge for them). .

There are custom third-party actions available to do things like notify slack for build completion; I haven’t looked too much into any of them, beyond seeing that I didn’t see any that would be “like travis defaults”.

A more complicated gem: postgres, and Rails matrix

Let’s move to a different example workflow file, in a different gem. You can see I called this one ci.yml, matching it’s name: CI, to have less friction for a developer (including future me) trying to figure out what’s going on.

This gem does have rails as a dependency and does test against it, but isn’t actually a Rails engine as it happens. It also needs to test against Postgres, not just sqlite3.

Scheduled Builds

At one point travis introduced a feature for scheduling (eg) weekly builds even when no PR/commit had been made. I enthusiastically adopted this for my gem projects. Why?

Gem releases are meant to work on a variety of different ruby versions and different exact versions of dependencies (including Rails). Sometimes a new release of ruby or rails will break the build, and you want to know about that and fix it. With CI builds happening only on new code, you find out about this with some random new code that is unlikely to be related to the failure; and you only find out about it on the next “new code” that triggers a build after a dependency release, which on some mature and stable gems could be a long time after the actual dependency release that broke it.

So scheduled builds for gems! (I have no purpose for scheduled test runs on apps).

Github Actions does have this feature. Hooray. One problem is that you will receive no notification of the result of the scheduled build, success or failure. :( I suppose you could include a third-party action to notify a fixed email address or Slack or something else; not sure how you’d configure that to apply only to the scheduled builds and not the commit/PR-triggered builds if that’s what you wanted. (Or make an custom action to file a GH issue on failure??? But make sure it doesn’t spam you with issues on repeated failures). I haven’t had the time to investigate this yet.

Also oops just noticed this: “In a public repository, scheduled workflows are automatically disabled when no repository activity has occurred in 60 days.” Which poses some challenges for relying on scheduled builds to make sure a stable slow-moving gem isn’t broken by dependency updates. I definitely am committer on gems that are still in wide use and can go 6-12+ months without a commit, because they are mature/done.

I still have it configured in my workflow; I guess even without notifications it will effect the “badge” on the README, and… maybe i’ll notice? Very far from ideal, work in progress. :(

Rails Matrix

OK, this one needs to test against various ruby versions AND various Rails versions. A while ago I realized that an actual matrix of every ruby combined with every rails was far too many builds. Fortunately, Github Actions supports the same kind of matrix/include syntax as travis, which I use.

     matrix:
        include:
          - gemfile: rails_5_0
            ruby: 2.4

          - gemfile: rails_6_0
            ruby: 2.7

I use the appraisal gem to handle setting up testing under multiple rails versions, which I highly recommend. You could use it for testing variant versions of any dependencies, I use it mostly for varying Rails. Appraisal results in a separate Gemfile committed to your repo for each (in my case) rails version, eg ./gemfiles/rails_5_0.gemfile. So those values I use for my gemfile matrix key are actually portions of the Gemfile path I’m going to want to use for each job.

Then we just need to tell bundler, in a given matrix job, to use the gemfile we specified in the matrix. The old-school way to do this is with the BUNDLE_GEMFILE environmental variable, but I found it error-prone to make sure it stayed consistently set in each workflow step. I found that the newer (although not that new!) bundle config set gemfile worked swimmingly! I just set it before the bundle install, it stays set for the rest of the run including the actual test run.

steps:
    # [...]
    - name: Bundle install
      run: |
        bundle config set gemfile "${GITHUB_WORKSPACE}/gemfiles/${{ matrix.gemfile }}.gemfile"
        bundle install --jobs 4 --retry 3

Note that single braces are used for ordinary bash syntax to reference the ENV variable ${GITHUB_WORKSPACE}, but double braces for the github actions context value interpolation ${{ matrix.gemfile }}.

Works great! Oh, note how we set the name of the job to include both ruby and rails matrix values, important for it showing up legibly in Github UI: name: ${{ matrix.gemfile }}, ruby ${{ matrix.ruby }}. Because of how we constructed our gemfile matrix, that shows up with job names rails_5_0, ruby 2.7.

Still not using bundler caching in this workflow. As before, we’re concerned about the ruby/setup-ruby built-in bundler-cache feature not working as desired without a Gemfile.lock in the repo. This time, I’m also not sure how to get that feature to play nicely with the variant gemfiles and bundle config set gemfile. Github Actions makes you put together a lot more pieces together yourself compared to travis, there are still things I just postponed figuring out for now.

Postgres

This project needs to build against a real postgres. That is relatively easy to set up in Github Actions.

Postgres normally by default allows connections on localhost without a username/password set, and my past builds (in travis or locally) took advantage of this to not bother setting one, which then the app didn’t have to know about. But the postgres image used for Github Actions doesn’t allow this, you have to set a username/password. So the section of the workflow that sets up postgres looks like:

jobs:
   tests:
     services:
       db:
         image: postgres:9.4
         env:
           POSTGRES_USER: postgres
           POSTGRES_PASSWORD: postgres
         ports: ['5432:5432']

5432 is the default postgres port, we need to set it and map it so it will be available as expected. Note you also can specify whatever version of postgres you want, this one is intentionally testing on one a bit old.

OK now our Rails app that will be executed under rspec needs to know that username and password to use in it’s postgres connection; when before it connected without a username/password. That env under the postgres service image is not actually available to the job steps. I didn’t find any way to DRY the username/password in one place, I had to repeat it in another env block, which I put at the top level of the workflow so it would apply to all steps.

And then I had to alter my database.yml to use those ENV variables, in the test environment. On a local dev machine, if your postgres doens’t have a username/password requirement and you don’t set the ENV variables, it keeps working as before.

I also needed to add host: localhost to the database.yml; before, the absence of the host key meant it used a unix-domain socket (filesystem-located) to connect to postgres, but that won’t work in the Github Actions containerized environment.

Note, there are things you might see in other examples that I don’t believe you need:

  • No need for an apt-get of pg dev libraries. I think everything you need is on the default GH Actions images now.
  • Some examples I’ve seen do a thing with options: --health-cmd pg_isready, my builds seem to be working just fine without it, and less code is less code to maintain.

allow_failures

In travis, I took advantage of the travis allow_failures key in most of my gems.

Why? I am testing against various ruby and Rails versions; I want to test against *future* (pre-release, edge) ruby and rails versions, cause its useful to know if I’m already with no effort passing on them, and I’d like to keep passing on them — but I don’t want to mandate it, or prevent PR merges if the build fails on a pre-release dependency. (After all, it could very well be a bug in the dependency too!)

There is no great equivalent to allow_failures in Github Actions. (Note again, continue_on_error just makes failed jobs look identical to successful jobs, and isn’t very helpful here).

I investigated some alternatives, which I may go into more detail on in a future post, but on one project I am trying a separate workflow just for “future ruby/rails allowed failures” which only checks master commits (not PRs), and has a separate badge on README (which is actually pretty nice for advertising to potential users “Yeah, we ALREADY work on rails edge/6.1.rc1!”). Main downside there is having to copy/paste synchronize what’s really the same workflow in two files.

A Rails app

I have many more number of projects I’m a committer on that are gems, but I spend more of my time on apps, one app in specific.

So here’s an example Github Actions CI workflow for a Rails app.

It mostly remixes the features we’ve already seen. It doesn’t need any matrix. It does need a postgres.

It does need some “OS-level” dependencies — the app does some shell-out to media utilities like vips and ffmpeg, and there are integration tests that utilize this. Easy enough to just install those with apt-get, works swimmingly.

        - name: Install apt dependencies
          run: |
            sudo apt-get -y install libvips-tools ffmpeg mediainfo

In addition to the bundle install, a modern Rails app using webpacker needs yarn install. This just worked for me — no need to include lines for installing npm itself or yarn or any yarn dependencies, although some examples I find online have them. (My yarn installs seem to happen in ~20 seconds, so I’m not motivated to try to figure out caching for yarn).

And we need to create the test database in the postgres, which I do with RAILS_ENV=test bundle exec rails db:create — typical Rails test setup will then automatically run migrations if needed. There might be other (better?) ways to prep the database, but I was having trouble getting rake db:prepare to work, and didn’t spend the time to debug it, just went with something that worked.

    - name: Set up app
       run: |
         RAILS_ENV=test bundle exec rails db:create
         yarn install

Rails test setup usually ends up running migrations automatically is why I think this worked alone, but you could also throw in a RAILS_ENV=test bundle exec rake db:schema:load if you wanted.

Under travis I had to install chrome with addons: chrome: stable to have it available to use with capybara via the webdrivers gem. No need for installing chrome in Github Actions, some (recent-ish?) version of it is already there as part of the standard Github Actions build image.

In this workflow, you can also see a custom use of the github “cache” action to cache a Solr install that the test setup automatically downloads and sets up. In this case the cache doesn’t actually save us any build time, but is kinder on the apache foundation servers we are downloading from with every build otherwise (and have gotten throttled from in the past).

Conclusion

Github Aciton sis a really impressively powerful product. And it’s totally going to work to replace travis for me.

It’s also probably going to take more of my time to maintain. The trade-off of more power/flexibility and focusing on almost limitless use cases is more things th eindividual project has to get right for their use case. For instance figuring out the right configuration to get caching for bundler or yarn right, instead of just writing cache: { yarn: true, bundler: true}. And when you have to figure it out yourself, you can get it wrong, which when you are working on many projects at once means you have a bunch of places to fix.

The amazingness of third-party action “marketplace” means you have to figure out the right action to use (the third-party ruby/setup-ruby instead of the vendor’s actions/setup-ruby), and again if you change your mind about that you have a bunch of projects to update.

Anyway, it is what it is — and I’m grateful to have such a powerful and in fact relatively easy to use service available for free! I could not really live without CI anymore, and won’t have to!

Oh, and Github Actions is giving me way more (free) simultaneous parallel workers than travis ever did, for my many-job builds!

NDSA Announces Winners of 2020 Innovation Awards / Digital Library Federation

The NDSA established its Innovation Awards in 2012 to recognize and encourage innovation in the field of digital stewardship.  Since then, it has honored 39 exemplary educators, future stewards, individuals, institutions, and projects for their efforts in ensuring the ongoing viability and accessibility of our valuable digital heritage. The 2020 NDSA Innovation Awards are generously sponsored by Digital Bedrock.

Today, NDSA adds 8 new awardees to that honor roll during the opening plenary ceremony of the 2020 NDSA Digital Preservation Conference.   These winners were selected from the largest pool of nominees so far in the Awards’ history: 32 nominations of 30 nominees.  While the pool size made the judging more difficult, the greater breadth, depth, and quality of the nominations is a positive sign for the preservation community, as it is indicative of the growing maturity and robustness of the field.  This year’s awardees continue to reflect a recent trend towards an increasingly international perspective and recognition of the innovative contributions by and for historically underrepresented and marginalized communities. 

Please help us congratulate these awardees!  We encourage you to follow-up in learning more about their activities and the ways in which they have had a profound beneficial impact on our collective ability to protect and make accessible our valuable digital heritage.

Educators are recognized for innovative approaches and access to digital preservation through academic programs, partnerships, professional development opportunities, and curriculum development. 

This year’s awardees in the Educators category are:

Library Juice Academy Certificate in Digital Curation.  This program, launched in 2019, encompasses a six-course sequence for library, archives and museum practitioners wanting to learn more about and expand their skill sets for curating and maintaining unique digital assets. The curriculum offers comprehensive coverage of collection development and appraisal, description, rights and access, digital preservation, and professional ethics and responsible stewardship.  The program’s affordability, flexible scheduling, and online pedagogy encouraging engaged collaborative learning provides a unique opportunity for professional development and continuing education.  In particular, the emphasis placed on ethics and sustainability provides an appropriate counterpoint to other more technically-focused topics, drawing needed attention to critical issues of policy, finance, equity, and diversity.

Library Juice Academy Logo

International Council on Archive (ICA) Africa Programme Digital Records Curation Programme.  The Programme supports the professional development of new generations of digital archivists and records managers in Africa, a geographic and cultural region historically marginalized and underrepresented in international digital stewardship discourse, practice, and education. The Programme’s volunteer-taught study school uses open access readings and open source tools to minimize technical resource and financial impediments to participation, and to encourage creative repurposing of pedagogic materials in the participants’ local contexts.  The Programme also provides financial support for early-career practitioners and educators across the African continent to attend and learn, share their own teaching techniques and insights, and to build a professional research and teaching network.  Parallel instructional opportunities are offered for Anglophone and Francophone participants.  With a focus on “training the trainers”, the Digital Records Curation Programme promotes the development of maturing cohorts of stewardship practitioners and the growing professionalism of digital preservation activities focused on long-term stewardship of Africa’s vital digital heritage.

Photo of DRCP participants at the Botswana Study SchoolDRCP participants at the Botswana Study School. From left to right: Forget Chaterera-Zambuko (Zimbabwe), Vusi Tsabedze (Eswatini), Alina Karlos (Namibia), Abel M’kulama (Zambia), Tshepho Mosweu (Botswana), Umaru Bangura (Sierra Leone), Said Hassan (Tanzania), Ayodele John Alonge (Nigeria), Juliet Erima (Kenya). Seated: Thatayaone Segaetsho (Botswana), Makulta Mojapelo (South Africa)

 

Future Stewards are recognized as students and early-career professionals or academics taking a creative approach to advancing knowledge of digital preservation issues and practices. 

These year’s awardees in the Future Stewards category are:

Photo of Sawood AlamSawood Alam

Sawood Alam.   A PhD candidate at Old Dominion University, Sawood has been an active participant in the digital preservation community via the International Internet Preservation Consortium, the ACM/IEEE Joint Conference on Digital Libraries, and other communities for years, presenting and reporting on the complex topics, like holdings of web archives, decentralized systems, archival fixity, web packaging, and more. As a developer and systems architect, Sawood is a strong advocate for open-source and open-access tools, and has offered courses and lectures on various programming languages like Linux, Python, Ruby on Rails, and more. A mentor to new graduate students and researchers, Sawood will join the Internet Archive after graduation, leveraging his engineering experience and his academic experience to perform outreach to research groups interested in making use of the Wayback Machine’s holdings.

 

 

Carolina Quezada MenesesCarolina Quezada Meneses

Carolina Quezada Meneses.  As an intern, Carolina worked on a variety of projects that ranged from exploring new tools and software that help preserve, manage, and provide access to born-digital material, and helped develop a remote processing workflow that enabled University of California, Irvine (UCI) staff to work on the organization’s digital backlog while working from home during the Coronavirus pandemic. 

However, it is Meneses’s work with the Christine Tamblyn papers — which included numerous Macintosh-formatted floppy disks and CD-ROMs — that deserves additional praise: faced with ample technical challenges to providing access, Quezada created disk images of the floppy disks and CD-ROMs with specialized hardware, found a compatible emulator, and created screencast videos of the artwork, making the content accessible to a broader audience than traditional on-site access would typically allow.  Thanks to Meneses’s innovative thinking, a collection that had no prior level of access for 22 years is now accessible to researchers, and remains an example of her lasting dedication to providing access to born-digital formats.

 

Organizations are recognized for innovative approaches to providing support and guidance to the digital preservation community.  This year’s awardee in the Organizations category is:

National Archives and Records Administration (NARA).  NARA has a notable history of providing records management guidance focusing on digital preservation and addressing key factors to the successful permanent preservation of digital content. This year, the panel is pleased to distinguish NARA’s Digital Preservation Framework. Created after an extensive environmental scan of community digital preservation risk assessment and planning resources, this project recognizes that successful digital preservation requires both understanding the risks posed by file formats and identifying or developing processes for mitigating these risks. In response to this, the Framework provides extensive risk and planning analysis for over 500 formats in 16 type categories. The Framework can be applied across the lifecycle of digital content and is designed to enable a low-barrier to use, regardless of an organization’s current digital preservation practices or infrastructure. This information – officially released on GitHub in June of 2020 – is a vital tool of great, if not critical, utility to international stewardship programs and practitioners.

NARA Preservation Framework project team group photoNARA Preservation Framework project team: (top, left to right) Leslie Johnston, Elizabeth England, Brett Abrams; (middle) Jana Leighton, Criss Austin, Dara Baker; (bottom) Meg Guthorn, Andrea Riley, Michael Horsley

 

Projects are recognized for activities whose goals or outcomes represent an inventive, meaningful addition to the understanding or processes required for successful, sustainable digital preservation stewardship. 

This year’s awardees in the Projects category are:

  • DLF Levels of Born-Digital Access (LDBA).  Preservation and access are often viewed as two disparate concerns and activities, when in fact they are necessary complements.  Despite the central role that access plays in digital preservation, little agreement exists about what access to digital material should look like or how it might be implemented from institution to institution.  Levels of Born-Digital Access created by the DLF Born-Digital Archives Working Group (BDAWG) sought to address and fill the gap.  This instrument was developed through an iterative and inter-institutional collaborative effort. It delineates a tiered set of format-agnostic recommendations applicable f or internal or external assessment and planning of enhancements to capabilities and capacities. This document is responsive to both practitioners’ and researchers’ needs, while also serving as a potential model for future standards development.  The work of the LDBA is important in highlighting the critical role access plays in any effective long-term stewardship program.
Levels of Born Digital Access Grid ScreenshotLevels of Born Digital Access Grid Screenshot
  • Project Electron.  A multi-year initiative at the Rockefeller Archive Center to implement sustainable, user-centered, and standards-compliant infrastructure to support the ongoing acquisition, management, and preservation of archival digital records.  The project includes a digital records transfer pipeline called Aurora, as well as a transfer specification and integrations with existing archival systems for accessioning, digital preservation, and description.  The awards panel was particularly impressed by the Project’s comprehensive adaptation and extension of traditional archival principles and workflows to digital materials.  The panel also recognizes the positioning of this initiative as an open-source and standards-based effort, maximizing opportunities for its transferability to other programmatic contexts.  Many archival institutions face significant challenges in supporting digitized and born-digital records and special collections.  The work of Project Electron provides an important exemplar for effective and sustainable digital archival handling.
Project Electron LogoProject Electron Logo

 

  • Tribesourcing Southwest Film Project.  The Tribesourcing project aims to preserve — in a culturally appropriate way — a digitized collection of non-fiction films that document Native cultures across North and South America.  Many of these films contain beautiful and valuable images; however, the original narrations are often insensitive and racist.  The project invites Native community members to record new, culturally-competent narrations in indigenous or European languages as alternate audio tracks for the films.  This process, which project lead Jennifer Jenkins has termed “tribesourcing,” has the double benefit of repatriating historic images and decolonizing these archival films.  By including Native language narrations, the project also creates a digital repository for language preservation tied to films about culture and lifeways.  These narrations are recorded and presented online using accessible and open source tools.  The Tribesourcing project models an innovative solution to the question of integrating ethics and cultural competencies in digital preservation work. 
Tribesourceing Website ScreenshotTribesourceing Website Screenshot

 

~ The NDSA Innovation Awards Working Group

  • Samantha Abrams (Ivy Plus Libraries Confederation)
  • Stephen Abrams (Harvard University; co-chair)
  • Lauren Goodley (Texas State University)
  • Grete Graf (Yale University)
  • Kari May (University of Pittsburgh)
  • Krista Oldham (Clemson University; co-chair)

The post NDSA Announces Winners of 2020 Innovation Awards appeared first on DLF.

Even More On The Ad Bubble / David Rosenthal

I've been writing for some time about the hype around online advertising. There's a lot of evidence that it is ineffective. Recently, the UK's Information Commissioner's Office concluded an investigation into Cambridge Analytica's involvement in the 2016 US election and the Brexit referendum. At The Register, Shaun Nichols summarizes their conclusions in UK privacy watchdog wraps up probe into Cambridge Analytica and... it was all a little bit overblown, no?:
El Reg has heard on good authority from sources in British political circles that Cambridge Analytica's advertised powers of online suggestion were rather overblown and in fact mostly useless. In the end, it was skewered by its own hype, accused of tangibly influencing the Brexit and presidential votes on behalf of political parties and campaigners using its Facebook data. Yet, no evidence, according to the ICO, could be found supporting those specific claims.
Below the fold I look at this, a recent book on the topic, and other evidence that has emerged since I wrote Contextual vs. Behavioral Advertising.

The ICO's conclusions are summarized in a letter to the chair of the relevant Committee of Parliament:
SCL’s own marketing material claimed they had "Over 5,000 data points per individual on 230 million adult Americans." However, based on what we found it appears that this may have been an exaggeration.
...
while the models showed some success in correctly predicting attributes on individuals whose data was used in the training of the model, the real-world accuracy of these predictions –when used on new individuals whose data had not been used in the generating of the models –was likely much lower. Through the ICO’s analysis of internal company communications, the investigation identified there was a degree of scepticism within SCL as to the accuracy or reliability of the processing being undertaken. There appeared to be concern internally about the external messaging when set against the reality of their processing.
See also Thom Dunn's UK ICO report on Cambridge Analytica finds no illegal activity or Russian involvement and Izabella Kaminska's ICO's final report into Cambridge Analytica invites regulatory questions.

But these conclusions haven't stopped Cambridge Analytica's successors hyping their wares. Alex Pasternak's This data expert helped Trump win. Now he’s built a machine to take him down reports on a 2020 version:
The goal is to use microtargeted ads, follow-up surveys, and an unparalleled data set to win over key electorates in a few critical states: the low-education voters who unexpectedly came out in droves or stayed home last time, the voters who could decide another monumental election.
...
“We’ve been able to really understand how to communicate with folks who have lower levels of political knowledge, who tend to be ignored by the political process,” says James Barnes, a data and ads expert at the all-digital progressive nonprofit Acronym, who helped build Barometer. This is familiar territory: Barnes spent years on Facebook’s ads team, and in 2016 was the “embed” who helped the Trump campaign take Facebook by storm. Last year, he left Facebook and resolved to use his battle-tested tactics to take down his former client.
...
Acronym was first out of the gate, and is thought to be the Democrats’ most advanced digital advertising project. By the election it promises to have spent $75 million on Facebook, Google, Instagram, Snapchat, Hulu, Roku, Viacom, Pandora, and anywhere else valuable voters might be found.
These whizzo hi-tech schemes are attractive to people with money. All they need to do is to sign a few large checks. A few highly-paid consultants, their kind of people, wave their magic wands and the right kind of voters stream to the polls.

But the results of the recent election suggest that this doesn't really work. What really does work is the kind of grass-roots person-to-person organizing that Stacey Abrams used to flip Georgia. But that means signing lots of small checks and working with lots of awkward people, a much less attractive proposition. No-one in Stacy Abrams organization is raking in the big bucks, so there are no marketeers hyping their product.

Someone else who has noticed the gap between hype and reality in online advertising is Tim Hwang. In Ad Tech Could Be the Next Internet Bubble, Gilad Edelman reviews his new book Subprime Attention Crisis. Hwang:
lays out the case that the new ad business is built on a fiction. Microtargeting is far less accurate, and far less persuasive, than it’s made out to be, he says, and yet it remains the foundation of the modern internet: the source of wealth for some of the world’s biggest, most important companies, and the mechanism by which almost every “free” website or app makes money. If that shaky foundation ever were to crumble, there’s no telling how much of the wider economy would go down with it.
The argument of Hwang's book is that, because the online advertising marketplace was designed by economists such as Google's Hal Varian by analogy with the financial markets, they inherited pathologies similar to those that caused the 2008 financial crisis. And thus the market is likely to suffer a similar kind of meltdown, which would have a huge impact on the online environment:
Intense dysfunction in the online advertising markets would threaten to create a structural breakdown in the classic bargain at the core of the information economy; services can be provided for free online to consumers, insofar as they are subsidized by the revenue generated from advertising.
...
a sustained depression in the global programmatic advertising marketplace would pose some thorny questions not entirely unlike those faced by the government during the darkest days of the 2008 financial crisis. Are advertising-reliant services like social media platforms, search engines, and video streaming so important to the regular functioning of society and the economy that they need to be supported lest they take down other parts of the economy with them? Are they, in some sense, "too big to fail"?
Hwang's prologue features a talk to the Programmatic I/O conference by Nico Neumann, who:
begins by showing an analysis done by him and his collaborators auditing a sample of the third-party consumer data—also known as the record of everything you supposedly do online—that form the basis of online ad targeting. When compared with verified data about those same consumers, the accuracy was often extremely poor. The most accurate data sets still featured inaccuracies about 10 percent of consumers, with the worst having nearly 85 percent of the data about consumers wrong.
Hwang surveys the vast array of evidence of online advertising's excessive hype, including:
When they first launched in 1994, the first banner ads generated a remarkable click-through rate of 44 percent. ... One data set drawn from Google's ad network suggests that the average click-through rate for a comparable display ad in 2018 was 0.46 percent. ... Recent attempts to measure click-through rates on Facebook ads reveal similar rates of less than 1 percent. ... Even these sub-1-percent click-through rates may overstate the effectiveness of ads on some platforms. On mobile devices close to 50 percent of all click-throughs are not users signaling interest in an advertisement, but instead accidental "fat finger" clicks—users unintentionally clicking on content while using a touch-screen device. Ads may also drive a response among only a small segment of the population. In 2009, one study estimated that 8 percent of internet users were responsible for 85 percent of all advertisement click-throughs online.
And:
In 2013, a controlled experiment on more than a million customers to evaluate the causal effect of online ads concluded that a customer "between ages 20 and 40 experienced little or no effect from the advertising". This was in spite of this demographics's proportionally heavier usage of the internet. In contrast, the study found that customers older than sixty-five, despite constituting only 5 percent of the experimental group, were responsible for 40 percent of the total effect observed as a result of the advertising.
And:
In 2014, Google released a report suggesting that 56.1 percent of all ads displayed on the internet are never seen by a human. One 2017 report by Comscore found that this problem is particularly pronounced for as purchased through the programmatic ecosystem.
...
One study by Deloitte from 2017 suggests that fully three-quarters of North Americans engage in "at least one form of regular ad blocking". In 2016, 615 million around the world were actively blocking ads.
But the main part of the book is a detailed comparison of the online advertising market now to the market in subprime mortgage bonds leading up to the 2008 financial crisis. Hwang bases his argument on the idea that, just as mortgages were treated as commodities:
What is different about the present-day online advertising system is that it has enabled the bundling of a multitude of tiny moments of attention into discrete, luquid assets that can then be bought and sold frictionlessly in a global marketplace. Attention is commodified to an extent that it has not been in the past.
By being commodified, the differences between individual mortgages were obscured, and so are the differences between different episodes of attention. The market is opaque:
Opacity isn't dangerous only because it can cause errors in valuation. It also allows for the active inflation of a market despite the fundamental shakiness of the thing being bought and sold. This sometimes results from irrational levels of market confidence, a regular feature of financial crises going back hundreds of years.
...
This divergence between rosy outlooks and structural vulnerabilities is kindling for crises of confidence. When a hot, overpriced commodity is discovered to be effectively worthless, panic can set in, causing the market to implode.
Hwang lists a number of causes for market opacity:
The measurability of the online ad economy is an inch deep and a mile wide. As such, the tidal wave of data that has accompanied the development of online advertising provides only an illusion of greater transparency.
...
Modern online advertising remains deeply opaque on three fronts. First is the ever-iincreasing automation of the marketplace. Second is the creation of dark pools of liquidity where advertising inventory is bought and sold outside of the public eye. Third is the dominance of platforms, like Facebook and Google, that have frequently introduced new layers of opacity into the advertising marketplace.
Here is my take on these three:
  1. Among the problems the scale and the automation cause is the difficulty of ensuring "brand safety". Advertisers don't want their ads appearing on controversial content, so they have an analogous problem to the platform's "content moderation" problem. What content is "controversial" enough to warrant exclusion? The real-time ad auction system makes it almost impossible for an advertiser to know what content their ad appears on, and even if they knew Masnick's Impossibility Theorem means there are bound to be a lot of cases where, in hindsight, the placement was a mistake.
  2. I discussed dark pools in my last post, The Order Flow, and Hwang's analogy with the financial markets holds here. Because both buyers and sellers see advantages from avoiding the public markets, and because the owner of the dark pool can profit by abusing their trust in the dark, dark pools were bound to arise in the online ad markets:
    Platforms increasingly give select buyers access to private marketplaces (PMPs)—exclusive exchanges for ad inventory. PMPs allow selected advertisers who have negotiated a special deal with a publisher to bid for advertising space, usually of a higher quality and in a less crowded, and therefore less competitive, market. These arrangements are attractive because they offer better transparency to the participants and also allow buyers to keep targeting data and other valuable information away from the public markets.

    PMPs are a growing segment of the transactions taking place in online advertising. In 2018, 45 percent of all the money spent in real-time bidding auctions took place within the confines of a PMP.
  3. Google and Facebook have much better information about the ad market than the publishers and the advertisers, and they are devoted to keeping both sides in the dark as much as possible. Facebook, in particular, has a long history of lying to both sides. Tim Peterson made a list of 12 such lies in 2017 and Amy Gesenhues added another 10 in 2018.
As a description of the market, Hwang's analogy holds up very well. Clearly it is possible that the market will suddenly collapse, as the subprime mortgage market did, when participants realize that wht they ar ebuying is worthless. If it did the impact would be huge. But his suggestions about how to let the air out of the bubble without disaster strike me as futile:
Two pillars of faith give programmatic advertising an aura of invulnerability: measurability and effectiveness. The core proposition of programmatic advertising is that it gives advertisers an unprecedented depth of accurate data about consumers, which is able to produce uniquely effective outcomes for advertisers.
...
reducing confidence in the measurability and effectiveness of programmatice advertising will chip away at the willingness of ad buyers to pour money into the ecosystem.
...
Independent research may be a particularly powerful tool for shaping industry views of online advertising.
I think a collapse of confidence in the online ad market is unlikely in the medium term for two main reasons.

First, there is already a vast amount of research showing that the value of on-line ads is near zero, and advertisers know about it. Some of it comes from experiments run by major advertisers such as Proctor & Gamble, and major publishers such as the New York Times. Both sides just have very strong incentives to ignore it, as described from the advertisers side in entertaining detail by Jesse Frederik and Maurits Martijn in The new dot com bubble is here: it’s called online advertising. Here's the merest snippet:
It might sound crazy, but companies are not equipped to assess whether their ad spending actually makes money. It is in the best interest of a firm like eBay to know whether its campaigns are profitable, but not so for eBay’s marketing department.

Its own interest is in securing the largest possible budget, which is much easier if you can demonstrate that what you do actually works. Within the marketing department, TV, print and digital compete with each other to show who’s more important, a dynamic that hardly promotes honest reporting.
...
Marketers are often most successful at marketing their own marketing.
The marketeer's effective marketing works for everyone:
"Bad methodology makes everyone happy,” said David Reiley, who used to head Yahoo’s economics team and is now working for streaming service Pandora. "It will make the publisher happy. It will make the person who bought the media happy. It will make the boss of the person who bought the media happy. It will make the ad agency happy. Everybody can brag that they had a very successful campaign."
Nico Neumann agrees:
One experiment he presents shows that, under proper experimental conditions, the impact of an ad for auto insurance had a negative effect on sales, rather than the massively positive one suggested by popular statistical models used in the industry.

So why does Nico think these technologies are so popular in the online advertising space? Marketers, he says, "love machine-learning/AI campaigns because they look so great in ... analytics dashboards and attribution models." This cutting edge technology is favored—in other words—because it makes for great theater.
Publisher's incentive to devalue the product they are selling is obviously zero, especially since so many are struggling to survive decreasing income as more and more is swallowed by the platforms:
One study by The Guardian suggests that some 70 percent of the money spent by buyers is consumed by the ad tech platform, with the publisher retaining the remainder.
Second, as Michael Lewis explains in his must-read The Big Short; Inside the Doomsday Machine, key to the collapse of the subprime mortgage market was the ability of investors who spotted the bubble to buy credit default swaps on the packages of subprime mortgages. These swaps were the equivalent for the bond market of short-selling in the equity markets; a way to make money by betting on a fall in value. As far as I can see there's no way to short the attention market, so there's no-one incentivized to make advertisers and publishers skeptical of the value of online ads..

Anyone interested in this topic must read Maciej Cegłowski's wonderful post from 5 years ago What Happens Next Will Amaze You.

OCLC-LIBER Open Science Discussion on Citizen Science / HangingTogether

Thanks to Sarah Bartlett, technology writer, for contributing this guest blog post.

Sarah Bartlett

How is Citizen Science—the active contribution of the general public in scientific research activities—developing, and what should research library involvement look like? This final session of the OCLC/LIBER Open Science Discussion series brought together research librarians with a range of viewpoints and practical experiences of this exciting area. Together the group formed a vision of Citizen Science in an ideal future state, and identified challenges that stand in the way of achieving that.

Much progress has been made since 2018, when libraries first identified a potential role in Citizen Science. Since then, several research libraries in Europe have incorporated Citizen Science into their activities—despite the adverse impact of COVID-19—and are working with researchers. We can also see knowledge brokering taking place in this area, one valuable example being LIBER’s Citizen Science Working Group, two members of whom were present at this session. So we’re seeing some momentum for libraries within Citizen Science, though not evenly spread, across Europe.

“It’s important to emphasize that Citizen Science is not about making citizens better scientists; it’s about making scientists better citizens.”

Citizen Science – What would be the ideal future state?

Citizen scientists at work. Image CC BY, NPS / Karlie Roland

As an emerging field within Open Science, Citizen Science proved to be fertile ground for a vigorous discussion of an ideal future state, and the group explored a number of fruitful areas.

Bringing citizens and scientists closer together. It’s important to emphasize that Citizen Science is not about making citizens better scientists; it’s about making scientists better citizens, one participant argued. Through Citizen Science, scientists can gain a broader understanding of citizens and their perceptions, expectations, and worries in relation to science. That said, another important by-product will be that people would better understand the scientific endeavor, in what some regard as a post-truth society.

Fuller engagement of citizens. The group agreed that all too often citizens are limited to data collection activities. An ideal future state would see citizens more fully integrated in Citizen Science work. They would be more aware of the broader context of their work, and the science itself would be re-articulated in layman’s terms. This would result in fewer cases of volunteer burnout (a known problem in Citizen Science) and an increased likelihood of volunteers moving beyond “pleasant Sunday activities”, such as photographing wildlife close to home, into more challenging work.

Improved awareness of Citizen Science. At present, more Citizen Science research takes place than is formally recognized or coordinated. A number of participants reported uncovering hidden examples of Citizen Science within their institution. In an ideal future state, all Citizen Science would be identifiable as such, making it easier to form partnerships and share knowledge.

Cross-disciplinary engagement. The need for more involvement with the arts, humanities, social sciences, etc. resonated strongly with the group. Although Citizen Science originated in the natural sciences, in an ideal future state it would enjoy cross-disciplinary engagement.

A clear but flexible role for libraries. In an ideal future state, libraries would have a clear role within Citizen Science. But as they work towards this, a one-size-fits-all approach is not feasible. One speaker proposed an empathic approach—to try to walk in the shoes of both researchers and citizens, as a way of establishing what might work in libraries’ own setting. Ideas included:

  • a single point of contact for citizens, an idea proposed by the LIBER Citizen Science Working Group,
  • relationship brokering and knowledge sharing, which sit comfortably within the librarian skillset,
  • the provision of spaces, collections, and open content to support Citizen Science, and
  • the lending of powerpacks to citizens for mobile devices, to support lengthy periods of work in remote locations.
Photo by Alex Rainer on Unsplash

“Another practical idea is to try to walk in the shoes of researchers, and embed Citizen Science activities along these lines. And then walk in the shoes of citizens, who might worry that their mobile phone will run out of charge if they spend twelve hours away in a remote location. So they’d need a power pack. Could the library lend one?”

Obstacles and challenges

In the second half of the session, participants identified challenges which stood in the way of this ideal future state. They then voted in a poll to establish the top three challenges, which were:

  1. Libraries are not aware of what they can do for Citizen Science.
  2. More collaboration between libraries and researchers.
  3. Engaging libraries in new partnerships.

Challenge 1 – Libraries are not aware of what they can do for Citizen Science

The LIBER Open Science Roadmap suggests promoting the library as an active partner and using the library as an organizing and managing body within Citizen Science. These roles draw on established professional strengths, as does the single point of contact for coordination. The roadmap also recommends producing guidelines, methodologies and policies to help libraries get involved. In terms of advocacy, LIBER’s Citizen Science Working Group and projects are a good start, but more is needed.

Ultimately, each library needs to engage with Citizen Science in its own way. For a profession traditionally reliant on process and guidance, this may be challenging. However, librarians are skilled relationship brokers. One speaker pointed out that although his library does not interact with the public directly, it does deal with institutions that do—the national library, public libraries, museums, and organizations in the heritage sector.

Challenge 2 – More collaboration between libraries and researchers

Librarians already deliver research support in a number of ways, and offer services and competencies that can be readily framed into Citizen Science. Some librarians feel under-confident about supporting researchers. But as one speaker noted, researchers need help finding the right information—a core librarian skill. Institutions aren’t always aware of what the library can do for Citizen Science.

One participant suggested that if libraries were more familiar with Citizen Science, they might be better positioned to identify what they can do. Libraries would benefit from clear guidance from experienced libraries who are already involved in Citizen Science activities.

Challenge 3 – Engaging libraries in new partnerships

Citizen Science cannot happen in a silo; it is crucial for research libraries to form partnerships with stakeholders in other types of organization, even in the early stages of Citizen Science involvement.

LIBER’s Citizen Science Working Group monitors opportunities for partnerships. Partnerships with public libraries are particularly important because university libraries do not interact directly with citizens. The group discussed awareness raising among public librarians, many of whom are curious about Citizen Science, to prepare the ground for citizen engagement. 

Challenge 4 – Just do it

Photo by Elisabeth Wales on Unsplash

Outside the top three, another challenge had surprising resonance for libraries and Citizen Science—the Nike tagline, Just do it.

Just do it might involve identifying individuals who are already working on Citizen Science research at laboratory level, then putting forward existing library services to help. At TU Delft, the library has enjoyed considerable success, by starting small and watching activities snowball. They began by adopting a knowledge broker role across the research and library communities, by inviting researchers who were already working on Citizen Science to a series of working sessions. People joined in because they liked it and felt connected to it.

In the context of the global pandemic, in which changing priorities are placing considerable strain on budgets, an agile, lightweight approach may be the only option open to libraries that are keen to deliver Citizen Science with existing resources. Simple guidance around simple, low-cost activities would be valuable for both university and public libraries. Once Citizen Science initiatives gain traction, they can be rolled in with existing work on Open Science and Open Access. But it starts with libraries taking a proactive approach to spotting opportunities in their own setting.

About the OCLC-LIBER Open Science Discussion Series

The discussion series is a joint initiative of OCLC Research and LIBER (the Association of European Research Libraries). It focusses on the seven topics identified in the LIBER Open Science Roadmap, and aims to guide research libraries in envisioning the support infrastructure for Open Science (OS) and their roles at local, national, and global levels. The series runs from 24 September through 5 November.

The kick-off webinar opened the forum for discussion and exploration and introduced the theme and its topics. Summaries of all seven topical small group discussions have now been published on the OCLC Research blog, Hanging Together. The previous posts are: (1) Scholarly Publishing, (2) FAIR research data, (3) Research Infrastructures and the European Open Science Cloud, (4) Metrics and rewards, (5) Skills and (6) Research Integrity.

The post OCLC-LIBER Open Science Discussion on Citizen Science appeared first on Hanging Together.

Award Winners: NDSA Levels of Digital Preservation Group / Digital Library Federation

This year’s World Digital Preservation Day (#WDPD) was the biggest yet! With outpourings of research, achievements, practical advice, and fun it was hard to believe that there were also awards as part of that process.

On 05 November, the NDSA’s Levels of Digital Preservation Reboot was the recipient of one of the Digital Preservation Coalition’s Digital Preservation Award! We won in the ICA-sponsored category for Collaboration and Cooperation – the first time it has been awarded!  This honor is collectively bestowed on the many of you who helped craft and refine the Levels and we hope your continued ideas, and enthusiasm will keep the momentum going. Thank you for all your hard work! For an overview, background, and charge for the Levels, see my blog post that speaks to leveraging such a high level of collaborative energy.

~ Bradley Daigle, Levels of Digital Preservation Steering Group Lead

The post Award Winners: NDSA Levels of Digital Preservation Group appeared first on DLF.

Thought leaders in Diversity, Equity and Inclusion you should know / Tara Robertson

many lightbulbs handing down, the middle one is biggest, clearest and brightest

I can’t think of any company, country, or industry that has diversity, equity and inclusion all figured out–it’s an emergent space where we’re all learning how to do better. We can always learn from the people leading the work and from the research. I am sharing this list of nine thought leaders who I admire. I admire that they center their values in their work, drive results and are generous in sharing their thoughts and ideas. It is weighted towards women of colour and queers in the tech sector. I think these people’s work experience, formal credentials and lived experience, makes what they have to say extremely valuable. 

Dr. Erin Thomas (Twitter, LinkedIn

This year when we started having conversations about anti-racism, and specifically anti-Black racism, at work I would gut check my strategies and tactics against the detailed Sunday Twitter threads posted by  Dr. Thomas. Dr Erin Thomas is the Heads of Diversity, Inclusion & Belonging and Talent Acquisition at Upwork and was named in Fortune Magazine’s 40 under 40. With a doctorate in social psychology Dr. Thomas pulls in relevant research and skillfully bridges it to action in a corporate setting. Her most recent Sunday thread is about how Kamala Harris’ win connects to the future of women leaders with a bonus of awesome illustrated and animated gifs

Candice Morgan (Twitter, LinkedIn)

Candice is one of the few DEI leaders working in the venture capital space. She describes her job as creating inclusive strategies for GV (Google Ventures) and its portfolio companies, and helping the firm expand diversity across the entrepreneurs it funds. She’s my mentor and I’ve been able to level up my strategy and execution from our conversations. 

When she was at Pinterest she led impressive increases in diversity internally on their teams and shared some of what they learned in HBR. Externally she made an impact in the Pinterest product too. The skin tone filter allows users to find makeup that is relevant for them and was featured in Wired. Personally I’m delighted by this–a blue-red lipstick looks really good with my skin tone but an orange-red makes me look ill. I love citing this example of product inclusion. 

She’s not a prolific social media poster but when she posts it’s useful and thoughtful. Candice recently shared a conversation with UK-based Dr. Jonathan Ashong-Lamptey on anti-racism in the workplace across cultures. In September she wrote a piece for Fast Company titled How to build a race-conscious equity, diversity, and inclusion strategy outlining four not-so-easy steps companies must make to move from recognition to integration.

Michelle Kim (Twitter, LinkedIn)

I admire the boldness, honesty and integrity that Michelle Kim brings to the DEI space. She is the co-founder and CEO of Awaken and says: “I dream big and get sh*t done.”  

I’ve shared these posts many times this year:

As someone who is mixed race and Asian, I appreciate her writing on how Asian people can show up as allies for Black communities and how Asians perpetuate anti-Black racism has helped me deepen my anti-racism work. 5 months ago, when there was a huge surge of interest in unconscious bias training, she put together a spreadsheet of Black Owned DEI Companies + Consultants Currently Accepting New Corporate Clients

I can’t wait to read the book that she is currently writing.

Aubrey Blanche (Twitter, LinkedIn)

I’ve been a fan of Aubrey’s for awhile. She shares a lot of her corporate DEI work. While at Atlassian she shared the Balanced Teams Diversity Assessment tool and Atlassian’s Team Playbook which has several DEI plays, including How To Run Inclusive Meetings. I’m a huge fan of Culture Amp and as their Director of Equitable Design & Impact at Culture Amp she shares a lot of useful information like, how they’re preparing managers to support employees during the US election week

I love that she’s unapologetic about her politics and very human on Twitter. 

Steven Huang (LinkedIn)

Steven is a generous and thoughtful leader and colleague. Currently he’s the Managing director of the Collective – DEI Lab and an Inclusion and Diversity advisory for Jumpstart

His posts on LinkedIn are thoughtful and invite interesting conversations. A couple of months ago he posted a real life scenario on cancel culture and invited people to “collectively broaden our nuanced understanding of this topic seeking to understand other POVs”.

Lily Zheng (Twitter, LinkedIn)

Lily regularly writes several times a week on LinkedIn. They clearly articulate things that are still half baked in my mind or say them in a way that shifts my thinking to include different perspectives. Here’s one example

I no longer ask clients to pick company values or describe their own company #culture. Every time, leaders come up with generic, cookie-cutter terms drawn from the same pool of 50 words. “Excellence.” “Integrity.” “Quality.”

Overwhelmingly generic. Nobody knows what they mean, not your employees, not your lawyers, not your leaders. They’re not really culture, or values–just words.

Now, I flip the question on its head: “what AREN’T your values?” “What ISN’T your culture?” “What are the ANTITHESES to the identity of your company?”

They are the author of The Ethical Sellout. Lily’s article in HBR Do Your Employees Feel Safe Reporting Abuse and Discrimination? clearly outlines four practices that you can adopt to rebuild employee trust in reporting. They were also on the HBR Women at Work podcast talking about how the gender binary restricts people at work and how to be respectful and supportive of gender-diverse colleagues. 

Dr. Janice Gassam Asare (Twitter, LinkedIn)

Dr. Gassam Asare has a PhD in Applied Organizational Psychology and is prolific writer. She is Senior Contributing Writer for Forbes and has published hundreds of articles on topics from anti-racism, hiring practices, inclusive leadership and examining various case studies of companies through the lens of DEI. I’m grateful that one of those case studies profiled the work I led at Mozilla on trans and non-binary inclusion

Here’s a few of her articles about anti-Black racism and system change that are all Forbes editors’ picks:

She is also the author of Dirty Diversity: A Practical Guide to Foster an Equitable and Inclusive Workplace for All. You can preorder her second book The Pink Elephant that comes out on November 27th. 

Dr. Sarah Saska (Twitter, LinkedIn)

Dr. Saska is the Co-Founder & CEO Feminuity, a DEI consultancy based in Toronto that works with a lot of US technology companies. Her PhD is in Equity Studies and Technology and Innovation Studies. She is a frequent speaker and shares useful content including some of the practical guides her team has written including:

I learned about Namedrop, a service where you can record how you pronounce your name, from Sarah’s email signature. (BTW–My name is Tah-rah, not Terra) 

Joelle Emerson (Twitter, LinkedIn)

Joelle is the Founder and CEO of Paradigm, a strategy firm that partners with companies to build more inclusive organizations. She leads an amazing team of DEI practitioners.She’s written for HBR Fortune and signal boosts articles people on her team write. Recently she coauthored a post with Dr. Evelyn Carter and Y-Vonne Hutchinson In Fortune magazine asking Why is President Trump trying to kill off diversity training programs? I really enjoyed her tweets during last week’s US election. 

OCLC Research and the National Finding Aid Network project / HangingTogether

The current state of US finding aid aggregation

We are very pleased to share details about our involvement in the Building a National Finding Aid Network project, which has received funding from the Institute of Museum and Library Services. OCLC will be working with the University of Virginia and project lead California Digital Library, in close partnership with LYRASIS and statewide/regional aggregators, to conduct a two-year research and demonstration project to build the foundation for a national archival finding aid network. Work will be conducted in parallel across multiple focus areas, including:  

  • Research investigating end-user and contributor needs in relation to finding aid aggregations 
  • Evaluating the quality of existing finding aid data  
  • Technical assessments of potential systems to support network functions, and formulating system requirements for a minimum viable product instantiation of the network  
  • Community building, sustainability planning, and governance modeling to support subsequent phases moving from a project to a program, post-2022  

OCLC Research will be involved in the first two focus areas, and we could not be more excited! This is the first in what we hope will be a series of posts to keep you filled in on what we are doing on the project.

Where did this project come from?

“Finding Aid Aggregation at a Crossroads” report

This project is an outcome of a a 2018-2019 planning initiative. OCLC was also a participant in the earlier project, which produced both findings and a subsequent action plan.

Why is OCLC involved?

OCLC has been in on the finding aid aggregation game for a long time! The planning phase research identified three “meta-aggregators” (defined as programs or organizations that harvest and index finding aid data — or descriptions of archival context — from aggregators and individual institutions). In the US, OCLC’s ArchiveGrid is of those (along with the History of Medicine Finding Aid Consortium, and the Social Networks and Archival Context program — or “SNAC”).

OCLC Research is also an acknowledged leader in conducting research projects on behalf of the archives and special collections community.

What the work will look like

OCLC will lead work in three major areas of inquiry:

Research with end users

We will explore two main research questions:

  • Who are the current users of finding aid aggregations? Do they align with the persona types and needs identified in recent archival persona work?
  • What are the benefits and challenges users face when searching for descriptions of archival materials within finding aid aggregations?

We will conduct a pop-up survey on the sites of regional aggregators who are partners on the grant. The survey will gather information about demographics and information seeking needs of the users of finding aid aggregations, and help us identify people who are willing to take part in semi-structured interviews to help us further understand their research needs. Semi-structured interviews will be conducted, and recorded and transcribed. Transcriptions will be coded and analyzed using NVivo.

In this portion of our work, we will seek to better understand who archival researchers are and their goals and motivations in their research, in order to inform what functionality might be prioritized and included in a national aggregation structure.

Research with cultural heritage institutions

Here, we will explore three main research questions:

  • What are the enabling and constraining factors that influence whether and how institutions describe the archival collections in their care?
  • What are the enabling or constraining factors that influence whether institutions contribute to an aggregation?
  • What value does participation in an aggregation service bring to institutions?

To answer these questions, we will conduct focus group interviews with colleagues in a variety of types and sizes of cultural heritage institutions that steward archival collections. Focus group interviews will be conducted, and recorded and transcribed. Transcriptions will be coded and analyzed using NVivo.

We will seek to to identify users’ and prospective users’ expectations and needs related to aggregation and discovery of archival description, to inform what functionality, policy, and governance structures might best support a national aggregation.

Evaluation of finding aid data quality

In approaching the finding aid data quality, we will explore these main research questions:

  • What is the structure and extent of consistency across finding aid data in current aggregations?
  • Can that data support the needs to be identified in the user research phase of the study? If so, how? If not, what are the gaps?

In approaching assessing finding aid quality, we will use representative samples from multiple aggregators. We will conduct a first wave of analysis, examining consistency and variance in descriptive and encoding practice. We will then conduct a second wave of analysis informed by the end user and cultural heritage institution portions of our research, to assess if extant data can support the kinds of functionality identified as needed via our survey, interviews, and focus groups.

Here we will seek to identify the quality of existing finding aid data at scale, in order to inform and scope the network’s initial functionality and lay the groundwork for iterative data remediation and expanded network features in subsequent phases of development.

Project team

The OCLC project team is led by Chela Scott Weber, with Lynn Silipigni Connaway leading research design, and includes , Chris Cyr, Brittany Bannon, , Janet Mason, Merrilee Proffitt, and Bruce Washburn. We are very pleased to welcome Lesley Langa as an Associate Researcher for this project. Lesley earned her Master of Arts at Florida State University and her doctoral degree at the University of Maryland. She has extensive research experience and has worked for the National Endowment for the Humanities, the Smithsonian Institution Accessibility Program, and IMLS.

We are excited to get to work on this project! Keep an eye out here for updates on our explorations as the project unfolds.

The post OCLC Research and the National Finding Aid Network project appeared first on Hanging Together.

MyData Online 2020 / Open Knowledge Foundation

MyData Online 2020 (Dec 10-12) will gather 1000 personal data professionals and people interested in the data economy. They bring together business, legal, tech and societal perspectives to create sustainable, fair and prosperous digital society. The online conference will provide quality programme, networking opportunities and social connections. The conference is organised by the MyData Global – an award-winning international nonprofit based in Finland. MyData Global’s mission is to empower individuals to self-determination regarding their personal data

The origins of MyData can be traced back to Open Knowledge Festival held in Finland in 2012. There, a small group of people gathered in a breakout session to discuss what ought to be done with the kind of data that cannot be made publicly available and entirely open, namely personal data.

Over the years, more and more people who had similar ideas about personal data converged and found each other around the globe. Finally, in 2016, a conference entitled MyData brought together thinkers and doers who shared a vision of a human-centric paradigm for personal data and the community became aware of itself.

The MyData movement, which has since gathered momentum and grown into an international community of hundreds of people and organisations, shares many of its most fundamental values with the Open movement from which it has spun off. Openness and transparency in collection, processing, and use of personal data; ethical and socially beneficial use of data; cross-sectoral collaboration; and democratic values are all legacies of the open roots of MyData and hard-wired into the movement itself.

The MyData movement was sustained originally through annual conferences held in Helsinki and attended by data professionals in their hundreds. These were made possible by the support of the Finnish chapter of Open Knowledge, who acted as their main organiser. As the years passed and the movement matured, in the autumn of 2018, the movement formalised into its own organisation, MyData Global. Headquartered in Finland, the organisation’s international staff of six, led by General Manager Teemu Ropponen, now facilitate the growing community with local hubs in over 20 locations on six continents, and the continued efforts of the movement to bring about positive change in the way personal data is used globally.

Call for volunteers

Are you the person we’re looking for? Or do you know someone who could be?  The community behind the conference is a diverse and dynamic international group of people who work for human-centric personal data and fair data economy – the future of the internet. We’ll offer you: A chance to take a sneak peek inside and learn about a variety of skills and substances you benefit in the future: organising a global conference, setting up social events in the online reality, learn about topics like the future of the internet from legal, societal, business and tech perspectives. connect with like-minded people and expand you network. And what’s best, this year you don’t have to travel to Finland! We are open for people of all ages and backgrounds – and timezones! You’ll offer us: either your social, philanthropic, kind character (needed in chats, giving advice and instructions for the participants) or your reactive attention (needed in recording the sessions, taking notes, and doing other play-pause-mute-edit-things). No superpowers needed. You can freely choose how much you want to participate! https://lnkd.in/dg7X9tR

Join MyData 2020 conference with a special discount code!

Now we are one month and one day away from MyData Online 2020 Conference (it will start on 10 December) and I hope that the members of the Open Knowledge Foundations will join us! If you would like to promote the conference, we can offer you the discount code WelcomeMyDataFriends which gives 15% off on all categories of the tickets on online2020.mydata.org/tickets. 

Library IT Services Portfolio / Library Tech Talk (U of Michigan)

TRACC: A tool developed by Michigan to help with portfolio management

Academic library service portfolios are mostly a mix of big to small strategic initiatives and tactical projects. Systems developed in the past can become a durable bedrock of workflows and services around the library, remaining relevant and needed for five, ten, and sometimes as long as twenty years. There is, of course, never enough time and resources to do everything. The challenge faced by Library IT divisions is to balance the tension of sustaining these legacy systems while continuing to innovate and develop new services. The University of Michigan’s Library IT portfolio has legacy systems in need of ongoing maintenance and support, in addition to new projects and services that add to and expand the portfolio. We, at Michigan, worked on a process to balance the portfolio of services and projects for our Library IT division. We started working on the idea of developing a custom tool for our needs since all the other available tools are oriented towards corporate organizations and we needed a light-weight tool to support our process. We went through a complete planning process first on whiteboards and paper, then developed an open source tool called TRACC for helping us with portfolio management.

Dark Reading / Ed Summers

I just made a donation to the Dark Reader project over on Open Collective. If you haven’t seen it before Dark Reader is a browser plugin (Firefox, Chrome, Edge, Safari) that lets you enable dark mode on websites that don’t support it. It has lots of useful configuration settings, and allows you to easily turn it on and off for particular web sites.

For most of my life I’ve actually preferred light backgrounds for my text editor. But for the past year or so as my eyes have gotten worse I’ve enabled dark mode in Vim and gradually in any application or website that will let me.

I’m not an eye doctor, but it seems that the additional light from the screen reflects off of whatever material constitutes the cataracts in the lenses of my eyes, which causes everything to fuzz out, and for text to become basically illegible. Being able to turn on dark mode has meant I’ve been able continue to read online, although it still can be difficult. Dark Reader lets me turn on dark mode for other websites that don’t support it, which has been a real life saver. So it was nice to be able to say thank you.

Just as an aside, I’ve been using Open Collective for a few years now, to donate regularly to the social.coop project which is how I’m doing social media these days. Realizing Dark Reader was on Open Collective too made me think how I should really look at more open source projects to support that are on there. It also made me think that perhaps Open Collective could be a useful platform for the [Documenting the Now] project to look at to support some of the tools it has developed, as the project draws down on its grant funding and moves into sustaining some of the things it has started. Perhaps it would be useful for other projects like Webrecorder potentially too?

PS. It’s enabled here on my blog for people who have their browser/os set to dark mode.

Meet the 2020 DLF Forum Community Journalists / Digital Library Federation

The 2020 Virtual DLF Forum looks different from our typical event in almost every way imaginable. Due to the fact that we aren’t convening in person and registration is free, we decided to offer a different kind of fellowship opportunity. Because the guiding purpose of this year’s Virtual DLF Forum is building community while apart, through our re-envisioned fellowship program, we are highlighting new voices from “community journalists” in the field. We are providing $250 stipends to a cohort of 10 Virtual DLF Forum attendees from a variety of backgrounds and will feature their voices and experiences on the DLF blog after our events this fall.

We are excited to announce this year’s DLF Forum Community Journalists:

Arabeth BalaskoArabeth Balasko

Arabeth Balasko (she/her) is an archivist and historian dedicated to public service and proactive stewardship. As a professional archivist, her overarching goals are to curate collections that follow a shared standardization practice, are user-centric, and are searchable and accessible to all via physical and digital platforms.

She believes that an archive should be a welcoming place for all people and should be an inclusive environment which advocates to collect, preserve, and make accessible the stories and histories of diverse voices. By getting individuals involved in telling THEIR story and making THEIR history part of the ever-growing story of humanity, we all win!

 

Rebecca BayeckRebecca Bayeck
@rybayeck

Rebecca Y. Bayeck is a dual-PhD holder in Learning Design & Technology and Comparative & International Education from the Pennsylvania State University. Currently a CLIR postdoctoral fellow at the Schomburg Center for Research in Black Culture where she engages in digital research, data curation, and inclusive design. Her interdisciplinary research is at the interface of several fields including the learning sciences, literacy studies, and game studies. At this intersection, she explores literacies and learning in games, particularly board games, the interaction of culture, space, and context on design, learning, research, literacies. 

 

Shelly BlackShelly Black
@ShellyYBlack

Shelly Black is the Cyma Rubin Library Fellow at North Carolina State University Libraries where she supports digital preservation in the Special Collections Research Center. She also works on a strategic project involving immersive technology spaces and digital scholarship workflows. Previously she was a marketing specialist at the University of Arizona Libraries and promoted library services and programs through social media, news stories, and newsletters.

Shelly was recently selected as a 2020 Emerging Leader by the American Library Association and is a provisional member of the Academy of Certified Archivists. She received a MLIS and a Certificate in Archival Studies from the University of Arizona where she was a Knowledge River scholar. She also holds a BFA in photography and minor in Japanese from the UA.

 

Lisa CovingtonLisa Covington
@prof_cov

Lisa Covington, MA is a PhD Candidate at The University of Iowa studying Sociology of Education, Digital Humanities and African American Studies. Her dissertation work is “Mediating Black Girlhood: A Multi-level Comparative Analysis of Narrative Feature Films.” This research identifies mechanisms in which media operates as an institution, (mis)informing individual and social ontological knowledge.  

In 2020, Lisa received the Rev. Dr. Martin Luther King, Jr. Award from the Iowa Department of Human Rights. She is the Director of the Ethnic Studies Leadership Academy in Iowa City, an educational leadership program for Black youth, in middle school and high school, to learn African American advocacy through incorporating digital humanities and social sciences.   

Lisa received her MA from San Diego State University in Women & Gender Studies. As a youth development professional, Lisa develops curriculum for weekly programming with girls of color, trains teachers on best practices for working with underrepresented youth, and directs programs in preschool through college settings in California, Pennsylvania, Iowa, New Jersey, New York and Washington, D.C. 

 

Ana Hilda Figueroa de JesúsAna Hilda Figueroa de Jesús

Ana Hilda Figueroa de Jesús will be graduating next spring from the Universidad de Puerto Rico in Río Piedras with a BA in History of Art. Her research interest focuses on education, accessibility and publicity of minority, revolutionary Puerto Rican art including topics such as race, gender and transnationalism. She has interned at Visión Doble: Journal of Criticism and History of Art, and volunteered at MECA International Art Fair 2019 and Instituto Nueva Escuela. Ana works as assistant for the curator and director of the Museum of History, Anthropology and Art at UPR. She is currently a Katzenberger Art History Intern at Smithsonian Libraries.

 

Amanda GuzmanAmanda Guzman

Amanda Guzman is an anthropological archaeologist with a PhD in Anthropology (Archaeology) from the University of California, Berkeley. She specializes in the field of museum anthropology with a research focus on the history of collecting and exhibiting Puerto Rico at the intersection of issues of intercultural representation and national identity formation. She applies her collections experience as well as her commitment to working with and for multiple publics to her object-based inquiry teaching practice that privileges a more equitable, co-production of knowledge in the classroom through accessible engagement in cultural work. Amanda is currently the Ann Plato Post-Doctoral Fellow in Anthropology and American Studies at Trinity College in Hartford, CT. 

 

Carolina HernandezCarolina Hernandez
@carolina_hrndz

Carolina Hernandez is currently an Instruction Librarian at the University of Houston where she collaborates on creating inclusive learning environments for students. Previously, she was the Journalism Librarian at the University of Oregon, where she co-managed the Oregon Digital Newspaper Program. Her MLIS is from the University of Wisconsin-Madison. Her current research interests are in critical information literacy, inclusive pedagogy, and most recently, the intersection of digital collections and pedagogy. 

 

Jocelyn HurtadoJocelyn Hurtado

Jocelyn Hurtado is a native Miamian who worked as an Archivist at a community repository for four year. She is experienced in working with manuscript, art and artifact collections pertaining to a community of color whose history has often been overlooked. Ms. Hurtado, understands the responsibility and the significance of the work done by community archivists and has seen firsthand that this work not only affects the present-day community but that it will continue to have a deep-rooted impact on generations to come.

Ms. Hurtado also has experience promoting collections through exhibits, presentations, instructional sessions, and other outreach activities which includes the development and execution of an informative historical web-series video podcast. 

Ms. Hurtado earned her Associate Degree in Anthropology from Miami-Dade College and a Bachelor of Arts in Anthropology from the University of Florida. She also completed the Georgia Archives Institute Program. 

 

Melde RutledgeMelde Rutledge
@MeldeRutledge

Melde Rutledge is the Digital Collections Librarian at Wake Forest University’s Z. Smith Reynolds Library. He is responsible for leading the library’s digitization services—primarily in support of ZSR’s Special Collections and Archives, as well as providing support for university departments. 

He earned his MLIS from the University of North Carolina at Greensboro, and has served in librarianship for approximately 12 years. His background also includes 8 years of newspaper journalism, where he wrote news, sports, and feature articles for several locally published newspapers in North Carolina. 

He currently lives in Winston-Salem, NC, with his wife and three sons.

 

Hsiu-Ann TomHsiu-Ann  Tom

Hsiu-Ann Tom is the Digital Archivist at The Amistad Research Center in New Orleans, LA where her work focuses on born digital collection development. She received her Masters in Library and Information Science with a concentration in Archives Management from Simmons University in Boston in 2019. She is a graduate of Columbia University (BA, Sociology) and Harvard University (MA, Religion and Politics), and is a member of the Academy of Certified Archivists. Prior to working in the archival field, Hsiu-Ann served in the United States Army intelligence field as a cryptolinguistic analyst, attending the Defense Language Institute in Monterey, California. Before coming to Amistad, Hsiu-Ann worked on the archives staff of Boston University’s Howard Gotlieb Archival Research Center working with the Military Historical Society of Massachusetts Collection. She recently obtained the Society of American Archivist Digital Archivist Specialist certification and enjoys supporting students and new professionals in their educational development through her work as a member of SAA’s Graduate Archival Education Committee.  

 

Kevin WinsteadKevin Winstead
@Kaerf1
Kevin is a 2019-2021 CLIR Postdoctoral Fellow at Penn State, earning his PhD in American Studies at the University of Maryland. His scholarship includes published articles on social movements and religion. In “Black Catholicism and Black Lives Matter: the process towards joining a movement” (Ethnic and Racial Studies, 2017), Kevin uses an adaptation of social movement frame analysis to examine how Black Catholics define and construct the ongoing political issues within the Black Lives Matter movement. His current research interest centers around social movements, digital studies, religion, the social construction of knowledge, and digital misinformation.

In his previous work, Kevin served as project manager for the Andrew W. Mellon funded African American History, Culture, and Digital Humanities, where he produced project events and other scholarly activities, making the digital humanities more inclusive of African American scholarship while enriching African American studies research with new methods, archives, and tools. Kevin also served as project manager for Baltimore Stories: Narratives and Life of an American City, funded by the NEH Humanities in the Public Square grant in partnership with Maryland Humanities, the University of Maryland Baltimore County, Enoch Pratt Free Library, and the Greater Baltimore Cultural Alliance.

 

Betsy YoonBetsy Yoon
@betsyoon

Betsy Yoon (she/they) is an Adjunct Assistant Professor and OER/Reference Librarian at the College of Staten Island, CUNY and earned her MLIS in 2019. She also has a Master of International Affairs. She lives in occupied Lenapehoking and is a longtime member of Nodutdol, a grassroots organization of diasporic Koreans and comrades working to advance peace, decolonization, and self-determination on the Korean Peninsula and Turtle Island (North America). Interests include critical approaches to OER and openness, the free/libre software movement, understanding and addressing root causes over symptom management, and the role that libraries and archives can play in our collective liberation.

The post Meet the 2020 DLF Forum Community Journalists appeared first on DLF.

2020 DLF Forum: Building Community With DLF’s Digital Library Pedagogy Working Group / Digital Library Federation

Join us online! November 9th-13th, 2020.Though DLF is best known for our signature event, the annual DLF Forum, our working groups collaborate year round. Long before COVID-19 introduced the concept of “Zoom fatigue” into our lives, DLF’s working groups organized across institutional and geographical boundaries, building community while apart, to get work done. Made possible through the support of our institutional members, working groups are the efforts of a committed community of practitioners, using DLF as a framework for action, engaged in problem-solving in a variety of digital library subfields from project management and assessment to labor and accessibility.

Once we decided that the 2020 DLF Forum and affiliated events would be held in a virtual format, it meant that our working groups wouldn’t have the opportunity to meet in person for their typical working meals that take place throughout the Forum; however, this year’s virtual format means that we’ll have more new DLF Forum attendees than ever before. Because DLF’s working groups are open to ALL, regardless of whether you’re affiliated with a DLF member institution or not, we asked leaders of the DLF working groups to introduce their groups and the work they do to the new and returning members of the #DLFvillage in a series of blogs and videos.

We’ll share these working group updates in the days leading to this year’s DLF Forum.


Who are we?

DLF Digital Library Pedagogy Logo

The DLF Digital Library Pedagogy Working Group, commonly referred to as #DLFteach (also our Twitter hashtag), was founded in 2015 and is focused on building a community of practice for those interested in using digital library collections and technology in the classroom. The group is open for anyone to join regardless of your position, academic discipline, or DLF institutional affiliation. Here is what #DLFteach does and the ways you can join us:

Twitter Chats

One of the best ways to get involved with #DLFteach is to participate in a Twitter chat. Our Twitter chats offer a chance to chat with colleagues from all over on different subjects each chat. Every chat has a host or two who plan the topic and write questions that will be tweeted at intervals over the course of one hour. Participants can follow the questions tweeted from the @CLIRDLF handle and respond from their own Twitter account. Hosts will monitor the chat and also tweet frequently. To see all the tweets as they happen, the hashtag #DLFteach is included with every tweet, and participants should likewise add it to their tweets. People can participate as much or as little as possible, ranging from lurking to tweeting answers and replying to others’ tweets. 

Twitter chats usually take place at 2-3 PM EST / 11 AM – noon PST on the third Tuesday of every other month. Once or twice a year, the chat will take place at another time for those who cannot make the regular time. You can see previous chats on the group’s wiki. Interested in hosting a chat? Want to suggest a topic? Get in touch with the outreach coordinators of the DLF Digital Library Pedagogy Group!

Past Projects

#DLFteach is a uniquely project-based working group, and we are usually working on a couple of projects at any given time of year. Typically, members propose or are made aware of projects that would benefit from the expertise and dedication of group members working to implement them. If you are interested in our group’s focus and are looking to get involved, you are welcome to propose a project. If you do not have a specific project in mind but still want to get involved, that’s great, too, since these projects offer many opportunities to contribute to the community and the profession.

You may be wondering: What projects does #DLFteach work on? In September 2019, we released #DLFteach Toolkit 1.0, an openly available, peer reviewed collection of lesson plans and concrete instructional strategies edited by Erin Pappas and Liz Rodrigues and featuring the work of many #DLFteach members and affiliates. Check it out to get ideas of how to incorporate digital library collections and technologies into the classroom in structured, reproducible ways. Another 2019 resource developed by #DLFteach is the Teaching with Digital Primary Sources white paper, by Brianna Gormly, Maura Seale, Hannah Alpert-Abrams, Andi Gustavson, Angie Kemp, Thea Lindquist, and Alexis Logsdon, which outlines literacies and considers issues associated with finding, evaluating, and citing digital primary resources. If you are considering using digital primary sources in the classroom, this is an excellent resource to accompany your work with these materials. Additionally, #DLFteach has developed and facilitated workshops at the DLF Forum and Learn@DLF pre-conferences in 2016, 2018, and 2019.

Current Projects

Following the success of the first version released last year, we have issued a call for participation for the #DLFteach Toolkit 2.0, which will focus on instructional strategies using immersive technology. We are looking for both contributors and volunteers to assist with reviewing submissions and producing the Toolkit. Additionally, we are currently working on two blog series! One is focused on ethical issues for multimodal scholarship and pedagogy, and the other, Practitioner Perspectives: Developing, Adapting, and Contextualizing the #DLFteach Toolkit, is collecting interviews from practitioners (via Google Form) who have used or adapted #DLFteach Toolkit lesson plans. Look for these to be published in the coming months as well as calls to participate.

How can you get involved?

Anyone is welcome to join and participate in the Digital Library Pedagogy group and help grow the community of practitioners around teaching with digital library collections and tools. Our next Twitter chat will be on December 15 at 2:00 pm EST and will be focused on ways #DLFteach can help build community and support each other with the projects and ongoing initiatives we work on. Have you used or adapted lesson plans from the #DLFteach Toolkit 1.0? add your voice to Practitioner Perspectives: Developing, Adapting, and Contextualizing the #DLFteach Toolkit, a forthcoming blog series! Just answer our questions on this Google Form. Additionally, please consider joining our Google group and check out our wiki for more information about who we are and what we do.

The post 2020 DLF Forum: Building Community With DLF’s Digital Library Pedagogy Working Group appeared first on DLF.

Weeknote 45 (2020) / Mita Williams

Some things I was up to this past week:

  1. I registered for the Indigenous Mapping Workshop which will run Nov. 16-18;
  2. had meetings pertaining to servers and surveys;
  3. attended regular meetings including that of the University Library Advisory Committee, Leddy Library Department Heads, my bi-weekly meeting with Library Admin, and the WUFA Grievance Committee
  4. uploaded another batch of ETDs to the repository
  5. uploaded another batch of final edits to the OSSA Conference repository
  6. ordered books that have gone missing from the library (including Steal Like an Artist natch) as well titles to support the School of the Environment
  7. discussed APCs, video streaming, and the potential structure of the new Leddy Library website with various colleagues;
  8. and did an evening shift of our LiveChat Research Help service.

I don’t think I’ve said this publicly but the weekly updates from the CARL e-alert newsletter are excellent and are put together so well. From last week’s alert, I learned of this amazing project:

Community members living in Vancouver’s Downtown Eastside (DTES) have been the focal point of countless scholarly research studies and surveys over the years. Up until recently, this research has remained largely out of reach to participants and community organizations, locked away in journals and other databases that require paid subscriptions to access. Community members have said they would benefit from access to that data for evaluating program and service effectiveness, for example, or for grant writing.

The recently launched Downtown Eastside Research Access Portal (DTES RAP), a project led by the UBC Learning Exchange in partnership with UBC Library’s Irving K. Barber Learning Centre, is designed to change that.

The DTES RAP provides access to research and research-related materials relevant to Vancouver’s Downtown Eastside through an easy-to-use public interface. The portal was developed in consultation with DTES residents and community organizations through focus groups and user experience testing, and in collaboration with a number of university units.

New Downtown Eastside Research Access Portal takes collaborative approach to Open Access (UBC)

I love that this collection is centred around the needs of those who have been studied and not the needs of the researcher.

And not to center my own work but (BUT) I was hoping to explore similar work during my last sabbatical but for a variety of reasons, it did not come to pass.


No Weeknote update next week because I’m taking a staycation!

OCLC-LIBER Open Science Discussion on Research Integrity / HangingTogether

What does research integrity mean in an ideal open science ecosystem and how can libraries contribute to heighten professional ethics and standards required by open science? The sixth session of the OCLC/LIBER Open Science Discussion series brought together a small group of engaged participants focusing on these questions.

Ideal future state

Photo by JodyHongFilms on Unsplash   Photo by JodyHongFilms on Unsplash

One of the participants observed that there are two open science contexts where good research practices are particularly important:

(1) publication and dissemination

(2) data practices and management.

When publishing according to Open Access scenarios, researchers retain more control and copyrights over their outputs. The data underlying their research study remains available and is well documented, so that peers can verify and, when possible, reproduce the study.

In the open science environment, using the right data and using the data responsibly is key because the ecosystem is built on data and driven by artificial intelligence and digital tools.

This was the foundation of the ideal future state on which participants added new building-blocks:

  1. Respecting copyright and intellectual property rights when reusing data, graphs, images, software, or other sources is hardwired in the system and in the brains of researchers.
  2. Peer review is also a built-in process, verifying how data is used and how estimates are made.
  3. Research integrity is more than just a code of conduct document and a set of control mechanisms. Open practices – such as sharing data and open review – are instilled in students from the moment they enter university. They are based in universal values and norms that all researchers share, regardless of language and discipline-specific specialization. The ethical perspective is deeply ingrained in the academic community, where inclusion and the CARE Principles for indigenous data governance are highly valued.

These essential features assure the integrity of research and the credibility and reputation of the research enterprise – which is under constant pressure to perform by those commissioning or sponsoring research. As one participant put it:

“The pace of the research is not the pace of politicians and governments and workers, and I know the world is moving fast but we need to find a balance because otherwise we jeopardize (…) the credibility of key institutions.”

Top three challenges

Imposición del Birrete doctoral Universidad Complutense de Madrid – Public domain

The envisioning process was somewhat clouded by concerns about the current situation and the obstacles to be overcome to achieve the ideal future state. One discussant exclaimed: “in practice, winning the hearts and minds of researchers is not to be underestimated” and another added “younger researchers are more involved in open science but they are more at risk of being caught in [the trap of] predatory practices, for example predatory journals. They are less aware of the huge problems about integrity.”

It wasn’t surprising to see 17 challenges listed during the second part of the discussion, when we asked participants to suggest obstacles to achieving the future for research integrity they had just described. Polling within the group determined that the top three challenges were:

  1. Lack of knowledge about research integrity and ethics: researchers aren’t familiar with norms and precepts
  2. Prioritizing prestige over ethics
  3. Lack of research data management planning

Many of the other challenges listed could be seen as complementary to the first two obstacles. “Lack of awareness” and “lack of common understanding” about research integrity clearly are related to “lack of knowledge.” Similarly, “incentives for researchers,” “publish or perish mentality,” and “research evaluation methods” could be associated to “prioritizing prestige over ethics.” Finally, a fourth challenge surfacing at the top, and worth mentioning, is the siloed structure across academic campus (such as the institutional research board, departments, faculty, library, office of research) which inhibits an effective transition to the ideal future state.

Collective action

How can the library (and other) communities take collective action to address these challenges?  Collaboration with other stakeholders on campus to bridge the silos, was a suggestion that resonated with the group. One of the discussants gave an example from her own institution. Her library was getting questions at the end of the research cycle, about publishing images and copyrights. She went to the Research Board and encouraged them to send researchers to the library for training in copyright basics early in the process. She concluded by saying:

We need to be more deliberate about those conversations and [cross-campus] collaborations. Have them be a little more organic to the organization.

The group then discussed the crucial role of library liaisons who – as one participant expressed it – “should become an integral part of the research team. If they are more embedded in the research process, I think we can do a better job of facilitating conversations around research integrity.

Integrity is a sensitive issue. Reaching out to other liaison librarians across disciplines (e.g. STEM vs Humanities) and learning from each other could be enlightening. Reaching out to the Research Ethics Board is another possibility. In short, librarians need to be proactive and seek out these other stakeholders. Admittedly, it is a time-consuming process, and, yes, policies at national and institutional levels would be helpful – but, ultimately, researchers need to become knowledgeable about research integrity issues and “spread the word” themselves.      

Photo by Pang Yuhao on Unsplash

In discussing the challenge of prestige being prioritized over ethics within the current system of incentives and rewards, one participant objected to the use of the term prestige. She argued prestige is what moves researchers and science forward and so, she proposed to use the term competition instead. Nowadays, she added, competition for excellence is more about rewards and career, rather than prestige. Another participant agreed: “Science gets rushed because there is so much competition.” There was a suggestion to take some of the pressure away and change it into celebration and reward. Good practices need to be rewarded.

The moderator challenged the group asking: “To what extent can the library really have an impact on this obstacle? (…) It seems to me that ultimately that sense of reward has to come from the discipline itself (…) So I wonder (…) do librarians have much scope to influence things in regard to this particular obstacle? Or do you think that most of it will have to come from the disciplines themselves? ” One of the group members offered that the best tactic for libraries to achieve impact is finding the right level in the organization that has influence on researchers and influence that intermediary level.

The policy issue arose again while discussing the third challenge: the lack of research data management planning. Policies can help making good practices a priority. Some of the examples given were: Horizon, the EU Research and Innovation funding programme, which mandates the deposit of datasets, and a university policy institutionalizing the training of PhD students in RDM during their first year. One discussant mentioned COVID19 as a trigger to encourage data sharing, and another the RDA COVID-19 Recommendations and Guidelines for Data Sharing, published by the Research Data Alliance.

The moderator concluded the session with some new questions to ponder: “The policy issue is quite interesting (..) Some institutions have data policies, some institutions do not (…) If [your] institution does have a data policy, how useful has that been in terms of encouraging researchers to manage their data appropriately and what kind of compliance response did you get?”

About the OCLC-LIBER Open Science Discussion Series

The discussion series is a joint initiative of OCLC Research and LIBER (the Association of European Research Libraries). It focusses on the seven topics identified in the LIBER Open Science Roadmap, and aims to guide research libraries in envisioning the support infrastructure for Open Science (OS) and their roles at local, national, and global levels. The series runs from 24 September through 5 November.

The kick-off webinar opened the forum for discussion and exploration and introduced the theme and its topics. Summaries of all seven topical small group discussions are published on the OCLC Research blog, Hanging Together. Up to now these are: (1) Scholarly Publishing, (2) FAIR research data, (3) Research Infrastructures and the European Open Science Cloud, (4) Metrics and Rewards and (5) Skills.

The post OCLC-LIBER Open Science Discussion on Research Integrity appeared first on Hanging Together.

Making Customizable Interactive Tutorials with Google Forms / Meredith Farkas

Farkas_GoogleFormsPresentation

In September, I gave a talk at Oregon State University’s Instruction Librarian Get-Together about the interactive tutorials I built at PCC last year that have been integral to our remote instructional strategy. I thought I’d share my slides and notes here in case others are inspired by what I did and to share the amazing assessment data I recently received about the impact of these tutorials that I included in this blog post. You can click on any of the slides to see them larger and you can also view the original slides here (or below). At the end of the post are a few tutorials that you can access or make copies of.

Farkas_GoogleFormsPresentation (1)

I’ve been working at PCC for over six years now, but I’ve been doing online instructional design work for 15 years and I will freely admit that it’s my favorite thing to do. I started working at a very small rural academic library where I had to find creative and usually free solutions to instructional problems. And I love that sort of creative work. It’s what keeps me going.

Farkas_GoogleFormsPresentation (2)

I’ve actually been using survey software as a teaching tool since I worked at Portland State University. There, my colleague Amy Hofer and I used Qualtrics to create really polished and beautiful interactive tutorials for students in our University Studies program.

Farkas_GoogleFormsPresentation (3)

Farkas_GoogleFormsPresentation (4)

I also used Qualtrics at PSU and PCC to create pre-assignments for students to complete prior to an instruction session that both taught students skills and gave me formative assessment data that informed my teaching. So for example, students would watch a video on how to search for sources via EBSCO and then would try searching for articles on their own topic.

Farkas_GoogleFormsPresentation (5)

A year and a half ago, the amazing Anne-Marie Dietering led my colleagues in a day of goal-setting retreat for our instruction program. In the end, we ended up selecting this goal, identify new ways information literacy instruction can reach courses other than direct instruction, which was broad enough to encompass a lot of activities people valued. For me, it allowed me to get back to my true love, online instructional design, which was awesome, because I was kind of in a place of burnout going into last Fall.

Farkas_GoogleFormsPresentation (6)

At PCC, we already had a lot of online instructional content to support our students. We even built a toolkit for faculty with information literacy learning materials they could incorporate into their classes without working with a librarian.

Farkas_GoogleFormsPresentation (7)

 

The toolkit contains lots of handouts, videos, in-class or online activities and more. But it was a lot of pieces and they really required faculty to do the work to incorporate them into their classes.

Farkas_GoogleFormsPresentation (8)

What I wanted to build was something that took advantage of our existing content, but tied it up with a bow for faculty. So they really could just take whatever it is, assign students to complete it, and know students are learning AND practicing what they learned. I really wanted it to mimic the sort of experience they might get from a library instruction session. And that’s when I came back to the sort interactive tutorials I built at PSU.

Farkas_GoogleFormsPresentation (9)

So I started to sketch out what the requirements of the project were. Even though we have Qualtrics at PCC, I wasn’t 100% sure Qualtrics would be a good fit for this. It definitely did meet those first four criteria given that we already have it, it provides the ability to embed video, for students to get a copy of the work they did, and most features of the software are ADA accessible. But I wanted both my colleagues In the library and disciplinary faculty members to be able to easily see the responses of their students and to make copies of the tutorial to personalize for the particular course. And while PCC does have Qualtrics, the majority of faculty have never used it on the back-end and many do not have accounts. So that’s when Google Forms seemed like the obvious choice and I had to give up on my fantasy of having pretty tutorials.

Farkas_GoogleFormsPresentation (10)

I started by creating a proof of concept based on an evaluating sources activity I often use in face-to-face reading and writing classes. You can view a copy of it here and can copy it if you want to use it in your own teaching.

screenshot1

In this case, students would watch a video we have on techniques for evaluating sources. Then I demonstrate the use of those techniques, which predate Caulfield’s four moves, but are not too dissimilar. So they can see how I would go about evaluating this article from the Atlantic on the subject of DACA.

screenshot2

The students then will evaluate two sources on their own and there are specific questions to guide them.

Farkas_GoogleFormsPresentation (11)

During Fall term, I showed my proof of concept to my colleagues in the library as well as at faculty department meetings in some of my liaison areas. And there was a good amount of enthusiasm from disciplinary faculty – enough that I felt encouraged to continue.

One anthropology instructor who I’ve worked closely with over the years asked if I could create a tutorial on finding sources to support research in her online Biological Anthropology classes – classes I was going to be embedded in over winter term. And I thought this was a perfect opportunity to really pilot the use of the Google Form tutorial concept and see how students do.

Farkas_GoogleFormsPresentation (12)

So I made an interactive tutorial where students go through and learn a thing, then practice a thing, learn another thing, then practice that thing. And fortunately, they seemed to complete the tutorial without difficulty and from what I heard from the instructor, they did a really good job of citing quality sources in their research paper in the course. Later in the presentation, you’ll see that I received clear data demonstrating the impact of this tutorial from the Anthropology department’s annual assessment project.

Farkas_GoogleFormsPresentation (13)
So my vision for having faculty make copies of tutorials to use themselves had one major drawback. Let’s imagine they were really successful and we let a thousand flowers bloom. Well, the problem with that is that you now have a thousand versions of your tutorials lying around and what do you do when a video is updated or a link changes or some other update is needed? I needed a way to track who is using the tutorials so that I could contact them when updates were made.

Farkas_GoogleFormsPresentation (14)

So here’s how I structured it. I created a Qualtrics form that is a gateway to accessing the tutorials. Faculty need to put in their name, email, and subject area. They then can view tutorials and check boxes for the ones they are interested in using.

Farkas_GoogleFormsPresentation (15)

 

Farkas_GoogleFormsPresentation (16)

Once they submit, they are taking to a page where they can actually copy the tutorials they want. So now, I have the contact information for the folks who are using the tutorials.

This is not just useful for updates, but possibly for future information literacy assessment we might want to do.

Farkas_GoogleFormsPresentation (17)

The individual tutorials are also findable via our Information Literacy Teaching materials toolkit.

So when the pandemic came just when I was ready to expand this, I felt a little like Nostradamus or something. The timing was very, very good during a very, very bad situation. So we work with Biology 101 every single term in Week 2 to teach students about the library and about what peer review means, why it matters, and how to find peer-reviewed articles.

Farkas_GoogleFormsPresentation (18)
As soon as it became clear that Spring term was going to start online, I scrambled to create this tutorial that replicates, as well as I could, what we do in the classroom. So they do the same activity we did in-class where they look at a scholarly article and a news article and list the differences they notice. And in place of discussions, I had them watch videos and share insights. I then shared this with the Biology 101 faculty on my campus and they assigned it to their students in Week 2. It was great! [You can view the Biology 101 tutorial here and make a copy of it here]. And during Spring term I made A LOT more tutorials.

Farkas_GoogleFormsPresentation (19)

The biggest upside of using Google Forms is its simplicity and familiarity. Nearly everyone has created a Google form and they are dead simple to build. I knew that my colleagues in the library could easily copy something I made and tailor it to the courses they’re working with or make something from scratch. And I knew faculty could easily copy an existing tutorial and be able to see student responses. For students, it’s a low-bandwidth and easy-to-complete online worksheet. The barriers are minimal. And on the back-end, just like with LibGuides, there’s a feature where you can easily copy content from another Google Form.

Farkas_GoogleFormsPresentation (20)

The downsides of using Google Forms are not terribly significant. I mean, I’m sad that I can’t create beautiful, modern, sharp-looking forms, but it’s not the end of the world. The formatting features in Google Forms are really minimal. To create a hyperlink, you actually need to display the whole url. Blech. Then in terms of accessibility, there’s also no alt tag feature for images, so I just make sure to describe the picture in the text preceding or following it. I haven’t heard any complaints from faculty about having to fill out the Qualtrics form in order to get access to the tutorials, but it’s still another hurdle, however small.

Farkas_GoogleFormsPresentation (21)
This Spring, we used Google Form tutorials to replace the teaching we normally do in classes like Biology 101, Writing 121, Reading 115, and many others. We’ve also used them in addition to synchronous instruction, sort of like I did with my pre-assignments. But word about the Google Form tutorials spread and we ended up working with classes we never had a connection to before. For example, the Biology 101 faculty told the anatomy and physiology instructors about the tutorial and they wanted me to make a similar one for A&P. And that’s a key class for nursing and biology majors that we never worked with before on my campus. Lots of my colleagues have made copies of my tutorials and tailored them to the classes they’re working with or created their own from scratch. And we’ve gotten a lot of positive feedback from faculty, which REALLY felt good during Spring term when I know I was working myself to the bone.

Farkas_GoogleFormsPresentation (22)

Since giving this presentation, I learned from my colleagues in Anthropology that they actually used my work as the basis of their annual assessment project (which every academic unit has to do). They used a normed rubric to assess student papers in anthropology 101 and compared the papers of students who were in sections in which I was embedded (where they had access to the tutorial) to students in sections where they did not have an embedded librarian or a tutorial. They found that students in the class sections in which I was involved had a mean score of 43/50 and students in other classes had a mean score of 29/50. That is SIGNIFICANT!!! I am so grateful that my liaison area did this project that so validates my own work.

Farkas_GoogleFormsPresentation (23)

Here’s an excerpt from one email I received from an anatomy and physiology instructor: “I just wanted to follow up and say that the Library Assignment was a huge success! I’ve never had so many students actually complete this correctly with peer-reviewed sources in correct citation format. This is a great tool.” At the end of a term where I felt beyond worked to the bone, that was just the sort of encouragement I needed.

I made copies of a few other tutorials I’ve created so others can access them, though I’ve made many more. If there are others you’d be interested in having access to, just let me know!

Louisa Kwasigroch Appointed Interim DLF Senior Program Officer / Digital Library Federation

Louisa Kwasigroch

The Council on Library and Information Resources (CLIR) is pleased to announce the appointment of Louisa Kwasigroch as interim Digital Library Federation (DLF) senior program officer. Kwasigroch, who currently serves as CLIR’s director of outreach and engagement and has worked extensively with the DLF community, will serve as the primary point of contact for DLF member institutions and individuals until a permanent senior program officer has been appointed. During the interim period, she will also continue to serve as director of outreach and engagement.  

“I’m delighted Louisa has accepted this interim appointment,” said CLIR president Charles Henry. “With her knowledge of DLF’s engaged and active community, she will bring an empathetic and insightful continuity that will position us strategically for the next phase of DLF’s evolution.” 

Kwasigroch has been in the library field for more than 15 years, working with public, museum, and academic libraries. She has her BA in photography from Columbia College Chicago, and both an MSLIS and MBA from the University of Illinois, Urbana-Champaign. She began her career with CLIR in 2013 as a DLF program associate and was promoted to director of development and outreach in 2015 and director of outreach and engagement in 2020.

“It has been a great joy to serve the DLF community these past seven years in my roles at CLIR,” said Kwasigroch. “I look forward to continuing to support our members, working groups, and constituents while collaborating even more closely with CLIR and DLF staff, who have been doing an amazing job moving things forward.”

CLIR will resume its search for a permanent senior program officer in January 2021.

The post Louisa Kwasigroch Appointed Interim DLF Senior Program Officer appeared first on DLF.

As a Cog in the Election System: Reflections on My Role as a Precinct Election Official / Peter Murray

I may nod off several times in composing this post the day after election day. Hopefully, in reading it, you won’t. It is a story about one corner of democracy. It is a journal entry about how it felt to be a citizen doing what I could do to make other citizens’ voices be heard. It needed to be written down before the memories and emotions are erased by time and naps.

Yesterday I was a precinct election officer (PEO—a poll worker) for Franklin County—home of Columbus, Ohio. It was my third election as a PEO. The first was last November, and the second was the election aborted by the onset of the coronavirus in March. (Not sure that second one counts.) It was my first as a Voting Location Manager (VLM), so I felt the stakes were high to get it right.

  • Would there be protests at the polling location?
  • Would I have to deal with people wearing candidate T-shirts and hats or not wearing masks?
  • Would there be a crash of election observers, whether official (scrutinizing our every move) or unofficial (that I would have to remove)?

It turns out the answer to all three questions was “no”—and it was a fantastic day of civic engagement by PEOs and voters. There were well-engineered processes and policies, happy and patient enthusiasm, and good fortune along the way.

This story is going to turn out okay, but it could have been much worse. Because of the complexity of the election day voting process, last year Franklin County started allowing PEOs to do some early setup on Monday evenings. The early setup started at 6 o’clock. I was so anxious to get it right that the day before I took the printout of the polling room dimensions from my VLM packet, scanned it into OmniGraffle on my computer, and designed a to-scale diagram of what I thought the best layout would be. The real thing only vaguely looked like this, but it got us started.

A schematic showing the voting position and the flow of voters through the polling place. What I imagined our polling place would look like

We could set up tables, unpack equipment, hang signs, and other tasks that don’t involve turning on machines or breaking open packets of ballots. One of the early setup tasks was updating the voters’ roster on the electronic poll pads. As happened around the country, there was a lot of early voting activity in Franklin County, so the update file must have been massive. The electronic poll pads couldn’t handle the update; they hung at step 8-of-9 for over an hour. I called the Board of Elections and got ahold of someone in the equipment warehouse. We tried some of the simple troubleshooting steps, and he gave me his cell phone number to call back if it wasn’t resolved.

By 7:30, everything was done except for the poll pad updates, and the other PEOs were wandering around. I think it was 8 o’clock when I said everyone could go home while the two Voting Location Deputies and I tried to get the poll pads working. I called the equipment warehouse and we hung out on the phone for hours…retrying the updates based on the advice of the technicians called in to troubleshoot. I even “went rogue” towards the end. I searched the web for the messages on the screen to see if anyone else had seen the same problem with the poll pads. The electronic poll pad is an iPad with a single, dedicated application, so I even tried some iPad reset options to clear the device cache and perform a hard reboot. Nothing worked—still stuck at step 8-of-9. The election office people sent us home at 10 o’clock. Even on the way out the door, I tried a rogue option: I hooked a portable battery to one of the electronic polling pads to see if the update would complete overnight and be ready for us the next day. It didn’t, and it wasn’t.

Picture of a text with the contents: '(Franklin County Board Of Elections) Franklin County is going to ALL Paper Signature Poll Books.  Open your BUMPER PACKET and have voters sign in on the Paper Signature Poll Books.  Use the Paper Authority To Vote Slips.  Go thru your Paper Supplemental Absentee List and record AB/PROV on the Signature Line of all voters on that list.  Mark Names off of the White and Green Register of Voters Lists.' Text from Board of Elections

Polling locations in Ohio open at 6:30 in the morning, and PEOs must report to their sites by 5:30. So I was up at 4:30 for a quick shower and packing up stuff for the day. Early in the setup process, the Board of Elections sent a text that the electronic poll pads were not going to be used and to break out the “BUMPer Packets” to determine a voter’s eligibility to vote. At some point, someone told me what “BUMPer” stood for. I can’t remember, but I can imagine it is Back-Up-something-something. “Never had to use that,” the trainers told me, but it is there in case something goes wrong. Well, it is the year 2020, so was something going to go wrong?

Fortunately, the roster judges and one of the voting location deputies tore into the BUMPer Packet and got up to speed on how to use it. It is an old fashioned process: the voter states their name and address, the PEO compares that with the details on the paper ledger, and then asks the voter to sign beside their name. With an actual pen…old fashioned, right? The roster judges had the process down to a science. They kept the queue of verified voters full waiting to use the ballot marker machines. The roster judges were one of my highlights of the day.

And boy did the voters come. By the time our polling location opened at 6:30 in the morning, they were wrapped around two sides of the building. We were moving them quickly through the process: three roster tables for checking in, eight ballot-marking machines, and one ballot counter. At our peak capacity, I think we were doing 80 to 90 voters an hour. As good as we were doing, the line never seemed to end. The Franklin County Board of Elections received a grant to cover the costs of two greeters outside that helped keep the line orderly. They did their job with a welcoming smile, as did our inside greeter that offered masks and a squirt of hand sanitizer. Still, the voters kept back-filling that line, and we didn’t see a break until 12:30.

The PEOs serving as machine judges were excellent.
This was the first time that many voters had seen the new ballot equipment that Franklin County put in place last year. I like this new equipment: the ballot marker prints your choices on a card that it spits out. You can see and verify your choices on the card before you slide it into a separate ballot counter. That is reassuring for me, and I think for most voters, too. But it is new, and it takes a few extra moments to explain. The machine judges got the voters comfortable with the new process. And some of the best parts of the day were when they announced to the room that a first-time voter had just put their card into the ballot counter. We would all pause and cheer.

The third group of PEOs at our location were the paper table judges. They handle all of the exceptions.

  • Someone wants to vote with a pre-printed paper ballot rather than using a machine? To the paper table!
  • The roster shows that someone requested an absentee ballot? That voter needs to vote a “provisional” ballot that will be counted at the Board of Elections office if the absentee ballot isn’t received in the mail. The paper table judges explain that with kindness and grace.
  • In the wrong location? The paper table judges would find the correct place.

The two paper table PEOs clearly had experience helping voters with the nuances of election processes.

Rounding out the team were two voting location deputies (VLD). By law, a polling location can’t have a VLD and a voting location manager (VLM) of the same political party. That is part of the checks and balances built into the system. One VLD had been a VLM at this location, and she had a wealth of history and wisdom about running a smooth polling location. For the other VLD, this was his first experience as a precinct election officer, and he jumped in with both feet to do the visible and not-so-visible things that made for a smooth operation. He reminded me a bit of myself a year ago. My first PEO position was as a voting location deputy last November. The pair handled a challenging curbside voter situation where it wasn’t entirely clear if one of the voters in the car was sick. I’d be so lucky to work with them again.

The last two hours of the open polls yesterday were dreadfully dull. After the excitement of the morning, we may have averaged a voter every 10 minutes for those last two hours. Everyone was ready to pack it in early and go home. (Polls in Ohio close at 7:30, so counting the hour early for setup and the half an hour for tear down, this was going to be a 14 to 15 hour day.) Over the last hour, I gave the PEOs little tasks to do. At one point, I said they could collect the barcode scanners attached to the ballot markers. We weren’t using them anyway because the electronic poll pads were not functional. Then, in stages (as it became evident that there was no final rush of voters), they could pack up one or two machines and put away tables. Our second to last voter was someone in medical scrubs that just got off their shift. I scared our last voter because she walked up to the roster table at 7:29:30. Thirty seconds later, I called out that the polls are closed (as I think a VLM is required to do), and she looked at me startled. (She got to vote, of course; that’s the rule.) She was our last voter; 799 voters in our precinct that day.

Then our team packed everything up as efficiently as they had worked all day. We had put away the equipment and signs, done our final counts, closed out the ballot counter, and sealed the ballot bin. At 8:00, we were done and waving goodbye to our host facility’s office manager. One of the VLD rode along with me to the board of elections to drop off the ballots, and she told me of a shortcut to get there. We were among the first reporting results for Franklin County. I was home again by a quarter of 10—exhausted but proud.

I’m so happy that I had something to do yesterday. After weeks of concern and anxiety for how the election was going to turn out, it was a welcome bit of activity to ensure the election was held safely and that voters got to have their say. It was certainly more productive than continually reloading news and election results pages. The anxiety of being put in charge of a polling location was set at ease, too. I’m proud of our polling place team and that the voters in our charge seemed pleased and confident about the process.

Maybe you will find inspiration here.

  • If you voted, hopefully it felt good (whether or not the result turned out as you wanted).
  • If you voted for the first time, congratulations and welcome to the club (be on the look-out for the next voting opportunity…likely in the spring).
  • If being a poll worker sounded like fun, get in touch with your local board of elections (here is information about being a poll worker in Franklin County).

Democracy is participatory. You’ve got to tune in and show up to make it happen.

Picture of certificate from Franklin County Board of Elections in appreciation for serving as a voting location manager for the November 3, 2020, general election. Certificate of Appreciation

2020 DLF Forum: Building Community With DLF’s Data and Digital Scholarship Working Group / Digital Library Federation

Join us online! November 9th-13th, 2020.Though DLF is best known for our signature event, the annual DLF Forum, our working groups collaborate year round. Long before COVID-19 introduced the concept of “Zoom fatigue” into our lives, DLF’s working groups organized across institutional and geographical boundaries, building community while apart, to get work done. Made possible through the support of our institutional members, working groups are the efforts of a committed community of practitioners, using DLF as a framework for action, engaged in problem-solving in a variety of digital library subfields from project management and assessment to labor and accessibility.

Once we decided that the 2020 DLF Forum and affiliated events would be held in a virtual format, it meant that our working groups wouldn’t have the opportunity to meet in person for their typical working meals that take place throughout the Forum; however, this year’s virtual format means that we’ll have more new DLF Forum attendees than ever before. Because DLF’s working groups are open to ALL, regardless of whether you’re affiliated with a DLF member institution or not, we asked leaders of the DLF working groups to introduce their groups and the work they do to the new and returning members of the #DLFvillage in a series of blogs and videos.

We’ll share these working group updates in the days leading to this year’s DLF Forum. 


What is the DLF Digital Scholarship and Data Services Working Group?

The DLF Data and Digital Scholarship Working Group (DLFdds) is a continuation of two DLF groups: The eResearch Network and the Digital Scholarship Working Group. The current version of the group uses a mutual aid model to offer peer leaders and the group the ability to create topics of interest for the community. It is an evolution of the eResearch Network that DLF ran for many years

Sara Mannheimer, Data Librarian, Montana State University and Jason Clark, Lead for Research Informatics, Montana State University will be facilitating the working group this year.

Our charge notes that we are “a community of practice focused on implementing research data and digital scholarship services. The group focuses on shared skill development, peer mentorship, networking, and collaboration. DLFdds aims to create a self-reliant, mutually supportive community: a network of institutions and individuals engaged in continuous learning about research data management, digital scholarship, and research support.”

Learn more about us:

Our Digital Scholarship and Data Services Wiki

Our Digital Scholarship and Data Services Google Group (listserv)

What are we working on?

We meet quarterly for discussion and activities based on DLF DDS community interest and ideas. Past topics have included: Advocacy and Promotion of Data Services and Digital Scholarship, Data Discovery/Metadata and Reusability, Collections as Data, Assessment (Metrics for success with Data Services and Digital Scholarship), etc.

Last month, we met to talk about roadmapping in a session led by Shilpa Rele
Scholarly Communication & Data Curation Librarian, Rowan University. 

screenshot of Shilpa Rele's presentation“Building a Roadmap for Research Data Management and Digital Scholarship Services” by Shilpa Rele.

View Slides and Video

These sessions have a flexible focus between RDM and DS. These 90 minutes each quarter are structured around a particular topic and usually involve: 

  • A short visit from invited speaker on topic
  • An in-session discussion and activity

We are basing this structure on the former eResearch Network (eRN) cohort model which had a more of a course-based mode. An example eResearch Network syllabus is linked here to give you some more perspective on the history of that group. Our new goal is to bring the best parts of eRN into this revitalized working group.

We also connect folks in the working group around consultation ideas. Consultations are working sessions that give consultees a chance to work through an in-depth, peer conversation to solve a local data services or digital scholarship question. Consultants are peers and associated experts (e.g. fellow DLFdds members, former eResearch Network participants, practitioners from other DLF member institutions). Consultations are flexible and customized according to consultee needs. 

How to contribute or get involved?

As we are working to instill a mutual aid model for our community, we are interested in your ideas. We have opened a survey to pull together these interests and welcome your thoughts.

Take our DLF DDS Interest and Ideas Survey:
https://bit.ly/dlf-dds-survey 

Beyond the survey, please feel free to join our Google Group as announcements and opportunities related to the WG and Digital Scholarship/Data Services in general will be available there. 

Our next scheduled meeting will be in December 2020. We hope to see you there!

The post 2020 DLF Forum: Building Community With DLF’s Data and Digital Scholarship Working Group appeared first on DLF.

#WebArchiveWednesday: A Community Conversation / Archives Unleashed Project

#WebArchiveWednesday Network via Netlytic

Thanks to the folks at the International Internet Preservation Consortium (IIPC), the community has an opportunity to contribute to regular Wednesday discussions using the #WebArchiveWednesday Twitter hashtag.

Engaging with this hashtag has given individuals, groups, and organizations a chance to share information, news, and projects with other professionals as well as the public. The focused hashtag also provides an opportunity to support colleagues in the field — their stories, successes, and, more broadly, web archiving discussions.

Looking at how the #WebArchiveWednesday community and conversation has evolved over the past year, the Archives Unleashed project would like to offer congratulations to IIPC for reaching this anniversary and milestone! As part of our birthday present to the hashtag, I carried out an analysis of the tweets that people have been sharing as part of #WebArchiveWednesday.

How #WebArchiveWednesday Started

The first #WebArchiveWednesday tweet was published by @TroveAustralia (the National Library of Australia’s discovery platform) to promote @NLAPAndora, Australia’s national web archiving program. Shortly after, IIPC reinvigorated the hashtag on 6 November 2019 to encourage information sharing focused on web archiving activities worldwide.

Beginnings of #WebArchiveWednesday hashtag, initiated by TroveAustralia/NLAPAndora and then adapted by IIPC

Exploring the #WebArchiveWednesday Dataset

My #WebArchivesWednesday dataset was collected using Netlytic (Gruzd, A. 2016), a text and social networks analyzer, to understand and visualize the conversations and community that have grown around the hashtag.

A little more on the technical aspects of Netlytic. The tool uses the Twitter REST API v1.1 and retrieves up to 1000 of the most recent tweets every 15 minutes. Retrieval is based on a user’s search query, ranging from broad and straightforward to more complex, which can include several operators. As this dataset was focused, the search query was set for Tweets that included #WebArchiveWednesday.

The data collected was downloaded in a CSV format and included 16 data points. This report focuses on 7 of those data points as a source of analysis:

  • id — assigned by Netlytic so each tweet collected has a unique id within the dataset
  • link — provides the URL of the collected tweet
  • pubdate — the time and date stamp of when the tweet was published
  • author — identifies the Twitter user of the post
  • title/description — this is the Tweet’s content; the variance between these two fields relates to whether the Tweet is truncated (title) or non-truncated (description).
  • user_location — if provided by the user, identifies location given to their Twitter account
  • lang — identifies the designated language code of the tweet

While collection for the #WebArchiveWednesday hashtag continues today, the analysis below focuses on tweets captured between 28 April 2020–27 October 2020. There are 2543 Twitter messages included in the dataset, with 485 unique posters.

Demographics of the #WebArchiveWednesday Community

We can explore aspects such as location, language, and contributors to understand the #WebArchiveWednesday community’s composition.

Location + Language

I focused on the locations identified and defined by hashtag participants to analyze the geographic distribution of conversations. These descriptions varied from specific place names or geopolitical references to diverse descriptions that lacked geographical information.

To analyze user locations, two main steps occurred: first, descriptions that were similar or noted as variations were grouped (e.g. Brighton / Brighton, U.K. / Brighton, UK); second, locations were coded into four categories:

  • Geo-Location — any location that could be identified on a map. In most cases, a city, province/state and country was defined. Within this category, specific institutions (e.g. Trinity College Dublin, the British Library, or the Hauge) and entire countries (e.g. Scotland) were included.
  • GeoPolitical — any location that included a broad description of a geographic area that is influenced by economic or political influence (e.g. “Southland Region, New Zealand,” the “Commonwealth,” “Darug and Gundungurra lands,” “Haudenosauneega Confederacy / Chickasaw / Miami / Shawnee / Osage Territory”)
  • Non Geo-Location — any description that was not geographically based. Some examples include: “constantly in motion,” “The 1980s”, and “Libraries around the world.”
  • Unknown — this category was used in instances where character transformations were not possible (e.g. üòë) or if acronyms had multiple possibilities (e.g. NoVa, MD, NC).

Among #WebArchiveWednesday contributors, 78% provided a user-defined location, which after analysis revealed 168 unique locations that could be identified on a map.

From the map we can see an international representation among the #WebArchiveWednesday community, with main clusters occurring within North America, Europe, Australia and New Zealand.

While 93% of locations have less than five users, four main hubs can be identified: London UK, Norfolk USA, Wellington NZ, and Australia. Although a few messages were identified with Romanian (ro) and French (fr) language codes, closer inspection revealed all messages collected for this dataset were written in English.

Information Flow and Contributors

Another aspect explored: how individuals, groups, and institutions contribute to and use information within the network.

The #WebArchiveWednesday community loves to share information! Out of 2543 messages analyzed, 86% are identified as retweets, and the majority of these messages are shared between one and fifteen times. On average, a message is shared seven times by community members.

There are 485 unique users that participate; 60% are single-time posters to network, either with an original tweet or a retweet. These participants reflect three types of accounts ranging from individual, institutional/organizational, and project-based. We see a handful of influential users who have posted well 50 messages using the hashtag; these users contribute to just over 20% of content and engagement for the #WebArchivesWednesday hashtag.

The top ten posters identify the most frequent posters using the #WebArchiveWednesday hashtag. These participants provide geographic representation from England, Ireland, the United States, and Australia. We can also identify that most of these users are individuals, as only two (NetPreserve and UKWebArchive) are organizations.

(A reminder that location was publicly indicated and is user-defined.)

There are three messages that the community engaged with significantly, all of which were retweeted more than 30 times.

#WebArchiveWednesday Text Analysis

Netyltic’s keyword analysis feature, exported in the form of a word cloud, summarizes emerging and important topics found in the #WebArchiveWednesday conversation. Word clouds visually display the frequency of words mentioned, with size correlating to frequency.

We expect certain words to appear, such as the hashtag (#WebArchiveWednesday), and closely related terms like “web archives,” “web archive,” “web archiving,” “web,” “archive,” “archives,” “archived,” “archiving,” “collection,” and “collections.” To dig deeper, we removed these words to display and explore the top 30 most frequently used words.

Immediate themes presented through the hashtag include:

Sharing of information is the most prominent theme present and relates to shared content among participants, specifically blog posts, news, and events. While the content is diverse, for instance, we see blog posts, webinars, tools, projects, events, collections, and the like, the factor that ties them all together is a conscious and mediated choice to distribute to the #WebArchiveWednesday community.

We see keywords relating to the types of content shared, such as blog posts, projects, links, research, and tools. Some keywords are action-focused, like “read” and “thread,” as well as temporal-based like “time” and “today.” Collectively, these keywords’ essence creates a call to action — for participants to engage and explore the resources shared with and by the community.

Conference hashtags: messages shared presentations from the JCDL2020 and wadl2020 conferences, in which tools were demonstrated capabilities and functionalities of supporting web archiving. For instance, the conference presentation which introduced TimeMap Visualizations (TMVis), a tool that lists mementos of a single webpage and displays them as thumbnails via a timemap visualization.

Organization specific handles such as #uklegaldeposit, #uknpld @netpreserve @stormyarchives @ukwebarchive and @webscidl relate to information distributed within the #WebArchiveWednesday network to feature organizational news, announcements, and the sharing of resources. For instance, showcasing @UKWebArchive curated collections such as their recently released science web archive collection or celebrating their 15th anniversary of web archiving; highlighting collaborative efforts with @netpreserve to establish the Novel Coronavirus web archive; and exploring @WebSciDL’s release of an updated version of #IPWB (an archival replay system), or @StormyArchives Hypercane tool.

Analyzing the #WebArchiveWednesday Twitter Network via Netlytic

Network analysis offers a way to visualize the social and communication structures of a community. The graph reveals 485 unique posters, represented as nodes (which could be individuals, organizations, institutions, or other entities) with 4320 ties linking participants in a communication interaction (including @mentions, @replies, or retweets).

This analysis also helps to uncover who talks to whom and the topics they are discussing. Single lines represent one-way communication, wherein a tweet from one account would reference (mention) another account and forms the connection. Nodes connected in a loop (i.e. the first node connects to a second, and then in return, the second node connects back to the first) generally demonstrates dialogue.

The #WebArchiveWednesday community is representative of an information-sharing network, meaning that participants’ primary focus is to distribute information (in the form of sharing messages and retweeting) rather than engaging in conversation-based interactions. We can look to a few defining characteristics.

First, we generally see single lines connecting nodes, which represents one-way communication behaviours. This makes sense, given that 86% of messages are retweets, which, in the network, are visualized by a single line connecting two nodes. As both the density and reciprocity measures (Netlytic, 2016, a) are closer to zero (then 1), it indicates little connection between groups of participants — think of them as siloed connections — in which conversations tend to be one-sided, with little back and forth (Netlytic, 2016, b).

Second, if we take a bird’s eye view of the network diagram, and in combination with a modularity measurement of 0.3576, we see a mish-mash of colours. Clusters are not distinctly spaced, we see them layered on top of one another with core nodes that act as hubs, and clusters tend to remain insular, meaning we don’t see clusters connecting to one another. Since the #WebArchiveWednesday network lacks clear divisions between clusters (which would note distinct communities), we can suggest that the information interactions revolve around similar topics and participants.

Similarly, the majority of the top ten posters create hubs for sharing within dedicated clusters. Of the top posters, 90% are found within the two main clusters — the red and pink clusters visible in the figure below. One poster is a central point of focus within the orange cluster.

Let’s take a look at the various clusters that have developed within the network. As previously mentioned, many of the clusters overlap, and while they are similar in nature, the clusters act as an organizing structure for topics that occur around specific users. Five groups of interactions are identified.

The Main Conversation(s)

At the network’s forefront, we have two clusters that interweave and overlap: a red cluster (center left) and a pink cluster (center right). Note colours used in the network diagram do not have a specific meaning; they are used to distinguish between each designated cluster.

#WebArchiveWednesday Main Clusters (red and pink)

The Red Cluster focuses on a handful of prominent users, and conversations focus on promoting ways to use different tools and projects within the web archiving ecosystem.

The Pink Cluster has two distinct nodes, @NetPreserve and @UKwebarchive, that create hubs for connection. This cluster’s content is specific to IIPC activities and initiatives, like COVID collections, the General assembly, and IIPC funding opportunities. There are also overlapping conversation topics we saw in the red cluster that relate to the use of tools/projects like pywb, stormy archives story graph, memento.

We can describe four additional smaller clusters.

#WebArchiveWednesday additional clusters (orange, yellow, blue)

The orange cluster has a few interconnected interactions but primarily revolves around messages sharing training materials designed and produced by the IIPC training working group. We also see individuals within the cluster sharing published blog posts on collaborative efforts across institutions to document the COVID response. A smaller hub within the cluster focuses on distributing information related to project enhancements, such as the @LosAlamosNatLab robust link web service, extractor functionality of the Archives Unleashed Toolkit, capturing the 2020 Australian .au domain. One of the top ten posters appears as a hub for this cluster, acting as an information distributor. They share news and progress from several projects, including Webrecorder, Archives Unleashed Cloud, @websciDL #IPWD and @LosAlamosNatLab. Finally, within the orange cluster, we see a core group of nodes circulate an interview between the NDSA (National Digital Stewardship Alliance) and Samantha Abrams (2016 Innovation Awards winner) on web archiving community work.

Our fourth cluster, yellow in colour, provides insights to discussions centered around three interwoven main topics:

Finally, the blue cluster highlights cohesion around New Zealand web archiving examples, including a look back at musician Lorde’s Tumblr site from 2013 and an early fan website for Taika Waititi.

Summary

  • Over the past six months, over 2500 messages and just under 500 unique users have contributed to the #WebArchiveWednesday hashtag. While most participants are individuals, we do see national and international based organizations and institutions frequently engage within the community.
  • In mapping the user-defined location, we see an international representation of #WebArchiveWednesday participants. While there are many users from different places, we do see hubs of participants from around London UK, Norfolk USA, Wellington NZ and Australia. This makes sense as there are major bodies in this area that conduct national level archiving programs.
  • Web archiving is done at an international level and includes many stakeholders and voices. While we see a broad representation geographically of individuals who either participate in web archiving activities or are promoters of such content, we don’t see language diversification within the #WebArchiveWednesday conversations.
  • Hashtag interactions and conversations primarily occur around Wednesdays, and there are no other patterns to suggest participants regularly engage in discussions or interactions outside of the dedicated time. However, we need to account for time zone differences, early/late messages, and tying in specific events to the community that fall outside of a Wednesday.
  • Our most frequent posters, which contribute to approximately 20% of #WebArchiveWednesday content, are primarily individual users, with two organizations present. These participants are located in countries with highly active web archiving activities and programs and institutions.
  • This community is focused on sharing and distributing information rather than engaging in two-way (or multi-way) conversations; information is primarily shared by single time posters.
  • Both the text and network analysis confirms the sharing nature of this hashtag-based community.
  • In exploring the top 30 keywords from the dataset, one overarching theme is present. That despite the diverse content types (blogs, news, tools, events, collections, etc.), there is an underlying request for the community to explore, engage, and in return, share resources and information.
  • Network dynamics indicate the #WebArchiveWednesday community is an information-sharing network. The community can be described as a highly dense group of participants who consistently engage in one-way communication behaviours (retweets and shares) and gather around power users, who structurally create a hub for interactions.

The #WebArchiveWednesday has offered an incredible opportunity to engage with colleagues who are partaking in web archiving activities across the globe and provides a regular opportunity to participate in information sharing. Many thanks, and congratulations to IIPC for leading this initiative. We can’t wait to see where the #WebArchivesWednesday discussions will lead!

References

Gruzd, A. (2016). Netlytic: Software for Automated Text and Social Network Analysis. Available at http://Netlytic.org

Netlytic (2016, a). SNA Measures Plot Definitions. https://netlytic.org/network/sna/snachart.php?datatype=twitter2&datatype=twitter2&net_centralization=0.2375&net_density=0.01293&net_reciprocity=0.1241&net_modularity=0.3586&net_islands=3&net_diameter=7&net_nodes=474&net_edges=2900

Netlytic. (2016, b). Network Analysis/Visualization: Auto Clusters. https://netlytic.org/home/?page_id=2#cmtoc_anchor_id_11


#WebArchiveWednesday: A Community Conversation was originally published in Archives Unleashed on Medium, where people are continuing the conversation by highlighting and responding to this story.