Planet Code4Lib

Research Data Management looking outward from IT / Peter Sefton

This is a presentation that I gave on Wednesday the 2nd of December 2020 at the AeRO (Australian eResearch Organizations) council meeting at the request of the chair Dr Carina Kemp).

Carina asked:

It would be really interesting to find out what is happening in the research data management space. And I’m not sure if it is too early, but maybe touch on what is happening in the EOSC Science Mesh Project.

The Audience of the AeRO Council is AeRo member reps from AAF, AARNet, QCIF, CAUDIT, CSIRO, GA, TPAC, The uni of Auckland, REANNZ, ADSEI, Curtin, UNSW, APO.

At this stage I was still the eResearch Support Manager at UTS - but I only had a couple of weeks left in that role.

Research Data Management  looking outward from IT Research Data Management  looking outward from IT

In this presentation I’m going to start from a naive IT perspective about research data.

 <p>

I would like to acknowledge and pay my respects to the Gundungurra and Darug people who are the traditional custodians of the land on which I live and work.

 ❄️ <p>❄️

Research data is special - like snowflakes - and I don’t mean that in a mean way, Research Data could be anything - any shape any size and researchers are also special, not always 100% aligned with institutional priorities, they align with their disciplines and departments and research teams.

$data_management != $storage $data_management != $storage

It’s obvious that buying storage doesn’t mean you’re doing data management well but that doesn’t mean it’s not worth restating.

So "data storage is not data management". In fact, the opposite might be true - think about buying a laptop - do you just get one that fits all your stuff and rely on getting a bigger one every few years? Or do you get a smaller main drive and learn how to make sure that your data's actually archived somewhere? That would be managing data.

And remember that not all research data is the same “shape” as corporate data - it does not all come in database or tabular form - it can be images, video, text, with all kinds of structures.

 💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾👩🏽‍🔬💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾 <p>

There are several reasons we don’t want to just dole-out storage as needed.

 💰 <p>

  1. It’s going to cost a lot of money and keep costing a lot of money
$storing != $sharing <p>

  1. Not everything is stored in “central storage” anyway. There are share-sync services like AARNet’s Cloudstor.
 Emoji of lots of floppy disks with a researcher in the centre
  1. Just keeping things doesn’t mean we can find them again
Emoji of a trophy and a bomb

So far we’ve just looked at things from an infrastructure perspective but that’s not actually why we’re here, us with jobs in eResearch. I think we’re here to help researchers do excellent research with integrity, AND we need to help our institutions and researchers manage risk.

  • The Australian Code for the Responsible Conduct of Research which all research organizations need to adhere to if we get ARC or NHMRC grants sets out some institutional responsibilities to provide infrastructure and training

  • There are risks associated with research data, reputational, financial and risks to individuals and communities about whom we hold data

At UTS, we’ve embraced the Research Data Management Plan - as a way to assist in dealing with this risk. RDMPs have a mixed reputation here in Australia - some organizations have decided to keep them minimal and as streamlined as possible but at UTS the thinking is that they can be useful in addressing a lot of the issues raised so far.

  • Where’s the data for project X - when there’s an integrity investigation. Were procedures followed?

  • How much storage are we going to need?

 <p>Image showing a research data management system that can provision research services (workspaces).</p> <p>

Inspired by the (defunct?) Research Data Lifecycle project that was conceived by the former organizations that became the Australian Research Data Commons (ANDS, NeCTAR and RDSI) we came up with this architecture for a central research data management system (in our case we use the open source ReDBox system) loosely linked to a variety of research Workspaces, as we call them.

The plan is that over time, researchers can plan and budget for data management in the short, medium and long term, provision services and use the system to archive data as they go.

(Diagram by Gerard Barthelot at UTS)

 <p>Screenshot of the OCFL home page.</p> <p>

UTS has been an early adopter of the OCFL (Oxford Common File Layout) specificiation - a way of storing file sustainably on a file system (coming soon: s3 cloud storage) so it does not need to be migrated. I presented on this at the Open Repositories conference

 <p>Screenshot of the RO-Crate home page.</p> <p>

And at the same conference, I introduced the RO-Crate standards effort, which is a marriage between the DataCrate data packaging work we’ve been doing at UTS for a few years, and the Research Object project.

 <p>Screenshot of the Arkisto page.</p> <p>

We created the Arkisto platform to bring together all the work we’ve been doing to standardise research data metadata, and to build a toolkit for sustainable data repositories at all scales from single-collection up to institutional, and potentially discipline and national collections.

 <p>Screenshot from the Arkisto use-cases page showing an idealized data-flow for sensor data using the Arkisto platform.</p> <p>

This is an example of one of many Arkisto deployment patterns you can read more on the Arkisto Use Cases page

 <p>Screenshot of a search interface to a historical dataset.</p> <p>

This is an example of an Arkisto-platform output. Data exported from one content management system into an archive-ready RO-Crate package, which can then be made into a live site. This was created for Ass Prof Tamson Pietsch at UTS. The website is ephemeral - the data will be Interoperable and Reusable (I and R from FAIR) via the use of RO-Crate.

 <p>Image of chickens roosting on the roof of their house rather than inside it.</p> <p>

Now to higher-level concerns: I built this infrastructure for my chooks (chickens) - they have a nice dry box with a roosting loft. But most of the time they roost on the roof.

We know all too well that researchers don’t always use the infrastructure we build for them - you have to get a few other things right as well.

…  and have the right policy environment … <p>

… they will come, given the right training, incentives, support etc. <p>

One of the big frustrations I have had as an eResearch manager is that the expectations and aspirations of funders and integrity managers and so on are well ahead of our capacity to deliver the services they want, and then when we DO get infrastructure sorted there are organizational challenges to getting people to use it. To go back to my metaphor, we can’t just pick up the researchers from the roof and put them in their loft, or spray water on them to get them to move.

CS3MESH4EOSC T4.2 CS3MESH4EOSC T4.2

Via Gavin Kennedy and Guido Aben from AARNet Marco La Rosa and I are helping out with this charmingly named project which is adding data management service to storage, syncronization and sharing services. Contracts not yet in place so won't say much about this yet.

 <p>Schreenshot of CS3MESH4EOSC homepage.</p> <p>

https://www.cs3mesh4eosc.eu/about EOSC is the European Open Science Cloud

CS3MESH4EOSC - Interactive and agile sharing mesh of storage, data and applications for EOSC - aims to create an interoperable federation of data and higher-level services to enable friction-free collaboration between European researchers. CS3MESH4EOSC will connect locally and individually provided services, and scale them up at the European level and beyond, with the promise of reaching critical mass and brand recognition among European scientists that are not usually engaged with specialist eInfrastructures.

 Telescope emoji

I told Carina I would look outwards as well. What are we keeping an eye on?

     Book and factory emojis

Watch out for the book factory. Sorry, the publishing industry.

     Fox and hen emojis

The publishing industry is going to “help” the sector look after it’s research data.

 ©

Like, you, know, they did with the copyright in publications. Not only did that industry work out how to take over copyright in research works, they successfully moved from selling us hard-copy resources that we could keep in our own libraries to charging an annual rent on the literature - getting to the point where they can argue that they are essential to maintaining the scholarly record and MUST be involved in the publishing process even when the (sometimes dubious, patchy) quality checks are performed by us who created the literature.

It’s up to research institutions whether this story repeats with research data - remember who you’re dealing with when you sign those contracts!

Some questions I will explore post UTS How many research institutions in Australia have secure, scalable, sustainable Research Data repositories that support FAIR practices? When are we going to have national repositories? How many of the eResearch Platforms funded under various programs have sustainable data management stories? How much data is locked-up behind APIs and will be inaccessible if the software stops running? <p>

In the 2010s the Australian National Data (ANDS) service funded investment in Metadata stores; one of these was the ReDBOX research data management platform which is alive and well and being sustained by QCIF with a subscription maintenance service. But ANDS didn’t fund development of Research Data Repositories.

Credits The UTS eResaerch team Michael Lynch, Sharyn Wise, Fiona Tweedie, Moises Sacal, Marco La Rosa, Pascal Tampubolon,  Anelm Motha, Michael Lake, Matthew Gaston ,Simon Kruik, Weisi Chen from Intersect Austrlaia and our RDM project sponsor Louise Wheeler Credits <p>

The work I’ve talked about here was all done with with the UTS team.

Research Data Management looking outward from IT / Peter Sefton

This is a presentation that I gave on Wednesday the 2nd of December 2020 at the AeRO (Australian eResearch Organizations) council meeting at the request of the chair Dr Carina Kemp).

Carina asked:

It would be really interesting to find out what is happening in the research data management space. And I’m not sure if it is too early, but maybe touch on what is happening in the EOSC Science Mesh Project.

The Audience of the AeRO Council is AeRo member reps from AAF, AARNet, QCIF, CAUDIT, CSIRO, GA, TPAC, The uni of Auckland, REANNZ, ADSEI, Curtin, UNSW, APO.

At this stage I was still the eResearch Support Manager at UTS - but I only had a couple of weeks left in that role.

Research Data Management  looking outward from IT Research Data Management  looking outward from IT

In this presentation I’m going to start from a naive IT perspective about research data.

 <p>

I would like to acknowledge and pay my respects to the Gundungurra and Darug people who are the traditional custodians of the land on which I live and work.

 ❄️ <p>❄️

Research data is special - like snowflakes - and I don’t mean that in a mean way, Research Data could be anything - any shape any size and researchers are also special, not always 100% aligned with institutional priorities, they align with their disciplines and departments and research teams.

$data_management != $storage $data_management != $storage

It’s obvious that buying storage doesn’t mean you’re doing data management well but that doesn’t mean it’s not worth restating.

So "data storage is not data management". In fact, the opposite might be true - think about buying a laptop - do you just get one that fits all your stuff and rely on getting a bigger one every few years? Or do you get a smaller main drive and learn how to make sure that your data's actually archived somewhere? That would be managing data.

And remember that not all research data is the same “shape” as corporate data - it does not all come in database or tabular form - it can be images, video, text, with all kinds of structures.

 💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾👩🏽‍🔬💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾 <p>

There are several reasons we don’t want to just dole-out storage as needed.

 💰 <p>

  1. It’s going to cost a lot of money and keep costing a lot of money
$storing != $sharing <p>

  1. Not everything is stored in “central storage” anyway. There are share-sync services like AARNet’s Cloudstor.
 Emoji of lots of floppy disks with a researcher in the centre
  1. Just keeping things doesn’t mean we can find them again
Emoji of a trophy and a bomb

So far we’ve just looked at things from an infrastructure perspective but that’s not actually why we’re here, us with jobs in eResearch. I think we’re here to help researchers do excellent research with integrity, AND we need to help our institutions and researchers manage risk.

  • The Australian Code for the Responsible Conduct of Research which all research organizations need to adhere to if we get ARC or NHMRC grants sets out some institutional responsibilities to provide infrastructure and training

  • There are risks associated with research data, reputational, financial and risks to individuals and communities about whom we hold data

At UTS, we’ve embraced the Research Data Management Plan - as a way to assist in dealing with this risk. RDMPs have a mixed reputation here in Australia - some organizations have decided to keep them minimal and as streamlined as possible but at UTS the thinking is that they can be useful in addressing a lot of the issues raised so far.

  • Where’s the data for project X - when there’s an integrity investigation. Were procedures followed?

  • How much storage are we going to need?

 <p>Image showing a research data management system that can provision research services (workspaces).</p> <p>

Inspired by the (defunct?) Research Data Lifecycle project that was conceived by the former organizations that became the Australian Research Data Commons (ANDS, NeCTAR and RDSI) we came up with this architecture for a central research data management system (in our case we use the open source ReDBox system) loosely linked to a variety of research Workspaces, as we call them.

The plan is that over time, researchers can plan and budget for data management in the short, medium and long term, provision services and use the system to archive data as they go.

(Diagram by Gerard Barthelot at UTS)

 <p>Screenshot of the OCFL home page.</p> <p>

UTS has been an early adopter of the OCFL (Oxford Common File Layout) specificiation - a way of storing file sustainably on a file system (coming soon: s3 cloud storage) so it does not need to be migrated. I presented on this at the Open Repositories conference

 <p>Screenshot of the RO-Crate home page.</p> <p>

And at the same conference, I introduced the RO-Crate standards effort, which is a marriage between the DataCrate data packaging work we’ve been doing at UTS for a few years, and the Research Object project.

 <p>Screenshot of the Arkisto page.</p> <p>

We created the Arkisto platform to bring together all the work we’ve been doing to standardise research data metadata, and to build a toolkit for sustainable data repositories at all scales from single-collection up to institutional, and potentially discipline and national collections.

 <p>Screenshot from the Arkisto use-cases page showing an idealized data-flow for sensor data using the Arkisto platform.</p> <p>

This is an example of one of many Arkisto deployment patterns you can read more on the Arkisto Use Cases page

 <p>Screenshot of a search interface to a historical dataset.</p> <p>

This is an example of an Arkisto-platform output. Data exported from one content management system into an archive-ready RO-Crate package, which can then be made into a live site. This was created for Ass Prof Tamson Pietsch at UTS. The website is ephemeral - the data will be Interoperable and Reusable (I and R from FAIR) via the use of RO-Crate.

 <p>Image of chickens roosting on the roof of their house rather than inside it.</p> <p>

Now to higher-level concerns: I built this infrastructure for my chooks (chickens) - they have a nice dry box with a roosting loft. But most of the time they roost on the roof.

We know all too well that researchers don’t always use the infrastructure we build for them - you have to get a few other things right as well.

…  and have the right policy environment … <p>

… they will come, given the right training, incentives, support etc. <p>

One of the big frustrations I have had as an eResearch manager is that the expectations and aspirations of funders and integrity managers and so on are well ahead of our capacity to deliver the services they want, and then when we DO get infrastructure sorted there are organizational challenges to getting people to use it. To go back to my metaphor, we can’t just pick up the researchers from the roof and put them in their loft, or spray water on them to get them to move.

CS3MESH4EOSC T4.2 CS3MESH4EOSC T4.2

Via Gavin Kennedy and Guido Aben from AARNet Marco La Rosa and I are helping out with this charmingly named project which is adding data management service to storage, syncronization and sharing services. Contracts not yet in place so won't say much about this yet.

 <p>Schreenshot of CS3MESH4EOSC homepage.</p> <p>

https://www.cs3mesh4eosc.eu/about EOSC is the European Open Science Cloud

CS3MESH4EOSC - Interactive and agile sharing mesh of storage, data and applications for EOSC - aims to create an interoperable federation of data and higher-level services to enable friction-free collaboration between European researchers. CS3MESH4EOSC will connect locally and individually provided services, and scale them up at the European level and beyond, with the promise of reaching critical mass and brand recognition among European scientists that are not usually engaged with specialist eInfrastructures.

 Telescope emoji

I told Carina I would look outwards as well. What are we keeping an eye on?

     Book and factory emojis

Watch out for the book factory. Sorry, the publishing industry.

     Fox and hen emojis

The publishing industry is going to “help” the sector look after it’s research data.

 ©

Like, you, know, they did with the copyright in publications. Not only did that industry work out how to take over copyright in research works, they successfully moved from selling us hard-copy resources that we could keep in our own libraries to charging an annual rent on the literature - getting to the point where they can argue that they are essential to maintaining the scholarly record and MUST be involved in the publishing process even when the (sometimes dubious, patchy) quality checks are performed by us who created the literature.

It’s up to research institutions whether this story repeats with research data - remember who you’re dealing with when you sign those contracts!

Some questions I will explore post UTS How many research institutions in Australia have secure, scalable, sustainable Research Data repositories that support FAIR practices? When are we going to have national repositories? How many of the eResearch Platforms funded under various programs have sustainable data management stories? How much data is locked-up behind APIs and will be inaccessible if the software stops running? <p>

In the 2010s the Australian National Data (ANDS) service funded investment in Metadata stores; one of these was the ReDBOX research data management platform which is alive and well and being sustained by QCIF with a subscription maintenance service. But ANDS didn’t fund development of Research Data Repositories.

Credits The UTS eResaerch team Michael Lynch, Sharyn Wise, Fiona Tweedie, Moises Sacal, Marco La Rosa, Pascal Tampubolon,  Anelm Motha, Michael Lake, Matthew Gaston ,Simon Kruik, Weisi Chen from Intersect Austrlaia and our RDM project sponsor Louise Wheeler Credits <p>

The work I’ve talked about here was all done with with the UTS team.

Library Map Part 2 - How / Hugh Rundle

This is the second in a series of posts about my new Library Map. You probably should read the first post if you're interested in why I made the map and why it maps the particular things that it does. I expected this to be a two part series but it looks like I might make a third post about automation. The first post was about why I made the map. This one is about how.

The tech stack

The map is built with a stack of (roughly in order):

  • original Shape (SHP) and GeoJSON files
  • QGIS
  • geojson
  • a bunch of csv files
  • a tiny python script
  • topojson
  • some HTML, CSS and JavaScript
  • leafletjs and leaflet plugins
  • Map Box tile service

Boundary files

Since I primarily wanted to map things about library services rather than individual library buildings, the first thing I looked for was geodata boundary files. In Australia public libraries are usually run by local government, so the best place to start was with local government boundaries.

This is reasonably straightforward to get - either directly from data.gov.au or one of the state equivalents, or more typically by starting there and eventually getting to the website of the state department that deals with geodata. Usually the relevant file is provided as Shapefile, which is not exactly what we need, but is a vector format, which is a good start. I gradually added each state and data about it before moving on to the next one, but the process would basically have been the same even if I'd had all of the relevant files at the same time. There are two slight oddities at this point that may (or may not 😂) be of interest.

Australian geography interlude

The first is that more or less alone of all jurisdictions, Queensland provides local government (LGA) boundaries for coastal municipalities with large blocks covering the coastal waters and any islands. Other states draw boundaries around outlying islands and include the island — as an island — with the LGA that it is part of (if it's not "unincorporated", which is often the case in Victoria for example). As a result, the national map looks a bit odd when you get to Queensland, because the overlay bulges out slightly away from the coast. I'm not sure whether this is something to do with the LGA jurisdictions in Queensland, perhaps due to the Great Barrier Reef, or whether their cartography team just couldn't be bothered drawing lines around every little island.

Secondly, when I got to Western Australia I discovered two things:

  1. The Cocos (Keeling) Islands are an Overseas Territory of Australia; and
  2. Cocos and Christmas Islands have some kind of jurisdictional relationship with Western Australia, and are included in the Western Australia LGA files.

I hadn't really considered including overseas territories, but since they were right there in the file, I figured I may as well. Later this led to a question about why Norfolk Island was missing, so I hunted around and found a Shapefile for overseas territories, which also included Cocos and Christmas Islands.

Shapefiles are a pretty standard format, but I wanted to use leafletjs, and for that we need the data to be in JSON format. I also needed to both stitch together all the different state LGA files, and merge boundaries where local councils have formed regional library services. This seems to be more common in Victoria (which has Regional Library Corporations) than other states, but it was required in Victoria, New South Wales, and Western Australia. Lastly, it turns out there are significant parts of Australia that are not actually covered by any local government at all. Some of these areas are the confusingly named national parks that are actually governed directly by States. Others are simply 'unincorporated' — the two largest areas being the Unincorporated Far West Region of New South Wales (slightly larger than Hungary), and the Pastoral Unincorporated Area that consists of almost 60% of the landmass of South Australia (slightly smaller than France).

I had no idea these two enormous areas of Australia had this special status. There's also a pretty large section of the south of the Northern Territory that contains no libraries at all, and hence has no library service. If you're wondering why there is a large section of inland Australia with no overlays on the Library Map, now you know.

QGIS and GeoJSON

So, anyway, I had to munge all these files — mostly Shape but also GeoJSON — and turn them into a single GeoJSON file. I've subsequently discovered mapshaper which I might have used for this, but I didn't know about it at the time, so I used QGIS. I find the number of possibilities presented by QGIS quite overwhelming, but there's no doubt it's a powerful tool for manipulating GIS data. I added each Shapefile as a layer, merged local government areas that needed to be merged, either deleted or dissolved (into the surrounding area) the unincorporated areas, and then merged the layers. Finally, I exported the new merged layer as GeoJSON, which is exactly what it sounds like: ordinary JSON, for geodata.

CSV data

At this point I had boundaries, but not other data. I mean, this is not actually true, because I needed information about library services in order to know which LGAs collectively operate a single library service, but in terms of the files, all I had was a polygon and a name for each area. I also had a bunch of location data for the actual library branches in a variety of formats originally, but ultimately in comma separate values (CSV) format. I also had a CSV file for information about each library service. The question at this point was how to associate the information I was mapping with each area. There was no way I was going to manually update 400+ rows in QGIS. Luckily, CSV and JSON are two of the most common open file formats, and they're basically just text.

Python script

I'd had a similar problem in a previous, abandoned mapping project, and had a pretty scrappy Python script lying around. With a bit more Python experience behind me, I was able to make it more flexible and simpler. If we match on the name of the library service, it's fairly straightforward to add properties to each GeoJSON feature (the features being each library service boundaries area, and the properties being metadata about that feature). This is so because the value of properties within each feature is itself simply a JSON object:

{"type": "FeatureCollection",
"name": "library_services",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::3857" } },
"features":
[{ "type": "Feature", "properties" : {"name": "Bulloo Shire"}
"geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [143.78691062,-28.99912088],[143.78483624,-28.99912073] ... ]]}

The python script uses Python's inbuilt json and csv modules to read both the geojson and the csv file, then basically merge the data. I won't re-publish the whole thing, but the guts of it is:

# for each geojson feature, if a field in the json matches a field in the csv, add new properties to the json
for feature in json_data['features']:
with open(csv_file, newline='') as f:
# use DictReader so we can use the header names
reader = csv.DictReader(f)
for row in reader:
# look for match
if row[csv_match] == feature['properties'][geojson_match]:
# create new properties in geojson
for k in row:
feature['properties'][k] = row[k]

The whole thing is fewer than 40 lines long. This saved me heaps of time, but as you'll discover in my future post on automation, I later worked out how to automate the whole process every time the CSV file is updated!

TopoJSON

GeoJSON is pretty cool — it's specifically designed for web applications to read and write GIS files in a native web format. Unfortunately, GeoJSON can also get very big, especially with a project like mine where there are lots of boundaries over a large area. The final file was about 130MB — far too big for anyone to reasonably wait for it to load in their browser (and Chrome just refused to load it altogether). Because of the way I originally wrote the Python script, it actually became nearly three times the size, because I put in a two-space indent out of habit. This created literally hundreds of megabytes of empty spaces. "Pretty printing" JSON is helpful if a human needs to read it, but rather unhelpful if you want to keep the file size down.

Enter TopoJSON. To be honest I don't really understand the mathematics behind it, but TopoJSON allows you to represent the same information as GeoJSON but in a much, much smaller file. I reduced a 362MB GeoJSON file (admittedly, about 200MB being blank spaces) to 2.6MB simply by converting it to TopoJSON! By "quantising" it (essentially, making it less accurate), the file size can be reduced even further, rendering the current file of about 2.2MB - definitely small enough to load in a browser without too much of a wait, albeit not lightning fast.

Good old HTML/CSS/JavaScript

At this point we're ready to start putting together the website to display the map. For this I used plain, vanilla HTML, CSS, and JavaScript. The web is awash with projects, frameworks and blog posts explaining how to use them to create your SPA (Single Page App)™️, but we really don't need any of that. The leaflet docs have a pretty good example of a minimal project, and my map is really not much more complex than that.

Something that did stump me for a while was how to bring the TopoJSON and CSV files into the JavaScript file as variables. I'm a self-taught JavaScript coder, and I learned it back to front: initially as a backend scripting language (i.e. nodejs) and then as the front-end browser scripting language it was originally made to be. So sometimes something a front-end developer would consider pretty basic: "How do I import a text file into my JavaScript and assign it to a variable?" takes me a while to work out. Initially I just opened the files in a text editor and copy-pasted the contents between two quote marks, made it the value of a javascript variable, and saved the whole thing as a .js file. But it was obvious even to me that couldn't possibly be the correct way to do it, even though it worked. In nodejs I would use fs.readFile() but the only thing that looked vaguely similar for front end JavaScript was FileReader — which is for reading files on a client, not a server. Finally I did a bit of research and found that the answer is to forget that the file is sitting right there in the same directory as all your JavaScript and HTML files, and just use AJAX like it's a remote file. The modern way to do this is with fetch, so instead of doing this:

// index.html
<script src="./boundaries.js" type="text/javascript"></script>
<script src="./branchesCsv.js" type="text/javascript"></script>
<script src="./ikcCsv.js" type="text/javascript"></script>
<script src="./mechanics.js" type="text/javascript"></script>
<script src="./nslaBranches.js" type="text/javascript"></script>
<script src="./load-map.js" type="text/javascript"></script>

// boundaries.js
const boundaries = `{"contents": "gigantic JSON string"}`
// branchesCsv.js
const branchesCsv = `lat,lng,town,address,phone
-35.5574374,138.6107874,Victor Harbor Public Library Service, 1 Bay Road, 08 8551 0730
... etc`

// ikcCsv.js
const ikcCsv = `lat,lng,town,address,phone
-10.159918,142.166344,Badu Island Indigenous Knowledge Centre,Nona Street ,07 4083 2100
...etc`

// mechanics.js
const mechanics = `lat,lng,town,address,phone
-37.562362,143.858541,Ballaarat Mechanics Institute,117 Sturt Street,03 5331 3042
..etc`

// nslaBranches.js
const nslaBranches = `lat,lng,town,address,phone
-37.809815,144.96513,State Library of Victoria,"328 Swanston Street, Melbourne",03 8664 7000
... etc`


// load-map.js
// boundaries and the other constants are now globals
const loanPeriod = new L.TopoJSON(boundaries, options)

...we do this:

// index.html
<script src="./load-map.js" type="text/javascript"></script>

// load-map.js
const boundaries = fetch('data/boundaries.topo.json')
.then( response => response.json())

const branchesCsv = fetch('data/public_library_locations.csv')
.then( response => response.text());

const ikcCsv = fetch('data/indigenous_knowledge_centre_locations.csv')
.then( response => response.text());

const mechanics = fetch('data/mechanics_institute_locations.csv')
.then( response => response.text());

const nslaBranches = fetch('data/nsla_library_locations.csv')
.then( response => response.text());

// fetch returns a promise so we have to let them all 'settle' before we can use the returned value
Promise.all([boundaries, branchesCsv, ikcCsv, mechanics, nslaBranches])
.then( data => {
// data is an array with the settled values of the fetch() promises
const loanPeriod = new L.TopoJSON(data[0], options)
}

In the code this doesn't necessarily look much simpler, but in terms of workflow it's a huge improvement that cuts out manually copy-pasting every time a CSV or TopoJSON file is updated, and reduces duplication and the total number of files.

So now the site consists of:

  • the original data as CSV and TopoJSON files
  • an index.html file to display the map
  • a single CSS file for basic styling
  • a single JavaScript file to load the map

Leaflet and friends

Finally it's time to actually put all of this stuff into a map using Leaflet. This is a really great JavaScript library, with pretty good documentation. Leaflet allows us to plot shapes onto a map, and using JavaScript to make them interactive - including adding popups, zoom to features when they're clicked, and add interactive overlays.

I won't try to replicate the Leaflet docs here and explain the exact steps to making my map, but I do want to highlight how two Leaflet plugins really helped with making the map work nicely. Leaflet has a fairly strong plugin collection, and they allow the base library to be fairly lightweight whilst the entire system is still quite flexible and fully featured.

I knew from the beginning it would require the whole library community to keep the map up to date over time. There are hundreds of library services across Australia, and they don't set their rules or their procurement decisions in stone forever. So it needed to be relatively simple to update the data as it changes. As we've discussed, GeoJSON also takes up a lot of space. Ideally, I could store as much data in CSV files as possible, and use them directly as data feeding the map. Turns out there's a plugin for that - Leaflet.geoCSV. This allows us to load CSV files directly (for library building locations), and it's converted to GeoJSON on the fly. Since CSV files are much smaller than the equivalent data in JSON, this is not only easier to maintain, but also loads faster.

The second plugin that really helped was Leaflet.pattern. The problem this helped me to solve was how to show both the fines layer and the loan period layer at the same time. Typically for a chloropleth map, different colours or shades indicate certain values. But if you add a second overlay on top of the first one, the colours no longer necessarily make much sense and combinations can be difficult or impossible to discern. Thinking about this, I figured if I could make one layer semi-transparent colours, and the second layer patterns like differently angled stripes or dots, that might do the trick. Leaflet.pattern to the rescue! After some alpha-testing by my go-to volunteer Quality Assurance tester, I worked out how to make the layers always appear in the same order, regardless of which order they were added or removed, making the combination always look consistent:

Animated GIF showing overlays

Tile service

Once all of that's complete, we can load the map. But there's a problem: all we have is a bunch of vector points and lines, there's no underlying geography. For this we need a Map Tile service. We can use one of several options provided by OpenStreetMap, but I ended up using the commercial Map Box service on a free plan (or at least, it will be free as long as thousands of people don't suddenly start using the map all at the same time). Their dark and light map styles really suited what I was trying to do, with minimal detail in terms of the underlying geography, but with roads and towns marked at the appropriate zoom level.

So that's it! It took a while to work it all out, but most of the complexity is in getting the data together rather than displaying the map. Once I had that done (though there is still a fair bit of information missing), I was able to pay more attention to maintaining the map into the future. That led me to look into some options for automating the merging of data from the library services CSV file (when it's updated) into the TopoJSON file, and also automatically refreshing the data on the actual map when the GitHub repository is updated. In my next post I'll explain how that works. While you're waiting for that, you can help me find missing data and make the map more accurate 😀.


Managed Solr SaaS Options / Jonathan Rochkind

I was recently looking for managed Solr “software-as-a-service” (SaaS) options, and had trouble figuring out what was out there. So I figured I’d share what I learned. Even though my knowledge here is far from exhaustive, and I have only looked seriously at one of the ones I found.

The only managed Solr options I found were: WebSolr; SearchStax; and OpenSolr.

Of these, i think WebSolr and SearchStax are more well-known, I couldn’t find anyone with experience with OpenSolr, which perhaps is newer.

Of them all, SearchStax is the only one I actually took for a test drive, so will have the most to say about.

Why we were looking

We run a fairly small-scale app, whose infrastructure is currently 4 self-managed AWS EC2 instances, running respectively: 1) A rails web app 2) Bg workers for the rails web app 3) Postgres, and 4) Solr.

Oh yeah, there’s also a redis running one of those servers, on #3 with pg or #4 with solr, I forget.

Currently we manage this all ourselves, right on the EC2. But we’re looking to move as much as we can into “managed” servers. Perhaps we’ll move to Heroku. Perhaps we’ll use hatchbox. Or if we do stay on AWS resources we manage directly, we’d look at things like using an AWS RDS Postgres instead of installing it on an EC2 ourselves, an AWS ElastiCache for Redis, maybe look into Elastic Beanstalk, etc.

But no matter what we do, we need a Solr, and we’d like to get it managed. Hatchbox has no special Solr support, AWS doesn’t have a Solr service, Heroku does have a solr add-on but you can also use any Solr with it and we’ll get to that later.

Our current Solr use is pretty small scale. We don’t run “SolrCloud mode“, just legacy ordinary Solr. We only have around 10,000 documents in there (tiny for Solr), our index size is only 70MB. Our traffic is pretty low — when I tried to figure out how low, it doesn’t seem we have sufficient logging turned on to answer that specifically but using proxy metrics to guess I’d say 20K-40K requests a day, query as well as add.

This is a pretty small Solr installation, although it is used centrally for the primary functions of the (fairly low-traffic) app. It currently runs on an EC2 t3a.small, which is a “burstable” EC2 type with only 2G of RAM. It does have two vCPUs (that is one core with ‘hyperthreading’). The t3a.small EC2 instance only costs $14/month on-demand price! We know we’ll be paying more for managed Solr, but we want to do get out of the business of managing servers — we no longer really have the staff for it.

WebSolr (didn’t actually try out)

WebSolr is the only managed Solr currently listed as a Heroku add-on. It is also available as a managed Solr independent of heroku.

The pricing in the heroku plans vs the independent plans seems about the same. As a heroku add-on there is a $20 “staging” plan that doesn’t exist in the independent plans. (Unlike some other heroku add-ons, no time-limited free plan is available for WebSolr). But once we go up from there, the plans seem to line up.

Starting at: $59/month for:

  • 1 million document limit
  • 40K requests/day
  • 1 index
  • 954MB storage
  • 5 concurrent requests limit (this limit is not mentioned on the independent pricing page?)

Next level up is $189/month for:

  • 5 million document limit
  • 150K requests/day
  • 4.6GB storage
  • 10 concurrent request limit (again concurrent request limits aren’t mentioned on independent pricing page)

As you can see, WebSolr has their plans metered by usage.

$59/month is around the price range we were hoping for (we’ll need two, one for staging one for production). Our small solr is well under 1 million documents and ~1GB storage, and we do only use one index at present. However, the 40K requests/day limit I’m not sure about, even if we fit under it, we might be pushing up against it.

And the “concurrent request” limit simply isn’t one I’m even used to thinking about. On a self-managed Solr it hasn’t really come up. What does “concurrent” mean exactly in this case, how is it measured? With 10 puma web workers and sometimes a possibly multi-threaded batch index going on, could we exceed a limit of 4? Seems plausible. What happens when they are exceeded? Your Solr request results in an HTTP 429 error!

Do I need to now write the app to rescue those gracefully, or use connection pooling to try to avoid them, or something? Having to rewrite the way our app functions for a particular managed solr is the last thing we want to do. (Although it’s not entirely clear if those connection limits exist on the non-heroku-plugin plans, I suspect they do?).

And in general, I’m not thrilled with the way the pricing works here, and the price points. I am positive for a lot of (eg) heroku customers an additional $189*2=$378/month is peanuts not even worth accounting for, but for us, a small non-profit whose app’s traffic does not scale with revenue, that starts to be real money.

It is not clear to me if WebSolr installations (at “standard” plans) are set up in “SolrCloud mode” or not; I’m not sure what API’s exist for uploading your custom schema.xml (which we’d need to do), or if they expect you to do this only manually through a web UI (that would not be good); I’m not sure if you can upload custom solrconfig.xml settings (this may be running on a shared solr instance with standard solrconfig.xml?).

Basically, all of this made WebSolr not the first one we looked at.

Does it matter if we’re on heroku using a managed Solr that’s not a Heroku plugin?

I don’t think so.

In some cases, you can get a better price from a Heroku plug-in than you could get from that same vendor not on heroku or other competitors. But that doesn’t seem to be the case here, and other that that does it matter?

Well, all heroku plug-ins are required to bill you by-the-minute, which is nice but not really crucial, other forms of billing could also be okay at the right price.

With a heroku add-on, your billing is combined into one heroku invoice, no need to give a credit card to anyone else, and it can be tracked using heroku tools. Which is certainly convenient and a plus, but not essential if the best tool for the job is not a heroku add-on.

And as a heroku add-on, WebSolr provides a WEBSOLR_URL heroku config/env variable automatically to code running on heroku. OK, that’s kind of nice, but it’s not a big deal to set a SOLR_URL heroku config manually referencing the appropriate address. I suppose as a heroku add-on, WebSolr also takes care of securing and authenticating connections between the heroku dynos and the solr, so we need to make sure we have a reasonable way to do this from any alternative.

SearchStax (did take it for a spin)

SearchStax’s pricing tiers are not based on metering usage. There are no limits based on requests/day or concurrent connections. SearchStax runs on dedicated-to-you individual Solr instances (I would guess running on dedicated-to-you individual (eg) EC2, but I’m not sure). Instead the pricing is based on size of host running Solr.

You can choose to run on instances deployed to AWS, Google Cloud, or Azure. We’ll be sticking to AWS (the others, I think, have a slight price premium).

While SearchStax gives you a pricing pages that looks like the “new-way-of-doing-things” transparent pricing, in fact there isn’t really enough info on public pages to see all the price points and understand what you’re getting, there is still a kind of “talk to a salesperson who has a price sheet” thing going on.

What I think I have figured out from talking to a salesperson and support, is that the “Silver” plans (“Starting at $19 a month”, although we’ll say more about that in a bit) are basically: We give you a Solr, we don’t don’t provide any technical support for Solr.

While the “Gold” plans “from $549/month” are actually about paying for Solr consultants to set up and tune your schema/index etc. That is not something we need, and $549+/month is way more than the price range we are looking for.

While the SearchStax pricing/plan pages kind of imply the “Silver” plan is not suitable for production, in fact there is no real reason not to use it for production I think, and the salesperson I talked to confirmed that — just reaffirming that you were on your own managing the Solr configuration/setup. That’s fine, that’s what we want, we just don’t want to mangage the OS or set up the Solr or upgrade it etc. The Silver plans have no SLA, but as far as I can tell their uptime is just fine. The Silver plans only guarantees 72-hour support response time — but for the couple support tickets I filed asking questions while under a free 14-day trial (oh yeah that’s available), I got prompt same-day responses, and knowledgeable responses that answered my questions.

So a “silver” plan is what we are interested in, but the pricing is not actually transparent.

$19/month is for the smallest instance available, and IF you prepay/contract for a year. They call that small instance an NDN1 and it has 1GB of RAM and 8GB of storage. If you pay-as-you-go instead of contracting for a year, that already jumps to $40/month. (That price is available on the trial page).

When you are paying-as-you-go, you are actually billed per-day, which might not be as nice as heroku’s per-minute, but it’s pretty okay, and useful if you need to bring up a temporary solr instance as part of a migration/upgrade or something like that.

The next step up is an “NDN2” which has 2G of RAM and 16GB of storage, and has an ~$80/month pay-as-you-go — you can find that price if you sign-up for a free trial. The discount price price for an annual contract is a discount similar to the NDN1 50%, $40/month — that price I got only from a salesperson, I don’t know if it’s always stable.

It only occurs to me now that they don’t tell you how many CPUs are available.

I’m not sure if I can fit our Solr in the 1G NDN1, but I am sure I can fit it in the 2G NDN2 with some headroom, so I didn’t look at plans above that — but they are available, still under “silver”, with prices going up accordingly.

All SearchStax solr instances run in “SolrCloud” mode — these NDN1 and NDN2 ones we’re looking at just run one node with one zookeeper, but still in cloud mode. There are also “silver” plans available with more than one node in a “high availability” configuration, but the prices start going up steeply, and we weren’t really interested in that.

Because it’s SolrCloud mode though, you can use the standard Solr API for uploading your configuration. It’s just Solr! So no arbitrary usage limits, no features disabled.

The SearchStax web console seems competently implemented; it let’s you create and delete individual Solr “deployments”, manage accounts to login to console (on “silver” plan you only get two, or can pay $10/month/account for more, nah), and set up auth for a solr deployment. They support IP-based authentication or HTTP Basic Auth to the Solr (no limit to how many Solr Basic Auth accounts you can create). HTTP Basic Auth is great for us, because trying to do IP-based from somewhere like heroku isn’t going to work. All Solrs are available over HTTPS/SSL — great!

SearchStax also has their own proprietary HTTP API that lets you do most anything, including creating/destroying deployments, managing Solr basic auth users, basically everything. There is some API that duplicates the Solr Cloud API for adding configsets, I don’t think there’s a good reason to use it instead of standard SolrCloud API, although their docs try to point you to it. There’s even some kind of webhooks for alerts! (which I haven’t really explored).

Basically, SearchStax just seems to be a sane and rational managed Solr option, it has all the features you’d expect/need/want for dealing with such. The prices seem reasonable-ish, generally more affordable than WebSolr, especially if you stay in “silver” and “one node”.

At present, we plan to move forward with it.

OpenSolr (didn’t look at it much)

I have the least to say about this, have spent the least time with it, after spending time with SearchStax and seeing it met our needs. But I wanted to make sure to mention it, because it’s the only other managed Solr I am even aware of. Definitely curious to hear from any users.

Here is the pricing page.

The prices seem pretty decent, perhaps even cheaper than SearchStax, although it’s unclear to me what you get. Does “0 Solr Clusters” mean that it’s not SolrCloud mode? After seeing how useful SolrCloud APIs are for management (and having this confirmed by many of my peers in other libraries/museums/archives who choose to run SolrCloud), I wouldn’t want to do without it. So I guess that pushes us to “executive” tier? Which at $50/month (billed yearly!) is still just fine, around the same as SearchStax.

But they do limit you to one solr index; I prefer SearchStax’s model of just giving you certain host resources and do what you want with it. It does say “shared infrastructure”.

Might be worth investigating, curious to hear more from anyone who did.

Now, what about ElasticSearch?

We’re using Solr mostly because that’s what various collaborative and open source projects in the library/museum/archive world have been doing for years, since before ElasticSearch even existed. So there are various open source libraries and toolsets available that we’re using.

But for whatever reason, there seem to be SO MANY MORE managed ElasticSearch SaaS available. At possibly much cheaper pricepoints. Is this because the ElasticSearch market is just bigger? Or is ElasticSearch easier/cheaper to run in a SaaS environment? Or what? I don’t know.

But there’s the controversial AWS ElasticSearch Service; there’s the Elastic Cloud “from the creators of ElasticSearch”. On Heroku that lists one Solr add-on, there are THREE ElasticSearch add-ons listed: ElasticCloud, Bonsai ElasticSearch, and SearchBox ElasticSearch.

If you just google “managed ElasticSearch” you immediately see 3 or 4 other names.

I don’t know enough about ElasticSearch to evaluate them. There seem on first glance at pricing pages to be more affordable, but I may not know what I’m comparing and be looking at tiers that aren’t actually usable for anything or will have hidden fees.

But I know there are definitely many more managed ElasticSearch SaaS than Solr.

I think ElasticSearch probably does everything our app needs. If I were to start from scratch, I would definitely consider ElasticSearch over Solr just based on how many more SaaS options there are. While it would require some knowledge-building (I have developed a lot of knowlege of Solr and zero of ElasticSearch) and rewriting some parts of our stack, I might still consider switching to ES in the future, we don’t do anything too too complicated with Solr that would be too too hard to switch to ES, probably.

Rails auto-scaling on Heroku / Jonathan Rochkind

We are investigating moving our medium-small-ish Rails app to heroku.

We looked at both the Rails Autoscale add-on available on heroku marketplace, and the hirefire.io service which is not listed on heroku marketplace and I almost didn’t realize it existed.

I guess hirefire.io doesn’t have any kind of a partnership with heroku, but still uses the heroku API to provide an autoscale service. hirefire.io ended up looking more fully-featured and lesser priced than Rails Autoscale; so the main service of this post is just trying to increase visibility of hirefire.io and therefore competition in the field, which benefits us consumers.

Background: Interest in auto-scaling Rails background jobs

At first I didn’t realize there was such a thing as “auto-scaling” on heroku, but once I did, I realized it could indeed save us lots of money.

I am more interested in scaling Rails background workers than I a web workers though — our background workers are busiest when we are doing “ingests” into our digital collections/digital asset management system, so the work is highly variable. Auto-scaling up to more when there is ingest work piling up can give us really nice inget throughput while keeping costs low.

On the other hand, our web traffic is fairly low and probably isn’t going to go up by an order of magnitude (non-profit cultural institution here). And after discovering that a “standard” dyno is just too slow, we will likely be running a performance-m or performance-l anyway — which likely can handle all anticipated traffic on it’s own. If we have an auto-scaling solution, we might configure it for web dynos, but we are especially interested in good features for background scaling.

There is a heroku built-in autoscale feature, but it only works for performance dynos, and won’t do anything for Rails background job dynos, so that was right out.

That could work for Rails bg jobs, the Rails Autoscale add-on on the heroku marketplace; and then we found hirefire.io.

Pricing: Pretty different

hirefire

As of now January 2021, hirefire.io has pretty simple and affordable pricing. $15/month/heroku application. Auto-scaling as many dynos and process types as you like.

hirefire.io by default can only check into your apps metrics to decide if a scaling event can occur once per minute. If you want more frequent than that (up to once every 15 seconds), you have to pay an additional $10/month, for $25/month/heroku application.

Even though it is not a heroku add-on, hirefire does advertise that they bill pro-rated to the second, just like heroku and heroku add-ons.

Rails autoscale

Rails autoscale has a more tiered approach to pricing that is based on number and type of dynos you are scaling. Starting at $9/month for 1-3 standard dynos, the next tier up is $39 for up to 9 standard dynos, all the way up to $279 (!) for 1 to 99 dynos. If you have performance dynos involved, from $39/month for 1-3 performance dynos, up to $599/month for up to 99 performance dynos.

For our anticipated uses… if we only scale bg dynos, I might want to scale from (low) 1 or 2 to (high) 5 or 6 standard dynos, so we’d be at $39/month. Our web dynos are likely to be performance and I wouldn’t want/need to scale more than probably 2, but that puts us into performance dyno tier, so we’re looking at $99/month.

This is of course significantly more expensive than hirefire.io’s flat rate.

Metric Resolution

Since Hirefire had an additional charge for finer than 1-minute resolution on checks for autoscaling, we’ll discuss resolution here in this section too. Rails Autoscale has same resolution for all tiers, and I think it’s generally 10 seconds, so approximately the same as hirefire if you pay the extra $10 for increased resolution.

Configuration

Let’s look at configuration screens to get a sense of feature-sets.

Rails Autoscale

web dynos

To configure web dynos, here’s what you get, with default values:

The metric Rails Autoscale uses for scaling web dynos is time in heroku routing queue, which seems right to me — when things are spending longer in heroku routing queue before getting to a dyno, it means scale up.

worker dynos

For scaling worker dynos, Rails Autoscale can scale dyno type named “worker” — it can understand ruby queuing libraries Sidekiq, Resque, Delayed Job, or Que. I’m not certain if there are options for writing custom adapter code for other backends.

Here’s what the configuration options are — sorry these aren’t the defaults, I’ve already customized them and lost track of what defaults are.

You can see that worker dynos are scaled based on the metric “number of jobs queued”, and you can tell it to only pay attention to certain queues if you want.

Hirefire

Hirefire has far more options for customization than Rails Autoscale, which can make it a bit overwhelming, but also potentially more powerful.

web dynos

You can actually configure as many Heroku process types as you have for autoscale, not just ones named “web” and “worker”. And for each, you have your choice of several metrics to be used as scaling triggers.

For web, I think Queue Time (percentile, average) matches what Rails Autoscale does, configured to percentile, 95, and is probably the best to use unless you have a reason to use another. (“Rails Autoscale tracks the 95th percentile queue time, which for most applications will hover well below the default threshold of 100ms.“)

Here’s what configuration Hirefire makes available if you are scaling on “queue time” like Rails Autoscale, configuration may vary for other metrics.

I think if you fill in the right numbers, you can configure to work equivalently to Rails Autoscale.

worker dynos

If you have more than one heroku process type for workers — say, working on different queues — Hirefire can scale the independently, with entirely separate configuration. This is pretty handy, and I don’t think Rails Autoscale offers this.

For worker dynos, you could choose to scale based on actual “dyno load”, but I think this is probably mostly for types of processes where there isn’t the ability to look at “number of jobs”. A “number of jobs in queue” like Rails Autoscale does makes a lot more sense to me as an effective metric for scaling queue-based bg workers.

Hirefire’s metric is slightly difererent than Rails Autoscale’s “jobs in queue”. For recognized ruby queue systems (a larger list than Rails Autoscale’s), it actually measures jobs in queue plus workers currently busy. So queued+in-progress, rather than Rails Autoscale’s just queued. I actually have a bit of trouble wrapping my head around the implications of this, but basically, it means that Hirefire’s “jobs in queue” metric strategy is intended to try to scale all the way to emptying your queue, or reaching your max scale limit, whichever comes first. I think this may make sense and work out at least as well or perhaps better than Rails Autoscale’s approach?

Here’s what configuration Hirefire makes available for worker dynos scaling on “job queue” metric.

Since the metric isn’t the same as Rails Autosale, we can’t configure this to work identically. But there are a whole bunch of configuration options, some similar to Rails Autoscale’s.

The most important thing here is that “Ratio” configuration. It may not be obvious, but with the way the hirefire metric works, you are basically meant to configure this to equal the number of workers/threads you have on each dyno. I have it configured to 3 because my heroku worker processes use resque, with resque_pool, configured to run 3 resque workers on each dyno. If you use sidekiq, set ratio to your configured concurrency — or if you are running more than one sidekiq process, processes*concurrency. Basically how many jobs your dyno can be concurrently working is what you should normally set for ‘ratio’.

Hirefire not a heroku plugin

Hirefire isn’t actually a heroku plugin. In addition to that meaning separate invoicing, there can be some other inconveniences.

Since hirefire only can interact with heroku API, for some metrics (including the “queue time” metric that is probably optimal for web dyno scaling) you have to configure your app to log regular statistics to heroku’s “Logplex” system. This can add a lot of noise to your log, and for heroku logging add-ons that are tired based on number of log lines or bytes, can push you up to higher pricing tiers.

If you use paperclip, I think you should be able to use the log filtering feature to solve this, keep that noise out of your logs and avoid impacting data log transfer limits. However, if you ever have cause to look at heroku’s raw logs, that noise will still be there.

Support and Docs

I asked a couple questions of both Hirefire and Rails Autoscale as part of my evaluation, and got back well-informed and easy-to-understand answers quickly from both. Support for both seems to be great.

I would say the documentation is decent-but-not-exhaustive for both products. Hirefire may have slightly more complete documentation.

Other Features?

There are other things you might want to compare, various kinds of observability (bar chart or graph of dynos or observed metrics) and notification. I don’t have time to get into the details (and didn’t actually spend much time exploring them to evaluate), but they seem to offer roughly similar features.

Conclusion

Rails Autoscale is quite a bit more expensive than hirefire.io’s flat rate, once you get past Rails Autoscale’s most basic tier (scaling no more than 3 standard dynos).

It’s true that autoscaling saves you money over not, so even an expensive price could be considered a ‘cut’ of that, and possibly for many ecommerce sites even $99 a month might a drop in the bucket (!)…. but this price difference is so significant with hirefire (which has flat rate regardless of dynos), that it seems to me it would take a lot of additional features/value to justify.

And it’s not clear that Rails Autoscale has any feature advantage. In general, hirefire.io seems to have more features and flexibility.

Until 2021, hirefire.io could only analyze metrics with 1-minute resolution, so perhaps that was a “killer feature”?

Honestly I wonder if this price difference is sustained by Rails Autoscale only because most customers aren’t aware of hirefire.io, it not being listed on the heroku marketplace? Single-invoice billing is handy, but probably not worth $80+ a month. I guess hirefire’s logplex noise is a bit inconvenient?

Or is there something else I’m missing? Pricing competition is good for the consumer.

And are there any other heroku autoscale solutions, that can handle Rails bg job dynos, that I still don’t know about?

Three New NDSA Members / Digital Library Federation

Since January 2021, the NDSA Coordinating Committee unanimously voted to welcome three new members. Each of these members bring a host of skills and experience to our group. Please help us to welcome:

  • Arkivum: Arkivum is recognized internationally for its expertise in the archiving and digital preservation of valuable data and digitized assets in large volumes and multiple formats.
  • Colorado State University Libraries: Colorado State University Libraries’ digital preservation activities has focused on web archiving, targeted born-digital collecting, along with collection development and preservation guidelines for its digital repository.
  • Vassar College Libraries: Vassar College Libraries are committed to supporting a framework of sustainable access to our digital collections and to participate locally, nationally, and globally with other cultural and professional organizations and institutions in efforts to preserve, augment, and disseminate our collective documentary heritage.

Each organization has participants in one or more of the various NDSA interest and working groups, so keep an eye out for them on your calls and be sure to give them a shout out. Please join me in welcoming our new members. A complete list of NDSA members is on our website.

In future, NDSA is moving to a quarterly process for reviewing membership applications. Announcements for new members will be scheduled accordingly.

~ Nathan Tallman, Vice Chair of the NDSA Coordinating Committee

The post Three New NDSA Members appeared first on DLF.

Fedora Migration Paths and Tools Project Update: January 2021 / DuraSpace News

This is the fourth in a series of monthly updates on the Fedora Migration Paths and Tools project – please see last month’s post for a summary of the work completed up to that point. This project has been generously funded by the IMLS.

The grant team has been focused on completing an initial build of a validation utility, which will allow implementers to compare their migrated content with the original Fedora 3.x source material to verify that everything has been migrated successfully. A testable version of this tool is expected to be completed in the coming weeks, at which point the University of Virginia pilot team will test and provide feedback on the utility.

The University of Virginia team has completed a full migration of their legacy Fedora 3.2.1 repository. They also recently contributed improvements to the Fedora AWS Deployer which have been merged into the codebase. The team is now awaiting a testable version of the validation utility so they can validate their migrated content before moving on to testing this content in a newly installed Fedora 6.0 instance.

The Whitman College pilot team has completed their metadata remediation and mapping work. Their process and lessons learned will be shared in a presentation at the upcoming Code4Lib conference. Meanwhile, Islandora 8 is currently being tested with an Alpha build of Fedora 6.0, which will be used as the basis for migration testing for the Whitman College pilot. Work is currently being done in parallel to install Islandora 8 using ISLE and complete work on a new theme. Due to the impending end-of-life date of Drupal 8 the team decided to proceed directly to Drupal 9, and the theme needed to be updated accordingly. Fortunately, the transition from Drupal 8 to 9 is relatively minor.

Next month we plan to use the validation utility to validate the University of Virginia migration before moving on to testing the migrated data in Fedora 6.0 and updating the application as needed. For the Whitman College pilot, once the Islandora 8 with Fedora 6.0 installation is complete we will be able to run a series of test migrations and update the utilities and application as necessary in order to satisfy functional requirements.

Stay tuned for future updates!

 

The post Fedora Migration Paths and Tools Project Update: January 2021 appeared first on Duraspace.org.

Open Knowledge Justice Programme takes new step on its mission to ensure algorithms cause no harm / Open Knowledge Foundation

Today we are proud to announce a new project for the Open Knowledge Justice Programme – strategic litigation. This might mean we will go to court to make sure public impact algorithms are used fairly, and cause no harm. But it will also include advocacy in the form of letters and negotiation. 

The story so far

Last year, Open Knowledge Foundation made a commitment to apply our skills and network to the increasingly important topics of artificial intelligence (AI) and algorithms.

As a result, we launched the Open Knowledge Justice Programme in April 2020. Our  mission is to ensure that public impact algorithms cause no harm.

Public impact algorithms have four key features:

  • they involve automated decision-making
  • using AI and algorithms
  • by governments and corporate entities and
  • have the potential to cause serious negative impacts on individuals and communities.

We aim to make public impact algorithms more accountable by equipping legal professionals, including campaigners and activists, with the know-how and skills they need to challenge the effects of these technologies in their practice. We also work with those deploying public impact algorithms to raise awareness of the potential risks and build strategies for mitigating them. We’ve had some great feedback from our first trainees! 

Why are we doing this? 

Strategic litigation is more than just winning an individual case. Strategic litigation is ‘strategic’ because it plays a part in a larger movement for change. It does this by raising awareness of the issue, changing public debate, collaborating with others fighting for the same cause and, when we win (hopefully!) making the law fairer for everyone. 

Our strategic litigation activities will be grounded in the principle of openness because public impact algorithms are overwhelmingly deployed opaquely. This means that experts that are able to unpick why and how AI and algorithms are causing harm cannot do so and the technology escapes scrutiny. 

Vendors of the software say they can’t release the software code they use because it’s a trade secret. This proprietary knowledge, although used to justify decisions potentially significantly impacting people’s lives, remains out of our reach. 

We’re not expecting all algorithms to be open. Nor do we think that would necessarily be useful. 

But we do think it’s wrong that governments can purchase software and not be transparent around key points of accountability such as its objectives, an assessment of the risk it will cause harm and its accuracy.

Openness is one of our guiding principles in how we’ll work too. As far as we are able, we’ll share our cases for others to use, re-use and modify for their own legal actions, wherever they are in the world. We’ll share what works, and what doesn’t, and make learning resources to make achieving algorithmic justice through legal action more readily achievable. 

We’re excited to announce our first case soon, so stay tuned! Sign up to our mailing list or follow the Open Knowledge Justice Programme on Twitter to receive updates.

ISP Monopolies / David Rosenthal

For at least the last three years (It Isn't About The Technology) I've been blogging about the malign effects of the way the FAANGs dominate the Web and the need for anti-trust action to mitigate them. Finally, with the recent lawsuits against Facebook and Google, some action may be in prospect. I'm planning a post on this topic. But when it comes to malign effects of monopoly I've been ignoring the other monopolists of the Internet, the telcos.

An insightful recent post by John Gilmore to Dave Farber's IP list sparked a response from Thomas Leavitt and some interesting follow-up e-mail. Gilmore was involved in pioneering consumer ISPs, and Leavitt in pioneering Web hosting. Both attribute the current sorry state of Internet connectivity in the US to the lack of effective competition. They and I differ somewhat on how the problem could be fixed. Below the fold I go into the details.

I've known Gilmore since the early days of Sun Microsystems, and I'm grateful for good advice he gave me at a critical stage of my career. He has a remarkable clarity of vision and a list of achievements that includes pioneering paid support for free software (the "Red Hat" model) at Cygnus Support, and pioneering consumer Internet Service Providers (ISPs) with The Little Garden. Because in those dial-up days Internet connectivity was a commercial product layered on regulated infrastructure (the analog telephone system) there was no lock-in. Consumers could change providers simply by changing the number their modem dialed.

This experience is key to Gilmore's argument. He writes:
The USA never had "network neutrality" before it was "suspended". What the USA had was 3,000 ISPs. So if an ISP did something unfriendly to its customers, they could just stop paying the bad one, and sign up with a different ISP that wouldn't screw them. That effectively prevented bad behavior among ISPs. And if the customer couldn't find an ISP that wouldn't screw them, they could START ONE THEMSELVES. I know, because we did exactly that in the 1990s.

Anyone could start an ISP because by law, everyone had tariffed access to the same telco infrastructure (dialup phone lines, and leased lines at 56 kbit/sec or 1.544 Mbit/sec or 45 Mbit/sec). You just called up the telco and ordered it, and they sent out techs and installed it. We did exactly that, plugged it into our modems and routers and bam, we were an ISP: "The Little Garden".
I was an early customer of The Little Garden. A SPARCstation, a SCSI disk and a modem sat on my window-ledge. The system dialed a local, and thus free, number and kept the call up 24/7, enabling me to register a domain and start running my own mail server. Years later I upgraded to DSL with Stanford as my ISP. As Gilmore points out, Stanford could do this under the same law:
Later, DSL lines required installing equipment in telco central offices, at the far end of the wire that leads to your house. But the telcos were required by the FCC to allow competing companies to do that. Their central office buildings were 9/10th empty anyway, after they had replaced racks of mechanical relays with digital computers.
Gilmore explains how this competitive market was killed:
The telcos figured this out, and decided they'd rather be gatekeepers, instead of being the regulated monopoly that gets a fixed profit margin. Looking ahead, they formally asked the FCC to change its rule that telcos had to share their infrastructure with everybody -- but only for futuristic optical fibers. They whined that "FCC wants us to deploy fiber everywhere, but we won't, unless we get to own it and not share it with our competitors." As usual, the regulated monopoly was great at manipulating the public interest regulators. The FCC said, "Sure, keep your fibers unshared." This ruling never even mentioned the Internet, it is all about the physical infrastructure. If the physical stuff is wires, regulated telcos have to share it; if it's glass, they don't.

The speed of dialup maxed out at 56 kbit/sec. DSL maxed out at a couple of megabits. Leased lines worked to 45 Mbit/sec but cost thousands of dollars per month. Anything over that speed required fiber, not wire, at typical distances. As demand for higher Internet speeds arose, any ISP who wanted to offer a faster connection couldn't just order one from the telco, because the telco fibers were now private and unshared. If you want a fiber-based Internet connection now, you can't buy it from anybody except the guys who own the fibers -- mostly the telcos. Most of the 3,000 ISPs could only offer slow Internet access, so everybody stopped paying them. The industry consolidated down to just one or a few businesses per region -- mostly the telcos themselves, plus the cable companies that had build their own local monopoly via city government contracts. Especially lucky regions had maybe one other competitor, like a Wireless ISP, or an electrical co-op that ran fibers on its infrastructure.
Leavitt makes a bigger point than Glimore's:
The ONLY reason the Internet exists as we know it (mass consumer access) was the regulatory loophole which permitted the ISP industry to flourish in the 1990s. The telcos realized their mistake, as John said, and made sure that there wasn't going to be a repeat of that, so with each generation (DSL, fiber), they made it more and more difficult to access their networks, with the result that John mention--almost no consumer choice, for consumers or business. Last office I rented, there was one choice of Internet provider: the local cable monopoly, which arbitrarily wanted to charge me much more ($85/mo) to connect my office than it did the apartments upstairs in the same building ($49). As is the case in most of that county; the only alternatives were a few buildings and complexes wired up by the two surviving local ISPs, and a relatively expensive WISP.
Gilmore concludes:
The telcos' elimination of fiber based competition, and nothing else, was the end of so-called "network neutrality". The rest was just activists, regulators and legislators blathering. There never was an /enforceable federal regulatory policy of network neutrality, so the FCC could hardly suspend it. If the FCC actually wanted US customers to have a choice of ISPs, they would rescind the FIBER RULE. And if advocates actually understood how only competition, not regulation, restrains predatory behavior, they would ask FCC for the fiber rule to be rescinded, so a small ISP company could rent the actual glass fiber that runs from the telco to (near or inside) your house, for the actual cost plus a regulated profit. Then customers could get high speed Internet from a variety of vendors at a variety of prices and terms. So far neither has happened.
Leavitt shows the insane lengths we are resorting to in order to deliver a modicum of competition in the ISP market:
It's ridiculous that it is going to take sending 10s of thousands of satellites into orbit to restore any semblance of competitiveness to the ISP market, when we've had a simple regulatory fix all along. It's not like the telco/cable monopolies suffered as a result of competition... in fact, it created the market they now monopolize. Imagine all the other opportunities for new markets that have been stifled by the lack of competition in the ISP market over the last two decades!
I have been, and still am, an exception to Gilmore's and Leavitt's experiences. Palo Alto owns its own utilities, a great reason to live there. In September 2001 Palo Alto's Fiber To The Home trial went live, and I was one of 67 citizens who got a 10Mbit/s bidirectional connection, with the city Utilities as our ISP. We all loved the price, the speed and the excellent customer service. The telcos got worried and threatened to sue the Utilities if it expanded the service. The City was on safe legal ground, but that is what they had thought previously when they lost a $21.5M lawsuit as part of the fallout from the Enron scandal. Enron's creditors claimed that the Utilities had violated their contract because they stopped paying Enron. The Utilities did so because Enron became unable to deliver them electricity.

So, when the trial ended after I think five years, we loved it so much that we negotiated with Motorola to take over the equipment and found an upstream ISP. But the Utilities were gun-shy and spent IIRC $50K physically ripping out the fiber and trashing the equipment. Since then, Palo Alto's approach to municipal fiber has been a sorry story of ineffective dithering.

Shortly after we lost our fiber, Stanford decided to stop providing staff with DSL, but we again avoided doing business with the telcos. We got DSL and phone service from Sonic, a local ISP that was legally enabled to rent access to AT&T's copper. It was much slower than Comcast or AT&T, but the upside was Sonic's stellar customer service and four static IP addresses. That kept us going quite happily until COVID-19 struck and we had to host our grandchildren for their virtual schools. DSL was not up to the job.

Fortunately, it turned out that Sonic had recently been able to offer gigabit fiber in Palo Alto. Sonic in its North Bay homeland has been deploying its own fiber, as has Cruzio in its Santa Cruz homeland. Here they rent access to AT&T's fiber in the same way that they rented access to the copper. So, after a long series of delays caused by AT&T's inability to get fiber through the conduit under the street that held their copper, we have gigabit speed, home phone and Sonic's unmatched customer service all for $80/month.

As a long-time Sonic customer, I agree with what the Internet Advisor website writes:
Sonic has maintained a reputation as not only a company that delivers a reliable high-speed connection to its customers but also a company that stands by its ethics. Both Dane Jasper and Scott Doty have spoken up on numerous occasions to combat the ever-growing lack of privacy on the web. They have implemented policies that reflect this. In 2011, they reduced the amount of time that they store user data to just two weeks in the face of an ever-growing tide of legal requests for its users’ data. That same year, Sonic alongside Google fought a court order to hand over email addresses who had contacted and had a correspondence with Tor developer and Wikileaks contributor Jacob Applebaum. When asked why, CEO Dane Jasper responded that it was “rather expensive, but the right thing to do.”

Sonic has made a habit of doing the right thing, both for its customers and the larger world. It’s a conscientious company that delivers on what is promised and goes the extra mile for its subscribers.
Leavitt explained in e-mail how Sonic's exception to Gilmore's argument came about:
Sonic is one of the few independent ISPs that's managed to survive the regulatory clampdown via stellar customer service and customers willing to go out of their way to support alternative providers, much like Cruzio in my home town of Santa Cruz. They cut some kind of reseller deal with AT&T back in 2015 that enabled them to offer fiber to a limited number of residents, and again, like Cruzio, are building out their own fiber network, but according to [this site], fiber through them is potentially available to only about 400,000 customers (in a state with about 13 million households and 1 million businesses); it also reports that they are the 8th largest ISP in the nation, despite being a highly regional provider with access available to only about 3 million households. This says everything about how monopolistic and consolidated the ISP market is, given the number of independent cable and telco companies that existed in previous decades, the remaining survivors of which are all undoubtedly offering ISP services.

I doubt Sonic's deal with AT&T was much more lucrative than the DSL deals Santa Cruz area ISPs were able to cut.
Gilmore attempted to build a fiber ISP in his hometown, San Francisco:
Our model was to run a fiber to about one person per block (what Cruzio calls a "champion") and teach them how to run and debug 1G Ethernet cables down the back fences to their neighbors, splitting down the monthly costs. This would avoid most of the cost of city right-of-way crud at every house, which would let us and our champions fiber the city much more broadly and quickly. And would train a small army of citizens to own and manage their own infrastructure.
For unrelated reasons it didn't work out, but it left Gilmore with the conviction that, absent repeal of the FIBER rule, ISP-owned fiber is the way to go. Especially in rural areas this approach has been successful; a recent example was described by Jon Brodkin in Jared Mauch didn't have good broadband—so he built his own fiber ISP. Leavitt argues:
I'd like to see multiple infrastructure providers, both private for profit, and municipally sponsored non-profit public service agencies, all with open access networks; ideally, connecting would be as simple as it was back in the dial up days. I think we need multiple players to keep each other "honest". I do agree that a lot of the barriers to building out local fiber networks are regulatory and process, as John mentions. The big incumbent players have a tremendous advantage navigating this process, and the scale to absorb the overhead of dealing with them in conjunction with the capital outlays (which municipalities also have).
I think we all agree that "ideally, connecting would be as simple as it was back in the dial up days". How to make this happen? As Gilmore says, there are regulatory and process costs as well as the cost of pulling the fiber. So if switching away from a misbehaving ISP involves these costs there is going to a significant barrier. It isn't going to be "as simple as it was back in the dial up days" when the customer could simply re-program their modem.

My experience of municipal fiber leads me to disagree with both Gilmore and Leavitt. For me, the key is to separate the provision of fiber from the provision of Internet services. Why would you want to switch providers?
  • Pretty much the only reason why you'd want to switch fiber providers is unreliability. But, absent back-hoes, fiber is extremely reliable.
  • There are many reasons why you'd want to switch ISPs, among them privacy, bandwidth caps, price increases.
Municipal fiber provision is typically cheap, because they are the regulator and control the permitting process themselves, and because they are accountable to their voters. And if they simply provide the equivalent of an Ethernet cable to a marketplace of ISPs, each of them will be paying the same for their connectivity. So differences in the price of ISP service will reflect the features and quality of their service offerings.

The cost of switching ISPs would be low, simply reprogramming the routers at each end of the fiber. The reason the telcos want to own the fiber isn't because owning fiber as such is a good business, it is to impose switching costs and thus lock in their customers. We don't want that. But equally we don't want switching ISPs to involve redundantly pulling fiber, because that imposes switching costs too. The only way to make connecting "as simple as it was back in the dial up days" is to separate fiber provision from Internet service provision, so that fiber pets pulled once and rented to competing ISPs. If we are going to have a monopoly at the fiber level, I'd rather have a large number of small monopolies than the duopoly of AT&T and Comcast. And I'd rather have the monopoly accountable to voters.

Virtual 2020 NDSA Digital Preservation recordings available online! / Digital Library Federation

Session recordings from the virtual 2020 NDSA Digital Preservation conference are now available on NDSA’s YouTube channel, as well as on Aviary. The full program from Digital Preservation 2020: Get Active with Digital Preservation, which took place online November 12, 2020, is free and open to the public.

NDSA is an affiliate of the Digital Library Federation (DLF) and the Council on Library and Information Resources (CLIR). Each year, NDSA’s annual Digital Preservation conference is held alongside the DLF Forum and acts as a crucial venue for intellectual exchange, community-building, development of good practices, and national agenda-setting for digital stewardship.

Enjoy,

Tricia Patterson; DigiPres 2020 Vice-Chair, 2021 Chair

The post Virtual 2020 NDSA Digital Preservation recordings available online! appeared first on DLF.

MarcEdit 7.5.x/MacOS 3.5.x Timelines / Terry Reese

I sent this to the MarcEdit Listserv to provide info about my thoughts around timelines related to the beta and release.  Here’s the info.

Dear All,

As we are getting close to Feb. 1 (when I’ll make the 7.5 beta build available for testing) – I wanted to provide information about the update process going forward.

Feb. 1:

  1. MarcEdit 7.5 Download will be released.  This will be a single build that includes both the 32 and 64 bit builds, dependencies, and can install if you have Admin rights or non-admin rights.
    1. I expect to be releasing new builds weekly – with the goal of taking the beta tag off the build no later than April 1.
  2. MarcEdit 7.3.x
    1. I’ll be providing updates for 7.3.x till 7.5 comes out of beta.  This will fold in some changes (mostly bug fixes) when possible. 
  3. MarcEdit MacOS 3.2.x
    1. I’ll be providing Updates for MacOS 3.2.x till 3.5 is out and out of beta
  4. MarcEdit MacOS 3.5.x Beta
    1. Once MarcEdit 7.5.x beta is out, I’ll be looking to push a 3.5.x beta by mid-late Feb.  Again, with the idea of taking the beta tag off by April (assuming I make the beta timeline)

March 2021

  1. MarcEdit MacOS 3.5.x beta will be out and active (with weekly likely builds)
  2. MarcEdit 7.5.x beta – testing assessed and then determine how long the beta process continues (with April 1 being the end bookend date)
  3. MarcEdit 7.3.x – Updates continue
  4. MarcEdit MacOS 3.2.x – updates continue

April 2021

  1. MarcEdit 7.5.x comes out of Beta
  2. MarcEdit 7.3.x is deprecated
  3. MarcEdit MacOS 3.5.x beta assessed – end bookend date is April 15th if above timelines are met

May 2021

  1. MarcEdit MacOS 3.5.x is out of beta
  2. MarcEdit MacOS 3.2.x is deprecated

Let me know if you have questions.

A new font for the blog / Jez Cope

I’ve updated my blog theme to use the quasi-proportional fonts Iosevka Aile and Iosevka Etoile. I really like the aesthetic, as they look like fixed-width console fonts (I use the true fixed-width version of Iosevka in my terminal and text editor) but they’re actually proportional which makes them easier to read.
https://typeof.net/Iosevka/

Training a model to recognise my own handwriting / Jez Cope

If I’m going to train an algorithm to read my weird & awful writing, I’m going to need a decent-sized training set to work with. And since one of the main things I want to do with it is to blog “by hand” it makes sense to focus on that type of material for training. In other words, I need to write out a bunch of blog posts on paper, scan them and transcribe them as ground truth. The added bonus of this plan is that after transcribing, I also end up with some digital text I can use as an actual post — multitasking!

So, by the time you read this, I will have already run it through a manual transcription process using Transkribus to add it to my training set, and copy-pasted it into emacs for posting. This is a fun little project because it means I can:

  • Write more by hand with one of my several nice fountain pens, which I enjoy
  • Learn more about the operational process some of my colleagues go through when digitising manuscripts
  • Learn more about the underlying technology & maths, and how to tune the process
  • Produce more lovely content! For you to read! Yay!
  • Write in a way that forces me to put off editing until after a first draft is done and focus more on getting the whole of what I want to say down.

That’s it for now — I’ll keep you posted as the project unfolds.

Addendum

Tee hee! I’m actually just enjoying the process of writing stuff by hand in long-form prose. It’ll be interesting to see how the accuracy turns out and if I need to be more careful about neatness. Will it be better or worse than the big but generic models used by Samsung Notes or OneNote. Maybe I should include some stylus-written text for comparison.

Making Customizable Interactive Tutorials with Google Forms / Meredith Farkas

Farkas_GoogleFormsPresentation

In September, I gave a talk at Oregon State University’s Instruction Librarian Get-Together about the interactive tutorials I built at PCC last year that have been integral to our remote instructional strategy. I thought I’d share my slides and notes here in case others are inspired by what I did and to share the amazing assessment data I recently received about the impact of these tutorials that I included in this blog post. You can click on any of the slides to see them larger and you can also view the original slides here (or below). At the end of the post are a few tutorials that you can access or make copies of.

Farkas_GoogleFormsPresentation (1)

I’ve been working at PCC for over six years now, but I’ve been doing online instructional design work for 15 years and I will freely admit that it’s my favorite thing to do. I started working at a very small rural academic library where I had to find creative and usually free solutions to instructional problems. And I love that sort of creative work. It’s what keeps me going.

Farkas_GoogleFormsPresentation (2)

I’ve actually been using survey software as a teaching tool since I worked at Portland State University. There, my colleague Amy Hofer and I used Qualtrics to create really polished and beautiful interactive tutorials for students in our University Studies program.

Farkas_GoogleFormsPresentation (3)

Farkas_GoogleFormsPresentation (4)

I also used Qualtrics at PSU and PCC to create pre-assignments for students to complete prior to an instruction session that both taught students skills and gave me formative assessment data that informed my teaching. So for example, students would watch a video on how to search for sources via EBSCO and then would try searching for articles on their own topic.

Farkas_GoogleFormsPresentation (5)

A year and a half ago, the amazing Anne-Marie Dietering led my colleagues in a day of goal-setting retreat for our instruction program. In the end, we ended up selecting this goal, identify new ways information literacy instruction can reach courses other than direct instruction, which was broad enough to encompass a lot of activities people valued. For me, it allowed me to get back to my true love, online instructional design, which was awesome, because I was kind of in a place of burnout going into last Fall.

Farkas_GoogleFormsPresentation (6)

At PCC, we already had a lot of online instructional content to support our students. We even built a toolkit for faculty with information literacy learning materials they could incorporate into their classes without working with a librarian.

Farkas_GoogleFormsPresentation (7)

 

The toolkit contains lots of handouts, videos, in-class or online activities and more. But it was a lot of pieces and they really required faculty to do the work to incorporate them into their classes.

Farkas_GoogleFormsPresentation (8)

What I wanted to build was something that took advantage of our existing content, but tied it up with a bow for faculty. So they really could just take whatever it is, assign students to complete it, and know students are learning AND practicing what they learned. I really wanted it to mimic the sort of experience they might get from a library instruction session. And that’s when I came back to the sort interactive tutorials I built at PSU.

Farkas_GoogleFormsPresentation (9)

So I started to sketch out what the requirements of the project were. Even though we have Qualtrics at PCC, I wasn’t 100% sure Qualtrics would be a good fit for this. It definitely did meet those first four criteria given that we already have it, it provides the ability to embed video, for students to get a copy of the work they did, and most features of the software are ADA accessible. But I wanted both my colleagues In the library and disciplinary faculty members to be able to easily see the responses of their students and to make copies of the tutorial to personalize for the particular course. And while PCC does have Qualtrics, the majority of faculty have never used it on the back-end and many do not have accounts. So that’s when Google Forms seemed like the obvious choice and I had to give up on my fantasy of having pretty tutorials.

Farkas_GoogleFormsPresentation (10)

I started by creating a proof of concept based on an evaluating sources activity I often use in face-to-face reading and writing classes. You can view a copy of it here and can copy it if you want to use it in your own teaching.

screenshot1

In this case, students would watch a video we have on techniques for evaluating sources. Then I demonstrate the use of those techniques, which predate Caulfield’s four moves, but are not too dissimilar. So they can see how I would go about evaluating this article from the Atlantic on the subject of DACA.

screenshot2

The students then will evaluate two sources on their own and there are specific questions to guide them.

Farkas_GoogleFormsPresentation (11)

During Fall term, I showed my proof of concept to my colleagues in the library as well as at faculty department meetings in some of my liaison areas. And there was a good amount of enthusiasm from disciplinary faculty – enough that I felt encouraged to continue.

One anthropology instructor who I’ve worked closely with over the years asked if I could create a tutorial on finding sources to support research in her online Biological Anthropology classes – classes I was going to be embedded in over winter term. And I thought this was a perfect opportunity to really pilot the use of the Google Form tutorial concept and see how students do.

Farkas_GoogleFormsPresentation (12)

So I made an interactive tutorial where students go through and learn a thing, then practice a thing, learn another thing, then practice that thing. And fortunately, they seemed to complete the tutorial without difficulty and from what I heard from the instructor, they did a really good job of citing quality sources in their research paper in the course. Later in the presentation, you’ll see that I received clear data demonstrating the impact of this tutorial from the Anthropology department’s annual assessment project.

Farkas_GoogleFormsPresentation (13)
So my vision for having faculty make copies of tutorials to use themselves had one major drawback. Let’s imagine they were really successful and we let a thousand flowers bloom. Well, the problem with that is that you now have a thousand versions of your tutorials lying around and what do you do when a video is updated or a link changes or some other update is needed? I needed a way to track who is using the tutorials so that I could contact them when updates were made.

Farkas_GoogleFormsPresentation (14)

So here’s how I structured it. I created a Qualtrics form that is a gateway to accessing the tutorials. Faculty need to put in their name, email, and subject area. They then can view tutorials and check boxes for the ones they are interested in using.

Farkas_GoogleFormsPresentation (15)

 

Farkas_GoogleFormsPresentation (16)

Once they submit, they are taking to a page where they can actually copy the tutorials they want. So now, I have the contact information for the folks who are using the tutorials.

This is not just useful for updates, but possibly for future information literacy assessment we might want to do.

Farkas_GoogleFormsPresentation (17)

The individual tutorials are also findable via our Information Literacy Teaching materials toolkit.

So when the pandemic came just when I was ready to expand this, I felt a little like Nostradamus or something. The timing was very, very good during a very, very bad situation. So we work with Biology 101 every single term in Week 2 to teach students about the library and about what peer review means, why it matters, and how to find peer-reviewed articles.

Farkas_GoogleFormsPresentation (18)
As soon as it became clear that Spring term was going to start online, I scrambled to create this tutorial that replicates, as well as I could, what we do in the classroom. So they do the same activity we did in-class where they look at a scholarly article and a news article and list the differences they notice. And in place of discussions, I had them watch videos and share insights. I then shared this with the Biology 101 faculty on my campus and they assigned it to their students in Week 2. It was great! [You can view the Biology 101 tutorial here and make a copy of it here]. And during Spring term I made A LOT more tutorials.

Farkas_GoogleFormsPresentation (19)

The biggest upside of using Google Forms is its simplicity and familiarity. Nearly everyone has created a Google form and they are dead simple to build. I knew that my colleagues in the library could easily copy something I made and tailor it to the courses they’re working with or make something from scratch. And I knew faculty could easily copy an existing tutorial and be able to see student responses. For students, it’s a low-bandwidth and easy-to-complete online worksheet. The barriers are minimal. And on the back-end, just like with LibGuides, there’s a feature where you can easily copy content from another Google Form.

Farkas_GoogleFormsPresentation (20)

The downsides of using Google Forms are not terribly significant. I mean, I’m sad that I can’t create beautiful, modern, sharp-looking forms, but it’s not the end of the world. The formatting features in Google Forms are really minimal. To create a hyperlink, you actually need to display the whole url. Blech. Then in terms of accessibility, there’s also no alt tag feature for images, so I just make sure to describe the picture in the text preceding or following it. I haven’t heard any complaints from faculty about having to fill out the Qualtrics form in order to get access to the tutorials, but it’s still another hurdle, however small.

Farkas_GoogleFormsPresentation (21)
This Spring, we used Google Form tutorials to replace the teaching we normally do in classes like Biology 101, Writing 121, Reading 115, and many others. We’ve also used them in addition to synchronous instruction, sort of like I did with my pre-assignments. But word about the Google Form tutorials spread and we ended up working with classes we never had a connection to before. For example, the Biology 101 faculty told the anatomy and physiology instructors about the tutorial and they wanted me to make a similar one for A&P. And that’s a key class for nursing and biology majors that we never worked with before on my campus. Lots of my colleagues have made copies of my tutorials and tailored them to the classes they’re working with or created their own from scratch. And we’ve gotten a lot of positive feedback from faculty, which REALLY felt good during Spring term when I know I was working myself to the bone.

Farkas_GoogleFormsPresentation (22)

Since giving this presentation, I learned from my colleagues in Anthropology that they actually used my work as the basis of their annual assessment project (which every academic unit has to do). They used a normed rubric to assess student papers in anthropology 101 and compared the papers of students who were in sections in which I was embedded (where they had access to the tutorial) to students in sections where they did not have an embedded librarian or a tutorial. They found that students in the class sections in which I was involved had a mean score of 43/50 and students in other classes had a mean score of 29/50. That is SIGNIFICANT!!! I am so grateful that my liaison area did this project that so validates my own work.

Farkas_GoogleFormsPresentation (23)

Here’s an excerpt from one email I received from an anatomy and physiology instructor: “I just wanted to follow up and say that the Library Assignment was a huge success! I’ve never had so many students actually complete this correctly with peer-reviewed sources in correct citation format. This is a great tool.” At the end of a term where I felt beyond worked to the bone, that was just the sort of encouragement I needed.

I made copies of a few other tutorials I’ve created so others can access them:

Library Map Part 1 - Why / Hugh Rundle

This weekend past I ran the Generous & Open Galleries, Libraries, Archives & Museums (GO GLAM) Miniconf at LinuxConf.au, with Bonnie Wildie. Being a completely online conference this year, we had an increased pool of people who could attend and also who could speak, and managed to put together what I think was a really great program. I certainly learned a lot from all our speakers, and I'll probably share some thoughts on the talks and projects later this year.

I also gave a short talk, about my new Library Map project and some thoughts on generosity in providing open data. Unfortunately, Alissa is completely right about my talk. The tone was wrong. I spoke about the wrong things and in the wrong way. It was an ungenerous talk on the virtues of generosity. I allowed my frustration at underfunded government bureaucracies and my anxiety about the prospect of giving a "technical" talk that "wasn't technical enough" for LCA to overwhelm the better angels of my nature. I won't be sharing the video of my own talk when it becomes available, but here is a short clip of me not long after I delivered it:

via GIPHY

So I'm trying again. In this post I'll outline the basic concepts, and the why of Library Map - why I wanted to make it, and why I made the architecture and design choices I've made. In the next post, I'll outline how I built it - some nuts and bolts of which code is used where (and also, to some extent, why). You may be interested in one, or the other, or neither post 🙂.

Library Map

The Library Map is a map of libraries in Australia and its external territories. There are three 'layers' to the map:

Libraries

The libraries layer shows every public library in Australia, plus an indicative 800m radius around it. Also mapped on additional overlays are State and National libraries, Indigenous Knowledge Centres, and most still-operating Mechanics Institutes.

Rules

The Rules layer has two overlays.

The Fines overlay colour-codes each library service area according to whether they charge overdue fines for everyone, only for adults, or not at all.

The Loan Periods overlay uses patterns (mostly stripes) to indicate the standard loan period in weeks (2, 3, 4, or 6 as it turns out).

Library Management Software

The Library Management Software layer works basically the same as the Rules layer, except it colour codes library services according to which library management system (a.k.a Intergrated Library System) they use.

What is the library map for?

I've wanted something like this map at various times in the past. There is a fair amount of information around at the regional and state level about loan periods, or fine regimes, and even library management systems. But a lot of this is in people's heads, or in lists within PDF documents. I'm not sure I'd call myself a 'visual learner' but sometimes it is much clearer to see something mapped out visually than to read it in a table.

The intended audience for the map is actually a little bit "inside baseball". I'm not trying to build a real-time guide for library users to find things like current opening hours. Google Maps does a fine job of that, and I'm not sure a dedicated site for every public library but only libraries is a particularly useful tool. It would also be a nightmare to maintain. The site ultimately exists because I wanted to see if I could do it, but I had — broadly — two specific use cases in mind:

Mapping library management systems to visualise network effects

My talk at LCA2018 was called Who else is using it? — in reference to a question library managers often ask when confronting a suggestion to use a particular technology, especially something major like a library management system. This is understandable — it's reassuring to know that one's peers have made similar decisions ("Nobody gets fired for buying IBM"), but there are also genuine advantages to having a network of fellow users you can talk to about shared problems or desired features. I was interested in whether these sorts of networks and aggregate purchasing decisions might be visible if they were mapped out, in a different way to what might be clear from a list or table. Especially at a national level — I suspected there were strong trends within states in contrasts between them, but didn't have a really clear picture.

The State Library of Queensland was invaluable in this regard, because they have a list of every library service in the state and which library management system they use. When visiting library service websites it turned out that identifying the LMS was often the easiest piece of data to find — much easier than finding out whether they charge overdue fines! It turns out there are very strong trends within each state — stronger than I expected — but Western Australia is a much more fractured and diverse market than I had thought. I also discovered a bunch of library management systems I had never heard of, so that was fun. This layer is the most recent — I only added it today — so there may still be some improvements to be made in terms of how the data is displayed.

Mapping overdue fines

The second thing I wanted to map was whether and how libraries charge overdue fines, but my reason was different. I actually started the map with this layer, as part of a briefing I gave to some incoming Victorian local government Councillors about what they should know about public libraries.

Here, the goal is mapping as an advocacy tool, using the peer pressure of "who else is charging it?" to slowly flip libraries to go fine-free. Fines for overdue library books are regressive and counter-productive. I have found no compelling or systematic evidence that they have any effect whatsoever on the aggregate behaviour of library users in terms of returning books on time. They disproportionally hurt low income families. They need to go.

In Victoria there has been a growing movement in the last few years for public libraries to stop charging overdue fines. I wasn't really aware of the situation in other states, but it turns out the whole Northern Territory has been fine-free for over a decade, and most libraries in Queensland seem to also be fine-free. I'm still missing a fair bit of data for other states, especially South and Western Australia. What I'm hoping the map can be used for (once the data is more complete) is to identify specific libraries that charge fines but are near groups of libraries that don't, and work with the local library networks to encourage the relevant council to see that they are the odd ones out. I've worked in public libraries and know how difficult this argument can be to make from the inside, so this is a tool for activists but also to support library managers to make the case.

As if often a problem in libraries, I had to define a few terms and therefore "normalise" some data in order to have it make any sense systematically. So "no fines for children" is defined as any system that has a "younger than" exclusion for library fines or an exclusion for items designated as "children's books". Some libraries are fine free for users under 14, others for those under 17, some only for children's book loans and so on. On my map they're all the same. The other thing to normalise was the definition of "overdue fine", which you might think is simple but turns out to be complex. In the end I somewhat arbitrarily decided that if there is no fee earlier than 28 days overdue, that is classified as "no overdue fines". Some libraries charge a "notice fee" after two weeks (which does count), whilst others send an invoice for the cost of the book after 28 days (which doesn't).

Colonial mode

As the project has progressed, some things have changed, especially how I name things. When I first added the Libraries layer, I was only looking at Victoria, using the Directory of Public Library Services in Victoria. This includes Mechanics Institutes as a separate category, and that seemed like a good idea, so I had two overlays, in different colours. Then I figured I should add the National Library, and the State Libraries, as a separate layer, since they operate quite differently to local public libraries.

Once I got to Queensland, I discovered that the State Library of Queensland not only provides really good data on public libraries, but also had broadly classified them into three categories: "RLQ" for Rural Libraries Queensland, a reciprocal-borrowing arrangement; "IND" for Independent library services, and "IKC" for "Indigenous Knowledge Centre". The immediate question for me was whether I would also classify any of these libraries as something different to a "standard" public library.

The main thing that distinguishes the RLQ network from the "independents" is that it is a reciprocal lending network. In this regard, it's much the same as Libraries Victoria (formerly the Swift Consortium), or ShoreLink. There are other ways that rural libraries in Queensland operate differently to urban libraries in Queensland, but I don't think these differences make them qualitatively different in terms of their fundamental nature.

But what about Indigenous Knowledge Centres? I admit I knew very little about them, and I still only know what I've gleaned from looking at IKC websites. The Torres Strait Island Regional Council website seems to be fairly representative:

Our Indigenous Knowledge Centres endeavour to deliver new technology, literacy and learning programs to empower our communities through shared learning experiences. We work with communities to preserve local knowledge and culture and heritage, to keep our culture strong for generations.

The big difference between an IKC and a typical public library is that the focus is on preserving local Indigenous knowledge and culture, which does happen through books and other library material, but is just as likely to occur through classes and activities such as traditional art and dance.

But the more I looked at this difference, the less different it seemed to be. Public libraries across the world have begun focussing more on activities and programs in the last two decades, especially in WEIRD countries. Public libraries have always delivered new technology, literacy and learning programs. And the ‌Directory of Public Library Services in Victoria amusingly reports that essentially every library service in Victoria claims to specialise in local history. What are public libraries for, if not to "keep our culture strong for generations"?

Yet it still felt to me that Indigenous Knowledge Centres are operating from a fundamentally different mental model. Finally it dawned on me that the word "our" is doing a lot of work in that description. Our Indigenous Knowledge Centres, keep our culture strong for generations. I was taken back to a conversation I've had a few times with my friend Baruk Jacob, who lives in Aotearoa but grew up in a minority-ethnicity community in India. Baruk maintains that public libraries should stop trying to be universally "inclusive" — that they are fundamentally Eurocentric institutions and need to reconcile themselves to staying within that sphere. In this line of thinking, public libraries simply can't serve Indigenous and other non-"Western" people appropriately as centres of knowledge and culture. I could see where Baruk was coming from, but I was troubled by his argument, and the implication that different cultural traditions could never be reconciled. As I struggled to decide whether Indigenous Knowledge Centres were public libraries, or something else, I think I started to understand what Baruk meant.

I'd been thinking about this back to front. Indigenous Knowledge Centre is a usefully descriptive term. These places are centres for Indigenous knowledge. The problem wasn't how to classify IKCs, but rather how to classify the other thing. The activities might be the same, but the our is different. I thought about what a non-Indigenous Knowledge Centre might be. What kind of knowledge does it want to "keep strong for generations"? I thought about all those local history collections full of books about "pioneers" and family histories of "first settlers". If it's not Indigenous knowledge, it must be Settler knowledge. When I first saw this term being used by Aboriginal activists in reference to non-Indigenous residents generally, and white Australians specifically, I bristled. I mean, sure, the modern culture is hopelessly dismissive of 60,000 years of human occupation, culture and knowledge, but how could I be a "settler" when I have five or six generations of Australian-born ancestors? But a bit of discomfort is ok, and I have rather hypocritical ideas about other settler-colonial communities. It's exactly the right term to describe the culture most Australians live in.

So I renamed "public libraries" as "Settler Knowledge Centres". I initially renamed the National & State Libraries to "Imperial Knowledge Centres", but later decided it was more accurate to call them "Colonial Knowledge Centres". I also briefly renamed Mechanics Institutes to Worker Indoctrination Centres, but that's not entirely accurate and I realised I was getting carried away. I wasn't completely oblivious to the fact that this nomenclature could be a bit confusing, so I cheekily created two views: the "General" view which would be the default, and a second view which would appear on clicking "View in White Fragility mode". This second mode would show the more familiar names "Public Libraries" and "National & State Libraries".

While I was doing some soul searching this morning about my GO GLAM talk, I continued to work on the map. My cheeky joke about "White fragility mode" had made me slightly uncomfortable from the moment I'd created it, but I initially brushed it off as me worrying too much about being controversial. But I realised today that the real problem was that calling it "White fragility mode" sabotages the entire point of the feature. The default language of "Settler Knowledge Centre" and "Colonial Knowledge Centre" sitting next to "Indigenous Knowledge Centre" is intended to invite map users to think about the work these institutions do to normalise certain types of knowledge, and to "other" alternative knowledge systems and lifeworlds. The point is to bring people in to sit with the discomfort that comes from seeing familiar things described in an unfamiliar way. Calling it "White fragility mode" isn't inviting, it's smug. It either pushes people away, or invites them to think no more about it because they're already woke enough to get it.

So today I changed it to something hopefully more useful. General mode is now called Standard Mode, and White fragility mode is now called Colonial mode. It's the mode of thinking that is colonial, not the reader. Flicking to Colonial Mode is ok if you need the more familiar terms to get your bearings: but hopefully by making it the non-standard view, users of the map are encouraged to think about libraries and about Australia in a slightly different way. They don't have to agree that the "standard mode" terminology is better.

So that's some background behind why I started building the map and why I made some of the decisions I have about how it works. You can check it out at librarymap.hugh.run and see (most of) the code and data I used to build it on GitHub. Next time join me for a walk through how I made it.


Four Fucking Years of Donald Trump / Nick Ruest

Nearly four years ago I decided to start collecting tweets to Donald Trump out of morbid curiosity. If I was a real archivist, I would have planned this out a little bit better, and started collecting on election night in 2016, or inaguration day 2017. I didn’t. Using twarc, I started collecting with the Filter (Streaming) API on May 7, 2017. That process failed, and I pivoted to using the Search API. I dropped that process into a simple bash script, and pointed cron at it to run every 5 days. Here’s what the bash script looked like:

#!/bin/bash

DATE=`date +"%Y_%m_%d"`

cd /mnt/vol1/data_sets/to_trump/raw

/usr/local/bin/twarc search 'to:realdonaldtrump' --log donald_search_$DATE.log > donald_search_$DATE.json

It’s not beautiful. It’s not perfect. But, it did the job for the most part for almost four years save and except a couple Twitter suspensions on accounts that I used for collection, and an absolutely embarassing situtation where I forgot to setup cron correctly on a machine I moved the collecting to for a couple weeks while I was on family leave this past summer.

In the end, the collection ran from May 7, 2017 - January 20, 2021, and collected 362,464,578 unique tweets; 1.5T of line-delminted json! The final created_at timestamp was Wed Jan 20 16:49:03 +0000 2021, and the text of that tweet very fittingly reads, “@realDonaldTrump YOU’RE FIRED!

The “dehydrated” tweets can be found here. In that dataset I decided to include in a number of derivatives created with twut which, I hope rounds out the dataset. This update is the final update on the dataset.

I also started working on some notebooks here where I’ve been trying to explore the dataset a bit more in my limited spare time. I’m hoping to have the time and energy to really dig into this dataset sometime in the future. I’m especially curious at what the leadup to the 2021 storming of the United States Capitol looks like in the dataset, as well as the sockpuppet frequency. I’m also hopeful that others will explore the dataset and that it’ll be useful in their research. I have a suspicion folks can do a lot smarter, innovative, and creative things with the dataset than I did here, here, here, here, or here.

For those who are curious what the tweet volume for the last few months looked like (please note that the dates are UTC), check out these bar charts. January 2021 is especially fun.

Tweets to @realDonaldTrump by day in October 2020
Tweets to @realDonaldTrump by day in November 2020
Tweets to @realDonaldTrump by day in December 2020
Tweets to @realDonaldTrump by day in January 2021

-30-

Weeknote 3 (2021) / Mita Williams

Hey. I missed last week’s weeknote. But we are here now.

§1

This week I gave a class on searching scientific literature to a group of biology masters students. While I was making my slides comparing the Advanced Search capabilities of Web of Science and Scopus, I discovered this weird behaviour of Google Scholar: a phrase search generated more hits than not.

I understand that Google Scholar performs ‘stemming’ instead of truncation in generating search results but this still makes no sense to me.

§2

New to me: if you belong to an organization that is already a member of CrossRef, you are eligible to use a Similarity Check of documents for an additional fee. Perhaps this is a service we could provide to our OJS editors.

§3

I’m still working through the Canadian Journal of Academic Librarianship special issue on Academic Libraries and the Irrational.

Long time readers know that I have a fondness for the study of organizational culture and so it should not be too surprising that the first piece I wanted to read was The Digital Disease in Academic Libraries. It begins….

THOUGH several recent books and articles have been written about change and adaptation in contemporary academic libraries (Mossop 2013; Eden 2015; Lewis 2016), there are few critical examinations of change practices at the organizational level. One example, from which this paper draws its title, is Braden Cannon’s (2013) The Canadian Disease, where the term disease is used to explore the trend of amalgamating libraries, archives, and museums into monolithic organizations. Though it is centered on the impact of institutional convergence, Cannon’s analysis uses an ethical lens to critique the bureaucratic absurdity of combined library-archive-museum structures. This article follows in Cannon’s steps, using observations from organizational de-sign and management literature to critique a current trend in the strategic planning processes and structures of contemporary academic libraries. My target is our field’s ongoing obsession with digital transformation beyond the shift from paper-based to electronic resources, examined in a North American context and framed here as The Digital Disease.

I don’t want to spoil the article but I do want to include this zinger of a symptom which is the first of several:

If your library’s organizational chart highlights digital forms of existing functions, you might have The Digital Disease.

Kris Joseph, The Digital Disease in Academic Libraries, Canadian Journal of Academic Librarianship, Vol 6 (2020)

Ouch. That truth hurts almost as much as this tweet did:

Blogging by hand / Jez Cope

I wrote the following text on my tablet with a stylus, which was an interesting experience:

So, thinking about ways to make writing fun again, what if I were to write some of them by hand? I mean I have a tablet with a pretty nice stylus, so maybe handwriting recognition could work. One major problem, of course, is that my handwriting is AWFUL! I guess I’ll just have to see whether the OCR is good enough to cope…

It’s something I’ve been thinking about recently anyway: I enjoy writing with a proper fountain pen, so is there a way that I can have a smooth workflow to digitise handwritten text without just typing it back in by hand? That would probably be preferable to this, which actually seems to work quite well but does lead to my hand tensing up to properly control the stylus on the almost-frictionless glass screen.

I’m surprised how well it worked! Here’s a sample of the original text:

And here’s the result of converting that to text with the built-in handwriting recognition in Samsung Notes:

Writing blog posts by hand
So, thinking about ways to make writing fun again, what if I were to write some of chum by hand? I mean, I have a toldest winds a pretty nice stylus, so maybe handwriting recognition could work.
One major problems, ofcourse, is that my , is AWFUL! Iguess
I’ll just have to see whattime the Ocu is good enough to cope…
It’s something I’ve hun tthinking about recently anyway: I enjoy wilting with a proper fountain pion, soischeme a way that I can have a smooch workflow to digitise handwritten text without just typing it back in by hand?
That wouldprobally be preferableto this, which actually scams to work quito wall but doers load to my hand tensing up to properly couldthe stylus once almost-frictionlessg lass scream.

It’s pretty good! It did require a fair bit of editing though, and I reckon we can do better with a model that’s properly trained on a large enough sample of my own handwriting.

Diversity, equity and inclusion core competencies: Get cross functional projects done (Part 5 of 5) / Tara Robertson

Latina woman writing on a poster board paper while 2 Black women offer ideasPhoto from WOCinTech chat, CC-BY

This is the last post in a weeklong series exploring DEI professional competencies. Again, I believe the five key competencies for DEI professionals are:

  1. be strategic
  2. translate academic research into action and measure the impact of initiatives
  3. meet people where they are at and help them move to be more inclusive 
  4. influence others
  5. get cross functional projects done. 

Yesterday’s post was about influencing others. This post will explore getting cross functional projects done. I’ll also share some other DEI career resources. 

Great ideas without action are totally meaningless. As a DEI leader you’ll be working across departments and functions to get stuff done. Strong project management skills and collaboration are key in making change to existing processes and developing new ways of doing things. Here’s two examples to illustrate this competency. 

One of my first projects at Mozilla was working with People Ops and a Tableau expert in IT to build a dashboard to track our diversity metrics, which was more difficult and time consuming than I first thought. When I started the project was off the rails, so I suggested we restart by introducing ourselves, what we thought we brought to the table and then developed a RASCI for the project. With these foundations in place we shifted us to be a very effective team. We completed the project and became friends. Having a dashboard for diversity metrics was important as leaders owned accountability goals and needed to know how they were doing.

Engineers started Mozilla’s first mentorship program. I joined the team and was the only non-technical person and marvelled at some of the skills and ways of thinking that the others brought. It was one of those wonderful experiences where we were more than the sum of our parts. We were a small group of people with different backgrounds, doing different jobs, at various job levels and we were able to stand up and support a mentorship program for about 100 people. I credit the leadership of Melissa O’Connor, Senior Manager of Data Operations. She often said “tell me what I’m missing here” to invite different options and ran the most efficient meetings I’ve ever attended in my life. 

Great ideas without action are totally meaningless. Turning thoughts into actions as a leader in DEI is a necessary art–to get things done you’ll need to effectively collaborate with people at different levels and in different functions. 

Additional resources on DEI careers

I’m excited to be one of the panelists for Andrea Tatum’s DEI careers panel tomorrow, January 23. The event is sold out but she’ll be simulcasting live on YouTube at January 23 at 10am Pacific. Andrea also introduced me to Russell Reynold’s list of competencies of a Chief Diversity Officer

Aubrey Blanche’s post How can I get a job in D&I? starts by trying to talk the reader out of going into this line of work then gets into five key areas of expertise. 

Dr. Janice Gassam’s Dirty Diversity Podcast has an episode where she interviews Lambert Odeh, Diversity and Inclusion Manager at Olo Inc. on How to Land a Career in DEI.

The post Diversity, equity and inclusion core competencies: Get cross functional projects done (Part 5 of 5) appeared first on Tara Robertson Consulting.

How to run your Open Data Day event online in 2021 / Open Knowledge Foundation

Open Data Day 2020: images from round the world

For Open Data Day 2021 on Saturday 6th March, the Open Knowledge Foundation is offering support and funding for in-person and online events anywhere in the world via our mini-grant scheme.

Open Data Day normally sees thousands of people getting together at hundreds of events all over the world to celebrate and use open data in their communities but this year has not been a normal year.

With many countries still under lockdown or restricted conditions due to the Covid-19 pandemic, we recognise that many people will need to celebrate Open Data Day by hosting online events rather than getting together for in-person gatherings.

To support the running of events, anyone can apply to our mini-grant scheme to receive $300 USD towards the running of your Open Data Day event whether it takes place in-person or online. Applications must be submitted before 12pm GMT on Friday 5th February 2021 by filling out this form.

If you’re applying for a mini-grant for an online event, we will accept applications where the funds are allocated to cover any of the following costs:

  • Fees for online tools needed to help with the running of your event
  • Essential equipment needed to help with the running of your event
  • Reimbursing speakers or participants for mobile data costs incurred during event
  • Paying for the printing and posting of physical materials to event participants
  • Other costs associated with running the event

It might feel challenging to plan a great online event if you are used to running events in the real world. But many people and organisations have overcome these challenges this year, and there are many tools that can help you plan your event. Here are some tips and tools that we use for remote events that we think will help with your preparations.

Open Knowledge Foundation is a remote working organisation with our team spread around the world. We use Zoom, Google Meet or Slack to host our internal and external video meetings and rely on Google Docs, Github, Gitter and Discourse to allow us to share documents and talk in real-time. Many of these tools are free and easy to set up. 

Two members of our team are also on the organisation team of csv,conf, an annual community conference for data makers which usually hosts several hundred people for a two-day event. For csv,conf,v5 in May 2020, the team decided to make their event online-only and it proved to be a great success thanks to lots of planning and the use of good online tools. Read this post – https://csvconf.com/2020/going-online – to learn more about how the team organised their first virtual conference including guidance about the pros and cons of using tools like Crowdcast, Zenodo, Zoom and Spatial Chat for public events. 

Other organisations – including the Center for Scientific Collaboration and Community Engagement and the Mozilla Festival team – have also shared their guidebooks and processes for planning virtual events. 

We hope some of these resources will help you in your planning. If you have any further questions relating to an Open Data Day 2021 mini-grant application, please email opendataday@okfn.org.

Trump's Tweets / Ed Summers

TLDR: Trump’s tweets are gone from twitter.com but still exist spectrally in various states all over the web. After profiting off of their distribution Twitter now have a responsibility to provide meaningful access to the Trump tweets as a read only archive.

This post is also published on the Documenting the Now Medium where you can comment, if the mood takes you.


So Trump’s Twitter account is gone. Finally. It’s strange to have had to wait until the waning days of his presidency to achieve this very small and simple act of holding him accountable to Twitter’s community guidelines…just like any other user of the platform.

Better late than never, especially since his misinformation and lies can continue to spread after he has left office.

But isn’t it painful to imagine what the last four years (or more) could have looked like if Twitter and the media at large had recognized their responsibility and acted sooner?

When Twitter suspended Trump’s account they didn’t simply freeze it and prevent him from sending more hateful messages. They flipped a switch that made all the tweets he has ever sent disappear from the web.

These are tweets that had real material consequences in the world. As despicable as Trump’s utterances have been, a complete and authentic record of them having existed is important for the history books, and for holding him to account.

Where indeed? One hopes that they will end up in the National Archives (more on that in a moment). But depending on how you look at it, they are everywhere.

Twitter removed Trump’s tweets from public view at twitter.com. But fortunately, as Shawn Jones notes, embedded tweets like the one above persist the tweet text into the HTML document itself. When a tweet is deleted from twitter.com the text stays behind elsewhere on the web like a residue, as evidence (that can be faked) of what was said and when.

It’s difficult to say whether this graceful degradation was an intentional design decision to make their content more resilient, or it was simply a function of Twitter wanting their content to begin rendering before their JavaScript had loaded and had a chance to emboss the page. But design intent isn’t really what matters here.

What does matter is the way this form of social media content degrades in the web commons. Kari Kraus calls this process “spectral wear”, where digital media “help mitigate privacy and surveillance concerns through figurative rather than quantitative displays, reflect and document patterns of use, and promote an ethics of care.” (Kraus, 2019). This spectral wear is a direct result of tweet embed practices that Twitter itself promulgates while simultaneously forbidding it Developer Terms of Service:

If Twitter Content is deleted, gains protected status, or is otherwise suspended, withheld, modified, or removed from the Twitter Applications (including removal of location information), you will make all reasonable efforts to delete or modify such Twitter Content (as applicable) as soon as possible.

Fortunately for history there has probably never been a more heavily copied social media content than Donald Trump’s tweets. We aren’t immediately dependent on twitter.com to make this content available because of the other other places on the web where it exists. What does this copying activity look like?

I intentionally used copied instead of archived above because the various representations of Trump’s tweets vary in terms of their coverage, and how they are being cared for.

Given their complicity in bringing Trump’s messages of division and hatred to a worldwide audience, while profiting off of them, Twitter now have a responsibility to provide as best a representation of this record for the public, and for history.

We know that the Trump administration have been collecting the (???) Twitter account, and plan to make it available on the web as part of their responsibilities under the Presidential Records Act:

The National Archives will receive, preserve, and provide public access to all official Trump Administration social media content, including deleted posts from (???) and (???). The White House has been using an archiving tool that captures and preserves all content, in accordance with the Presidential Records Act and in consultation with National Archives officials. These records will be turned over to the National Archives beginning on January 20, 2021, and the President’s accounts will then be made available online at NARA’s newly established trumplibrary.gov website.

NARA is the logical place for these records to go. But it is unclear what shape these archival records will take. Sure the Library of Congress has (or had) it’s Twitter archive. It’s not at all clear if they are still adding to it. But even if they are LC probably hasn’t felt obligated to collect the records of an official from the Executive Branch, since they are firmly lodged in the Legislative. Then again they collect GIFs so, maybe?

Reading between the lines it appears that a third party service is being used to collect the social media content: possibly one of the several e-discovery tools like ArchiveSocial or Hanzo. It also looks like the Trump Administration themselves have entered into this contract, and at the end of its term (i.e. now) will extract their data and deliver it to NARA. Given their past behavior it’s not difficult to imagine the Trump administration not living up to this agreement in substantial ways.

This current process is a slight departure from the approach taken by the Obama administration. Obama initiated a process where platforms [migrate official accounts] to new accounts that were then managed going forward by NARA (Acker & Kriesberg, 2017). We can see that this practice is being used again on January 20, 2021 when Biden became President. But what is different is that Barack Obama retained ownership of his personal account (???), which he continues to use.NARA has announced that they will be archiving Trump’s now deleted (or hidden) personal account.

A number of Trump administration officials, including President Trump, used personal accounts when conducting government business. The National Archives will make the social media content from those designated accounts publicly available as soon as possible.

The question remains, what representation should be used, and what is Twitter’s role in providing it?

Meanwhile there are online collections like The Trump Archive, the New York Times’ Complete List of Trump’s Twitter Insults, Propublica’s Politwoops and countless GitHub repositories of data which have collected Trump’s tweets. These tweets are used in a multitude of ways including things as absurd as a source for conducting trades on the stock market.

But seeing these tweets as they appeared in the browser, with associated metrics and comments is important. Of course you can go view the account in the Wayback Machine and browse around. But what if we wanted a list of all the Trump tweets? How many times were these tweets actually archived? How complete is the list?

After some experiments with the Internet Archive’s API it’s possible to get a picture of how the tweets from the (???) account have been archived there. There are a few wrinkles because a given tweet can have many different URL forms (e.g. tracking parameters in the URL query string). In addition just because there was a request to archive a URL for something that looks like a realDonaldTrump tweet URL doesn’t mean it resulted in a successful response. Success here means a 200 OK from twitter.com when resolving the URL. Factoring these issues into the analysis it appears the Wayback Machine contains (at least) 16,043,553 snapshots of Trump’s tweets.

https://twittter.com/realDonaldTrump/status/{tweet-id}

Of these millions of snapshots there appear to be 57,292 unique tweets. This roughly correlates with the 59K total tweets suggested by the last profile snapshots of the account. The maximum number of times in one day that his tweets were archived was 71,837 times on February 10, 2020. Here’s what the archive snapshots of Trump’s tweets look like over time (snapshots per week).

It is relatively easy to use the CSV export from the [Trump Archive] project to see what tweets they know about that the Internet Archive does not and vice-versa (for the details see the Jupyter notebook and SQLite database here).

It looks like there are 526 tweet IDs in the Trump Archive that are missing from the Internet Archive. But further examination shows that many of these are retweets, which in Twitter’s web interface, have sometimes redirected back to the original tweet. Removing these retweets to specifically look at Trump’s own tweets there are only 7 tweets in the Trump Archive that are missing from the Internet Archive. Of these 4 are in fact retweets that have been miscategorized by the Trump Archive.

One of the three is this one, which is identified in the Trump Archive as deleted, and wasn’t collected quick enough by the Internet Archive before it was deleted:

Roger Stone was targeted by an illegal Witch Hunt tha never should have taken place. It is the other side that are criminals, including the fact that Biden and Obama illegally spied on my campaign - AND GOT CAUGHT!"

Sure enough, over at the Politwoops project you can see that this tweet was deleted 47 seconds after it was sent:

Flipping the table it’s also possible to look at what tweets are in the Internet Archive but not in the Trump Archive. It turns out that there are 3,592 tweet identifiers in the Wayback machine for Trump’s tweets which do not appear in the Trump Archive. Looking a bit closer we can see that some are clearly wrong, because the id itself is too small a number, or too large. And then looking at some of the snapshots it appears that they often don’t resolve, and simply display a “Something went wrong” message:

Yes, something definitely went wrong (in more ways than one). Just spot checking a few there also appear to be some legit tweets in the Wayback that are not in the Trump archive like this one:

Notice how the media will not play there? It would take some heavy manual curation work to sort through these tweet IDs to see which ones are legit, and which ones aren’t. But if you are interested here’s an editable Google Sheet.

Finally, here is a list of the top ten archived (at Internet Archive) tweets. The counts here reflect all the variations for a given tweet URL. So they will very likely not match the count you see in the Wayback Machine, which is for the specific URL (no query paramters).

The point of this rambling data spelunking, if you’ve made it this far, is to highlight the degree to which Trump’s tweets have been archived (or collected), and how the completeness and quality of those representations is very fluid and difficult to ascertain. Hopefully Twitter is working with NARA to provide as complete a picture as possible of what Trump said on Twitter. As much as we would like to forget, we must not.

References

Acker, A., & Kriesberg, A. (2017). Tweets may be archived: Civic engagement, digital preservation and obama white house social media data. Proceedings of the Association for Information Science and Technology, 54(1), 1–9.

Kraus, K. (2019). The care of enchanted things. In M. K. Gold & L. F. Klein (Eds.), Debates in the digital humanities 2019. Retrieved from https://www.jstor.org/stable/10.5749/j.ctvg251hk.17

Announcing Finnish Translations of the 2019 Levels of Preservation Matrix and Assessment Tool / Digital Library Federation

The NDSA is pleased to announce that the 2019 Levels of Preservation documents have been translated into Finnish by our colleagues from CSC – IT Center for Science and the Finnish digital preservation collaboration group. 

Translations for the Assessment Tool Template and both versions of the Levels of Digital Preservation Matrix were completed.  

Links to these documents are found on the 2019 Levels of Digital Preservation OSF site (https://osf.io/qgz98/) as well as below.

If you would be interested in translating the Levels of Digital Preservation V2.0 into another language please contact us at ndsa.digipres@gmail.com.  

 

Suomenkieliset käännökset vuoden 2019 pitkäaikaissäilyttämisen tasot dokumentteihin

NDSA:lla on ilo ilmoittaa, että CSC – IT Center for Science ja PAS-yhteistyöryhmä ovat yhteistyössä kääntäneet pitkäaikaissäilyttämisen tasot dokumentit suomeksi.

Käännökset arviointityökaluun ja matriisin ovat valmiita.

Linkit dokumentteihin löytyvät OSF verkkosivustolta (https://osf.io/qgz98/) ja alta.

 

The post Announcing Finnish Translations of the 2019 Levels of Preservation Matrix and Assessment Tool appeared first on DLF.

Diversity, equity and inclusion core competencies: Influence others (Part 4 of 5) / Tara Robertson

red paper plane leading a group of white paper planes, all are facing to the right

This is the fourth post in a week-long series exploring DEI professional competencies. I believe the five key competencies for DEI professionals are:

  1. be strategic
  2. translate academic research into action and measure the impact of initiatives
  3. meet people where they are at and help them move to be more inclusive 
  4. influence others
  5. get cross functional projects done. 

Yesterday I wrote about being a professional change agent and the need to meet people where they’re at and help them change their perspective and behaviour to be more inclusive. Today I’m going to explore the ability to influence others

DEI leaders need to be able to influence beyond their small team, at all levels of an organization. The way I did this was by building authentic relationships, learning about what other people’s priorities are, and negotiating how to be mutually successful. 

For example, I reached out to the AV Operations team to advocate for live captioning for our big internal meetings to increase access, both for people who were hard of hearing, people who process content better with text, and for people for whom English was an additional language. They worked to make this part of the workflow and handled the administration with the captioning vendor.

Over a year later the AV Operations team reached out to me to partner on the sound quality in the office meetings rooms. As a distributed workforce we spent a lot of time in Zoom meetings and some rooms had better sound quality than others. Also, for some neurodiverse people they were too noisy and echoey and made it exhausting to be in meetings, so it was an accessibility issue too. 

I recently did CliftonStrengths and one of my top strengths is Woo, or winning others over. CliftonStrenghts describes this as: “People exceptionally talented in the Woo theme love the challenge of meeting new people and winning them over. They derive satisfaction from breaking the ice and making a connection with someone.” DEI leaders have little if no formal power, so being skilled at this is necessary.

This is the fourth in a series of five posts. Tomorrow’s post, the last one in this series, is about getting cross functional projects done. 

The post Diversity, equity and inclusion core competencies: Influence others (Part 4 of 5) appeared first on Tara Robertson Consulting.

MarcEdit 7.5 Change/Bug Fix list / Terry Reese

* Updated; 1/20

Change: Allow OS to manage supported supported Security Protocol types.

Change: Remove com.sun dependency related to dns and httpserver

Change: Changed AppData Path

Change: First install automatically imports settings from MarcEdit 7.0-2.x

Change: Field Count – simplify UI (consolidate elements)

Change: 008 Windows — update help urls to oclc

Change: Generate FAST Headings — update help urls

Change: .NET changes thread stats queuing. Updating thread processing on forms:

* Generate FAST Headings

* Batch Process Records

* Build Links

* Main Window

* RDA Helper

* Delete Selected Records

* MARC Tools

* Check URL Tools

* MARCValidator

Change: XML Function List — update process for opening URLs

Change: Z39.50 Preferences Window – update process for opening URLs

Change: About Windows — new information, updated how version information is calculated.

Change: Catalog Calculator Window — update process for opening URLs

Change: Generate Call Numbers — update process for opening URLs

Change: Generate Material Formats — update process for opening URLs

Change: Tab Delimiter — remove context windows

Change: Tab Delimiter — new options UI

Change: Tab Delimiter — normalization changes

Change: Remove Old Help HTML Page

Change: Remove old Hex Editor Page

Change: Updated Hex Editor to integrate into main program

Change: Main Window — remove custom scheduler dependency

Change: UI Update to allow more items

Change: Main Window — new icon

Change: Main Window — update process for opening URLs

Change: Main Window — removed context menus

Change: Main Window — Upgrade changes to new executable name

Change: Main Window — Updated the following menu Items:

* Edit Linked Data Tools

* Removed old help menu item

* Added new application shortcut

Change: OCLC Bulk Downloader — new UI elements to correspond to new OCLC API

Change: OCLC Search Page — new UI elements to correspond to new OCLC API

Change: Preferences — Updates related to various preference changes:

* Hex Editor

* Integrations

* Editor

* Other

Change: RDA Helper — update process for opening URLs

Change: RDA Helper — Opening files for editing

Change: Removed the Script Maker

Change: Templates for Perl and vbscripts includes

Change: Removed Find/Search XML in the XML Editor and consolidated in existing windows

Change: Delete Selected Records: Exposed the form and controls to the MarcEditor

Change: Sparql Browser — update process for opening URLs

Change: Sparql Browser — removed context menus

Change: TroubleShooting Wizard — Added more error codes and kb information to the Wizard

Change: UNIMARC Utility — controls change, configurable transform selections

Change: MARC Utilities — removed the context menu

Change: First Run Wizard — new options, new agent images

Change: XML Editor — Delete Block Addition

Change: XML Editor — XQuery transform support

Change: XML Profile Wizard — option to process attributes

Change: MarcEditor — Status Bar control doesn’t exist in NET 5.0. Control has changed.

Change: MarcEditor — Improved Page Loading

Change: MarcEditor — File Tracking updated to handle times when the file opened is a temp record

Change: MarcEditor — removed ~7k of old code

Change: MarcEditor — Added Delete Selected Records Option

Change: Removed helper code used by Installer

Change: Removed Office2007 menu formatting code

Change: Consolidated Extensions into new class (removed 3 files)

Change: Removed calls Marshalled to the Windows API — replaced with Managed Code

Change: OpenRefine Format handler updated to capture changes between OpenRefine versions

Change: MarcEngine — namespace update to 75

Bug Fix: Main Window — corrects process for determining version for update

Bug Fix: Main Window — Updated image

Bug Fix: When doing first run, wizard not showing in some cases.

Bug Fix: Main Window — Last Tool used sometimes shows duplicates

Bug Fix: RDA Helper — $e processing

Bug Fix: RDA Helper — punctuation in the $e

Bug Fix: XML Profile Wizard — When the top element is selected, it’s not viewed for processing (which means not seeing element data or attribute data)

Bug Fix: MarcEditor — Page Processing correct to handle invalid formatted data better

Core competencies in DEI: meet people where they are at and help them move to be more inclusive (Part 3 of 5) / Tara Robertson

2 blank speech bubbles on a pink backgroundPhoto by Miguel Á. Padriñán from Pexels

This is the third post in a week-long series exploring DEI professional competencies. I believe the five key competencies for DEI professionals are:

  1. be strategic
  2. translate academic research into action and measure the impact of initiatives
  3. meet people where they are at and help them move to be more inclusive 
  4. influence others
  5. get cross functional projects done. 

Yesterday I described how a DEI professional needs to be able to translate academic research into action as well as be able to measure the impact of programs. Today I’ll talk about meeting people where they’re at and helping them move to be more inclusive

For me, doing this work in a professional context means meeting people where they’re at and helping them move to where they need to be. Building trust with leaders is key to being successful in this line of work. When leaders are vulnerable and honest with you about their process of unpacking their own biases or learning about inclusion you can be a trusted partner to help them move forward on their journey. 

I loved working with senior leaders who invited feedback on how they were showing up. I supported a manager in levelling up his knowledge of gender, so that he could continue to foster an inclusive environment. I encouraged several executives to share a bit more about who they are and their fears and vulnerabilities to come off as more human and so that staff would share their fears with them. Trust allows for an authentic relationship where we can do hard and necessary things together.

I had a lot of 1:1 conversations with people at all levels of the organization about the intent of their actions and the actual impact on people. The key to having these conversations well is to be able to offer clear and direct feedback with empathy. Kim Scott’s book Radical Candor offers a great framework for this. 

tattoo of a lit match on the inside of my forearm

Looking back to when I was an activist in my 20s I was extremely self-righteous about my politics and very judgemental about where other people were at. There’s no way I would’ve been able to be effective in this work with that mindset. This tattoo on my left forearm has a few meanings for me. One is to remind me of when I was younger and wanted to burn down oppressive systems and as a reminder to meet people who are in that place with patience and empathy so we can build together. 

This is the third in a series of five posts. Tomorrow’s post will address influencing others as a core competency for DEI leaders. 

The post Core competencies in DEI: meet people where they are at and help them move to be more inclusive (Part 3 of 5) appeared first on Tara Robertson Consulting.

Core competencies in DEI: Translate academic research to action and measure impact of initiatives (Part 2 of 5) / Tara Robertson

closeup of a Pyrex measuring cupmade to measure by Chuan Chew, CC-BY-NC licensed

This is the second post in a week-long series exploring DEI professional competencies. I believe the five key competencies for DEI professionals are:

  1. be strategic
  2. translate academic research into action and measure the impact of initiatives
  3. meet people where they are at and help them move to be more inclusive 
  4. influence others
  5. get cross functional projects done. 

Yesterday I wrote about strategy as a key competency for leading DEI work. Today I’m going to explore translating academic research into action and measuring the impact of initiatives.  

I love doing DEI work in the corporate space because it allows me to bring my values and all the different work experiences I’ve had (feminist organizer, academic librarian, accessibility leader, workshop facilitator) to this work. These are emergent problems, meaning that no company, industry or country, has solved them, so we need to try new things and figure out if they work. This means keeping up with the current research and being able to translate it to programs and initiatives and measuring the impact of programs. 

When I started at Mozilla I started a Zotero library to keep track of all the research and reports I was reading, so I could easily find that specific study on psychological safety in the workplace  that had the survey tool questions. I adapted these for our employee engagement survey so I could measure if a pilot program improved psychological safety on teams. I also read various posts on the impact of hiring referral programs at other companies and then working with our Talent Acquisition team to look at our actual data to understand if the referral program was helping or hindering our diversity efforts.

This is the second in a series of five posts. Tomorrow’s post will explore meeting people where they are at and helping them move to be more inclusive.

The post Core competencies in DEI: Translate academic research to action and measure impact of initiatives (Part 2 of 5) appeared first on Tara Robertson Consulting.

COVID-19 Research and REALM / HangingTogether

The REopening Archives, Libraries, and Museums (REALM) project has been in full motion since spring 2020. As the project charts the path forward into the new year and next phase of research, we are taking a moment to look back on the road we’ve traveled thus far. As REALM project director for OCLC, I’ve benefited from the guidance of the thoughtful, responsive, compassionate people who comprise the multi-organization project team, working groups, and executive project steering committee. The activities and results I describe in this article would not have been possible without their commitment to resilient archives, libraries, and museums; and to the safety of their staff and community members.

How did REALM come about?

At the end of March 2020, as the novel coronavirus was moving through the US in what was to be the first COVID-19 surge, the Institute of Museum and Libraries Services (IMLS) had just hosted a webinar on managing collections during an active pandemic. During the session, an epidemiologist and a health scientist from the US Centers for Disease Control provided guidance and covered topics such as creating or updating your institution’s emergency operations plan, staying in close communication with your local health department, and helping prevent spread of the virus by encouraging staff and visitors to practice frequent handwashing and social distancing. They also recommended frequent cleaning and disinfecting of high-touch surfaces with EPA-approved products, and to leave any room or item alone for 24 hours when there was possible contamination by someone infected. The interpretation of “high-touch” and “possible contamination” led many webinar attendees to wonder if well-trafficked public spaces and frequently handled collection and exhibit materials in libraries and museums fell into this category. Ensuing discussions led the IMLS to see the potential for a COVID-19 research project focusing on the operations, spaces, collections, and services specific to archives, libraries, and museums.

Who is involved in REALM?

On April 22, 2020, IMLS announced a partnership between the IMLS, the scientific research institute Battelle, and OCLC. Battelle has an extensive history doing research on emerging and infectious diseases such as Ebola, West Nile, influenza, and tuberculosis. OCLC’s research division has managed dozens of grant-funded projects and collaborates across a network of thousands of libraries throughout the world. To support the partnership and its decision-making, the IMLS also coordinated the formation of an executive project steering committee and working groups composed of representatives from member organizations, consortia, and individual institutions, as well as subject matter experts. By the end of May, the partnership had launched the REopening Archives, Libraries, and Museums, or REALM, project.

What is the scope of REALM?

The REALM project was designed to produce and distribute science-based information about how materials can be handled to mitigate exposure to coronavirus in staff and visitors of archives, libraries, and museums. A goal of the project is to better understand the virus in ways that will help inform local decision-making around operational practices and policies. The main questions that have shaped the project’s research activities are the following:

  • How is SARS-CoV-2, the virus that causes COVID-19, transmitted?
  • Are contaminated surfaces and materials contributing to COVID-19 infections?
  • What are effective prevention and decontamination tactics to mitigate transmission?

These questions were explored in two ways during the first two phases of the project: Systematic literature reviews and laboratory testing. In addition to the scientific research, the project is also collecting and reviewing relevant informational resources created by other organizations, and sharing illustrative examples of policies, practices, and procedures that archives, libraries, and museums have developed in response to the COVID-19 pandemic. The other main activity of the project is to create toolkit resources to synthesize the research into simplified language and imagery that can be used to support conversations with stakeholders and community members.

What is not in scope for the project is to develop one-size-fits-all recommendations or guidelines. Institutions vary significantly in their resources, settings, services, and priorities; and there is also a wide range of advisories and orders in place at local, state, and national levels. Therefore, each institution needs to develop policies and procedures in response to its local community needs and conditions and take into account pragmatic considerations of risk and available resources.

Staff and leadership of organizations are under a great deal of stress while trying to find and interpret credible information and make decisions in the middle of crisis. Naturally, individuals also want to know how best to protect their own health and the health of others. All of us working in the field are trying to do the “right” thing to reduce any risk to the staff and users who depend on services, facilities, and collections. In an atmosphere of urgency, uncertainty, and ambiguity, figuring out what is the best course of action can be very complicated. We have had to learn that during a public health crisis it is normal to have to make decisions based on incomplete or conflicting information. As the authors of the BMJ article Managing uncertainty in the COVID-19 era suggest, we are learning to “make sense of complex situations by acknowledging the complexity, admitting ignorance, exploring paradoxes, and reflecting collectively.”

What can we learn from the scientific literature on SARS-CoV-2?

The REALM project has published two systematic literature reviews thus far, one in June and another in October of 2020. These reports synthesize research on the virus that was published through mid-August. A third review is currently underway and will be published this winter.

We still do not know several things about SARS-CoV-2; this is important to recognize when discussing, considering, and making decisions about your institution’s and your community’s policies and procedures. For example, we don’t know how many virus particles an infected person leaves behind on an object through such actions as sneezing or coughing. Although research and some educated guesses exist, there is no definitive answer. Another unknown is how many virus particles you can pick up from an object, and whether that transference is contributing to infections. Also, scientists have not yet determined the human infectious dose for this virus: we don’t know how much of the virus you need to ingest to contract COVID-19.

The SARS-CoV-2 virus has shown to be quite infectious, as demonstrated by the speed in which it has surged across countries around the world. The primary form of transmission is now generally understood to be through contaminated water droplets expelled when people infected with COVID-19 sneeze, cough, sing, talk, and/or breathe. The practice of limiting close (less than six feet) or extended (more than 15 minutes) contact between people is intended to reduce the risk of this type of transmission.

Evidence has also suggested that another likely pathway for spreading the virus is by breathing air in which the virus is suspended after some sort of aerosolization event, such as a sneeze. Aerosols have received increased attention since the summer, and some researchers believe they may be a significant source of COVID-19 transmission. Others suggest that more study is needed before any conclusions are drawn.

While touching fomites, or objects contaminated by virus-containing droplets, was thought to be a significant pathway in the early months of the pandemic, this has been a difficult pathway to trace when there is so much direct people-to-people transmission also occurring. As concerns about airborne transmission have grown, less attention has been given to studying the role of fomites in the COVID-19 pandemic. The REALM lab testing on fomites has contributed to this area of scientific inquiry.

Environmental factors have been identified as influential in the spread of the SARS-CoV-2, though additional research is needed to understand the complexities of these variables’ impact. Higher humidity and temperatures show evidence of hastening the deactivation of the virus; lower humidity and temperatures slow the speed of deactivation, so the virus remains infectious for a longer period under these conditions. Fresh air free of pollutants reduces transmission of the virus more than “dirty” air does. Some evidence has suggested that inadequate HVAC systems and other air circulation mechanisms can contribute to viral spread if not configured to maximize air exchange to refresh indoor spaces frequently with clean air. However, until further research defines the risk of people contracting COVID-19 through airborne virus, the extent to which these systems contribute to infection is unclear.

The research that has emerged reinforces the effectiveness of certain low-cost, relatively easy prevention tactics, especially handwashing or hand sanitizing, physical distancing, and wearing a mask. When a room, surface, or object is suspected of being contaminated by a person infected with COVID-19, increased heat, and use of disinfectants identified by the EPA have been shown to be effective decontamination practices. Important considerations in evaluating any type of agent or treatment for use are described in various resources; the National Park Service (NPS), the Northeast Document Conservation Center (NEDCC), and the Canadian Conservation Institute (CCI) offer guidance on caring for collection items and exhibit spaces so as not to damage materials or the staff handling them. The REALM website points to these and other resources.

What are the findings from REALM lab testing?

The project is currently scoped to conduct ten tests. Six have been completed and their findings were published between June and November 2020. The seventh and eighth tests took place in December, and the remaining two tests will be defined and scheduled based on those results. 

Two types of tests are used to measure the presence of the virus. One detects genetic matter associated with the virus but does not distinguish between active (i.e., infectious) and inactive particles. The other measures the amount of active virus by mixing it into a test cell culture and noting whether the virus infects the cells. The Battelle research method for REALM measures infectious virus, but the tests are not able to determine whether the number of active particles present would be enough to infect a human being with COVID-19.

The first six tests examined five materials selected from recommendations provided by the REALM Steering Committee and working groups. Three tests focused on commonly circulated public library items, such as different types of books and DVDs. The other tests studied plastic materials, textiles, and hard surfaces respectively. Many of the materials were donated by the Columbus Metropolitan Library, the National Archives and Records Administration, and the Library of Congress. Other materials, such as some of the textiles, were purchased new from vendors.

The Battelle lab technicians cut each material into coupons and applied the infectious virus to the surface at a known concentration in a synthetic saliva solution. They put the coupons into a controlled environmental chamber in stacked and unstacked configurations. The first six tests were conducted in conditions that simulated a standard office environment of 68°F to 75°F (22 ± 2°C) and 30 to 50% relative humidity. At preselected timepoints, the scientists measured the quantity of active virus on the coupons to document its attenuation (or drop) over time. The timepoints for each test were selected with two concerns in mind: the desired outcome of seeing the virus fall below the detection limit within the test timespan; and the timeframe representing a practical quarantine period for materials suspected of contamination.

Quantitation Limit
The culture method cannot be used to measure a virus count below 26 particles. This is called the “limit of quantitation.” Below this point, researchers must look at the test coupons under a microscope and note the presence or absence of virus on each coupon visually. If virus is not observed on any of the coupons, the result is recorded as being below the limit of detection.

Results of the first six REALM tests are shown here. Quick summaries, full lab reports, and the raw data sets are available on the REALM website.

What is next for REALM?

Research is expected to continue for the first half of 2021; new factors to consider include the emergence of vaccines, the impact of the second deadly winter surge, impacts induced by sliding along a confusing continuum of “open” and “closed” for institutions and their communities, and expanding experience that decision makers are gaining in making risk assessments and contingency plans. The project will continue to host and attend virtual speaking engagements to share information and answer questions to allow and foster listening, learning, and adapting among all in the cultural heritage fields. WebJunction is hosting a REALM webinar on January 29, 2021; register for the live event or view the recording when it is added to the course catalog.

[This post is adapted from an article published in the January 2021 edition of AIC News (Vol. 46, No. 1), the member newsletter of the American Institute for Conservation of Historic & Artistic Works. Quick summaries, full lab reports, and the raw data sets are available on the REALM website at oc.lc/realm.]

The post COVID-19 Research and REALM appeared first on Hanging Together.

What I want from a GLAM/Cultural Heritage Data Science Network / Jez Cope

Introduction

As I mentioned last year, I was awarded a Software Sustainability Institute Fellowship to pursue the project of setting up a Cultural Heritage/GLAM data science network. Obviously, the global pandemic has forced a re-think of many plans and this is no exception, so I’m coming back to reflect on it and make sure I’m clear about the core goals so that everything else still moves in the right direction.

One of the main reasons I have for setting up a GLAM data science network is because it’s something I want. The advice to “scratch your own itch” is often given to people looking for an open project to start or contribute to, and the lack of a community of people with whom to learn & share ideas and practice is something that itches for me very much.

The “motivation” section in my original draft project brief for this work said:

Cultural heritage work, like all knowledge work, is increasingly data-based, or at least gives opportunities to make use of data day-to-day. The proper skills to use this data enable more effective working. Knowledge and experience thus gained improves understanding of and empathy with users also using such skills.

But of course, I have my own reasons for wanting to do this too. In particular, I want to:

  • Advocate for the value of ethical, sustainable data science across a wide range of roles within the British Library and the wider sector
  • Advance the sector to make the best use of data and digital sources in the most ethical and sustainable way possible
  • Understand how and why people use data from the British Library, and plan/deliver better services to support that
  • Keep up to date with relevant developments in data science
  • Learn from others’ skills and experiences, and share my own in turn

Those initial goals imply some further supporting goals:

  • Build up the confidence of colleagues who might benefit from data science skills but don’t feel they are “technical” or “computer literate” enough
  • Further to that, build up a base of colleagues with the confidence to share their skills & knowledge with others, whether through teaching, giving talks, writing or other channels
  • Identify common awareness gaps (skills/knowledge that people don’t know they’re missing) and address them
  • Develop a communal space (primarily online) in which people feel safe to ask questions
  • Develop a body of professional practice and help colleagues to learn and contribute to the evolution of this, including practices of data ethics, software engineering, statistics, high performance computing, …
  • Break down language barriers between data scientists and others

I’ll expand on this separately as my planning develops, but here are a few specific activities that I’d like to be able to do to support this:

  • Organise less-formal learning and sharing events to complement the more formal training already available within organisations and the wider sector, including “show and tell” sessions, panel discussions, code cafés, masterclasses, guest speakers, reading/study groups, co-working sessions, …
  • Organise training to cover intermediate skills and knowledge currently missing from the available options, including the awareness gaps and professional practice mentioned above
  • Collect together links to other relevant resources to support self-led learning

Decisions to be made

There are all sorts of open questions in my head about this right now, but here are some of the key ones.

Is it GLAM or Cultural Heritage?

When I first started planning this whole thing, I went with “Cultural Heritage”, since I was pretty transparently targeting my own organisation. The British Library is fairly unequivocally a CH organisation. But as I’ve gone along I’ve found myself gravitating more towards the term “GLAM” (which stands for Galleries, Libraries, Archives, Museums) as it covers a similar range of work but is clearer (when you spell out the acronym) about what kinds of work are included.

What skills are relevant?

This turns out to be surprisingly important, at least in terms of how the community is described, as they define the boundaries of the community and can be the difference between someone feeling welcome or excluded. For example, I think that some introductory statistics training would be immensely valuable for anyone working with data to understand what options are open to them and what limitations those options have, but is the word “statistics” offputting per se to those who’ve chosen a career in arts & humanities? I don’t know because I don’t have that background and perspective.

Keep it internal to the BL, or open up early on?

I originally planned to focus primarily on my own organisation to start with, feeling that it would be easier to organise events and build a network within a single organisation. However, the pandemic has changed my thinking significantly. Firstly, it’s now impossible to organise in-person events and that will continue for quite some time to come, so there is less need to focus on the logistics of getting people into the same room. Secondly, people within the sector are much more used to attending remote events, which can easily be opened up to multiple organisations in many countries, timezones allowing. It now makes more sense to focus primarily on online activities, which opens up the possibility of building a critical mass of active participants much more quickly by opening up to the wider sector.

Conclusion

This is the type of post that I could let run and run without ever actually publishing, but since it’s something I need feedback and opinions on from other people, I’d better ship it! I really want to know what you think about this, whether you feel it’s relevant to you and what would make it useful. Comments are open below, or you can contact me via Mastodon or Twitter.

Core competencies in DEI – Be strategic (Part 1 of 5) / Tara Robertson

 

 

a white hand holds up a small round mirror that shows the clear refleciton of the mountains, the background is blurredPhoto by Ethan Sees

Three and a half years ago I changed careers from being an academic librarian who did diversity, equity and inclusion (DEI) work in the library technology community to a full time job as a DEI professional in the tech sector. Many people have reached out to see if I’d be willing to have a coffee and chat about DEI work and what the work actually looks like. I hope this series of five posts answers many of those questions.

Diversity, equity and inclusion careers panel event information

I’m excited to be one of the panelists for Andrea Tatum’s DEI careers panel on January 23. Registration is free and I hope you’ll join us. 

For me, the combination of head and heart make this work deeply satisfying and challenging. I think the five key competencies for DEI professionals are:

  1. be strategic
  2. translate academic research into action and measure the impact of initiatives
  3. meet people where they are at and help them move to be more inclusive 
  4. influence others
  5. get cross functional projects done. 

Over the coming week I’ll get into more detail on each of these competencies.

Be strategic

DEI means different things to different people and most people have an opinion on what the priorities should be. A clear strategy is important to focus on what you’re going to do, and more importantly, what you’re not going to do. 

While I’d facilitated and written strategic and operational plans for academic and library organizations, I really levelled up through conversations and a planning session with technology strategist John Jensen. My mentor Candice Morgan also shared her strategy and the thinking behind it. Seeing how others approach this was really useful. 

At the start of 2020 I adapted our existing strategy at Mozilla to connect the work we’d been doing on diversity and inclusion to the goals of the business. We had just gone through a layoff and some of the key business goals were focused on product innovation. Experimenting on how to build psychological safety was a key part of the strategy. Psychological safety is the belief that it’s safe to speak up with great new ideas and to raise the alarm when things are going off the rails. Looking at this through a DEI lens meant asking questions like:

  • Who speaks up?
  • Which voices are valued?
  • What kind of training and coaching can help teams and managers do this better?

DEI is a lens to look at HR policies and processes across the entire employee lifecycle and goes beyond to look at the entire business.

Over the last few months I’ve talked to over 30 companies about where they’re at in their DEI journey. In 2020 we saw many companies take a reactive approach to DEI, quickly rolling out one off workshops on unconscious bias or anti-racism. Without a broader strategy, these types of trainings won’t make a lasting impact.

When many people think DEI they quickly jump to thinking about hiring process. Increasing representation is impacted by who joins the company and who chooses to leave. Hiring is important but maybe your bigger problem is attrition. Looking at your attrition rates and exit survey data is a good place to look to start to understand who is leaving and why they’re leaving. To make the biggest impact these programs need to connect to an overall strategy. 

This is the first in a series of five posts. Tomorrow I’ll share some examples of translating academic research to action and talk about measuring impact of initiatives.

The post Core competencies in DEI – Be strategic (Part 1 of 5) appeared first on Tara Robertson Consulting.

Datasette hosting costs / Ted Lawless

I've been hosting a Datasette (https://baseballdb.lawlesst.net, aka baseballdb) of historical baseball data for a few years and the last year or so it has been hosted on Google Cloud Run I thought I would share my hosting costs for 2020 as a point of reference for others who might be interested in running a Datasette but aren't sure how much it may cost. The total hosting cost on Google Cloud Run for 2020 for the baseballdb was $51.31, or a monthly average of about $4.28 USD The monthly bill did vary a fair amount from as high as $13 in May to as low as $2 in March Since I did no deployments during this time or updates to the site, I assume the variation in costs is related to the amount queries the Datasette was serving I don't have a good sense of how many total queries per month this instance is serving since I'm not using Google Analytics or similar. Google does report that it is subtracting $49.28 in credits for the year but I don't expect those credits/promotions to expire anytime soon since my projected costs for 2021 is $59. This cost information is somewhat incomplete without knowing the number of queries served per month but it is a benchmark

Datasette hosting costs / Ted Lawless

I've been hosting a Datasette (https://baseballdb.lawlesst.net, aka baseballdb) of historical baseball data for a few years and the last year or so it has been hosted on Google Cloud Run. I thought I would share my hosting costs for 2020 as a point of reference for others …

sequence models of language: slightly irksome / Andromeda Yelton

Not much AI blogging this week because I have been buried in adulting all week, which hasn’t left much time for machine learning. Sadface.

However, I’m in the last week of the last deeplearning.ai course! (Well. Of the deeplearning.ai sequence that existed when I started, anyway. They’ve since added an NLP course and a GANs course, so I’ll have to think about whether I want to take those too, but at the moment I’m leaning toward a break from the formal structure in order to give myself more time for project-based learning.) This one is on sequence models (i.e. “the data comes in as a stream, like music or language”) and machine translation (“what if we also want our output to be a stream, because we are going from a sentence to a sentence, and not from a sentence to a single output as in, say, sentiment analysis”).

And I have to say, as a former language teacher, I’m slightly irked.

Because the way the models work is — OK, consume your input sentence one token at a time, with some sort of memory that allows you to keep track of prior tokens in processing current ones (so far, so okay). And then for your output — spit out a few most-likely candidate tokens for the first output term, and then consider your options for the second term and pick your most-likely two-token pairs, and then consider all the ways your third term could combine with those pairs and pick your most likely three-token sequences, et cetera, continue until done.

And that is…not how language works?

Look at Cicero, presuming upon your patience as he cascades through clause after clause which hang together in parallel but are not resolved until finally, at the end, a verb. The sentence’s full range of meanings doesn’t collapse until that verb at the end, which means you cannot be certain if you move one token at a time; you need to reconsider the end in light of the beginning. But, at the same time, that ending token is not equally presaged by all former tokens. It is a verb, it has a subject, and when we reached that subject, likely near the beginning of the sentence, helpfully (in Latin) identified by the nominative case, we already knew something about the verb — a fact we retained all the way until the end. And on our way there, perhaps we tied off clause after clause, chunking them into neat little packages, but none of them nearly so relevant to the verb — perhaps in fact none of them really tied to the verb at all, because they’re illuminating some noun we met along the way. Pronouns, pointing at nouns. Adjectives, pointing at nouns. Nouns, suspended with verbs like a mobile, hanging above and below, subject and object. Adverbs, keeping company only with verbs and each other.

There’s so much data in the sentence about which word informs which that the beam model casually discards. Wasteful. And forcing the model to reinvent all these things we already knew — to allocate some of its neural space to re-engineering things we could have told it from the beginning.

Clearly I need to get my hands on more modern language models (a bizarre sentence since this class is all of 3 years old, but the field moves that fast).

Launching the Open Data Day 2021 mini-grant scheme / Open Knowledge Foundation

Open Data Day 2020: images from round the world

We are thrilled to announce that once again the Open Knowledge Foundation is giving out mini-grants to support people hosting Open Data Day events across the world.

Open Data Day is an annual celebration of open data taking place for the eleventh time on Saturday 6th March 2021. Everyone can take part as groups from around the globe create local events to show how they use open data in their communities.

We are extremely grateful to our partners who have provided funding for this year’s mini-grant scheme. These include Microsoft, UK Foreign, Commonwealth and Development Office, Mapbox, Latin American Open Data Initiative (ILDA), Open Contracting Partnership and Datopian.

Open Data Day 2021 funder logos

How to apply? 

The deadline to submit your mini-grant application is midday GMT on Friday 5th February 2021. Use this form to make your application.

Who can apply? 

Anyone can apply for a $300 USD mini-grant. 

This year we are providing mini-grants to both:

  • Real world events in your location, and 
  • Online events to connect with community members and people around the world virtually

We understand that many people sadly will not be able to meet in person for this year’s Open Data Day due to local/national restrictions relating to the Covid-19 pandemic. But we want to help you and open data communities around the world by supporting online events and celebrations. As well as providing mini-grant funds to those running online events, we have shared our tips and advice for running great virtual sessions.

What are the criteria? 

Your event or online session must fit into one of the four tracks laid out below to be in with a chance of receiving a mini-grant:

  • Environmental data: Use open data to illustrate the urgency of the climate emergency and spur people into action to take a stand or make changes in their lives to help the world become more environmentally sustainable.
  • Tracking public money flows: Expand budget transparency, dive into public procurement, examine tax data or raise issues around public finance management by submitting Freedom of Information requests.
  • Open mapping: Learn about the power of maps to develop better communities.
  • Data for equal development: How can open data be used by communities to highlight pressing issues on a local, national or global level? Can open data be used to track progress towards the Sustainable Development Goals or SDGs?

What is a mini-grant?

You can only make one application for one event/online session in just one track.  

A mini-grant is a small fund of $300 USD to help support groups organising Open Data Day events and online sessions.

The mini-grants cannot be used to fund government events, whether national or local. 

We can only support civil society activities. 

We encourage governments to find local groups and engage with them if they want to organise events and apply for a mini-grant.

The funds will only be delivered to the successful grantees after:

  • The event or online session has taken place, and 
  • We receive a written report on your event/online session which must be delivered within 30 days of your event  

In case the funds are needed before 6th March 2021, you can email opendataday@okfn.org and we will assess whether or not we can help on a case-by-case basis.

Photography and video competition 

This year, we will be giving away prizes for the best Open Data Day photographs and videos. These will be used to help promote Open Data Day in the future. Check back soon for more information about how to enter the competition.

About Open Data Day

Open Data Day is the annual event where we gather to reach out to new people and build new solutions to issues in our communities using open data. 

The eleventh Open Data Day will take place on Saturday 6th March 2021.

If you have started planning your Open Data Day event already, please add it to the global map on the Open Data Day website using this form. If you are running a free online session open to anyone in the world, we will publish a timetable to promote your online session. 

Connect with others and spread the word about Open Data Day using the #OpenDataDay or #ODD2021 hashtags. 

Alternatively you can join the Google Group to ask for advice or share tips.

To get inspired with ideas for events or online sessions, read about some of the great events which took place on Open Data Day 2020 in our wrap-up blog post.

Need more information?

If you have any questions, you can reach out to the Open Knowledge Foundation team by emailing opendataday@okfn.org or on Twitter via @OKFN. There’s also the Open Data Day Google Group where you can connect with others interested in taking part, share ideas for your event or ask for help.

Upcoming WMS Acquisitions API Changes / OCLC Dev Network

OCLC will be making changes to the WMS Acquisitions API to better support Ship-To and Bill-To addresses on Purchase Orders and Invoices

Engaging with Early Web Collections / Archives Unleashed Project

Web Archive Datasets via the Internet Archive

Introduction

We have lots of web archival data — but what to do with it? The web archiving ecosystem has many organizations, institutions, projects, and individuals who have captured terabytes of data. These important preservation efforts allow us to explore the recent past. Web archive collections are a critical source for documenting, and in turn studying, our ever-growing online world. We don’t have to look too far back to see how born-digital collections illustrate important social and cultural movements, global health events, and various national elections, such as #MeToo, COVID, or #BlackLivesMatter.

We can see that web archives inform fields from social history and politics, to journalism, and beyond. It’s important to recognize that there is a high barrier for using and exploring web archives. The challenges are many: from needing experience with computational approaches, to lack of institutional resources, to the big problem of being able to access terabytes of raw web archival data in the WARC or ARC format. Indeed, many scholars often wonder how they can even get started in the field of web archive analysis?

Archive-It and the Archives Unleashed project have collaborated to highlight three collections unified around the broader theme of the early web experience: Geocities, Friendster, and an early web language annotations collection. The aim of these collections is to broaden access to web archives, spark investigative curiosity, and encourage research and engagement with web archive data.

Collection #1: GeoCities Collection (1994–2009)

No treatment of the early World Wide Web is complete without a detour to the fun and fascinating world of GeoCities. It played a critical role in mediating the beginning of many users’ first forays onto the Web as well as the creation of their early online social networks. Starting in 1994, anybody could visit GeoCities, provide an email address, and then create their website on any topic ranging from their favourite cars, their school, their stuffed animals, or almost anything else that struck their fancy!

Created by David Bohnett and John Rezner in 1994, GeoCities was one of the first platforms where users could create their own web pages without having to be an expert in HTML, FTP, or any other among a host of acronyms that defined the early web experience. GeoCities made it easy for people to step onto the “information superhighway.” It was also the first platform where individuals could build and participate in online communities. The site was structured thematically around the concept of “neighbourhoods,” where users would select an area to build their website based on their interest. For instance, a user interested in science and technology would build their site in Area 51; a child would build their kids-focused site in EnchantedForest (with its attendant enhanced moderation); a car aficionado in the MotorCity, or the amateur genealogist or family cataloguer in the Heartland.

The ease of GeoCities meant that people took to it with enthusiasm. The first ten thousand users built their sites by October 1995, the first 100,000 by August 1996, and the first 1,000,000 users by October 1997. In other words, GeoCities experienced exponential growth back when that was a good thing! Geocities was bought by Yahoo! in 1999. While it continued to grow, its heyday was over, and in 2009 Yahoo! Announced that it was going to be deleted. All those terabytes of user-generated content, lost forever. That is, of course, if not for the Internet Archive and the Archive Team, among others!

Access to the Geocities Datasets

Collection URL: https://archive.org/details/geocitiesdatasets

In collaboration with the Archives Unleashed Project, the Geocities collection was processed with the Archives Unleashed Toolkit to produce derivatives scholars can freely use and explore.

By visiting the Geocities dataset, users can expect to have access to csv and parquet files for the following datasets: domain count, image graph, web graph, and binary file information for audio, images, PDFs, presentation program files, spreadsheets, videos, and word processor files. A graphml file is also available for the domain graph.

Collection #2: Friendster (2003–2015)

Before Facebook, there was Friendster, the first platform to be patented as a social networking site. Friendster’s development also affected a wave of platforms that would debut in the early 2000s, including MySpace (2003), Hi5 (2004), Facebook (2004).

Friendster was founded in 2002 by Jonathan Abrams and was designed to discover and manage online social connections. With Friendster, Abrams altered the way individuals presented themselves online, as Friendster required users to use real names and pictures, which was a departure from earlier platforms such as GeoCities. The platform translated everyday behaviours, such as meeting and interacting with people, to the online context. One of the distinguishing features was a social graph that displayed social connections and highlighted degrees of separation. So users could see layers of connections, in other words, the adage of I’m a friend of a friend of a friend.

What launched in 2003 as a local platform for residents of San Francisco, Friendster quickly grew into a national phenomenon. The biggest challenge faced by Friendster: performance issues. Ultimately, the exponential growth of the user-base outpaced the ability of the technology to deal with surging traffic. As a result, the user experience was significantly degraded. Competitive platforms like MySpace presented additional functionality like page customization, and built infrastructure to stay ahead of an expanding user-base, both of which appealed to frustrated Friendster users. There were a number of other challenges Friendster faced over its history, including management transitions (six CEOs in four years), missed opportunities, and eroding traffic within the US market. By 2011 the platform was reenvisioned as a social gaming site, and services were officially shut down in 2015.

Access to the Friendster Datasets

Collection URL: https://archive.org/details/friendsterdatasets

The datasets provided are generated from the larger Friendster web archive collection in the Internet Archive. This provides a great opportunity for researchers to start exploring Friendster data without being overwhelmed by the full 10TB collection. In this smaller collection, users can use the LGA (Longitudinal Graph Analysis) file to explore the way websites link to each other, and the WAT file which features key metadata elements.

Collection #3: Early Web Language Datasets (1996–1999)

Collection URL: https://archive.org/details/earlywebdatasets

These two related datasets are generated from the Internet Archive’s global web archive collection and will be of particular interest to researchers who want to explore language. This dataset may be of more interest to technical users who want to work on translation or language identification.

The first dataset, “Parallel Language Records of the Early Web (1996–1999)” provides a dataset of multilingual records — that is, URLs of websites that appear in multiple languages over that period (i.e. instructions that are provided in English, German, Spanish, French, Italy, and Dutch). They could in turn be used to work with the Wayback Machine.

The second dataset, “Language Annotations of the Early Web (1996–1999)” is another metadata set that annotates the language of over four million websites using Google’s Compact Language Detector (CLD3). This file is provided as CDX to provide an index of all sites available, and CDXA which provides the language annotation of the file.

Special thanks to Helge Holzmann and Nick Ruest for their efforts in preparing these datasets!

Explore Web Archives

We encourage you to engage with these collections to discover the benefits of using web archive data in research. Take advantage of this unparalleled window into early web history! All of the files and derivatives are freely accessible and available for research.

We’d love to know how you interpret and analyze these collections!

References

Archive-It. (2011, September 20). The Archive Team Friendster Snapshot Collection [Web Archive Collection]. Retrieved from https://archive.org/details/archive-team-friendster?tab=about

Blumberg, A. (Host). (2017, April 21). Friendster 1: The Rise [Audio podcast]. Retrieved from https://gimletmedia.com/shows/startup/n8hogn

Blumberg, A. (Host). (2017, April 21). Friendster 2: The Fall [Audio podcast]. Retrieved from https://gimletmedia.com/shows/startup/8whow5

Fiegerman, S. (2014, February 3). Friendster Founder Tells His Side of the Story, 10 Years After Facebook. Mashable. Retrieved from https://mashable.com/2014/02/03/jonathan-abrams-friendster-facebook/

Garcia, D., Mavrodiev, P., and Frank Schweitzer. 2013. Social resilience in online communities: the autopsy of friendster. In Proceedings of the first ACM conference on Online social networks (COSN ‘13). Association for Computing Machinery, New York, NY, USA, 39–50. DOI:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/2512938.2512946

GeoCities. (n.d.) In Wikipedia. https://archiveteam.org/index.php?title=GeoCities

Geocities. (2010, November 17). In Dead Media Archive: NYU Dept. of Media, Culture, and Communication. Retrieved from http://cultureandcommunication.org/deadmedia/index.php/Geocities

Kaplan, J.(2009, August 25). GeoCities, Preserved! Internet Archive Blog. Retrieved from https://blog.archive.org/2009/08/25/geocities-preserved/

Lerner, L. (2008, April 2). How Friendster Works.”HowStuffWorks.com. Retrieved from https://computer.howstuffworks.com/internet/social-networking/networks/friendster.htm

Scheinman, J. (2017, May 11). Friendster: Lessons from a Missed $400B Opportunity. Medium. Retrieved from https://medium.com/startup-grind/friendster-lessons-from-a-missed-400b-opportunity-dda144e1847

Seki, Nakamura. “The Mechanism of Collapse of the Friendster Network: What Can We Learn from the Core Structure of Friendster?” Social network analysis and mining 7.1 (2017): 1–21. Web.


Engaging with Early Web Collections was originally published in Archives Unleashed on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Bitcoin "Price" / David Rosenthal

Jemima Kelly writes No, bitcoin is not “the ninth-most-valuable asset in the world” and its a must-read. Below the fold, some commentary.

Source

The "price" of BTC in USD has quadrupuled in the last three months, and thus its "market cap" has sparked claims that it is the 9th most valuable asset in the world.

Kelly explains the math:
Just like you would calculate a company’s market capitalisation by multiplying its stock price by the number of shares outstanding, with bitcoin you just multiply its price by its total “supply” of coins (ie, the number of coins that have been mined since the first one was in January 2009). Simples!

If you do that sum, you’ll see that you get to a very large number — if you take the all-time-high of $37,751 and multiply that by the bitcoin supply (roughly 18.6m) you get to just over $665bn. And, if that were accurate and representative and if you could calculate bitcoin’s value in this way, that would place it just below Tesla and Alibaba in terms of its “market value”. (On Wednesday!)
Then Kelly starts her critique, which is quite different from mine in Stablecoins:
In the context of companies, the “market cap” can be thought of as loosely representing what someone would have to pay to buy out all the shareholders in order to own the company outright (though in practice the shares have often been over- or undervalued by the market, so shareholders are often offered a premium or a discount).

Companies, of course, have real-world assets with economic value. And there are ways to analyse them to work out whether they are over- or undervalued, such as price-to-earnings ratios, net profit margins, etc.

With bitcoin, the whole value proposition rests on the idea of the network. If you took away the coinholders there would be literally nothing there, and so bitcoin’s value would fall to nil. Trying to value it by talking about a “market cap” therefore makes no sense at all.
Secondly, she takes aim at the circulating BTC supply:
Another problem is that although 18.6m bitcoins have indeed been mined, far fewer can actually be said to be “in circulation” in any meaningful way.

For a start, it is estimated that about 20 per cent of bitcoins have been lost in various ways, never to be recovered. Then there are the so-called “whales” that hold most of the bitcoin, whose dominance of the market has risen in recent months. The top 2.8 per cent of bitcoin addresses now control 95 per cent of the supply (including many that haven’t moved any bitcoin for the past half-decade), and more than 63 per cent of the bitcoin supply hasn’t been moved for the past year, according to recent estimates.
The small circulating supply means that BTC liquidity is an illusion:
the idea that you can get out of your bitcoin position at any time and the market will stay intact is frankly a nonsense. And that’s why the bitcoin religion’s “HODL” mantra is so important to be upheld, of course.

Because if people start to sell, bad things might happen! And they sometimes do. The excellent crypto critic Trolly McTrollface (not his real name, if you’re curious) pointed out on Twitter that on Saturday a sale of just 150 bitcoin resulted in a 10 per cent drop in the price.
And there are a lot of "whales' HODL-ing. If one decides to cash out, everyone will get trampled in the rush for the exits:
More than 2,000 wallets contain over 1,000 bitcoin in them. What would happen to the price if just one of those tried to unload their coins on to the market at once? It wouldn’t be pretty, we would wager.

What we call the “bitcoin price” is in fact only the price of the very small number of bitcoins that wash around the retail market, and doesn’t represent the price that 18.6m bitcoins would actually be worth, even if they were all actually available.
Source
Note that Kelly's critique implictly assumes that BTC is priced in USD, not in the mysteriously inflatable USDT. The graph shows that the vast majority of the "very small number of bitcoins that wash around the retail market" are traded for, and thus priced in USDT. So the actual number of bitcoins being traded for real money is a small fraction of a very small number.

Bitfinex & Tether have agreed to comply with the New York Supreme Court and turn over their financial records to the New York Attorney General by 15th January. If they actually do, and the details of what is actually backing the current stock of nearly 24 billion USDT become known, things could get rather dynamic. As Tim Swanson explains in Parasitic Stablecoins, the 24B USD are notionally in a bank account, and the solvency of that account is not guaranteed by any government deposit insurance. So even if there were a bank account containing 24B USD, if there is a rush for the exits the bank holding that account could well go bankrupt.

To give a sense of scale, the 150 BTC sale that crashed the "price" by 10% represents ( 150 / 6.25 ) / 6 = 4 hours of mining reward. If miners were cashing out their rewards, they would be selling 900BTC or $36M/day. In the long term, the lack of barriers to entry means that the margins on mining are small. But in the short term, mining capacity can't respond quickly to large changes in the "price". It certainly can't increase four times in three months.

Source
Lets assume that three months ago, when 1BTC≈10,000USDT, the BTC ecosystem was in equilibrium with the mining rewards plus fees slightly more than the cost of mining. While the BTC "price" has quadrupled, the hash rate and thus the cost of mining has oscillated between 110M and 150M TeraHash/s. It hasn't increased significantly, so miners only now need to sell about 225BTC or $9M/day to cover their costs. With the price soaring, they have an incentive to HODL their rewards.

Islandora Open Meeting: January 28, 2021 / Islandora

Islandora Open Meeting: January 28, 2021 manez Thu, 01/14/2021 - 15:55
Body

We will be holding another open drop-in session on January 28th from 10:00 AM to 2:00 PM Eastern. Full details, and the Zoom link to join, are in this Google doc. The meeting is free form, with experienced Islandora 8 users on hand to answer questions or give demos on request. Please drop in at any time during the four-hour window.

Registration is not required. If you would like a calendar invite as a reminder, please let us know at community@islandora.ca.

csv,conf,v6 is going ahead on May 4-5 2021 / Open Knowledge Foundation

csv,conf,v4

Attendees of csv,conf,v4

Save the date for csv,conf,v6! The 6th version of csv,conf will be held online on May 4-5 2021.

If you are passionate about data and its application to society, this is the conference for you. Submissions for session proposals for 25-minute talk slots, and a new ‘Birds of a Feather’ track, for which we invite people to gather around a topic or area of interest, are open until February 28 2021, and we encourage talks about how you are using data in an interesting way.

The conference will take place on Crowdcast, Slack, Spatial Chat and other platforms used for breakout and social spaces. We will be opening ticket sales soon.

Pictured are attendees to the csv,conf,v5 opening session.

csv,conf,v5 was planned to go ahead in May 2020 in person in Washington DC. Due to the Covid-19 situation this was not possible and the organising team made the decision to do the event online. The event was a huge success and most talks had well over 300 attendees. We have written about our experience of organising an online conference with the hope that it will help others (https://csvconf.com/going-online) and are excited to be building on this experience for this year.

csv,conf is a much-loved community conference bringing together diverse groups to discuss data topics, featuring stories about data sharing and data analysis from science, journalism, government, and open source. Over two days, attendees will have the opportunity to hear about ongoing work, share skills, exchange ideas and kickstart collaborations. As in previous years, the Open Knowledge Foundation are members of the organising team.

Expect the return of the comma llama!

First launched in July 2014, csv,conf has expanded to bring together over 2,000 participants from 30 countries with backgrounds from varied disciplines. If you’ve missed the earlier years’ conferences, you can watch previous talks on topics like data ethics, open source technology, data journalism, open internet, and open science on our YouTube channel. We hope you will join us  in May to share your own data stories and join the csv,conf community!

We are happy to answer all questions you may have or offer any clarifications if needed. Feel free to reach out to us on csv-conf-coord@googlegroups.com, on Twitter @CSVConference or our dedicated community Slack channel

We are committed to diversity and inclusion, and strive to be a supportive and welcoming environment to all attendees. To this end, we encourage you to read the csv,conf Code of Conduct.

csv,conf,v6 is a community conference that is in part supported by the Sloan Foundation through our Frictionless Data for Reproducible Research work. The Frictionless Data team is part of the organising group.

Counting down to 1925 in the public domain / John Mark Ockerbloom

We’re rapidly approaching another Public Domain Day, the day at the start of the year when a year’s worth of creative work joins the public domain. This will be the third year in a row that the US will have a full crop of new public domain works (after a prior 20-year drought), and once again, I’m noting and celebrating works that will be entering the public domain shortly. Approaching 2019, I wrote a one-post-a-day Advent Calendar for 1923 works throughout the month of December, and approaching 2020, I highlighted a few 1924 works, and related copyright issues, in a series of December posts called 2020 Vision.

This year I took to Twitter, making one tweet per day featuring a different 1925 work and creator using the #PublicDomainDayCountdown hashtag. Tweets are shorter than blog posts, but I started 99 days out, so by the time I finish the series at the end of December, I’ll have written short notices on more works than ever. Since not everyone reads Twitter, and there’s no guarantee that my tweets will always be accessible on that site, I’ll reproduce them here. (This post will be updated to include all the tweets up to 2021.) The tweet links have been reformatted for the blog, a couple of 2-tweet threads have been recombined, and some typos may be corrected.

If you’d like to comment yourself on any of the works mentioned here, or suggest others I can feature, feel free to reply here or on Twitter. (My account there is @JMarkOckerbloom. You’ll also find some other people tweeting on the #PublicDomainDayCountdown hashtag, and you’re welcome to join in as well.)

September 24: It’s F. Scott Fitzgerald’s birthday. His best-known book, The Great Gatsby, joins the US public domain 99 days from now, along with other works with active 1925 copyrights. #PublicDomainDayCountdown (Links to free online books by Fitzgerald here.)

September 25: C. K. Scott-Moncrieff’s birthday’s today. He translated Proust’s Remembrance of Things Past (a controversial title, as the Public Domain Review notes). The Guermantes Way, his translation of Proust’s 3rd volume, joins the US public domain in 98 days. #PublicDomainDayCountdown

September 26: Today is T.S. Eliot’s birthday. His poem “The Hollow Men” (which ends “…not with a bang but a whimper”) was first published in full in 1925, & joins the US public domain in 97 days. #PublicDomainDayCountdown More by & about him here.

September 27: Lady Cynthia Asquith, born today in 1887, edited a number of anthologies that have long been read by children and fans of fantasy and supernatural fiction. Her first major collection, The Flying Carpet, joins the US public domain in 96 days. #PublicDomainDayCountdown

September 28: As @Marketplace reported tonight, Agatha Christie’s mysteries remain popular after 100 years. In 95 days, her novel The Secret of Chimneys will join the US public domain, as will the expanded US Poirot Investigates collection. #PublicDomainDayCountdown

September 29: Homer Hockett’s and Arthur Schlesinger, Sr.’s Political and Social History of the United States first came out in 1925, and was an influential college textbook for years thereafter. The first edition joins the public domain in 94 days. #PublicDomainDayCountdown

September 30: Inez Haynes Gillmore Irwin died 50 years ago this month, after a varied, prolific writing career. This 2012 blog post looks at 4 of her books, including Gertrude Haviland’s Divorce, which joins the public domain in 93 days. #PublicDomainDayCountdown

October 1: For some, spooky stories and themes aren’t just for October, but for the whole year. We’ll be welcoming a new year’s worth of Weird Tales to the public domain in 3 months. See what’s coming, and what’s already free online, here. #PublicDomainDayCountdown

October 2: Misinformation and quackery has been a threat to public health for a long time. In 13 weeks, the 1925 book The Patent Medicine and the Public Health, by American quack-fighter Arthur J. Cramp joins the public domain. #PublicDomainDayCountdown

October 3: Sophie Treadwell, born this day in 1885, was a feminist, modernist playwright with several plays produced on Broadway, but many of her works are now hard to find. Her 1925 play “Many Mansions” joins the public domain in 90 days. #PublicDomainDayCountdown

October 4: It’s Edward Stratemeyer’s birthday. Books of his syndicate joining the public domain in 89 days include the debuts of Don Sturdy & the Blythe Girls, & further adventures of Tom Swift, Ruth Fielding, Baseball Joe, Betty Gordon, the Bobbsey Twins, & more. #PublicDomainDayCountdown

October 5: Russell Wilder was a pioneering diabetes doctor, testing newly invented insulin treatments that saved many patients’ lives. His 1925 book Diabetes: Its Cause and its Treatment with Insulin joins the public domain in 88 days. #PublicDomainDayCountdown

October 6: Queer British Catholic author Radclyffe Hall is best known for The Well of Loneliness. Hall’s earlier novel A Saturday Life is lighter, though it has some similar themes in subtext. It joins the US public domain in 87 days. #PublicDomainDayCountdown

October 7: Edgar Allan Poe’s stories have long been public domain, but some work unpublished when he died (on this day in 1849) stayed in © much longer. In 86 days, the Valentine Museum’s 1925 book of his previously unpublished letters finally goes public domain. #PublicDomainDayCountdown

October 8: In 1925, the Nobel Prize in Literature went to George Bernard Shaw. In 85 days, his Table-Talk, published that year, will join the public domain in the US, and all his solo works published in his lifetime will be public domain nearly everywhere else. #PublicDomainDayCountdown

October 9: Author and editor Edward Bok was born this day in 1863. In Twice Thirty (1925), he follows up his Pulitzer-winning memoir The Americanization of Edward Bok with a set of essays from the perspective of his 60s. It joins the public domain in 84 days. #PublicDomainDayCountdown

October 10: In the 1925 silent comedy “The Freshman”, Harold Lloyd goes to Tate University, “a large football stadium with a college attached”, and goes from tackling dummy to unlikely football hero. It joins the public domain in 83 days. #PublicDomainDayCountdown

October 11: It’s François Mauriac’s birthday. His Le Desert de l’Amour, a novel that won the 1926 Grand Prix of the Académie Française, joins the US public domain in 82 days. Published translations may stay copyrighted, but Americans will be free to make new ones. #PublicDomainDayCountdown

October 12: Pulitzer-winning legal scholar Charles Warren’s Congress, the Constitution, and the Supreme Court (1925) analyzes controversies, some still argued, over relations between the US legislature and the US judiciary. It joins the public domain in 81 days. #PublicDomainDayCountdown

October 13: Science publishing in 1925 was largely a boys’ club, but some areas were more open to women authors, such as nursing & science education. I look forward to Maude Muse’s Textbook of Psychology for Nurses going public domain in 80 days. #PublicDomainDayCountdown #AdaLovelaceDay

October 14: Happy birthday to poet E. E. Cummings, born this day in 1894. (while some of his poetry is lowercase he usually still capitalized his name when writing it out) His collection XLI Poems joins the public domain in 79 days. #PublicDomainDayCountdown

October 15: It’s PG Wodehouse’s birthday. In 78 days more of his humorous stories join the US public domain, including Sam in the Suburbs. It originally ran as a serial in the Saturday Evening Post in 1925. All that year’s issues also join the public domain then. #PublicDomainDayCountdown

October 16: Playwright and Nobel laureate Eugene O’Neill was born today in 1888. His “Desire Under the Elms” entered the US public domain this year; in 77 days, his plays “Marco’s Millions” and “The Great God Brown” will join it. #PublicDomainDayCountdown

October 17: Not everything makes it to the end of the long road to the US public domain. In 76 days, the copyright for the film Man and Maid (based on a book by Elinor Glyn) expires, but no known copies survive. Maybe someone will find one? #PublicDomainDayCountdown

October 18: Corra Harris became famous for her novel A Circuit Rider’s Wife and her World War I reporting. The work she considered her best, though, was As a Woman Thinks. It joins the public domain in 75 days. #PublicDomainDayCountdown

October 19: Edna St. Vincent Millay died 70 years ago today. All her published work joins the public domain in 74 days in many places outside the US. Here, magazine work like “Sonnet to Gath” (in Sep 1925 Vanity Fair) will join, but renewed post-’25 work stays in ©. #PublicDomainDayCountdown

October 20: All songs eventually reach the public domain. Authors can put them there themselves, like Tom Lehrer just did for his lyrics. But other humorous songs arrive by the slow route, like Tilzer, Terker, & Heagney’s “Pardon Me (While I Laugh)” will in 73 days. #PublicDomainDayCountdown

October 21: Sherwood Anderson’s Winesburg, Ohio wasn’t a best-seller when it came out, but his Dark Laughter was. Since Joycean works fell out of fashion, that book’s been largely forgotten, but may get new attention when it joins the public domain in 72 days. #PublicDomainDayCountdown

October 22: Artist NC Wyeth was born this day in 1882. The Brandywine Museum near Philadelphia shows many of his works. His illustrated edition of Francis Parkman’s book The Oregon Trail joins the public domain in 71 days. #PublicDomainDayCountdown

October 23: Today (especially at 6:02, on 10/23) many chemists celebrate #MoleDay. In 70 days, they’ll also get to celebrate historically important chemistry publications joining the US public domain, including all 1925 issues of Justus Liebigs Annalen der Chemie. #PublicDomainDayCountdown

October 24: While some early Alfred Hitchcock films were in the US public domain for a while due to formality issues, the GATT accords restored their copyrights. His directorial debut, The Pleasure Garden, rejoins the public domain (this time for good) in 69 days. #PublicDomainDayCountdown (Addendum: There may still be one more year of copyright to this film as of 2021; see the comments to this post for details.)

October 25: Albert Barnes took a different approach to art than most of his contemporaries. The first edition of The Art in Painting, where he explains his theories and shows examples from his collection, joins the public domain in 68 days. #PublicDomainDayCountdown

October 26: Prolific writer Carolyn Wells had a long-running series of mystery novels featuring Fleming Stone. Here’s a blog post by The Passing Tramp on one of them, The Daughter of the House, which will join the public domain in 67 days. #PublicDomainDayCountdown

October 27: Theodore Roosevelt was born today in 1858, and died over 100 years ago, but some of his works are still copyrighted. In 66 days, 2 volumes of his correspondence with Henry Cabot Lodge, written from 1884-1918 and published in 1925, join the public domain. #PublicDomainDayCountdown

October 28: American composer and conductor Howard Hanson was born on this day in 1896. His choral piece “Lament for Beowulf” joins the public domain in 65 days. #PublicDomainDayCountdown

October 29: “Skitter Cat” was a white Persian cat who had adventures in several children’s books by Eleanor Youmans, illustrated by Ruth Bennett. The first of the books joins the public domain in 64 days. #PublicDomainDayCountdown #NationalCatDay

October 30:Secret Service Smith” was a detective created by Canadian author R. T. M. Maitland. His first magazine appearance was in 1920; his first original full-length novel, The Black Magician, joins the public domain in 9 weeks. #PublicDomainDayCountdown

October 31: Poet John Keats was born this day in 1795. Amy Lowell’s 2-volume biography links his Romantic poetry with her Imagist poetry. (1 review.) She finished and published it just before she died. It joins the public domain in 62 days. #PublicDomainDayCountdown

November 1: “Not just for an hour, not for just a day, not for just a year, but always.” Irving Berlin gave the rights to this song to his bride in 1926. Both are gone now, and in 2 months it will join the public domain for all of us, always. #PublicDomainDayCountdown

November 2: Mikhail Fokine’s The Dying Swan dance, set to music by Camille Saint-Saëns, premiered in 1905, but its choreography wasn’t published until 1925, the same year a film of it was released. It joins the public domain in 60 days. #PublicDomainDayCountdown (Choreography copyright is weird. Not only does the term not start until publication, which can be long after 1st performance, but what’s copyrightable has also changed. Before 1978 it had to qualify as dramatic; now it doesn’t, but it has to be more than a short step sequence.)

November 3: Herbert Hoover was the only sitting president to be voted out of office between 1912 & 1976. Before taking office, he wrote the foreword to Carolyn Crane’s Everyman’s House, part of a homeowners’ campaign he co-led. It goes out of copyright in 59 days. #PublicDomainDayCountdown

November 4:The Golden Cocoon” is a 1925 silent melodrama featuring an election, jilted lovers, and extortion. The Ruth Cross novel it’s based on went public domain this year. The film will join it there in 58 days. #PublicDomainDayCountdown

November 5: Investigative journalist Ida Tarbell was born today in 1857. Her History of Standard Oil helped break up that trust in 1911, but her Life of Elbert H. Gary wrote more admiringly of his chairmanship of US Steel. It joins the public domain in 57 days. #PublicDomainDayCountdown

November 6: Harold Ross was born on this day in 1892. He was the first editor of The New Yorker, which he established in coöperation with his wife, Jane Grant. After ninety-five years, the magazine’s first issues are set to join the public domain in fifty-six days. #PublicDomainDayCountdown

November 7: “Sweet Georgia Brown” by Ben Bernie & Maceo Pinkard (lyrics by Kenneth Casey) is a jazz standard, the theme tune of the Harlem Globetrotters, and a song often played in celebration. One thing we can celebrate in 55 days is it joining the public domain. #PublicDomainDayCountdown

November 8: Today I hiked on the Appalachian Trail. It was completed in 1937, but parts are much older. Walter Collins O’Kane’s Trails and Summits of the White Mountains, published in 1925 when the AT was more idea than reality, goes public domain in 54 days. #PublicDomainDayCountdown

November 9: In Sinclair Lewis’ Arrowsmith, a brilliant medical researcher deals with personal and ethical issues as he tries to find a cure for a deadly epidemic. The novel has stayed relevant well past its 1925 publication, and joins the public domain in 53 days. #PublicDomainDayCountdown

November 10: John Marquand was born today in 1893. He’s known for his spy stories and satires, but an early novel, The Black Cargo, features a sailor curious about a mysterious payload on a ship he’s been hired onto. It joins the US public domain in 52 days. #PublicDomainDayCountdown

November 11: The first world war, whose armistice was 102 years ago today, cast a long shadow. Among the many literary works looking back to it is Ford Madox Ford’s novel No More Parades, part of his “Parade’s End” tetralogy. It joins the public domain in 51 days. #PublicDomainDayCountdown

November 12: Anne Parrish was born on this day in 1888. In 1925, The Dream Coach, co-written with her brother, got a Newbery honor , and her novel The Perennial Bachelor was a best-seller. The latter book joins the public domain in 50 days. #PublicDomainDayCountdown

November 13: In “The Curse of the Golden Cross”, G. K. Chesterton’s Father Brown once again finds a natural explanation to what seem to be preternatural symbols & events. As of today, Friday the 13th, the 1925 story is exactly 7 weeks away from the US public domain. #PublicDomainDayCountdown

November 14: The pop standard “Yes Sir, That’s My Baby” was the baby of Walter Donaldson (music) and Gus Kahn (lyrics). It’s been performed by many artists since its composition, and in 48 days, this baby steps out into the public domain. #PublicDomainDayCountdown

November 15: Marianne Moore, born on this day in 1887, had a long literary career, including editing the influential modernist magazine The Dial from 1925 on. In 47 days, all 1925 issues of that magazine will be fully in the public domain. #PublicDomainDayCountdown

November 16: George S. Kaufman, born today in 1889, wrote or directed a play in every Broadway season from 1921 till 1958. In 46 days, several of his plays join the public domain, including his still-performed comedy “The Butter and Egg Man”. #PublicDomainDayCountdown

November 17: Shen of the Sea was a Newbery-winning collection of stories presented as “Chinese” folktales, but written by American author Arthur Bowie Chrisman. Praised when first published, seen more as appropriation later, it’ll be appropriable itself in 45 days. #PublicDomainDayCountdown

November 18: I share a birthday today with Jacques Maritain, a French Catholic philosopher who influenced the Universal Declaration of Human Rights. His book on 3 reformers (Luther, Descartes, and Rousseau) joins the public domain in 44 days. #PublicDomainDayCountdown

November 19: Prevailing views of history change a lot over 95 years. The 1926 Pulitzer history prize went to a book titled “The War for Southern Independence”. The last volume of Edward Channing’s History of the United States, it joins the public domain in 43 days. #PublicDomainDayCountdown

November 20: Alfred North Whitehead’s Science and the Modern World includes a nuanced discussion of science and religion differing notably from many of his contemporaries’. (A recent review of it.) It joins the US public domain in 6 weeks.

November 21: Algonquin Round Table member Robert Benchley tried reporting, practical writing, & reviews, but soon found that humorous essays & stories were his forte. One early collection, Pluck and Luck, joins the public domain in 41 days. #PublicDomainDayCountdown

November 22: I’ve often heard people coming across a piano sit down & pick out Hoagy Carmichael’s “Heart and Soul”. He also had other hits, one being “Washboard Blues“. His original piano instrumental version becomes public domain in 40 days. #PublicDomainDayCountdown

November 23: Harpo Marx, the Marx Brothers mime, was born today in 1888. In his oldest surviving film, “Too Many Kisses” he does “speak”, but silently (like everyone else in it), without his brothers. It joins the public domain in 39 days. #PublicDomainDayCountdown

November 24: In The Man Nobody Knows, Bruce Barton likened the world of Jesus to the world of business. Did he bring scriptural insight to management, or subordinate Christianity to capitalism? It’ll be easier to say, & show, after it goes public domain in 38 days. #PublicDomainDayCountdown

November 25: Before Virgil Thomson (born today in 1896) was well-known as a composer, he wrote a music column for Vanity Fair. His first columns, and the rest of Vanity Fair for 1925, join the public domain in 37 days. #PublicDomainDayCountdown

November 26: “Each moment that we’re apart / You’re never out of my heart / I’d rather be lonely and wait for you only / Oh how I miss you tonight” Those staying safe by staying apart this holiday might appreciate this song, which joins the public domain in 36 days. #PublicDomainDayCountdown (The song, “Oh, How I Miss You Tonight” is by Benny Davis, Joe Burke, and Mark Fisher, was published in 1925, and performed and recorded by many musicians since then, some of whom are mentioned in this Wikipedia article.)

November 27: Feminist author Katharine Anthony, born today in 1877, was best known for her biographies. Her 1925 biography of Catherine the Great, which drew extensively on the empress’s private memoirs, joins the public domain in 35 days. #PublicDomainDayCountdown

November 28: Tonight in 1925 “Barn Dance” (soon renamed “Grand Ole Opry”) debuted in Nashville. Most country music on it & similar shows then were old favorites, but there were new hits too, like “The Death of Floyd Collins”, which joins the public domain in 34 days. #PublicDomainDayCountdown (The song, with words by Andrew Jenkins and music by John Carson, was in the line of other disaster ballads that were popular in the 1920s. This particular disaster had occurred earlier in the year, and became the subject of song, story, drama, and film.)

November 29: As many folks get ready for Christmas, many Christmas-themed works are also almost ready to join the public domain in 33 days. One is The Holly Hedge, and Other Christmas Stories by Temple Bailey. More on the book & author. #PublicDomainDayCountdown

November 30: In 1925 John Maynard Keynes published The Economic Consequences of Sterling Parity objecting to Winston Churchill returning the UK to the gold standard. That policy ended in 1931; the book’s US copyright lasted longer, but will finally end in 32 days. #PublicDomainDayCountdown

December 1: Du Bose Heyward’s novel Porgy has a distinguished legacy of adaptations, including a 1927 Broadway play, and Gershwin’s opera “Porgy and Bess”. When the book joins the public domain a month from now, further adaptation possibilities are limitless. #PublicDomainDayCountdown

December 2: In Dorothy Black’s Romance — The Loveliest Thing a young Englishwoman “inherits a small sum of money, buys a motor car and goes off in search of adventure and romance”. First serialized in Ladies’ Home Journal, it joins the public domain in 30 days. #PublicDomainDayCountdown

December 3: Joseph Conrad was born on this day in 1857, and died in 1924, leaving unfinished his Napoleonic novel Suspense. But it was still far enough along to get serialized in magazines and published as a book in 1925, and it joins the public domain in 29 days. #PublicDomainDayCountdown

December 4: Ernest Hemingway’s first US-published story collection In Our Time introduced his distinctive style to an American audience that came to view his books as classics of 20th century fiction: It joins the public domain in 28 days. #PublicDomainDayCountdown

December 5: Libertarian author Rose Wilder Lane helped bring her mother’s “Little House” fictionalized memoirs into print. Before that, she published biographical fiction based on the life of Jack London, called He Was a Man. It joins the public domain in 27 days. #PublicDomainDayCountdown

December 6: Indiana naturalist and author Gene Stratton-Porter died on this day in 1924. Her final novel, The Keeper of the Bees, was published the following year, and joins the public domain in 26 days. One review. #PublicDomainDayCountdown

December 7: Willa Cather was born today in 1873. Her novel The Professor’s House depicts 1920s cultural dislocation from a different angle than F. Scott Fitzgerald’s better-known Great Gatsby. It too joins the public domain in 25 days. #PublicDomainDayCountdown

December 8: The last symphony published by Finnish composer Jean Sibelius (born on this day in 1865) is described in the Grove Dictionary as his “most remarkable compositional achievement”. It joins the public domain in the US in 24 days. #PublicDomainDayCountdown

December 9: When the Habsburg Empire falls, what comes next for the people & powers of Vienna? The novel Old Wine, by Phyllis Bottome (wife of the local British intelligence head) depicts a society undergoing rapid change. It joins the US public domain in 23 days. #PublicDomainDayCountdown

December 10: Lewis Browne was “a world traveler, author, rabbi, former rabbi, lecturer, socialist and friend of the literary elite”. His first book, Stranger than Fiction: A Short History of the Jews, joins the public domain in 22 days. #PublicDomainDayCountdown

December 11: In 1925, John Scopes was convicted for teaching evolution in Tennessee. Books explaining the science to lay audiences were popular that year, including Henshaw Ward’s Evolution for John Doe. It becomes public domain in 3 weeks. #PublicDomainDayCountdown

December 12: Philadelphia artist Jean Leon Gerome Ferris was best known for his “Pageant of a Nation” paintings. Three of them, “The Birth of Pennsylvania”, “Gettysburg, 1863”, and “The Mayflower Compact”, join the public domain in 20 days. #PublicDomainDayCountdown

December 13: The Queen of Cooks, and Some Kings was a memoir of London hotelier Rosa Lewis, as told to Mary Lawton. Her life story was the basis for the BBC and PBS series “The Duchess of Duke Street”. It joins the public domain in 19 days. #PublicDomainDayCountdown

December 14: Today we’re celebrating new films being added to the National Film Registry. In 18 days, we can also celebrate more Registry films joining the public domain. One is The Clash of the Wolves, starring Rin Tin Tin. #PublicDomainDayCountdown

December 15: Etsu Inagaki Sugimoto, daughter of a high-ranking Japanese official, moved to the US in an arranged marriage after her family fell on hard times. Her 1925 memoir, A Daughter of the Samurai, joins the public domain in 17 days. #PublicDomainDayCountdown

December 16: On the Trail of Negro Folk-Songs compiled by Dorothy Scarborough assisted by Ola Lee Gulledge, has over 100 songs. Scarborough’s next of kin (not Gulledge, or any of their sources) renewed its copyright in 1953. But in 16 days, it’ll be free for all. #PublicDomainDayCountdown

December 17: Virginia Woolf’s writings have been slowly entering the public domain in the US. We’ve had the first part of her Mrs. Dalloway for a while. The complete novel, and her first Common Reader essay collection, join it in 15 days. #PublicDomainDayCountdown

December 18: Lovers in Quarantine with Harrison Ford sounds like a movie made for 2020, but it’s actually a 1925 silent comedy (with a different Harrison Ford). It’ll be ready to go out into the public domain after a 14-day quarantine. #PublicDomainDayCountdown

December 19: Ma Rainey wrote, sang, and recorded many blues songs in a multi-decade career. Two of her songs becoming public domain in 13 days are “Shave ’em Dry” (written with William Jackson) & “Army Camp Harmony Blues” (with Hooks Tilford). #PublicDomainDayCountdown

December 20: For years we’ve celebrated the works of prize-winning novelist Edith Wharton as her stories join the public domain. In 12 days, The Writing of Fiction, her book on how she writes her memorable tales, will join that company. #PublicDomainDayCountdown

December 21: Albert Payson Terhune, born today in 1872, raised and wrote about dogs he kept at what’s now a public park in New Jersey. His book about Wolf, who died heroically and is buried there, will also be in the public domain in 11 days. #PublicDomainDayCountdown

December 22: In the 1920s it seemed Buster Keaton could do anything involving movies. Go West, a 1925 feature film that he co-wrote, directed, co-produced, and starred in, is still enjoyed today, and it joins the public domain in 10 days. #PublicDomainDayCountdown

December 23: In 9 days, not only will Theodore Dreiser’s massive novel An American Tragedy be in the public domain, but so will a lot of the raw material that went into it. Much of it is in @upennlib‘s special collections. #PublicDomainDayCountdown

December 24: Johnny Gruelle, born today in 1880, created the Raggedy Ann doll, and a series of books sold with it that went under many Christmas trees. Two of them, Raggedy Ann’s Alphabet Book and Raggedy Ann’s Wishing Pebble, join the public domain in 8 days. #PublicDomainDayCountdown

December 25: Written in Hebrew by Joseph Klausner, translated into English by Anglican priest Herbert Danby, Jesus of Nazareth reviewed Jesus’s life and teachings from a Jewish perspective. It made a stir when published in 1925, & joins the public domain in 7 days. #PublicDomainDayCountdown

December 26: “It’s a travesty that this wonderful, hilarious, insightful book lives under the inconceivably large shadow cast by The Great Gatsby.” A review of Anita Loos’s Gentlemen Prefer Blondes, also joining the public domain in 6 days. #PublicDomainDayCountdown

December 27: “On revisiting Manhattan Transfer, I came away with an appreciation not just for the breadth of its ambition, but also for the genius of its representation.” A review of the John Dos Passos novel becoming public domain in 5 days. #PublicDomainDayCountdown

December 28: All too often legal systems and bureaucracies can be described as “Kafkaesque”. The Kafka work most known for that sense of arbitrariness and doom is Der Prozess (The Trial), reviewed here. It joins the public domain in 4 days. #PublicDomainDayCountdown

December 29: Chocolate Kiddies, an African American music and dance revue that toured Europe in 1925, featured songs by Duke Ellington and Jo Trent including “Jig Walk”, “Jim Dandy”, and “With You”. They join the public domain in 3 days. #PublicDomainDayCountdown

December 30: Lon Chaney starred in 2 of the top-grossing movies of 1925. The Phantom of the Opera has long been in the public domain due to copyright nonrenewal. The Unholy Three, which was renewed, joins it in the public domain in 2 days. #PublicDomainDayCountdown (If you’re wondering why some of the other big film hits of 1925 haven’t been in this countdown, in many cases it’s also because their copyrights weren’t renewed. Or they weren’t actually copyrighted in 1925.)

December 31: “…You might as well live.” Dorothy Parker published “Resumé” in 1925, and ultimately outlived most of her Algonquin Round Table-mates. This poem, and her other 1925 writing for periodicals, will be in the public domain tomorrow. #PublicDomainDayCountdown

Drupal 9: We Did It. / Islandora

Drupal 9: We Did It. dlamb Tue, 01/12/2021 - 20:03
Body

Just before the holiday break, a brave handful of devs adventured through terrifying dependency chains and deprecations to bring Islandora on Drupal 9 to us all. There were many twists, turns, and bumps in the road, but in the end, they pulled it off. Islandora is now compatible with both Drupal 8 and 9! In fact, as an outcome of this sprint, keep.lib.asu.edu is now running on Drupal 9 in the wild.  Oh, and did we mention that it's kind of a multi-site? Maybe it's not exactly what a multi-site was in 7.x, but prism.lib.asu.edu is running happily alongside.

Moving Islandora to Drupal 9 is a significant accomplishment for a small crew of volunteer developers just before the holidays. In particular, it was a lot of work to thread the needle and keep support for 8 while unlocking 9. Not to mention all the testing that had to be done.  And as an added bonus, everyone's favourite CI platform, Travis, made a guest appearance just to keep things interesting. Many thanks to everyone who powered through and brought this over the finish line. 

In order to support both 8 and 9 concurrently, there's a few things you'll need to be ready for when this code is eventually released.

  • To use Composer 2 with Drupal 8, you need PHP 7.3+.
  • You can no longer install/maintain the drupal console using Composer 2. But you can totally install it via other means.
  • For Drupal 9, you need PHP 7.4+. This is due to a dependency requiring PHP 7.4 in Drupal 9.
  • We had to leave behind some modules for Drupal 9 compatibility.  You'll need to uninstall and remove these modules from your site if you want Drupal 9:
    • hook_post_action
    • libraries
    • name
    • permissions_by_term

As for how you can get this code, it's available right now on all 8.x-1.x branches of our modules in Github.  For the less adventurous out there, this code will be included in the next release.  Which means we have to tackle documenting/demystifying the update process.  And given the size of the changes required for Drupal 9, it feels like it's about time we start formally providing changelogs.  The changes aren't unreasonable, but they are significant enough that we wouldn't want anyone to be caught by surprise.

Also, now that we support two versions of Drupal at the same time, what do we call this?  Islandora 8 using Drupal 9?  Islandora 9?  Dare I say just Islandora? That might seem like a silly question, but it has implications that we as a community need to consider. Besides, we all know that naming is the hardest part of programming ;)

As always, there are so many people to thank for continuously improving Islandora. This software keeps getting better because awesome people give their time and talent to help make it the best it can be. Thank you to everyone who took part in this sprint:

  • Daniel Aitken - discoverygarden
  • Jordan Dukart - discoverygarden
  • Nick Ruest - York University
  • Seth Shaw - University of Nevada, Las Vegas
  • Alan Stanley - Agile Humanities
  • Jared Whiklo - University of Manitoba
  • Eli Zoller - Arizona State University

Be sure to thank one of these fine people next time you bump into them in zoom or slack.  And we'll be doing more sprints soon.  Details to come.