Planet Code4Lib

Open Knowledge International receives $1.5 million from Omidyar Network / Open Knowledge Foundation

We’ve recently received funding from Omidyar Network, which will allow us to further our commitment to civil society organisations!

Open Knowledge International has received a two-year grant amounting to $1.5 million from Omidyar Network to support the development and implementation of our new civil society-focused strategy. Running until the end of December 2018, this grant reflects Omidyar Network’s confidence in our shared vision to progress openness in society and we are looking forward to using the funds to strengthen the next phase of our work.

With over a decade’s experience opening up information, we will be turning our attention and efforts to focus on realising the potential of data for society. The unrestricted nature of the funding will help us to build on the successes of our past, work with new partners and implement effective systems to constructively address the challenges before us.

2017 certainly presents new challenges to the open data community. Increased access to information simply is not enough to confront a shrinking civic space, the stretched capacities of NGOs, and countless social and environmental issues. Open Knowledge International is looking to work with new partners on these areas to use open data as an effective tool to address society’s most pressing issues. Omidyar Network’s support will allow us to work in more strategic ways, to develop relationships with new partners and to embed our commitment to civil society across the organisation.

Pavel Richter, Open Knowledge International’s CEO, underlines the impact that this funding will have on the organisation’s continued success: “Given the expertise Open Knowledge International has amassed over the years, we are eager to employ our efforts to ensure open data makes a real and positive impact in the world. Omidyar Network’s support for the next two years will allow us to be much more strategic and effective with how we work.”

Of course implementing our strategic vision will take time. Long-term funding relationships like the one we have with Omidyar Network play an instrumental role in boosting Open Knowledge International’s capacity as they provide the space to stabilise and grow. For the past six years, Omidyar Network has been an active supporter of Open Knowledge International, and this has allowed us to cultivate and refine the strong vision we have today. More recently Omidyar Network has provided valuable expertise for our operational groundwork, helping to instil a suitable structure for us to thrive. Furthermore, our shared vision of the transformative impact of openness has allowed us to scale our community and grow our network of committed change-makers and activists around the world.

“We are proud to continue our support for Open Knowledge International, which plays a critical role in the open data ecosystem,” stated Martin Tisné, Investment Partner at Omidyar Network. “Open Knowledge International has nurtured several key developments in the field, including the Open Definition, CKAN and the School of Data, and we look forward to working with Open Knowledge International as it rolls out its new civil society-focused strategy.”

As we continue to chart our direction, Open Knowledge International’s work will focus on three areas to unlock the potential value of open data for civil society organisations: we will demonstrate the value of open data for the work of these organisations, we will provide organisations with the tools and skills to effectively use open data, and we will work to make government information systems more responsive to the needs of civil society. Omidyar Network’s funding ensures Open Knowledge International has the capacity to address these three areas. We are grateful for the support and we welcome our new strategic focus to empower civil society organisations to use open data to improve people’s lives.

Further information:

Open Knowledge International

Open Knowledge International is a global non-profit organisation focused on realising open data’s value to society by helping civil society groups access and use data to take action on social problems. Open Knowledge International does this in three ways: 1) we show the value of open data for the work of civil society organizations; 2) we provide organisations with the tools and skills to effectively use open data; and 3) we make government information systems responsive to civil society.

Omidyar Network 

Omidyar Network is a philanthropic investment firm dedicated to harnessing the power of markets to create opportunity for people to improve their lives. Established in 2004 by eBay founder Pierre Omidyar and his wife Pam, the organization invests in and helps scale innovative organizations to catalyze economic and social change. Omidyar Network has committed more than $1 billion to for-profit companies and nonprofit organizations that foster economic advancement and encourage individual participation across multiple initiatives, including Education, Emerging Tech, Financial Inclusion, Governance & Citizen Engagement, and Property Rights.

To learn more, visit, and follow on Twitter @omidyarnetwork


Excel is threatening the quality of research data — Data Packages are here to help / Open Knowledge Foundation

This week the Frictionless Data team at Open Knowledge International will be speaking at the International Digital Curation Conference #idcc17 on making research data quality visible. Dan Fowler looks at why the popular file format Excel is problematic for research and what steps can be taken to ensure data quality is maintained throughout the research process.

Our Frictionless Data project aims to make sharing and using data as easy and frictionless as possible by improving how data is packaged.The project is designed to support the tools and file formats researchers use in their everyday work, including basic CSV files and popular data analysis programming languages and frameworks like R and Python Pandas.  However, Microsoft Excel, both the application and the file format, remains very popular for data analysis in scientific research.

It is easy to see why Excel retains its stranglehold: over the years, an array of convenience features for visualizing, validating, and modeling data have been developed and adopted across a variety of uses.  Simple features, like the ability to group related tables together, is a major advantage of the Excel format over, for example, single-table formats like CSV.  However, Excel has a well documented history of silently corrupting data in unexpected ways which leads some, like data scientist Jenny Bryan, to compile lists of “Scary Excel Stories” advising researchers to choose alternative formats, or at least, treat data stored in Excel warily.

“Excel has a well-documented history of silently corrupting data in unexpected ways…”

With data validation and long-term preservation in mind, we’ve created Data Packages which provide researchers an alternative format to Excel by building on simpler, well understood text-based file formats like CSV and JSON and adding advanced features.  Added features include providing a framework for linking multiple tables together; setting column types, constraints, and relations between columns; and adding high-level metadata like licensing information.  Transporting research data with open, granular metadata in this format, paired with tools like Good Tables for validation, can be a safer and more transparent option than Excel.

Why does open, granular metadata matter?

With our “Tabular” Data Packages, we focus on packaging data that naturally exists in “tables”—for example, CSV files—a clear area of importance to researchers illustrated by guidelines issued by the Wellcome Trust’s publishing platform Wellcome Open Research. The guidelines mandate:

Spreadsheets should be submitted in CSV or TAB format; EXCEPT if the spreadsheet contains variable labels, code labels, or defined missing values, as these should be submitted in SAV, SAS or POR format, with the variable defined in English.

Guidelines like these typically mandate that researchers submit data in non-proprietary formats; SPSS, SAS, and other proprietary data formats are accepted due to the fact they provide important contextual metadata that haven’t been supported by a standard, non-proprietary format. The Data Package specifications—in particular, our Table Schema specification—provide a method of assigning functional “schemas” for tabular data.  This information includes the expected type of each value in a column (“string”, “number”, “date”, etc.), constraints on the value (“this string can only be at most 10 characters long”), and the expected format of the data (“this field should only contain strings that look like email addresses). The Table Schema can also specify relations between tables, strings that indicate “missing” values, and formatting information.

This information can prevent incorrect processing of data at the loading step.  In the absence of these table declarations, even simple datasets can be imported incorrectly in data analysis programs given the heuristic (and sometimes, in Excel’s case, byzantine) nature of automatic type inference.  In one example of such an issue, Zeeberg et al. and later Ziemann, Eren and El-Osta describe a phenomenon where gene expression data was silently corrupted by Microsoft Excel:

A default date conversion feature in Excel (Microsoft Corp., Redmond, WA) was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 [Deleted in Esophageal Cancer 1] [3] was being converted to ’1-DEC.’ [16]

These errors didn’t stop at the initial publication.  As these Excel files are uploaded to other databases, these errors could propagate through data repositories, an example of which took place in the now replaced “LocusLink” database. In a time where data sharing and reproducible research is gaining traction, the last thing researchers need is file formats leading to errors.

Much like Boxed Water, Packaged Data is better because it is easier to move.

Zeeberg’s team described various technical workarounds to avoid Excel problems, including using Excel’s text import wizard to manually set column types every time the file is opened.  However, the researchers acknowledge that this requires constant vigilance to prevent further errors, attention that could be spent elsewhere.   Rather, a simple, open, and ubiquitous method to unambiguously declare types in column data—columns containing gene names (e.g. “DEC1”) are strings not dates and “RIKEN identifiers” (e.g. “2310009E13”) are strings not floating point numbers—paired with an Excel plugin that reads this information may be able to eliminate the manual steps outlined above.

Granular Metadata Standards Allow for New Tools & Integrations

By publishing this granular metadata with the data, both users and software programs can use it to automatically import into Excel, and this benefit also accrues when similar integrations are created for other data analysis software packages, like R and Python.  Further, these specifications (and specifications like them) allow for the development of whole new classes of tools to manipulate data without the overhead of Excel, while still including data validation and metadata creation.

For instance, the Open Data Institute has created Comma Chameleon, a desktop CSV editor.  You can see a talk about Comma Chameleon on our Labs blog.  Similarly, Andreas Billman created SmartCSV.fx to solve the issue of broken CSV files provided by clients.  While initially this project depended on an ad hoc schema for data, the developer has since adopted our Table Schema specification.

Other approaches that bring spreadsheets together with Data Packages include Metatab which aims to provide a useful standard, modeled on the Data Package, of storing metadata within spreadsheets.  To solve the general case of reading Data Packages into Excel, Nimble Learn has developed an interface for loading Data Packages through Excel’s Power Query add-in.

For examples of other ways in which Excel mangles good data, it is worth reading through Quartz’s Bad Data guide and checking over your data.  Also, see our Frictionless Data Tools and Integrations page for a list of integrations created so far.   Finally, we’re always looking to hear more user stories for making it easier to work with data in whatever application you are using.

This post was adapted from a paper we will be presenting at International Digital Curation Conference (IDCC) where our Jo Barratt will be presenting our work to date on Making Research Data Quality Visible .

Truly progressive WebVR apps are available offline! / Dan Scott

I've been dabbling with the A-Frame framework for creating WebVR experiences for the past couple of months, ever since Patrick Trottier gave a lightning talk at the GDG Sudbury DevFest in November and a hands-on session with AFrame in January. The @AFrameVR Twitter feed regularly highlights cool new WebVR apps, and one that caught my attention was ForestVR - a peaceful forest scene with birds tweeting in the distance. "How nice would it be", I thought, "if I could just escape into that little scene wherever I am, without worrying about connectivity or how long it would take to download?"

Then I realized that WebVR apps are a great use case for Progressive Web App (PWA) techniques that allow web apps to be as fast, reliable, and engaging as native Android apps. With the source code for ForestVR at my disposal, I set out to add offline support. And it turned out to be surprisingly easy to make this work on Android in both the Firefox and Chrome browsers.

If you just want to see the required changes for this specific example, you can find the relevant two commits at the tip of my branch. The live demo is at

ForestVR with "Add to Home Screen" menu on Firefox for Android 51.0.3

ForestVR with "Add" prompt on Chrome for Android 57

In the following sections I've written an overview of the steps you have to take to turn your web app into a PWA:

Describe your app with a Web App Manifest

ForestVR already had a working Web App Manifest (Mozilla docs / Google docs), a simple JSON file that defines metadata about your web app such as the app name and icon to use when it is added to your home screen, the URL to launch, the splash screen to show when it is loading, and other elements that enable it to integrate with the Android environment.

The web app manifest for ForestVR is named manifest.json and contains the following code:

  "name": "Forest VR",
  "icons": [
      "src": "./assets/images/icons/android-chrome-144x144.png",
      "sizes": "144x144",
      "type": "image/png"
  "theme_color": "#ffffff",
  "background_color": "#ffffff",
  "start_url": "./index.html",
  "display": "standalone",
  "orientation": "landscape"

You associate the manifest with your web app through a simple <link> element in the <head> of your HTML:

<link rel="manifest" href="manifest.json">

Create a service worker to handle offline requests

A service worker is a special chunk of JavaScript that runs independently from a given web page, and can perform special tasks such as intercepting and changing browser fetch requests, sending notifications, and synchronizing data in the background (Google docs / Mozilla docs). While implementing the required networking code for offline support would be painstaking, bug-prone work, Google has fortunately made the sw-precache node module available to support generating a service worker from a simple configuration file and any static files in your deployment directory.

The configuration I added to the existing gulp build system gulpfile uses runtime caching for assets that are hosted at a different hostname or, in the case of the background soundtrack, is not essential for the experience at launch and can thus be loaded and cached after the main experience has been prepared. The staticFileGlobs list, on the other hand, defines all of the assets that must be cached before the app can launch.

swConfig = {
  runtimeCaching: [{
    urlPattern: /^https:\/\/cdn\.rawgit\.com\//,
    handler: 'cacheFirst'
    urlPattern: /^https:\/\/aframe\.io\//,
    handler: 'cacheFirst'
    urlPattern: /\/assets\/sounds\//,
    handler: 'cacheFirst'
  staticFileGlobs: [

I defined the configuration inside a new writeServiceWorkerFile() function so that I could add it as a build task to the gulpfile:

function writeServiceWorkerFile(callback) {
  swConfig = {...}
  swPrecache.write('service-worker.js', swConfig, callback);

In that gulp task, I declared the 'scripts' and 'styles' tasks as prerequisites for generating the service worker, as those tasks generate the bundle.js and bundle.css files. If the files are not present in the build directory when sw-precache runs, then it will simply ignore their corresponding entry in the configuration, and they will not be available for offline use.

gulp.task('generate-service-worker', ['scripts', 'styles'], function(callback) {

I added the generate-service-worker task to the deploy task so that the service worker will be generated every time we build the app:


Register the service worker

Just like the Web App Manifest, you need to register your service worker--but it's a little more complex. I chose Google's boilerplate service worker registration script because it contains self-documenting comments and hooks for adding more interactivity, and added it in a <script> element in the <head> of the HTML page.

Host your app with HTTPS

PWAs--specifically service workers--require the web app to be hosted on an HTTPS-enabled site due to the potential for mischief that service workers could cause if replaced by a man-in-the-middle attack that would be trivial with a non-secure site. Fortunately, my personal VPS already runs HTTPS thanks to free TLS certificates generated by Let's Encrypt.

Check for success with Lighthouse

Google has made Lighthouse, their PWA auditing tool, available as both a command-line oriented node module and a Chrome extension for grading the quality of your efforts. It runs a separate instance of Chrome to check for offline support, responsiveness, and many other required and optional attributes and generates succinct reports with helpful links for more information on any less-than-stellar results you might receive.

Check for success with your mobile web browser

Once you have satisfied Lighthouse's minimum requirements, load the URL in Firefox or Chrome on Android and try adding it to your home screen.

  • In Firefox, you will find the Add to Home Screen option in the browser menu under the Page entry.
  • In Chrome, the Add button (Chrome 57) or Add to Home Screen button (Chrome 56) will appear at the bottom of the page when you have visited it a few times over a span of five minutes or more; a corresponding entry may also appear in your browser menu.

Put your phone in airplane mode and launch the app from your shiny new home screen button. If everything has gone well, it should launch and run successfully even though you have no network connection at all!


As a relative newbie to node projects, I spent most of my time in figuring out how to integrate the sw-precache build steps nicely into the existing gulp build, and in making the app relocatable on different hosts and paths for testing purposes. The actual service worker itself was straightforward. While I used ForestVR as my proof of concept, the process should be similar for turning any other WebVR app into a Progressive WebVR App. I look forward to seeing broader adoption of this approach for a better WebVR experience on mobile!

As an aside for my friends in the library world, I plan to apply the same principles to making the My Account portion of the Evergreen library catalogue a PWA in time for the 2017 Evergreen International Conference. Here's hoping more library software creators are thinking about improving their mobile experience as well...

Today, I learned about the Accessibility Tree / LibUX

Today, I learned about the “accessibility tree.

I am not sure who attribute this diagram to, but I borrowed this from Marcy Sutton.

The accessibility tree and the DOM tree are parallel structures. Roughly speaking the accessibility tree is a subset of the DOM tree. It includes the user interface objects of the user agent and the objects of the document. Accessible objects are created in the accessibility tree for every DOM element that should be exposed to an assistive technology, either because it may fire an accessibility event or because it has a property, relationship or feature which needs to be exposed. Generally if something can be trimmed out it will be, for reasons of performance and simplicity. For example, a <span> with just a style change and no semantics may not get its own accessible object, but the style change will be exposed by other means. W3C Core Accessibility Mappings 1.1

Basically, when a page renders in the browser, there is the Document Object Model (DOM) that is the underlying structure of the page that the browser interfaces with. It informs the browser that such-and-such is the title, what markup to render, and so on. It’s hierarchically structured kind of like a tree. There’s a root and a bunch of branches.

At the same time, there is an accessibility tree that is created. Browsers make them to give assistive technology something to latch on to.

When we use ARIA attributes, we are in part giving instructions to the browser about how to render that accessibility tree.

There’s a catch: not all browsers create accessibility trees in the same way; not all screen readers interpret accessibility trees in the same way; not all screen readers even refer to the accessibility tree, but they scrape the DOM directly — some do both.

The Space Age: Library as Location / LITA

On the surface, a conversation about the physical spaces within libraries might not seem relevant in:re technology in libraries, but there’s a trend I’ve noticed — not only in my own library, but in other libraries I’ve visited in recent months: user-supplied tech in library landscapes.

Over the course of the last decade, we’ve seen a steady rise in the use of portable personal computing devices. In their Evolution of Technology survey results, Pew Research Center reports that 51% of Americans own a tablet, and 77% own smartphones. Library patrons seem to be doing less browsing and more computing, and user-supplied technology has become ubiquitous — smartphones, and tablets, and notebooks, oh my! Part of the reason for this BYO tech surge may be explained by a triangulation of high demand for the library’s public computer stations, decreased cost of personal devices, and the rise of telecommuting and freelance gig-work in the tech sector. Whatever the reasons, it seems that a significant ratio of patrons are coming to the library to use the wi-fi and the workspace.

I recently collected data for a space-use analysis at my library, and found that patrons who used our library for computing with personal devices outnumbered browsers, readers, and public computer users 3:1. During the space use survey, I noted that whenever our library classrooms are not used for a class, they’re peopled with multiple users who “camp” there, working for 2 – 4 hours at a time. Considering elements of these more recently constructed rooms that differ from the space design in the rest of the 107-year-old building offers a way into thinking about future improvements. Below are a few considerations that may support independent computers and e-commuters in the library space.

Ergonomic Conditions

Furnish work spaces with chairs designed to provide lumbar support and encourage good posture, as well as tables that match the chairs in terms of height ratio to prevent wrist- and shoulder-strain.

Adequate Power

A place to plug in at each surface allows users to continue working for long periods. It’s important to consider not only the number of outlets, but their position: cords stretched across spaces between tables and walls could result in browsers tripping, or knocking laptops off a table.

Reliable Wireless Signal

It goes without saying that telecommuters need the tele– to do their commuting. Fast, reliable wi-fi is a must-have.

Concentration-Inducing Environment

If possible, a library’s spaces should be well-defined, with areas for users to meet and talk, and areas of quiet where users can focus on their work without interruption. Sound isn’t the only environmental consideration. A building that’s too hot or too cold can be distracting. High-traffic areas — such as spaces near doors, teens’ and children’s areas, or service desks — aren’t the best locations for study tables.

Relaxed Rules

This is a complex issue; it’s not easy to strike a balance. For instance, libraries need to protect community resources — especially the expensive electronic ones like wiring — from spills; but we don’t want our patrons to dehydrate themselves while working in the library! At our library, we compromise and allow beverages, as long as those beverages have a closed lid, e.g., travel mugs, yes; to go cups (which have holes that can’t be sealed) no.

As library buildings evolve to accommodate digital natives and those whose workplaces have no walls, it’s important to keep in mind the needs of these library users and remix existing spaces to be useful for all of our patrons, whether they’re visiting for business or for pleasure.


Do you have more ideas to create useful space for patrons who bring their own tech to the library? Any issues you’ve encountered? How have you met those challenges?


2018 Evergreen International Conference – Host Site Selected / Evergreen ILS

The 2018 Evergreen Conference Site Selection Committee has chosen the next host and venue for the 2018 conference.  The MOBIUS consortium will be our 2018 conference host and St. Charles, Missouri will be the 2018 location.  Conference dates to be determined.

Congratulations, MOBIUS!  

LITA Personas Task Force / LITA

Coming soon to the LITA blog: the results of the LITA Personas Task Force. The initial report contains a number of useful persona types and was submitted to the LITA Board at the ALA Midwinter 2017 conference. Look for reports on the process and each of the persona types here on the LITA blog starting in March 2017.

As a preview, go behind the scenes with this short podcast presented as part of the LibUX Podcast series, on the free tools the Task Force used to do their work.

Metric: A UX Podcast
Metric is a #libux podcast about #design and #userExperience. Designers, developers, librarians, and other folks join @schoeyfield and @godaisies to talk shop.

The work of the LITA Personas Task Force

In this podcast Amanda L. Goodman (@godaisies) gives you a peek into the work of the LITA Persona Task Force, who are charged with defining and developing personas that are to be used in growing membership in the Library and Information Technology Association.

The ten members of the task force were from academic, public, corporate, and special libraries located in different timezones. With such challenges, the Task Force had to use collaborative tools which were easy to use for all. Task member, Amanda L. Goodman, presented this podcast originally on LibUX’s Metric podcast.

How could a global public database help to tackle corporate tax avoidance? / Open Knowledge Foundation

A new research report published today looks at the current state and future prospects of a global public database of corporate accounts.

Shipyard of the Dutch East India Company in Amsterdam, 1750. Wikipedia.

The multinational corporation has become one of the most powerful and influential forms of economic organisation in the modern world. Emerging at the bleeding edge of colonial expansion in the seventeenth century, entities such as the Dutch and British East India Companies required novel kinds of legal, political, economic and administrative work to hold their sprawling networks of people, objects, resources, activities and information together across borders. Today it is estimated that over two thirds of the world’s hundred biggest economic entities are corporations rather than countries.

Our lives are permeated by and entangled with the activities and fruits of these multinationals. We are surrounded by their products, technologies, platforms, apps, logos, retailers, advertisements, publications, packaging, supply chains, infrastructures, furnishings and fashions. In many countries they have assumed the task of supplying societies with water, food, heat, clothing, transport, electricity, connectivity, information, entertainment and sociality.

We carry their trackers and technologies in our pockets and on our screens. They provide us not only with luxuries and frivolities, but the means to get by and to flourish as human beings in the contemporary world. They guide us through our lives, both figuratively and literally. The rise of new technologies means that corporations may often have more data about us than states do – and more data than we have about ourselves. But what do we know about them? What are these multinational entities – and where are they? What do they bring together? What role do they play in our economies and societies? Are their tax contributions commensurate with their profits and activities? Where should we look to inform legal, economic and policy measures to shape their activities for the benefit of society, not just shareholders?

At the moment these questions are surprisingly difficult to answer – at least in part due to a lack of publicly available information. We are currently on the brink of a number of important policy decisions (e.g. at the EU and in the UK) which will have a lasting effect on what we are able to know and how we are able to respond to these mysterious multinational giants.

Image from report on IKEA’s tax planning strategies. Greens/EFA Group in European Parliament.

A wave of high-profile public controversies, mobilisations and interventions around the tax affairs of multinationals followed in the wake of the 2007-2008 financial crisis. Tax justice and anti-austerity activists have occupied high street stores in order to protest multinational tax avoidance. A group of local traders in Wales sought to move their town offshore, in order to publicise and critique legal and accountancy practices used by multinationals. One artist issued fake certificates of incorporation for Cayman Island companies to highlight the social costs of tax avoidance. Corporate tax avoidance came to epitomise economic globalisation with an absence of corresponding democratic societal controls.

This public concern after the crisis prompted a succession of projects from various transnational groups and institutions. The then-G8 and G20 committed to reducing the “misalignment” between the activities and profits of multinationals. The G20 tasked the OECD with launching an initiative dedicated to tackling tax “Base Erosion and Profit Shifting” (BEPS). The OECD BEPS project surfaced different ways of understanding and accounting for multinational companies – including questions such as what they are, where they are, how to calculate where they should pay money, and by whom they should be governed.

For example, many industry associations, companies, institutions and audit firms advocated sticking to the “arms length principle” which would treat multinationals as a group of effectively independent legal entities. On the other hand, civil society groups and researchers called for “unitary taxation”, which would treat multinationals as a single entity with operations in multiple countries. The consultation also raised questions about the governance of transnational tax policy, with some groups arguing that responsibility should shift from the OECD to the United Nations  to ensure that all countries have a say – especially those in the Global South.

Exhibition of Paolo Cirio’s “Loophole for All” in Basel, 2015. Paolo Cirio.

While many civil society actors highlighted the shortcomings and limitations of the OECD BEPS process, they acknowledged that it did succeed in obtaining global institutional recognition for a proposal which had been central to the “tax justice” agenda for the previous decade: “Country by Country Reporting” (CBCR), which would require multinationals to produce comprehensive, global reports on their economic activities and tax contributions, broken down by country. But there was one major drawback: it was suggested that this information should be shared between tax authorities, rather than being made public. Since the release of the the OECD BEPS final reports in 2015, a loose-knit network of campaigners have been busy working to make this data public.

Today we are publishing a new research report looking at the current state and future prospects of a global database on the economic activities and tax contributions of multinationals – including who might use it and how, what it could and should contain, the extent to which one could already start building such a database using publicly available sources, and next steps for policy, advocacy and technical work. It also highlights what is involved in making of data about multinationals, including social and political processes of classification and standardisation that this data depends on.

New report on why we need a public database on the tax contributions and economic activities of multinational companies

The report reviews several public sources of CBCR data – including from legislation introduced in the wake of the financial crisis. Under the Trump administration, the US is currently in the process of repealing and dismantling key parts of the Dodd-Frank Wall Street Reform and Consumer Protection Act, including Section 1504 on transparency in the extractive industry, which Oxfam recently described as the “brutal loss of 10 years of work”. Some of the best available public CBCR data is generated as a result of the European Capital Requirements Directive IV (CRD IV), which gives us an unprecedented (albeit often imperfect) series of snapshots of multinational financial institutions with operations in Europe. Rapporteurs at the European Parliament just published an encouraging draft in support of making country-by-country reporting data public.

While the longer term dream for many is a global public database housed at the United Nations, until this is realised civil society groups may build their own. As well as being used as an informational resource in itself, such a database could be seen as form of “data activism” to change what public institutions count – taking a cue from citizen and civil society data projects to take measure of issues they care about – from migrant deaths to police killings, literacy rates, water access or fracking pollution.

A civil society database could play another important role: it could be a means to facilitate the assembly and coordination of different actors who share an interest in the economic activities of multinationals. It would thus be not only a source of information, but also a mechanism for organisation – allowing journalists, researchers, civil society organisations and others to collaborate around the collection, verification, analysis and interpretation of this data. In parallel to ongoing campaigns for public data, a civil society database could thus be viewed as a kind of democratic experiment opening up space for public engagement, deliberation and imagination around how the global economy is organised, and how it might be organised differently.

In the face of an onslaught of nationalist challenges to political and economic world-making projects of the previous century – not least through the “neoliberal protectionism” of the Trump administration – supporting the development of transnational democratic publics with an interest in understanding and responding to some of the world’s biggest economic actors is surely an urgent task.

Launched in 2016, supported by a grant from Omidyar Network, the FTC and coordinated by TJN and OKI, Open Data for Tax Justice is a project to create a global network of people and organisations using open data to improve advocacy, journalism and public policy around tax justice. More details about the project and its members can be found at

This piece is cross-posted at OpenDemocracy.

Security releases: OpenSRF 2.4.2 and 2.5.0-alpha2, Evergreen 2.10.10, and Evergreen 2.11.3 / Evergreen ILS

OpenSRF 2.4.2 and 2.5.0-alpha2, Evergreen 2.10.10, and Evergreen 2.11.3 are now available. These are security releases; the Evergreen and OpenSRF developers strongly urge users to upgrade as soon as possible.

The security issue fixed in OpenSRF has to do with how OpenSRF constructs keys for use by memcached; under certain circumstances, attackers would be able to exploit the issue to perform denial of service and authentication bypass attacks against Evergreen systems. Users of OpenSRF 2.4.1 and earlier are should upgrade to OpenSRF 2.4.2 right away, while testers of OpenSRF 2.5.0-alpha should upgrade to 2.5.0-alpha2.

If you are currently using OpenSRF 2.4.0 or later, you can update an Evergreen system as follows:

  • Download OpenSRF 2.4.2 and follow its installation instructions up to and including the make install step and chown -R opensrf:opensrf /<PREFIX> steps.
  • Restart Evergreen services using osrf_control.
  • Restart Apache

If you are running a version of OpenSRF older than 2.4.0, you will also need to perform the make and make install steps in Evergreen prior to restarting services.

Please visit the OpenSRF download page to retrieve the latest releases and consult the release notes.

The security issue fixed in Evergreen 2.10.10 and 2.11.3 affects users of the Stripe credit card payment processor and entails the possibility of attackers gaining access to your strip credentials. Users of Evergreen 2.10.x and 2.11.x can simply upgrade as normal, but if you are running Evergreen 2.9.x or earlier, or if you cannot perform a full upgrade right away, you can apply the fix by running the following two SQL statements in your Evergreen database:

UPDATE config.org_unit_setting_type
    SET view_perm = (SELECT id FROM permission.perm_list
    WHERE name LIKE 'credit.processor.stripe%' AND view_perm IS NULL;

UPDATE config.org_unit_setting_type
    SET update_perm = (SELECT id FROM permission.perm_list
    WHERE name LIKE 'credit.processor.stripe%' AND update_perm IS NULL;

In addition, Evergreen 2.10.10 has the following fixes since 2.10.9:

  • A fix to correctly apply floating group settings when performing no-op checkins.
  • A fix to the HTML coding of the temporary lists page.
  • A fix of a problem where certain kinds of requests of information about the organizational unit hierarchy to consume all available open-ils.cstore backends.
  • A fix to allow staff to use the place another hold link without running into a user interface loop.
  • A fix to the Edit Due Date form in the web staff client.
  • A fix to sort billing types and non-barcoded item types in alphabetical order in the web staff client.
  • A fix to the return to grouped search results link in the public catalog.
  • A fix to allow pre-cat checkouts in the web staff client without requiring a circulation modifier.
  • Other typo and documentation fixes.

Evergreen 2.11.3 has the following additional fixes since 2.11.2:

  • A fix to correctly apply floating group settings when performing no-op checkins.
  • An improvement to the speed of looking up patrons by their username; this is particularly important for large databases.
  • A fix to properly display the contents of temporary lists (My List) in the public catalog, as well as a fix of the HTML coding of that page.
  • A fix to the Spanish translation of the public catalog that could cause catalog searches to fail.
  • A fix of a problem where certain kinds of requests of information about the organizational unit hierarchy to consume all available open-ils.cstore backends.
  • A fix to allow staff to use the place another hold link without running into a user interface loop.
  • A fix to the Edit Due Date form in the web staff client.
  • A fix to the definition of the stock Full Overlay merge profile.
  • A fix to sort billing types in alphabetical order in the web staff client.
  • A fix to the display of the popularity score in the public catalog.
  • A fix to the return to grouped search results link in the public catalog.
  • A fix to allow pre-cat checkouts in the web staff client without requiring a circulation modifier.
  • A fix to how Action/Trigger event definitions with nullable grouping fields handle null values.
  • Other typo and documentation fixes.

Please visit the Evergreen download page to retrieve the latest releases and consult the release notes.

New amicus briefs on old copyright cases / District Dispatch

The American Library Association (ALA), as a member of the Library Copyright Alliance (LCA), joined amicus briefs on Monday in support of two landmark copyright cases on appeal.

iPad e-book demo on computer desk

Photo credit: Anita Hart, flickr

The first (pdf) is the Georgia State University (GSU) case—yes, that one— arguing that GSU’s e-reserves service is a fair use. The initial complaint was brought back in 2008 by three academic publishers and has been bankrolled by the Copyright Clearance Center and the American Association of Publishers ever since.
Appeals and multiple requests for injunction from the publishers have kept this case alive for eight years. (The long history of the ins and outs of these proceedings can be found here, and the briefs filed by the Library Copyright Alliance (LCA) can be found here.) Most recently, in March 2016, a federal appeals court ruled in GSU’s favor and many thought that would be the end of the story. The publishers appealed again, however, demanding in part that the court conduct a complicated market effect analysis and reverse its earlier ruling.

While not parties to the case, LCA and co-author the Electronic Frontier Foundation (EFF) make three principal points in their “friend of the court” (or “amicus”) brief:

  • First, they note that that GSU’s e-reserve service is a fair use of copyrighted material purchased by its library, underscoring that the service was modeled on a broad consensus of best practices among academic libraries.
  • Second, and more technically, the brief explains why the district court should have considered the goals of faculty and researchers who wrote most the works involved to disseminate works broadly as a characteristic of the “nature of the use” factor of fair use.
  • Third, and finally, the brief addresses the fourth factor of the statutory fair use test: the effect of the material’s use on the market for the copyrighted work.

Libraries and EFF note that the content loaned by GSU through its e-reserve service is produced by faculty compensated with state funds. Accordingly, they contend, “A ruling against fair use in this case will create a net loss to the public by suppressing educational uses, diverting scarce resources away from valuable educational investments, or both. This loss will not be balanced by any new incentive for creative activity.”

digital audio icon

Photo credit: Pixabay

The second amicus brief just filed by ALA and its LCA allies, another defense of fair use, was prepared and filed in conjunction with the Internet Archive on behalf of ReDigi in its ongoing litigation with Capitol Records. ReDigi is an online business that provides a cloud storage service capable of identifying lawfully acquired music files. Through ReDigi, the owner of the music file can electronically distribute it to another person. When they do, however, the ReDigi service is built to automatically and reliably delete the sender’s original copy. ReDigi originally maintained that this “one copy, one user” model and its service should have been considered legal under the “first sale doctrine” in U.S. copyright law. That’s the statutory provision which allows libraries to lend copies that they’ve lawfully acquired or any individual to, for example, buy a book or DVD and then resell or give it away. Written long before materials became digital, however, that part of the Copyright Act refers only to tangible (rather than electronic) materials. The Court thus originally rejected ReDigi’s first sale doctrine defense.

In their new amicus brief on ReDigi’s appeal, LCA revives and refines an argument that it first made way back in 2000 when ReDigi’s automatic delete-on-transfer technology did not exist. Namely, that digital first sale would foster more innovative library services and, for that and other reasons, should be viewed as a fair use that is appropriate in some circumstances.

With the boundaries of fair use or first sale unlikely to be productively changed in Congress, ALA and its library and other partners will continue to participate in potentially watershed judicial proceedings like these.

The post New amicus briefs on old copyright cases appeared first on District Dispatch.

Look Back, Move Forward: librarians combating misinformation / District Dispatch

Librarians across the field have always been dedicated to combating misinformation. TBT to 1987, when the ALA Council passed the “Resolution on Misinformation to Citizens” on July 1 in San Francisco, California. (The resolution is also accessible via the American Library Association Institutional Repository here.)

Resolution on Misinformation to Citizens, passed on July 1, 1987, in San Francisco, California.

Resolution on Misinformation to Citizens, passed on July 1, 1987, in San Francisco, California.

In response to the recent dialogue on fake news and news literacy, the ALA Intellectual Freedom Committee crafted the “Resolution on Access to Accurate Information,” adopted by Council on January 24.

Librarians have always helped people sort reliable sources from unreliable ones. Here are a few resources to explore:

  • IFLA’s post on “Alternative Facts and Fake News – Verifiability in the Information Society”
  • Indiana University East Campus Library’s LibGuide, “Fake News: Resources”
  • Drexel University Libraries’ LibGuide, “Fake News: Source Evaluation”
  • Harvard Library’s LibGuide, “Fake News, Misinformation, and Propaganda”
  • ALA Office for Intellectual Freedom’s “Intellectual Freedom News,” a free biweekly compilation of news related to (among other things!) privacy, internet filtering and censorship.
  • This Texas Standard article on the “CRAAP” (Currency, Relevance, Authority, Accuracy & Purpose) test.

If you are working on or have encountered notable “fake news” LibGuides, please post links in the comments below!

The post Look Back, Move Forward: librarians combating misinformation appeared first on District Dispatch.

Upcoming Evergreen and OpenSRF security releases / Evergreen ILS

Later today we will be releasing security updates for Evergreen and OpenSRF. We recommend that Evergreen users be prepared to install them as soon as possible.

The Evergreen security issue only affects users of a certain credit card payment processor, and the fix can be implemented by running two SQL statements; a full upgrade is not required.

The OpenSRF security issue is more serious and can be used by attackers to perform a denial of service attack and potentially bypass standard authentication.  Consequently, we recommend that users upgrade to OpenSRF 2.4.2 as soon as it is released.

If you are currently using OpenSRF 2.4.0 or OpenSRF 2.4.1, the upgrade will consist of the following steps:

  • downloading and compiling OpenSRF 2.4.2
  • running the ‘make install’ step
  • restarting Evergreen services

If you are currently running a version of OpenSRF that is older than 2.4.0, we strongly recommend upgrading to 2.4.2; note that it will also be necessary to recompile Evergreen.

There will also be an second beta release of OpenSRF 2.5 that will include the security fix.

Postel's Law again / David Rosenthal

Eight years ago I wrote:
In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:
"Be conservative in what you do; be liberal in what you accept from others."
Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law,
Recently, discussion on a mailing list I'm on focused on the downsides of Postel's Law. Below the fold, I try to explain why most of these downsides don't apply to the "accept" side, which is the side that matters for digital preservation.

Two years after my post, Eric Allman wrote The Robustness Principle Reconsidered, setting out the reasons why Postel's Law isn't an unqualified boon. He writes that Postel's goal was interoperability:
The intent of the Robustness Principle was to maximize interoperability between network service implementations, particularly in the face of ambiguous or incomplete specifications. If every implementation of some service that generates some piece of protocol did so using the most conservative interpretation of the specification and every implementation that accepted that piece of protocol interpreted it using the most generous interpretation, then the chance that the two services would be able to talk with each other would be maximized.
In recent years, however, that principle has been challenged. This isn't because implementers have gotten more stupid, but rather because the world has become more hostile. Two general problem areas are impacted by the Robustness Principle: orderly interoperability and security.
Allman argues, based on his experience with SMTP and Kirk McKusick's with NFS, that interoperability arises in one of two ways, the "rough consensus and running code" that characterized NFS (and TCP), or from detailed specifications:
the specification may be ambiguous: two engineers build implementations that meet the spec, but those implementations still won't talk to each other. The spec may in fact be unambiguous but worded in a way that some people misinterpret. ... The specification may not have taken certain situations (e.g., hardware failures) into account, which can result in cases where making an implementation work in the real world actually requires violating the spec. ... the specification may make implicit assumptions about the environment (e.g., maximum size of network packets supported by the hardware or how a related protocol works), and those assumptions may be incorrect or the environment may change. Finally, and very commonly, some implementers may find a need to enhance the protocol to add new functionality that isn't defined by the spec.
His arguments here are very similar to those I made in Are format specifications important for preservation?:
I'm someone with actual experience of implementing a renderer for a format from its specification. Based on this, I'm sure that no matter how careful or voluminous the specification is, there will always be things that are missing or obscure. There is no possibility of specifying formats as complex as Microsoft Office's so comprehensively that a clean-room implementation will be perfect. Indeed, there are always minor incompatibilities (sometimes called enhancements, and sometimes called bugs) between different versions of the same product.
The "rough consensus and running code" approach isn't perfect either. As Allman relates, it takes a lot of work to achieve useful interoperability:
The original InterOp conference was intended to allow vendors with NFS (Network File System) implementations to test interoperability and ultimately demonstrate publicly that they could interoperate. The first 11 days were limited to a small number of engineers so they could get together in one room and actually make their stuff work together. When they walked into the room, the vendors worked mostly against only their own systems and possibly Sun's (since as the original developer of NFS, Sun had the reference implementation at the time). Long nights were devoted to battles over ambiguities in the specification. At the end of those 11 days the doors were thrown open to customers, at which point most (but not all) of the systems worked against every other system.
The primary reason is that even finding all the corner cases is difficult, and so is deciding for each whether the sender needs to be more conservative or the receiver needs to be more liberal.

The security downside of Postel's Law is even more fundamental. The law requires the receiver to accept, and do something sensible with, malformed input. Doing something sensible will almost certainly provide an attacker with the opportunity to make the receiver do something bad.

An example is in encrypted protocols such as SSL. They typically provide for the initiator to negotiate with the receiver the specifics of the encryption to be used. Liberal receivers can be negotiated down to the use of an obsolete algorithm, vitiating the security of the conversation. Allman writes:
Everything, even services that you may think you control, is suspect. It's not just user input that needs to be checked—attackers can potentially include arbitrary data in DNS (Domain Name System) results, database query results, HTTP reply codes, you name it. Everyone knows to check for buffer overflows, but checking incoming data goes far beyond that.
Security appears to demand receivers be extremely conservative, but that would kill off interoperability; Allman argues that a balance between these conflicting goals is needed.

Ingest and dissemination in digital preservation are more restricted cases of both interoperability and security. As regards interoperability:
  • Ingest is concerned with interoperability between the archive and the real world. As digital archivists we may be unhappy that, for example, one of the consequences of Postel's Law is that in the real world almost none of the HTML conforms to the standard. But our mission requires that we observe Postel's Law and not act on this unhappiness. It would be counter-productive to go to websites and say "if you want to be archived you need to clean up your HTML".
  • Dissemination is concerned with interoperability between the archive and an eventual reader's tools. Traditionally, format migration has been the answer to this problem, whether preemptive or on-access. More recently, emulation-based strategies such as Ilya Kreymer's avoid the problem of maintaining interoperability through time by reconstructing a contemporaneous environment.
As regards security:
  • Ingest. In the good old days when Web archives simply parsed the content they ingested to find the links, the risk to their ingest infrastructure was minimal. But now the Web has evolved from inter-linked static documents to a programming environment, the risk to the ingest infrastructure from executing the content is significant. Precautions are needed, such as sandbox-ing the ingest systems.
  • Dissemination. Many archives attempt to protect future readers by virus-scanning on ingest. But, as I argued in Scary Monsters Under The Bed, this is likely to be both ineffective and counter-productive. As digital archivists we may not like the fact that the real world contains malware, but our mission requires that we not deprive future scholars of the ability to study it. Optional malware removal on access is a suitable way to mitigate the risk to scholars not interested in malware (cf. the Internet Archive's Malware Museum).
Thus, security considerations for digital preservation systems should not focus on being conservative by rejecting content for suspected malware, but instead focus on taking reasonable precautions so that content can be accepted despite the possibility that some might be malicious.

Mapping open data governance models: Who makes decisions about government data and how? / Open Knowledge Foundation

Different countries have different models to govern and administer their open data activities. Ana Brandusescu, Danny Lämmerhirt and Stefaan Verhulst call for a systematic and comparative investigation of the different governance models for open data policy and publication.

The Challenge

An important value proposition behind open data involves increased transparency and accountability of governance. Yet little is known about how open data itself is governed. Who decides and how? How accountable are data holders to both the demand side and policy makers? How do data producers and actors assure the quality of government data? Who, if any, are data stewards within government tasked to make its data open?

Getting a better understanding of open data governance is not only important from an accountability point of view. If there is a better insight of the diversity of decision-making models and structures across countries, the implementation of common open data principles, such as those advocated by the International Open Data Charter, can be accelerated across countries.

In what follows, we seek to develop the initial contours of a research agenda on open data governance models. We start from the premise that different countries have different models to govern and administer their activities – in short, different ‘governance models’. Some countries are more devolved in their decision making, while others seek to organize “public administration” activities more centrally. These governance models clearly impact how open data is governed – providing a broad patchwork of different open data governance across the world and making it difficult to identify who the open data decision makers and data gatekeepers or stewards are within a given country.  

For example, if one wants to accelerate the opening up of education data across borders, in some countries this may fall under the authority of sub-national government (such as states, provinces, territories or even cities), while in other countries education is governed by central government or implemented through public-private partnership arrangements. Similarly, transportation or water data may be privatised, while in other cases it may be the responsibility of municipal or regional government. Responsibilities are therefore often distributed across administrative levels and agencies affecting how (open) government data is produced, and published.

Why does this research matter? Why now?

A systematic and comparative investigation of the different governance models for open data policy and publication has been missing till date. To steer the open data movement toward its next phase of maturity, there is an urgency to understand these governance models and their role in open data policy and implementation.

For instance, the International Open Data Charter states that government data should be “open by default” across entire nations. But the variety of governance systems makes it hard to understand the different levers that could be used to enable nationwide publication of open government data by default. Who holds effectively the power to decide what gets published and what not? By identifying the strengths and weaknesses of governance models, the global open data community (along with the Open Data Charter) and governments can work together and identify the most effective ways to implement open data strategies and to understand what works and what doesn’t.

In the next few months we will seek to increase our comparative understanding of the mechanisms of decision making as it relates to open data within and across government and map the relationships between data holders, decision makers, data producers, data quality assurance actors, data users and gatekeepers or intermediaries. This may provide for insights on how to improve the open data ecosystem by learning from others.

Additionally, our findings may identify the “levers” within governance models used to provide government data more openly. And finally, having more transparency about who is accountable for open data decisions could allow for a more informed dialogue with other stakeholders on performance of the publication of open government data.

We are interested in how different governance models affect open data policies and practices – including the implementations of global principles and commitments. We want to map the open data governance process and ecosystem by identifying the following key stakeholders, their roles and responsibilities in the administration of open data, and seeking how they are connected:

  • Decision makers – Who leads/asserts decision authority on open data in meetings, procedures, conduct, debate, voting and other issues?
  • Data holders – Which organizations / government bodies manage and administer data?
  • Data producers – Which organizations / government bodies produce what kind of public sector information?
  • Data quality assurance actors – Who are the actors ensuring that produced data adhere to certain quality standards and does this conflict with their publication as open data?
  • Data gatekeepers/stewards – Who controls open data publication?

We plan to research the governance approaches to the following types of data:

  • Health: mortality and survival rates, levels of vaccination, levels of access to health care, waiting times for medical treatment, spend per admission
  • Education: test scores for pupils in national examinations, school attendance rates, teacher attendance rates
  • National Statistics: population, GDP, unemployment
  • Transportation: times and stops of public transport services – buses, trains
  • Trade: import and export of specific commodities, balance of trade data against other countries
  • Company registers: list of registered companies in the country, shareholder and beneficial ownership information, lobbying register(s) with information on companies, associations representatives at parliamentary bodies
  • Legislation: national legal code, bills, transcripts of debates, finances of parties

Output of research

We will use different methods to get rapid insights. This includes interviews with stakeholders such as government officials, as well as open government initiatives from various sectors (e.g. public health services, public education, trade). Interviewees may be open data experts, as well as policymakers or open data champions within government.

The type of questions we will seek to answer beyond the broad topic of “who is doing what”

  • Who holds power to assert authority over open data publication? What roles do different actors within government play to design policies and to implement them?
  • What forms of governance models can be derived from these roles and responsibilities? Can we see a common pattern of how decision-making power is distributed? How do these governance models differ?
  • What are criteria to evaluate the “performance of the observed governance models? How do they for instance influence open data policy and implementation?

Call for contributions

We invite all interested in this topic to contribute their ideas and to participate in the design and execution of one or more case studies. Have you done research on this? If so, we would also like to hear from you!

Contact one or all of the authors at:

Ana Brandusescu:

Danny Lämmerhirt:

Stefaan Verhulst:

Benchmarks and Heuristics Reports / LibUX

The user experience audit is the core deliverable from the UX bandwagon if you don’t code or draw. It has real measurable value, but it also represents the lowest barrier of entry for aspirants. Code or visual design work have these baked-in quality indicators. Good code works, and you just know good design when you see it in the same way Justice Stewart was able to gauge obscenity.

 I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description [“hard-core pornography”], and perhaps I could never succeed in intelligibly doing so. But I know it when I see it, and the motion picture involved in this case is not that.

Audits, though, aren’t so privileged. Look for audit templates or how-to’s — we even have one here on LibUX — and you’ll find the practice is inconsistent across the discipline.

In part, they suffer from the same flaw inherent to user experience design in general in that nobody can quite agree on just what user experience audits do.

It’s an ambiguity that extends across the table.

As a term, the “user experience audit” fails to describe its value to client stakeholders. There is no clear return in paying for an “audit,” rather than the promise of red flags under scrutiny. And precisely because the value of performing an audit requires explanation, scoring the opportunity relies now on the art of the pitch rather than the expertise of the service you provide.

It boils down to a semantic problem.

That’s all preamble for this: this weekend, my partnership with a library association came to an end – capped by the delivery of a benchmarks and heuristics report, which was a service I was able to up-sell in addition to my original scope of involvement. I don’t think I could have sold a “user experience audit.”

Instead, I offered to report on the accessibility, performance, and usability of their service in order to establish benchmarks on which to build moving forward. This creates an objective-ish reference that they or future consultants can use in future decision-making. Incremental improvements in any of these areas has an all-ships-rising-with-the-tide effect, but with this report — I say — we will be able to identify which opportunities have the most bang for their buck.

So, okay. It’s semantics. But this little wordsmithy makes an important improvement: “benchmarks and heuristics” actually describe the content of the audit. This makes it easier to convince stakeholders it’s no report card – but a decision-making tool that empowers the organization.

My template

I use a simple template. I tweak, add, and remove sections depending on the scope of the project, but I think the arrangement holds-up. There is a short cover letter followed with an overview summarizing the whole shebang. I make it conversational, and try to answer the question stakeholders paid me to answer: how do we stand, and what should we do next? The rest of the report is evidence to support my advice.

Benchmarks are quantitative scores informed by data or programmatic audits that show the organization where they stand in relation to law or best practice or competition. You can run objective-ish numbers from user research as long as they adhere to some system — like net promoter scores or system usability scales — but in my experience the report is best organized from inarguable to old-fashioned gut feelings.

Programmatic audits border on the “inarguable” here. You’re either Section 508 compliant or you’re not. These are validation scans for either accessibility, performance, security, which can — when there’s something wrong — identify the greatest opportunities for improvement. I attach the full results of each audit in the appendix and explain my method. Then, I devote the white-space to describing the findings like you would over coffee.

Anticipate and work the answers to these questions into your writeup:

  • Is this going to cost me [the stakeholder] money, business, credibility, or otherwise hurt me sometime down the road if I don’t fix?
  • What kind of involvement, cost, consideration, and time does it take to address?
  • What would you [the expert] recommend if you had your druthers?
  • What is the least I could do or spend to assuage the worst of it?

I follow benchmarks with liberally linked-up heuristics and other research findings as they veer further into opinion, and the more opinionated each section becomes, the more I put into their presentation: embed gifs or link out to unlisted YouTube videos of the site in action, use screenshots, pop-in analytics charts or snippets from external spreadsheets like their content audit — or even audio clips from a user chatting about a feature.

Wait, audio? I’m not really carrying podcast equipment everywhere. Sometimes, I’ll put the website or prototype up on and ask five to ten people to perform a short task, then I’ll use the video or audio to — let’s say — prove a point about the navigation.

The more qualitative data I can use to support a best practice or opinion, the better I feel. I don’t actually believe that folks who reach out to me for this kind of stuff are looking for excuses to pshaw my work, but I’m a little insecure about it.

Anyway, your mileage may vary, but I thought I’d show you the basic benchmarks and heuristics report template I fork and start with each time. It might help if you don’t know where to start.


TIMTOWTDI / Galen Charlton

The Internet Archive had this to say earlier today:

This was in response to the MacArthur Foundation announcing that the IA is a semifinalist for a USD $100 million grant; they propose to digitize 4 million books and make them freely available.

Well and good, if they can pull it off — though I would love to see the detailed proposal — and the assurance that this whole endeavor is not tied to the fortunes of a single entity, no matter how large.

But for now, I want to focus on the rather big bus that the IA is throwing “physical libraries” under. On the one hand, their statement is true: access to libraries is neither completely universal nor completely equitable. Academic libraries are, for obvious reasons, focused on the needs of their host schools; the independent researcher or simply the citizen who wishes to be better informed will always be a second-class user. Public libraries are not evenly distributed nor evenly funded. Both public and academic libraries struggle with increasing demands on their budgets, particularly with respect to digital collections. Despite the best efforts of librarians, underserved populations abound.

Increasing access to digital books will help — no question about it.

But it won’t fundamentally solve the problem of universal and equitable service. What use is the Open Library to somebody who has no computer — or no decent smart phone – or an inadequate data plan—or uncertain knowledge of how to use the technology? (Of course, a lot of physical libraries offer technology training.)

I will answer the IA’s overreach into technical messianism with another bit of technical lore: TIMTOWTDI.

There Is More Than One Way To Do It.

I program in Perl, and I happen to like TIMTOWTDI—but as a principle guiding the design of programming languages, it’s a matter of taste and debate: sometimes there can be too many options.

However, I think TIMTOWTDI can be applied as a rule of thumb in increasing social justice:

There Is More Than One Way To Do It… and we need to try all of them.

Local communities have local needs. Place matters. Physical libraries matter—both in themselves and as a way of reinforcing technological efforts.

Technology is not universally available. It is not available equitably. The Internet can route around certain kinds of damage… but big, centralized projects are still vulnerable. Libraries can help mitigate some of those risks.

I hope the Internet Archive realizes that they are better off working with libraries — and not just acting as a bestower of technological solutions that may help, but will not by themselves solve the problem of universal, equitable access to information and entertainment.

Data Designed for Discovery / HangingTogether

I’ve been talking about linked data for so long that I can’t remember when I first began. I was actually a skeptic at first, as I was struggling to see the benefit from all the work required to move our data from where it is now into that brave new world.

But then I started to understand what a transformational change we were contemplating, and the many benefits that could accrue. Let me spell it out for you.

MARC, our foundational metadata standard, is fundamentally built for description. As a library cataloger, you have an object in hand, and your task is to describe that item to the best of your abilities so that a library user can distinguish it from other, similar items. Sure, your task is also to assign some subject headings so it can be discovered by a subject search, but the essential bit is to describe the thing with enough specificity so that someone else (perhaps another cataloger) can determine whether the item they hold in their hand is the same thing.

I humbly submit that this has been the mission of cataloging for the last X number of decades. And now, I also submit, we are about to turn the tables. Rather than focusing our efforts on description, we will be focusing more of our efforts on discovery. What does this mean?

It means lashing up our assertions about an item (e.g., this person wrote this work) with canonical identifiers that can be resolved and can lead to additional information about that assertion. This of course assumes the web as the foundational infrastructure that makes linked data possible.

But it’s also more than this. It is also about using linked data techniques to associate related works. Using the concepts laid out by the Functional Requirements for Bibliographic Records (FRBR) to bring together all of the various manifestations of a work. This can support interfaces that make it easier to navigate search results and find the version of a work that you need. Linked data techniques are also making it easier for us to link translations to the original works and vice versa.

All of these advancements are making discovery easier and more effective, which is really what we should be all about, don’t you think?

About Roy Tennant

Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.

Jobs in Information Technology: February 15, 2017 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Johns Hopkins University, Data Services Manager, Baltimore, MD

Drexel University Libraries, Liaison Librarian Life Sciences, Philadelphia, PA

Drexel University Libraries, Application Developer, Philadelphia, PA

Ed Map, Inc., Curator, Nelsonville, OH

E-Nor, Silicon Valley Digital Analytics Consultancy Content/Knowledge Manager (metadata), location not important, CA

Virginia Tech, University Libraries, Linux Systems Administrator, Blacksburg, VA

Los Angeles County Public Library, Chief Information Officer, Los Angeles, CA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Read Collections as Data Report Summary / Library of Congress: The Signal

Our Collections as Data event in September 2016 on exploring the computational use of library collections was a success on several levels, including helping steer our team at National Digital Initiatives in our path of action.

We are pleased to release the following summary report which includes an executive summary of the event, the outline of our work in this area over the past five months, and the work of our colleagues Oliver Baez Bendorf, Dan Chudnov, Michelle Gallinger, and Thomas Padilla. If you are interested in what we mean when we talk about collections as data or the infrastructure necessary to support this work, this is for you.

The format of this summary report is itself an experiment. We contracted authors and artists to comment on this important topic from their diverse perspectives in order to create a holistic resource reflective of what made the symposium so great. You will read a reflection on the event from keynote speaker Thomas Padilla, a recommendation on how to implement a computational environment for scholars by Dan Chudnov and Michelle Gallinger, as well as a series of collages by the artist Oliver Baez Bendorf representing key themes from the day.

Mark your calendars for the next #AsData event on July 24th-25th at the Library of Congress.  By featuring stories from humanities researchers, journalists, students, social scientists, artists, and story tellers who have used library collections computationally, we hope to communicate the possibilities of this approach to a broad general audience.















Islandoracon Logo! / Islandora

The Islandoracon Planning Committee is very pleased to unveil the logo that will grace our conference t-shirts in Hamilton, Ontario this May:

With all due credit to both the remarkable musical that inspired the image, and to the entirely different man named Hamilton who actually founded the city, the concept for this image comes from Bryan Brown at FSU. It also brings back the now-ubiquitous Islandora lobster (also known as the CLAWbster), who was created for the first Islandoracon and has gone on to dominate Islandora CLAW repositories in many different guises.

Attempting to Prevent the Feeling of Unproductiveness Even When We Are Productive / Cynthia Ng

This piece was originally published on February 15, 2017 as part of the The Human in the Machine publication project. It seems it is not uncommon to finish a full day of work and feel completely unproductive. Sometimes I wonder if that’s simply a symptom of how we define what’s “productive”. The Problem I admit, … Continue reading Attempting to Prevent the Feeling of Unproductiveness Even When We Are Productive

Travel Funding Available to attend DPLAfest 2017 / DPLA

At DPLA, it is very important to us that DPLAfest bring together a broad array of professionals and advocates who care about access to culture to discuss everything from technology and open access to copyright, public engagement, and education. We celebrate the diversity of our DPLA community of partners and users and want to ensure that these perspectives are represented at DPLAfest, which is why we are thrilled to announce three fully funded travel awards to attend DPLAfest 2017.  

Our goal is to use this funding opportunity to promote the widest possible range of views represented at DPLAfest. We require that applicants represent one or more of the following:

  • Professionals from diverse ethnic and/or cultural backgrounds representing one or more of the following groups: American Indian/Alaska Native, Asian, Black/African American, Hispanic/Latino or Native Hawaiian/Other Pacific Islander
  • Professionals whose work or institutions primarily serve and/or represent historically underserved populations including, but not limited to, LGBTQ communities, incarcerated people and ex-offenders, people of color, people with disabilities, or Native American and tribal communities
  • Individuals who would not otherwise have the financial capacity to attend DPLAfest 2017
  • Graduate students and/or early career professionals
  • Students or professionals who live and/or work in the Greater Chicago metro area


  • Visit the DPLAfest Scholarships page and complete the application form by March 1, 2017.
  • Award recipients must attend the entire DPLAfest, taking place on Thursday, April 20 between 9:00am and 5:45pm and Friday, April 21 between 9:00am and 3:30pm.
  • Award recipients agree to write a blog post about their experience at the event to be published on DPLA’s blog within two weeks following the end of the event.

Please note that this funding opportunity will provide for airfare and/or other required travel to Chicago if needed, lodging for two nights in one of the event hotels, and complimentary registration for DPLAfest. The award will not provide for meals or incidental expenses such as cab fares or local public transportation.  All applicants will be notified regarding their application status during the week of March 13.

We appreciate your help sharing this opportunity with interested individuals in your personal and professional networks.

Click here to apply

The FIRE UP Approach: How to Optimize Websites for People / LibUX

A matchstick lightened up in front of dark background. Looks impressive.

How do you optimize websites for people that are readers who are suffering from chronic attention shortage?

You need to make your sites findable, intriguing, readable, engaging, usable, popular – or, in short, “fire up“.

“fire someone up – Fig. to motivate someone; to make someone enthusiastic.”

A shocking reading statistic from US libraries shows almost 20% of Americans don’t read books at all.

Let me confess: I didn’t read books until I was 16. I didn’t even read newspapers or magazines until I was 14. We didn’t have the Internet or mobile phones back then, so I wasn’t reading websites or SMS either! To make things worse my mother was a linguist focused on teaching language and editing literature.

These days I roughly read a 500 pages book a week. Sometimes I read two or three books at once.

How did that miraculous transformation happen?

My case seemed to be hopeless like with the rest of this one fifth of the population. It wasn’t that I couldn’t read. I could read in three languages by the age of 14. I didn’t want to read. I didn’t think books were attractive enough for me. I didn’t appreciate books and I didn’t think reading could enrich my daily life.

It’s not a story about me here though. My point is that you can make your content

  1. findable
  2. intriguing
  3. readable
  4. engaging
  5. usable
  6. popular

so that everybody will want to read and enjoy it.

People who don’t read lack a basic human experience. They don’t have access to a whole universe of knowledge.

It’s our task to enable these people to read again or at all.

With websites and mobile phones being used by almost everybody it’s far easier to encourage people to read in these times. How? You need to make your site:


You have to ensure the findability of your content – be it text or mixed content (text and images) or solely images. Even images without text can make people read. Image captions are a good start.

How did I finally start reading? When I was 14 my mother pointed out an article to me about my favorite sweets in the weekly magazine my father had subscribed to for years. I have looked at the magazine covers for ages but didn’t even read when they featured barely clad women. The mags always lay around the living room.

Thus my mother only needed to point it out casually. She didn’t have to get up, search for it and peruse dozens of other magazines to find it.

Nowadays, it’s not as easy to get your content around where the hard to reach audience stays. One day it’s Facebook, next day it’s Instagram, third day it’s Snapchat. Don’t just stay in your ivory tower and wait for everybody to come visit you. Spend some time on outreach efforts.

You can publish quotes from your favorite authors on social media. Indeed on sites like

  • Instagram
  • Pinterest
  • Tumblr
  • Twitter

aphorisms work best!

Don’t just optimize for Google. Work on your internal search. Tag your content. Make sure your images and quotes can be found on Pinterest or Twitter.


Now you agree that just staying where you are and waiting passively may not be enough. You will probably wonder how to make people who never read become interested. People who can’t read or comprehend fast need other forms of content that don’t require huge chunks of text.

You also need to point them to content that would interest them. You need to intrigue them. Don’t just attempt to make people read what you think they should.

Even clickbait can help. Don’t give away everything in the headline.


Readability is a huge problem with both books and websites alike.

Screen shot of the website which is very clean, black and white using large readable fonts and lots of white space.

In these times our attention spans are extremely short. Small screens on the go make reading even more difficult. We can’t focus on longer paragraphs or even sentences anymore.

This is also an accessibility issue. Readers may have cognitive deficits. I have cognitive deficits myself when I’m tired or when I experience pain from migraine attacks. I can’t read properly then. Other people may be perfectly average but have difficulties when reading while on the go, holding a baby on their arm, or trying to multitask in general.

Keep sentences short. Don’t complicate things to sound smart. Your website should be written for everybody.

It’s not about writing a thesis in college. It’s about reaching a wider audiences. Twitter is a good exercise. Make your messages work in 140 characters and write like that for the rest of the Web. Authors like Ernest Hemingway or Paul Auster have used simplicity to reach millions.

Simple word choices, short sentences and paragraphs won’t suffice though. You need to format text:

  • blockquotes
  • lists
  • bold text
  • text-marker effects

have proven to work best to enhance website readability.


There is this old social media cliche that “you have to engage in the conversation”. It’s true. That’s like you have to write on the Web. Write in a conversational tone. Talk to your reader.

Ask questions in your articles, not just rhetoric ones. Add calls to action asking for feedback!

Appeal to emotions by telling stories and speaking about real people like yourself.

Cover worthwhile causes. Don’t just promote yourself and push what you like. Publish what others love.

You are meant not to judge a book by it’s cover but we all do. What’s on top and gets seen first counts.

Make a good first impression by placing attractive visual content on top of your page, each page.


OK, you have done everything right. Now the fire is already burning. Your site and content are findable, intriguing, readable, engaging…

What’s the problem then? Your site may still fail as a whole. It can be all of the above but when it’s not usable it’s wasted. Usability or in modern words user experience matters on many levels. Can everybody access and use the site?

Or is it rather built for healthy white males, speaking English – as seems to be the case with the new

Some of these non-readers may be actually disabled – blind, for example!

When you fail to provide alternative text to your images and make the page navigable using the keyboard you lose that person.

Others may be healthy and enabled at the first sight – but like me often sick and tired (no pun intended). After staring all day into the screen

  • I can’t read long paragraphs anymore
  • sift through huge mega menus
  • or even deal with bright colors with a lot of contrast.

Don’t be overtly creative when building sites. The website is not the artwork, it’s the frame the artwork is in.

Keep it simple with low cognitive load by not reinventing the wheel.

Just place the logo on the top left, the navigation on top and the content below. Don’t add more than 6 menu items.

Cut out or limit the blinking ads and misleading “you may also like” partner stories.


Even when the fire is burning and almost everybody can use your site it’s still not enough to succeed with your website. You of course need to become popular. You can become popular in general but also your niche or area.

Popularity can be relative. A popular restaurant in your neighborhood does not have to be McDonald’s.

Consider popularization as explaining science to larger audiences. It’s about understanding. When scientists write for each to get reviewed by their peers nobody else will really understand them. Many bloggers and website owners write like scientists only for their colleagues. Remember to publish for average people as well.

Cover topics that matter for many people. Use language even your mother can understand without cryptic acronyms and insider lingo.

Don’t assume everybody knows what you are talking about and provide context on top.

Last but not least simplify the linking and sharing process without only focusing on third party sites. Let people link to your site easily and share your content by mail or using messenger apps.

Allow copy and pasting! Sometimes it’s even impossible to select a headline or text quote due to bad design.


Are you enthusiastic now? I certainly hope so. Of course I tried to use the FIRE UP approach myself here.

Did it work? I have no idea! You have to tell me below in the comments or on social media!

Civic Tech or Civic Business? Digital technology will not help democracy without adopting its foundations / Open Knowledge Foundation

This blog originally appeared on and has been translated by Pierre Chrzanowski and Samuel Goëta (Open Knowledge France).

Civil society did not wait for the buzzword “Civic Tech” to implement digital technology to serve democratic innovation. But since the boom of this trendy term, there have been many initiatives claiming to belong to what it entitles as a concept without respecting the very basic principles of democracy.

Adapted from a photo of Alexis (CC0)

Digital technology is not democratic in itself. Its simple use would not be enough to magically manage the essential democratic stakes, quite the contrary. Having a blind faith in technology opens the door to a loss of sovereignty and democratic control. There is a reason why the global “Open Government” movement has found its foundations in the Open Data dynamic and the collaborative governance of the Internet, themselves deeply tightened to the principles of democratic transparency, public deliberation and open source communities. It would not be acceptable if the digital transition of democratic life were to come along with the creation of lucrative monopolies whose mechanisms would be hidden from society. This digital transition must therefore at the very least scrupulously respect the level of transparency and sovereignty of our democratic heritage.

The Open Government Partnership global summit in Paris was a new opportunity to observe the public authorities boast, support and proudly announce the use, or even promote the tools of several “Civic Business” startups who specifically refuse to apply these democratic principles. Some of these companies, like Cap Collective (a startup also leading Parlement & Citoyens, that sells its proprietary consultation tool to the French government for most of its participatory initiatives), even claimed for years, for purely advertising purposes, to adhere to the principles of transparency and openness, but in reality never applied them. This is the reason why it seems urgent today to reaffirm that any democratic digital project needs to be based on open source code, ensuring diversity, transparency, participation and collaboration which are the very principles governing “Open Government”.

While Regards Citoyens is not primarily dedicated to the promotion of free software (many organizations such as April or Framasoft are already doing so well at the national level), it is at the heart of our statutes as well as in all our projects. It is not a question of taking a dogmatic, purist or even a technical posture. It is more of an ethical position. Since its beginnings and until today, democracy has had to equip itself with tools such as the collaborative counting of votes, official newspapers and public deliberations to ensure a minimum of transparency and equal access to public life and provide a sufficient level of trust to citizens. It is essential that these same principles now apply to digital tools who aim at accompanying public institutions in our modern democracy.

If “code is law”, only free software can ensure transparency and collective governance of this code, both essential in regards to trust in these new “laws”. That’s the whole issue of the transparency of algorithms that participate in public decision-making. Publishing the data generated as open data is also necessary, but as Valentin Chaput very well says “retrospectively publishing a dataset from a non-auditable platform is not a sufficient guarantee that the data has not been manipulated”. On (Regards Citoyens’ Parliamentary monitoring website)  and its visualizations of parliamentary activities, for example, it is crucial that anyone can check that our algorithms do not implement any discriminatory treatment for a specific MP or political group.

To gloriously invite world-renowned and fervent advocates of commons and free software such as Lawrence Lessig, Audrey Tang, Pablo Soto, Rufus Pollock or mySociety, while promoting initiatives that refuse to apply these principles, is purely and simply called “open-washing”. The fact that the French secretary in charge of digital affairs can organize and animate an international overview of Civic Tech during the OGP summit, while France is only represented by an initiative which is also the only one of the panel that sells proprietary software, is particularly caricatural, and shares a disastrous image for France among these international ambassadors of digital democracy.

Nonetheless, this does not mean that the use of digital tools to support democracy is incompatible with any form of remuneration for Civic Tech actors! The same way we advocate for a fair compensation for elected officials and that we support that many policy makers have relatively low income in comparison with their responsibilities, we consider the remuneration of developers and animators of Civic Tech tools as a major issue. This debate must however not forget that democracy lives mostly based on volunteering. For instance, political activists getting involved in an electoral campaign knowing they will never be elected, or citizens mobilized during months to influence a political decision.

Unfortunately too often, Civic Businesses dogmatically reproduce usual economic models, forgetting about the space and the issues in which they operate, exempting themselves from essential democratic values. By reproducing authoritarian models of startups and other sorts of incubators, Civic Tech may unleash cryonism, conflicts of interests, and actors who only want to enrich and empower themselves. However, many companies have well understood there are business models out there compatible with an open governance, with the production and use of free software, and with the transparency of algorithms.

Everyone can be open! Simply applying the same requirements of exemplarity and transparency to our own structures is enough. Our workshop dedicated to Transparency applied to NGOs during the OGP summit was most illustrative in this matter and included many key actions, simple but essential, with which civil society can engage: free software and open data obviously, but also apply a more horizontal open governance, by for instance publishing detailed accounts, declaration of interests of representatives, minutes of meetings or also opening to oversight or the participation of all to meetings and ongoing works… So many opening actions that can become beneficial to structures who implement them and without which (at least) the growing French Civic Tech may sadly lose its soul.

Digital technology will not renew democracy by feeding distrust with more transparency.

My top 5 reasons to go to the BCLA conference / Tara Robertson

graffiti that looks like an elephant with three trunks and the text "Connect"Connect by AV Dezign

I just registered for the BCLA conference and I hope you’ll consider attending too:

Here’s my top 5 reasons to go to BCLA:

  1. Sessions relevant to today’s political climate:
    • Understanding librarianship in the time of Trump (Kevin Stranack, PKP/SFU, Phil Hall, Tami Setala)
    • OpenMedia
    • Small steps to becoming a government information activist (Susan Paterson, UBC, Carla Graebner, SFU)
  2. Strong program for academic libraries:
    • Calling Bullsh*t in the age of big data (the folks from UW who made
    • Are we engaged? Academic libraries and off-campus communities as partners in life (Dr. Norah McRae, UVic, Deb Zehr and Gordon Yusko UBC)
    • Collaborative effort: institutional OER initiatives shared and discussed (with Ken Jeffery, BCIT and Arthur Gill Green, Okanagan College)
    • From citizen science to personal benefit: data management for everyone (Alex Garnett, Carla Graebner, Jessica Gallinger, SFU, Allison Trumble, VIRL)
    • Technology trends: tomorrow’s library (Ben Hyman, VIU, Daniel Phillips, GVPL, Paul Joseph, UBC)
    • Provincial Digital Library (Caroline Daniels, KPU, Anita Cocchia, BC ELN)
    • Making it work: ideology and the infrastructure of the library (Emily Drabinski, Long Island University)
    • Does the medium matter? Using evidence from science and engineering student surveys to guide choices between electronic and print books in collection development (Christina Nilsen, Seattle University)
    • 3×3 in Search of An Assessment Plan (Collen Bell, UFV, Amy Paterson, TRU, Laura Thorne, UBC-O)
    • Keeping Assessment in Sight (Tania Alekson, Capilano U)
  3. Never Neutral: Ethics and Digital Collections – I’m organizing and speaking on this hot topic plenary panel about some of my (completely unrelated to CAPER) research on the ethics of digitizing lesbian porn. I’m super excited that Jarret M. Drake from Princeton University Archives, who does amazing work with community archives and is also on the Advisory Committee of DocNow, and Michael Wynne from the Mukurtu agreed to come and participate on this panel. I think we might challenge the idea that open access is always a good thing and also talk about how we need to shift how we work with communities.
  4. Sessions by and about First Nations people:
    • Understanding the library and archival needs of Indigenous People (Camille Callison, University of Manitoba)
    • Rhymes, Rhythm, and Relationships: A Model of Community Collaboration between a Public Library and an Organization Serving Aboriginal Families (Els Kushner, VPL, Robyn Lean, YWCA Crabtree Corner)
  5. Awesome keynotes:
    • Khelsilem – Sḵwx̱wú7mesh language activist and teacher
    • Anita Sarkeesian – Equality or GTFO: Navigating the Gendered Minefield of Online Harassment She’s well known for her tropes vs women in gaming video series and for continuing to speak out about sexism in gaming despite being the ongoing target of massive, vicious online harassment.

For me it’s a rare chance to connecting with colleagues from across the province and with folks who work in public libraries.

I’ve been on the program planning committee for a few years now and I’m really proud of the diversity in speakers and quality of sessions. The program has a good balance between sessions for public and academic libraries and seeks to provoke broader conversations around the themes of access, community, evidence, place and work.

Early bird pricing is on until March 10th, register now!

Reference Rot Is Worse Than You Think / David Rosenthal

At the Fall CNI Martin Klein presented a new paper from LANL and the University of Edinburgh, Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. Shawn Jones, Klein and the co-authors followed on from the earlier work on web-at-large citations from academic papers in Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot, which found:
one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.
Reference rot comes in two forms:
  • Link rot: The resource identified by a URI vanishes from the web. As a result, a URI reference to the resource ceases to provide access to referenced content.
  • Content drift: The resource identified by a URI changes over time. The resource’s content evolves and can change to such an extent that it ceases to be representative of the content that was originally referenced.
The British Library's Andy Jackson analyzed the UK Web Archive and found:
I expected the rot rate to be high, but I was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%. This indicates there are some islands of stability, and that any broad ‘average lifetime’ for web resources is likely to be a little misleading.
Clearly, the problem is very serious. Below the fold, details on just how serious and discussion of a proposed mitigation.

This work is enabled by the support by Web archives for RFC7089, which allows access to preserved versions (Mementos) of web pages by [url,datetime]. The basic question to ask is "does the web-at-large URI still resolve to the content it did when it was published?".

The earlier paper:
estimated the existence of representative Mementos for those URI references using an intuitive technique: if a Memento for a referenced URI existed with an archival datetime in a temporal window of 14 days prior and after the publication date of the referencing paper, the Memento was regarded representative.
The new paper takes a more careful approach:
For each URI reference, we poll multiple web archives in search of two Mementos: a Memento Pre that has a snapshot date closest and prior to the publication date of the referencing article, and a Memento Post that has a snapshot date closest and past the publication date. We then assess the similarity between these Pre and Post Mementos using a variety of similarity measures.
Incidence of web-at-large URIs
They worked with three corpora (arXiv, Elsevier and PubMed Central) with a total of about 1.8M articles referencing web-at-large URIs. This graph, whose data I took from Tables 4,5,6 of the earlier paper, shows that the proportion of articles with at least one web-at-large URI was increasing rapidly through 2012. It would be interesting to bring this analysis up-to-date, and to show not merely the proportion through time of articles with at least one web-at-large URI as in this graph, but also histograms through time of the proportion of citations that were to web-at-large URIs.

From those articles 3,983,985 URIs were extracted. 1,059,742 were identified as web-at-large URIs, for 680,136 of which it was possible to identify [Memento Pre, Memento Post] pairs. Eliminating non-text URIs left 648,253. They use four different techniques to estimate similarity. By comparing the results they set an aggregate similarity threshold, then:
We apply our stringent similarity threshold to the collection of 648,253 URI references for which Pre/Post Memento pairs can be compared ... and find 313,591 (48.37%) for which the Pre/Post Memento pairs have the maximum similarity score for all measures; these Mementos are considered representative.
Then they:
use the resulting subset of all URI references for which representative Mementos exist and look up each URI on the live web. Predictably, and as shown by extensive prior link rot research, many URIs no longer exist. But, for those that still do, we use the same measures to assess the similarity between the representative Memento for the URI reference and its counterpart on the live web.
This revealed that over 20% of the URIs had suffered link rot, leaving 246,520. More now had different content type, or no longer contained text to be compared. 241,091 URIs remained for which:
we select the Memento with an archival date closest to the publication date of the paper in which the URI reference occurs and compare it to that URI’s live web counterpart using each of the normalized similarity measures.
The aggregated result is:
a total of 57,026 (23.65%) URI references that have not been subject to content drift. In other words, the content on the live web has drifted away from the content that was originally referenced for three out of four references (184,065 out of 241,091, which equals 76.35%).
Another way of looking at this result is that the authors could find only 57,026 out of 313,591 URIs for which matching Pre/Post Memento pairs existed could be shown not to have rotted, or 18.18%. For 334,662 out of 648,253 references with Pre/Post Memento pairs, or 51.63%, the referenced URI changed significantly between the Pre and Post Mementos, showing that it was probably unstable even as the authors were citing it. The problem gets worse through time:
even for articles published in 2012 only about 25% of referenced resources remain unchanged by August of 2015. This percentage steadily decreases with earlier publication years, although the decline is markedly slower for arXiv for recent publication years. It reaches about 10% for 2003 through 2005, for arXiv, and even below that for both Elsevier and PMC.
Similarity over time at arXiv
Thus, as this arXiv graph shows, they find that, after a few years, it is very unlikely that a reader clicking on a web-at-large link in an article will see what the author intended. They suggest that this problem can be addressed by:
  • Archiving Mementos of cited web-at-large URIs during publication, for example using Web archive nomination services such as
  • The use of "robust links":
    a link can be made more robust by including:
    • The URI of the original resource for which the snapshot was taken;
    • The URI of the snapshot;
    • The datetime of linking, of taking the snapshot.
The robust link proposal describes the model of link decoration that Klein discussed in his talk:
information is conveyed as follows:
  • href for the URI of the original resource for which the snapshot was taken;
  • data-versionurl for the URI of the snapshot;
  • data-versiondate for the datetime of linking, of taking the snapshot.
But this has a significant problem. The eventual reader will click on the link and be taken to the original URI, which as the paper shows, even if it resolves is very unlikely to be what the author intended. The robust links site also includes JavaScript to implement pop-up menus giving users a choice of Mementos,which they assume a publisher implementing robust links would add to their pages. An example of this is Reminiscing About 15 Years of Interoperability Efforts. Note the paper-clip and down-arrow appended to the normal underlined blue link rendering. Clicking on this provides a choice of Mementos.

The eventual reader, who has not internalized the message of this research, will click on the link. If it returns 404, they might click on the down-arrow and choose an alternate Memento. But far more often they will click on the link and get to a page that they have no way of knowing has drifted. They will assume it hasn't, so will not click on the down-arrow and not get to the page the author intended. The JavaScript has no way to know that the page has drifted, so cannot warn the user that it has.

The robust link proposal also describes a different model of link decoration:
information is conveyed as follows:
  • href for the URI that provides the specific state, i.e. the snapshot or resource version;
  • data-originalurl for the URI of the original resource;
  • data-versiondate for the datetime of the snapshot, of the resource version.
If this model were to be used, the eventual reader would end up at the preserved Memento, which for almost all archives would be framed with information from the archive. This would happen whether or not the original URI had rotted or the content had drifted. The reader would both access, and would know they were accessing, what the author intended. JavaScript would be needed only for the case where the linked-to Memento was unavailable, and other Web archives would need to be queried for the best available Memento.

The robust links specification treats these two models as alternatives, but in practice only the second provides an effective user experience without significant JavaScript support beyond what has been demonstrated. Conservatively, these two papers suggest that between a quarter and a third of all articles will contain at least one web-at-large citation and that after a few years it is very unlikely to resolve to the content the article was citing. Given the very high probability that the URI has suffered content drift, it is better to steer the user to the contemporaneous version if one exists.

With some HTML editing, I made the links to the papers above point to their DOIs at, so they should persist, although DOIs have their own problems (and also here). I could not archive the URLs to which the DOIs currently resolve, apparently because PLOS blocks the Internet Archive's crawler. With more editing I decorated the link to Andy Jackson's talk in the way Martin suggested - the BL's blog should be fairly stable, but who knows? I saved the two external graphs to Blogger and linked to them there, as is my habit. Andy's graph was captured by the Internet Archive, so I decorated the link to it with that copy. I nominated the arXiv graph and my graph to the Internet Archive and decorated the links with their copy.

The difficulty of actually implementing these links, and the increased understanding of how unlikely it is that the linked-to content will be unchanged, reinforce the arguments in my post from last year entitled The Evanescent Web:
All the proposals depend on actions being taken either before or during initial publication by either the author or the publisher. There is evidence in the paper itself ... that neither authors nor publishers can get DOIs right. Attempts to get authors to deposit their papers in institutional repositories notoriously fail. The LOCKSS team has met continual frustration in getting publishers to make small changes to their publishing platforms that would make preservation easier, or in some cases even possible. Viable solutions to the problem cannot depend on humans to act correctly. Neither authors nor publishers have anything to gain from preservation of their work.
It is worth noting that discussions with publishers about the related set of changes discussed in Improving e-Journal Ingest (among other things) are on-going. Nevertheless, this proposal is more problematic for them. Journal publishers are firmly opposed to pointing to alternate sources for their content, such as archives, so they would never agree to supply that information in their links to journal articles. Note that very few DOIs resolve to multiple targets. They would therefore probably be reluctant to link to alternate sources for web-at-large content from other for-profit or advertising-supported publishers, even if it were open access. The idea that journal publishers would spend the effort needed to identify whether a web-at-large link in an article pointed to for-profit content seems implausible.

Update: Michael Nelson alerted me to two broken links, which I fixed. The first was my mistake during hand-editing the HTML to insert the Memento links. The second is both interesting and ironic. The link with text "point to their DOIs at" referred to
Persistent URIs Must Be Used To Be Persistent, a paper by Herbert van de Sompel, Martin Klein and Shawn Jones which shows:
a significant number of references to papers linked by their locating URI instead of their identifying URI. For these links, the persistence intended by the DOI persistent identifier infrastructure was not achieved.
I.e. how important it is for links to papers to be links. I looked up the paper using Google Scholar and copied the link from the landing page at the ACM's Digital Library, which was The link I copied was, which is broken!

Update 2: Rob Baxter commented on another broken link, which I have fixed.

Concerns about FCC E-rate letter on fiber broadband deployment / District Dispatch

While we anticipated the Federal Communications Commission (FCC) would take a look at its Universal Service Fund (USF) programs once Chairman Pai was in place, we did not anticipate the speed at which moves to review and evaluate previous actions would occur. After the Commission retracted the “E-rate Modernization Report,” our E-rate ears have been itching with concern that our bread and butter USF program would attract undue attention. We did not have long to wait.

Photo credit Wikimedia Commons

Last week, FCC Commissioner Michael O’Rielly sent a letter (pdf) to the Universal Service Administrative Company (USAC) seeking detailed information on libraries and schools that applied in 2016 for E-rate funding for dark fiber and self-provisioned fiber. Our main concern is that the tenor of the Commissioner’s inquiries calls into question the need for these fiber applications. The FCC’s December 2014 E-rate Modernization Order allowed libraries and schools to apply for E-rate on self-construction costs for dark fiber and applicant owned fiber. Allowing E-rate eligibility of self-construction costs “levels the playing field” with the more typical leased fiber service offered by a third party, like a local telecommunications carrier. Because we know from our members that availability of high-capacity broadband at reasonable costs continues to be a significant barrier for libraries that want to increase broadband capacity of their libraries, ALA advocated for this change in several filings with the FCC.

We find Commissioner O’Rielly’s concern about overbuilding to be misplaced. The real issue is getting the best broadband service at the lowest cost, thus ensuring the most prudent use of limited E-rate and local funds. As we explained in our September 2013 comments (pdf) filed in response to then Acting Chair Mignon Clyburn’s opening of the E-rate modernization proceeding, “It is not a good stewardship of E-rate funds (or local library funds) to pay more for leasing a circuit when ownership is less expensive.”

To help ensure that applicants get the lowest cost for their fiber service the FCC already has in-place detailed E-rate bidding regulations that require cost be the most important factor when evaluating bids from providers. As the Commission stated in its December 2014 E-rate Modernization Order (pdf), incumbent providers “Are free to offer dark-fiber service themselves, or to price their lit-fiber service at competitive rates to keep or win business – but if they choose not to do so, it is market forces and their own decisions, not the E-rate rules” that preclude their ability to compete with a self-construction option. The Commission’s reforms to allow self-construction costs for dark fiber and applicant owned fiber were correct in 2014 and remain so. In addition, applicants will evaluate and select the best, most cost effective fiber option for their library or school.

If the last few weeks are any indication of activity at the FCC, we’re in for a busy spring.

The post Concerns about FCC E-rate letter on fiber broadband deployment appeared first on District Dispatch.

RFC 4810 / David Rosenthal

A decade ago next month Wallace et al published RFC 4810 Long-Term Archive Service Requirements. Its abstract is:
There are many scenarios in which users must be able to prove the existence of data at a specific point in time and be able to demonstrate the integrity of data since that time, even when the duration from time of existence to time of demonstration spans a large period of time. Additionally, users must be able to verify signatures on digitally signed data many years after the generation of the signature. This document describes a class of long-term archive services to support such scenarios and the technical requirements for interacting with such services.
Below the fold, a look at how it has stood the test of time.

The RFC's overview of the problem a long-term archive (LTA) must solve is still exemplary, especially in its stress on the limited lifetime of cryptographic techniques (Section 1):
Digital data durability is undermined by continual progress and change on a number of fronts. The useful lifetime of data may exceed the life span of formats and mechanisms used to store the data. The lifetime of digitally signed data may exceed the validity periods of public-key certificates used to verify signatures or the cryptanalysis period of the cryptographic algorithms used to generate the signatures, i.e., the time after which an algorithm no longer provides the intended security properties. Technical and operational means are required to mitigate these issues.
But note the vagueness of the very next sentence:
A solution must address issues such as storage media lifetime, disaster planning, advances in cryptanalysis or computational capabilities, changes in software technology, and legal issues.
There is no one-size-fits-all affordable digital preservation technology, something the RFC implicitly acknowledges. But it does not even mention the importance of basing decisions on an explicit threat model when selecting or designing an appropriate technology. More than 18 months before the RFC was published, the LOCKSS team made this point in Requirements for Digital Preservation Systems: A Bottom-Up Approach. Our explicit threat model was very useful in the documentation needed for the CLOCKSS Archive's TRAC certification.

How to mitigate the threats? Again, Section 1 is on point:
A long-term archive service aids in the preservation of data over long periods of time through a regimen of technical and procedural mechanisms designed to support claims regarding a data object. For example, it might periodically perform activities to preserve data integrity and the non-repudiability of data existence by a particular point in time or take actions to ensure the availability of data. Examples of periodic activities include refreshing time stamps or transferring data to a new storage medium.
Section 4.1.1 specifies a requirement that is still not implemented in any ingest pipeline I've encountered:
The LTA must provide an acknowledgement of the deposit that permits the submitter to confirm the correct data was accepted by the LTA.
It is normal for Submission Information Packages (SIPs) to include checksums of their components, bagit is typical in this respect. The checksums allow the archive increased confidence that the submission was not corrupted in transit. But they don't do anything to satisfy RFC 4810's requirement that the submitter be reassured that the archive got the right data. Even if the archive reported the checksums to the submitter, this doesn't tell the submitter anything useful. The archive could simply have copied the checksums from the submission without validating them.

SIPs should include a nonce. The archive should prepend the nonce to each checksummed item, and report the resulting checksum back to the submitter, who can validate them, thus mitigating among others the threat that the SIP might have been tampered with in transit. This is equivalent to the first iteration of Shah et al's audit technology.

Note also how the RFC follows (without citing) the OAIS Reference Model in assuming a "push" model of ingest.

The RFC correctly points out that an LTA will rely on, and trust, services that must not be provided by the LTA itself, for example (Section 4.2.1):
Supporting non-repudiation of data existence, integrity, and origin is a primary purpose of a long-term archive service. Evidence may be generated, or otherwise obtained, by the service providing the evidence to a retriever. A long-term archive service need not be capable of providing all evidence necessary to produce a non-repudiation proof, and in some cases, should not be trusted to provide all necessary information. For example, trust anchors [RFC3280] and algorithm security policies should be provided by other services. An LTA that is trusted to provide trust anchors could forge an evidence record verified by using those trust anchors.
and (Section 2):
Time Stamp: An attestation generated by a Time Stamping Authority (TSA) that a data item existed at a certain time. For example, [RFC3161] specifies a structure for signed time stamp tokens as part of a protocol for communicating with a TSA.
But the RFC doesn't explore the problems that this reliance causes. Among these are:
  • Recursion. These services, which depend on encryption technologies that decay over time, must themselves rely on long-term archiving services to maintain, for example, a time-stamped history of public key validity. The RFC does not cite Petros Maniatis' 2003 Ph.D. thesis Historic integrity in distributed systems on precisely this problem.
  • Secrets. The encryption technologies depend on the ability to keep secrets for extended periods even if not, as the RFC explains, for the entire archival period. Keeping secrets is difficult and it is more difficult to know whether, or when, they leaked. The damage to archival integrity which the leak of secrets enables may only be detected after the fact, when recovery may not be possible. Or it may not be detected, because the point at which the secret leaked may be assumed to be later than it actually was.
These problems were at the heart of the design of the LOCKSS technology. Avoiding reliance on external services or extended secrecy led to its peer-to-peer, consensus-based audit and repair technology.

Encryption poses many problems for digital preservation. Section 4.5.1 identifies another:
A long-term archive service must provide means to ensure confidentiality of archived data objects, including confidentiality between the submitter and the long-term archive service. An LTA must provide a means for accepting encrypted data such that future preservation activities apply to the original, unencrypted data. Encryption, or other methods of providing confidentiality, must not pose a risk to the associated evidence record.
Easier said than done. If the LTA accepts encrypted data without the decryption key, the best it can do is bit-level preservation. Future recovery of the data depends on the availability of the key which, being digital information, will need itself to have been stored in an LTA. Another instance of the recursive nature of long-term archiving.

"Mere bit preservation" is often unjustly denigrated as a "solved problem". The important archival functions are said to be "active", modifying the preserved data and therefore requiring access to the plaintext. Thus, on the other hand, the archive might have the key and, in effect, store the plaintext. The confidentiality of the archived data then depends on the archive's security remaining impenetrable over the entire archival period, something about which one might reasonably be skeptical.

Section 4.2.2 admits that "mere bit preservation" is the sine qua non of long-term archiving:
Demonstration that data has not been altered while in the care of a long-term archive service is a first step towards supporting non-repudiation of data.
and goes on to note that "active preservation" requires another service:
Certification services support cases in which data must be modified, e.g., translation or format migration. An LTA may provide certification services.
It isn't clear why the RFC thinks it is appropriate for an LTA to certify the success of its own operations. A third-party certification service would also need access to pre- and post-modification plaintext, increasing the plaintext's attack surface and adding another instance of the problems caused by LTAs relying on external services discussed above.

Overall, the RFC's authors did a pretty good job. Time has not revealed significant inadequacies beyond those knowable at the time of publication.

Rebranding the Hydra Project / Hydra Project

Some of you may be aware that the Hydra Project has been attempting to trademark its “product” in the US and in Europe.  During this process we became aware of MPDV, a German company that has a wide ranging trademark on the use of ‘Hydra’ for computer software and that their claim to the word considerably predates ours.  Following discussions with their lawyers, our attorney advised that we should agree to MPDV’s demand that we cease use of the name “Hydra” and, having sought a second opinion, we have agreed that we will do so.  Accordingly, we need to embark on a program to rebrand ourselves.  MPDV have given us six months to do this which our lawyer deems “generous”.

The Steering Group, in consultation with the Hydra Partners, has already started mapping out a process to follow over the coming months but will welcome input from the Hydra Community – particularly help in identifying a new name, a matter of some urgency.  We will be especially interested in hearing from anyone with prior success in any naming and (re-)branding initiatives!  Rather than seeing this as a setback we are looking at the process as a way to refocus and re-invigorate the project ahead of new, exciting developments such as cloud-hosted delivery.

Please share your ideas via any of Hydra’s mailing lists.  If you use Slack you may like to look at a new Hydra channel called #branding where some interesting ideas are being discussed.

MarcEdit Update: All Versions / Terry Reese

All versions have been updated.  For specific information about workstream work, please see: MarcEdit Workstreams: MacOS and Windows/Linux

MarcEdit Mac Changelog:

** 2.2.35
* Bug Fix: Delimited Text Translator: The 3rd delimiter wasn’t being set reliably. This should be corrected.
* Enhancement: Accessibility: Users can now change the font and font sizes in the application.
* Enhancement: Delimited Text Translator: Users can enter position and length on all fields.

MarcEdit Windows/Linux Changelog:

* Enhancement: Plugin management: automated updates, support for 3rd party plugins, and better plugin management has been added.
* Bug Fix: Delimited Text Translator: The 3rd delimiter wasn’t being set reliably. This should be corrected.
* Update: Field Count: Field count has been updated to improve counting when dealing with formatting issues.
* Enhancement: Delimited Text Translator: Users can enter position and length on all fields.

Downloads are available via the automated updating tool or via the Downloads ( page.



New Faces at Hydra-In-A-Box / DuraSpace News

Austin, TX  As the Hydra-in-a-Box project prepares for major developments in 2017 – release of the Hyku repository minimum viable product, a HykuDirect hosted service pilot program, and a higher-performing aggregation system at DPLA – we welcome three stars who recently joined the project team. Please join us in welcoming Michael Della Bitta, Heather Greer Klein, and Kelcy Shepherd.

We’re Hiring: DuraSpace Seeks Business Development Manager / DuraSpace News

Austin, TX  A cornerstone of the DuraSpace mission is focused around expanding collaborations with academic, scientific, cultural, technology, and research communities in support of projects and services to help ensure that current and future generations will have access to our collective digital heritage. The DuraSpace organization seeks a Business Development Manager to cultivate and deepen those relationships and partnerships particularly with international organizations and consortia to elevate the organization’s profile and to expand the services it offers.

Listen: LITA Persona Task Force (7:48) / LibUX

This week’s episode of Metric: A User Experience Podcast with Amanda L. Goodman (@godaisies) gives you a peek into the work of the LITA Persona Task Force, who are charged with defining and developing personas that are to be used in growing membership in the Library and Information Technology Association.

You can also  download the MP3 or subscribe to Metric: A UX Podcast on OverCastStitcher, iTunes, YouTube, Soundcloud, Google Music, or just plug our feed straight into your podcatcher of choice.

bento_search 1.7.0 released / Jonathan Rochkind

bento_search is the gem for making embedding of external searches in Rails a breeze, focusing on search targets and use cases involving ‘scholarly’ or bibliographic citation results.

Bento_search isn’t dead, it just didn’t need much updating. But thanks to some work for a client using it, I had the opportunity to do some updates.

Bento_search 1.7.0 includes testing under Rails 5 (the earlier versions probably would have worked fine in Rails 5 already), some additional configuration options, a lot more fleshing out of the EDS adapter, and a new ConcurrentSearcher demonstrating proper use of new Rails5 concurrency API.  (the older BentoSearch::MultiSearcher is now deprecated).

See the CHANGES file for full list.

As with all releases of bento_search to date, it should be strictly backwards compatible and an easy upgrade. (Although if you are using Rails earlier than 4.2, I’m not completely confident, as we aren’t currently doing automated testing of those).

Filed under: General

Miseducation / Karen Coyle

There's a fascinating video created by the Southern Poverty Law Center (in January 2017) that focuses on Google but is equally relevant to libraries. It is called The Miseducation of Dylann Roof.


 In this video, the speaker shows that by searching on "black on white violence" in Google the top items are all from racist sites. Each of these link only to other racist sites. The speaker claims that Google's algorithms will favor similar sites to ones that a user has visited from a Google search, and that eventually, in this case, the user's online searching will be skewed toward sites that are racist in nature. The claim is that this is what happened to Dylan Roof, the man who killed 9 people at an historic African-American church - he entered a closed information system that consisted only of racist sites. It ends by saying: "It's a fundamental problem that Google must address if it is truly going to be the world's library."

I'm not going to defend or deny the claims of the video, and you should watch it yourself because I'm not giving a full exposition of its premise here (and it is short and very interesting). But I do want to question whether Google is or could be "the world's library", and also whether libraries do a sufficient job of presenting users with a well-round information space.

It's fairly easy to dismiss the first premise - that Google is or should be seen as a library. Google is operating in a significantly different information ecosystem from libraries. While there is some overlap between Google and library collections, primarily because Google now partners with publishers to index some books, there is much that is on the Internet that is not in libraries, and a significant amount that is in libraries but not available online. Libraries pride themselves on providing quality information, but we can't really take the lion's share of the credit for that; the primary gatekeepers are the publishers from whom we purchase the items in our collections. In terms of content, most libraries are pretty staid, collecting only from mainstream publishers.

I decided to test this out and went looking for works promoting Holocaust denial or Creationism in a non-random group of libraries. I was able to find numerous books about deniers and denial, but only research libraries seem to carry the books by the deniers themselves. None of these come from mainstream publishing houses. I note that the subject heading, Holocaust denial literature, is applied to both those items written from the denial point of view, as well as ones analyzing or debating that view.

Creationism gets a bit more visibility; I was able to find some creationist works in public libraries in the Bible Belt. Again, there is a single subject heading, Creationism, that covers both the pro- and the con-. Finding pro- works in WorldCat is a kind of "needle in a haystack" exercise.

Don't dwell too much on my findings - this is purely anecdotal, although a true study would be fascinating. We know that libraries to some extent reflect their local cultures, such as the presence of the Gay and Lesbian Archives at the San Francisco Public Library.  But you often hear that libraries "cover all points of view," which is not really true.

The common statement about libraries is that we gather materials on all sides of an issue. Another statement is that users will discover them because they will reside near each other on the library shelves. Is this true? Is this adequate? Does this guarantee that library users will encounter a full range of thoughts and facts on an issue?

First, just because the library has more than one book on a topic does not guarantee that a user will choose to engage with multiple sources. There are people who seek out everything they can find on a topic, but as we know from the general statistics on reading habits, many people will not read voraciously on a topic. So the fact that the library has multiple items with different points of view doesn't mean that the user reads all of those points of view.

Second, there can be a big difference between what the library holds and what a user finds on the shelf. Many public libraries have a high rate of circulation of a large part of their collection, and some books have such long holds lists that they may not hit the shelf for months or longer. I have no way to predict what a user would find on the shelf in a library that had an equal number of books expounding the science of evolution vs those promoting the biblical concept of creation, but it is frightening to think that what a person learns will be the result of some random library bookshelf.

But the third point is really the key one: libraries do not cover all points of view, if by points of view you include the kind of mis-information that is described in the SPLC video. There are many points of view that are not available from mainstream publishers, and there are many points of view that are not considered appropriate for anything but serious study. A researcher looking into race relations in the United States today would find the sites that attracted Roof to provide important insights, as SPLC did, but you will not find that same information in a "reading" library.

Libraries have an idea of "appropriate" that they share with the publishing community. We are both scientific and moral gatekeepers, whether we want to admit it or not. Google is an algorithm functioning over an uncontrolled and uncontrollable number of conversations. Although Google pretends that its algorithm is neutral, we know that it is not. On Amazon, which does accept self-published and alternative press books, certain content like pornography is consciously kept away from promotions and best seller lists. Google has "tweaked" its algorithms to remove Holocaust denial literature from view in some European countries that forbid the topic. The video essentially says that Google should make wide-ranging cultural, scientific and moral judgments about the content it indexes.

I am of two minds about the idea of letting Google or Amazon be a gatekeeper. On the one hand, immersing a Dylann Roof in an online racist community is a terrible thing, and we see the result (although the cause and effect may be hard to prove as strongly as the video shows). On the other hand, letting Google and Amazon decide what is and what is not appropriate does not sit well at all. As I've said before having gatekeepers whose motivations are trade secrets that cannot be discussed is quite dangerous.

There has been a lot of discussion lately about libraries and their supposed neutrality. I am very glad that we can have that discussion. With all of the current hoopla about fake news, Russian hackers, and the use of social media to target and change opinion, we should embrace the fact of our collection policies, and admit widely that we and others have thought carefully about the content of the library. It won't be the most radical in many cases, but we care about veracity, and that''s something that Google cannot say.

IEEE Big Data Conference 2016: Computational Archival Science / Library of Congress: The Signal

This is a guest post by Meredith Claire Broadway,a consultant for the World Bank.

Photo of a PowerPoint slide projected onto a wall.

Jason Baron, Drinker Biddle & Reath LLP, “Opening Up Dark Digital Archives Through The Use of Analytics To Identify Sensitive Content,” 2016. Photo by Meredith Claire Broadway.

Computational Archival Science can be regarded as the intersection between the archival profession and “hard” technical fields, such as computer science and engineering. CAS applies computational methods and resources to large-scale records and archives processing, analysis, storage, long-term preservation and access. In short: big data is a big deal for archivists, particularly because old-school pen-and-paper methodologies don’t apply to digital records. To keep up with big data, the archival profession is called upon to open itself up to new ideas and collaborate with technological professionals.

Naturally, collaboration was the  theme of the IEEE Big Data Conference ’16: Computational Archival Science workshop. There were many speakers with projects that drew on the spirit of collaboration by applying computational methods — such as machine learning, visualization and neuro-linguistic programming — to archival problems. Subjects ranged from improving optical-character-recognition efforts with topic modeling to utilizing vector-space models so that archives can better anonymize PII and other sensitive content.

For example, “Content-based Comparison for Collections Identification” was presented by a team led by Maria Esteva of the Texas Advanced Computing Center. Maria and her team created an automated method of discovering the identity of datasets that appear to be similar or identical but may be housed in two different repositories or parts of different collections. This service is important to archivists because datasets often exist in multiple formats and versions and in different stages of completion. Traditionally, archives determine issues such as these through manual effort and metadata entry. A shift to automation of content-based comparison allows archivists to identify changes, connections and differences between digital records with greater accuracy and efficiency.

The team’s algorithm operates in straightforward manner. First, two collections are analyzed to determine the types of records they contain and then a list is generated for each collection. Next the records analysis creates a list of pairs from each collection for comparison. Finally, a summary report is created to show differences between the collections.

Chart, Weijia Xu, Ruizhu Huang, Maria Esteva, Jawon Song, Ramona Walls, “Content-based Comparison for Collections Identification,” 2015.

Weijia Xu, Ruizhu Huang, Maria Esteva, Jawon Song, Ramona Walls, “Content-based Comparison for Collections Identification,” 2015.

To briefly summarize Maria’s findings, metadata alone isn’t enough for the content-based comparison algorithm to determine whether a dataset is unique. The algorithm needs more information from datasets to make improved comparisons.

Automated collection-based comparison is the future of digital archives. Naturally, it raises questions, among them, “What is the best way for archivists to meet automated methods?” and  ”How can current archival workflows be aligned with computational efforts?”

The IEEE Computational Archival Science session ended on a contemplative note. Keynote speaker Mark Conrad, of the National Archives and Records Administration, asked the gathering about what skills they thought the new generation of computational archival scientists should be taught. Topping the list were answers such as “coding,” “text mining” and “history of archival practice.”

What interested me most was the ensuing conversation about how CAS deserves its own academic track. The assembly agreed that CAS differs enough from the traditional Library and Information Science and Archival tracks, in both the United States and Canada, that it qualifies as a new area of study.

CAS differs from the LIS and Archival fields in large part due to its technology-centric nature. “Hard” technical skills take more than two years (the usual time it takes to complete an LIS master’s program) to develop, a fact I can personally attest to as a former LIS student and R beginner. It makes sense, then, that for CAS students to receive a robust education they should have a unique curriculum.

If CAS, LIS and the Archival Science fields merge, there’s an assumption that they will run the risk of taking an “inch-deep, mile-wide” approach to studies. Our assembly agreed that, in this case, “less is more” if it allows students to cultivate fully developed skills.

Of course these were just the opinions of those present at the IEEE workshop. As the session emphasized, CAS encourages collaboration, discussion and differing opinions. If you have something to add to any of my points, please leave a comment., Wikidata, Knowledge Graph: strands of the modern semantic web / Dan Scott

My slides from Ohio DevFest 2016:, Wikidata, Knowledge Graph: strands of the modern semantic web

And the video, recorded and edited by the incredible amazing Patrick Hammond:

In November, I had the opportunity to speak at Ohio DevFest 2016. One of the organizers, Casey Borders, had invited me to talk about, structured data, or something in that subject area based on a talk about and RDFa he had seem me give at the DevFest Can-Am in Waterloo a few years prior. Given the Google-oriented nature of the event and the 50-minute time slot, I opted to add in coverage of the Google Knowledge Graph and its API, which I had been exploring from time to time since its launch in late 2014.

Alas, the Google Knowledge Graph Search API is still quite limited; it returns quite minimal data in comparison to the rich cards that you see in regular Google search results. The JSON results only include links for an image, a corresponding Wikipedia page, and for the ID of the entity. I also uncovered errors that had lurked in the documentation for quite some time; happily, the team quickly responded to correct those problems.

So I dug back in time and also covered Freebase, the database of linked and structured data that had both allowed individual contributions and which had made its database freely available--until it was purchased by Google, fed into the Knowledge Graph, and shut down. Not many people knew what we had once had until it was gone (Ed Summers did, for one), but such is the way of commercial entities.

In that context, Wikidata looks something like the Second Coming of an open (for contribution and access) linked and structured database, with sustainability derived financially from the Wikimedia Foundation and structurally by its role in underpinning Wikipedia and Wikimedia Commons. Google also did a nice thing by putting resources into adding the appropriately licensed data they could liberate from Freebase: approximately 19 million statements and IDs.

The inclusion of Google Knowledge Graph IDs in Wikidata means that we can use the Google Search API to find an entity ID, then pull the corresponding richer data from Wikidata for that ID to populate relationships and statements. You can get there from here! Ultimately, my thesis is that Wikidata can and will play a very important role in the modern (much more pragmatic) semantic web.

Fonts, Font-sizes and the MacOS / Terry Reese

So, one of the questions I’ve occasionally been getting from Mac users is that they would really like the ability to shift the font and font sizes of the programs’ interface.  If you’ve used the Windows version of MarcEdit, this has been available for some time, but I’ve not put it into the Mac version in part, because, I didn’t know how.  The Mac UI is definitely different from what I’m use to, and the way that the AppKit exposes controls and the way controls are structures as a collection of Views and Subviews complicates some of the sizing and layout options.  But I’ve been wanting to provide something because on really high resolution screens, the application was definitely getting hard to read.

Anyway, I’m not sure if this is the best way to do it, but this is what I’ve come up with.  Essentially, it’s a function that can determine if an element has text, an image, and perform the font scaling, control resizing and ultimately, windows sizing to take advantage of Apples Autolayout features.  Code is below.


public void SizeLabels(NSWindow objW, NSControl custom_control = null)
			string test_string = "THIS IS MY TEST STRING";

			string val = string.Empty;
			string font_name = "";
			string font_size = "";
			NSStringAttributes myattribute = new NSStringAttributes();

			cxmlini.GetSettings(XMLPath(), "settings", "mac_font_name", "", ref font_name);
			cxmlini.GetSettings(XMLPath(), "settings", "mac_font_size", "", ref font_size);

			if (string.IsNullOrEmpty(font_name) && string.IsNullOrEmpty(font_size))

			NSFont myfont = null;
			if (String.IsNullOrEmpty(font_name))
				myfont = NSFont.UserFontOfSize((nfloat)System.Convert.ToInt32(font_size));

			else if (String.IsNullOrEmpty(font_size))
				font_size = "13";
				myfont = NSFont.FromFontName(font_name, (nfloat)System.Convert.ToInt32(font_size));
			else {
				myfont = NSFont.FromFontName(font_name, (nfloat)System.Convert.ToInt32(font_size));

			if (custom_control == null)

				CoreGraphics.CGSize original_size = NSStringDrawing.StringSize(test_string, myattribute);

				myattribute.Font = myfont;
				CoreGraphics.CGSize new_size = NSStringDrawing.StringSize(test_string, myattribute);

				CoreGraphics.CGRect frame = objW.Frame;
				frame.Size = ResizeWindow(original_size, new_size, frame.Size);
				objW.MinSize = frame.Size;
				objW.SetFrame(frame, true);

				//MessageBox(objW, objW.Frame.Size.Width.ToString() + ":" + objW.Frame.Size.Height.ToString());

				foreach (NSView v in objW.ContentView.Subviews)
					if (v.IsKindOfClass(new ObjCRuntime.Class("NSControl")))
						NSControl mycontrol = ((NSControl)v);
						switch (mycontrol.GetType().ToString())

							case "AppKit.NSTextField":
							case "AppKit.NSButtonCell":
							case "AppKit.NSBox":
							case "AppKit.NSButton":

								if (mycontrol.GetType().ToString() == "AppKit.NSButton")
									if (((NSButton)mycontrol).Image != null)

								mycontrol.Font = myfont;
								//if (!string.IsNullOrEmpty(mycontrol.StringValue))
								//	mycontrol.SizeToFit();


						if (mycontrol.Subviews.Length > 0)
							SizeLabels(objW, mycontrol);
					else if (v.IsKindOfClass(new ObjCRuntime.Class("NSTabView")))
						NSTabView mytabview = ((NSTabView)v);
						foreach (NSTabViewItem ti in mytabview.Items)
							foreach (NSView tv in ti.View.Subviews)
								if (tv.IsKindOfClass(new ObjCRuntime.Class("NSControl")))
									SizeLabels(objW, (NSControl)tv);
			else {
				if (custom_control.Subviews.Length == 0)
					if (custom_control.GetType().ToString() != "AppKit.NSButton" ||
						(custom_control.GetType().ToString() == "AppKit.NSButton" &&
						 ((NSButton)custom_control).Image == null))
						custom_control.Font = myfont;
				else {
					foreach (NSView v in custom_control.Subviews)

						NSControl mycontrol = ((NSControl)v);
						switch (mycontrol.GetType().ToString())

							case "AppKit.NSTextField":
							case "AppKit.NSButtonCell":
							case "AppKit.NSBox":
							case "AppKit.NSButton":
								if (mycontrol.GetType().ToString() == "AppKit.NSButton")
									if (((NSButton)mycontrol).Image != null)
								mycontrol.Font = myfont;
								//if (!string.IsNullOrEmpty(mycontrol.StringValue))
								//	mycontrol.SizeToFit();
								if (mycontrol.Subviews.Length > 0)
									SizeLabels(objW, mycontrol);




And that was it. I’m sure there might be better ways, but this is (crossing my fingers) working for me right now.

MarcEdit KBart Plugin / Terry Reese

Last year, I had the opportunity to present at NASIG, and one of the questions that came up was related to the KBart format and if MarcEdit could generate it.  I’ll be honest, I’d never heard of KBart and this was the first time it had come up.  Well, fast forward a few months and I’ve heard the name a few more times, and since I’ll be making my way to NASIG again later this year to speak, I figured this time I’d come bearing new gifts.  So, I spent about 20 minutes this evening wrapping up a kbart plugin.  The interface is very basic:

And essentially has been designed to allow a user to take a MARC or MarcEdit mnemonic file and output a kbart file in either tab or comma delimited format.

Now, a couple of caveats — I still don’t really have a great idea of why folks want to create kbart files — this isn’t my area.  But the documentation on the working group’s website was straightforward, so I believe that the files generated will be up to par.  Though, I’m hoping that prior to NASIG, a few of the folks that may actually find something like this interesting may be willing to give it a spin and provide a little feedback.

Again, this will be available after the next update is posted so that I can allow it to take advantage of some of the new plugin management features being added to the tool.


EC / Ed Summers

The readings this week focused on the Ethnography of Communication or EC. EC seeks to examine the particular cultural conditions of communication and also its general principles. EC takes as its material not just spoken or written language but also images, gestures, smell, digital, digital content and basically any cultural material that anthropologists use. In his overview of the field Carbaugh (2014) credits Gumperz & Hymes (1972) as establishing the foundation for EC. Gumperz and Hymes were both linguists and anthropologists and helped found the study of linguistic anthropologists.

The focus on EC is to look for patterns and practices in speech communities, and considers several key concepts:

  • communication act: a specific communicative action (uttering something)
  • communication event: a sequence of acts (a series of utterances)
  • communication situation: a place which may involve multiple acts/events
  • speech community: a social grouping of communicators
  • a way of speaking: a patterned behavior

Just as an aside, it is interesting to note that EC is heavily invested in the notions of community and communication, and that both words share the same root. Hymes came up with the useful mnemonic SPEAKING, which is a device or framework that characterize the types of phenomena EC researchers examine: Settings, Participants, Ends, Acts, Key/pitch, Instruments for communication, Norms, and Genre.

As with other ethnographic work, EC is iterative and cyclical. Study of the literature and theory precedes actual field work, which is followed by analysis and reporting. But each of these stages can feed back into the other. For example when analyzing field notes leads the researcher to return to field to interview a new contact or observe another situation. Also, it is important in ethnographic work that meanings are discovered throughout the process, rather than decided beforehand.

Philipsen (1987) argued that analysis shouldn’t be centered so much on the individual, but should instead focus on the collective or communal aspects of meaning, alignment and form from three perspectives:

  • Culture as Code: order and organization
  • Culture as Conversation: dynamism/creativity
  • Culture as Community: the setting

Culture as Code was formalized by Philipsen, Coutu, Covarrubias, & Gudykunst (2005) as Speech Code Theory, which starts out by cataloging a set of speech resources (spoken words and written texts) which are examined in order to investigate “a system of socially constructed symbols and meanings, premises, and rules, pertaining to communicative conduct”. I actually think this approach could be useful for me to examine the interviews I have with web archivists in coordination with their collection development policies. Perhaps I could zoom in on a particular one to see how the language operates?

Carbaugh, Nuciforo, Molina-Markham, & Over (2011) highlights the importance of reflexivity as a method in EC. Specifically discursive reflexivity, or the ability to render/reason through discourse, while reflecting on how we use discourse to study discourse, is the mechanism. Reflexivity is a core concept in ethnographic work, where the researcher themselves are a research tool being used. It takes on several aspects in EC work:

  • Theoretical Reflexivity: using reflexive moment to understanding communication
  • Descriptive Reflexivity: how is a reflexive moment being represented? what is selected and what is not? How are decisions accounted for? Need to be aware of our own goals and attitudes, and how they are interwoven into descriptive work.
  • Interpretive Reflexivity: what is the meaning/significance of the communication practices.
  • Comparative Reflexivity: provides different views on a communicative moment and multiple meanings. This idea of the multiple reminds me of Mol (2002) and Law (2004) a bit. Verbal vs Print knowledge can be useful to compare, especially when focusing on areas of misunderstanding or disagreement.
  • Critical Reflexivity: examining a communication practice from the perspective of an standardized ethic.

I actually think reflexivity is really important in my own work studying the construction of web archives, because I myself am a practitioner. My familiarity with web archiving community as a community of practice definitely influenced the way I spoke to archivists–I was a participant in the process. Being able to think critically about my role in the interviews, and my selection of policies to examine are an important part of the research.

Duff & Harris (2002) provided a great example of an empirical study that employed EC. I was particularly struck by her detailed description of the classroom, that included registries of the students and their backgrounds, a map of the classroom and the positions of the student groups, daily agendas, and fairly lengthy transcripts of conversation.

These materials were drawn together with a lucid description of the community setting, and the goals of the school to encourage cross-cultural understanding. I really felt like I came to understand the specific settings she chose. This allowed her to make describe the successes and failures of validating and rebroadcasting comments from students during lessons about culture. Attention to word counting, turn taking, pauses and sequencing provided a view into the dynamics of the classroom. Ethnographic interviews with the students also confirmed some of her findings.

In general I finished the readings this week thinking that EC approaches could hold a lot of promise for my own research. I think this is partly because my interviews with web archivists was conducted using an ethnographic method. I wasn’t out to prove or disprove a particular hypothesis. I just wanted to try to understand how appraisal of web content took place in the field. I think it might be useful to select an interview or two from the 30 interviews I conducted to take a closer look from a discourse perspective. Perhaps there are things I missed during my coding of all the interviews.

Another angle that could be interesting would be to look at an interview and a collection development policy, to see how they compare. It might be possible to look at the collections of web content themselves, which was suggested by Prof. Fagan last week. My only worry with that is that there might be too much material at different levels. This was mentioned as one of the challenges of EC, because of the volume of material that is difficult to integrate because it is from different perspectives: etic (from outside) and emic (from within).

Certainly using a diversity of material, and embracing the messiness of method, is more appealing (and realistic) to me more than focusing exclusively on the text of a particular transcript. I suspect that EC is going to be a useful method for me in discourse analysis. I guess I worry that the hyperfocus on a particular text transcript by itself will run the risk of me projecting my own ideas, rather than seeing what is actually going on. But perhaps I will be surprised when we come to the other discourse methods later in the semester.


Carbaugh, D. (2014). Cultures in conversation. Routledge.

Carbaugh, D., Nuciforo, E. V., Molina-Markham, E., & Over, B. van. (2011). Discursive reflexivity in the ethnography of communication: Cultural discourse analysis. Cultural Studies - Critical Methodologies, 11(2), 153–164.

Duff, W. M., & Harris, V. (2002). Stories and names: Archival description as narrating records and constructing meanings. Archival Science, 2(3-4), 263–285.

Gumperz, J. J., & Hymes, D. H. (1972). Directions in sociolinguistics: The ethnography of communication. Holt, Rinehart; Winston New York.

Law, J. (2004). After method: Mess in social science research. Routledge.

Mol, A. (2002). The body multiple: Ontology in medical practice. Duke University Press.

Philipsen, G. (1987). Communication theory: Eastern and western perspectives. In D. L. Kincaid (Ed.), (pp. 245–254). Academic Press.

Philipsen, G., Coutu, L. M., Covarrubias, P., & Gudykunst, W. (2005). Theorizing about intercultural communication. In Theorizing about intercultural communication (pp. 55–68). Sage Thousand Oaks, CA.

App usage outpaces the mobile web 7:1 for two years straight, but you’re still not Facebook. / LibUX

I hope to eventually distill comScore’s 2016 U.S. Mobile App Report in a future writeup once I’ve had time to mull it over – there is a lot there. In the meantime, I want to drum-out a few initial impressions that mostly color how I approach this data. Always keep in mind 800-pound gorilla in the room.

It’s blue.

“Mobile app continues to outpace mobile web by a 7:1 margin in time spent, a ratio that has held constant for the past two years.”


Should we start differentiating reports about “mobile app usage” from Facebook?

I guarantee that with regard to time spent in apps we are largely talking about the dominant Facebook ecosystem — Facebook, Facebook Messenger, Instagram, WhatsApp — and even more broadly FANG (Facebook, Amazon, Netflix, Google). There really is only one takeaway: your app is not one of these apps.

When ScientiaMobile or comScore report app engagement, these often confirm hunches that apps are hot shit. Each year, my leads spike with interest in their own native app, not considering that for every minute users spend in Facebook — a number that’s increasing — they’re not spending it with you.

Extrapolate that across Netflix, YouTube, Amazon Prime, Amazon shopping, any of the Google apps, Instagram, not to mention Snapchat, then even if your app drags itself across that store-to-home-screen barrier it will get relegated to the second or third swipe.

A bar chart

“And mobile audience growth is being driven more by mobile web properties, which are actually bigger and growing faster than apps.”

Where audience growth is concerned there isn’t even a competition. I have a couple of thoughts:

  • We can account for a lot of web traffic being driven from social apps like Twitter and Facebook. You never really leave the app. It’s part of the flow. My gut feeling here is that, presuming you’re not one of the top 1000 apps (it’s a good bet), you’re better growing your web audience by funneling users through these social apps in the way pilot fish attach to sharks. Reach-people-through-Facebook is pretty standard advice, but the important element is that you’re probably better off designing a fully functional web app rather than use Facebook to market a purely native option. Native apps that are outside the FANG user flow — e.g. users have to leave Facebook to use one — are, at best, a gamble.
  • I wonder whether in terms of interaction cost, users find it less “expensive” to occasionally pop open a browser and tap-out a URL than actually allocating drive space and real-estate to an app. It seems counter-intuitive, but it’d be interesting to explore. I resist home-screen clutter and prefer the hassle but I always figured I was an edge case.

All that said, this  is an important caveat:

Mobile web audiences continue to climb, but the new audiences being reached are lightly engaged and bring down the average time spent figures. 2016 U.S. Mobile App Report, 16

How do we measure the frequent light engagement over time?

I’m not sure how dire this really is. We impress the importance of time spent as a success metric, because higher engagement increases odds that users will convert. But does it matter whether time spent is measured in a single session or over time? How lightly engaged are we with that BuzzFeed listicle? How many times have we been so lightly engaged?

Even so, BuzzFeed’s effective brand-making around light-engagement has both been good for advertisers as well as for – ah – engagement. Gut checking my own behavior, I am probably spending more time sharing the content or commenting than actually reading it. This is engagement, right?

It just so happens that engaging with BuzzFeed involves engaging with Facebook.


Personal Stories / Ed Summers

There’s an interesting segment in the ProPublica’s recent interview with Elizaveta Osetinskaya about how the RBC broke a story about the mysterious deaths of Russian soldiers in the Ukraine.

I think it’s interesting because of the critical role that social media played in establishing personal connections with people who had the stories to tell. It would be interesting to know more about how journalists are using social media in this way, so if you have any pointers to resources I’d really like to see them.

Click here to jump to this part in the interview or move the time slider below to 14:00.

There was a big flow of content at that time reported by community newspapers, local newspapers, that there were cases of deaths of troops that were sent somewhere by the government, but no-one knew where they were sent. There were funerals, there were tragic situations with their families. We decided to talk to people and follow the path of those troops. Where they were told to go. Why they went there. To focus on personal stories.

Our reporter talked to many people in person and that’s how he collected personal stories of those troops. When we collected this information we asked officials in the ministry of defense and didn’t get a lot of information from them. So there was that collection from people to people. We also used Russian social network ВКонта́кте that is the biggest competitor to Facebook in Russia, where people published information that was in the open public profiles. So we took some pictures, and we talked to the people afterwards.

MarcEdit Workstreams: MacOS and Windows/Linux / Terry Reese

Over the past couple of months, I’ve had some interesting questions that have been leading me to going back and relooking at how a handful of things work within MarcEdit.  To that end, I’m hoping to complete the following two workstreams this weekend.


Two of the features most often asked for at this point deal with accessibility options and plugin support.  The creation of the AddPinyin plugin for windows ( has got people asking if this will show up for Mac Users as well.  My guess is that it could, but in order for that to happen, I need to implement plugin support in the Mac.  The challenge is figuring out how, since the process I used with Windows and Linux simply won’t work with the Mac UI thread model.  So, I’ve been thinking on this, and this weekend, I’ll be including the first parts of code that should allow me to start making this happen.  Ideally, I’ll start by migrating some of the current MarcEdit plugins, probably the Internet Archive 2 HathiTrust Packager first; and then go from there.

The other change that I’m working on that will show up in this update is the ability to control the application font and font sizes.  You can see the start of this work here:  Like the windows version, I’ll eventually add language support, which will enable the use of language files to set the text in the application.  But for now, I’ll be enabling the ability to modify the application font and change the size of the fonts within the application and editor.


The interest in the plugins have made me take another look at how they are managed.  Currently it is clunky, users get no notification when they change, and updating them takes multiple steps.  That will change.  I’ve been restructuring how plugins are managed, so that they will now automatically notify users when they have changed, as well as offer the ability to download the update.  Additionally, I’ve extended the plugin manager so that it can manage access to plugins outside of the MarcEdit website, so I’ll be including links to the AddPinyin plugin, and be able to include this plugin in the automated management (i.e., update notification).  Overall, I believe that this will make plugins easier to use, and much, much easier to manage.


LIL talks: Becky / Harvard Library Innovation Lab

Today Becky taught us about the lifetime of a star, and all of our minds were blown.