Planet Code4Lib

Personal Pods and Fatcat / David Rosenthal

Sir Tim Berners-Lee's Solid project envisages a decentralized Web in which people control their own data stored in personal "pods":
The basic idea of Solid is that each person would own a Web domain, the "host" part of a set of URLs that they control. These URLs would be served by a "pod", a Web server controlled by the user that implemented a whole set of Web API standards, including authentication and authorization. Browser-side apps would interact with these pods, allowing the user to:
  • Export a machine-readable profile describing the pod and its capabilities.
  • Write content for the pod.
  • Control others access to the content of the pod.
Pods would have inboxes to receive notifications from other pods. So that, for example, if Alice writes a document and Bob writes a comment in his pod that links to it in Alice's pod, a notification appears in the inbox of Alice's pod announcing that event. Alice can then link from the document in her pod to Bob's comment in his pod. In this way, users are in control of their content which, if access is allowed, can be used by Web apps elsewhere.
In his Paul Evan Peters Award Lecture, my friend Herbert Van de Sompel applied this concept to scholarly communication, envisaging a world in which access, for both humans and programs, to all the artifacts of research would be greatly enhanced.
In Herbert's vision, institutions would host their researchers "research pods", which would be part of their personal domain but would have extensions specific to scholarly communication, such as automatic archiving upon publication.
Follow me below the fold for an update to my take on the practical possibilities of Herbert's vision.

This improved access would be enabled by metadata, generated both by the decentralized Web infrastructure and by the researchers, connecting the multifarious types of digital objects representing the progress of their research.

The key access improvements in Herbert's vision are twofold:
  • Individuals, not platforms such as Google or Elsevier, control access to their digital objects.
  • Digital objects in pods are described by, and linked by, standardized machine-actionable metadata.
Their importance is in allowing much improved access to the digital objects by machines, not so much by humans. Text mining from published papers has already had significant results, so much so that publishers are selling the service on their platforms. This balkanization isn't helpful. Herbert's vision is of a world in which all digital research objects are uniformly accessible via a consistent, Web-based API.

Herbert was skeptical that transitioning scholarly communication in this way was achievable. I agreed with him at length in both Herbert Van de Sompel's Paul Evan Peters Award Lecture and It Isn't About The Technology, but didn't address the obvious question:
How much of the improved access in Herbert's vision could be implemented in the Web we have right now, rather than waiting for the pie-in-the-sky-by-and-by decentralized Web?
Clearly, the academic publishing oligopoly and the copyright maximalists aren't going to allow us to implement the first part. Even were Open Access to become the norm, their track record shows it will be Open Access to digital objects they host (and in many cases under a restrictive license).
Elsevier's Research Infrastructure
The Web we have lacks the mechanisms for automatically generating the necessary metadata. Experience shows that the researchers we have are unable to generate the necessary metadata. How could implementing the second part of Herbert's vision be possible?

Thanks to generous funding from the Andrew W. Mellon Foundation (I helped write the grant proposal) a team at the Internet Archive is working on a two-pronged approach. Prong 1 starts from Web objects known to be scholarly outputs because, for example, they have been assigned a DOI and:
  • Ensures that, modulo paywall barriers, they and the objects to which they link are properly archived by the Wayback Machine.
  • Extracts and, as far as possible, verifies the bibliographic metadata for the archived objects.
  • Implements access to the archived objects in the Wayback Machine via bibliographic rather than URL search.
Prong 2 takes the opposite approach, using machine learning techniques to identify objects in the Wayback Machine that appear to be scholarly outputs and:
  • Extracts and, as far as possible, verifies the bibliographic metadata for the archived objects.
  • Implements access to the archived objects in the Wayback Machine via bibliographic rather than URL search.
The goal of this work is to improve archiving of the "long tail" of scholarly communication by applying "big data" automation to ensure that objects are discovered, archived, and accessible via bibliographic metadata. Current approaches (LOCKSS, Portico, national library copyright deposit programs) involve working with publishers, which works well for large commercial publishers but is too resource intensive to cover more than a small fraction of the long tail. Thus current efforts to archive scholarly outputs are too focused on journal articles, and too focused on expensive journals, and thus too focused on content that is at low risk of loss.

Fatcat entry for Joi Ito blog post
The team at the Internet Archive have an initial version of the first prong up at The home page includes links to examples of preliminary Fatcat content for various types of research objects, such as a well-known blog post by Joi Ito. The Fatcat  "About" page starts:
Fatcat is versioned, publicly-editable catalog of research publications: journal articles, conference proceedings, pre-prints, blog posts, and so forth. The goal is to improve the state of preservation and access to these works by providing a manifest of full-text content versions and locations.

This service does not directly contain full-text content itself, but provides basic access for human and machine readers through links to copies in web archives, repositories, and the public web.

Significantly more context and background information can be found in The Guide.
Now, suppose Fatcat succeeds in its goals. It would provide a metadata infrastructure that could be enhanced to provide many of the capabilities Herbert envisaged, albeit in a centralized rather than a decentralized manner. The pod example above could be rewritten for the enhanced Fatcat environment thus:
If Alice posts a document to the Web that Fatcat recognizes in the Wayback Machine's crawls as a research output, Fatcat will index it, ensure it and the things it links to are archived, and create a page for it. Suppose Bob, a researcher with a blog which Fatcat indexes via Bob's ORCID entry, writes a comment on one of her blog's post that links to Alice's document. Fatcat's crawls will notice the comment and:
  • Update the page for Bob's blog post to include a link to Alice's document.
  • Update the page for Alice's document to include a link to Bob's comment.
Because Fatcat exports its data via an API as JSON, the information about each document, including its links to other documents, is available in machine-actionable form to third-party services. They can create their own UIs, and aggregate the data in useful ways.
As a manually-created demonstration of what this enhanced Fatcat would look like take this important paper in Science's 27th January 2017 issue, Gender stereotypes about intellectual ability emerge early and influence children’s interests by Lin Bian, Sarah-Jane Leslie and Andrei Cimpian. The authors' affiliations are the University of Illinois, Champaign, New York University, and Princeton University. Here are the things I could find in about 90 minutes that the enhanced Fatcat would link to and from:
[I'm sorry I don't have time to encode all this as JSON as specified in The Guide.]

Linking together the various digital objects representing the outputs of a single research effort is at the heart of Herbert's vision. It is true that the enhanced Fatcat would be centralized, and thus potentially a single point of failure. And that it would be less timely, less efficient, and would lack granular access control (it can only deal with open access objects). But it's also true that the enhanced Fatcat avoids many of the difficulties of the decentralized version that I raised. They are caused by the presence of multiple copies of objects, for example in the personal pods of each member of a multitudinous research team, or at their various institutions.

Given that both Herbert and I express considerable skepticism as to the feasibility of implementing his vision even were a significant part of the Web to become decentralized, exploring ways to deliver at least some of its capabilities on a centralized infrastructure seems like a worthwhile endeavor.

Update: Herbert points out that related work is also being funded by the Mellon Foundation in a collaborative project between Los Alamos and Old Dominion called
The modules in the pipeline are as follows:
  • Discovery of new artifacts deposited by a researcher in a portal is achieved by a Tracker that recurrently polls the portal's API using the identity of the researcher in each portal as an access key. If a new artifact is discovered, its URI is passed on to the capture process.
  • Capturing an artifact is achieved by using web archiving techniques that pay special attention to generating representative high fidelity captures. A major project finding in this realm is the use of Traces that abstractly describe how a web crawler should capture a certain class of web resources. A Trace is recorded by a curator through interaction with a web resource that is an instance of that class. The result of capturing a new artifact is a WARC file in an institutional archive. The file encompasses all web resources that are an essential part of the artifact, according to the curator who recorded the Trace that was used to guide the capture process.
  • Archiving is achieved by ingesting WARC files from various institutions into a cross-institutional web archive that supports the Memento "Time Travel for the Web" protocol. As such, the Mementos in this web archive integrate seamlessly with those in other web archives.
Major differences between the two include:
  • Targeted at specific platforms vs. generic Web.
  • Researcher-centric vs. object-centric.
  • Content-focused vs. metadata-focused.
  • Curator-driven vs. automated collection.

Ubiquity Press is a new Samvera Partner / Samvera

We are delighted to announce that Ubiquity Press has become a formal Samvera Partner.  Ubiquity Press is an open access publisher and they are working with the British Library, the national library of the United Kingdom, to develop shared open-source repository services using Hyku, our multi-tenant software solution.  The pilot repository will initially include research outputs from the British Library, the British Museum, the Tate galleries, National Museums Scotland, and the Museum of London Archaeology.  In the US, Ubiquity Press is also working with Gonzaga University, Penn State University, the University of Pennsylvania and Western University.

We greatly look forward to working more closely with them!

The post Ubiquity Press is a new Samvera Partner appeared first on Samvera.

Solr Indexing in Kithe / Jonathan Rochkind

So you may recall the kithe toolkit we are building in concert with our new digital collections app, which I introduced here.

I have completed some Solr Indexing support in kithe. It’s just about indexing, getting your data into Solr. It doesn’t assume Blacklight, but should work fine with Blacklight; there isn’t currently any support in kithe for what you do to provide UX for your Solr index.  You can look at the kithe guide documentation for the indexing features for a walk-through.

The kithe indexing support is based on ActiveRecord callbacks, in particular the after_commit callback. While callbacks get a bad rap, I think they are appropriate here, and note that both the popular sunspot gem (Solr/Rails integration, currently looking for new maintainers) and the popular searchkick gem (ElasticSearch/Rails integration) base their indexing synchronization on AR callbacks too. (There are various ways in kithe’s feature to turn off automatic callbacks temporarily or permanently in your code, like there are in those other two gems too). I spent some time looking at API’s, features, and implementation of the indexing-related functionality in sunspot, and searchkick, as well as other “prior art”, before/while developing kithe’s support.

The kithe indexing support is also based on traject for defining your mappings.

I am very happy with how it turned out, I think the implementation and public API both ended up pretty decent. (I am often reminded of the quote of uncertain attribution “I didn’t have time to write a short letter, so I wrote a long one instead” — it can take a lot of work to make nice concise code).

The kithe indexing support is independent of any other kithe features and doesn’t depend on them. I think it might be worth looking at for anyone writing a an app whose persistence is based on ActiveRecord. (If something ActiveModel-like but not ActiveRecord, it probably doesn’t have after_commit callbacks, but if it has after_save callbacks, we could make the kithe feature optionally use those instead; sunspot and searchkick can both do that).

Again, here’s the kithe documentation giving a tour of the indexing features. 

Note on traject

The part of the architecture I’m least happy with is traject, actually.

Traject was written for a different use case — command-line executed high-volume bulk/batch indexing from file serializations. And it was built for that basic domain and context at the time, with some YAGNI thoughts.

So why try to use it for a different case of event-based few-or-one object sync’ing, integrated into an app?  Well, hopefully it was not just because I already had traject and was the maintainer (‘when all you have is a hammer’), although that’s a risk. Partially because traject’s mapping DSL/API has proven to work well for many existing users. And it did at least lead me to a nice architecture where the indexing code is separate and fairly decoupled from the ActiveRecord model.

And the Traject SolrJsonWriter already had nice batching functionality (and thread-safety, although didn’t end up using it in current kithe architecture), which made it convenient to implement batching features in a de-coupled way (just send to a writer that’s batching, the other code doesn’t need to know about it, except for maybe flushing at the end).

And, well, maybe I just wanted to try it. And I think it worked out pretty well, although there are some oddities in there due to traject’s current basic architectural decisions. (Like, instantiating a Traject “Indexer” can be slow, so we use a global singleton in the kithe architecture, which is weird.)  I have some ideas for possible refactors of traject (some backwards compat some not) that would make it seem more polished for this kind of use case, but in the meantime, it really does work out fine.

Note on times to index, legacy sufia app vs our kithe-based app

Our collection, currently in a sufia app, is relatively small. We have about 7,000 Works (some of which are “child works”), 23,000 “FileSets” (which in kithe we call “Assets”), and 50 Collections.

In our existing Sufia-based app, it takes about 6 hours to reindex to Solr on an empty index.

  • Except actually, on an empty index it might take two re-index operations, because of the way sufia indexing is reliant on getting things out of the index to figure out the proper way to index a thing at hand. (We spent a lot of work trying to reorganize the indexing to not require an index to index, but I’m not sure if we succeeded, and may ironically have made performance issues with fedora worse with the new patterns?) So maybe 12 hours.
  • Except that 6 hours is just a guess from memory. I tried to do a bulk reindex-everything in our sufia app to reconfirm it — but we can’t actually currently do a bulk reindex at all, because it triggers an HTTP timeout from Fedora taking too long to respond to some API request.
    • If we upgraded to ActiveFedora 12, we could increase the timeout that ActiveFedora is willing to wait for a fedora response for. If we upgraded to ActiveFedora 12.1, it would include this PR, which I believe is intended to eliminate those super long fedora responses. I don’t think it would significantly change our end-to-end indexing time, the bulk of it is not in those initial very long fedora API calls. But I could be wrong. And not sure how realistic it is to upgrade our sufia app to AF 12 anyway.
    • To be fair, if we already had an existing index, but needed to reindex our actual works/collections/filesets because of a Solr config change, we had another routine which could do so in only ~25 minutes.

In our new app, we can run our complete reindexing routine in currently… 30 seconds. (That’s about 300 records/second throughput — only indexing Works and Collections. In past versions as I was building out the indexing I was getting up to 1000 records/second, but I haven’t taken time to investigate what changed, cause 30s is still just fine).

In our sufia app we are backing up our on-disk Solr indexes, because we didn’t want to risk the downtime it would take to rebuild (possibly including fighting with the code to get it to reindex).  In addition to just being more bytes to sling, this leads to ongoing developer time on such things as “did we back up the solr data files in a consistent state? Sync’d with our postgres backup?”, and “turns out we just noticed an error in the backup routine means the backup actually wasn’t happening.” (As anyone who deals with backups of any sort knows can be A Thing).

In the new system, we can just… not do that.  We know we can easily and quickly regenerate the Solr index whenever, from the data in postgres. (And if we upgrade to a new Solr version that requires an index rebuild, no need to figure out how to do so without downtime in a complicated way).

Why is the new system so much faster? I’ve identified three areas I believe are likely, but haven’t actually tried to do much profiling to determine which of these (if any?) are the predominant factors, so couldn’t say.

  1. Getting things out of fedora (at least under sufia’s usage patterns) is slow. Getting things out of postgres is fast.
  2. We are now only indexing what we need to support search.
    • The only things that show up in our search results are Works and Collections, so that’s all we’re indexing. (Sufia indexes not only FileSets too, but some ancillary objects such as one or two kinds of permission objects, and possibly a variety of other things I’m not familiar with. Sufia is trying to put pretty much everything that’s in fedora in Solr. For Reasons, mainly that it’s hard to query your things in Fedora with Fedora).
    • And we are only indexing the fields we actually need in Solr for those objects. Sufia tries to index a more or less round-trippable representation to Solr, with every property in it’s own stored solr field, etc. We aren’t doing that anymore. We could put all text in one “text” field, if we didn’t want to boost some higher than others. So we only index to as many fields as need different boosts, plus fields for facets, etc. Only what need to support the Solr functionality we want.
      • If you want to render your results from only Solr stored fields (as sufia/hyrax do, and blacklight kind of wants you to) you’d also need those stored fields, sufficiently independently addressable to render what you want (or perhaps just in one big serialized JSON?). We are hoping to not use solr stored fields for rendering at all, but even if we end up with Solr stored fields for rendering, it will be just enough that we need for rendering. (For instance, some people using Blacklight are using solr stored fields for the “index”/search results/hits page, but not for the individual record ‘show’ page).
  3. The indexing routines in new thing send updates to Solr in an efficient way, both batching record updates into fewer Solr HTTP update requests, and not sending synchronous Solr “hard commits” at all. (the bulk reindex, like the after_commit indexing, currently sends a softCommit per update request, although this could be configured differently).


Check out the kithe guide on indexing support! Maybe you want to use kithe, maybe you’re writing an ActiveRecord-based apps and want to consider kithe’s solr indexing support in isolation, or maybe you just want to look at it for API and implementation ideas in your own thing(s).

SAVE THE DATE: 2019 DSpace North American Users Group Meeting / DuraSpace News

Please join us September 23 and 24, 2019 at the University of Minnesota in Minneapolis for the 2019 DSpace North American Users Group Meeting.

This meeting will provide opportunities to discuss ideas, best practices, use cases, and the future development of DSpace 7 with members of the DSpace community including repository developers, technology directors, and institutional repository managers. We anticipate a variety of discussions, presentations, lightning talks, and workshops as part of the program. We encourage members of the wider open repository community and those interested in learning more about the open source DSpace repository platform to participate.

The program committee will release a Call for Proposals in the next few weeks. More information about accommodations, registration, and schedule will be made available on the conference website.

The 2019 DSpace North American User Group Meeting is jointly sponsored by the University of Minnesota Libraries and the Texas Digital Library.

The post SAVE THE DATE: 2019 DSpace North American Users Group Meeting appeared first on

Knowledge Organization Systems / HangingTogether

Connection Graph from Social Networks and Archival Context

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Daniel Lovins of Yale and Stephen Hearn of the University of Minnesota. As controlled vocabularies and thesauri are converted into linked open data and shared publicly, they often separate from their traditional role of facilitating collection browsing and find a renewed purpose as Web-based knowledge organizations systems (KOS). As Marcia Zeng points out in Knowledge Organization Systems (KOS) in the Semantic Web: a multi-dimensional review, “a KOS vocabulary is more than just the source of values to be used in metadata descriptions: by modeling the underlying semantic structures of domains, KOS act as semantic road maps and make possible a common orientation by indexers and future users, whether human or machine.”

Good examples of such repurposing are the Getty Vocabularies, which not only allow browsing of Getty’s representation of knowledge, but also help users generate their own SPARQL queries that can be embedded in external applications. Another example is Social Networks and Archival Context (SNAC), which enables browsing of entities and relationships independently of their collections of origins. In such cases, the discovery tool pivots to being person-centric (or family-centric, or topic-centric.), rather than (only) collection-centric. BIBFRAME, RDA, and IFLA-LRM vocabularies may prove similarly valuable as knowledge organization systems on the Web.

We identified three kinds of Knowledge Organization System functionality:

  • Clustering, based on hierarchical relationships, which would support clusters that include both broad and narrower terms, rather than the typical current clusters based on a single term
  • Presenting, showing information about an entity such as in knowledge cards or panels
  • Navigating, allowing users to follow related terms to explore topical and other relationships

Much of the promise of Knowledge Organization Systems depend on system functionality, which is usually outside the control of the library. We posited that its clustering functionality might be more useful than presentation and navigation.

We noted that Knowledge Organization Systems providing “semantic road maps” would require a major shift from local “collection-centric” systems to “knowledge organizations.” Not only is it unlikely we will ever have one “universal” knowledge organization system, we postulated that it may not even be desirable. Instead, we might spend our resources more effectively on incorporating or reconciling existing thesauri or ontologies. For example, Google provides access to far more data than individual library catalogs, but when you click on a link in a Google result you then can continue searching in the new environment using its conventions. Rather than one “global domain,” perhaps the library community could provide added value by adding bridges from the metadata in library domain databases to other domains. We cited Wikidata as an example of aggregating entities from different sources and linking to more details in various language Wikipedias.

Most institutions do not have discovery systems that present controlled vocabularies to users as knowledge organization systems. One of the barriers to doing so is the different ontologies and vocabularies represented in our metadata. Some have overlaps, and others have different hierarchies. Some use very broad terms, others are more granular. It is difficult for systems to establish relationships between vocabularies at the item level (by using semantics like owl:sameAs) much less at the vocabulary level. UCLA’s catalog provides a list of sources for different controlled vocabularies retrieved from a search that users can click on to continue to search within a specific vocabulary. Showing the provenance of a specific subject heading could inform users which vocabulary might be more relevant to them. The Program for Cooperative Cataloging’s Task Group on URIs in MARC submitted a proposal to the MARC Advisory Committee to encode the source vocabulary in main entry, uniform title, and added entry fields (MARC Proposal 2019-02), which was approved in January 2019. This change will better support the multiplicity of vocabularies in library metadata.

Use of controlled vocabularies from other countries is particularly challenging. Subject headings from the National Diet Library, China, and Korea all have non-Latin scripts. They may be useful for those who can read the scripts, but not for those who cannot. Wikidata addresses this by allowing users to set their default language so that they see information in their preferred language. However, there are cases where there are no satisfactory equivalences across languages; different concepts in other national library vocabularies cannot always be mapped unequivocally to English concepts. The multi-year MACS (Multilingual Access to Subjects) has built relationships across three subject vocabularies: Library of Congress Subject Headings, the German GND integrated authority file, and the French RAMEAU (Répertoire d’autorité-matière encyclopédique et alphabétique unifié). It has been a labor-intensive process and is not known to be widely implemented.  There are other factors to consider such as the geographic region where a term is used that may differ from other regions using the same language (e.g., American vs. British English; French-Canadian vs. French-French vs. Swiss-French). Although imperfect, with embedded biases, numeric classification systems might be an approach to overcome differences in language labels,

The discussion highlighted some of our common aspirations for future systems both for discovery and for metadata management. We realize that Web-savvy users are accustomed to using different search techniques in different environments, so bridging across domains may be more feasible than trying to attain one universal Knowledge Organization System.

The post Knowledge Organization Systems appeared first on Hanging Together.

Preparing Early Career Librarians for Leadership and Management: A Feminist Critique / In the Library, With the Lead Pipe

In Brief

This article explores the opportunities and challenges that early career librarians face when advancing their careers, desired qualities for leaders or managers of all career stages, and how early career librarians can develop those qualities. Our survey asked librarians at all career stages to share their sentiments, experiences, and perceptions of leadership and management. Through our feminist critique, we explore the relationships to power that support imbalances in the profession and discuss best practices such as mentoring, individualized support, and self-advocacy. These practices will be of use to early career librarians, as well as supervisors and mentors looking to support other librarians.

by Camille Thomas, Elia Trucks, and H.B. Kouns


As early career academic librarians, we have had many conversations about what leadership and management look like in our lives and found that our experiences were not well-represented in LIS literature. Much of the research on leadership and management focuses on current experiences of those who already moved into leadership roles after decades of experience, not the process for moving into such positions.

We are a research team comprised of early career librarians who are cisgender women, including a woman of color and a queer white woman. Holly (H.B. Kouns) and Camille (Thomas) have experience with leadership, management and mentoring. Elia (Trucks) wants to lead from within her position and ensure equity in development opportunities. Our research questions were: What are opportunities and challenges for early career librarians interested in management as libraries evolve? Have we seen any progress on the calls to action regarding diversity, training, mentoring, and opportunities?

Literature Review

We started this study by looking for existing research specifically focused on how early career librarians navigated their experiences with leadership and management. We interpreted early career to include librarians with fewer than 10 years of experience, including pre-MLIS and paraprofessional experience. Leadership “is concerned with direction setting, with novelty and is essentially linked to change, movement and persuasion” (Grint, Jones, & Holt, 2017).

Training and Demographics

The initial source of formal training for most library managers comes from management classes in MLIS programs (Rooney, 2010). Outside of the MLIS, leadership and management institutes, trainings, and workshops are meant to help librarians develop leadership skills. The American Library Association (ALA) alone highlights over 20 different programs which offer self-assessment of participants’ skills and expose participants to leadership theories (Herold, 2014; “Library Leadership Training Resources,” 2008). Hines (2019) examined 17 library leadership institutes that reinforce the existing power structures of traditional leadership training and did not incorporate the values championed by ALA such as access, democracy, and social responsibility. Moreover, the requirements to attend retain exclusive barriers for marginalized professionals who may not be gainfully employed, able to take time off, or considered worthy of support, leaving middle managers of color and interim directors of color with even fewer opportunities to gain relevant experience before moving into senior leadership roles (Irwin & deVries, 2019; Bugg, 2016).

As the profession ages, librarians move up through the ranks of leadership and management by filling positions vacated by retirees. However, the delayed retirement of late career librarians especially affects women and people of color (POC) in the profession. Representation in academic librarianship has become more equitable for white women, from the male-dominated leadership landscape of the 1970s to greater advances between the 1980s and the 2000s (DeLong, 2013). Despite these advances in the number of women in leadership positions, there are still many areas where women have more experience yet make less pay (Morris, 2019). However, administrative job prospects look even bleaker for librarians of color than for white women since it is typical for librarians to learn leadership and management practices on the job only after moving into the role (Ly, 2015; Rooney, 2010). While there are efforts in the profession (particularly within the Association of Research Libraries (ARL)) to recruit diverse students and employees into librarianship, there is not as much emphasis on retention and advancement. Some librarians of color also see the amount and type of experience required for entry level positions as a barrier, as it reinforces homogeneity in libraries (Chou & Pho, 2017). This is evident in the demographics of university libraries which are 85% white and 15% minority. According to the annual salary statistics published by ARL (Morris, 2019), the overall makeup of people working in ARL libraries is 63% female and 36% male, and statistics for leadership in those libraries (directors, associate directors, heads of branches, etc.) is relatively proportionate. Out of those, 109 are directors of ARL libraries, of which 10 identify as people of color. (Morris, 2019).

Qualities & Skills

Historically, Masculine-coded, agency-based leadership qualities such as “assertion, self-confidence and ambition” have been associated with successful leaders in North America (Richmond, 2017). However, as more women and millennials or Gen Xers are in positions of power (Phillips, 2014), valued leadership qualities have begun to include communal, feminine-coded traits like “empathy, interpersonal relationships, openness, and cooperation” (Martin, 2018). They continue, “[Baby] Boomers, Gen Xers, and Millennials all [want] a leader who [is] competent, forward-looking, inspiring, caring, loyal, determined, and honest” (Martin, 2018). Feminist scholars address the implicit associations of certain skills such as shared power, recognizing privilege, building partnerships, and self-advocacy, combining attributes of communal and agency-based skills (Higgins, 2017; Askey & Askey, 2017; Fleming & McBride, 2017) with specific identity performance, rather than effectiveness or value (Richmond, 2017).

Hierarchical power structures have traditionally consolidated both leadership and management roles and thus, librarians needed to gain recognition through years of experience with both to advance. Leaders focus on high-level initiatives and managers focus on granular initiatives, therefore the skills that are needed to be effective in each role are different. The most valued leadership qualities include creativity, vision, and commitment, while the most valued management qualities include dedication, communication, and caring for colleagues and subordinates (Phillips, 2014; Young, Powell, & Hernon, 2003; Aslam, 2018; “Leadership and Management Competencies,” 2016; Martin, 2018; Stewart, 2017). As a result, many mid-career and late-career librarians “drift” into leadership positions because they are believed to have gained the necessary skills through years of experience with a variety of projects, personnel, and institutional developments (Ly, 2015; Bugg, 2016). Those who experience “leadership by drift” are appointed, usually without much self-reflection or choice (Ly, 2015). For example, library leaders are most often appointed as an interim leader, and 80% of those interim leaders are then hired without any outside recruitment (Irwin & deVries, 2019). This led us to believe that “drifting” was the primary path librarians could take into leadership positions.

In contrast, there are few accounts of librarians who actively sought leadership and management positions. Pearl Ly’s (2015) trajectory from interim to permanent dean included earning a PhD, participating in leadership training programs, and peer-mentoring from other administrators. Ly always intended to move into a leadership or management position, an intention of ambition which stands out from many other librarians, and serves as evidence of how ambition can accelerate the process of acquiring skills and training. Ambitious librarians forgo the long timeline of traditional “drift” (usually predicated on being appointed into positions after decades of demonstrating skills).

The topic of ambition has many nuances and challenges, especially in light of the hegemonic representation of who traditionally becomes a leader in librarianship. A respondent in Chou and Pho’s 2017 study shared an experience in which a Latina woman’s ambition to be a branch manager within five years was scoffed at by a hiring committee and she was not hired in the end. The respondent believed the candidate was seen as aggressive, but did not believe the committee would have had the same impression of a white male. Unlike Ly’s case, responses in Bugg’s 2016 survey of people of color in middle management positions show alternative paths to leadership. Respondents reported having the skills and desire for the work described in a position that had leadership and management responsibilities, but not necessarily the ambition to move up. Many respondents participated in preparatory activities (such as leadership training, doctoral degrees, career coaching, etc.) but only one expressed desire to move from middle management to senior leadership. They cited reasons for not wanting to advance such as the elimination of tenure, dissonance with personal values, and lack of motivation. This led us to wonder if there is a discrepancy between the skills needed and the skills valued.


Once in positions with leadership and management responsibilities, librarians with ambition face different challenges. 32% of interim library leaders had fewer than five years of leadership and residency at their institutions when they were appointed to interim positions (Irwin & deVries, 2019). Several noted colleagues having difficulty accepting them as leaders, especially when length of service was not a criterion for appointment. Chou and Pho (2017) note common experiences in which female librarians of color were more likely to have their intelligence, qualifications, and authority questioned. Early career librarians in the study attributed perceived incompetence to looking young in addition to being a person of color and a woman. Women of color managers often experienced patrons who did not believe they were the person in charge. Similarly, Bugg (2016) found that many librarians of color felt apprehension about moving into senior leadership positions due to lack of exposure to senior leadership networks and incompatible organizational values. Before advancement, librarians were trained in both leadership and management externally. Afterwards, they discovered less access to and support for opportunities, such as lobbying for department needs, not feeling supported by senior leadership during difficult decisions, and lack of exposure to or not receiving opportunities.

Additionally, alternative types of leadership are not valued by traditional career advancement. Phillips (2014) highlights the “transformational leadership” type, which focuses on progressing organizational change, as the type of leadership commonly discussed in librarianship. In this style, collaboration among librarians is championed by the profession, as it is seen as necessary for progress. However, there is dissonance between valuing collaboration and recognizing demonstrations of leadership skills in collaborative work. This can be seen in the servant leadership style, currently popular in libraries (Richmond, 2017). Douglas and Gadsby’s 2017 study of instruction coordinators shows this imbalance. Instruction coordinators do a great deal of feminized “relational” work—supporting, helping, collaborating—and yet they are not given authority or power to make substantial change. Additionally, librarians of color often find themselves taking on undervalued “diversity work” in collaborations, piling explaining concepts and lived experiences to white colleagues on top of the work itself (Chou & Pho, 2017). There is very little recognition for those who are not in positions of authority or do not formally supervise others, but make substantial contributions to the collaborative work. Likewise, someone may lead within their position, but never manage others. The work is valued as work, but not in terms of leadership.


We chose a mixed method approach to cross examine the multiple complex factors that are involved in varied experiences with leadership and management over time. For our primary method, we used Constructivist Grounded Theory (CGT), which supports the open-ended collection and analysis of data. Unlike Grounded Theory, we co-constructed theory by taking multiple perspectives and the positioning of researchers and participants into account. We applied an existing theory based on reoccuring themes from the data. CGT does not assume theories are discovered and uses existing theories where they apply (Strauss & Corbin, 1997; Charmaz, 2006). Within CGT, we used a constant comparison to direct our analysis. Constant comparison is a method of Grounded Theory in which data is compared against existing findings throughout the data analysis period.

As we constructed theory from the results, we applied Feminist Theory, which is a method of analysis that examines the relationships between gender and power, and how structures reinforce the oppression of women (Tyson, 2006). We also looked closely at the historical factors that inform current practices. Our feminist critique is informed by intersectionality, as defined by Kimberlé Crenshaw (1990), which explores how people with multiple intersecting identities beyond gender, such as race and ethnicity, queerness, and disability, experience overlapping oppressive power structures.

We created a survey to explore the perceptions and lived experiences of library professionals related to leadership and management. This includes librarians with MLIS degrees, paraprofessionals, and students in order to capture the perspectives of newly minted librarians, managers of new librarians, and those interested in mentoring or supporting new librarians. We sent the survey to multiple ALA email listservs (groups for new members of ALA, new members of leadership groups, general leadership, reference, assessment, college and university libraries, diversity and inclusion, technology and scholarly communication) to gather responses. The survey included questions on skills, attributes, and participants’ experiences. We provided a list of skills based on the literature and asked participants to rate their importance. We deliberately designed the survey so that participants would share their own thoughts and values first, without being primed by our list of skills.

We analyzed the data based on the career experience of the respondents. We categorized librarians with 0-6 years of experience as “early career.” Those with 7-15 years we categorized as “mid-career,” and those with 16+ years as “late career.” These determinations are based on the Association of College and Research Libraries (ACRL) criteria for travel scholarships to the biennial conference (one of the few career level distinctions we found from a professional organization). We included pre- and post-MLIS work in determining experience. This differs from the ACRL definition, where they measure based on post-MLIS experience. We wanted to capture all experiences that contribute to how professionals acquire skills during the early stages of their careers. We also asked participants to report degrees earned. Our survey did not involve questions about tenure, but we did note any mention of the influence of tenure.

When we analyzed the data, instead of creating categories before coding responses, we created categories based on prominent themes in the data. Additionally, we used Voyant, an open-source text analysis tool, to track frequently used words, examine phrases, and measure sentiment. With Voyant, we determined the most popular qualities in leaders and managers. We also used it to determine the most common themes from qualitative responses.


We sent our survey to ALA-affiliated listservs, which excluded librarians who do not subscribe to those services. We did not ask for a lot of demographic information, including age, race or ethnicity, or type of library in which participants work. According to the literature available at the time we designed the survey, professionals had varying positive, neutral and negative perspectives on how identity affected their paths to leadership. We wanted to give participants an opportunity to address these issues and included a question specifically about whether they encountered challenges related to their identities. As we designed our instrument, we did not design them with feminist critique or intersectionality specifically in mind.

We realized after completion of the survey that “Cis or Trans Woman” and “Cis or Trans Man” may be more accurate labels than the options we provided, such as “Woman or Trans Woman.” We also did not ask for participants’ age or list age as a challenge related to identity. When we presented preliminary data at the ALA Annual Conference in June 2018, we consolidated mid- and late career responses. We realized it was important to separate these responses in the results of this paper, as they are distinctly different career levels.


We recorded 373 responses to the survey. After eliminating responses with less than a 22% completion rate, we had 270 complete responses.

Background and Demographics

The results of the survey included a high percentage of respondents with greater than six years of experience. We wanted a wide range of perspectives, including those who were able to reflect on how their early experiences shaped the rest of their career. This skew prompted us to filter responses (particularly qualitative ones) based on early career (0-6 years), mid-career (7-15 years), and late career (16+ years) to analyze specific perspectives.

Participants gave information about their years of experience (n = 270). The majority of respondents had 16 years of experience or more (36%). The second largest experience range was 7-10 years (20%). 26% of respondents were early career professionals, with an experience range of 0-6 years. Educational backgrounds among early and mid- to late-career professionals had no difference in proportion, although two early career respondents noted they were currently completing a bachelors or masters degree in library science.

Figure 1. Respondents’ years of experience in libraries, including pre-MLIS experience.

Representation and Challenges Related to Identities

Respondents (n = 270) were 80% cisgender women or trans women and 15% cisgender men or trans men; 1% identified as non binary, 1% preferred not to answer and 0.37% identified as other. Some qualitative responses included mentions of harassment, microaggressions, or bias related to gender:

“One time I was treated particularly unfairly during an internal interviewing situation in which [I] accepted the position but was offered considerably less money that a male counterpart. I had to prepare for negotiation and [speak] out about this inequity. I expressed my concern to my male supervisor / department head and, great as he was, he was not helpful for me because he was particularly conflict averse (although awesome in a lot of other ways).”

“The concerns I’ve faced with regard to these issues haven’t come from fellow employees, but from library patrons, who have occasionally been sexually explicit or harassing towards me and other female employees (non-white employees have faced similar problems, but being white I have not directly faced that problem myself)…”

Identities by Career Stage

While we did not ask for demographic information regarding race, sexual identity, or ability, we did want to gather information about whether respondents faced challenges related to these identities.

Identity Early Career [0-6 yrs] (n=42) Mid Career [7-15 yrs] (n=52) Late Career [16+ yrs] (n=39 ) Response Rate (n=182)
Gender 45.23% (n=19) 76.92% (n=40) 22.44% (n=11) 48.35% (n=88)
Race or Ethnicity 14.28% (n=6) 25.00% (n=13) 8.16% (n=4) 18.68% (n=34)
Gender Expression 4.76% (n=2) 5.76% (n=3) 0.00% (n=0) 2.75% (n=5)
Sexual Identity 7.14% (n=3) 5.76% (n=3) 0.00% (n=0) 3.30% (n=6)
Accessibility/Disability 11.90% (n=5) 3.84% (n=2) 4.08% (n=2) 9.89% (n=18)
Other experiences intersecting with any of the above or additional issues 16.66% (n=7) 19.23% (n=10) 44.89% (n=22) 17.03% (n=31)

Figure 2. Percentage of respondents at different career stages who reported experiencing challenges related to identity.

The most common challenges participants faced in relation to their identities included gender, race or ethnicity, and accessibility or disability concerns. Respondents often faced challenges related to their ability:

“I have a hearing disability. I often need technical support for meeting in ensuring that I can hear everyone. It doesn’t always work out.”

“I had health issues come up, that included significant exhaustion, brain fog, and executive function issues (along with other symptoms). My boss at the time handled it very badly – he kept pushing me to take on more tasks (in my first year in a new position), did not communicate options to me for leave/additional support (or refer me to the person in the campus structure who managed that for staff), and shamed me for making a necessary specialists appointment. I ended up having my contract not reviewed and was out of work for a year.”

“Mental illness, stigma”

Others faced challenges due to their age, race, or social class:

“When I was younger, people under my leadership would sometimes become angry about a perceived lack of experience in comparison to them…. I believe that I have often had to overcome quite a bit of disrespect as a woman of color in our field. Underneath others’ leadership, I have found less support from managers as I grow older and more experienced. Aging leadership clearly see me as a threat to their positions, and have cut me off from professional opportunities. Administrators sometimes shut down committees when the team selects me as a leader. This has happened to me 4 times in my current organization.”

“[H]onest conversations about race and social class. I was told to tone down my pride about coming from a working class background and being from the south. Learning to hide this identity has helped me connect with academic librarians, who are mostly from upper social classes.”

Some face challenges that are intersectional, including homophobic remarks, crossing personal boundaries (or as the respondent says, “lines I can draw”), and sexist behaviors:

“[My coworker] tells on people and I’m uncomfortable working with her. She is very conservative and the other day she asked me what I thought of gay couples raising children… I would like to work out with someone the lines I can draw. Having an older tentured male professors ask me to make coffee for them for an IRB meeting. They didn’t realize I was a professor (junior), and was there for the meeting. I was shocked and while thinking of a response, they realized their error.”

There were few responses to gender expression and sexual identity challenges that participants faced, but it is important to note identities which may be marginalized within gender issues.

Participants were able to select multiple identities to indicate intersectionality in challenges. It is beyond the scope of this paper to list all of the intersections of identities that exist, but in the data, the most common combinations of intersectional identities included gender and gender expression; gender, race or ethnicity, and sexual expression; gender and accessibility; and gender, race and accessibility.

Comparison of Direct Reports in Highest Supervisory Role

We asked librarians how many direct reports they supervised in their highest supervisory role. More than half of surveyed early career librarians had direct reports. This is higher than expected and highlights the fact that early career librarians are in fact moving into leadership and management roles.

Mid- and late career professionals had greater numbers of direct reports, with 27% having more than 10 subordinates and only 13% having none. Late career librarians supervise more than either other group, which is in line with our expectations. Additionally, 90% of men who completed the survey were supervisors, but only 76% of women were supervisors.

Career Stage Gender 0 Direct Reports 1-5 Direct Reports 5-10 Direct Reports 10+ Direct Reports Total Responses
Early Career Men 1 4 1 1 7
Women 33 20 7 5 65
Non-Binary 0 0 0 0 0
Mid Career Men 2 8 3 2 15
Women 16 38 16 17 87
Non-Binary 0 2 0 0 2
Late Career Men 1 4 4 12 21
Women 1 16 28 20 65
Non-Binary 1 0 0 0 1

Figure 3. Number of direct reports by respondent career stage and gender identity].

We did not include “Prefer not to answer” and “Other” in these tables due to lack of responses.

Leadership and Management Attributes

In an open-ended question, we asked participants what skills and qualities ideal leaders and managers possessed and to rate their value. These were the most commonly written values, qualities, and attributes:

Leader: vision (170); ability/able (125); communication (83); good (used an adjective describing excellence or competence) (46); communication (33); skills (33); visionary (33)

Manager: ability (67); good (57); skills (56); communication (52); staff (44)

Participants included words such as ability, good, skills, or staff, which were not in our prompted list of valued skills. Skills from the literature that were not valued by our participants (either in the open-ended questions or the value question) included commitment, influence, negotiation, problem solving, dedication, caring, and assertiveness.

Traits valued for leaders across all career stages were vision, ability/able (in this context also meaning competency or execution), good, and communication. Early career librarians valued additional traits such as listening, generating ideas, and thinking about the big picture. Mid- and late career librarians valued other traits such as a focus on people, work, and organization (both organizational knowledge and being organized).

Librarians valued skills, communication, work, and organization in managers across all career stages. Early career librarians valued teams (presumably both teamwork and the existence of teams) in addition to shared ideal traits.

Few people marked any of the qualities “Not at all important” and all except for one category, development, were 0% of the total answers.

Possessed Skills

We asked participants to rate the extent to which they feel they already possessed leadership and management qualities. Most librarians feel a strong sense of integrity and commitment is needed for leadership, but feel less strongly about their influence or negotiation skills. Overall, librarians feel positively about their leadership skills, with only a handful saying they “totally disagree” about any particular skill. This dedication to integrity is reflected in the responses as well. Librarians who felt their supervisor trusted them or acted ethically were more positive in their responses:

“Experience under a public library director who served with grace and integrity. Taught me how to deal with a board of supervisor who were all men.”

“I reported to an AUL who repeatedly lied to me about important issues. One example: she told me she took a diversity proposal I spent weeks developing to the Dean, who (AUL said) did not support it. I found out from the Dean that she had not brought him the proposal. (This is just one example.) I realized I could not trust her integrity. My reaction was to request to go back to a non-supervisory position. The library lost one of its few minority senior managers (me). I did not have the tools to effectively deal with this situation.”

Figure 4. Respondents’ perceptions of whether they possess specific leadership qualities.


However, librarians are in less agreement over their management skills. Librarians did not rate any of these skills as highly as they did for leadership skills. Librarians feel they have dedication, care for colleagues and subordinates, and problem solving skills. On the other hand, few believed that they possessed assertiveness, development, organization and delegation needed for management. The skills and attributes that librarians rated highly are correlated with caring and empathy, which are represented in the responses as well:

“I had a boss who was very open to hearing from his staff, and I told him that I didn’t like how he treated one of my colleagues; I felt as though he was disrespectful to her, and he listened to me and actually improved his actions towards her from then on. I appreciated that he cared enough to listen and change his actions.”

Figure 5. Respondents’ perceptions of whether they possess specific management qualities.


Generally, responses were positive across career stages. Early career respondents have fewer “totally agree” responses.

Additionally, communication was a choice in both questions, and librarians rated themselves differently. 36% said they totally agree that they have leadership-communication, but only 12% said the same for their management-communication. Communication was a challenge expressed in almost every section of the survey. Participants rated themselves poorly for their communication proficiency and wrote in open-ended questions about experiences they had with others. These two examples show a positive and a negative experience with supervisors:

“I had a manager who made a point to always stand up for her subordinates. If they were wrong, she would take them aside and personally talk to them about how the situation could have been handled better, rather than berating them in front of the public or co-workers. This lead to improved confidence, particularly in tough situations with public service.”

“On the negative side I’ve had supervisors who have failed to communicate information that later became public and caused problems in the library…. Previously, I worked with a leader who actively refused to advocate for the library and it cost us resources (laptops and other tech that I didn’t even know we had access to. I learned about it via gossip with other librarians and staff.)”

We asked participants to share when they have received feedback about these leadership and management skills to see if they actually possess them. A few paraprofessional respondents noted they did not receive feedback about leadership or management attributes because they did not have the same formal review processes as professionals. This is notable because it shows a gap in support between librarians with their MLIS and those in libraries who do not have the degree. Generally, feedback mechanisms were informal and came from supervisors and colleagues.

Another common theme was that respondents felt frustrated about the feedback they receive from supervisors. Unspecific feedback, no feedback at all, or informal feedback that does not reflect the formal review were major points of frustration. Below is an example from a respondent who prefers specific, constructive feedback that reflects the supervisor’s understanding of the value of their work:

“I receive generic “thank you for the good work” emails from my supervisor regularly, though I don’t think she has a good idea of the work I do, or of its value to my team”

One major issue that surfaces in this statement is that the supervisor may not understand their work, and does not give appropriate or helpful feedback because they cannot. This shows how miscommunication may be indicative of deeper issues, but we can only speculate because the respondent did not elaborate.

Support, Challenges and Role Models

Several questions on the survey were dedicated to feedback, support structures, and leaders who made an impression on them. Many reported additional types of resources for support in the “other” field. These included webinars, funding for professional development, informal peer mentoring, defunct mentoring programs, leadership training external to libraries. More people reported access to support for leadership training (28%) than formal mentoring (20%) or peer mentoring (18%). This is reflected in many of the qualitative responses, in which respondents note support for acquiring skills, but lack of support for navigating specific situations.

Institutional Support

Type Early Career (n=60) Mid Career (n=105) Late Career (n=105) Total (n=270)
Formal Mentorship 25.00% (n=15) 19.04% (n=20) 18.09% (n=19) 20.00% (n=54)
Leadership Training 18.33% (n=11) 32.38% (n=34) 29.52% (n=31) 28.14% (n=76)
Peer Mentorship 18.33% (n=11) 17.14% (n=18) 20.00% (n=21) 18.51% (n=50)
None 33.33% (n=20) 21.90% (n=23) 15.23% (n=16) 21.85% (n=59)
Other 5.00% (n=3) 9.52% (n=10) 17.14% (n=18) 11.48% (n=31)

Figure 6. Proportion of respondents who received different types of leadership and management support, by career stage.

Participants were able to select multiple options for types of support as well as topics that support covered. If someone indicated two types of support, it was counted in both categories. The following were dual forms of support indicated by early career librarians: formal mentorship and peer mentoring, leadership training and peer mentoring, as well as formal mentoring and leadership training. Other forms of support included funding or opportunities for external professional development (continued education, human resources, conferences, webinars, etc.), informal mentoring, training or mentoring for new or select “rockstar” librarians, and lack of support at the senior leadership level.

Topics of Support

Participants shared multiple areas of institutional support they received (n =147). Across all career stages, respondents received the most support for librarianship (or job function) (65.98%). Late career librarians received the most support for training (early 25.00%; mid 23.07%; late 38.33%). Fundraising was the area with the lowest amount of support across all career stages (3.4%), which aligns with the literature. Write-in topics of support from early career librarians included promotion and tenure, general orientation to the library and institution, teaching, institutional assessment, mentoring and leadership and professional development. Some listed “none,” or no support.

Overall, participants were moderately satisfied with the support they received. There were few extreme responses (either extremely satisfied (11.56%) or dissatisfied (7.54%)), with the majority marked moderately satisfied (33.17%). Yet, we found dissatisfaction in some of the open ended answers. This could be because priorities vary by organization. For example, academic librarians on the tenure track would need more support for publishing than public librarians.

Our questions about support focused on organizational support, but many respondents wrote about interpersonal support they received (or did not receive) from supervisors, mentors, or senior leaders in the organization. Many of these respondents were looking for emotional support to trust their own decisions rather than wanting their supervisor/mentor to make a decision for them. They did not have legitimacy or did not trust their own power to make certain decisions, and needed backup for making hard choices. Support is critical, as shown by one of the responses:

“I have needed support primarily in two areas: negotiating for budget and dealing with difficult personnel issues. In the first case, I needed political support from my administrator: how to build coalitions and be persuasive in order to accomplish my goal. In the second, I needed organizational support from my administrator: how to operate within the personnel system to solve the problem.”

Discussion and Analysis

In terms of our research question, we found that many opportunities and challenges have arisen as libraries evolve. Support for rising leaders requires librarians to recognize their own power to advocate for themselves and to use that power to create a supportive environment for others. Since librarianship is a feminized profession, we used the lens of feminist critique to analyze the results of our study when speaking of power. We focused on historicism in feminist theory, to explore how the history of leadership in librarianship impacts current practices. Furthermore, it is important to analyze the study through a feminist lens informed by intersectionality as we seek to interrogate the profession’s claims to value diversity, inclusion, and equity, despite the lack of changes in the power structure to promote equity. This way, we can take a more accurate look at the progress in addressing calls to action.

White women are now more proportionately represented in leadership, a trend reflected in our data and the literature. Yet there is still contention about who holds power based on how we associate leadership with certain behaviors and identities. Representation is the first step to equity, but more work needs to be done to shift power from those who have historically held it. Our research undercuts existing assumptions that librarianship is more egalitarian because we are less male dominated, that women leaders are inherently feminist leaders, and that more diverse representation will mean inclusive practices.

Additionally, the number of early career librarians in supervisory roles was much higher that we hypothesized. When we designed our study, we were looking at career support as a barrier to preparation for leadership and management, but as we read through the responses, the frequency of experiences pointed to systemic issues. Issues related to race, sexual expression, and other marginalized identities are impacting a lot of early career librarians, but we did not realize to what extent. A great deal of the literature that influenced our original study focused on the individual methods of professional development, disassociated from systems of power that maintain the status quo and keep power in a small number of hands. POC in management positions from Bugg’s 2016 study reported a range of perceptions about identity as helpful, hindering, or neutral to gaining leadership positions and varying levels of ambition. Originally, we thought lack of ambition was an intrinsic motivation, but our analysis revealed that there are external factors as well. The feminist lens helped us look at the larger picture regarding how our organizational structures and individual institutions uphold systematic oppression (e.g., classism, racism, sexism).

We need to examine our relationships to and support of oppressive structures not just in the field at large but within individual libraries. Members of marginalized groups, especially people of color and those who identify as LGBTQIA, experience hidden workloads, microaggressions, early burnout and lower retention. They have less access to and support for opportunities within their work and leadership roles than their counterparts. The profession can change this by implementing institutional policies for conduct and intervention, prioritizing retention, and incorporating anti-oppression practices into support systems and decision-making. Librarianship has documented issues with retention for librarians of color (Bugg, 2016; Chou & Pho, 2017), which directly relates to the lack of people of color represented in leadership positions. Many of the managers from Bugg’s 2016 study received opportunities to gain leadership skills through professional development but felt a lack of support within their libraries. Librarians need to reevaluate and account for the impact of barriers for marginalized in our assessments of how leadership potential is demonstrated, how leaders are retained, and the value of diverse perspectives.

For librarians who aspire to leadership, there can be a disconnect between learning which skills are important, recognizing those skills in yourself, and discovering the methods to obtain those skills. The majority of these “skills” are hard to capture with concrete measurements because they are personal qualities. Skills like organization may be acquired, but qualities such as integrity are characteristics. We did not distinguish these in our survey, and neither did our respondents. We wanted to observe respondents’ perceptions of the concepts, and we anticipated they would use skills and qualities interchangeably. Professionals may enter librarianship with varying individual skills or qualities, and access to learning opportunities or training may vary. We found this can create difficulties for librarians to self-assess, demonstrate abilities, and request feedback, which are major ways for librarians to recognize the skills they need to grow. This hinders progress, as self-advocacy is an important part of demonstrating leadership ability.

Librarians across career stages generally agreed on what they value in leaders and managers. Communication, integrity, and commitment were important qualities to participants. However, it was evident from open-ended responses that these values were not implemented well by some leaders. This aligns with a feminist critique of librarianship that points out how we fall short of what we say and do within organizations, despite aspirational values of transparency, community building, empowering others, and information sharing (Yousefi, 2017). This dissonance can be seen in the responses regarding communication: it is one of the most important identified traits of both leaders and managers, yet many participants rated their own leadership-communication more positively and management-communication more negatively. Many of the traits we desire in leaders have both masculine and feminine coding and may be perceived differently based on the leader’s background (Richmond, 2017). Attaching identities like gender, race, age, and ability to who can and cannot embody or perform leadership traits is a reflection of our relationships to power, however conscious or unconscious.

Figure 7. Beyonce says, “I’m not bossy. I’m the boss.”


Historically, librarianship has perpetuated hierarchical power structures in which leaders were white men. Women—more specifically, white women—were targeted as the ideal professionals to carry out orderly tasks and support researchers through care. The overemphasis of care, moral attachment and service that we currently glorify in librarianship (Ettarh, 2017) continues to perpetuate historical structures in which power is consolidated in few hands (Higgins, 2017; Richmond, 2017). The primary way librarians demonstrate power or influence is through long-term experience or accelerated accomplishments. This reinforces the need for a feminist framework, as mentioned earlier, for shared power in which librarians recognize privilege, cultivate interdependent partnerships rather than serve, and advocate for themselves to address this imbalance of power.

Creating organizations that are supportive, evolving, and inclusive requires that librarians take action to correct these imbalances. In the survey, we noticed an interesting pattern that librarians valued care-related qualities and believed they also possess these qualities. This aligns with the concept of the ethic of care, which prioritizes interpersonal relationships as moral virtue (Higgins, 2017). However, librarians placed less value on qualities related to influence such as assertiveness, negotiation, and delegation, and did not believe they possessed them. This discrepancy and aversion to risk may be influenced by servant leadership. This leadership style implies power is derived from moral standing but requires the leader to relinquish some amount of power in order to “deserve” the position, a starting place not historically afforded to marginalized groups. Higgins (2017) asserts that skills and qualities shown to be effective and valued in leaders should be championed over likability or collegiality, as it unnecessarily disadvantages women in leadership positions. We need to reevaluate whether supporting librarians who exhibit perceived moral care leads to effective leadership.

We categorized librarians based on how they did or did not demonstrate influence and leadership skills or qualities. “Experienced” librarians, generally mid- to late career, gained power through relational and organizational influence. “Rockstar” librarians, often early or mid-career, gained a sense of power through ambition, influence through common vision, accomplishments and accelerated responsibility. If positions of servant leadership are deserved based on these paths, those with hidden potential may be at a disadvantage. We labeled librarians with potential but little experience as in “The Middle.” They are still developing experience or ambition, may feel disempowered by leaders, and struggle with imposter syndrome when demonstrating achievements. Librarians in The Middle differ from underperformers or novices in the profession. New competencies (e.g. emerging service areas) and pre-MLIS experiences create opportunities for librarians to be new but not novices. If we are to support librarians of color who may fall in “The Middle,” we must consider cultural competencies and precarious situations that librarians of color navigate as demonstrations of leadership, rather than continue to undervalue the complexities of their experiences.

Respondents of this study revealed examples of lack of support as well as self-disempowerment. Members of marginalized groups can fall in The Middle because they may not have been conditioned to recognize opportunities nor develop leadership skills due to associations of leaders who typify traditional ideals. They may also internalize disempowerment from leaders, colleagues, or external systems of oppression to avoid making themselves highly visible and therefore subject to discrimination through self-advocacy. Some of these challenges surface because of risk aversion, which supports and continues oppressive structures, a false sense of neutrality, and paths of least resistance.


Though new values and additional representation in leadership indicate progress within missions and goals, libraries continue to “replicate libraries of the past instead of looking to the needs of library users and workers of the future” (Askey & Askey, 2017). We still build and recognize leaders based on traditional methods and values.

There are few mechanisms for people rising up in the profession to demonstrate their abilities outside of experience or taking initiative. Most of the focus in library literature has been on who current leaders are and what experience they have shown, not how they get to be leaders. Yet, it is necessary to tailor support to each individual librarian and their challenges. Some practices that support scaffolding (which ultimately lowers barriers to mobility) include clear, constructive, and specific feedback; clearly communicating vision; recognizing individuals’ strengths and weaknesses; helping others recognize their strengths and opportunities; and allowing ongoing, iterative development rather than perpetuating a culture of reactionism and perfectionism. It is especially important to create spaces for open dialogue that includes honest and supportive conversations about identity, given that people with marginalized identities experience the harmful effects disproportionately.

As we move away from traditional work and traditional ideas of leadership, those who currently hold positions must examine their relationship to power by using it to effectively create a legacy for the future. Early career librarians who may take on positions of power now or later must also examine their relationship to power through self-advocacy. It is a cultural shift that requires work from individuals, organizations, and the profession at large. If we are to prepare the next generations of librarians to lead among rapid changes to librarianship, we must intentionally revise relationships to power, scaffold new paths for those with potential to advance, and create inclusive organizational structures going forward.


We would like to thank our editor Amy Koester, and our reviewers Sofia Leung and Ali Versluis for supporting the journey of this complex article. Your insights, observations, and additional readings helped make this article much richer. We also want to thank Ryan Litsey and Denisse Solis for advising us on the framework of our findings.


Askey, D. & Askey, J. (2017). One Library, Two Cultures. Feminists among us : Resistance and advocacy in library leadership. S. Lew and B. Yousefi (Eds.). Sacramento: Library Juice Press.

Aslam, M. (2018). Perceptions of Leadership and Skills Development in Academic Libraries. Library Philosophy and Practice. Retrieved November 5, 2018, from

Birsel, A. (2015). Design the life you love: a step-by-step guide to building a meaningful future. Berkeley: Ten Speed Press.

Bugg, K. (2016). The Perceptions of People of Color in Academic Libraries Concerning the Relationship Between Retention and Advancement as Middle Managers. Journal of Library Administration. Retrieved from

Charmaz, K. (2006). Constructing grounded theory : A practical guide through qualitative analysis. London: Sage Publications.

Chou, R. L., & Pho, A. (2017). Intersectionality at the Reference Desk: Lived Experiences of Women of Color Librarians. The feminist reference desk: concepts, critiques, and conversations. M. T. Accardi (Ed.). Sacramento: Library Juice Press.

Crenshaw, K. (1990). Mapping the margins: Intersectionality, identity politics, and violence against women of color. Stan. L. Rev., 43, 1241.

DeLong, K. (2013). Career Advancement and Writing about Women Librarians: A Literature Review. Evidence Based Library and Information Practice, 8(1), 59.

Douglas, V. A., & Gadsby, J. (2017). Gendered Labor and Library Instruction Coordinators. Presented at the ACRL 2017 Conference, Baltimore, MD (pp. 266-274).

Ettarh, F. (2017, May 30). Vocational Awe? Retrieved from

Farrell, B., Alabi, J., Whaley, P., & Jenda, C. (2017). Addressing Psychosocial Factors with Library Mentoring. Portal: Libraries and the Academy, 17(1), 51–69.

Faulkner, A. (2016). Teaching a New Dog Old Tricks: Supervising Veteran Staff as an Early Career Librarian. Library Leadership and Management. Retrieved from

Fleming, R. & McBride, K. (2017). How We Speak, How We Think, What We Do: Leading Intersectional Feminist Conversations in Libraries. Feminists among us : Resistance and advocacy in library leadership. S. Lew and B. Yousefi (Eds.). Sacramento: Library Juice Press.

Grint, K., Jones, O. S., & Holt, C. (2017). “What is Leadership: Person, Result, Position, Purpose or Process, or All or None of These?” The Routledge Companion to Leadership. Storey, J., Hartley J., Denis, J. L., Hart, P., & Ulrich, D. (Eds.) New York: Routledge.

Harris-Keith, C. S. (2016). What Academic Library Leadership Lacks: Leadership Skills Directors Are Least Likely to Develop, and Which Positions Offer Development Opportunity. Journal of Academic Librarianship. Retrieved from

Herold, I. (2014). How to Develop Leadership Skills: Selecting the Right Program for You. Library Issues, 35(2), 1–4.

Higgins, S. (2017). Embracing the Feminization of Librarianship. Feminists among us : Resistance and advocacy in library leadership. S. Lew and B. Yousefi (Eds.). Sacramento: Library Juice Press.

Hines, S. (2019). Leadership Development for Academic Librarians: Maintaining the Status Quo?. Canadian Journal of Academic Librarianship 4(February), 1-19.

Irwin, K. M. & deVries, S. (2019). Experiences of Academic Librarians Serving as Interim Library Leaders. College & Research Libraries. 80(2), 238-258.

“Library Leadership Training Resources”, American Library Association, December 9, 2008. (Accessed November 5, 2018).

“Leadership and Management Competencies”, American Library Association, October 3, 2016. (Accessed November 5, 2018).

Lew, S. (2017). Creating a Path to Feminist Leadership. Feminists among us : Resistance and advocacy in library leadership. S. Lew and B. Yousefi (Eds.). Sacramento: Library Juice Press.

Ly, P. (2015). Young and in Charge: Early-Career Community College Library Leadership. Journal of Library Administration, 55(1), 60–68.

Martin, J. (2018). What Do Academic Librarians Value in a Leader? Reflections on Past Positive Library Leaders and a Consideration of Future Library Leaders. College & Research Libraries.

Morris, S. (2019) ARL Annual Salary Survey 2017–2018. Washington, DC: Association of Research Libraries.

Olin, J., & Millet, M. (2015). Gendered Expectations for Leadership in Libraries. In the Library with the Lead Pipe. Retrieved from

Phillips, A. L. (2014). What Do We Mean By Library Leadership? Leadership in LIS Education. Journal of Education for Library and Information Science, 55(4), 336-344.

Richmond, L. (2017). A Feminist Critique of Servant Leadership. Feminists among us : Resistance and advocacy in library leadership. S. Lew and B. Yousefi (Eds.). Sacramento: Library Juice Press.

Rooney, M. P. (2010). The Current State of Middle Management Preparation, Training, and Development in Academic Libraries. The Journal of Academic Librarianship, 36(5), 383–393.

Silva, E., & Galbraith, Q. (2018). Salary Negotiation Patterns between Women and Men in Academic Libraries. College & Research Libraries.

Stewart, C. (2017). What We Talk About When We Talk About Leadership: A Review of Research on Library Leadership in the 21st Century. Library Leadership & Management, 32(1). Retrieved from

Strauss, A. L., & Corbin, J. M. (1997). Grounded theory in practice. Thousand Oaks: Sage Publications.

Tyson, L. (2014). Critical theory today: A user-friendly guide. New York: Routledge.

Wilder, S. (2017). Delayed Retirements and the Youth Movement among ARL Library Professionals. Association of Research Libraries. 9.

Young, A. P., Powell, R. R., & Hernon, P. (2003). Attributes for the Next Generation of Library Directors, 8.

Yousefi, B. (2017). On the Disparity Between What We Say and What We Do in Libraries. Feminists among us : Resistance and advocacy in library leadership. S. Lew and B. Yousefi (Eds.). Sacramento: Library Juice Press.

Appendix A: Survey Questions


1. Are you a

  • Man or Trans Man
  • Woman or Trans Woman
  • Nonbinary
  • Prefer Not to Answer
  • Other [text entry]

2. How long have you worked in libraries?

  • Less than 1 year
  • 1-3 years
  • 4-6 years
  • 7-10 years
  • 11-15 years
  • 16+ years

3. Select degree(s) earned. Please list any subjects besides library science in the “Other” field. [tick box]

  • High School
  • Associate
  • Bachelor
  • Master
  • MLIS (or equivalent)
  • Doctoral
  • Vocational
  • Other________ (text entry)

4. On average, how many people have reported directly to you in your highest supervisory position?

  • 0
  • 1-5
  • 5-10
  • 10+

Leadership and Management Attributes

5. What are the qualities of an ideal leader?
Text Entry

6. What are the qualities of an ideal manager?
Text Entry

7. How important are these qualities of leadership? (Likert Scale – rate very important to not important at all)

  • Vision
  • Creativity
  • Commitment
  • Motivation
  • Communication
  • Integrity
  • Negotiation
  • Influence

8. Rate the extent to which you feel you already have these leadership qualities. (Likert Scale – rate totally agree to don’t agree at all)

  • Vision
  • Creativity
  • Commitment
  • Motivation
  • Communication
  • Integrity
  • Negotiation
  • Influence

9. How important are these qualities of management? (Likert Scale – rate very important to not important at all)

  • Dedication
  • Communication
  • Caring for colleagues and subordinates
  • Problem Solving
  • Assertiveness
  • Development
  • Organization
  • Delegation

10. Rate the extent to which you feel you already have these management qualities. (Likert Scale – rate totally agree to don’t agree at all)

  • Dedication
  • Communication
  • Caring for colleagues and subordinates
  • Problem Solving
  • Assertiveness
  • Development
  • Organization
  • Delegation

Demonstration of Attributes

11. Please provide any feedback you have received from a supervisor, mentor, or peer that you demonstrate these qualities. Include how this feedback was expressed (in a formal review, in a meeting, informally).
Text entry

12. Have you faced challenges regarding the following: (Yes or no checkboxes)

  • Gender
  • Race or ethnicity
  • Gender expression
  • Sexual identity
  • Accessibility/ Disability concerns
  • Other experiences intersecting with any of the above or additional issues

13. Describe a situation you have encountered in which you needed support or preparation. What kind of support or preparation was needed and did you receive it? Answer as a leader/manager or as an employee.
Text entry

14. Describe an experience you’ve had being led that made an impression on you and your work.
Text entry


15. What support does your institution provide? Choose from below or add your own:

  • Formal mentorship
  • Leadership training
  • Peer mentorship
  • None
  • Other [text entry]

16. Rate how satisfied you feel with this support.
Likert scale – rate Very satisfied to very dissatisfied

17. Does this support focus on any specific area, check all that apply.

  • Librarianship
  • Publishing
  • Research
  • Training
  • Service
  • Fundraising
  • Other [text entry]

Appendix B: Additional Data Visualization

Figure 8. Respondents’ perceptions of the importance of specific leadership qualities”.

Figure 9. Respondents’ perceptions of the importance of specific management qualities”.

Figure 10. Chart listing experienced and ambitious librarians at the top, “The Middle” representing librarians who show potential in the center, and brand new and underperforming librarians at the bottom”.

Can Machine Learning Help Give Boost to Flagging Healthcare? / Lucidworks

According to the World Health Organization, more people than ever are living into their 60s and older; this population will comprise 20 percent of the world’s population in 2050.

Our longevity can be attributed to several factors, but the outsized role of advances in health care, although unquantifiable as a single factor, is undeniable.

If only healthcare could be provided efficiently to our over-surviving populace. It is no secret, however, that the industry is facing enormous challenges.

Shortage of Professionals

According to the Association of American Medical Colleges, by 2025, there will be a shortfall of between 14,900 and 35,600 primary care physicians; the shortage of non-primary care specialists is projected to reach anywhere from 37,400 to 60,300 by that same year. Updated 2018 data from the AAMC indicates that by 2030, the shortage will reach a whopping 120,000 overall (a shortfall of 14,800 to 49, 300 primary care providers, and 33,800 to 72,700 non-primary care providers).

Along with this shortage, myriad issues continue to derail the provision of quality healthcare including:

  • Provider burnout, which is strongly associated with medical errors.
  • Disparate electronic health records impede data-sharing of vital information such as patient history and medication profile
  • Drug and medical supply shortages
  • Resurgent diseases
  • Antibiotic resistance
  • Administrative inefficiency and issues with insurance reimbursement, claim-processing
  • Rising costs of care and medications

Despite the incredible connectivity afforded by technology in the 20th and 21st centuries and near-daily medical advances, few will argue: The U.S. healthcare system is broken, consistently ranking as the most expensive and the least efficient in countless reports.

Providers and their staffs are inundated with paperwork, growing demands to see more patients in less time, rising costs of malpractice insurance – all while striving to stay abreast of breaking studies and new treatment protocols.

Could providers use … an extra set of processors? Could the entire industry?

Artificial Intelligence: Medicine, Reimagined

The 21st century has brought about the growing use of technologies that use artificial intelligence (AI) to complete tasks and reasoning once performed by humans, including providers. Britannica defines AI as “the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.”

Granted, there is a strangeness, even creepiness to devices that are programmed to reason, discover meaning, generalize, and learn from experience, qualities that may seem amazingly – or alarmingly – human. But in one sense, AI is nothing new. The founding father of AI and modern cognitive science was British mathematician and logician Alan Turing. A codebreaker during World War II, Turing’s post-war research focused on the parallels between computers and the human brain. The scientist considered the cortex to be an “unorganized machine” that could be trained to be organized into “a universal machine or something like it.”

The term artificial intelligence, however, was not coined until a few years later, in 1955 by Turing contemporary John McCarthy, a mathematics professor at Dartmouth College. McCarthy defined AI as the “science and engineering of making intelligent machines.” Shortly thereafter, McCarthy convened the first AI conference and subsequently created one of the first computer languages, LISP.

It took a while (35 to 40 years) for computers to be a staple of providers’ practices, and a ubiquitous item in patients’ purses and back pockets. But once computers became commonplace in our society, AI-powered computers and robots were the next logical step, right?

But not so fast.

These new-fangled machines bring unwelcome anxieties and fears. Will human beings be displaced and replaced by these unnerving inventions? Will AI be running the world? What will people do all day when machines are doing everybody’s job? Are providers on the brink of extinction? These questions are justified, and the feelings are natural. But we’ve been here before.

Truth Stranger Than Fiction?

Culturally, AI has long been the subject of dystopian sci-fi movies and a recurring source of dread, foreboding, and even hysteria. Nearly 100 years ago, in the 1927 classic “Metropolis” the robot Maria mixed it up in a futuristic, mechanized society dominated by power, manipulation, and social engineering. That film came on the heels of a 1930s America mired in the throes of the Great Depression, massive automation, and culture-altering mechanization spurred “robot hysteria.”

And the trend continued. In 1968, we had the epic showdown between rogue computer HAL 9000 and astronauts in “2001: A Space Odyssey“, The Terminator fighting killer machines in the 1980s, and the manipulative and murderous robot Ava in 2014’s “Ex Machina.” For nearly a hundred years, our society has had a love-hate relationship with machines that think and act as humans do.

These fears bear examination and warrant validation; for providers, the influx of new AI-driven machines may seem to supplant physicians’ expertise, experience and mere existence. Will AI take over the roles of diagnosing patients and creating treatment plans? Who will run our practices, clinics, hospitals? Robots? Computers?

What exactly are these machines doing?

AI: Processes That Are Human-Like, But Not Human

Although the idea of smart computers that “learn” without our permission may seem alien, most of us already use AI in our everyday lives. Assistants such as Apple’s Siri, Amazon’s Alexa and other “personal assistants” may seem innocuous enough, but the truth is these and similar devices use machine-learning to better predict and understand our natural-language questions, interpret our speech patterns, and anticipate our future needs and habits.

Netflix, for example, analyzes, in a flash, billions of records to suggest films a consumer might like (and order or purchase) based on that person’s previous choices and ratings of films. Google very quickly anticipates websites of interest targeted to each user, Amazon has long been using – and constantly refining – “transactional” AI algorithms to keep track of individual’s buying habits to assess, predict and suggest purchasing behavior.

How do they do it? Welcome to machine learning (ML is one of the most common types of AI used in the medical field). ML is the application of statistical models to data using computers. Machines that use ML do not “think” independently; rather, they are programmed with algorithms – a set of criterion that establish a process for recognizing patterns and solving problems.

The jury is out on whether consumer-level leveraging of AI to suggest where we should eat or who we should link with on LinkedIn is useful. But there are other ways that machine learning is used that has positive global effects.  

Energy giant BP uses the technology in sites worldwide to maximize the use of gas and oil production, so it is safe and reliable –something we all want when we start our cars every day. Its data is also analyzed by engineers and scientists to plan for future energy needs and develop new model and energy sources. In a world with limited resources and a burgeoning global population, AI undoubtedly helps to keep prices low and oil and gas available.

GE Power similarly uses machine learning combined with big data and the internet of things to optimize operations, maintain equipment and attempt to develop digital power plants. We all like it when we turn a switch and the lights go on and take for granted that the H tap will keep hot water flowing into our homes.

Many financial giants use AI to predict market trends, allowing investors and retirement funds to allocate money wisely, as well as cut down on fraud and other financial crimes, which can drop the price of these types of services for individuals and employers who rely on their data for building retirement funds.

What Can You Do for Me, Mr. Roboto?

The innovations in medicine and related industries in 21st century has brought about an astounding array of new medications, advances in imaging, and reimagined treatment protocols, all at a break-neck speed.

How is a provider to keep up with all these changes, the endless journal studies and clinical trial data, new treatment options counter-balanced by black-box warnings, abrupt changes in long-held treatment protocols, reporting requirements, and harbingers of disease-outbreaks?

Not to mention the ever-increasing pressures that healthcare entails.

  • Data, data, data. AI-driven machines and their algorithms quickly incorporate myriad variables across a broad array of complex screening and diagnostic information and data. The information is processed to predict future events (e.g., what treatment protocol for disease X in population Y has led to the best outcomes based on previous cases and outcomes.) AI systems constantly update as new data are added, learning new associations and refining their predictive power in real time. Machine learning can help to reinforce currently successful interventions and – perhaps more importantly – reveal new trends and raise new questions based on trends in data.Ostensibly, once compatible global AI systems are in place, providers will have access to data gleaned from every patient treated for any disorder in every country, plus breaking medical discoveries and clinical trials outcomes, all housed within AI-driven machines and their algorithms. The incidence of medical errors and ineffective care caused by incomplete patient electronic health data, missed treatment opportunities, and misdiagnoses will be greatly reduced, perhaps even eliminated.
  • Eliminating bias. Another quality of AI is the blind nature of its data processing. Clinicians gain valuable knowledge through experience over time based past patient cases. This knowledge, however, can introduce prejudice to the decision-making process, and lead to misdiagnosis, treatment based on old data, and missed opportunities for more current approaches. If done correctly use of AI can eliminate these potential pitfalls, as clinical recommendations are not based on a priori assumptions, but rather on constantly updated medical data drawn from myriad sources.
    (CAUTION: Biased data can introduce the very biases you are trying to avoid.)
  • Building Patient Profiles/Predicting Outcomes. Recommending an effective treatment plan is predicated on certain provider tasks such as taking the patient’s history, performing a physical exam, and incorporating information from established research (all of which are vital in screening and diagnosis). The goal to deliver a specific outcome requires planning and implementation of said plan (actions needed for treatment and monitoring). AI computers can quickly “crunch” complex data from the physical exam, lab results, and extant study data to match thousands and even millions of global data sets to predict future events (the best treatment plan for the desired outcome), in a matter of minutes or even seconds.

Growing Support for AI in Medicine

AI is already a part of standard practice in many medical networks; data show that intelligent machines are saving lives. China is using an AI program called Infervision to keep pace with reviewing the annual 1.4 billion lung CT scans ordered to detect early signs of cancer. The application has proven successful in addressing the country’s radiologist shortage and provider fatigue (and medical errors) associated with reading endless scans. Infervision is programmed to teach its algorithms to diagnose cancer more efficiently.

InferRead DR Chest: Reducing missed diagnosis in conventional chest X-ray


The University of North Carolina is using AI from IBM to comb through data such as discharge summaries, doctors’ notes and registration forms to identify high-risk patients and formulate prevention care plans for them. Google’s DeepMind initiative aims to mimic the brain’s thought processes to reduce time needed for diagnosis and formulating treatment plans. The implementation of DeepMind at the University College of London to quickly assess brain images and recommend appropriate, life-saving radiotherapy treatments.

But not every AI-driven medical application has been a success.

IBM’s AI-driven machine Watson (whose original claim to fame was beating human contestants at “Jeopardy” on prime-time TV), was hyped as a state-of-the-art personalized cancer treatment genius to millions of patients and providers around the world. In 2012, MD Anderson Cancer Center at the University of Texas paid Big Blue millions to partner in bringing Watson into the fold. In 2017, however, the university withdrew from the partnership. The main problems, although not clearly articulated by MD Anderson, may have been issues with procurement, cost overruns and delays. In addition, expectations rose to unrealistic levels due to publicity and hype.

Authors of a blog published in “Health Affairs” who commented on Watson’s failure noted that AI was no replacement for human involvement in clinical care; machines and robots are not poised to take on all the reasoning and complex tasks inherent in health care. They did concede, however, that AI is suited for reducing time spent on resource-intensive tasks.  

Although Watson was not yet ready for prime time, the authors noted, “… carefully examining and evaluating opportunities to automate tasks and augment decisions with machine learning can quickly yield benefits in everyday care. Furthermore, by taking a practical approach to evaluating and adopting machine learning, health systems can improve patient care today, while preparing for future innovations.”

Don’t Let the AI Train Pass You By

The AI marketplace is exploding; according to a recent analysis by Accenture, AI will grow to a 6.6 billion industry by 2021; new health care AI initiatives are cropping up everywhere and investors are being encouraged to invest. This bodes well for providers; dollars invested in AI health care will assure that current applications are expanded, extant programs (such as Watson) will receive the tweaking and refinement they need, and that new AI opportunities are funded.

AI could hold the keys to improving our beleaguered healthcare system in ways that were unimaginable a couple of decades ago. Projections indicate that insurance companies could save up to $7 billion over 18 months by using AI-driven technologies to streamline administrative process. Several insurers are already using AI chatbots to answer simple patient questions and predict emergency-room visits.

Providers could similarly use AI for administrative tasks, freeing up staff for more patient-oriented encounters, expediting insurance reimbursement, and affecting referrals. That is in addition to the myriad care-related support function that AI can currently provide; many more are in the works and will be rolled out in the upcoming months, years, and decades.

Skepticism is natural, and thorough vetting is necessary to ensure that Silicon Valley and other Big Tech promises don’t override peer-reviewed research on the efficacy of AI applications; consistent testing and feedback from a wide array of providers in a variety of disciplines; and appropriate rollout to ensure the safety, accuracy and applicability of AI in medicine.

The robots and machines are becoming more powerful. No worries — there will always be a doctor in the house, and, like the new technology, she’ll be better than ever.


Lise Millay Stevens, M.A., is a contract medical communications specialist who has served at the New York City Department of Health & Mental Hygiene (Deputy Director of Publications); at the American Medical Association (managing editor of five JAMA journals) and then senior press associate and editorial manager of Neurology. Lise is a member of the American Medical Writers’ Association (past-president, Chicago), the New York Press Club, and the Society of Professionals Journalists.

The post Can Machine Learning Help Give Boost to Flagging Healthcare? appeared first on Lucidworks.

The Demise Of The Digital Preservation Network / David Rosenthal

Now I've had a chance to read the Digital Preservation Network (DPN): Final Report I feel the need to add to my initial reactions in Digital Preservation Network Is No More, which were based on Roger Schonfeld's Why Is the Digital Preservation Network Disbanding?. Below the fold, my second thoughts.

The DPN started in 2012 and:
it was anticipated that there would be different nodes specializing in different types of content (e.g., text, data and moving images) and providing replication, audit, succession etc. at the bit level across the nodes; and 2) relatedly, the goal was to start at the most basic level (i.e., bit-level preservation with audit and succession) and then start working up the stack of services that are involved in full-blown digital preservation
What was the landscape of digital preservation back in 2012 that motivated the DPN? A year earlier, I had written A Brief History of E-Journal Preservation. Referring to it, we see that by 2012:
Thus e-journal preservation systems had a lot of experience showing that the real problem was economic, not technical, and that ingest was the largest cost. The LOCKSS team’s rule of thumb was that it was half the lifetime cost, with preservation being a third and access a sixth. And ingesting e-journals was cheap and easy compared to the less well-organized content the DPN hoped to target.

E-journal preservation economics were based on protecting institutions’ investment in expensive subscription content. Elsewhere, things were less sustainable. Institutional repositories contained little, and what they did was not very important. The reason was that getting stuff into them was too hard and costly.

As I wrote in my initial DPN post:
Each of the libraries represented had made significant investments in establishing an institutional repository, which was under-utilized due to the difficulty of persuading researchers to deposit materials. With the video collection out of the picture as too expensive, the librarians seized on diversity as the defense against the monoculture threat to preservation. In my view there were two main reasons:
  • Replicating pre-ingested content from other institutions was a quicker and easier way to increase the utilization of their repository than educating faculty.
  • Jointly marketing a preservation service that, through diversity, would be more credible than those they could offer individually was a way of transferring money from other libraries' budgets to their repositories' budgets.
Alas, this meant that the founders' incentives were not aligned with their customers'.
Of course, the diversity goal also meant that the DPN was an add-on to their existing institutional repositories. A hypothetical converged system would have been a threat to them.

The DPN’s pitch to customers was, in effect, that it would be a better institutional repository than one they ran themselves. Making the economics of "institutional repository as a service" sustainable required greatly improving the ingest process at each node for the content type in which it specialized. That was what would determine the operational expenses, and thus the prices the DPN needed to charge. Doing so posed major:
  • design problems, because metadata for the content was not standardized between the submitting institutions (unlike the fairly standard e-journal metadata),
  • implementation problems, because there were no off-the shelf solutions, and
  • cost problems, because this required site- and content-type-specific development, not development shared between the nodes.
The technical goal DPN’s management set themselves wasn’t to solve this critical customer-facing business model problem, it was to solve the internal problem of replicating and auditing the content that wasn't going to be in the nodes in the first place; it was too hard to get it in.  Despite the fact that replicating and auditing was a problem that could have been solved by assembling off-the-shelf, production proven components, it took them 2 years to hire a technical lead capable of reaching consensus on how to solve it:
In December of 2014, Dave Pcolar was hired as the Chief Technical Officer and with his leadership and direction, a consensus was reached on the best approach to develop the network.
The consensus was that the nodes would export a custom REST API. Because diversity was the whole point of the DPN, each node had to implement both the server and client sides of the API to integrate with their existing repository infrastructure. Pretty much the only shared implementation effort was the API specification. Which, of course, is what the diversity goal was intended to achieve.

The problem was that the participating institutional repositories were uneconomic and mostly empty. It could not be solved without making the ingest process much cheaper and easier. After all, someone was going to have to do the work and pay the cost of ingest. Not realizing this was a major management failure. As the final report shows, the customers told them that this was the requirement:
institutions repeatedly stated that they did not have a good workflow for digital preservation. Many institutions said that they did not have sufficient in-depth knowledge of their digital collections to manage them for long-term preservation. Local systems for managing content did not have a built-in “export to DPN” function and this presented a problem of how to prepare and move the content for deposit into DPN.
But that wasn't the real management failure. It was true that diversity improved the network's robustness against hypothetical future attacks and failures. The fundamental management failure was not to appreciate that, in return for this marginal future benefit, diversity immediately guaranteed that the product they had to offer would be more expensive and take longer to build, be more expensive to operate and maintain, and be more complex and thus less reliable than a centralized commercial competitor. Several of which duly arrived in the market before DPN did.

Indeed, a year before DPN started the commercial pioneer of outsourced institutional repositories, bepress, was already focused on this area. There was clearly a market for outsourcing institutional repositories. By 2017, bepress had:
more than 500 participating institutions, predominantly US colleges and universities. bepress claims a US market share of approximately 50% overall, recognizing that not all institutions have an institutional repository. Among those universities that conduct the greatest amount of research, for example the 115 US universities with highest research activity, bepress lists 34 as Digital Commons participants, for a market share of about 30%.
DPN management should have been aware of the potential competition. They could have reviewed the problem at the start, saying to the sponsoring institutions:
The diversity thing isn't going to be viable. What the world needs is a major improvement in the cost and ease-of-use of institutional repository ingest. Why don't we spend the money on that instead?
Unfortunately, this wouldn't have worked for two main reasons:
  • The management had no concrete plan for solving the cost and ease-of-use problem, which was widely known to be very difficult, so success was unlikely.
  • If success were achieved, it would benefit all institutional repositories, including the potential commercial competitors. Benefiting the repositories of the institutions behind the DPN was its real goal.
Going in to the early discussions I didn't understand what the real goal was. At this point I need to confess that the focus on mitigating monoculture risk may have been my fault. If I recall correctly, I was the one who raised the issue. I hoped that the need for inter-operation among the institutions would motivate a second, independent implementation of the LOCKSS protocol. That would have both provided a well-proven basis for interoperability among the DPN nodes, and allowed LOCKSS to mitigate monoculture risk by using it for some of the LOCKSS boxes.

I didn't understand that the various institutional repositories saw LOCKSS not as a useful technology but as a competing system. To them, whatever solution emerged from the meetings it was important that LOCKSS not be part of it. Once I figured that out, there was no point in participating in further meetings.

Table 1
Heading Cost Percent
Development $2,782,693 39.7%
Operations $134,454 1.9%
Marketing $795,136 11.4%
Overhead $3,289,038 47.0%
Total $7,001,321

All told, DPN spent just over $7M. Table 1 shows where the money went. Note that Overhead and Marketing consumed almost 60% of the total spend.

Table 2
Institution R&D
AP Trust $375,097
Chronopolis $414,882
DuraSpace $581,610
Hathi Trust $314,018
Stanford $657,534
Texas $439,552

Table 2 shows where the R&D spending went, illustrating the distributed and site-specific nature of the development mandated by the diversity goal.

In my view, the key lesson to be learnt from the DPN Final Report is in this graph, from page 15. It shows that the vast majority of the per-TB cost of the system in operation was in overhead, not in actually preserving content. To be viable, the system would have had to preserve enormous amounts of data, while holding overhead costs constant. Of course, preserving vast amounts of data without increasing overhead would have needed much more efficient ingest mechanisms.

The key design principle of the LOCKSS Program from its birth in 1998 was to spend on hardware to minimize operational and overhead costs. As we wrote in 2003:
Minimizing the cost of participating in the LOCKSS system is essential to its success, so individual peers are built from low-cost, unreliable technology. A generic PC with three 180GB disks currently costs under $1000 and would preserve about 210 years of the largest journal we have found (the Journal of Biological Chemistry) for a worst-case hardware cost of less than $5 per journal/year.
Peers require little administration, relying on cooperation with other caches to detect and repair failures. There is no need for off-line backups on removable media. Creating these backups, and using them when readers request access to data, would involve excessive staff costs and latencies beyond a reader’s attention span.
The LOCKSS technology was often criticized for wasting disk since a system with many fewer copies could achieve the same theoretical reliability. The critics didn't understand the point made by the graph. The important thing to minimize is the thing that costs the most, which at scales less than Petabytes is never the hardware:
The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at [2014] prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk.
An interesting and constructive suggestion for future efforts is at the end of this Quartz piece.

Open Knowledge Foundation community meet up at csv,conf,v4 / Open Knowledge Foundation

  • When: May 7th, 5-7pm
  • Location: Eliot Center, Portland, OR
  • Cost: Free; pizza & beverages available

Join Open Knowledge Foundation (OKF) for a community event the night before csv,conf,v4! This meet and greet happy hour will feature lightning talks on open projects, designated networking time, and pizza. We invite OKF community members to submit ideas for short lightning talks (5 minutes maximum).

Do you want to give a talk, but aren’t already a member of the OKF community? No problem! We are an inclusive community of Open enthusiasts (open data, open science, open source, open government, etc), and the evening is open to anyone who wants to share their ideas.

Come learn more about what we do, the open projects our members are working on, ways to get involved with an open project, and meet others! This event is open to all (including csv,conf,v4 attendees as well as other open enthusiasts).



More about csv,conf,v4

csv,conf is a community conference for data makers everywhere, bringing diverse groups together to discuss data topics, and featuring stories about data sharing and data analysis from science, journalism, government, and open source. It takes place from May 8-9 2019 at the Eliot Center in Portland, Oregon. More information on the program is available from the website, and you can still get your conference tickets on Eventbrite.


More about Open Knowledge Foundation (OKF):

OKF is a global non-profit organisation and worldwide network of people passionate about openness, and using advocacy, technology and training to unlock information and enable people to work with it to create and share knowledge. Chat with us on Gitter, join a discussion on our Forum, or check out our projects for ways to get involved!

Alternatives to Statistics for Measuring Success and Value of Cataloging / HangingTogether

Wikimedia Commons, under CC Attribution 3.0 Unported License

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Jennifer Baxmeyer of Princeton, Erin Grant of the University of Washington, and John Riemer of University of California, Los Angeles. Traditionally, the most common measure of cataloger productivity statistics on the number of records produced and time spent cataloging. As cataloging and metadata librarians become more involved in other activities that are not easily quantifiable (e.g., participating in linked data or similar projects), the problem of measuring productivity and success becomes more difficult.

Cataloging statistics have never captured the impact of cataloging on the end-user; they give no indication of how the bibliographic descriptions in the catalog actually contribute to a user’s success in finding the appropriate resource. Moreover, they add a perverse incentive for catalogers to work only on the easiest titles or problems, rather than on those with the biggest impact or best use of catalogers’ expertise, as those are the ones that will maximize statistics. Finally, since more cataloging data is available from outside sources, and records are often loaded in large batches into the library’s catalog, the number of titles added may not accurately reflect a single person’s contribution to the library’s catalog.

Our discussions focused on communicating the successes and challenges of metadata specialists with the rest of one’s institution and how metadata contributes to the division’s and organization’s strategic goals—especially “stories.”  Strategic goals may include “foster discovery and use” of the library’s resources and “enriching the user experience,” for which good metadata is critical. Discovery layers often highlight flaws in the metadata.

Key to demonstrating the value of metadata is integrating metadata specialists into other activities undertaken by other units. Examples of such integration include participating in collection usage and weeding projects, moving collections, convening conversations between different groups such as digital collections, and publicizing who the metadata experts are who can provide metadata consultancy—offering “metadata as a service.” It’s important that other units in the library realize that they need a metadata expert before they’ve proceeded too far with a project that should have had one. Metadata managers have discovered many needs for consulting services, such as content analysis and taxonomy for public-facing web pages. Actively contributing to other units’ activities reinforces the importance of metadata.

Institutions use a variety of methods to communicate their work and completion of project milestones, including: on-line newsletters, blog posts, social media, tie-ins to other library activities such as exhibits, consultations, articles in professional literature, brown-bag lunches, workshops and other events, staff presentations, reporting out highlights from professional conferences as it reflects on one’s own work.

Although most metadata managers keep statistics (and may be required to do so) the impact of metadata goes beyond record numbers. We agree that focusing on outcomes and the impact of metadata on user discovery is important, but these outcomes are hard to measure or document. Metadata departments that have strong senior management support feel their work is valued. Managers may find it challenging to evaluate the work of individuals who do different types of work. Statistics do not adequately represent the breadth of ways materials are described for all the various types of materials collected, but they may be useful as a baseline to identify anomalies, trends, the effect of changed workflows, and the extent of backlogs rather than as a ruler to measure productivity. Knowing what remains to be cataloged is more important than what has been cataloged, and that number can be used to justify adding or retaining positions.

Metadata specialists try to manage the quality of batch-loaded records from vendors but can feel like it’s a “losing battle.”  We noted that the distinction between “copy” and “original” cataloging is losing its meaning; the distinction between batch-processing and records you have to hand-touch is becoming more meaningful.  This distinction is blurred when so much more metadata is generated by machines that needs to be remediated later. Meanwhile, more attention is spent on creating metadata for hidden or distinct collections and unpublished materials.

Part of the challenge is engulfed by ongoing concern about the position of libraries: is the library in the line of sight for the people we want to use it? Discovery happens elsewhere, and we’ve failed as a profession to get our content into the place people are looking for it, although our metadata can help to do this in a linked data environment.  In Australia, discovery often happens first through Trove, a portal that aggregates content from libraries, museums, archives, repositories, and other collecting organizations around Australia. The absence of library linked data services today may be considered as a “temporary inconvenience” while the infrastructure evolves. RDA (Resource Description and Access) attempts to create a standard that is encoding-schema neutral, but vendors aren’t supporting it yet, and libraries aren’t demanding it.

Metadata managers need to balance allocation of staff to “traditional cataloging activities” with more exploratory R&D projects that do not directly relate to getting more metadata into catalogs, such as linked data projects and exploring Wikidata and ISNI. What’s needed is a culture shift, from pride around production alone to valuing opportunities to learn and explore new approaches.  Metadata specialists need to understand that improving all metadata is more important than any individual’s productivity numbers. This culture shift requires buy-in from administrators to support training programs for staff to learn new ways of doing things and to view metadata specialists as more than just “production machines.” Metadata managers faced with staff reductions while still being expected to maintain production levels can justify allocating staff time for R&D or “play time” to explore such questions as: What can we stop doing? What is the one thing you learned that we all need to do more of? What do you need to move forward? What open source software could help us do the work more efficiently? Managers need to set goals for success not based only on numbers.

Part of this culture shift is to out-source or train support staff to create metadata for the “easier stuff” and mandate that catalogers only do what well-trained humans can do.  Scope the materials requiring metadata that support staff or students can handle, providing templates where possible. If you take away all the easier materials, what’s often left are metadata that requires expertise in languages or formats and in describing (and disambiguating) persons, organizations, and other entities.

To encourage the culture shift among metadata specialists to change their mindsets about how they work and stimulate interest in learning opportunities, metadata managers have used several approaches:

  • Identify who on your team has the aptitude to pick up new skills. At one institution, the staff member shared what she learned and the whole unit became “lively” because she brought her colleagues along. It created appreciation for “continuous learning” and staff presented at national conferences what they were doing.
  • Convene group discussions to look at problem metadata and come up with solutions, encouraging staff to move forward together. Staff less interested in new skills can pick up some of the production from those learning new skills and producing less.
  • Launch “reading clubs” where staff all read an article and respond to three discussion questions to inspire metadata specialists to think about broader metadata issues outside of their daily work. (One of the readings was my 2017 blog post on “New skills for metadata management.”)
  • Hold weekly group “video-viewing brown-bag lunches” for staff on new developments such as linked data so staff can together “watch and learn.”
  • Participate in multi-institutional projects.
  • Encourage participation in professional conferences and standards development.

Discussants valued sharing ideas about facilitating the cultural aspects of shifting priorities and disseminating the value of metadata throughout the library and the institution.

The post Alternatives to Statistics for Measuring Success and Value of Cataloging appeared first on Hanging Together.

EU Council backs controversial copyright crackdown / Open Knowledge Foundation

The Council of the European Union today backed a controversial copyright crackdown in a ‘deeply disappointing’ vote that could impact on all internet users.

Six countries voted against the proposal which has been opposed by 5million people through a Europe-wide petition – Italy, Luxembourg, Netherlands, Poland, Finland and Sweden.
Three more nations abstained, but the UK voted for the crackdown and there were not enough votes for a blocking minority.

The proposal is expected to lead to the introduction of ‘filters’ on sites such as YouTube, which will automatically remove content that could be copyrighted. While entertainment footage is most likely to be affected, academics fear it could also restrict the sharing of knowledge, and critics argue it will have a negative impact on freedom of speech and expression online.

EU member states will have two years to implement the law, and the regulations are still expected to affect the UK despite Brexit.

The Open Knowledge Foundation said the battle is not over, with the European elections providing an opportunity to elect ‘open champions’.

Catherine Stihler, chief executive of the Open Knowledge Foundation, said:

“This is a deeply disappointing result which will have a far-reaching and negative impact on freedom of speech and expression online.

The controversial crackdown was not universally supported, and I applaud those national governments which took a stand and voted against it.

We now risk the creation of a more closed society at the very time we should be using digital advances to build a more open world where knowledge creates power for the many, not the few.

But the battle is not over. Next month’s European elections are an opportunity to elect a strong cohort of open champions at the European Parliament who will work to build a more open world.”

Blockchain: What's Not To Like? / David Rosenthal

I gave a talk at the Fall CNI meeting entitled Blockchain: What's Not To Like? The abstract was:
We're in a period when blockchain or "Distributed Ledger Technology" is the Solution to Everything™, so it is inevitable that it will be proposed as the solution to the problems of academic communication and digital preservation. These proposals typically assume, despite the evidence, that real-world blockchain implementations actually deliver the theoretical attributes of decentralization, immutability, anonymity, security, scalability, sustainability, lack of trust, etc. The proposers appear to believe that Satoshi Nakamoto revealed the infallible Bitcoin protocol to the world on golden tablets; they typically don't appreciate or cite the nearly three decades of research and implementation that led up to it. This talk will discuss the mis-match between theory and practice in blockchain technology, and how it applies to various proposed applications of interest to the CNI audience.
Below the fold, an edited text of the talk with links to the sources, and much additional material. The colored boxes contain quotations that were on the slides but weren't spoken.

Update: the video of my talk has now been posted on YouTube and Vimeo.

It’s one of these things that if people say it often enough it starts to sound like something that could work,
Sadhbh McCarthy

I'd like to start by thanking Cliff Lynch for inviting me back even though I'm retired, and for letting me debug the talk at Berkeley's Information Access Seminar. I plan to talk for 20 minutes, leaving plenty of time for questions. A lot of information will be coming at you fast. Afterwards, I encourage you to consult the whole text of the talk and much additional material on my blog. Follow the links to the sources to get the details you may have missed.

We're in a period when blockchain or "Distributed Ledger Technology" is the Solution to Everything™ so it is inevitable that it will be proposed as the solution to the problems of academic communication and digital preservation. In the second of a three-part series Ian Mulvaney has a comprehensive review of the suggested applications of blockchain in academia in three broad classes:

    • Priority Claims
      • Claims about authorship of a paper
      • Reviews of articles
      • Tracking article versions from preprint to publication
      • Claims about generation of data
      • Linking research artefacts together
      • Claims about facts and micro statements
    • Resources
      • Access to compute time
      • Access to lab time
      • Tracking of physical reagents
    • Rights
      • Rights transfers around copyright, articles or journals
    blockchain in STEM - part 2
    Ian Mulvaney

    • Priority Claims
    • Access to Resources
    • Rights
    Mulvaney discusses each in some detail and doesn't find a strong case for any of them. In a third part he looks at some of the implementation efforts currently underway and divides their motivations into two groups. I quote:
    The first comes from commercial interests where management of rights, IP and ownership is complex, hard to do, and has led to unusable systems that are driving researchers to sites like SciHub, scaring the bejesus out of publishers in the process.

    The other trend is for a desire to move to a decentralised web and a decentralised system of validation and reward, in a way trying to move even further away from the control of publishers.

    It is absolutely fascinating to me that two diametrically opposite philosophical sides are converging on the same technology as the answer to their problems. Could this technology perhaps be just holding up an unproven and untrustworthy mirror to our desires, rather than providing any real viable solutions?
    This talk answers Mulvaney's question in the affirmative. I've been writing skeptically about cryptocurrencies and blockchain technology for more than five years. What are my qualifications for such a long history of pontification?

    This is not to diminish Nakamoto's achievement but to point out that he stood on the shoulders of giants. Indeed, by tracing the origins of the ideas in bitcoin, we can zero in on Nakamoto's true leap of insight—the specific, complex way in which the underlying components are put together.
    Bitcoin's Academic Pedigree,
    Arvind Narayanan and Jeremy Clark

    More than fifteen years ago, nearly five years before Satoshi Nakamoto published the Bitcoin protocol, a cryptocurrency based on a decentralized consensus mechanism using proof-of-work, my co-authors and I won a "best paper" award at the prestigious SOSP workshop for a decentralized consensus mechanism using proof-of-work. It is the protocol underlying the LOCKSS system. The originality of our work didn't lie in decentralization, distributed consensus, or proof-of-work. All of these were part of the nearly three decades of research and implementation leading up to the Bitcoin protocol, as described by Arvind Narayanan and Jeremy Clark in Bitcoin's Academic Pedigree. Our work was original only in its application of these techniques to statistical fault tolerance; Nakamoto's only in its application of them to preventing double-spending in cryptocurrencies.

    We're going to walk through the design of a system to perform some function, say monetary transactions, storing files, recording reviewers' contributions to academic communication, verifying archival content, whatever. Being of a naturally suspicious turn of mind, you don't want to trust any single central entity, but instead want a decentralized system. You place your trust in the consensus of a large number of entities, which will in effect vote on the state transitions of your system (the transactions, reviews, archival content, ...). You hope the good entities will out-vote the bad entities. In the jargon, the system is trustless (a misnomer).

    Techniques using multiple voters to maintain the state of a system in the presence of unreliable and malign voters were first published in The Byzantine Generals Problem by Lamport et al in 1982. Alas, Byzantine Fault Tolerance (BFT) requires a central authority to authorize entities to take part. In the blockchain jargon, it is permissioned. You would rather let anyone interested take part, a permissionless system with no central control.

    In the case of blockchain protocols, the mathematical and economic reasoning behind the safety of the consensus often relies crucially on the uncoordinated choice model, or the assumption that the game consists of many small actors that make decisions independently.
    The Meaning of Decentralization,
    Vitalik Buterin, co-founder of Ethereum

    The security of your permissionless system depends upon the assumption of uncoordinated choice, the idea that each voter acts independently upon its own view of the system's state.

    If anyone can take part, your system is vulnerable to Sybil attacks, in which an attacker creates many apparently independent voters who are actually under his sole control. If creating and maintaining a voter is free, anyone can win any vote they choose simply by creating enough Sybil voters.

    From a computer security perspective, the key thing to note ... is that the security of the blockchain is linear in the amount of expenditure on mining power, ... In contrast, in many other contexts investments in computer security yield convex returns (e.g., traditional uses of cryptography) ... analogously to how a lock on a door increases the security of a house by more than the cost of the lock.
    The Economic Limits of Bitcoin and the Blockchain,
    Eric Budish, Booth School, University of Chicago

    So creating and maintaining a voter has to be expensive. Permissionless systems can defend against Sybil attacks by requiring a vote to be accompanied by a proof of the expenditure of some resource. This is where proof-of-work comes in; a concept originated by Cynthia Dwork and Moni Naor in 1992. To vote in a proof-of-work blockchain such as Bitcoin's or Ethereum's requires computing very many otherwise useless hashes. The idea is that the good voters will spend more, compute more useless hashes, than the bad voters.

    The blockchain trilemma
    much of the innovation in blockchain technology has been aimed at wresting power from centralised authorities or monopolies. Unfortunately, the blockchain community’s utopian vision of a decentralised world is not without substantial costs. In recent research, we point out a ‘blockchain trilemma’ – it is impossible for any ledger to fully satisfy the three properties shown in [the diagram] simultaneously ... In particular, decentralisation has three main costs: waste of resources, scalability problems, and network externality inefficiencies.
    The economics of blockchains,
    Markus K Brunnermeier & Joseph Abadi, Princeton

    Brunnermeir and Abadi's Blockchain Trilemma shows that a blockchain has to choose at most two of the following three attributes:
    • correctness
    • decentralization
    • cost-efficiency
    Obviously, your system needs the first two, so the third has to go. Running a voter (mining in the jargon) in your system has to be expensive if the system is to be secure. No-one will do it unless they are rewarded. They can't be rewarded in "fiat currency", because that would need some central mechanism for paying them. So the reward has to come in the form of coins generated by the system itself, a cryptocurrency. To scale, permissionless systems need to be based on a cryptocurrency; the system's state transitions will need to include cryptocurrency transactions in addition to records of files, reviews, archival content, whatever.

    Your system needs names for the parties to these transactions. There is no central authority handing out names, so the parties need to name themselves. As proposed by David Chaum in 1981 they can do so by generating a public-private key pair, and using the public key as the name for the source or sink of each transaction.

    we created a small Bitcoin wallet, placed it on images in our honeyfarm, and set up monitoring routines to check for theft. Two months later our monitor program triggered when someone stole our coins.

    This was not because our Bitcoin was stolen from a honeypot, rather the graduate student who created the wallet maintained a copy and his account was compromised. If security experts can't safely keep cryptocurrencies on an Internet-connected computer, nobody can. If Bitcoin is the "Internet of money," what does it say that it cannot be safely stored on an Internet connected computer?
    Risks of Cryptocurrencies,
    Nicholas Weaver, U.C. Berkeley

    In practice this is implemented in wallet software, which stores one or more key pairs for use in transactions. The public half of the pair is a pseudonym. Unmasking the person behind the pseudonym turns out to be fairly easy in practice.

    The security of the system depends upon the user and the software keeping the private key secret. This can be difficult, as Nicholas Weaver's computer security group at Berkeley discovered when their wallet was compromised and their Bitcoins were stolen.

    2-year Bitcoin "price" history
    The capital and operational costs of running a miner include buying hardware, power, network bandwidth, staff time, etc. Bitcoin's volatile "price", high transaction fees, low transaction throughput, and large proportion of failed transactions mean that almost no legal merchants accept payment in Bitcoin or other cryptocurrency. Thus one essential part of your system is one or more exchanges, at which the miners can sell their cryptocurrency rewards for the "fiat currency" they need to pay their bills.

    Who is on the other side of those trades? The answer has to be speculators, betting that the "price" of the cryptocurrency will increase. Thus a second essential part of your system is a general belief in the inevitable rise in "price" of the coins by which the miners are rewarded. If miners believe that the "price" will go down, they will sell their rewards immediately, a self-fulfilling prophesy. Permissionless blockchains require an inflow of speculative funds at an average rate greater than the current rate of mining rewards if the "price" is not to collapse. To maintain Bitcoin's price at $4K requires an inflow of $300K/hour.

    Ether Miners 10/10/18
    can we really say that the uncoordinated choice model is realistic when 90% of the Bitcoin network’s mining power is well-coordinated enough to show up together at the same conference?
    The Meaning of Decentralization,
    Vitalik Buterin

    In order to spend enough to be secure, say $300K/hour, you need a lot of miners. It turns out that a third essential part of your system is a small number of “mining pools”. Bitcoin has the equivalent of around 3M Antminer S9s, and a block time of 10 minutes. Each S9, costing maybe $1K, can expect a reward about once every 60 years. It will be obsolete in about a year, so only 1 in 60 will ever earn anything.

    To smooth out their income, miners join pools, contributing their mining power and receiving the corresponding fraction of the rewards earned by the pool. These pools have strong economies of scale, so successful cryptocurrencies end up with a majority of their mining power in 3-4 pools. Each of the big pools can expect a reward every hour or so. These blockchains aren’t decentralized, but centralized around a few large pools.

    At multiple times in 2014 one mining pool controlled more than 51% of the Bitcoin mining power. At almost all times since 3-4 pools have controlled the majority of the Bitcoin mining power. Currently two of them are controlled by Bitmain, the dominant supplier of mining ASICs. With the advent of mining-as-a-service, 51% attacks have become endemic among the smaller alt-coins.

    The security of a blockchain depends upon the assumption that these few pools are not conspiring together outside the blockchain; an assumption that is impossible to verify in the real world (and by Murphy's Law is therefore false). Similar off-chain collusion among cryptocurrency traders allows for extremely profitable pump-and-dump schemes.

    Since then there have been other catastrophic bugs in these smart contracts, the biggest one in the Parity Ethereum wallet software ... The first bug enabled the mass theft from "multisignature" wallets, which supposedly required multiple independent cryptographic signatures on transfers as a way to prevent theft. Fortunately, that bug caused limited damage because a good thief stole most of the money and then returned it to the victims. Yet, the good news was limited as a subsequent bug rendered all of the new multisignature wallets permanently inaccessible, effectively destroying some $150M in notional value. This buggy code was largely written by Gavin Wood, the creator of the Solidity programming language and one of the founders of Ethereum. Again, we have a situation where even an expert's efforts fell short.
    Risks of Cryptocurrencies,
    Nicholas Weaver, U.C. Berkeley

    In practice the security of a blockchain depends not merely on the security of the protocol itself, but on the security of the core software and the wallets and exchanges used to store and trade its cryptocurrency. This ancillary software has bugs, such as the recently revealed major vulnerability in Bitcoin Core, the Parity Wallet fiasco, and the routine heists using vulnerabilities in exchange software.

    Recent game-theoretic analysis suggests that there are strong economic limits to the security of cryptocurrency-based blockchains. For safety, the total value of transactions in a block needs to be less than the value of the block reward.

    Your system needs an append-only data structure to which records of the transactions, files, reviews, archival content, whatever are appended. It would be bad if the miners could vote to re-write history, undoing these records. In the jargon, the system needs to be immutable (another misnomer).

    Merkle Tree (source)
    The necessary data structure for this purpose was published by Stuart Haber and W. Scott Stornetta in 1991. A company using their technique has been providing a centralized service of securely time-stamping documents for nearly a quarter of a century. It is a form of Merkle or hash tree, published by Ralph Merkle in 1980. For blockchains it is a linear chain to which fixed-size blocks are added at regular intervals. Each block contains the hash of its predecessor; a chain of blocks.

    The blockchain is mutable, it is just rather hard to mutate it without being detected, because of the Merkle tree’s hashes, and easy to recover, because there are Lots Of Copies Keeping Stuff Safe. But this is a double-edged sword. Immutability makes systems incompatible with the GDPR, and immutable systems to which anyone can post information will be suppressed by governments.

    BTC transaction fees
    Cryptokitties’ popularity exploded in early December and had the Ethereum network gasping for air. ... Ethereum has historically made bold claims that it is able to handle unlimited decentralized applications  ... The Crypto-Kittie app has shown itself to have the power to place all network processing into congestion. ... at its peak [CryptoKitties] likely only had about 14,000 daily users. Neopets, a game to which CryptoKitties is often compared, once had as many as 35 million users.
    How Crypto-Kitties Disrupted the Ethereum Network,
    Open Trading Network

    A user of your system wanting to perform a transaction, store a file, record a review, whatever, needs to persuade miners to include their transaction in a block. Miners are coin-operated; you need to pay them to do so. How much do you need to pay them? That question reveals another economic problem, fixed supply and variable demand, which equals variable "price". Each block is in effect a blind auction among the pending transactions.

    So lets talk about CryptoKitties, a game that bought the Ethereum blockchain to its knees despite the bold claims that it could handle unlimited decentralized applications. How many users did it take to cripple the network? It was far fewer than non-blockchain apps can handle with ease; CryptoKitties peaked at about 14K users. NeoPets, a similar centralized game, peaked at about 2,500 times as many.

    CryptoKitties average "price" per transaction spiked 465% between November 28 and December 12 as the game got popular, a major reason why it stopped being popular. The same phenomenon happened during Bitcoin's price spike around the same time. Cryptocurrency transactions are affordable only if no-one wants to transact; when everyone does they immediately become un-affordable.

    Nakamoto's Bitcoin blockchain was designed only to support recording transactions. It can be abused for other purposes, such as storing illegal content. But it is likely that you need additional functionality, which is where Ethereum's "smart contracts" come in. These are fully functional programs, written in a JavaScript-like language, embedded in Ethereum's blockchain. They are mainly used to implement Ponzi schemes, but they can also be used to implement Initial Coin Offerings, games such as Cryptokitties, and gambling parlors. Further, in On-Chain Vote Buying and the Rise of Dark DAOs Philip Daian and co-authors show that "smart contracts" also provide for untraceable on-chain collusion in which the parties are mutually pseudonymous.

    ICO Returns
    The first big smart contract, the DAO or Decentralized Autonomous Organization, sought to create a democratic mutual fund where investors could invest their Ethereum and then vote on possible investments. Approximately 10% of all Ethereum ended up in the DAO before someone discovered a reentrancy bug that enabled the attacker to effectively steal all the Ethereum. The only reason this bug and theft did not result in global losses is that Ethereum developers released a new version of the system that effectively undid the theft by altering the supposedly immutable blockchain.
    Risks of Cryptocurrencies,
    Nicholas Weaver, U.C. Berkeley

    "Smart contracts" are programs, and programs have bugs. Some of the bugs are exploitable vulnerabilities. Research has shown that the rate at which vulnerabilities in programs are discovered increases with the age of the program. The problems caused by making vulnerable software immutable were revealed by the first major "smart contract". The Decentralized Autonomous Organization (The DAO) was released on 30th April 2016, but on 27th May 2016 Dino Mark, Vlad Zamfir, and Emin Gün Sirer posted A Call for a Temporary Moratorium on The DAO, pointing out some of its vulnerabilities; it was ignored. Three weeks later, when The DAO contained about 10% of all the Ether in circulation, a combination of these vulnerabilities was used to steal its contents.

    The loot was restored by a "hard fork", the blockchain's version of mutability. Since then it has become the norm for "smart contract" authors to make them "upgradeable", so that bugs can be fixed. "Upgradeable" is another way of saying "immutable in name only".

    Permissionless systems trust:
    • The core developers of the blockchain software not to write bugs.
    • The developers of your wallet software not to write bugs.
    • The developers of the exchanges not to write bugs.
    • The operators of the exchanges not to manipulate the markets or to commit fraud.
    • The developers of your upgradeable "smart contracts" not to write bugs.
    • The owners of the smart contracts to keep their secret key secret.
    • The owners of the upgradeable smart contracts to avoid losing their secret key.
    • The owners and operators of the dominant mining pools not to collude.
    • The speculators to provide the funds needed to keep the “price” going up.
    • Users' ability to keep their secret key secret.
    • Users’ ability to avoid losing their secret key.
    • Other users not to transact when you want to.

    So, this is the list of people your permissionless system has to trust if it is going to work as advertised over the long term.

    You started out to build a trustless, decentralized system but you have ended up with:
    • A trustless system that trusts a lot of people you have every reason not to trust.
    • A decentralized system that is centralized around a few large mining pools that you have no way of knowing aren’t conspiring together.
    • An immutable system that either has bugs you cannot fix, or is not immutable
    • A system whose security depends on it being expensive to run, and which is thus dependent upon a continuing inflow of funds from speculators.
    • A system whose coins are convertible into large amounts of "fiat currency" via irreversible pseudonymous transactions, which is thus an irresistible target for crime.
    If the “price” keeps going up, the temptation for your trust to be violated is considerable. If the "price" starts going down, the temptation to cheat to recover losses is even greater.

    Maybe it is time for a re-think.

    Suppose you give up on the idea that anyone can take part and accept that you have to trust a central authority to decide who can and who can’t vote. You will have a permissioned system.

    The first thing that happens is that it is no longer possible to mount a Sybil attack, so there is no reason running a node need be expensive. You can use BFT to establish consensus, as IBM’s Hyperledger, the canonical permissioned blockchain system does. You need many fewer nodes in the network, and running a node just got way cheaper. Overall, the aggregated cost of the system got orders of magnitude cheaper.

    Now there is a central authority it can collect “fiat currency” for network services and use it to pay the nodes. No need for cryptocurrency, exchanges, pools, speculators, or wallets, so much less temptation for bad behavior.

    Permissioned systems trust:
    • The central authority.
    • The software developers.
    • The owners and operators of the nodes.
    • The secrecy of a few private keys.

    This is now the list of entities you trust. Trusting a central authority to determine the voter roll has eliminated the need to trust a whole lot of other entities. The permissioned system is more trustless and, since there is no need for pools, the network is more decentralized despite having fewer nodes.

    Faults Replicas
    1 4
    2 7
    3 10
    4 13
    5 16
    6 19
    a Byzantine quorum system of size 20 could achieve better decentralization than proof-of-work mining at a much lower resource cost.
    Decentralization in Bitcoin and Ethereum Networks,
    Adem Efe Gencer Soumya Basu, Ittay Eyal, Robbert van Renesse and Emin Gün Sirer

    How many nodes does your permissioned blockchain need? The rule for BFT is that 3f + 1 nodes can survive f simultaneous failures. That's an awful lot fewer than you need for a permissionless proof-of-work blockchain. What you get from BFT is a system that, unless it encounters more than f simultaneous failures, remains available and operating normally.

    The problem with BFT is that if it encounters more than f simultaneous failures, the state of the system is irrecoverable. If you want a system that can be relied upon for the long term you need a way to recover from disaster. Successful permissionless blockchains have Lots Of Copies Keeping Stuff Safe, so recovering from a disaster that doesn't affect all of them is manageable.

    So in addition to implementing BFT you need to back up the state of the system each block time, ideally to write-once media so that the attacker can't change it. But if you're going to have an immutable backup of the system's state, and you don't need continuous uptime, you can rely on the backup to recover from failures. In that case you can get away with, say, 2 replicas of the blockchain in conventional databases, saving even more money.

    I've shown that, whatever consensus mechanism they use, permissionless blockchains are not sustainable for very fundamental economic reasons. These include the need for speculative inflows and mining pools, security linear in cost, economies of scale, and fixed supply vs. variable demand. Proof-of-work blockchains are also environmentally unsustainable. The top 5 cryptocurrencies are estimated to use as much energy as The Netherlands. This isn't to take away from Nakamoto's ingenuity; proof-of-work is the only consensus system shown to work well for permissionless blockchains. The consensus mechanism works, but energy consumption and emergent behaviors at higher levels of the system make it unsustainable.

    Additional Material

    It can be very hard to find reliable sources about cryptocurrencies because almost all cryptocurrency journalism is bought and paid for.

    When cryptocurrency issuers want positive coverage for their virtual coins, they buy it. Self-proclaimed social media personalities charge thousands of dollars for video reviews. Research houses accept payments in the cryptocurrencies they are analyzing. Rating “experts” will grade anything positively, for a price.

    All this is common, according to more than two dozen people in the cryptocurrency market and documents reviewed by Reuters.
    “The main reason why so many inexperienced individuals invest in bad crypto projects is because they listen to advice from a so-called expert,” said Larry Cermak, head of analysis at cryptocurrency research and news website The Block. Cermak said he does not own any cryptocurrencies and has never promoted any. “They believe they can take this advice at face value even though it is often fraudulent, intentionally misleading or conflicted.”
    Special Report: Little known to many investors, cryptocurrency reviews are for sale
    Anna Irrera & Elizabeth Dilts, Reuters
    See also: Crypto-shills, Jemima Kelly

    A recent example:

    The boxer Floyd Mayweather and the music producer DJ Khaled have been fined for unlawfully touting cryptocurrencies.

    The two have agreed to pay a combined $767,500 in fines and penalties, the Securities and Exchange Commission (SEC) said in a statement on Thursday. They neither admitted nor denied the regulator’s charges.

    According to the SEC, Mayweather and Khaled failed to disclose payments from three initial coin offerings (ICOs), in which new currencies are sold to investors.
    Floyd Mayweather and DJ Khaled fined over cryptocurrency promotion
    Dominic Rushe, The Guardian

    Some idea of the cryptocurrency milieu can be gained from Laurie Penny's Four Days Trapped at Sea With Crypto’s Nouveau Riche. Here's a taste:

    The women on this boat are polished and perfect; the men, by contrast, seem strangely cured—not like medicine, but like meat. They are almost all white, between the ages of 30 and 50, and are trying very hard to have the good time they paid thousands for, while remaining professional in a scene where many thought leaders have murky pasts, a tendency to talk like YouTube conspiracy preachers, and/or the habit of appearing in magazines naked and covered in strawberries. That last is 73-year-old John McAfee, who got rich with the anti-virus software McAfee Security before jumping into cryptocurrencies. He is the man most of the acolytes here are keenest to get their picture taken with and is constantly surrounded by private security who do their best to aesthetically out-thug every Armani-suited Russian skinhead on deck. Occasionally he commandeers the grand piano in the guest lounge, and the young live-streamers clamor for the best shot. John McAfee has never been convicted of rape and murder, but—crucially—not in the same way that you or I have never been convicted of rape or murder.

    On 7th December 2018 Bitcoin's "price" was around $3,700.

    Bitcoin now at $16,600.00. Those of you in the old school who believe this is a bubble simply have not understood the new mathematics of the Blockchain, or you did not cared enough to try. Bubbles are mathematically impossible in this new paradigm. So are corrections and all else
    Tweet from John McAfee, 7th December 2017

    Similarly, most of what your read about blockchain technology is people hyping their vaporware. A "trio of monitoring, evaluation, research, and learning, (MERL) practitioners in international development" started out enthusiastic about the potential of blockchain technology, so they did some research:

    We documented 43 blockchain use-cases through internet searches, most of which were described with glowing claims like “operational costs… reduced up to 90%,” or with the assurance of “accurate and secure data capture and storage.” We found a proliferation of press releases, white papers, and persuasively written articles. However, we found no documentation or evidence of the results blockchain was purported to have achieved in these claims. We also did not find lessons learned or practical insights, as are available for other technologies in development.

    We fared no better when we reached out directly to several blockchain firms, via email, phone, and in person. Not one was willing to share data on program results, MERL processes, or adaptive management for potential scale-up. Despite all the hype about how blockchain will bring unheralded transparency to processes and operations in low-trust environments, the industry is itself opaque. From this, we determined the lack of evidence supporting value claims of blockchain in the international development space is a critical gap for potential adopters.
    Blockchain for International Development: Using a Learning Agenda to Address Knowledge Gaps
    John Burg, Christine Murphy, & Jean Paul Pétraud

    I highly recommend David Gerard's book Attack of the 50-foot blockchain, and his blog. Others to follow include Arvind Narayanan and his group at Princeton, Nicholas Weaver at Berkeley, Emin Gün Sirer and the team at Cornell who blog at Hacking, Distributed, and Jemima Kelly and the FT Alphaville team.

    Every time the word "price" appears here, it has quotes around it. The reason is that there is a great deal of evidence that the exchanges, operating an unregulated market, are massively manipulating the exchange rate between cryptocurrencies and the US dollar. The primary mechanism is the issuance of billions of dollars of Tether, a cryptocurrency that is claimed to be backed one-for-one by actual US dollars in a bank account, and thus whose value should be stable. There has never been an audit to confirm this claim, and the trading patterns in Tether are highly suspicious. Tether, and its parent exchange Bitfinex, are the subject of investigations by the CFTC and federal prosecutors:

    As Bitcoin plunges, the U.S. Justice Department is investigating whether last year’s epic rally was fueled in part by manipulation, with traders driving it up with Tether -- a popular but controversial digital token.

    While federal prosecutors opened a broad criminal probe into cryptocurrencies months ago, they’ve recently homed in on suspicions that a tangled web involving Bitcoin, Tether and crypto exchange Bitfinex might have been used to illegally move prices, said three people familiar with the matter.
    Bitcoin-Rigging Criminal Probe Focused on Tie to Tether
    Matt Robinson and Tom Schoenberg, Bloomberg

    Social Capital has a series explaining Tether and the "stablecoin" scam:
    Tether's problems are in addition to the problems caused by exchanges' habit of losing their customers' coins (already in 2014 it was estimated that 6.6% of all Bitcoin in circulation had been stolen), front-running their trades, money laundering, "painting the tape", preventing customers withdrawing their funds, faking trading volume, and so on.

    John Lewis is an economist at the Bank of England. His The seven deadly paradoxes of cryptocurrency provides a skeptical view of the economics of cryptocurrencies that nicely complements my more technology-centric view. My comments on his post are here. Remember that a permissionless blockchain requires a cryptocurrency; if the economics don't work neither does the blockchain.

    You can find my writings about blockchain over the past five years here. In particular:
    More detail on the bugs in The DAO:

    The DAO was designed as a series of contracts that would raise funds for ethereum-based projects and disperse them based on the votes of members. An initial token offering was conducted, exchanging ethers for "DAO tokens" that would allow stakeholders to vote on proposals, including ones to grant funding to a particular project.

    That token offering raised more than $150m worth of ether at then-current prices, distributing over 1bn DAO tokens.

    [In May 2016], however, news broke that a flaw in The DAO's smart contract had been exploited, allowing the removal of more than 3m ethers.

    Subsequent exploitations allowed for more funds to be removed, which ultimately triggered a 'white hat' effort by token-holders to secure the remaining funds. That, in turn, triggered reprisals from others seeking to exploit the same flaw.

    An effort to blacklist certain addresses tied to The DAO attackers was also stymied mid-rollout after researchers identified a security vulnerability, thus forcing the hard fork option.
    The Hard Fork: What's About to Happen to Ethereum and The DAO,
    Michael del Castillo, Coindesk

    The DAO heist isn't an anomaly; here's a recent example (click through to the Medium post):

    ICO token Oyster PRL was exit-scammed by its founder, “Bruno Blocks” — who nobody has ever met — who took 3 million tokens via a deliberately-maintained back door in the smart contract code. How does this keep happening? Fortunately, the developers are on the case … by printing 27 million new tokens for themselves.
    David Gerard

    Exit scams are rife in the ICO world. Here is a recent example:

    Blockchain company Pure Bit has seemingly walked off with $2.7 million worth of investors’ money after raising 13,000 Ethereum in an ICO. Transaction history shows that hours after moving all raised funds out of its wallet, the company proceeded to take down its website. It now returns a blank page.
    This is the latest in a string of exit scams that took place in the blockchain space in 2018. Indeed, reports suggested exit scammers have thieved more than $100 million worth of cryptocurrency over the last two years alone. Subsequent investigations hint the actual sum of stolen cryptocurrency could be even higher.
    South Korean cryptocurrency startup reportedly pulls a $2.7M exit scam
    The Next Web

    More detail on the lack of decentralization in practice:

    in Bitcoin, the weekly mining power of a single entity has never exceeded 21% of the overall power. In contrast, the top Ethereum miner has never had less than 21% of the mining power. Moreover, the top four Bitcoin miners have more than 53% of the average mining power. On average, 61% of the weekly power was shared by only three Ethereum miners. These observations suggest a slightly more centralized mining process in Ethereum.

    Although miners do change ranks over the observation period, each spot is only contested by a few miners. In particular, only two Bitcoin and three Ethereum miners ever held the top rank. The same mining pool has been at the top rank for 29% of the time in Bitcoin and 14% of the time in Ethereum. Over 50% of the mining power has exclusively been shared by eight miners in Bitcoin and five miners in Ethereum throughout the observed period. Even 90% of the mining power seems to be controlled by only 16 miners in Bitcoin and only 11 miners in Ethereum.
    Decentralization in Bitcoin and Ethereum Networks,
    Adem Efe Gencer, Soumya Basu, Ittay Eyal, Robbert van Renesse and Emin Gün Sirer

    More on the lack of decentralization highlighted by Balaji S. Srinivasan and Leland Lee in Quantifying Decentralization, with their use of the "Nakamoto coefficient":

    "Ethereum’s smart contract ecosystem has a considerable lack of diversity. Most contracts reuse code extensively, and there are few creators compared to the number of overall contracts. ... the high levels of code reuse represent a potential threat to the security and reliability. Ethereum has been subject to high-profile bugs that have led to hard forks in the blockchain (also here) or resulted in over $170 million worth of Ether being frozen; like with DNS’s use of multiple implementations, having multiple implementations of core contract functionality would introduce greater defense-in-depth to Ethereum."
    Analyzing Etheruem's Contract Topology,
    Lucianna Kiffer, Dave Levin and Alan Mislove

    More detail on pump-and-dump (P&D) schemes:

    P&Ds have dramatic short-term impacts on the prices and volumes of most of the pumped tokens. In the first 70 seconds after the start of a P&D, the price increases by 25% on average, trading volume increases 148 times, and the average 10-second absolute return reaches 15%. A quick reversal begins 70 seconds after the start of the P&D. After an hour, most of the initial effects disappear. ... prices of pumped tokens begin rising five minutes before a P&D starts. The price run-up is around 5%, together with an abnormally high volume. These results are not surprising, as pump group organizers can buy the pumped tokens in advance. When we read related messages posted on social media, we find that some pump group organizers offer premium memberships to allow some investors to receive pump signals before others do. The investors who buy in advance realize great returns. Calculations suggest that an average return can be as high as 18%, even after considering the time it may take to unwind positions. For an average P&D, investors make one Bitcoin (about $8,000) in profit, approximately one-third of a token’s daily trading volume. The trading volume during the 10 minutes before the pump is 13% of the total volume during the 10 minutes after the pump. This implies that an average trade in the first 10 minutes after a pump has a 13% chance of trading against these insiders and on average they lose more than 2% (18%*13%).
    Cryptocurrency Pump-and-Dump Schemes
    Tao Li, Donghwa Shin and Baolian Wang

    A summary of the bad news about vote-buying in blockchains:

    The existence of trust-minimizing vote buying and Dark DAO primitives imply that users of all on-chain votes are vulnerable to shackling, manipulation, and control by plutocrats and coercive forces. This directly implies that all on-chain voting schemes where users can generate their own keys outside of a trusted environment inherently degrade to plutocracy, ... Our schemes can also be repurposed to attack proof of stake or proof of work blockchains profitably, posing severe security implications for all blockchains.
    On-Chain Vote Buying and the Rise of Dark DAOs
    Philip Daian, Tyler Kell, Ian Miers, and Ari Juels

    Here is a typical day on the Bitcoin blockchain. It is averaging 3 transactions/sec, and has a queue of an hour's worth of them waiting to be confirmed. Back in 2010 testing showed:
    VisaNet handles an average of 150 million transactions every day and is capable of handling more than 24,000 transactions per second.
    Eight years ago that was about 5,800 times as many transactions/sec on average, using much less electricity than Austria to do it.

    S&P500 companies are slowly figuring out that there is no there there in blockchains and cryptocurrencies, and they're not the only ones:

    Still new to NYC, but I met this really cool girl. Energy sector analyst or some such. Four dates in, she uncovers my love for BitCoin.

    Completely ghosted.
    Zack Voell

    RDA Plenary 13 reflection / Jez Cope

    Philadelphia International Airport apron with plans and luggage

    Photo by me

    I sit here writing this in the departure lounge at Philadelphia International Airport, waiting for my Aer Lingus flight back after a week at the 13th Research Data Alliance (RDA) Plenary (although I'm actually publishing this a week or so later at home). I'm pretty exhausted, partly because of the jet lag, and partly because it's been a very full week with so much to take in.

    It's my first time at an RDA Plenary, and it was quite a new experience for me! First off, it's my first time outside Europe, and thus my first time crossing quite so many timezones. I've been waking at 5am and ready to drop by 8pm, but I've struggled on through!

    Secondly, it's the biggest conference I've been to for a long time, both in number of attendees and number of parallel sessions. There's been a lot of sustained input so I've been very glad to have a room in the conference hotel and be able to escape for a few minutes when I needed to recharge.

    Thirdly, it's not really like any other conference I've been to: rather than having large numbers of presentations submitted by attendees, each session comprises lots of parallel meetings of RDA interest groups and working groups. It's more community-oriented: an opportunity for groups to get together face to face and make plans or show off results.

    I found it pretty intense and struggled to take it all in, but incredibly valuable nonetheless. Lots of information to process (I took a lot of notes) and a few contacts to follow up on too, so overall I loved it!

    Twitter / pinboard

    It occurs to me the #code4lib statement of support for Chris Bourg, , offers a better model…

    Open Data Day: Experience in Costa Rica & Elections, Public Contracts and Open Science: the mix at #ODD19 Guatemala / Open Knowledge Foundation

    This report is part of the event report series on International Open Data Day 2019. On Saturday 2nd March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. ACCESA from Costa Rica and Sofia Montenegroone of our School of Data Fellows from Guatemala, received funding through the mini-grant scheme by the Latin American Initiative for Open Data (ILDA) and by Hivos / Open Contracting Partnership, to organise events under the Open Contracting theme. This report was written by María Fernanda Avendaño Mora and Daniel Villatoro.

    Experience in Costa Rica and activities of ACCESA

    What we did

    In Costa Rica, on March 2th we celebrate the Open Data Day with a full agenda with several talks, workshops, and conversations.  The activity was carried out at the Center for Research and Training in Public Administration of the University of Costa Rica from 9:00 a.m. at 4:00 pm. More than 100 people signed up and participated throughout the day, and an approximate 46% of the attendees were women.

    One of the activities was the presentation of the Open Data Guide on Public Procurement and the case of Costa Rica. From data presented, we learned that not all public institutions use the integrated public procurement system and that the data of this system is not in open data. We also learned about Open Contracting Data Standards (OCDS)  and about all the possibilities of use data to make better decisions if the data of the purchasing system were in open data format.

    Another activity was the presentation of Results for Costa Rica of the Transparent Public Procurement Rating (TPPR) and the development of a workshop to define the route map for Costa Rica achieve the standards in open contracting in short and middle term.

    Lessons learned from experience

    • In Open Data and Transparency in Public Procurement, Costa Rica has a big gap between what the law said and the real implementation of the law in practice.
    • One aspect that makes it difficult to take action on open data and efficient and transparent Public Procurement management is the absence of a single governing body and clearly established in the law with good muscle to lead governance on the issue of Public Purchases.
    • We have to stop seeing the public purchase as a purely administrative procedure to begin to see public procurement as a policy for the public good where the State uses its considerable financial muscle to achieve social and development objectives.
    • An action that is viable to open the data is the development of an API with public procurement data so that anyone can get the information they need and build the solutions that the private sector and civil society consider appropriate. ll that is needed is political will and technical diligence.
    • There is an opportunity to exploit the public procurement system, in order to collect useful data according to public policy priorities. For example, if you want to strengthen the enterprises led by women, then in the public procurement system you should ask questions about that topic to collect data.

    Each assistant was able to take a copy of the Open Data Guide on Public Procurement

    We support the organization of the event by collaborating with the snack for the attendees


    Elections, Public Contracts and Open Science: the mix at #ODD19 Guatemala

    In Guatemala, the OpenDataDay event worked around three main themes. Each with different dynamics and spaces for learning.

    Data and elections

    Taking advantage that elections are taking place this year, innovative electoral projects that use technology and data were presented. Each project collects and shares data in an open format that allows citizens to cast an informed vote.

    Here is a detail of the projects presented:

    • For Whom I Vote?it’s a virtual platform where users fill a questionnaire that measures their preference with parties participating in the electoral process. This allows each user to identify firstly their own ideological position, but also how closely they are with each political party. Moreover, the platform collects data such as demographic variables and location of users participating in the test. These data will be accessible for analysts to identify potential research proposals.

    • 3de3 (or 3for3) is a replica of a mexican project that demands transparency from political candidates, inviting them to share three important documents: their tax return, a statement of interest (to avoid possible conflict of interest) and their patrimonial declaration.

    • La Papeleta (The ballot) by Guatecambia is a directory that converts the legal documents of candidates registration by the electoral office (scanned PDFs) to transcribed and machine-readable data.

    Tracking public moneyflows

    At the School of Data fellowship we have worked in a research project that maps out the process of public contracting as part of our work related to OpenContracting values. A visualization and the web platform was presented as a preview and a validation process to understand the needs from data users, their interest for Open Data about contracts and the best ways to explain and engage people into transparency efforts. Sofia Montenegro, current School of Data fellow presented her research and some of the key findings of the process.

    Open Science

    Led by Kevin Martinez-Folgar, a researcher in epidemiology who gave us a quick introduction to the framework around making scientific findings open, a tipsheet on how to conduct research that way, and a list of online resources to learn and apply to do so.

    We browsed around to understand how to be open across the whole research cycle,  to know a distribution server for articles and an electronic archive to learn and Zenodo to publish and share the results. We also reviewed some projects in github and learned about identification in the digital world through the Digital Object Identifier System.

    Last but not least, we reviewed the contents available from the OpenScienceMOOC and reflected around the lack of knowledge available for Spanish speaking audiences.

    The activity  was organized by School of Data and its local fellowship, with the help from trainers and the projects that presented their work.  We celebrated as a community, a space to shared experiences, best practices and exchange ideas for future collaborations.

    You can learn more about the work done by our spanish speaking School of Data community at our blog.

    Meet DLF’s 2019 LCI Tuition Grant Recipients / Digital Library Federation


    Today, CLIR and EDUCAUSE announced the 38 individuals who have been selected to participate in the 2019 Leading Change Institute (LCI). Through the DLF fellowships program, two participants—Monika Rhue and Tina Rollins—have been awarded full-tuition scholarships for the program. Offered for the first time this year, these tuition grants will enable the recipients to fully participate in LCI, and will foster cross-pollination among a variety of institution types.

    About the Leading Change Institute

    Jointly sponsored by CLIR and EDUCAUSE, LCI is designed for leaders in higher education, including CIOs, librarians, information technology professionals, and administrators, who want to work collaboratively to promote and initiate change on critical issues affecting the academy. These issues include new sources of competition, use of technology to support effective teaching and learning, distance learning, changing modes of scholarly communications, and the qualities necessary for leadership.

    Monika and Tina will join other participants in sessions led by deans Joanne Kossuth and Elliott Shore as well as other thought leaders from the community in discussing approaches to these challenges, including ideas for collaboration, collective creativity, and innovation within and across departments, institutions, and local or regional boundaries; the conceptualization of blended positions and organizations; and the importance of community mentorship and advocacy.

    The institute will be held June 2–7 in Washington, DC.

    About the recipients

    Monika Rhue
    Director of Library and Curation
    Johnson C. Smith University

    Monika Rhue is currently serving as the director of library services and curation at the James B. Duke Memorial Library, Johnson C. Smith University. Some of her work experiences include library management, grant writing, archival consulting, and museum curation. She has served on the HistoryMakers advisory board and the planning advisory team for the 2018 Harvard Radcliffe Workshop on Technology and Archival Processing and was the plenary speaker for the 2018 Rare Books and Manuscripts Section conference in New Orleans. She also serves as an archival consultant for the State Archives of North Carolina Traveling Archivist Program, and as 2017-2019 Board Chair for HBCU Library Alliance.

    Monika managed Save the Music: The History of Biddleville Quintet, JCSU’s archives first digital project to transfer instantaneous discs into a digital format, and launched Digital Smith, the university’s searchable archives. She was instrumental in accessioning the James Gibson Peeler collection with more than 100,000 photographs and negatives that document the history and culture of Charlotte’s African American population. She has bridged several partnerships across campus, in the Charlotte community and throughout the Southeast with programs such as:

    • Giving Back: the Soul of Philanthropy Reframed and Exhibited, a traveling exhibit throughout the Southeast paying tribute to generations of African American Philanthropy.
    • Know Your Plate, an interactive game project to promote awareness of obesity among African Americans in the Northwest Corridor.
    • JCSU’s Information Literacy Buddy initiative, which assisted HBCUs in transforming bibliographic instruction into an information literacy program. Monika was invited to share this initiative in South Africa as a People-to-People library delegate from October 19-29, 2009.

    She is the author of Organizing and Preserving Family and Religious Records: A Step-by-Step Guide and Dress the African Way: An Activity Book for the Family, and is a contributing writer to the ACRL publication Creating Leaders: An Examination of Academic and Research Library Leadership Institutes. She earned her Bachelor of Arts degree in communication from Johnson C. Smith University and an MLIS degree from UNC-Greensboro. Her current projects include developing an animated plagiarism game to help students avoid plagiarism and partnering with Arts and Science Council Culture Blocks to capture and preserve the rich heritage of the Northwest Corridor neighborhoods.

    Tina D. Rollins
    Director of the William R. and Norma B. Harvey Library
    Hampton University

    Tina D. Rollins is the director of the William R. and Norma B. Harvey Library at Hampton University. She completed her bachelor’s degree in criminal justice at Old Dominion University and her MLS degree at North Carolina Central University (NCCU). While at NCCU she was a member of the Diversity Scholars Program which was an Institute of Museum and Library Services (IMLS)-funded program to recruit students of diverse backgrounds into the library and information sciences field. This experience led to an interest in promoting and researching diversity within librarianship. Tina also studied international librarianship in Copenhagen, Denmark, during her studies at NCCU.

    At Hampton, she has created initiatives to improve information literacy, outreach services, and professional development. The initiatives have led to increases in library programming, grantsmanship, fundraising, and faculty and staff communication. The library is successfully rebuilding its brand and building cross-campus collaboration and partnerships. These opportunities create a wealth of potential resources to improve library services and research efforts throughout the university.

    Rollins has committed herself to bringing awareness to the lack of diversity within all facets of the LIS field. She currently serves as principal investigator on an IMLS grant awarded to Hampton University. This award, titled The Hampton University Forum on Minority Recruitment and Retention in the LIS Field, convened a national forum in August 2018 to discuss effective strategies and action planning to address the lack of diversity within the LIS field. The grant continues to address these concerns through virtual meetings and training sessions for LIS professionals.

    Tina Rollins holds various memberships in both regional and national organizations related to the field. She is a board member of the Historically Black Colleges and Universities (HBCU) Library Alliance. Additionally, she volunteers in literacy outreach organizations and initiatives in the region. She currently resides in Newport News, VA with her husband where she enjoys watching movies and bad reality television.

    In 2019, Monika and Tina are also serving as mentors with the HBCU Library Alliance and DLF’s  Authenticity Project — a program which provides support and professional development to early- to mid-career library staff from American HBCUs.

    To learn more about the Leading Change Institute and to view this year’s curriculum, visit the program’s website.


    The post Meet DLF’s 2019 LCI Tuition Grant Recipients appeared first on DLF.

    I have measured out my life in Doodle polls / Karen G. Schneider

    You know that song? The one you really liked the first time you heard it? And even the fifth or fifteenth? But now your skin crawls when you hear it? That’s me and Doodle.

    In the last three months I have filled out at least a dozen Doodle polls for various meetings outside my organization. I complete these polls at work, where my two-monitor setup means I can review my Outlook calendar while scrolling through a Doodle poll with dozens of date and time options. I don’t like to inflict Doodle polls on our library admin because she has her hands full enough, including managing my real calendar.

    I have largely given up on earmarking dates on my calendar for these polls, and I just wait for the inevitable scheduling conflicts that come up. Some of these polls have so many options I would have absolutely no time left on my calendar for work meetings, many of which need to be made on fairly short notice. Not only that, I gird my loins for the inevitable “we can’t find a date, we’re Doodling again” messages that mean once again, I’m going to spend 15 minutes checking my calendar against a Doodle poll.

    I understand the allure of Doodle; when I first “met” Doodle, I was in love. At last, a way to pick meeting dates without long, painful email threads! But we’re now deep into the Tragedy of the Doodle Commons, with no relief in sight.

    Here are some Doodle ideas–you may have your own to toss in.

    First, when possible, before Doodling, I ask for blackout dates. That narrows the available date/time combos and helps reduce the “we gotta Doodle again” scenarios.

    Second, if your poll requires more than a little right-scrolling, reconsider how many options you’re providing. A poll with 40 options might as well be asking me to block out April. And I can’t do that.

    Third, I have taken exactly one poll where the pollster chose to suppress other people’s responses, and I hope to never see that again. There is a whole gaming side to Doodling in which early respondents get to drive the dates that are selected, and suppressing other’s responses eliminates that capability. Plus I want to know who has and hasn’t responded, and yes, I may further game things when I have that information.

    Also, if you don’t have to Doodle, just say no.

    Webinar Recap: Create the Ideal Customer Experience, Without Giving Up Control / Lucidworks

    I had the privilege of hosting a webinar with Peter Curran, President and Co-Founder of Cirrus10, and expert consultant on designing web ecommerce architectures.

    You can listen to the entire webinar below, and I’ll mention some of the highlights of what Peter and I discussed.

    Companies that have digital commerce businesses need to meet the demands of customers who are asking for more personalized experiences. Forrester reported that 36% of US online adults believe that retailers should offer them more personalized experiences, and Gartner has confirmed that sellers who personalize the customer experience see greater levels of customer engagement and higher retention. A Fast Company piece from June 2018 explains the challenge facing retailers as they think about Amazon:

    “The problem is, most can’t possibly win against Amazon by playing the e-commerce giant’s game. To survive (and thrive) in a marketplace where price and convenience rein supreme, retailers of all stripes need to provide something that Amazon can’t: high-quality, human-touch customer service.”

    Watch the webinar recording to hear Peter’s real-life examples of shopping for tents, utility pants, and lingerie. He highlights how poorly supported ecommerce browse and search experiences harm click-through, conversion, and revenue. And he compares those examples with sites that improve shopper loyalty by giving customers more relevant, human-touch shopping experiences.

    Click here to watch the webinar.

    Ecommerce companies should be able to use both business rules and machine learning. Our digital commerce solution hyper-personalizes search and browsing, while letting the merchandiser curate the end-to-end experience, for the “ideal customer experience, without giving up control.”

    In the webinar, Peter discusses the challenges that merchandisers face maintaining huge sets of business rules in an ever-changing marketplace. Customer preferences change organically, new products come to market, and the rise of third-party marketplaces can double or triple the size of an e-retailer’s product catalog overnight.

    None of these challenges mean that the business rules are going away. Rather, business rules and Artificial Intelligence must interact with each other if teams are going to manage the growing time commitment to maintain merchandising rules. Because Fusion AI can minimize the time they spend maintaining those rules, it frees merchandisers to dedicate more time to the activities where deep human insight and expertise is required.

    A handful of companies have spent billions developing AI on their own platforms. (Peter introduced us to an acronym that names those companies: “G-MAFIA,” which stands for Google, Microsoft, Amazon, Facebook, IBM, and Apple. We mean no disrespect to the G-MAFIA, but thousands of ecommerce companies cannot invest their way into that exclusive AI club.

    Lucidworks gives merchandisers a platform to:

    • Curate the shopper experience using their ecommerce expertise,
    • Personalize results for the shopper, and
    • Optimize the entire end-to-end process with applied machine learning.

    Watch the webinar replay for hypothetical scenarios showing how Lucidworks Fusion combines human and machine intelligence, delivers relevant search results before the shopper knows to ask, and scales the number of ML models for product recommendations.

    If you want to learn more about how to increase conversions on your site, boost order values, and improve customer loyalty, contact us and schedule a meeting with one of our expert consultants.

    The post Webinar Recap: Create the Ideal Customer Experience, Without Giving Up Control appeared first on Lucidworks.

    Systematic Reviews of Our Metadata / HangingTogether

    That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Melanie Wacker of Columbia University, Roxanne Missingham of Australian National University, and Sharon Farnel of University of Alberta. Librarians and administrators are well aware of the tension that exists between delivering access to our library collections in a timely manner and providing good quality description. The metadata descriptions must be full enough to allow us to manage our collections and to support accessibility and discoverability for the end-user. Many libraries need to compromise by using vendor records, by creating minimal or less-than-full level descriptions (according to existing guidelines such as BIBCO or creating their own) for certain types of resources, and by limiting authority work. We need to better understand the impact that these compromises are having on our end users.

    Generally, Partners apply full-level cataloging to all metadata created in-house but accept lower-level cataloging from vendors. Minimal-level cataloging is commonly used as an alternative to leaving materials uncataloged, often as a result of large volume of materials and insufficient staff resources. The types of materials receiving minimal-level cataloging include theses, e-resources (generally accepted from vendors as is), ephemera, old backlogs, gift materials, and materials deemed to have low research value.  Minimal-level cataloging may be enhanced by adding FAST (Faceted Application of Subject Terminology) headings rather than LC Subject Headings. Minimal-level cataloging is also used as “placeholders” for inventory control, with the hope that they can be upgraded later or when an item is requested. Decisions are often driven by expediency. The metadata created also depends on the format—for example, whether it is for materials in an institutional repository or digital collection. Regardless of the format, everyone strives for consistency and authorized names.

    Inconsistencies in vendor records, especially variations in a specific author’s name and different romanizations for materials written in non-Latin scripts, hamper discoverability. Some try to remediate these problems using batch processes such as OpenRefine.  The National Library of Australia and Australian National University (ANU) have been working with publishers and publishers’ organizations to improve the consistency in their metadata and to start using ISNIs, as better, more consistent metadata would also benefit their other clients such as booksellers. ANU reported that this advocacy improved the quality of the metadata provided by Knowledge Unlatched.

    Metadata specialists consult “collection stewards”—specialists who are familiar with both the collections and users’ needs such as special collection librarians, curators, and archivists—on prioritizing the metadata applied, allocating staff resources, and which data elements are most needed. These priorities may complete with each other. Typically, metadata specialists have no direct contact with end-users. When they receive requests for specific data elements—such as the supervisors for theses—there may not be a way in current systems to handle such role distinctions appropriately.

    Many user studies have been conducted, but they generally focus on the discovery layer interfaces rather than the underlying metadata. One exception: The Library of Congress’s Digital Collections Management and Services Division plans to start a project in 2019 that will look at potential effects of different levels of metadata on discovery and use of web archives.

    Partners refer to usage data to prioritize clean-up work, weeding, and purchase decisions, but not for reviewing the impact of different levels of metadata.  The one exception noted was Cornell’s examination of interlibrary loan records to see if items stored remotely and cataloged according to their minimal level guidelines could be found. They determined that resources that otherwise would have remained uncataloged were being requested.

    The current prevalence of keyword searching in discovery layers overshadows the value of controlled access points that collate materials in different languages. Increasing use of identifiers may help bring materials together again in a future linked data environment.  It then becomes even more important that the identifiers associated with personal names are the correct ones. Integrating authority record data into discovery environments is another area that needs attention, including reconsidering authority standards’ orientation toward browse functionality.

    Migration to a new integrated library system prompts systematic reviews of metadata. It also offers an opportunity to structure more efficient workflows and use the system to review metadata quality so that everyone can better use the catalog.

    The post Systematic Reviews of Our Metadata appeared first on Hanging Together.

    ODD19 Mexico City: communities sharing DataLove & Data to fight violence against women / Open Knowledge Foundation

    This report is part of the event report series on International Open Data Day 2019. On Saturday 2nd March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. This is a joint report produced by Técnicas Rudas and SocialTIC from Mexico, who received funding through the mini-grant scheme by the Latin American Initiative for Open Data (ILDA) to organise events under the Open Mapping and Tracking Public Money themes.

    Open Data Day is the annual festival where communities and people interested in the use of data come together to share “data love”, learn, release data, share projects, and create solutions through open data.

    In Mexico City, this 2019 is the sixth consecutive year that we celebrate the ODD19. This time we had a whole data festival with activities to choose from: workshops, data expeditions, projects, public buildings rally, data city challenges. Close to 120 people from civil society,  local government and  communities interested in open data participated.

    How did we celebrate the data party?

    The workshops of this edition covered intro and advanced levels on the use and handling of data. The topics were: data analysis with Kibana, data extraction of FOIA requests, analogous visualization of data on public building, fundamentals of dataviz, the use of data for geolocation, public policy and essential statistics. We also had a discussion on the dark gaps of artificial intelligence.
    You can find the content of some workshops here:

    Data expeditions
    With objectives ranging from exploring data on mobility, security, budget, town planning and gender, to public contracts; the data expeditions are designed for diverse groups to share skills, hypotheses and conclusions based on data.

    Federal Public Building Rally
    This is the fourth consecutive year in which Transparencia Presupuestaria organizes the rally to verify how the goverment spent the money on public infrastructure. With participants from 30 states, Estado de México, Puebla and Oaxaca were the states with the largest number of participants.

    Public Building Rally in Mexico City
    This year the public building rally was also done at city level. An exercise to know and verify the use of the city budget (drainage, public lighting, soccer fields). In this edition, almost 600 million Mexican pesos involved in public building were verified.

    In the space to learn about projects, we got acquainted with initiatives related to transparency, accountability, public contracts, data about violence against women, and justice.
    Some of the projects based on data:

    1. A walkability audit with a feminist perspective to evaluate and propose improvements in infrastructure and urban design of the city (@Futura_org)
    2. Justicia Transparente, an audit exercise that analyzes data on insecurity and distrust in the authorities linked to criminal procedure (@IMCO)

    A summary, some pics and tweets, and related projects are available here: (Spanish)

    Data against violence

    by Técnicas Rudas and GeoChicas

    In Mexico, one can’t help but to be inspired by the powerful women’s movement there is here. However, violence against women is still rampant in our society. While there is a general perception that violence is greater, there is also a widespread concern among feminist activists that as with many human rights issues in Mexico, available data is insufficient to reflect the true scale and characteristics of violence against women. This is certainly the case with feminicide.

    Mexico is one of a handful of countries in Latin American that have incorporated Feminicide into their legislation as a hate crime, first in local legislation in 1993, and later (until 2007) into federal law. In Mexico the government has opened data about feminicide at the municipal level from 2015 to the present, and the data is updated every two months. Nevertheless, the information is used only by data specialists.In order to help society to take advantage of the government’s database on feminicide, Técnicas Rudas and Geochicas organized a workshop during Open Data Day, in which independent feminists and collectives came together to take a critical look at existing data visualization initiatives on feminicide – both from government and civil society -with a focus on cartography.

    We made a script using R to read the feminicide data from official crime statistics, generate a database of feminicide in csv format, and produce a geographical file saved as geojson. Workshop participants included independent activists and academics, and members of  five different collectives, as well as  one international

    The results of the workshop can be viewed at, the script is available in and a graphic view at

    When a Presidential Library Is Digital / Dan Cohen

    I’ve got a new piece over at The Atlantic on Barack Obama’s prospective presidential library, which will be digital rather than physical. This has caused some consternation. We need to realize, however, that the Obama library is already largely digital:

    The vast majority of the record his presidency left behind consists not of evocative handwritten notes, printed cable transmissions, and black-and-white photographs, but email, Word docs, and JPEGs. The question now is how to leverage its digital nature to make it maximally useful and used.

    This almost-entirely digital collection, and its unwieldy scale and multiple formats, should sound familiar to all of us. Over the past two decades, we have each become unwitting archivists for our own supersized collections, as we have adopted forms of communication that are prolific and easy to create, and that accumulate over time into numbers that dwarf our printed record and can easily mount into a pile of digital files that borders on shameful hoarding. I have over 300,000 email messages going back to my first email address in the 1990s (including an eye-watering 75,000 that I have sent), and 30,000 digital photos. This is what happens when work life meets Microsoft Office and our smartphone cameras meet kids and pets.

    Will we have lost something in this transition? Of course. Keeping a dedicated archival staff in close proximity to a bounded paper-based collection yields real benefits. Having a researcher who is on site discover a key note on the back of a typescript page is also special.

    However, although the analog world can foster great serendipity, it does not have a monopoly on such fortunate discoveries. Digital collections have a serendipity all their own.

    Please do read the whole article for my thoughts about how we should approach the design of this digital library, and the possibilities it will enable, including broad access and new forms of research.

    What is Amazon? / David Rosenthal

    In Why It's Hard To Escape Amazon's Long Reach, Paris Martineau and Louise Matsakis have compiled an amazingly long list of businesses that exist inside Amazon's big tent. After it went up, they had to keep updating it as people pointed out businesses they'd missed. In most of those businesses, Amazon's competitors are at a huge disadvantage:
    While its retail business is the most visible to consumers, the cloud computing arm, Amazon Web Services, is the cash cow. AWS has significantly higher profit margins than other parts of the company. In the third quarter, Amazon generated $3.7 billion in operating income (before taxes). More than half of the total, $2.1 billon, came from AWS, on just 12 percent of Amazon’s total revenue. Amazon can use its cloud cash to subsidize the goods it ships to customers, helping to undercut retail competitors who don’t have similar adjunct revenue streams.
    In the mid-50s my father wrote a textbook, Organisation of retail distribution, with a second edition in the mid-60s. He would have been fascinated by Amazon. I've written about Amazon from many different viewpoints, including storage as a service, and anti-trust, so I'm fascinated with Amazon, too. Now, when you put recent posts by two different writers together, an extraordinarily interesting picture emerges, not just of Amazon but of the risks inherent to the "friction-free" nature of the Web:
    • Zack Kanter's What is Amazon? is easily the most insightful thing I've ever read about Amazon. It starts by examining how Walmart's "slow AI" transformed retail, continues by describing how Amazon transformed Walmart's "slow AI" into one better suited to the Internet, and ends up with a discussion of how Amazon's "slow AI" seems recently to have made a fundamental mistake.
    • Izabella Kaminska's series Amazon (sub)Prime? and Amazon (sub)Prime - Part II provides the deep dive to go with Kanter's big picture, looking in detail into one of the many symptoms of the "slow AI's" apparent mistake.
    Below the fold, a long meditation on these posts.

    So how come Zack Kanter knows so much about Amazon?
    I have sold to and bought from Amazon in about as many ways as one person can; I built an auto parts brand that sold thousands of SKUs [stock keeping units] to Amazon as a vendor (both stocking and drop ship) and as a marketplace seller (both "seller-fulfilled" and Fulfillment By Amazon, or FBA), before selling the company to a private equity fund in 2018. And I am now the founder and CEO of a startup called Stedi (a modern EDI platform, if you're familiar with EDI) that runs on Amazon Web Services; we automate transactions like purchase orders and invoices between brands and retailers.
    You really need to take the time right now to go read Kanter's "short book" then come back here at this link. Trust me, it will be time well spent. But, if you can't be bothered, here is enough of a summary so you can understand the rest.

    Kanter starts by describing the goals of Walmart's "slow AI":
    1. "a wide assortment of good quality merchandise"
    2. offered "at the lowest possible prices"
    3. backed by "guaranteed satisfaction" and "friendly, knowledgeable service"
    4. available during "convenient hours" with "free parking" and "a pleasant shopping experience"
    5. all within the largest, most convenient possible store size and location permitted by local economics
    As the "slow AI" was seeking these goals it was constrained by the limited shelf space in each store. Optimizing the profit to be made from the limited shelf space meant that Walmart's mechandisers had to find the best products, negotiate the lowest prices, ensure that the vendor could keep up with Walmart's demands, and so on. Walmart built a huge organization that was extraordinarily good at doing this. But then:
    Jeff Bezos had a big realization in 1994: the world of retail had, up until then, been a world where the most important thing was optimizing limited shelf space in service of satisfying the customer - but that world was about to change drastically. The advent of the internet - of online shopping - meant that an online retailer had infinite shelf space.
    The goals of Amazon's "slow AI" were simpler:
    1. a vast selection
    2. delivered fast
    3. at the lowest possible prices
    4. backed by guaranteed satisfaction
    With infinite shelf space and search functionality:
    the new formula became simpler: the more SKUs it added, the more items would be discovered by customers; the more items that customers discovered, the more items they would buy. In this world of infinite shelf space, it wasn't the quality of the selection that mattered - it was pure quantity. And with this insight, Amazon did not need to be nearly as good - let alone better - than Walmart at Walmart's masterful game of vendor and SKU selection. Amazon just needed to be faster at aggregating SKUs - and therefore faster at onboarding vendors.
    The juggernaut started rolling:
    Amazon systematically removed friction from the seller onboarding workflow, doing seemingly small things like eliminating the UPC code requirement that would serve as a barrier for newer, less established sellers. All of these small changes started to add up, and Amazon became the fastest way for a company to start selling online. Customers began to associate Amazon with selection, and Amazon became the de facto storefront for the fledgling world of online commerce. ... Amazon's SKU aggregation juggernaut was running an unbound search for customer value nationwide, while Walmart's army of finely-tuned retailer gatekeepers was still running a bounded search in local geographies.
    Kanter continues with a fascinating description of how Amazon, faced with a series of constraints on its growth, each time solved them with a platform that others, outside the company, could use. The obvious example is AWS, but there are many.

    Thanks to these platforms, the juggernaut kept running and Amazon became this Internet-scale collection of SKUs. Kanter writes:
    you have to understand that things get really weird when you run an unbounded search at internet-scale. When you remove "normal" constraints imposed by the physical world, the scale can get so massive that all of the normal approaches start to break down.
    Amazon's version of this was the sheer size and diversity of its collection of SKUs:
    Amazon would never be able to effectively curate such a sprawling array of product categories. It isn't particularly good at merchandising to start with, and, even if it were, it could never build a large enough army of merchandisers to curate such a massive selection. Instead, Amazon relies on a ranking algorithm that heavily weights product reviews and sales velocity. The more reviews a product has and the more units it sells, the higher it climbs in rankings. Of course, this creates a positive feedback loop: the more a product is exposed to customers, the more it sells; the more it sells, the more reviews it gets, and the higher it climbs in rankings, starting the loop all over again. (Yes, this is a gross oversimplification of Amazon's extraordinarily complex ranking algorithm)

    This creates a big problem for Amazon's customers, who want the latest and greatest products, and for its sellers, who want to develop and sell exciting new items. Failure to satisfy these demands would put Amazon's ecommerce dominance at risk.

    Amazon answered this problem in typical fashion: with a platform. Amazon Advertising allowed sellers to feature 'Sponsored Products' - paid ads that appear at the top of search results. Sponsored Products solved three problems at once: new product discovery for the customers, new product introductions for the sellers, and, as an added bonus, pure gross margin revenue for Amazon - to the tune of $8 billion annually.
    In other words, since the curation problem was too big for Amazon to solve, they made a for-pay curation platform to solve it. Introducing "Sponsored Products" in this way is very similar to what Google did to curate search results:
    Google allows companies to bid on search terms, and displays paid content at the top of its search results in the same blue font used for unpaid content. (For example, a candy maker might bid on the term "Christmas candy" so that its ads pop up when someone searches for those words.) Google identifies ads in its search results with an icon below the link.
    When this topic came up on Dave Farber's IP list, my friend Chuck McManis, who once ran a Google competitor, weighed in with a typically informative response about the result:
    On the search page, Google's bread and butter so to speak, for a 'highly contested' search (that is what search engine marketeers call a search query that can generate lucrative ad clicks) such as 'best credit card' or 'lowest home mortgage', there are many web browser window configurations that show few, if any organic search engine results at all!
    As I wrote at the time:
    In other words, for searches that are profitable, Google has moved all the results it thinks are relevant off the first page and replaced them with results that people have paid to put there. Which is pretty much the definition of "evil" in the famous "don't be evil" slogan notoriously dropped in 2015. I'm pretty sure that no-one at executive level in Google thought that building a paid-search engine was a good idea, but the internal logic of the "slow AI" they built forced them into doing just that.
    Naturally, "Sponsored Products" caused the same thing to happen to Amazon:
    The problem with Sponsored Products is that sponsored listings are not actually good for customers - they are good for sellers; more specifically, they are good for sellers who are good at advertising, and bad for everyone else. Paid digital advertising is a very specific skill set; the odds that the brand with the best product also happens to employ the best digital marketing staff or agency is extraordinarily low. Further, the ability to buy the top slot in search results favors products with the highest gross margin - hence the highest bidder - not the products that would best satisfy customers.

    The issue is compounded by the fact that the average customer is unable to tell the difference between an "organic" search result and a sponsored product. The top four results in an Amazon search are now occupied by sponsored listings, which means that the average Amazon customer is disproportionately likely to be purchasing a sponsored product. And since the sponsored listings favor high-margin products pushed by savvy digital marketers, it is highly unlikely that Amazon's customer is buying the optimal product that the market could provide.

    To be sure, very poor products get rated poorly and are weeded out quickly, but, by and large, sponsored listings drag the average quality of products sold closer to mediocrity, and further from greatness. That's bad.
    What happens when you order one of these mediocre, high-margin products? Amazon ships it to you in two days, or maybe even next day, or in some cases the same day! How is this even possible?

    This the question behind Izabella Kaminska's deep dive. Some of the products with rapid delivery are Amazon's own. Preventing Amazon both owning the platform and selling through it is a key aspect of Senator Elizabeth Warren's anti-trust policy; it isn't a good idea for a company to compete with its own customers. But that's a topic for a different post. The other products with rapid delivery come from vendors using "Fulfilled By Amazon" (FBA); they are stocked in and shipped from Amazon's warehouses. There are two ways to identify products in this system. The first is the FNSKU:
    Unless you make your money from selling stuff on Amazon, chances are you won't have heard of an FNSKU. The acronym stands for Fulfilment Network Stock Keeping Unit and represents a location identifier for products sitting in Amazon warehouses. This, to all intents and purposes, equates to an Amazon barcode.
    The other is the manufacturer's barcode, the one you'd see in a brick-and-mortar store. Amazon's systems discourage vendors from using the FNSKU:
    Not using an FNSKU is appealing for sellers. It means products sourced from manufacturers do not have to be relabelled, ensuring they can be sent into Amazon's network directly, saving time and money. Sellers who have chosen to be fulfilled by Amazon otherwise add an additional logistical layer into their operations if they have to relabel the goods independently.

    Using manufacture bar codes also means products are more likely to qualify for Amazon Prime classification, pushing them higher up the search rankings.
    Amazon's explanation for why it prefers manufacturer's barcodes is:
    If multiple sellers have inventory with the same manufacturer barcode, Amazon may fulfil orders using products with that barcode when those products are closest to the customer. This happens regardless of which seller actually receives a customer's order. We use this process to facilitate faster delivery.
    In other words, products using an FNSKU will ship from the warehouse where the vendor stocked them, whereas products using a manufacturer's barcode will ship from a warehouse where a vendor selling the same product has stocked them. In theory this "commingling" is good for the vendors, getting them faster delivery at lower inventory cost, and good for the customers, who get faster delivery and lower prices.

    At Internet scale there's no way Amazon can verify that products carrying the same manufacturer's barcode are actually the same product. So, inevitably, in some cases they aren't. Kaminska reports:
    For the whole thing to work seamlessly, the underlying inventory across the entire system must be genuinely equivalent, and bear all the same properties. For that to be case, somebody has to be willing to police the quality of goods entering the system.

    Sellers we spoke to said Amazon seems strangely reluctant to step up on this front, preferring to trust that what's on the label is what's inside - possibly because of the costs involved.

    Some sellers believe the lackadaisical approach has now exposed the entire network to contamination by inferior or fake goods: an unscrupulous vendor can pass off a copycat good as the genuine article by applying a manufacturer's barcode, which is easy to do.

    This in turn has created a quality lottery for customers who purchase from commingled inventory (often without realising it).
    It gets worse:
    because Amazon is still obliged to return unsold inventory to suppliers on request, sellers say this creates an incentive for opportunists to send in fake, or low quality, goods into commingled inventory just to receive higher quality goods in return. What proportion of quality goods they receive back depends on how contaminated the particular product pool is, but for many the arbitrage opportunity is worth a punt.

    A further arbitrage relates to products bearing manufacturer warranties. As a rule, warranties are either dated from the date of manufacture or the date of sale. In the event of the former, commingled pools can comprise of a huge range of warranty durations, from entirely expired to brand-new. Since customers often don't check warranties until it's too late, these discrepancies often go unnoticed.
    Just as Facebook claims it can police its network, Amazon claims it can handle the problem:
    Amazon says it has the means to track the provenance of disputed goods even if sourced from commingled stock. It adds that in such cases it fronts the refund to the customer directly before attempting to recoup costs from the bad actor responsible. If that fails, it pursues further action against the bad actor - whether that's blocking the account, litigation or law enforcement.

    But sellers insist that in many instances the company's response time is far too slow when it comes to disabling offending accounts or protecting their IP. It's also highly discretionary, with further legal action only being taken when it is expedient for Amazon to do so.
    The vendors are much smaller than Amazon, and dependent upon Amazon's sales channel. So it is easy, if short-sighted, for Amazon to insist that they must deal with the problem themselves:
    One US-based seller of security products told us:
    Amazon is demanding sellers who suffer from this problem that the sellers - who have no way of actually rectifying the issue unless they de-intermingle [de-commingle] and start issuing their own SKUs which is against Amazon's policy - offer a plan of action to explain how they are going to fix all the customer complaints they are getting about the products Amazon is sending,
    He added that if they don't offer a plan, Amazon simply kicks them off the platform anyway, as the presence of bad ratings is bad for everyone. The recent Amazon vendor purge is testament to that (more on that in our follow-up post).

    Another seller said that the process of identifying the bad seller who supplied his customer is not only long-winded but typically involves placing a "test buy" themselves, putting the onus on them to prove the problem exists, which may or may not be replicable.
    After this examination of the role of fake products in the Fulfilled By Amazon channel, in Part II, Kaminska discusses their role in Amazon Marketplace:
    Fakes can also sneak in through the company's Marketplace platform. On Marketplace, Amazon only acts as a commission-charging matching agent, with sellers fulfilling their own orders and using their own warehouses, meaning there's little to no quality control or supervision on their part.
    As usual, Amazon's way of dealing with this problem was to create a platform, the review and rating system, by which outsiders could solve it for them:
    But reviews on the site cannot always be trusted, thus Amazon's largely immutable review and ratings system hasn't entirely eliminated the information asymmetry that creates the famous Akerlof-ian lemon problem — in which good products are underpriced to compensate for the risk of customers unwittingly buying lemons.

    Customers have learnt to offset the risk of delays and returns by demanding significant discounts on products purchased via third parties. That discount, alongside the cost and uncertainty of running arduous returns policies, has transferred a huge amount of risk to the remaining honest sellers on Marketplace.
    It may not be just FBA and Marketplace vendors that are at risk from fake products:
    To what degree Amazon's own products (ie those it sources from wholesale suppliers directly) are vulnerable to being mixed in with third-party products, however, is unclear. All we know from the spokesman is that Amazon has been commingling its own products since 2013 in the UK. Whether or not there has been an increase in the number of subquality goods being sold under Amazon's name can only be determined by a holistic review of comments for Amazon's own service which, as far as we know, is impossible for a user to do.
    Last month Amazon abruptly purged many long-standing vendors of Amazon's own products. Ostensibly, this was to push them into the Marketplace program, which generates better margins for Amazon. But:
    Other evidence, however, suggests the vendor purge was as much to do with rooting out bad actors as it was about managing costs. For example, a week or so after the vendor purge was announced, DigiDay UK was among those reporting that Amazon had decided to walk back the decision to terminate purchase orders on the proviso vendors signed up to Amazon's Brand Registry enrolment system. This is a service that allows brand owners to protect their IP by ensuring those who sell their products on Amazon must have permission to do so. Except, suspensions didn't just impact brand owners or licensees.
    As DigiDay put it:
    The emphasis on Brand Registry suggests that Amazon’s Vendor Central purge was a move to eliminate vendors that not only aren’t profitable for Amazon to manage, but in some cases are also low quality, selling counterfeit goods or branded products without authorisation.
    This, however, is understandably problematic for legitimate resellers of branded products.
    Kaminska concludes:
    it's fair to assume that the more good actors opt out of commingling the more they increase costs for Amazon and leave the online store itself on the hook for the subquality products entering its system from the remaining bad actors. ... the more trustworthy a retailer's supplier network is, the leaner and more efficient it can operate and, ultimately, the more cost competitive it can be.

    What Amazon investors need to establish is to what degree Amazon's model of instant and affordable fulfilment under the Prime umbrella has been indirectly dependent on drawing on similar efficiencies without the same regard for quality control. And, in that context, to what degree its competitive pricing (and consequent ability to outcompete conventional retail) is more the product of allowing any old seller with any old ware to compete alongside those with superior standards — generating the lemon discount phenomenon — than a core efficiency in its model.

    Because if the answer is a lot, the more we see Amazon taking measures to nip the counterfeiting problem in the bud, the more likely the retailer is to bifurcate into a closed online retail system (similar to conventional retail with trusted supply chains, white labelled goods and other customer-facing ops) and an open-ended eBay-equivalent (albeit without an auction mechanism to regulate prices).
    Just as "Sponsored Products" give the advantage to sellers who exploit the buyer, commingling gives the advantage to unscrupulous sellers who exploit both more scrupulous sellers and the buyers, who get fake products or expired warranties.

    The fundamental problem is one that affects all Internet scale platforms. It is the same problem that means Facebook can't prevent the bad guys using their platform to spread hate and interfere in elections, can't prevent third-parties leaking their user's data, and won't remove content that obviously violates their stated policies. It is the same problem that means the bad guys can censor YouTube with bogus DMCA takedowns, spread anti-vaxxer propaganda, and target children with disturbing fake Peppa Pig videos. It is the same problem that means the Google Play store contains a whole lot of malware, including from governments. It is the same problem that means scholarly communication is collapsing under attack from predatory publishers.

    The problem is that Internet scale means "things get really weird" and human intervention to de-weird them is ineffective. Why is human de-weirding ineffective? For three related reasons:
    • Even though the FAANGs are huge companies, there are a lot more bad guys out there trying to subvert their platforms than the companies have employees. Just as an example, as of last December there were 35,587 full-time Facebook employees. But last week, Researchers unearth 74 Facebook cybercrime groups with 385,000 members. Ten bad guys per employee in just this one example, which is not unique:
      Friday’s report comes a year after journalist Brian Krebs reported Facebook had deleted almost 120 groups with more than 300,000 members total after Krebs provided documentation they were flagrantly promoting a host of illicit activities on the social media network’s platform.
    • The financial and other incentives for the bad guys are large and immediate. In the short term, and even as we see with Facebook the medium term, the financial penalty for not de-weirding their platform is insignificant.
    • Because the short-term penalties for tolerating weirdness are insignificant, and the cost of even marginally effective efforts to de-weird the platform large - they require hiring large numbers of humans - the platforms only mount token de-weirding efforts. The problem being fixed isn't the weirdness, it is the publicity about the lack of de-weirding efforts. Note that Facebook depends upon outsiders like Krebs and Talos to detect bad guys, despite the searches involved being trivial:
      Indeed, less than two minutes of searching on Facebook turned up groups that appeared to offer the same services. One group called Carding Secured offered an array of services related to stolen payment-card data.
    In each instance of the fundamental problem, the result is negative externalities. Fossil fuel companies' massive profits are based on avoiding the costs carbon emissions impose on current, and even more so future, society. The Sackler's massive profits are based on avoiding the costs opioid addiction imposes on current, and even more so future, society. Similarly, the platforms' massive profits are based on avoiding the costs that are imposed on current, and even more so future, society by their inability to cope with the way things get really weird when you run an unbounded search at internet-scale. Costs such as science's loss of credibility, a generation of traumatized, video-addicted children, and the collapse of democracy.

    I, too, want answers / CrossRef

    Around 1966-67 I worked on the reference desk at my local public library. For those too young to remember, this was a time when all information was in paper form, and much of that paper was available only at the library. The Internet was just a twinkle in the eye of some scientists at DARPA, and none of us had any idea what kind of information environment was in our future.* The library had a card catalog and the latest thing was that check-outs were somehow recorded on microfilm, as I recall.

    As you entered the library the reference desk was directly in front of you in the prime location in the middle of the main room. A large number of library users went directly to the desk upon entering. Some of these users had a particular research in mind: a topic, an author, or a title. They came to the reference desk to find the quickest route to what they sought. The librarian would take them to the card catalog, would look up the entry, and perhaps even go to the shelf with the user to look for the item.**

    There was another type of reference request: a request for facts, not resources. If one wanted to know what was the population of Milwaukee, or how many slot machines there were in Saudia Arabia***, one turned to the library for answers. At the reference desk we had a variety of reference materials: encyclopedias, almanacs, dictionaries, atlases. The questions that we could answer quickly were called "ready reference." These responses were generally factual.

    Because the ready reference service didn't require anything of the user except to ask the question, we also provided this service over the phone to anyone who called in. We considered ourselves at the forefront of modern information services when someone would call and ask us: "Who won best actor in 1937?" OK, it probably was a bar bet or a crossword puzzle clue but we answered, proud of ourselves.

    I was reminded of all this by a recent article in Wired magazine, "Alexa, I Want Answers."[1] The argument as presented in the article is that what people REALLY want is an answer; they don't want to dig through books and journals at the library; they don't even want an online search that returns a page of results; what they want is to ask a question and get an answer, a single answer. What they want is "ready reference" by voice, in their own home, without having to engage with a human being. The article is about the development of the virtual, voice-first, answer machine: Alexa.

    There are some obvious observations to be made about this. The glaringly obvious one is that not all questions lend themselves to a single, one sentence answer. Even a question that can be asked concisely may not have a concise answer. One that I recall from those long-ago days on the reference desk was the question: "When did the Vietnam War begin?" To answer this you would need to clarify a number of things: on whose part? US? France? Exactly what do you mean by begin? First personnel? First troops? Even with these details in hand experts would differ in their answers.

    Another observation is that in the question/answer method over a voice device like Alexa, replying with a lengthy answer is not foreseen. Voice-first systems are backed by databases of facts, not explanatory texts. Like a GPS system they take facts and render them in a way that seems conversational. Your GPS doesn't reply with the numbers of longitude and latitude, and your weather app wraps the weather data in phrases like: "It's 63 degrees outside and might rain later today." It doesn't, however, offer a lengthy discourse on the topic. Just the facts, ma'am.[3]

    It is very troubling that we have no measure of the accuracy of these answers. There are quite a few anecdotes about wrong answers (especially amusing ones) from voice assistants, but I haven't seen any concerted studies of the overall accuracy rate. Studies of this nature were done in the 1970's and 1980's on library reference services, and the results were shocking. Even though library reference was done by human beings who presumably would be capable of detecting wrong answers, the accuracy of answers hovered around 50-60%.[2] Repeated studies came up with similar results, and library journals were filled with articles about this problem. The  solution offered was to increase training of reference staff. Before the problem could be resolved, however, users who previously had made use of "ready reference" had moved on to in-sourcing their own reference questions by using the new information system: the Internet. If there still is ready reference occuring in libraries, it is undoubtedly greatly reduced in the number of questions asked, and it doesn't appear that studying the accuracy is on our minds today.

    I have one final observation, and that is that we do not know the source(s) of the information behind the answers given by voice assistants. The companies behind these products have developed databases that are not visible to us, and no source information is given for individual answers. The voice-activated machines themselves are not the main product: they are mere user interfaces, dressed up with design elements that make them appealing as home decor. The data behind the machines is what is being sold, and is what makes the machines useful. With all of the recent discussion of algorithmic bias in artificial intelligence we should be very concerned about where these answers come from, and we should seriously consider if "answers" to some questions are even appropriate or desirable.

    Now, I have question: how is it possible that so much of our new technology is based on so little intellectual depth? Is reductionism an essential element of technology,  or could we do better? I'm not going to ask Alexa**** for an answer to that.

    [1] Vlahos, James. “Alexa, I Want Answers.” Wired, vol. 27, no. 3, Mar. 2019, p. 58. (Try EBSCO)
    [2] Weech, Terry L. “Review of The Accuracy of Telephone Reference/Information Services in Academic Libraries: Two Studies.” The Library Quarterly: Information, Community, Policy, vol. 54, no. 1, 1984, pp. 130–31.

    * The only computers we saw were the ones on Star Trek (1966), and those were clearly a fiction.
    ** This was also the era in which the gas station attendent pumped your gas, washed your windows, and checked your oil while you waited in your card.
    *** The question about Saudia Arabia is one that I actually got. I also got the one about whether there were many "colored people" in Haiti. I don't remember how I answered the former, but I do remember that the user who asked the latter was quite disappointed with the answer. I think he decided not to go.
    **** Which I do not have; I find it creepy even though I can imagine some things for which it could be useful.

    Change to OCLC OAuth Server / OCLC Dev Network

    OCLC has deprecated returning user information, our API OAuth Server, to be more compliant with the FEDRAMP standard.

    It's The Enforcement, Stupid! / David Rosenthal

    Kim Stanley Robinson is a remarkable author. In 1990 he concluded his Wild Shore triptych of novels describing alternate futures for California with Pacific Edge:
    Pacific Edge (1990) can be compared to Ernest Callenbach's Ecotopia, and also to Ursula K. Le Guin's The Dispossessed. This book's Californian future is set in the El Modena neighborhood of Orange in 2065. It depicts a realistic utopia as it describes a possible transformation process from our present status, to a more ecologically-focused future.
    Why am I writing about this now, nearly three decades later? Follow me below the fold for an explanation.

    After last summer's Decentralized Web Summit I wrote:
    Cory Doctorow's barn-burner of a closing talk, Big Tech's problem is Big, not Tech, was on anti-trust. I wrote about anti-trust in It Isn't About The Technology, citing Lina M. Kahn's Amazon's Antitrust Paradox. It is a must-read, as will Cory's talk be if he posts it (Update: the video is here). I agree with him that this has become the key issue for the future of the Web; it is a topic that's had a collection of notes in my blog's draft posts queue for some time.
    With the start of the 2020 Presidential race, anti-trust has become a highly visible issue, just less visible than economic inequality. In The Myth of Capitalism: Monopolies and the Death of Competition Jonathan Tepper and Denise Hearn argue that these issues are intimately related; the rise of oligopoly capitalism has been a major cause of the rise of economic inequality. From John Hempton's review of the book:
    [the book] starts in an entirely appropriate place.

    Dr Dao - a doctor with patients to serve the next day - was "selected" by United Airlines to be removed from an overbooked plane.

    As he had patients to tend the next day he did not think he should leave the plane. So the airline sent thugs to bash him up and forcibly removed him.

    The video (truly sickening) went viral. But the airline did not apologise. The problem it seems was caused by customer intransigence.

    They apologised after what Tepper and Hearn think was true public revolt, but what I think was more likely the realistic threat to ban United Airlines from China because of the racial undertones underlying that incident.

    If a "normal" company sent thugs to brutalise its customers it would go out of business. But United went from strength to strength.

    The reason the authors assert was that United has so much market power you have no choice to fly them anyway - and by demonstrating they had the power to kick your teeth in they also demonstrated that they had the power to raise prices. The stock went up pretty sharply in the end.

    Oligopoly - extreme market power - not only makes airlines super-profitable. It gives them the licence to behave like complete jerks.

    But what is true of airlines is true of industry after industry in the United States. Hospital mergers have left many towns with one or two hospitals. Health insurance is consolidated to the point where in most states there is only one or two realistic choices. Even the chicken-farming industry is consolidated to the point where the relatively unskilled and non-technical industry makes super-normal profits.
    How did we get to the point where a company can hike its stock price by assaulting its customers? It wasn't that anti-trust law changed, it is that the Chicago school changed the way the law was interpreted to focus on "consumer welfare" defined as low prices, thereby ham-stringing its enforcement. As we see in the contrast between the Savings and Loan crisis and the Global Financial crisis, a law isn't effective simply because it is on the books, but only if it is effectively enforced.

    The need for enforcement is something that features in Senator Elizabeth Warren's antitrust proposal. Her idea is that companies that exceed $25B/yr in revenue encounter special rules requiring them to divest certain parts of their operations, and that weaker rules apply between $90M/yr and $25B/yr. Nilay Patel interviewed Senator Warren at SXSW. Patel's questions are in bold:
    At $25 billion [in annual revenue to trigger a breakup], you’re not anticipating that the local supermarket is going to stop having to do house brands.

    Exactly. And no one’s looking for that. You’re getting into the nuance, that actually this is a two level regulation. The one that’s caught all the headlines is that for everybody above $25 billion, you got to break off the platform for many of the ancillary or affiliated businesses.

    But between 90 million and 25 billion [in annual global revenue], the answer is to say if you run a platform, you have an obligation of neutrality, so you can’t engage in discriminatory pricing. Obviously, it’s like the net neutrality rule: you can’t speed up some folks and slow down other folks, which is another way of pricing. So there’s an obligation of neutrality.

    The advantage to breaking them up at the top [tier] rather than just simply saying, “gosh, girl, why didn’t you just go for obligation of neutrality all the way through?” is that it actually makes regulation far easier. When you’ve just got a bright-line rule, you don’t need the regulators. At that point, the market will discipline itself. If Amazon the platform has no economic interest in any of the formerly-known-as-Amazon businesses, you’re done. It takes care of itself.
    So you’re articulating a bright-line rule. A lot of conversations I’ve had with antitrust people like the Tim Wus and Lina Khans of the world, they’re saying we need to change the standard. We need to go from the consumer welfare antitrust standard to a European-style competition standard. Are you advocating that we change the antitrust standard?

    I just think it’s a lot harder to enforce that against a giant that has huge political power.

    So you’re in favor of leaving the consumer welfare standard alone?

    Look, would I love to have [that changed] as well? Sure. I have no problem with that.

    My problem is in the other direction: there are times when hard, bright-line rules are the easiest to enforce, and therefore you’re sure you’ll get the result you want.

    Let me give you an example of that: I’ve been arguing for a long time now for reinstatement of [the] Glass-Steagall [Act]. And my argument is basically, don’t tell me that the Fed and the Office of the Comptroller of the Currency can crawl through Citibank and JPMorgan Chase and figure out whether or not they’re taking on too much risk and whether they’ve integrated and cross-subsidized businesses. Just break off the boring banking part — the checking accounts, the savings accounts, what you and I would call commercial banking — from investment banking, where you go take a high flyer on this stock or that new business

    When you break those two apart, you actually need fewer regulators and less intrusion on the business.

    You also get more assurance it really happened. We live in an America where it’s not only economic power that we need to worry about from the Amazons and Facebooks and Googles and Apples of the world — we have to worry about their political power as well. There’s a reason that the Department of Justice and the Federal Trade Commission are not more aggressive. There was a time, long ago, when they were more aggressive, a golden age of antitrust enforcement.

    These big companies exert enormous influence in the economy and in Washington, DC. We break them apart, that backs up the influence a little bit, and it makes absolutely sure that they’re not engaged in these unfair practices that stomp out every little business that’s trying to get a start, every startup that’s trying to get in there.
    Senator Warren is clearly right about the importance of bright lines for enforceable anti-trust laws when she says:
    When you’ve just got a bright-line rule, you don’t need the regulators. At that point, the market will discipline itself.
    But in my view she doesn't go far enough, for two reasons:
    • In her vision, what happens when a company exceeds $25B/yr in revenue is that a conversation starts between the company and the regulators. Given the resources available on both sides, this is a conversation that (a) will go on for a long time, and (b) will be resolved in some way acceptable to the company.
    • Her vision seems narrowly tailored to the FAANGS, ignoring the real oligopoly of the online world, the telcos. But her arguments apply equally to oligopoly and monopoly in other areas.  John Hempton uses the example of Lamb Weston, the dominant player in french fries:
      French fries it seems are absurdly profitable. The return on assets is in the teens (which seems kind-of-good in this low return world). Margins keep rising and yet there is no obvious emerging competition.

      It may be a good investment even though it looks pretty expensive. But if competition comes Lamb Weston could be a terrible stock.

      There has been plenty of consolidation in this industry. Sure many of the mergers shouldn't have been approved by regulators - but they were - and the industry has become oligopolistic.

      But this is not a complicated industry - it is not obvious why competition doesn't come.
    I think Robinson was on to a better alternative. Although it is never spelled out explicitly, one key aspect of the transformation in Pacific Edge is that there are hard limits on both personal incomes and the size of corporations. There is a very simple way to implement such hard limits, via the tax code:

    Corporations should be subject to a 100% tax rate on revenue above the cap.

    There should be no need for anti-trust regulators to argue with the company about what it should do. It is up to the company, as always, to decide how to minimize their tax liability. They can decide to break themselves  up, to lower prices, to stop selling product for the year, whatever makes sense in their view. It isn't up to the government to tell them how to structure their business. Basing the cap on revenue, as opposed to profit, prevents most of the ways companies manipulate their finances to avoid tax. Basing enforcement on the tax code leverages existing mechanisms rather than inventing new ones. And, by the way:

    Individuals should be subject to a 100% tax rate on income above a similar cap.

    In both cases the 100% rate should be supplemented by a small wealth tax, a use-it-or-lose-it incentive for cash hoards to be put to productive use instead of imitating Smaug's hoard.

    Update 6th April 2019:

    The effect of the lack of anti-trust enforcement can be seen in this graph of the proportion of startups in the US economy, roughly halved in 40 years, via Charles Hugh Smith.

    Digital Transformation Lowers Risk of Oil and Gas Exploration / Lucidworks

    For major oil and gas producers, the risk associated with exploring potentially lucrative oil prospects is often accepted as the cost of doing business. For example, in 2015, after eight years of planning and drilling, Shell announced a $4 billion write down as a result of a single failed exploration project in the Arctic.

    In recent years, the application of technologies such as machine learning (ML) and artificial intelligence (AI) have helped forward-thinking operators lower their risk by providing exploration teams with valuable insights about where to drill. In such cases, advanced analytics are used to search and interpret decades worth of untapped information — much of which is locked in data silos — to make more informed exploration decisions and avoid multi-billion dollar mistakes.

    Despite these benefits, the majority of exploration and production (E&P) companies continue to rely on traditional methods of data retrieval and analysis. However, as operators search for ways to increase efficiency and profit margins in the “lower for longer” environment (i.e., with crude oil prices fluctuating between $50 – $75 per barrel), the value that digital transformation can provide is becoming harder and harder to ignore.

    The Role of Search in Upstream

    The primary objective of upstream geophysicists and supporting data scientists is to identify and model hydrocarbon resources that can be exploited cost-effectively. Today, the vast majority of exploration takes place offshore in deepwater. In this environment, drilling of a single exploratory well can cost upward of $150 million. With documented success rates of around 20 percent, it is critical that personnel responsible for analyzing geology and identifying locations to drill are able to make an informed decision based on all relevant data.

    And therein lies the problem.

    In most cases, the information required to make these decisions resides in a variety of locations and file formats (e.g., text files, well logs, GIS files, images, etc.). In the case of large E&P companies whose operations span decades and millions of surface acres, internal databases often consist of highly disparate, siloed, and unorganized data. This makes it difficult for engineers and geophysicists to find and access the right information using conventional search applications. Common issues companies encounter include:

    • Available, relevant data takes too long for users to find and access (i.e., poor search functionality)
    • Inconsistencies in how data is labeled and tagged
    • Some data is so disparate that decision-makers are not aware that it even exists
    • The location of information and data related to past projects resides in the minds and/or private notes of experienced personnel and is inaccessible to new employees
    • LAN (or GIS) files constitute a significant portion of wanted yet unsearchable data
    • Efforts to find and access data are often duplicated by separate teams, wasting time and resources

    Leveraging Search Analytics

    The upstream oil and gas industry is addressing these challenges by making data applications more intelligent. In one instance, a major E&P operator had more than 10 million surface acres in its portfolio and was focused on driving digital transformation at a massive scale. An important part of that initiative involved applying advanced search analytics and machine learning to change the way exploration teams could find and access upstream data.

    The objective of the project extended beyond simply providing users with a faster way of locating relevant information. It was about creating a compelling data experience by using analytics to connect information silos, create an agile catalog of context and relevance, and enable users to extract relevant business knowledge across all historical databases.

    To achieve this, the operator deployed an AI-powered search and discovery application based on Lucidworks Fusion, which provided the following advanced capabilities:

    • Collecting documents to be processed while maintaining information security and OCR as necessary
    • Analyzing documents via natural language processing (NLP) in English, French, German, Russian
    • Analyzing each file content to determine most likely location(s) based on country>basin>block>field>well hierarchy
    • Creating relevancy by classifying documents based on their content
    • Enabling information retrieval across data sources
    • Refining relevancy with user rating of retrieved documents
    • Visualizing results via Map, word cloud type interface for dynamic filtering, and timescale filtering
    • Facilitating discovery of new information via triggers based on specific interests saved by users, as well as Google-like “you might be interested in …” suggestions based on browsing history and popular queries

    A key component of the intelligent search application is its use of machine learning, which allows it to continuously improve, based on the individual input from the user.

    With the deployment, the E&P operator can answer user queries and solve complex search and relevance problems, neither of which was possible using its existing search platform. This enables exploration teams to leverage more effectively the wealth of upstream data that has (and is) being generated by giving them the ability to quickly and accurately retrieve information when they need it, which ultimately leads to better decision-making.

    The Road Ahead

    Digital transformation continues to serve as a powerful driver of success for industrial enterprises. While the oil and gas industry has not been immune to this trend, it historically has lagged behind others on the digitalization maturity curve. However, this is changing as organizations, particularly in the upstream sector, face increased pressure to reduce costs and minimize risk.

    In the near-term future, the primary focus for forward-thinking operators will be on leveraging big data and analytics to improve decision making and streamline existing business processes. During that time, the role of ML and AI in upstream operations will undoubtedly grow immensely.

    Alex Misiti is a former civil and environmental engineer turned freelance writer who focuses on topics relating to oil and gas, renewable energy, IIoT, and digital transformation. He is the owner of Medium Communications, LLC.

    The post Digital Transformation Lowers Risk of Oil and Gas Exploration appeared first on Lucidworks.

    AVAILABLE: The 2018 DuraSpace Annual Report / DuraSpace News

    The 2018 DuraSpace Annual Report is now available with an overview of accomplishments, finances, and global membership. Download the Annual Report here.

    With your support our organization took significant steps forward in 2018.

    DuraSpace and the communities it serves are alive with ideas and innovation. Our team strives to meet the needs of the ever-expanding scholarly ecosystem that connects us all. Our international community of practitioners, strategic partners, and service providers continued to contribute to the growth, advances and adoption rate of DSpace, Fedora and VIVO in 2018. In addition, DuraSpace hosted services growth is lowering the barrier of entry for organizations who want to deploy open source technologies with the assistance of a not-for-profit. We are grateful for our community’s financial support, and for their engagement in the enterprise we share as we work together to provide enduring access to the world’s digital heritage.

    The post AVAILABLE: The 2018 DuraSpace Annual Report appeared first on

    Open Data Awareness Event at Kyambogo University, Uganda / Open Knowledge Foundation

    This report is part of the event report series on International Open Data Day 2019. On Saturday 2nd March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. The event reported on in this post took place on 9 March, and was organized by Samson Ngumenawe at Kyambogo University in Uganda under the Association of Student Surveyors Kyambogo (ASSK), an association that unites all lands students in Kyambogo. It unlocked the potential of open data to students, most especially finalists that are undertaking their research projects.

    The open data awareness event featured different topics including crowdsourcing data using OpenStreetMap, introduction to open geospatial tools like Quantum GIS and Java OpenStreetMap Editor, open data querying tools like overpass-turbo, OpenStreetMap downloader, quick OSM, and HOT export tool.

    The event was dominated by students from the department of lands and architectural studies with the biggest number of students from the surveying and land economics classes. The unforgettable event was cheered on how it created an opportunity for students to access open data for research projects.

    Ms. Robinah Nakiwa a fourth-year student of Land Economics running a research project on “The role of land use plans in the development control for buildings in upcoming towns” was stranded with how to acquire the number of buildings in her study area until she became aware of the availability of open geospatial data on OpenStreetMap. Her study area was however not fully mapped and this called upon the intervention of MapUganda to help in mapping all the buildings in Bombo Town Council on OpenStreetMap where the researcher was able to query them using overpass-turbo and performed a count that she later used to generate her sampling frame. This was done in a short while and it saved resources that would have been used in the process of data collection. “A lot of thanks go to everyone that has ever contributed to OpenStreetMap, the local OSM contributors the organizer of the Open Data event at Kyambogo University. Keep the community growing.”

    Ms. Edith Among a fourth-year student of land surveying and land information systems was also able to query highway data from OpenStreetMap and went ahead to do her final year project on finding the optimum route for solid waste collection trucks in Njeru Division of Jinja Municipality.

    The challenging part of the event was lack of financial support. This created hindrances in providing necessities like internet bundles, event materials like stickers and banners, refreshments and communication.

    I believe that the next event will be bigger and it will create a great impact.

    How to build a user interface / Alf Eaton, Alf

    1. Find out what the data looks like at the beginning and what it needs to look like at the end.
    2. Build a database to store that data.
    3. Build an HTML form for manipulating the data.
    4. Record the time taken, and how many clicks and keypresses it takes, to get from the input data to the output data.
    5. Do whatever you like* with the UI to get those numbers as low as possible.

    * while maintaining the accessibility of the interface and the privacy, safety and security of everyone involved

    Islandora Community Code of Conduct Survey / Islandora

    As our Islandora Community Code of Conduct nears its third birthday, the Islandora Coordinating Committee would like to take stock of how well it is serving all members of our community, and then update or expand it accordingly. To get this process started, we would like your input on a 6-question survey.

    The survey will be open for two months, after which time the Islandora Coordinating Committee will review the feedback and use it to inform a new draft. If you'd like to take a more active hand in helping us to shape the Code of Conduct, you can drop your email address at the end of the survey, or contact me directly. Any suggested changes to the Code of Conduct will also go out to the community for review, so you'll have another opportunity to let us know what you think  before it's taken up for a vote. 

    Thank you for your help!

    Islandora at OR2019 / Islandora

    The Open Repositories conference is heading to Hamburg, Germany from June 10-13th, and Islandora will be there. If you're planning to attend (or watch some videos afterwards), we've put together a list of sessions that might be of particular interest for our community:

    June 10th:

    June 11th:

    June 12th:

    Not enough Islandora yet? We've got an Islandora Camp a few days after Open Repositories, near Zürich, Switzerland from June 17 - 19. If you're already in the region for OR2019, why not stay over and join us there? 

    Digitized Historical Documents / David Rosenthal

    Josh Marshall of Talking Points Memo trained as a historian. From that perspective, he has a great post entitled Navigating the Deep Riches of the Web about the way digitization and the Web have transformed our access to historical documents. Below the fold, I bestow both praise and criticism.

    The really good part of Marshall's post is the explanation, with many examples, of the beneficial results of libraries' and archives' digitization programs. These all used to be locked away in archives or, at best, presented a page at a time in display cases. All but a tiny fraction of the interested public had to make do with small reproductions. Even scholars would need months to get permission for even brief access.

    Now anyone, but especially scholars, wherever they feel the need,can examine in close-up detail treasures such as ancient codexes, illuminated manuscripts, Gutenberg Bibles, ancient maps, or Isaac Newton's own notebook. Even more recent resources such as photographs of jazz greats are available. All this without asking permission, travel to the library, or even donning white gloves. And, as a side-effect, the risk to the originals has been significantly reduced. Truly something worth celebrating, as Marshall does.

    Marshall describes the problem of preserving the digital versions of these treasures thus:
    Happily, for those of us who are merely consumers of these riches in the present, it’s someone else’s problem. But it is a big, fascinating problem for librarians and digital archivists around the world.
    Well, yes, it is a much more interesting problem than I thought when I started work on it more than two decades ago. The not so good part of Marshall's post is that, like most of the public and even some in the digital preservation community is simply wrong about what the problem is. He writes eloquently:
    There is a more complex process underneath all these digital riches which is just how to preserve digitized collections to stand the test of time. With books, by and large, you just take care of them. Easier said than done and world class libraries now have a complex set of practices to preserve physical artifacts from acid-free containers to climate control and the like. But there’s an entirely different set of issues with digitization. It would certainly suck if you’d digitized your whole collection in 1989 and just had a big collection of 5 1/4 inch floppy disks produced on OS/2, the failed IBM-backed PC operating system that officially died in 2006.

    That’s just an example for illustration. But you can see the challenge. Over the last 30 or 40 years we’ve had Betamax, VHS, vinyl albums, CDs, DVDs, BluRay, various downloadable video and audio formats. These are all a positive terror if you’re trying to organize and preserve artifacts of the past that people will have some hope of using in a century or five centuries. What formats do you use? How do you store them – not simply to make them available today but to ensure they aren’t lost in some digital transition or societal disruption in the future?
    I wrote about some of the many reasons format obsolescence was only a problem for the archaeology of IT before the 90s back in 2007, in the second and third posts to this blog. Here is a summary from 2011's Are We Facing a "Digital Dark Age?":
    In the pre-Web world digital information lived off-line, in media like CD-Rs. The copy-ability needed for media refresh and migration was provided by a reader extrinsic to the medium, such as a CD drive. In the Web world, information lives on-line. Copy-ability is intrinsic to on-line media; media migration is routine but insignificant. The details of the storage device currently in use may change at any time without affecting reader's access. Access is mediated not by physical devices such as CD readers but by network protocols such as TCP/IP and HTTP. These are the most stable parts of cyberspace. Changing them in incompatible ways is effectively impossible; even changing them in compatible ways is extremely hard. This is because they are embedded in huge numbers of software products which it is impossible to update synchronously and whose function is essential to the Internet's operation.
    Research by the BL's Andy Jackson and INA's Matt Holden (discussed here and here) shows that Web formats are extremely slow to change and backwards compatibility is generally well-maintained. Even if it isn't Ilya Kreymer's shows that accessing preserved Web content with a contemporaneous Web browser is easy.

    British Library budget, 2000 £
    As I keep saying, for example here, the fundamental problem of digital preservation is economic; we know how to keep digital content safe for the long term, we just don't want to pay enough to have it done.The budgets for society's memory institutions - libraries, museums and archives - have been under sustained pressure for many years. For each of them, caring for their irreplaceable legacy treasures must take priority in their shrinking budget over caring for digitized access surrogates of them. These are, after all, replaceable if the originals survive. It may be expensive to re-digitize them with, presumably, better technology in the future. But it can't be done if the originals succumb to a budget crisis.

    Open Data Day: Strengthening Citizen Participation & Women in Power / Open Knowledge Foundation

    This report is part of the event report series on International Open Data Day 2019. On Saturday 2nd March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. This is a joint report produced by NaimLab (Peru) and Centro Latinoamericano de Derechos Humanos (CLADH) from Argentina, who received funding through the mini-grant scheme by the Latin American Initiative for Open Data (ILDA) and the Foreign and Commonwealth Office of the United Kingdom, to organise events under the Open Mapping and Equal Development themes respectively. It has been written by Clara Cubas and María Fabiola Cantú: their biographies are included at the bottom.

    Open Data Day: Comunidata 2019: Open Data to Strengthen Citizen Participation

    Chiclayo, Perú

    On Friday, March 22, 2019, the Open Data Day was held in the city of Chiclayo in northern Peru, the event was intended to strengthen citizen participation through open data, called Comunidata. The main purpose of this meeting was to provide a first approach to the concepts of open data, access to information and transparency of public data and its importance to improve social problems in the city.

    This first edition was organized by the members of: Iguana Org, a collective dedicated to creating spaces where participation is strengthened, and citizen networking is built, and the members of Social Innovation Laboratory: NaimLab, who consolidated a structure composed of three parts: virtual exhibitions, discussion forums and a open dialogue space.

    The total capacity was of 25 participants of all ages, who shared 4 virtual exhibitions, 1 discussion forum and 3 topics in an open dialogue space that allowed integration with the public.


    The goal of the first part was to provide different views about Open Data, from its main concepts, such as the conceptual basis of access to information, to successful cases of Open Government. These exhibitions, although having been online, strengthened a network of collaboration between participating specialists and local organizations, and initiated proposals and ideas to apply what they learned in local projects. The participants were: the leader of Open Data Peru, Antonio Cucho Gamboa, who told us the first steps of the ODD organization in our country and also gave us a technical scope of how to use the information obtained to solve local problems; Jimena Sánchez Velarde (Digital Government Advisor) who presented a series of examples of municipalities working with Open Data. She emphasized the need to articulate the political will, and the voice of citizens with the aim for transparency and participation becoming reality in Peru. Finally, thanks to Miguel Morachimo, leader of Hiperderecho, an association that promotes the respect of rights and freedoms in digital environments. He contributed from his perspective an explanation of the Peruvian Law of access to information and public transparency, emphasizing that access to information is every citizen’s right.

    The second part was composed of a speech by Alan Saavedra, leader of the technological laboratory ITnovate Peru, representatives of the Codescill (The Civil Society Coordinator of La Libertad) and David Chaupis, biologist and social entrepreneur, who works with themes of Open Data Science. The event was relevant in that it showed different edges of how it was possible to approach Open Data. From innovation and entrepreneurship, in the case of Alan Saavedra, developer of InfoCity, an application that maps information on the web to inform the community about the status and report of basic services; to the intersection of arts and science. Thanks to David Chaupis, who spoke about scientific research with free licenses for the community and insured to companies, which allows generating sustainability in the model of bio-entrepreneurship. He also emphasized the relevance of models of collaboration among the four pillars of the community: science, technology, arts and entrepreneurship.

    Finally, the participation of the members of the CODESCILL, Coordinadora de la Sociedad Civil de La Libertad, region near Chiclayo, gave us ideas on the matter to initiate a process of citizen articulation that is currently used to promote the Open Government of La Libertad. The experience of Leopoldo León and Paula Santos, whom have been involved in social activism for years, gave the #Comunidata an intergenerational vision, and also a firm invitation to actively engage in upcoming activities.

    The final part of the event was an integration of the audience with the experts, previously mentioned. Guests were able to ask questions to the members of the panel who gave their knowledgeable answers which concluded a great evening.

    In conclusion, COMUNIDATA has been an opportunity to gather citizens interest in learning to work with Open Data, with civil society organizations and entities working on projects from the local level, regional level to the national level. This networking will be materialized in our future meetings, for example, in mappings of civil society organizations and their projects, in the legal strengthening of initiatives that work with accessing information, and in the development of the first “Experimental Laboratory Festival”, Festilab, in Chiclayo, which will be related to the use of Open Data.

    This event could not have been possible without the amazing support from the co-leader of Naimlab: Keyla Sandoval, and the leader of Iguana Org: Karen Diaz. Both are special contributors to this project with whom we will continue to work to strengthen citizen’s participation with the use of Open Data.


    Open Data Day: Women in Power


    Open data mapping. How many women hold public positions in the province of Mendoza?

    On Friday, March 1, as part of the international open data day, the Open Data Day event was held: Women in Power. The meeting took place in the postgraduate room of the Agustín Maza University and brought together about 20 people.

    For several decades, women around the world have been demanding their right to hold public office and participate in politics. Under this impulse, the analysis was proposed in the Province of Mendoza of the level of participation of women in public positions, identifying the positions and places they occupy in the Legislature, the Executive Power and in Justice.

    The activity was carried out through the massive search of information through the different official digital portals. It gathered journalists, researchers, public officials, civil society organizations, specialists in the use and exploitation of open data, as well as professionals and students from other areas such as health and law.

    The conclusions of the mapping were:

    1. In most of the official digital portals the data is outdated, and those portals that reflect updated public information do not have the appropriate formats for processing and reuse.
    2. In the Executive Power it was possible to elucidate that there is a cap close to 35% of female quota in some sectors. Women represent the majority in areas related to health, education and culture, but their participation is very low in the areas of economy, security and infrastructure. Also, the highest positions are mostly occupied by men. An example that can be illustrating is that, in the health area, only 4 women direct the 24 hospitals that exist in the Province.
    3. In the case of the Judiciary, the scarce representation of women in higher positions is reflected in the fact that the seven members of the Supreme Court, the highest court of justice, are men. In the other levels of the Judiciary there is a greater presence of women. 60.87% of employees and state officials are women.
    4. Finally, regarding the Legislative Power, the female quota is close to 35%. In the Senate, of 38 posts only 13 are occupied by women representing 34.21% of the body. In addition, of 16 unicameral commissions, only 5 (31.25%) are chaired by women. Following the study, the Chamber of Deputies has 20 women in its 48 positions, that is, 41.67% and the commissions are 4 out of 11, 36.36%.

    After the analysis of the data, a debate began under the following: Is there gender equality in the distribution of positions in the Province of Mendoza? The discussion was enriched by the different views and contributions of all the participants.

    It was concluded that equality in access to public office should not correspond to an arithmetical equality in terms of the number of positions held, but that women have the real possibility of occupying spaces of decision-making power.

    Faced with this perspective, governments must make concerted efforts to promote the participation of women in the institutional life of the State and accommodate the voice of women themselves to generate solutions to overcome current barriers.

    The UN explains that the International Women’s Day “is a good time to reflect on the progress made, ask for more changes and celebrate the courage and determination of ordinary women who have played a key role in the history of their lives. countries and communities.”

    Convert the ideal of equality into tangible reality

    This March 8, we must celebrate, but also raise awareness. We have come a long way to reach this point, but there is still much to be done. For this reason from CLADH we want to celebrate this International Women’s Day not only by echoing messages in favor of equality, justice and development but also by working on concrete projects so that this desire for equality is transformed into a tangible reality. Simple changes are needed, but of a great magnitude. Our rulers and all civil society must understand that equality and respect are the only way to the future.


    The organization in charge was the Fundación Nuestra Mendoza, Centro Latinoamericano de Derechos Humanos (CLADH) and the School of Journalism of the Juan Agustín Maza University.



    Clara Cubas is the Co-Leader of Naimlab: Social Innovation Lab. She is a strategic IT professional with expertise in Processes Improvement and strong interests in Social Innovation, Open data and Creative Commons.




    María Fabiola Cantú is the Executive Director of Centro Latinoamericano de Derechos Humanos (CLADH). She is a lawyer who studied at the Universidad Nacional de Cuyo, Law School (Mendoza-Argentina), where she had an outstanding academic performance. She was recognized by the Argentine Federation of Women as the best graduate of her career. Diploma in International Defense of Human Rights (Escuela de Prácticas Jurídicas de la Universidad de Zaragoza – CLADH). Diploma in Women Human Rights (Universidad Austral – with collaboration of OEA). Selected in 2015 to conduct an academic exchange at the Faculty of Law of the Autonomous University of Chiapas (San Cristobal de las Casas, Chiapas, Mexico), where she studied International Systems for the Protection of Human Rights, International Law and Indigenous Law. During her stay in Mexico she collaborated with the Penitentiary Center No. 5 of San Cristóbal de las Casas in the integration of the indigenous population with the rest of the prison population.

    She served as Director of the Freedom of Expression and Transparency Area of ​​Centro Latinoamericano de Derechos Humanos (CLADH). She is currently the Coordinator of the International Journal of Human Rights, a scientific publication of the same organization. Shee has experience in international litigation of human rights cases and in human rights activism on issues of access to public information and citizen participation.



    OCLC Research Mini-Symposium on Linked Data (Marseille edition) / HangingTogether

    My colleague Titia van der Werf and I organized a “mini-symposium on Linked Data” as part of the OCLC EMEA (Europe-Middle East-Africa) Regional Council conference held in Marseille on 26 February 2019.  Fifty staff from OCLC member institutions throughout the EMEA region participated in this interactive session. The month before we had conducted a short survey to determine the session’s registrants key interest in this session.  Most wanted to learn how other institutions were implementing linked data, so we arranged for five “lightning talks” summarizing different linked data implementations or perspectives, each followed by discussion and questions.

    A national library’s experiences: Sébastien Peyrard of the Bibliothèque nationale de France (BnF), reminded us that linked data and open data are related but not identical. In France few datasets in the portal are linked data, but all are available under the French equivalent to the Creative Commons License Attribution (CC-BY), Etalab, permitting others to freely distribute and re-use the data as long as they give appropriate credit to the source. The BnF’s linked data source aggregates the entities represented in the library’s various silos of resources: main catalogue, archival and manuscripts catalogue, virtual exhibitions, digital library, educational resources, etc. The BnF views linked data as an export format, not necessarily a cataloguing format. What is most important is that the cataloguing format is compatible with linked data principles: entity-driven (a record per entity, links between entities done with links between records through their identifiers) and rich in controlled values that can be easily consumed by machines. This shift is underway at the BnF as it is building a new cataloguing system, but the cataloguing format will still be MARC. However, this “MARC flavor” will be entity-driven, with a bibliographic record becoming a Work, Expression, Manifestation and Item representation, allowing linking between entities. It will also rely more heavily on controlled subfields.

    Peyrard stressed that the choice of a cataloguing format is specific to the institution—it could be BibFrame, other flavors of MARC, or something else. The BnF’s choice is the next generation of INTERMARC. The impact of linked data is more about having an entity-driven database, however it is done.

    Research Information Management Graph: My OCLC colleague Annette Dortmund outlined the benefits of identifiers and linked data for a “global Research Information Management (RIM) graph.” She defined RIM as the “aggregation, curation, and utilization of research information” and persistent identifiers as an important infrastructure for “unambiguous referencing and linking to resources.” RIM metadata almost always includes publications, maybe research data sets, preprints, and other outputs, and attempts to connect these outputs to grants, funders, equipment and a growing number of other categories.

    At the local level, this metadata can be captured in a typical traditional relational database, such as a CRIS. However, the information is rarely complete—and each institution does similar work. Once this metadata is aggregated to a national level, you need identifiers understood across all systems to identify and merge researchers, projects, funders etc. across all systems and to see the network level activity. Otherwise you end up with duplicates. But this information may still be incomplete. Research is international, with international collaborations and funding. Setting up a “global system” to capture, merge, and de-duplicate all the information is not feasible, but linked data can help create a global RIM graph.

    If we rely on persistent identifiers to uniquely identify entities—–researchers, organizations, projects, funders, etc.—–and establish links between these identifiers, we could then connect them to locally held information. With identifiers such as ORCID or ISNI, it is much easier to reliably identify the one “John Smith” in question. Organization identifiers help get the affiliation information right. Publication identifiers such as DOIs or ISBNs help with that part. Research data is often citable by a DOI. Identifiers for projects, funders, grants, and many other entities either exist or are in development. This information can be found and used by anyone interested. The global RIM graph decentralizes the task of collecting all this information, and provides a central, global source of information.

    Dortmund concluded that the one thing we can all do today to help create this future global RIM graph is to include resolvable persistent identifiers in your system, for as many categories as possible, in addition to local or national ones.

    A cultural heritage institution’s perspective: Gildas Illien of the Muséum national d’Histoire naturelle provided the context of a natural history museum full of databases in silos. The museum had just launched a proof of concept project based on a sample of ca 500 local people names and identified the resources attached to their names spread throughout those silos. The museum sees linked data as a way to connect people talking about the same thing in different databases and to provide context for their objects to end-users without implementing a Google-like search box. All cultural heritage institutions (museums, archives, and libraries) have similar silos of data that could benefit from connecting them for end-users through linked data.

    OCLC Research’s experiences with Wikibase: I talked about how last year OCLC Research explored creating linked data in the entity-based Wikibase platform in collaboration with metadata practitioners in sixteen U.S. libraries. Wikibase is the platform underlying Wikidata that contains structured data which you may see in the “information boxes” in various language Wikipedias. The attraction of using Wikibase is that metadata practitioners could focus on creating entities and their relationships without knowing any of the technical underpinnings of linked data. For example:

    Entity: Photograph, which depicts this person who has this role in this time period

    >> is part of this collection >> curated by this archive >> is part of this institution, which is located in this place >> which is located in this country, which had this previous name in this time period.

    We started with 1.2 million entities that we imported from Wikidata which matched entities in WorldCat and VIAF, so practitioners could link to existing ones and focus on creating new ones and establishing new relationships. We added a discovery layer so that the practitioners could see the relationships they created as part of their workflow and the added value of retrieving related data from other linked data sources.

    Another valued feature of the Wikibase platform is that it embeds multilingualism. By changing the language interface, the participants could create labels and descriptions in their preferred language and script, deferring to others to provide transliterations or labels in other languages.

    I reiterated the value of including identifiers wherever possible in the metadata people create now and noted that good metadata translates into good linked data!

    Strategic choices by a national library: Jan Willem van Wessel of the Koninklijke Bibliotheek (KB) summarized its strategic choices about linked data. Its Strategic Plan for 2015-2018 included recommendations for the KB to adopt linked data.  The KB did not choose Linked Data as a goal in itself but as a simpler way to present information that is connected and easily accessible to its users— the core function of the KB as a National Library.

    The KB is now creating a platform (not system!) to bring people and information together as part of a network, leveraging the work done by network partners from the heritage field (public libraries, university libraries, museums, archives.) If everyone does what they are good at, and only do that, joint work will proceed faster and have higher quality. He noted that we are also working in an environment where users are creating their own information through Wikipedia articles, blogs, and social media. Search engines now structure and present information in Knowledge Graphs; they have developed a common language,

    Forty years of machine-readable cataloging has given us a legacy that includes software and structures that are poorly supported. The KB catalog includes 14,000 different keywords—when the average vocabulary of a native Dutch speaker is 42,000. How useful is that? The KB has much metadata remediation to do! Which of the several hundreds record fields and subfields are really needed?  Bibliographic metadata is dispersed and lacks structure to glue disparate parts together. The KB does not have a publicly shared set of references for a publication.

    Although linked data is not a panacea to solve all these problems, it does help to integrate and link sources of information from within the KB, from the library world in general and—why not? — from the entire world.

    The KB has not achieved yet what it wants. It has conducted pilots, demonstrations, Proof of Concepts, and organized HackaLODS (hackathons about cultural Linked Open Data) with great success. It has succeeded in knowledge building and experimentation, for example a Linked Data extension to its Depher platform with marked-up entities contained in 12 million newspapers (check out But van Wessel observed that the KB has just scratched the surface. Its goal is to set up the KB as an authoritative linked data source for bibliographic data accessible to the outside world.

    Suggested resources:

    The post OCLC Research Mini-Symposium on Linked Data (Marseille edition) appeared first on Hanging Together.

    Evergreen 3.3 Released / Evergreen ILS

    Evergreen 3.3 has been released to its eager users.  Dan Wells, release manager for Calvin College in his announcement email described 3.3 like this:

    “As we close out this development period, I think the overall theme of this release can be summarized as infrastructural improvements and modernization.  Chief among these would be the significant shift to new versions of Angular/Bootstrap as spearheaded by Bill Erickson.  There are a handful of production ready interfaces now using the new system (MARC Import/Export and some Administration interfaces) and an ever more feature-rich “staff” catalog which can be optionally enabled by Evergreen administrators.  The staff catalog is still considered experimental for this release, but is already highly functional, and may one day serve as a prototype for bringing similar technology to the public OPAC.
    A somewhat less glamorous but equally important effort is making sure Evergreen continues to work smoothly with major parts of our technology stack.  As lead by Jason Stephenson and Ben Shum, Evergreen can now claim official support for the newest Ubuntu LTS (18.04) as well as significantly more modern versions of PostgreSQL (9.6 and 10).  Ubuntu is a popular choice in the Evergreen community for a server operating system, and PostgreSQL plays the critical role of backend database for Evergreen, so these improvements are necessary and much appreciated.”
    Read the full release notes here:
    and download here:

    Something For Archives in / Richard Wallis

    The recent release of the vocabulary (version 3.5) includes new types and properties, proposed by the W3C Schema Architypes Community Group, specifically target at facilitating the web sharing of archives data to aid discovery.

    When the Group, which I have the privilege to chair, approached the challenge of building a proposal to make useful for archives, it was identified that the vocabulary could be already used to describe the things & collections that you find in archives.  What was missing was the ability to identify the archive holding organisation, and the fact that an item is being held in an archives collection.

    The first part of this was simple, resulting in the creation of a new Type ArchiveOrganisation.  Joining the many subtypes of the generic LocalBusiness type, it inherits all the useful properties for describing such organisations. Using the (Multi Typed Entity – MTE) capability of the vocabulary, it can be used to describe an organisation that is exclusively an Archive or one that has archiving as part of of its remit – for example one that is both a Library and an ArchiveOrganization.

    The need to identify the ‘things‘ that make up the content of an archive, be they individual items, collections, or collections of collections resulted in the creation of a second new type: ArchiveComponent.

    In theory we could have introduced archive types for each type thing you might find in an archive, such as ArchivedBook, ArchivedPhotograph, etc. – obvious at first but soon gets difficult to scope and maintain.   Instead we took the MTE approach of creating a type (ArchiveComponent) could be added to the description of any thing, to provide the archive-ness needed.  This applies equally to Collection and individual types such as Book, Photograph, Manuscript etc.

    In addition to these two types, a few helpful properties were part of the proposals: holdingArchive & itemLocation are available for ArchiveComponent; archiveHeld for ArchiveOrganization; collectionSize for Collection; materialExtent.

    To help get your head around the use of these types and properties, here is a very simple example in JSON-LD format:

       "@context": "",
       "@type": "ArchiveOrganization",
       "name": "Example Archives",
       "@id": "",
       "archiveHeld": {
          "@type": ["ArchiveComponent","Collection"],
          "@id": "",
          "name": "Example Archive Collection",
          "collectionSize": 1,
          "holdingArchive": "",
          "hasPart": {
             "@type": ["ArchiveComponent","Manuscript"],
             "@id": "",
             "name": "Interesting Manuscript",
             "description": "Interesting manuscript - on loan to the British Library",
             "itemLocation": "",
             "isPartOf": ""

    These proposals gained support from many significant organisations and I look forward to seeing these new types and properties in use in the wild very soon helping to make archives and their contents visible and discoverable on the web.




    Fudge, and open access ebook download statistics / Eric Hellman

    If you found out that the top 50 authors born in Gloucestershire, England average over 10 million copies sold, you might think that those authors are doing pretty well. But it's silly to compute averages like that. When you compute an average over a population, you're making an assumption that the quantity you're averaging over is statistically distributed somehow over the population. Unless of course you don't care if the average means anything, and you just want numbers to help justify an agenda.

    Most folks would look at the list of Gloucestershire authors and say that one of the authors is an outlier, not representative of Gloucestershire authors in general. And so J.K. Rowling, with her 500+ million copies sold, would get removed from the data set, revealing the presumably unimpressive book selling record of the "more representative" authors. Scientists refer to this process as "fudging the data". It's done all the time, but it's not honest.

    There's a better way. If a scientific study presents averages across a population, it should also report statistical measures such as variance and standard deviation, so the audience can judge how meaningful the reported averages are (or aren't!).

    Other times, the existence of "outliers" is evidence that the numbers are better measured and compared on a different scale. Often, that's a logarithmic scale. For example, noise is measured on a logarithmic scale, in units of decibels. An ambulance siren has a million times the noise power of normal conversation, but it's easier to make sense of that number if we compare the 60 dB sound volume of conversation to the 90 dB of a hair dryer, the 120 dB of the siren and the 140 dB of a jet engine. Similarly, we can understand that while J.K. Rowling's sales run into 8 figures, most top Gloucestershire-born authors are probably 3, 4 and or maybe 5 figure sellers.

    Over the weekend, I released a "preprint" on Humanities Commons, describing my analysis of open-access ebook usage data. I worked with a wonderful team including two open-access publishers, University of Michigan Press and Open Book Publishers, on this project, which was funded by the Mellon Foundation. To boil down my analysis to two pithy points, the preprint argues:

    1. Free ebook downloads are best measured on a logarithmic scale, like earthquakes and trade publishing sales.
    2. We shouldn't average download counts.

    If you take the logarithm of book downloads, the histogram looks like a bell curve!
    For example, if someone tries to tell you that "Engineering, mathematics and computer science OA books perform much better than the average number of downloads for OA books across all subject areas" without telling you about variances of the distributions and refusing to release their data, you should pay them no mind.

    Next week, I'll have a post about why logarithmic scales makes sense for measuring open-access usage, and maybe another about how log-normal statistics could save civilization.

    Open Data Day 2019: a joint report by Open Knowledge Colombia and Datasketch / Open Knowledge Foundation

    This report is part of the event report series on International Open Data Day 2019. On Saturday 2nd March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. This is a joint report produced by Open Knowledge Colombia and Datasketch, who received funding through the mini-grant scheme by Hivos / Open Contracting Partnership and the Latin American Initiative for Open Data (ILDA) to organise an event under the Equal development and Tracking Public Money themes.  It has been written by Verónica Toro (Datasketch) and Luis M. Vilches-Blázquez (Open Knowledge Colombia).


    In Bogota (Colombia), we developed an event, called IgualData, focused on demonstrating and raising awareness on salary differences among genders in Colombia where different actors were involved.

    IgualData was performed in conjunction with the public (governmental) sector and civil society. Thus, this event was organized by the National Planning Department, National Secretary for Transparency, Ministry of Finance and Public Credit, Ministry of Communications and Information Technology, Colombia Buys Efficient, Datasketch, Open Contracting Partnership, Open Knowledge Colombia, Global Integrity, and the Jorge Tadeo Lozano University.

    More than 60 people were participating in our event with different roles (citizens, academia, social actors, governmental bodies, etc.), which had active discussions and interactions on the importance of open data in promoting gender equality and how these one help discovering the gender pay gap in the Colombian context.


    The IgualData event was based on three main topics: i) gender pay gap, ii) anti-corruption and public policy with a gender approach, and iii) women participation in public purchases. These topics were useful for opening a debate on the rights and inequalities of women in the Colombian society through Open Data.

    Additionally, we added some open questions related to open data and gender issues, such as: How can we use open data as a tool to promote gender equality? What can be done to ensure that women, gays, lesbians, trans, bisexuals, and queers have power and benefit from the state budget? How to achieve a gender approach in the creation of public policies related to access to information?

    This scenario allowed us sought to answer these questions since there are little or no reports on the budget and follow-up with the gender approach of open data, it can not be tracked or analyzed, the public budget has promoted gender equality in Colombia.

    Therefore, the main challenge IgualData aimed to have a global vision about the status of open data on gender issues and to discover the existing gender pay gap in Colombia through open contracts data associated with governmental bodies.


    Open Knowledge Colombia and Datasketch in conjunction with National Planning Department, National Secretary for Transparency, Ministry of Finance and Public Credit, Ministry of Communications and Information Technology, Colombia Buys Efficient, Open Contracting Partnership, Global Integrity, and Jorge Tadeo Lozano University prepared a complete agenda for our event (IgualData).

    Thus, the day began with an expert panel that included women from civil society, private sector, and academia. The objective was to discuss how the governmental bodies produce the data and how they have a bias and discrimination from the forms and surveys. The allegations of manipulation of data suffered by some of these bodies and the importance of institutional strengthening with a gender and intersectional approach were put on the table since not all women are equal, nor all homosexual, bisexual or transgender people.

    On the other hand, we had different interventions and exhibitions from various actors associated with the public, private and societal sectors. They showed some data analysis related to official and open data from governmental bodies.

    Moreover, we created three working groups focused on three main topics of IgualData, where participants discussed challenges, shortcomings, and opportunities:

    • Gender pay gap. This group discussed the niches affected by the lack of data with a gender focus. Besides, they reviewed the difference between the hours’ amount worked by men and women and the availability of data.
    • Anti-corruption and public policy with a gender approach. This table discussed the current status of the General System of Anti-corruption in Colombia and dealt also with the necessity to include a gender approach and strength the available data.
    • Participation of women in public purchase. This working group put on the table the points on the data state, where highlighted the fact that the majority of data are in pdf format, which makes more difficult the massive analysis.

    Finally, we presented a mosaic that was honored in tribute to Rosie the Riveter. This work was built with data on the wage gap and violence against women, figures from reports such as the World Economic Forum (WEF) and the International Labor Organization, also, included photographs of feminist, scientific, academic women and writers.

    Conclusions and Lessons Learnt

    We obtained different conclusions and lessons learned in the context of IgualData. Next, we list some of the main ones:

    With respect to (open) data and interoperability status:

    • Currently, there is no gender distinction in the National public contracting platform, called SECOP.
    • Most information related to gender issues is available in PDF format.
    • Interoperability between platforms is needed (e.g.: SECOP and SIGEP to extract data such as gender, training, experience, geographical distribution, marital status, among others).

    Regarding monitoring of gender issues:

    • It is important to monitor and measure how the resources of national investment projects are executed in the context of gender issues.
    • It is necessary to set a connection between gender pay gap information and training and experience factors.
    • It is necessary to include spaces to select gender issues in the SECOP platform and characteristics of companies in order to evaluate the participation of women in govermental public contracts.

    Launching CAP Search / Harvard Library Innovation Lab

    Today we're launching CAP search, a new interface to search data made available as part of the Caselaw Access Project API. Since releasing the CAP API in Fall 2018, this is our first try at creating a more human-friendly way to start working with this data.

    CAP search supports access to 6.7 million cases from 1658 through June 2018, digitized from the collections at the Harvard Law School Library. Learn more about CAP search and limitations.

    We're also excited to share a new way to view cases, formatted in HTML. Here's a sample!

    We invite you to experiment by building new interfaces to search CAP data. See our code as an example.

    The Caselaw Access Project was created by the Harvard Library Innovation Lab at the Harvard Law School Library in collaboration with project partner Ravel Law.

    First We Change How People Behave / David Rosenthal

    Then the system will work the way we want. My skepticism about Level 5 self-driving cars keeps getting reinforced. Below the fold, two recent examples.

    The fundamental problem of autonomous vehicles sharing roads is that until you get to Level 5, you have a hand-off problem. The closer you get to Level 5, the worse the hand-off problem.

    Sean Gallagher's Lion Air 737 MAX crew had seconds to react, Boeing simulation finds shows the hand-off problem for aircraft:
    In testing performed in a simulator, Boeing test pilots recreated the conditions aboard Lion Air Flight 610 when it went down in the Java Sea in October, killing 189 people. The tests showed that the crew of the 737 MAX 8 would have only had 40 seconds to respond to the Maneuvering Characteristics Augmentation System’s (MCAS’s) attempts to correct a stall that wasn’t happening before the aircraft went into an unrecoverable dive, according to a report by The New York Times.

    While the test pilots were able to correct the issue with the flip of three switches, their training on the systems far exceeded that of the Lion Air crew—and that of the similarly doomed Ethiopian Airlines Flight 302, which crashed earlier this month. The Lion Air crew was heard on cockpit voice recorders checking flight manuals in an attempt to diagnose what was going on moments before they died.
    Great, must-read journalism from Dominic Gates at the Seattle Times, Boeing's home-town newspaper in Flawed analysis, failed oversight: How Boeing and FAA certified the suspect 737 MAX flight control system shows that the fundamental problem with the 737 MAX was regulatory capture of the FAA by Boeing; the FAA's priority wasn't to make the 737 MAX safe, it was to get it into the market as quickly as possible because Airbus had a 9-month lead in this segment. And because Airbus' fly-by-wire planes minimize the need for expensive pilot re-training, Boeing's priority was to remove the need for it.
    The company had promised Southwest Airlines Co. , the plane’s biggest customer, to keep pilot training to a minimum so the new jet could seamlessly slot into the carrier’s fleet of older 737s, according to regulators and industry officials.

    [Former Boeing engineer Mr. [Rick] Ludtke [who worked on 737 MAX cockpit features] recalled midlevel managers telling subordinates that Boeing had committed to pay the airline $1 million per plane if its design ended up requiring pilots to spend additional simulator time. “We had never, ever seen commitments like that before,” he said.
    The software fix Boeing just announced is just a patch on a fundamentally flawed design, as George Leopold reports in Software Won’t Fix Boeing’s ‘Faulty’ Airframe. Boeing is gaming the regulations, and the FAA let them do it. Neither placed safety first. These revelations should completely destroy the credibility of FAA certifications.

    Although Boeing's highly-trained test pilots didn't have to RTFM, they did have only 40 seconds to diagnose and remedy the problem caused by the faulty angle-of-attack sensor and the buggy MCAS software. Inadequately trained Lion Air and Ethiopian Airlines pilots never stood a chance of a successful hand-off. Self-driving car advocates assume that hand-offs are initiated by the software recognizing a situation it can't handle. But in this case the MCAS software was convinced, on the basis of a faulty sensor, that it was handling the situation and refused to hand-off to the pilots 24 times in succession.

    Self-driving car stopper
    Self-driving cars drivers will lack even the level of training of the dead pilots. The cars' software is equally dependent upon sensors, which can be fooled by stickers on the road*, and cannot handle rain, sleet or snow. Or, as it turns out, pedestrians As David Zipper tweeted:
    Atrios' apt comment was:
    It is this type of thing which makes me obsess about this issue. And I have a couple insider sources (ooooh I am a real journalist) who confirm these concerns. The self-driving car people see pedestrians as a problem. I don't really understand how you can think urban taxis are your business model and also think walking is the enemy. Cities are made of pedestrians. Well, cities other than Phoenix, anyway. I pay a dumb mortgage so I can walk to a concert, like I did last night.
    But no-one who matters cares about pedestrians because no-one who matters is ever on the sidewalk, let alone crossing the street. As the CDC reports:
    In 2016, 5,987 pedestrians were killed in traffic crashes in the United States. This averages to one crash-related pedestrian death every 1.5 hours.

    Additionally, almost 129,000 pedestrians were treated in emergency departments for non-fatal crash-related injuries in 2015. Pedestrians are 1.5 times more likely than passenger vehicle occupants to be killed in a car crash on each trip.
    The casualties who don't "know what they can't do" won't add much to the deaths and injuries, so we can just go ahead and deploy the technology ASAP.

    * Tesla says the "stickers on the road" attack:
    is not a realistic concern given that a driver can easily override Autopilot at any time by using the steering wheel or brakes and should always be prepared to do so
    Well, yes, but the technology is called "Autopilot" and Musk keeps claiming "full autonomy" is just around the corner.

    Open Mapping in Brazil for Open Data Day 2019 / Open Knowledge Foundation

    This report is part of the event report series on International Open Data Day 2019. Code for Curitiba and Open Knowledge Brasil / UG Wikimedia in Brazil received funding through the mini-grant scheme by Mapbox to organise events under the Open Mapping theme. This is a joint report by Ricardo Mendes Junior & Celio Costa Filho: their biographies are included at the bottom of this post.

    Open Data Day São Paulo

    Open Data Day is an annual celebration of open data that takes place all around the world. In its ninth edition, in 2019, people in various countries organized events using and/or producing open data. This is a great opportunity to show the benefits of open data and to encourage the adoption of open data policies in government, business, and civil society. In Brazil, these events occurred in the first half of March.

    The initiative to conduct one of these events in the city of São Paulo came from two volunteers of the group Wiki Movimento Brasil. The idea of ​​the event came after the Brumadinho dam disaster, which occurred on January 25, 2019, when a tailings dam at an iron ore mine in Brumadinho, Minas Gerais, Brazil suffered a catastrophic failure. In this context, we perceive the importance of the existence of data from Brazilian dams of tailings properly structured on open platforms and with machine-readable data, such as Wikidata. This became even more visible when, by the end of January of this year, a report from the National Water Agency classified 45 reservoirs of dams as vulnerable, potentially affecting a population of 3.5 million people in risk-damped cities.

    The purpose of this Open Data Day, therefore, was to perform the scraping of databases whose content is free, and create items on Wikidata rich in structured information about the existing dams in Brazil. The site of the National Information System on Dams Safety, controlled by the National Water Agency was the main source; the site records more than 3,500 dams. Once the data organized in a spreadsheet, the process of “wikidatification” began with the help of the participants of the event. Wikidatifying data is nothing more than modeling structurable data, that is, trying to establish correspondences between the concepts and values ​​presented in the data table and the properties and items of Wikidata. Only after wikidatification is it possible to upload the data to Wikidata. Each participant of the event raised about 500 items of dams.

    Items created in this event can serve a variety of purposes, such as the illustration of dam maps by associated potential harm level ( and cross-checking of dam safety statistics with other databases (for instance, the ones related in the Brazilian news today: 

    The event is organized by the members of the Wiki Movimento Brasil and had the support of Creative Commons Brazil. 

    Map example:


    Open Data Day Curitiba 2019

    The Open Data Day Curitiba 2019 was held at the FIEP Paula Gomes Training Center and had 61 people participating, in 4 working rooms and watching the lectures in the auditorium. The programming of lectures had the collaboration of 11 special guests who spoke 15 minutes each one, in the subjects Access and reuse of scientific data, Open data of public spending in accessible formats, Open Science: Repository of scientific data of Research, Collaborative Mapping, Open Education and open educational technology, Impacts of the Brazilian General Data Protection Act, Information Systems for public transport, Use of methodology City Information Modeling (CIM) for urban planning, Transparency and social control, Roadmap to civic innovation in the public sector and Urbanism and collaborative mappings, civic engagement and urban laboratories. At the opening of the event the director of the Curitiba/Vale do Pinhão Agency, Cris Alessi, spoke about the innovation ecosystem of Curitiba and what actions we can perform as participants in the movement of civic hackers and encouraging public open data. In the working rooms the participants discussed and develop activities related to the themes of the ODD Curitiba 2019.

    Open science

    In the Open science working room 13 people participated in the activities and the group started discussing the contextualization of the concept of scientific data and some international approaches on the topic, the differentiation between scientific information and research product. The group then identified 3 datasets, analyzing its structures (data, documentation and support of the original publication that contextualizes the information). After this activity the group discussed the 8 Panton Principles that analyze the quality of open data, and discussed the repositories and As a last activity, they discussed the context of scientific data in scientific journals, the types of copyright license for data and the difficulty of obtaining information from the data published on the platform of Brazilian researchers’ curricula.

    Tracking flow of public money

    In the Tracking flow of public money working room 28 people participated. The initial discussions were about money spent in public events and public policy actions that use public resources and how to find the destination of these resources in the city’s documents (bids, commitments, notices, etc.). After this discussion, the group decided to concentrate on tracking drug expenditures and public transport costs. So, they started the discussion with questions related to these expenses. Subsequently, a map was elaborated with the money trail for these expenses, including the sources of information. This trail will be improved by the group, who pledged to continue working on these ideas. And the conclusion of the group is that citizen engagement is the best remedy and has been summarized in one sentence:

    “The Ministry of Health warns: Citizen participation is the best remedy for public health management. “

    Open Mapping

    In the Open Mapping working room was held the 1st Urban Accessibility Mapathon of Curitiba (Mapathon = Collaborative mapping marathon). The activity consisted of gathering information in the field of about 800 meters of sidewalks, per team, in the neighborhood of the event’s location. With the help of mobile applications, situations related to accessibility problems was collected, with coordinates, photos and videos. The Checklist had 18 items such as irregular pavement, irregular or non-existent accessibility ramp, hole in the lanes. After collection, the raw data were edited using the free QGIS software, generating the final unified maps that were made available to the community via an online map ( Were Raised 39 Problems Of Accessibility In Surroundings.


    8 people participated in the ô project working room. The initiative started in 2019 and maintained by the Code for Curitiba aims to be an aggregator of data related to public transportation in the city of Curitiba. In the event, the project leaders, Guilherme and Henrique, presented the project, raised questions and the participants discussed ways to identify the answers. They conducted an exploratory survey of public and private services, extracted data and studied The webservice provided by URBS (Urbanization of Curitiba S/A). They Created a comparative table for identification of lines in different services and coded in PHP + HTML a view of these schedules. At the end, they took the opportunity to development and integration with the project Kartão, developed in Code for Curitiba in 2016, which presents the points of sale and recharge of the public transport card.


    The Open Data Day Curitiba in previous years was also carried out by the Code for Curitiba. The ODD of 2019 was greater in public participation and in activities performed. The results obtained in this year include some direct results indicated below. A group formed to discuss and implement a solution to track the public money applied in medicines in Curitiba. The activity of the 1st Urban Accessibility Mapathon of Curitiba resulted in information geolocated that will be delivered to the Ippuc (Institute of research and urban planning of Curitiba) demonstrating how it is possible using technology to involve the population in collaborative urban planning with the mapping of information of the city. The ô project received valuable contributions from the participants and began to count with new collaborators. All projects under development in the Code for Curitiba are conducted by volunteers. The discussions on the Open Research Data initiated in the ODD 2018 have advanced. And finally, the evaluation by the participants considered the event positive to understand the existing challenges to work with open data and that data integration still requires great work. Collaborative mapping participants liked the idea of using georeferenced data for the improvement of the city. All were unanimous in stating that they would like to continue in the activities proposed by the ODD 2019, would like to receive more information and consider these important activities and of great impact to the city and to the understanding of effective citizenship.

    More information and photos:



    Code for Curitiba is a brigade of Code for Brazil, inspired on Code for America. They use the principles and practices of the digital age to improve how government serves the public, and how the public improves government. To inspire public servants, people from the tech sector, and community organizers to create change by proving government can do better and showing others how. Providing government with access to the resources and digital talent they need so that together we can meaningfully impact some of the world’s toughest societal challenges. Connecting and convening people from inside and outside government, and from all over the world to inspire each other, share successes, learn, build, and shape a new culture of public service for the 21st century.

    Ricardo Mendes Junior is currently the captain of Code for Curitiba. Graduated in Civil Engineering and PhD in Production Engineering he is currently professor at the Federal University of Paraná working in the Postgraduate Program in Information Management. His topics of interest are: Information Engineering, City Information Modeling (CIM), collaborative production, public participation thru collaborative mapping, crowdsourcing plus artificial intelligence, crowd collaboration and civic entrepreneurship.

    Celio Costa Filho is a founding member of Open Knowledge Brasil, the Wiki Movimento Brasil user group and the Creative Commons Brasil wiki coordinator.