Planet Code4Lib

Presentation: Auditing Your Website for Accessibility Compliance and Understanding How ADA Applies / Cynthia Ng

This was presented as a webinar on August 16, 2018 for the Florida Libraries Webinars program. Slides Live deck on GitHub Pages Brief Overview Policy & Legislation Web Accessibility Types of Disabilities Assistive Technology WCAG Auditing Your Website Accessibility Statement Take Away Policy & Legislation In the US, there are two Acts to consider. The … Continue reading "Presentation: Auditing Your Website for Accessibility Compliance and Understanding How ADA Applies"

ASICs and Mining Centralization / David Rosenthal

Three and a half years ago, as part of my explanation of why peer-to-peer networks that were successful would become centralized, I wrote in Economies of Scale in Peer-to-Peer Networks:
When new, more efficient technology is introduced, thus reducing the cost per unit contribution to a P2P network, it does not become instantly available to all participants. As manufacturing ramps up, the limited supply preferentially goes to the manufacturers best customers, who would be the largest contributors to the P2P network. By the time supply has increased so that smaller contributors can enjoy the lower cost per unit contribution, the most valuable part of the technology's useful life is over.
I'm not a blockchain insider. But now in a blockbuster post a real insider, David Vorick, the lead developer of Sia, a blockchain based cloud storage platform, makes it clear that the effect I described has been dominating the Bitcoin and other blockchains for a long time, and that it has led to centralization in the market for mining hardware:
The biggest takeaway from all of this is that mining is for big players. The more money you spend, the more of an advantage you have, and there’s not an easy way to change that equation. At least with traditional Nakamoto style consensus, a large entity that produces and controls most of the hashrate seems to be more or less the outcome, and at the very best you get into a situation where there are 2 or 3 major players that are all on similar footing. But I don’t think at any point in the next few decades will we see a situation where many manufacturing companies are all producing relatively competitive miners. Manufacturing just inherently leads to centralization, and it happens across many different vectors.
Below the fold, the details.

A year ago Vorick wrote in Choosing ASICs for Sia:
We’ve seen a lot of discouraging things play out in Bitcoin. At points, mining pools have controlled more than 51% of the hashrate, and today something like 80% of all Bitcoin mining chips are produced by a single company, a company that has not shied away from using their monopoly to make political moves. Ultimately you only need about 5 mining pools to get 51% of the hashrate in Bitcoin, and 10 to hit 75%.
The Ethereum blockchain uses a different Proof of Work from Bitcoin, intended to discourage the use of specialized mining chips and encourage the use of GPUs instead:
The story is actually a bit worse in Ethereum — 3 pools control more than 60% of the hashrate, and 6 pools will get you over 85%. I have tried to get information about how much of this hashrate is everyday users and how much is massive datacenters. Not surprisingly, the massive datacenters are not eager to advertise themselves, and it’s difficult to get a good feel for the distribution. We know however though that there are very large Ethereum mining farms, and that these farms are able to use economies of scale to get significantly better cost efficiencies and energy efficiencies than what you can get with your GPU at home. Make no mistake, the centralization pressures that drove Bitcoin to where it is today are active in the Ethereum ecosystem as well — GPU mining is not a safe haven.
We started Nvidia in a downturn; we were the only chip company to get funding that quarter. Six months later we counted around thirty other companies all attacking the PC graphics chip market. Most of the planned to build programmable graphics chips, a general purpose CPU combined with graphics-specific accelerators. We knew than none of them could compete on performance for a very simple reason. The bottleneck for graphics performance was the bandwidth of the graphics memory. The programmable chips needed some of that bandwidth to fetch their instructions. Custom graphics chips like NV1 could use every single memory cycle for graphics.

A quarter of a century later memory bandwidth is enormously greater and GPUs are programmable, which is why they are useful for mining. But Vorkick makes essentially the same argument about mining:
At the end of the day, you will always be able to create custom hardware that can outperform general purpose hardware. I can’t stress enough that everyone I’ve talked to in favor of ASIC resistance has consistently and substantially underestimated the flexibility that hardware engineers have to design around specific problems, even under a constrained budget. For any algorithm, there will always be a path that custom hardware engineers can take to beat out general purpose hardware. It’s a fundamental limitation of general purpose hardware.
Some years ago ASICs made mining Bitcoin using GPUs uneconomic:
In Bitcoin, if you are using hardware other than Bitcoin-specific ASICs to attack the network, your efficiency is going to drop by a factor of a thousand or more. The hundred thousand dollar attack becomes a hundred million dollar attack. For this reason, we don’t typically worry about things like supercomputers — an entire supercomputer mining Bitcoin is overpowered by a handful ASICs, and the energy costs to produce a full alternate history are strictly prohibitive. If you are going to attack Bitcoin, you need Bitcoin ASICs, end of story.
GPUs are mass-market product manufactured in volumes much greater than were ever used for mining. Yes, mining demand at the margin caused shortages and price increases in the GPU market, but mining was never the main use, and the existence of each new GPU is public. But mining ASICs are a much smaller market; maybe a hundred million or so a year against Nvidia's almost ten billion. And their entire market is miners. It is true that big customers will get new GPUs slightly sooner, but things in the mining ASIC market are very different. For example, Monero's Proof of Work is intended to resist ASICs, but:
A few months ago, it was publicly exposed that ASICs had been developed in secret to mine Monero. My sources say that they had been mining on these secret ASICs since early 2017, and got almost a full year of secret mining in before discovery. The ROI on those secret ASICs was massive, and gave the group more than enough money to try again with other ASIC resistant coins.

It’s estimated that Monero’s secret ASICs made up more than 50% of the hashrate for almost a full year before discovery, and during that time, nobody noticed. During that time, a huge fraction of the Monero issuance was centralizing into the hands of a small group, and a 51% attack could have been executed at any time.
Bitmain is the leading supplier of mining ASICs. While Sia was developing their own ASIC, Bitmain took them by surprise with their competitor, the Antminer A3 chip. Sia estimate this was an incredibly profitable move:
Using Sia as an example, we estimate it cost Bitmain less than $10 million to bring the A3 to market. Within 8 minutes of announcing the A3, Bitmain already had more than $20 million in sales for the hardware they spent $10 million designing and manufacturing. Before any of the miners had made any returns for customers, Bitmain had recovered their full initial investment and more.
But for most customers it was a money-loser:
If a cryptocurrency like Sia has a monthly block reward of $10 million, and a batch of miners is expected to have a shelf life of $120 million, the most you would expect a company could make off of building miners is $120 million. But, manufacturers actually have a way to make substantially more than that.

In the case of Bitmain’s A3, a small batch of miners were sold to the public with a very fast shipping time, less than 10 days. Shortly afterwards, YouTube videos started circulating of people who had bought the miners and were legitimately making $800 per day off of their miner. This created a lot of mania around the A3, setting Bitmain up for a very successful batch 2.

While we don’t know exactly how many A3 units got sold, we suspect that the profit margins they made on their batch 2 sales are greater than the potential block reward from mining using the A3 units. That is to say, Bitmain sold over a hundred million dollars in mining rigs knowing that the block reward was not large enough for their customers to make back that money, even assuming free electricity. And this isn’t the first time, they pulled something similar with the Dash miners. We call it flooding, and it’s another example of the dangerous asymmetry that exists between manufacturers and customers.
GPU manufacturers can't perform these tricks, but mining ASIC manufacturers can. The mining ASIC market, like the cryptocurrency market itself, is completely corrupt and dominated by a few huge players. You need to read both of Vorick's posts to get the full picture of the corruption, this is just a taste.

Vorick concludes:
Though that’s discouraging news, it’s not the end of the world for Bitcoin or other Proof of Work based cryptocurrencies. Decentralization of hashrate is a good-to-have, but there are a large number of other incentives and mechanisms at play that keep monopoly manufacturers in line. .... There are plenty of other tools available to cryptocurrency developers and communities as well to deal with a hostile hashrate base, including hardforks and community splits. The hashrate owners know this, and as a result they are careful not to do anything that would cause a revolt or threaten their healthy profit streams. And now that we know to expect a largely centralized hashrate, we can continue as developers and inventors to work on structures and schemes which are secure even when the hashrate is all pooled into a small number of places.
So despite all the hype about decentralized cryptocurrencies, they aren't. They are dependent upon the self-restraint of a small number of dominant players, who could at any time blow the system up, but don't want to kill the goose that lays the golden egg.

Jobs in Information Technology: August 15, 2018 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Middlesex Community College, Director of Library and Learning Commons Services, Middletown, CT

American College of Physicians, Inside Sales Associate, Philadelphia, PA

University of Arizona Libraries, Department Head, Technology Strategy & Services, Tucson, AZ

Colorado State University Libraries, STEM Liaison Librarian, Fort Collins, CO

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Libraries educate today’s workforce for tomorrow’s careers / District Dispatch

This article was originally posted on The Scoop.

A five-person panel of Ohio community leaders explored employment issues August 9 during “Libraries Educate Today’s Workforce for Tomorrow’s Careers,” an event organized by four Ohio library partners and the American Library Association.

The discussion—which focused on libraries as an essential component in creating, sustaining, and retaining a viable workforce—brought together leaders from all levels of government and community nonprofits.

The program took place at Rakuten OverDrive headquarters in Cleveland and was cosponsored by Cleveland Public Library, Cuyahoga County Public Library, Ohio Library Council, and ALA. It is the first in an upcoming series of local events organized by ALA and hosted in collaboration with libraries and library businesses across the country.

US Rep. Marcia Fudge (D-Ohio) opened the event by identifying one of the biggest employment issues facing her constituents.

“Many job applications today are exclusively online. But the truth is, a lot of the people I represent don’t have daily access to computers or broadband,” Fudge said. “We need to find ways to give them a fighting chance. It takes a real village, and the library is the hub of that village.”

The panel discussion, which was moderated by WKYC-TV anchor and Managing Editor Russ Mitchell, included Ryan Burgess, director of the Governor’s Office of Workforce Transformation; Shontel Brown, Cuyahoga County council representative for district 9; Jeff Patterson, CEO of Cuyahoga Metropolitan Housing Authority; Denise Reading, CEO of GetWorkerFIT; and Mick Munoz, a former Marine and Ohio library patron.

Burgess noted that libraries contribute the necessary infrastructure to accommodate the programs Ohio is working to make available. He also talked about the importance of bringing people together so they don’t have to face the job market alone.

“It’s a retail strategy: location, location, location. The libraries in Ohio help us meet our people where they are,” Burgess said. “It’s great to go to the library and be with people who are seeking the same things you are: updating your résumé, learning a new skill, applying for a job.”

The program went beyond offering insights into the work libraries are doing with partners. ALA Manager of Public Policy Megan Ortegon emphasized that the goal of events like these is not only to talk about the work of libraries but to engage with elected officials and keep libraries in the public eye.

“ALA’s goal is to generate more opportunities by prompting community leaders and elected officials to think about libraries as key partners,” said Ortegon, a former congressional staffer.

The hour-long event encouraged panelists to look to the future.

“We must continue to leverage libraries in order to ready our community members for the growing and changing job market,” said Brown, noting that the work of libraries to help people to find their career paths is vital to Ohio.

Fellow panelist Munoz agreed. “People are working directly with library staff to improve their job skills and access educational programs,” he said. “In this ever-changing environment, we have to use libraries to create a true cycle of continuous learning.”

After the panel, video comments from US Sen. Rob Portman (R-Ohio) emphasized the instrumental role of libraries in workforce development. The event concluded with remarks from ALA President Loida Garcia-Febo.

The post Libraries educate today’s workforce for tomorrow’s careers appeared first on District Dispatch.

Decentralized Web Summit 2018: Quick Takes / David Rosenthal

Last week I attended the main two days of the 2018 Decentralized Web Summit put on by the Internet Archive at the San Francisco Mint. I had many good conversations with interesting people, but it didn't change the overall view I've written about in the past. There were a lot of parallel sessions, so I only got a partial view, and the acoustics of the Mint are TERRIBLE for someone my age, so I may have missed parts even of the sessions I was in. Below the fold, some initial reactions.

Brewster Kahle's theme for the meeting was "A Game With Many Winners". He elaborated on the theme in his talk on the second morning, which included a live demo of accessing the Internet Archive's collections via the decentralized Web. Brewster stressed the importance of finding a business model for publishing on the decentralized Web that isn't surveillance-based advertising. This is something I agree with, but I believe insufficient attention has been paid to DuckDuckGo's successful advertising model, which isn't based on surveillance.

Brewster, who admitted that he'd made a lot of money from Bitcoin, lauded the ability of cryptocurrency micro-payments to enable a pay-per-view model. Alas, Bitcoin is about making a lot of money, whereas micro-payments are about making a little money, so Brewster's experience may have misled him.

The pay-per-view idea ran into opposition from many in the audience who were concerned with opening the Web's resources to under-served populations. The idea of differential pricing, as exemplified by initiatives such as Hinari for the biomedical literature, was raised only to have Cory Doctorow point out that charging different prices depending on how poor you were implied very intrusive surveillance, which was what Brewster was trying to get away from!

I was skeptical in another dimension since, as I wrote earlier:
Clay Shirky had pointed out the reason there wasn't a functional Internet micro-payment system back in 2000:
The Short Answer for Why Micropayments Fail

Users hate them.

The Long Answer for Why Micropayments Fail

Why does it matter that users hate micropayments? Because users are the ones with the money, and micropayments do not take user preferences into account.
To illustrate the state of the art in cryptocurrency micro-payments, I e-mailed Brewster the link to Shitcoin and the Lightning Network.

Brewster ended his talk by using the example of the "Internet Archive, but decentralized" to exhort the audience to go forth and multiply "XXX but decentralized" systems, such as "Slack but decentralized". And, indeed, many of the demos of working software shown at the summit were of "XXX but decentralized".

Alas, in most cases the demos may have been decentralized but they weren't yet as good as XXX. This pointed up a theme common to many sessions, which was that developers needed to focus on the User Experience (UX in the jargon); the mass of users already use XXX and won't shift to something that does the same job no better just because it is decentralized. In this context I'd point out that there is a long-established decentralized network that gets far less use than it should, which is Tor. How Do Tor Users Interact With Onion Services? by Philipp Winter et al from Princeton looks in detail at Tor's UX barriers to adoption, some of which are shared with the decentralized web (such as names that are impossible to type correctly).

Adoption of the decentralized Web outside the geek-o-sphere requires either:
  • applications that are compelling to ordinary people but can only be implemented in a decentralized system, which I haven't seen identified,
  • or a UX enough better than centralized systems to overcome network effects and incumbency, which I haven't seen implemented.
A panel with Kendra Albert, Cindy Cohn, Chris Riley and Caroline Sinders, and a talk by Jennifer Granick were both informative but depressing on the legal aspects, stressing for example the increasing legal requirements to take down content upon request. There are many reasons why this is inherently difficult in a decentralized system, among them being the necessary lack of a central point to which takedown requests can be sent! Although censorship resistance is one of the advantages touted for the decentralized Web, nodes running a Web in which content could be posted anonymously and could not thereafter be made inaccessible would be at intolerable legal risk in virtually any jurisdiction. See, for example, the Bitcoin blockchain, storing which is arguably illegal in almost all countries due to child porn and other non-transaction content which has been injected.

Indeed, one thing I found irritating about much of the discussion at the summit was the casual assumption that the theoretical advantages claimed for the decentralized Web, including security, privacy, persistence, and censorship-resistance, would automatically be delivered by practical implementations of a decentralized Web. As we see with Nakamoto's magnum opus, this is rather unlikely.

A decentralized Web needs ride above a decentralized storage layer. Nodes participating in the storage layer need to either:
  • Accept liability for the content which they store, which implies that some human has looked at it and decided whether, for example, it is child porn. Or at least that they operate under the DMCA safe harbor and delete content on request.
  • Or claim ignorance of the content which they store, which implies that it is encrypted and the node does not know the key, so cannot decrypt the content. In the context of a decentralized storage system this is technically manageable, if possibly legally fraught. In the context of the decentralized Web, where the whole point is to make the content accessible to anyone, it is difficult. Anyone includes the node itself.
In the second case it would in theory be possible to mix multiple streams of content together cryptographically so that any one could only be re-assembled from M out of N nodes. Then only the requester would see the individual content stream. But this would have significant performance problems, for example using M times the bandwidth, and would mean that the decrypted copy at the requester would have to be evanescent.

For me the most interesting thing was talking with a Swiss professor who is heavily involved in the Named Data Networking effort that I wrote about in Moving vs. Copying. Basically, the decentralized Web is replicating all the work that the Named Data Networking people have been doing, just at a much higher level in the stack. The properties of the underlying IP layers are likely to vitiate many of the properties that the decentralized Web proponents want. I made the detailed argument about this in Brewster Kahle's Distributed Web Proposal. Somehow we need to get these two groups talking.

In Decentralising the web: Why is it so hard to achieve? John Leonard interviewed a number of attendees in the run-up to the summit. Here are some extracts showing that realism is starting to sink in:
Matt Zumwalt, program manager at Protocol Labs, creator of Inter-Plantetary File System (IPFS), argued that proponents of decentralised web need to think about how it might be gamed.

"We should be thinking, really proactively, about what are the ways in which these systems can be co-opted, or distorted, or gamed or hijacked, because people are going to try all of those things," he said.

The decentralised web is still an early stage project, and many involved in its creation are motivated by idealism, he went on, drawing parallels with the early days of the World Wide Web. Lessons should be learned from that experience about how reality is likely to encroach on the early vision, he said.

"I think we need to be really careful, and really proactive about trying to understand, what are these ideals? What are the things we dream about seeing happen well here, and how can we protect those dreams?"
Another caution I agree with the first part of is this:
Mitra Ardron, technical lead for decentralisation at the Internet Archive, believes that one likely crunch point will be when large firms try to take control.

"I think that we may see tensions in the future, as companies try and own those APIs and what's behind them," he said. "Single, unified companies will try and own it."

However, he does not think this will succeed because he believes people will not accept a monolith. Code can be forked and "other people will come up with their own approaches."
The investors in decentralized technology companies are not investing with the idea of being one among many, they're hoping that the one they chose will end up dominant and thus able to extract monopoly rent.

Patrick Stanley of Blockstack raised the governance issues that most cryptocurrencies have failed miserably at:
That's lots of positives so far from a user point of view, and also for developers who have a simpler architecture and fewer security vulnerabilities to worry about, but of course, there's a catch. It's the difference between shooting from the hip and running everything by a committee.

"Decentralisation increases coordination costs. High coordination costs make it hard to get some kinds of things done, but with the upside that the things that do get done are done with the consensus of all stakeholders."
David Irvine of MaidSafe was also concerned with governance:
Within any movement dedicated to upending the status quo, there lurks the danger of a People's Front of Judea-type scenario with infighting destroying the possibilities of cooperation. Amplifying the risk, many projects in this space are funded through cryptocurrency tokens, which adds profiteering to the mix. It's easy to see how the whole thing could implode, but Irvine says he's now starting to see real collaborations happen and hopes the summit will bring more opportunities.
Patrick Stanley also raised the elephant in the room:
There are already privacy-centric social networks and messaging apps available on Blockstack, but asked about what remains on the to-do list, Stanley mentioned "the development of a killer app". Simply replicating what's gone before with a few tweaks won't be enough.

A viable business model that doesn't depend on tracking-based advertising is another crucial requirement - what would Facebook be without the data it controls? - as is interoperability with other systems, he said.
OmiseGO's team
The viable business model is urgent, and it won't be micro-payments. It is worrying that there are so many relatively large teams working in an area without, as yet, a sustainable business model. In the picture of OmiseGO's team I count 36 people. At say $150K/yr/person that's a $5.4M/yr burn rate without a sustainable business model, which is asking for trouble. The notional amounts of ICO's such as FileCoin's may provide attention-grabbing headlines but, as I showed for FileCoin, they aren't a substitute for a sustainable business model in the medium term.

Althea Allen of OmiseGO stressed the importance of UX and the added difficulty of implementing a good UX in a decentralized system:
However, if the alternatives are awkward and clunky, they will never take off.

"It is difficult, though not impossible, to create a decentralised system that provides the kind of user experience that the average internet user has come to expect. Mass adoption is unlikely until we can provide decentralised platforms that are powerful, intuitive and require little or no change in users' habits."
It is good that Mozilla recognizes IPFS, DAT and so on as legitimate protocols, but none of the popular browsers support these protocols directly. Extensions and downloadable JavaScript are ways around this, but they aren't the best way to address Allen's requirements.

Cory Doctorow's barn-burner of a closing talk, Big Tech's problem is Big, not Tech, was on anti-trust. I wrote about anti-trust in It Isn't About The Technology, citing Lina M. Kahn's Amazon's Antitrust Paradox. It is a must-read, as will Cory's talk be if he posts it (Update: the video is here). I agree with him that this has become the key issue for the future of the Web; it is a topic that's had a collection of notes in my blog's draft posts queue for some time. Until I get to it, one hopeful sign is that even the University of Chicago's Booth School is re-thinking the anti-trust ideology that has driven the centralization of the Web, among many other things. The evidence for this is Policy Failure: The Role of “Economics” in AT&T-Time Warner and American Express by Marshall Steinbaum and James Biese, which is worth a read.

I plan to write a follow-up to this post looking at several areas, such as the demos of the Beaker Browser and MIT's Solid system, where I feel the need to follow Andreas Brekken's example, and try before I write.

Putting Partners First with the Lucidworks Partner Program / Lucidworks

Today we are happy to announce the launch of our global partner program.

Our slogan is “Partners First” with an emphasis on putting every partner’s success first. The new program includes comprehensive training and certification by Lucidworks, plus extensive practice with our Fusion platform from real-world projects to ensure the highest-quality implementations for customers across all industries.

“Commvault recognized Lucidworks’ innovative AI, machine learning, and cognitive search features very early on, which is why we are excited to be an early adopter and a Platinum Partner,” said Brian Brockway, Vice President and CTO at Commvault. “This technology partnership will help extend on our leading vision and commitment around data protection with integrated powerful tools that give our joint customers valuable AI-driven insights from their data. The Lucidworks Partner Program will enable our field teams and joint channel partners to collaborate more seamlessly in the acceleration of our joint business opportunities.”

By partnering with us, vendors help form a global network of reliable specialists for customers to rely on to bring the power of the Fusion platform to their organizations. The new program provides the structure to equip customers worldwide with the benefits of the our platform, ensuring a complementary fit with our partners.

“Our focus on partners is key to our strategy to bring the power of machine learning and AI to our customers,” said Will Hayes, CEO of Lucidworks. “With this launch, we are committing to customizing success for each partner. This allows our company and our partners to innovate together, blend our unique capabilities, and reach a broader range of organizations looking for operationalized AI.”

Our program also includes OEM partners looking to create their own solutions with embedded Lucidworks technology on a broader SaaS basis or as part of a managed service cloud platform. Additionally, consulting partners who are already a company’s preferred integration vendor, can generate new revenue as system integrators and trusted advisors.

More reactions from Lucidworks partners:

“At Onix, customers have always come first,” said Tim Needles, President and CEO of Onix, a current Lucidworks partner. “We look forward to the new partner program and how it further enhances our collaboration, so we can continue elevating our customers to the next level of enterprise search, relevancy, and productivity.”

“At Wabion we look back on a long history of enterprise search and knowledge management projects,” said Michael Walther, Managing Director of Wabion. “From the beginning of our partnership with Lucidworks we felt a strong commitment to partners like Wabion. In the end the perfect combination of a great product and an experienced integration partner delivers successful projects to our customers and future prospects. We also believe in the commitment of Lucidworks in the European market and their partners.”

“Raytion’s long-standing partnership with Lucidworks aligns with both companies’ commitment to providing the building blocks for the implementation of world-class enterprise search solutions,” said Valentin Richter, CEO of Raytion. “This program ensures that Raytion’s high-quality connectors and professional services complement the Fusion search and discovery platform. Together, with our combined expertise and technology we provide the robust offering our customers worldwide demand and value.”

“Thanx Media is excited to be a Lucidworks partner as they expand their commitment to the Solr community with their Fusion platform,” said Paul Matker, CEO of Thanx Media. “Fusion’s capabilities fit well with our expertise and customer base in the search space so we can continue to offer best-in-class enterprise solutions that help our customers solve their user experience challenges.”

“As a leading solution provider delivering cognitive and AI-based enterprise solutions, Essextec is pleased to be partnering with such a strong technology leader in intelligent search and discovery,” said Evan H. Herbst, SVP Business Development and Cognitive Innovations at Essex Technology Group, Inc. “Lucidworks has been a very supportive partner to Essextec and our clients. We look forward to accelerating our cognitive and AI business as Lucidworks increases their commitment and resources to working with partners through their new program.”

“During our time as a partner with Lucidworks, we have witnessed their growth from a Solr advisory firm to a leader in Gartner’s Magic Quadrant,” said Michael Cizmar, Managing Director of MC+A. “We are excited to have been an early adopter of this program. We are looking forward to mutual growth by working together delivering data transformation and insights to our customers using Lucidworks’ SDKs, machine learning, and App Studio.”

Learn more about the Lucidworks Partner Program at

Full press release about today’s announcement.

The post Putting Partners First with the Lucidworks Partner Program appeared first on Lucidworks.

The Internet of Torts / David Rosenthal

Rebecca Crootof at Balkinization has two interesting posts:
  • Introducing the Internet of Torts, in which she describes "how IoT devices empower companies at the expense of consumers and how extant law shields industry from liability."
  • Accountability for the Internet of Torts, in which she discusses "how new products liability law and fiduciary duties could be used to rectify this new power imbalance and ensure that IoT companies are held accountable for the harms they foreseeably cause.
Below the fold,some commentary on both.

Introducing starts with this example:
Once upon a time, missing a payment on your leased car would be the first of a multi-step negotiation between you and a car dealership, bounded by contract law and consumer protection rules, mediated and ultimately enforced by the government. ... Today, however, car companies are using starter interrupt devices to remotely “boot” cars just days after a payment is missed. This digital repossession creates an obvious risk of injury when an otherwise operational car doesn’t start: as noted in a New York Times article, there have been reports of parents unable to take children to the emergency room, individuals marooned in dangerous neighborhoods, and cars that were disabled while idling in intersections.
And asks this question:
This is but one of many examples of how the proliferating Internet of Things (IoT) enables companies to engage in practices that foreseeably cause consumer property damage and physical injury. But how is tort law relevant, given that these actions are authorized by terms of service and other contracts?
IANAL, but presumably "consumer property damage and physical injury" should also include economic loss such as can foreseeably be caused, for example, by routers with hard-wired administrative passwords that allow miscreants to compromise consumers' bank accounts:
Classically, an injured individual could bring a tort suit to seek compensation for harm. But in addition to social and practical deterrents, a would-be plaintiff suffering from an IoT-enabled injury faces three significant legal hurdles.
The three hurdles are:
  • The End User License Agreement (EULA) for the device is a contract that may specifically permit actions such as remotely disabling a car. But it likely does not specifically permit vulnerabilities such as hardwired administrative passwords; it probably just contains a general disclaimer of liability.
  • Especially since they had no opportunity to negotiate the terms of the contract, the consumer could argue that it is unconscionable, and thus a tort suit could proceed. But Crootof writes: "Absent a better understanding of how IoT-enabled harms scale, however, judges are unlikely to declare clauses limiting liability unconscionable when evaluating individual cases."
  • Even were the contract declared unconscionable, Crootof writes: "a plaintiff will still need to prove breach of a duty and causation. But there is little clarity about what duties an IoT company owes users". The state of the art is that IoT, and software companies in general, owe their users no duties of any kind. And it is arguable, for example, that the economic loss was caused by the miscreant accessing the bank account, not the company that implemented the means for the access and neither informed the user that it existed nor provided any means to disable it.
Thus, in the current state of the law, consumers are effectively denied any remedy for harms caused by IoT companies' negligence, and IoT companies have incentives to harm consumers through negligence.

Crootof is not completely pessimistic. Tort law is not immutable. Accountability starts by describing the history of tort law's evolution in the face of technological change:
Over and over, in response to technologically-fostered shifts in the political economy, tort law has evolved in response to situations where the logic of individual agreement or apparent non-relation should give way to a social logic of duty and recompense. Two of the more momentous examples are the creation of the modern conception of “negligence” and the development of products liability law. In each of these situations, tort law responded to new, technologically-enabled harms by creating more expansive duties of care and affirming the validity of more attenuated causation analyses.
How does products liability law measure up to the IoT?
When harm is caused as the result of an IoT device’s design defect, manufacturing defect, or inadequate warning, it can be addressed through existing products liability law.
Not so fast. First, there is the disclaimer of liability in the EULA to be overcome. Second, in order to establish the presence of a defect it would be necessary to examine both the software in  the device, prohibited under the DMCA, and the software in the server, prohibited under the CFAA. Third, unless the IoT device is part of a major purchase such as a car or an appliance, the manufacturer is probably a tiny company in Shenzen operating on razor-thin margins assembling the device from components sourced from other tiny companies. Not a viable target for a lawsuit. In practice, manufacturers and vendors have immunity for defects.

Does products liability help with harms due to vulnerabilities?
When such harm is caused by a hacker, we can debate whether the harm should lie where it falls or be considered a kind of design defect or breach of implied warranty.
If the debate assigns harm to the hacker, given the difficulty of attribution in cyberspace, consumers are extremely unlikely to be able to identify the miscreant, who is in any case probably in a different jurisdiction. If the debate assigns harm to the manufacturer, the obstacles above apply. The EULA almost certainly disclaims any implied warranties.

Thus in practice smaller IoT manufacturers are immune from products liability law, and large companies are shielded by the EULA. The case of "digital repossession" is different, because it is intentional, not a defect:
But what about when a company intentionally discontinues service for an IoT device, either in response to a contractual breach or as outright punishment?
Crootof suggests how products liability could be enhanced for "self-help enforcement":
For products liability law to be applicable, we may need to develop a new claim grounded in defective service — a “service defect” claim. A company could be required to provide written notice of the possibility of self-help enforcement in its initial contract, and it could install all manner of warnings to notify the device’s user of missed payments or other contractual violations that trigger the possibility of digital repossession. Alternatively, companies could be required to engage the state to ensure a certain amount of due process before digitally repossessing a device, especially should a company delegate its self-help enforcement decisions to algorithms.
Both of these alternatives would be an improvement. But consumers don't read the EULA, and companies won't like the costs and delays if the "engage the state". I remain skeptical except possibly for cars, which are already a heavily regulated market.

Crootof's alternative approach seems more promising to me:
In situations where IoT companies provide services that consumers rely upon—such as cars, alert systems, or medical devices—it might make more sense to focus on the trust element associated with that service relationship. Doctors, therapists, accountants, and lawyers are all fiduciaries, entities who have a “position of superiority or influence, acquired by virtue of [a] special trust.” Similarly, IoT companies could be recognized as having a distinct fiduciary relationship with IoT device users. ... Like other fiduciaries, IoT companies would have a duty of care; specifically, a duty not to foreseeably cause harm to their consumers when discontinuing service, remotely altering a device, or  engaging in digital repossession. IoT companies would also have a duty of loyalty, which would require them to act in the interests of the IoT device user.
It seems more promising primarily because the most important aspect of it could be the result of case law rather than heavily lobbied-against legislation:
IoT companies could have a duty not to overreach in their contracts. This duty could be extrapolated from Williams v. Walker Thomas Furniture, which implied that companies owe a tort-like duty of good faith to their customers, especially when customers have limited choice in negotiating contractual terms. In the IoT context, this would prohibit the industry from including overly invasive contractual terms, holding IoT devices hostage by conditioning their continued utility on acceptance of new contract terms, or using notice purely as a liability shield.
As also could be the question of attributing the harm:
it will also be necessary to reconceptualize the causation evaluation. Intervening causes of harm are not necessarily unforeseeable. Additionally, different IoT devices can cause different degrees of harm. An inoperative Fitbit will not cause much harm; an inoperative Nest might; an inoperative pacemaker, alert system, or vehicle almost certainly will. Because disabling devices will usually increase the likelihood of harm, rather than directly causing harm, a balancing test that weighs both the foreseeability of harm and its likely gravity would be useful in the IoT context.

Libraryland, We Have a Problem / CrossRef

The first rule of every multi-step program is to admit that you have a problem. I think it's time for us librarians to take step one and admit that we do have a problem.

The particular problem that I have in mind is the disconnect between library data and library systems in relation to the category of metadata that libraries call "headings." Headings are the strings in the library data that represent those entities that would be entry points in a linear catalog like a card catalog.

It pains me whenever I am an observer to cataloger discussions on the proper formation of headings for items that they are cataloging. The pain point is that I know that the value of those headings is completely lost in the library systems of today, and therefore there are countless hours of skilled cataloger time that are being wasted.

The Heading

Both book and card catalogs were catalogs of headings. The catalog entry was a heading followed by one or more bibliographic entries. Unfortunately, the headings serve multiple purposes, which is generally not a good data practice but is due to the need for parsimony in library data when that data was analog, as in book and card catalogs.

  • A heading is a unique character string for the "thing" – the person, the corporate body, the family – essentially an identifier.
Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
  • It supports the selection of the entity in the catalog from among the choices that are presented (although in some cases the effectiveness of this is questionable)

  • It is an access point, intended to be the means of finding, within the catalog, those items held by the library that meet the need of the user.
  • It provides the sort order for the catalog entries (which is why you see inverted forms like "Tolkien, J. R. R.")
United States. Department of State. Bureau for Refugee Programs
United States. Department of State. Bureau of Administration
United States. Department of State. Bureau of Administration and Security
United States. Department of State. Bureau of African Affairs
      • That sort order, and those inverted headings, also have a purpose of collocation of entries by some measure of "likeness"
      Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
      Tolkien Society
      Tolkien Trust
      The last three functions, providing a sort order, access, and collocation, have been lost in the online catalog. The reasons for this are many, but the main explanation is that keyword searching has replaced alphabetical browse as a way to locate items in a library catalog.

      The upshot is that many hours are spent during the cataloging process to formulate a left-anchored, alphabetically order-able heading that has no functionality in library catalogs other than as fodder for a context-less keyword search.

      Once a keyword search is done the resulting items are retrieved without any correlation to headings. It may not even be clear which headings one would use to create a useful order. The set of retrieved bibliographic resources from a single keyword search may not provide a coherent knowledge graph. Here's an illustration using the keyword "darwin":

      Gardiner, Anne.
      Melding of two spirits : from the "Yiminga" of the Tiwi to the "Yiminga" of Christianity / by Anne Gardiner ; art work by
      Darwin : State Library of the Northern Territory, 1993.
      Christianity--Australia--Northern Territory.
      Tiwi (Australian people)--Religion.
      Northern Territory--Religion.

      Crabb, William Darwin.
      Lyrics of the golden west. By W. D. Crabb.
      San Francisco, The Whitaker & Ray company, 1898
      West (U.S.)--Poetry.

      Darwin, Charles, 1809-1882.
      Origin of species by means of natural selection; or, The preservation of favored races in the struggle for life and The descent of man and selection in relation to sex, by Charles Darwin.
      New York, The Modern library [1936]
      Evolution (Biology)
      Natural selection.
      Human evolution.

      Bear, Greg, 1951-
      Darwin's radio / Greg Bear.
      New York : Ballantine Books, 2003.
      Women molecular biologists--Fiction.
      DNA viruses--Fiction.

      No matter what you would choose as a heading on which to order these, it will not produce a sensible collocation that would give users some context to understand the meaning of this particular set of items – and that is because there is no meaning to this set of items, just a coincidence of things named "Darwin."

      Headings that have been chosen to be controlled strings should offer a more predictable user search experience than free text searching, but headings do not necessarily provide collocation. As an example, Wikipedia uses the names of its pages as headings, and there are some rules (or at least preferred practices) to make the headings sensible. A search in Wikipedia is a left-to-right search on a heading string that is presented as a drop-down list of a handful of headings that match the search string:

      Included in the headings in the drop-down are "see"-type terms that, when selected, take the user directly to the entry for the preferred term. If there is no one preferred term Wikipedia directs users to disambiguation pages to help users select among similar headings:

      The Wikipedia pages, however, only provide accidental collocation, not the more comprehensive collocation that libraries aim to attain. That library-designed collocation, however, is also the source of the inversion of headings, making those strings unnatural and unintuitive for users. Although the library headings are admirably rules based, they often use rules that will not be known to many users of the catalog, such as the difference in name headings with prepositions based on the language of the author. To search on these names, one therefore needs to know the language of the author and the rule that is applied to that language, something that I am quite sure we can assume is not common knowledge among catalog users.

      De la Cruz, Melissa
      Cervantes Saavedra, Miguel de
      I may be the only patron of my small library branch that has known to look for the mysteries by Icelandic author Arnaldur Indriðason under "A" not "I".

      What Is To Be Done?

      There isn't an easy (or perhaps not even a hard) answer. As long as humans use words to describe their queries we will have the problem that words and concepts, and words and relationships between concepts, do not neatly coincide.

      I see a few techniques that might be used if we wish to save collocation by heading. One would be to allow keyword searching but for the system to use that to suggest headings that then can be used to view collocated works. Some systems do allow users to retrieve headings by keyword, but headings, which are very terse, are often not self-explanatory without the items they describe. A browse of headings alone is much less helpful that the association of the heading with the bibliographic data it describes. Remember that headings were developed for the card catalog where they were printed on the same card that carried the bibliographic description.

      Another possible area of investigation would be to look to the classified catalog, a technique that has existed alongside alphabetical catalogs for centuries. The Decimal Classification of Dewey was a classified approach to knowledge with a language-based index (his "Relativ Index") to the classes. (It is odd that the current practice in US libraries is to have one classification for items on shelves and an unrelated heading system (LCSH) for subject access.)
      The classification provides the intellectual collocation that the headings themselves do not provide. The difficulty with this is that the classification collocates topically but, at least in its current form, does not collocate the name headings in the catalog that identify people and organizations as entities.

      Conclusion (sort of)

      Controlled headings as access points for library catalogs could provide better service than keyword search alone. How to make use of headings is a difficult question. The first issue is how to exploit the precision of headings while still allowing users to search on any terms that they have in mind. Keyword search is, from the user's point of view, frictionless. They don't have to think "what string would the library have used for this?".

      Collocation of items by topical sameness or other relationships (e.g. "named for", "subordinate to") is possibly the best service that libraries could provide, although it is very hard to do this through the mechanism of language strings. Dewey's original idea of a classified order with a language-based index is still a good one, although classifications are hard to maintain and hard to assign.

      If challenged to state what I think the library catalog should be, my answer would be that it should provide a useful order that illustrates one or more intellectual contexts that will help the user enter and navigate what the library has to offer. Unfortunately I can't say today how we could do that. Could we think about that together?


      Dewey, Melvil. Decimal classification and relativ index for libraries, clippings, notes, etc. Edition 7. Lake Placid Club, NY., Forest Press, 1911.

      Shera, Jesse H, Margaret E. Egan, and Jeannette M. Lynn. The Classified Catalog: Basic Principles and Practices. Chicago, Ill: American Library Association, 1956

      Frictionless Data and FAIR Research Principles / Open Knowledge Foundation

      In August 2018, Serah Rono will be running a Frictionless Data workshop in CopenHagen, congregated by the Danish National Research Data Management Forum as part of the FAIR Across project. In October 2018, she will also run a Frictionless Data workshop at FORCE11 in Montreal, Canada. Ahead of the two workshops, and other events before the close of 2018, this blog post discusses how the Frictionless Data initiative aligns with FAIR research principles.

      An integral part of evidence-based research is gathering and analysing data, which takes time and often requires skill and specialized tools to aid the process. Once the work is done, reproducibility requires that research reports be shared with the data and software from which insights are derived and conclusions are drawn, if at all.  Widely lauded as a key measure of research credibility, reproducibility also makes a bold demand for openness by default in research, which in turn fosters collaboration.

      FAIR (findability, accessibility, interoperability and reusability) research principles are central to the open access and open research movements.

      FAIR Guiding Principles precede implementation choices, and do not suggest any specific technology, standard, or implementation-solution; moreover, the Principles are not, themselves, a standard or a specification. They act as a guide to data publishers and stewards to assist them in evaluating whether their particular implementation choices are rendering their digital research artefacts Findable, Accessible, Interoperable, and Reusable.”

      Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data3:160018 doi: 10.1038/sdata.2016.18 (2016)

      Data Packages in Frictionless Data as an example of FAIRness

      Our Frictionless Data project aims to make it effortless to transport high quality data among different tools & platforms for further analysis. The Data Package format is at the core of Frictionless Data, and it makes it possible to package data and attach contextual information to it before sharing it.

      An example data package

      Data packages are nothing without the descriptor file. This descriptor file is made available in a machine readable format, JSON, and holds metadata for your collection of resources, and a schema for your tabular data.


      In Data Packages, pieces of information are called resources. Each resource is referred to by name and has a globally unique identifier, with the provision to reference remote resources by URLs. Resource names and identifiers are held alongside other metadata in the descriptor file.


      Since metadata is held in the descriptor file, it can be accessed separately from associated data. Where resources are available online – in an archive or data platform – sharing the descriptor file only is sufficient and data provenance is guaranteed for all associated resources.


      The descriptor file is saved as a JSON file, a machine-readable format that can be processed with great ease by many different tools during data analysis. The descriptor file uses accessible and shared language, and has provision to add descriptions, and information on sources and contributors for each resource, which makes it possible to link to other existing metadata and guarantee data provenance. It is also very extensible, and can be expanded to accommodate additional information as needed.


      Part of the metadata held in a data package includes licensing and author information, and has a requirement to link back to original sources thus ensuring data provenance. This serves as a great guide for users interested in your resources. Where licensing allows for resources to be archived on different platforms, this means that regardless of where users access this data from, they will be able to trace back to original sources of the data as needed. For example, all countries of the world have unique codes attached to them. See how the Country Codes data package is represented on two different platforms:  GitHub, and on DataHub.

      With thanks to SLOAN Foundation for the new Frictionless Data For Reproducible Research grant, we will be running deep dive workshops to expound on these concepts and identify areas for improvement and collaboration in open access and open research. We have exciting opportunities in store, which we will announce in our community channels over time.

      Bonus readings

      Here are some of the ways researchers have adopted Frictionless Data software in different domains over the last two years:

      • The Cell Migration and Standardisation Organisation (CMSO) uses Frictionless Data specs to package cell migration data and load it into Pandas for data analysis and creation of visualizations. Read more.
      • We collaborated with Data Management for TEDDINET project (DM4T) on a proof-of-concept pilot in which we used Frictionless Data software to address some of the data management challenges faced by DM4T. Read more.
      • Open Power System Data uses Frictionless Data specifications to avail energy data for analysis and modeling. Read more.
      • We collaborated with Pacific Northwest National Laboratory – Active Data Biology and explored use of Frictionless Data software to generate schema for tabular data and check validity of metadata stored as part of a biological application on GitHub. Read more.
      • We collaborated with the UK Data service and used Frictionless Data software to assess and report on data quality, and made a case for generating visualisations with ensuing data and metadata. Read more.

      Our team is also scheduled to run Frictionless Data workshops in the coming months:

      • In CopenHagen, congregated by the Danish National Research Data Management Forum as part of the FAIR Across project, in August 2018.
      • In Montreal, Canada, at FORCE11 between October 10 and 12, 2018. See the full program here and sign up here to attend the Frictionless Data workshop.

      Managing ILS Updates / ACRL TechConnect

      We’ve done a few screencasts in the past here at TechConnect and I wanted to make a new one to cover a topic that’s come up this summer: managing ILS updates. Integrated Library Systems are huge, unwieldy pieces of software and it can be difficult to track what changes with each update: new settings are introduced, behaviors change, bugs are (hopefully) fixed. The video belows shows my approach to managing this process and keeping track of ongoing issues with our Koha ILS.

      New research to map the diversity of citizen-generated data for sustainable development / Open Knowledge Foundation

      We are excited to announce a new research project around citizen-generated data and the UN data revolution. This research will be led by Open Knowledge International in partnership with King’s College London and the Public Data Lab to develop a vocabulary for governments to navigate the landscape of citizen-generated data.

      This research elaborates on past work which explored how to democratise the data revolution, how citizen and civil society data can be used to advocate for changes in official data collection, and how citizen-generated data can be organised to monitor and advance sustainability. It is funded by the United Nations Foundation and commissioned by the Task Team on Citizen Generated Data which is hosted by the Global Partnership for Sustainable Development Data (GPSDD).

      Our research seeks to develop a working vocabulary of different citizen-generated data methodologies. This vocabulary shall highlight clear distinction criteria between different methods, but also point out different ways of thinking about citizen-generated data. We hope that such a vocabulary can help governments and international organisations attend to the benefits and pitfalls of citizen-generated data in a more nuanced way and will help them engage with citizen-generated data more strategically.

      Why this research matters

      The past decades have seen the rise of many citizen-generated data projects. A plethora of concepts and initiatives use citizen-generated data for many goals, ranging from citizen science, citizen sensing and environmental monitoring to participatory mapping, community-based monitoring and community policing. In these initiatives citizens may play very different roles (from assigning the role of mere sensors, to enabling them to shape what data gets collected). Initiatives may differ in the  media and technologies used to collect data, in the ways stakeholders are engaged with partners from government or business, or how activities are governed to align interests between these parties.

      Air pollution monitoring devices used as part of Citizen Sense pilot study in New Cross, London (image from Changing What Counts report)

      Likewise different actors articulate the concerns and benefits of CGD in different ways. Scientific and statistical communities may be concerned about data quality and interoperability of citizen-generated data whereas a community centered around the monitoring of the Sustainable Development Goals (SDGs) may be more concerned with issues of scalability and the potential of CGD to fill gaps in official data sets. Legal communities may consider liability issues for government administrations when using unofficial data,, whilst CSOs and international development organisations may want to know what resources and capacities are needed to support citizen-generated data and how to organise and plan projects.

      In our work we will address a range of questions including: What citizen-generated data methodologies work well, and for what purposes? What is the role of citizens in generating data, and what can data “generation” look like? How are participation and use of citizen data organised? What collaborative models between official data producers/users and citizen-generated data projects exist? Can citizen-generated data be used alongside or incorporated into statistical monitoring purposes, and if so, under what circumstances? And in what ways could citizen-generated data contribute to regulatory decision-making or other administrative tasks of government?

      In our research we will

      • Map existing literature, online content and examples of projects, practices and methods associated with the term “citizen generated data”;
      • Use this mapping to solicit for input and ideas on other kinds of citizen-generated data initiatives as well as other relevant literatures and practices from researchers, practitioners and others;
      • Gather suggestions from literature, researchers and practitioners about which aspects of citizen-generated data to attend to, and why;
      • Undertake fresh empirical research around a selection of citizen-generated data projects in order to explore these different perspectives.

      Visual representation of the Bushwick Neighbourhood, geo-locating qualitative stories in the map (left image), and patterns of land usage (right image) (Source: North West Bushwick Community project)

      Next steps

      In the spirit of participatory and open research, we invite governments, civil society organisations and academia to share examples of citizen-generated data methodologies, the benefits of using citizen-generated data and issues we may want to look into as part of our research.

      If you’re interested in following or contributing to the project, you can find out more on our forum.

      My queer root / Tara Robertson

      photo of me in a turquoise polka dot bathing suit, with a rainbow mermaid tail, and a unicorn mask on, posing on a rock with a water and blue sky in the backgroundAmber Dawn invited me to share the story of my queer “root” as part of the Vancouver Queer Film Festival‘s program. It was an honour to be on stage with poet and marketing pro, David Ly;  filmmaker and artist in residence Thirza Cuthand, wizard, goldsmith and creative consultant, Tien Neo Eamas.

      Here’s my story.

      I was born in Vancouver, but my family moved to Prince George when I was 4. As a young kid I was so scared of water that getting my face wet in the shower would make me shriek. So, my extremely pragmatic parents put me in swim lessons. I quickly moved past my fear and became a competitive swimmer. This is something that defined who I was for most of my life. As a teenager I was swimming 9 times a week, plus dryland training and weights. I remember when I squatted 350 pounds the boys on my team called me quadzilla. Now, I think I’d be proud of this nickname but as a teenage girl I was mortified. Teenagers can be such assholes sometimes. At school in gym class my sweat would smell like chlorine. My body was like a giant scratch and sniff sticker and sometimes I would secretly lick my forearm and smell the pool. I was, and still am, a little weird.

      I’ve spent tens of thousands of hours swimming–waking up early and swimming back and forth staring at the black line on the bottom of the pool while our coaches yelled feedback and encouragement. I didn’t realize or appreciate how much work this was for my parents until much later. If I had early morning swim practice, it also meant that either my dad or another swim parent in the neighbourhood had to also get up at 5:30am and drive us kids to the pool.

      So, fast forward a few years. It was 1997 and I was 20 and in Washington DC for an internship. I saw that there was a queer swim club called District of Columbia Aquatics Club, or DC/AC. I was so nervous showing up at my first workout. It was same level of nervousness that I felt the first time going into Little Sisters. I was 15 and down from PG visiting my Gran during summer vacation. I’d taken the bus downtown and made my first pilgrimage to the gay and lesbian book store. I nervously looked around and quickly ducked in the front door, my heart pounding. This is how I felt when I showed up at the pool, looked around to see if anyone was watching me and ducked through the doors.

      The folks at DC/AC were super welcoming and a swim workout is familiar to me. Outside of a club or party I hadn’t been around so many people who were out and comfortable with themselves. DC/AC became my community in Washington and I swam with them 4 times a week. A month later was the International Gay and Lesbian Aquatics championship in San Diego. I was a student and didn’t have much money, but one of my teammates was a flight attendant and he flew me out with him on a buddy pass.

      The International Gay and Lesbian Aquatics championships, or IGLA, was similar to the swim meets of my youth, but with a couple of differences. First, it was my first Masters meet, which means swimmers are 19 years and older. Each age category spanned 5 years and there were a few other younger swimmers like me, lots of swimmers in their 30s, 40s and 50s, and even some in their 60s and 70s. There were several former Olympians.

      IGLA meets also have a special event called the Pink Flamingo, which is what you would get if a drag show and synchronized swimming had a baby. Most of the entries were high camp, mostly men in drag, lip syncing to bad pop music and gay anthems. London’s Out to Swim’s entry was completely different from the rest. They slowly unfurled a long piece of black cloth to make a black ribbon as audio of Princess Diana talking with empathy and compassion about People with AIDS played. She had died a few months before this competition.

      This was when I started to understand the AIDS crisis. Of course I’d read articles in the newspaper but I didn’t really know anyone who was positive. I heard when the queer swim club in Vancouver started in the late 80s, they had difficulty booking pool space as there was a fear of people getting HIV. At the IGLA meets they used to have a minute of silence and read the names of swimmers who had died from AIDS in the last year. San Diego was the first time that this didn’t happen because the list of names was too long.

      My eyes had been opened to this amazing queer community which became an important chosen family that played as significant a role as my biological parents in raising me.

      From Washington DC I was going to Scotland for an exchange. I couldn’t afford to travel back to the west coast. My aunt in Toronto let me stay with her over Christmas. I trained with the queer swim club in Toronto. I remember calling my parents from my aunt’s living room and telling them that I was going to swim in the Amsterdam Gay Games. Looking back it was the perfect way to come out to them–a bit indirect and related to swimming. I won 13 medals at the Amsterdam Gay Games. When I told my parents they were so proud of me.

      I realize now how supportive my parents were of swimming—they paid my swim fees, got up early to carpool kids to the pool, volunteered a lot of their time to organize, chaperone and officiate at swim meets. It’s a huge amount of work that I didn’t see until I was in my 30s. After I finished age group swimming my dad continued to volunteer as an official, including at the Commonwealth Games. Swimming was something that was a good thing. I’m not sure many parents hope that their kids will grow up to be queer, so having swimming to temper my news was useful.

      While I was a student in Scotland I swam on my university’s swim team and on London’s Out to Swim. I was broke and either took the night bus down to London or cheated the train. I joined Out to Swim for a swim meet in Paris. There were a few culture shock moments, including men and women showering and changing on the deck together. I saw some interesting genital piercings I hadn’t seen before. Queer swim clubs have a lot of interesting tattoos and piercings.

      One of the things that made going to competitions financially accessible for me was hosted housing. In Paris we were put up by ex-Londoner who had been living in Paris for a long time. I stayed in a little private room beside his flat that had the toilet in the shower. I’d never seen anything like this before. We stayed up late one night talking and realized we had similar politics. He had been a member of the Gay Liberation Front in London and pulled out purple mimeographed newsletters and we talked about socialism, feminism and queer sex. Connecting with older queer people around politics and history made me feel a sense of rootedness.

      I’ve been very lucky and had many opportunities to travel and live in other countries. Swimming has been my constant and where I’ve found community. I swam on a master’s swim team when I lived in Japan, and then I moved to Sydney to compete in Gay Games and figure out what I was going to do with my life.

      Swimming has been a key ingredient to my mental health. Hard swim workouts allow me to easily access the introverted, intuitive, problem solving parts of my brain. I found an unexpected benefit of swim workouts was that it enabled me to figure out database queries that stumped me while I was sitting at my computer. Now it’s where I have conversations with my inner Captain and listen to her wisdom in setting boundaries at work

      Swimming has been central to having a positive relationship with my body. My body is fat, strong, and beautiful. In a culture that hates fat people, especially fat women, swimming has been my best defense against self hatred, fatphobia and misogyny. As I get older my relationship to my body is shifting. It’s a lot more work to just maintain fitness, strength and flexibility. I haven’t competed for 5 years and my goals have changed from achieving best times to just making time to practice and be really present in my body

      Swimming is my queer root. Of course there wasn’t something in the water that made me queer, but through swimming I found an easier path to come out to my parents and found my chosen family and community. My queer swimming community connected me to a queer histories, taught me to speak truth to power, and taught me that smelling like chlorine is weird and beautiful.

      Google Summer of Code 2018 / Open Library

      This is Internet Archive’s second year participating in Google Summer of Code, but for Open Library, it’s an exciting first. Open Library’s mission is to create, “a web page for every book” and this summer, we’re fortunate to team with Salman Shah to advance this mission. Salman’s Google Summer of Code roadmap aims to targets two core needs of modernizing and increasing the coverage of its book catalog and improving website reliability. 

      Bots & Open Library

      Every day, users contribute thousands of edits and improvements to Open Library’s book catalog. Anyone with an Open Library account can add a book record to the catalog if it doesn’t already exist. There’s also a great walkthrough on adding or editing data for existing book pages. Making edits manually can be tedious and so the majority of new book pages on Open Library are automatically created by Bots which have been programmed to perform specific tasks by our amazing community of developers and digital librarians. This month, Salman programmed two new bots. The first one is called ia-wishlist-bot. It makes sure an Open Library catalog record exists for each of the 1M books on the Internet Archive’s Wishlist, compiled by Chris Freeland and Matt Miller. The second bot, named onix-bot, takes book feeds (in a special format called ONIX) from our partners (e.g. Cory McCloud at Bibliometa), and makes sure the books exist in our catalog.

      Importing Internet Archive Wishlist

      Earlier this year, as part of the Open Libraries initiative, Chris Freeland, with the help of Matt Miller and others, compiled a Wishlist of hundreds of thousands of book recommendations for the Internet Archive to digitize:

      “Our goal is to bring 4 million more books online, so that all digital learners have access to a great digital library on par with a major metropolitan public library system. We know we won’t be able to make this vision a reality alone, which is why we’re working with libraries, authors, and publishers to build a collaborative digital collection accessible to any library in the country.”

      In support of this mission, the Open Library team decided it would be helpful if the metadata for these books were imported into the catalog. 

      Importing thousands of books in bulk into Open Library’s catalog presents several challenges. First, many precautions have to be taken to avoid adding duplicate book and author records to the database. To avoid the creation of duplicate records, Salman used the Open Library Book API to check for existing works by ISBN10, ISBN13, and OCLC identifiers. For this project, we were specifically interested in books which had no other editions on Open Library, so any time we noticed an existing edition for the same work, we skipped it. A second check used the Open Library Search API to check for any existing editions with a similar title and author. If there’s a plausible match, we don’t add it to Open Library. This process leaves us with a much shorter list of presumably unique works to add to Open Library.

      Finding book covers for this new shortlist was the next challenge to overcome. These book covers typically come from an Open Library partner like Better World Books. Because Better World Books doesn’t have book covers for every book in our list, we had to be mindful that sometimes their service returns a default fallback image (which we had to detect). We wouldn’t want to add these placeholder images into Open Library’s catalog.

      The last step is to make sure we’re not accidentally creating new Author records when we add our shortlist of books to Open Library. Even if we’ve taken precautions to ensure that a book with the same identifiers, title, and author doesn’t already exist doesn’t guarantee that the author isn’t already registered in our database. If they are, duplicating the author record would result in a negative and confusing user experience for readers searching for this author. We check to see if an author already exists on Open Library by using the Author search API and faceting on their name, as well as birth and death dates (where available in our shortlist).

      In summary:

      • The Project started with 1 million books which were to be added to Open Library, out of those 1 million books.
      • A lot of these works were duplicates and already existed on Open Library and were merged on Open Library. The number of works that were left after this round were 255,276.
      • The parameters that were matched were ISBN, Title and Author Name and we were started with the top 1000 Open Library works which were added to Open Library. One example for one of the books that were added can be found here

      An important output from this step was the standardization and generalization of our bot creation process.

      Importing ONIX Records

      In late 2017, one of our partners, Cory McCloud from Bibliometa, gifted Open Library access to tens of thousands of book metadata records in ONIX format:

      ONIX for Books is an XML format for sharing bibliographic data pertaining to both traditional books and eBooks. It is the oldest of the three ONIX standards, and is widely implemented in the book trade in North America, Europe and increasingly in the Asia-Pacific region. It allows book and ebook publishers to create and manage a corpus of rich metadata about their products, and to exchange it with their customers (distributors and retailers) in a coherent, unambiguous, and largely automated manner.”

      Many publishers use ONIX feeds to disseminate the metadata and prices of their books to partner vendors. Cory and his team thought Bibliometa’s ONIX records could be a great opportunity for synergy; to get publishers and authors increased exposure and recognition, and to improve the completeness and quality of Open Library’s catalog.

      The steps for processing Bibliometa’s ONIX records is similar to importing books from the Internet Archive Wishlist, especially the steps for ensuring we weren’t creating duplicate records in Open Library. At the same time, the task of determining which authors already exist and which need to be created in the catalog was exacerbated by the fact that fewer birth and death dates were available, greatly reducing our confidence in author searching & matching. In other ways, creating an ONIX import pipeline was simplified by our earlier efforts which had established key conventions for how new bots may be created using the openlibrary-bots repository. Additionally, our ONIX feeds have the advantage of coming with book covers whereas we had to manually source book covers for items in the wishlist. 

      The first step towards adding these records to Open Library was to write a parser to convert these ONIX feeds into a format which Open Library can understand.  . Open Library did have an ONIX Parser and Import Script written by the co-founder of Open Library, Aaron Swartz who had written the initial script to parse ONIX Records and add them to the Open Library Database. Like much of Open Library’s scripts, this code was in Python 2.7, encoded a much earlier version of the ONIX specification, and made use of a very old xml parser which was difficult to extend. Unfortunately, we couldn’t find any drop-in python replacements for the ONIX parser on github. These factors motivated rolling our own new ONIX parser.

      To start with Salman received a dump of ~70,000 ONIX records from bibliometa to be evaluated for import into Open Library. There were two checks that were implemented for this procedure:

      1. Checking if there was an existing ISBN-10 or ISBN-13 for that particular work on Open Library using the Open Library Client.
      2. Matching via Title or Author and see if the record exists on Open Library or not via an API Call.

      While much of the ONIX parser is complete, the ONIX Bot project is still in development.

      A Guide on Writing Bots

      Interested in writing your own Open Library Bot? For more information on how to make an Open Library Bot and their capabilities, please consult our documentation. The basic steps are:

      1. Apply for a Bot Account on Open Library by contacting the Open Library Maintainer and obtain a bot account. A good way to do this is to respond to this issue on github.
      2. After registering a bot account and having it approved, you can write a bot by extending the openlibrary-client to add accomplish tasks like adding new works to Open Library. You can refer to the openlibrary-client examples.
      3. All bots that add works to Open Library have to be added, are added to the Open Library Bots Repository on Github. Every bot has its own directory with a README containing instructions on how to reproducibly run the bot. Each bot should also link to a corresponding directory within the openlibrary-bots item where the outputs of the bot may be stored for provenance.

      Next Steps: Provisioning

      Unfortunately, there wasn’t enough time during the GSoC program to complete all three phrases of our roadmap (Wishlist, ONIX, and Provisioning). The objective of the third phase of our plan was to make Open Library deployment more robust and reliable using Docker and Ansible. Docker has been a discussion point of several Open Library Community Calls and has catalyzed the creation of a docker branch on the Open Library Github Repository which addresses some of the basic use cases outlined in the GSoC proposal. One important outcome is the identification of concrete steps and recommendations which the community can implement to improve Open Library’s provisioning process:

      • Switch from Docker to Docker Compose: Currently the Docker branch uses single Docker files to manage the dependencies for Docker. The goal is to use a single docker-compose file which will manage all services being used.
      • Switch Open Library to use Ansible (a software that automates software provisioning, configuration management, and application deployment). Have a Production as well as a Development Playbook. Playbooks are Ansible’s configuration, deployment, and orchestration language. They can describe a policy you want your remote systems to enforce, or a set of steps in a general IT process.  
      • Use Ansible Vault which is a feature of ansible that allows keeping sensitive data such as passwords or keys in encrypted files. This will replace the current system of having a olsystem.


      In retrospect, Google Summer of Code 2018 has resulted in thousands of new books being added to the Open Library catalog. Conventions were established both to streamline and make it easier for others to create new bots in the future and to continue and extend this summer’s work.

      Some of the key points that we overlooked while going drafting the proposal were as follows:

      1. Checking whether a book exists on Open Library or not is hard. We started with a simple Title match and ended up with formatting the title, formatting the authors to ensure no new author objects are created, making changes to the code to ensure it doesn’t break when there are no authors for a work in our data.
      2. Improving the openlibrary-client as well as documenting it extensively to ensure that future developers don’t have to go through the code to understand what that particular function ends up doing and how it can be used.
      3. Setting up a structure for the openlibrary-bots directory to ensure future developers are easily able to find the required code they need if they are writing their own bot.
      4. Assuming that data would be perfect and it was a matter of copy-pasting, but in reality, Salman and Mek had to go through the data to understand where the code broke because of various reasons like having a ‘,’(comma) in the string and so on.

      One learning we obtained from participating in GSoC for the first time is that we may have been better off focusing on two instead of three work deliverables. By the end of the program, we didn’t have enough time for our third phase, even though we were proud of the progress we made. On the flip side, because of discussions catalyzed during our community calls and suggestions outlined in our GSoC proposal, there is now ongoing community progress on this final phase — dockerization of Open Library — which can be found here.

      A major win of this GSoC project is that the project’s complexity necessitated Salman explore writing test cases for the first time and provided first hand experience as to the importance of a test harness in developing an end to end data processing pipeline.

       Three of our biggest objective key results during this program were:

      1. Quality assuring and updating the documentation of the openlibrary-client tool to support future developers.
      2. Creating a new `openlibrary-bots` repository with documentation and processes to ensure that there is a standard way to add future bots moving forward. And also making sure our Wishlist and ONIX bot processes are well documented with results which are reproducible.
      3. Adding thousands of new modern books to the Open Library catalog

      Project Links

      1. Open Library Client –
      2. Open Library Bots (IA Wishlist Bot) –
      3. Open Library Bots (ONIX Bot) –
      4. Docker (In Progress) –

      Fellow Reflection: Jennifer Nichols / Digital Library Federation

      Jennifer Nichols (@jennytnichols) is the Digital Scholarship Librarian and interim Department Head for the Office of Digital Innovation and Stewardship at the University of Arizona Libraries, and Co-Director for the iSpace, the libraries’ innovation and maker space. She attended the Digital Humanities Summer Institute and the DLFxDHSI unconference in Victoria, BC with support from a Cross-Pollinator Tuition Award. Learn more about Jennifer, or read on for her reflection on her experience.

      Thank you to DLF for the generous support to attend DHSI and the first ever DLFxDHSI unconference.

      It goes without saying that 10 days in the Pacific Northwest, in the beautiful city of Victoria, is a gift that anyone could appreciate. Verdant green landscapes accented with bold floral displays are a sensory delight, and do wonders to restore my sense of connection to this earth, and to my work here upon it. Now, coming into this inviting environment from the Sonoran desert in early June is like transporting unto another planet. I joked that I would be the only one sporting a down coat. Though I hesitated, I’m glad I brought it for those two chilly, rainy days.

      As an academic librarian, I often struggle to maintain my identity as a scholar. I am consumed with the entirety of my role–building partnerships, running programs, consulting with students and faculty, managing staff. Though scholarship is a part of my job, it is a small part (5-10% in fact, depending on the year). I know I’m not alone in that, and one of the most meaningful experiences for me at DHSI was the opportunity to connect with others in similar roles as my own.

      Jennifer’s Web API class at DHSI.

      I participated in week one of DHSI and attended the Web APIs and Python class taught by CUNY Grad Center librarian Stephen Zweibel and grad students Jojo Karlin, Patrick Smyth, and Jonathan Reeve (of Columbia University). I chose this class because I wanted to learn a new, tangible skill, and Python is one of those things that I have struggled to apply, and thus never made any progress. I wanted to come away having built something, so naturally, I used my time and the expertise of my instructors, to create a database of superhero stats to examine the sexism (duh) inherent in the design of their skills and talents. Stay tuned for part two, the Twitter bot that faces off male and female superheroes…

      So many dynamic and relevant conversations were happening at the Institute, and when I listened to the courses (and unconferences and brown bags) described in lively and inviting ways during the welcome session, I was impressed. As anyone who has attended DHSI will tell you, you cannot possibly do it all, and though you may be tempted, it is not recommended. Take comfort that these conversations are happening, and know that there are many ways to jump in. The Race and DH course was one such conversation that encouraged me. As the conversation spilled out into the whole group, and flowed into each classroom and over Twitter, I was relieved. These are essential to our DH pedagogy work and should not be confined to one course.

      Opening reception poster session with (L-R) Jojo Karlin, Param Ajmera, Dana Johnson, and Jennifer Nichols.

      In my time at DHSI, I also wanted to connect with others, and my final day at the DLFxDHSI unconference on Social Justice and Digital Libraries was the perfect way to end my experience. After a week of learning Python with mostly grad students and scholars, I was so happy to be reunited with my library tribe! I reflected and applied learning from the week with like-minded professionals, met new people who shared my work, who wanted to think about alternative ways of doing and being together. The conversations we cultivated were enlightening and pertinent. We asked questions like How can we advance social justice on faculty projects that are “not yours”? Is there a difference between Environmental Justice and Climate Justice? In what ways are we still not valuing indigenous discourse? We considered ways to dismantle racists, sexist, and ableist constructions and how to interrogate the power that archivists hold.

      What else can I say? The week was dynamic and full, and I am so grateful to have spent this time as a student, a learner and a colleague. Thanks DLF, for bringing me here, and for bringing the two communities together. Without you, I may never have seen myself as belonging here.

      The post Fellow Reflection: Jennifer Nichols appeared first on DLF.

      Advice from the Trenches: You’re Not Alone / HangingTogether

      binoculars“FOCUS” by Iain Farrell CC BY-ND 2.0

      Assessment projects can be intimidating and overwhelming. I have these same thoughts each time I begin a new project. I wonder if my data collection tools will be effective in answering my questions. I worry that I may not be able to adhere to the timelines or that I will not learn anything new or useful. Yes, these are common concerns, but we cannot allow them to stifle our ability to move forward and begin a new project.

      Danuta Nitecki, Dean of Libraries at Drexel University Libraries, provides some excellent advice on assessment evaluations in the research methods book that I co-authored with Marie Radford (Connaway & Radford, 2016). It helps me to reflect on this advice even with years of experience in assessment and I hope it is useful to you as well.

      1. “Techniques to conduct an effective assessment evaluation are learnable.”
      2. Always start with a problem – the question/s.
      3. “…consult the literature, participate in Webinars, attend conferences, and learn what is already knowns about the evaluation problem.
      4. Take the plunge and just do an assessment evaluation and learn from the experience – the next one will be easier and better.
      5. Make the assessment evaluation a part of your job, not more work.
      6. Plan the process…and share your results.”

      I’m going to drill down on the hardest part of an assessment project, which is developing the questions. One of the biggest pitfalls is coming up with broad questions – questions that are difficult or impossible to answer. The questions must be specific and pertain to the unknown aspects of your programs, collections, users and potential users that you do not know.

      The first step in developing your questions is to identify what you already know.

      • Think about all the data that the library collects on a daily, weekly, and monthly basis.
      • Think about your daily work and activities in the library and your interactions and observations of the people who use the physical and virtual library, the resources, the offerings, programs, and events.
      • Document what you have learned from the data you collected and from the interactions and observations you have had with individuals using the library.

      Once you have documented what you know, you will be able to identify the gaps in your knowledge. Now is when you will be able to develop a problem statement and then specific questions to use for your assessment project. Keep it simple. Develop the questions so they can be scoped into a manageable project.

      Continue to chip away at the unanswered questions to create a more detailed picture of the library. Why? Because assessment is systematic and cyclical. It does not end but is a constant exercise that should become a part of our daily activities. Assessment is a way of putting the library into the life of the user. And, always remember, “rust never sleeps – not for rockers, not for libraries.”

      If you’re interested in learning more about library assessment, please join me Tuesday, August 14 for Digging into Assessment Data: Tips, Tricks, and Tools of the Trade, and Wednesday, October 3 for Take Action: Using and Presenting Research Findings to Make Your Case. These webinars are part of a special series, Evaluating and Sharing Your Library’s Impact, about assessment that brings together research and practice and provides useful, actionable data to promote and demonstrate the critical role your library plays in your community. Register now for both or for the one that interests you. All sessions are being recorded, if you’re not able to attend. There’s also a Learner Guide available with activities to complete to help you get the most out of the webinar experience. The Learner Guide is valuable for those working together as a team to explore a library’s assessment needs. One cohort of learners working through the series and related exercises is comprised of members of the OCLC Research Library Partnership, and they have been gathering for virtual conversations in the OCLC RLP Library Assessment Interest Group.


      ALA urges Commerce Department to reject Census citizenship question / District Dispatch

      The American Library Association (ALA) has joined 144 groups in opposing the addition of a citizenship question to the 2020 Census form. ALA is a signee of a letter submitted August 1 by the Leadership Conference on Civil and Human Rights to the Department of Commerce, which oversees the US Census Bureau.

      The comments submitted by the coalition elaborate on the harm that would result from adding such a question to the 2020 Census, including diminished data accuracy, an increased burden of information collection, and an added cost to taxpayers. The submission also points to the US Census Bureau’s own January 19 technical review, in which Associate Director for Research and Methodology John Abowd concluded that adding a citizenship question would have an “adverse impact on self-response and, as a result, on the accuracy and quality of the 2020 Census.”

      The technical review also stated that using existing administrative records instead of asking a citizenship question would provide more accurate citizenship data at lower cost to the federal government.

      “Adding a citizenship question to the 2020 Census would suppress Census response, distorting the statistics and making them less informative,” says ALA President Loida Garcia-Febo.

      ALA has participated in previous coalition efforts to prevent the Trump administration’s addition of a citizenship question to the 2020 Census, including a January 10 letter opposing the proposal. The Association is engaging with the US Census Bureau and other stakeholders to keep libraries informed of and represented in the 2020 Census policy discussions and planning process, with the goal that libraries may be better able to support their communities.

      The US Census is a decennial count of all US residents required by the Constitution to determine Congressional representation; district boundaries for federal, state, and local offices; and allocation of billions of dollars in federal funding to states and localities, such as grants under the Library Services and Technology Act. Libraries across the US provide access to the wealth of statistical data published by the US Census Bureau and help businesses, government agencies, community organizations, and researchers find and use the information.

      “As partners in providing access to and supporting meaningful use of Census data, libraries promote the highest possible quality and completeness of the data,” says Garcia-Febo.

      The post ALA urges Commerce Department to reject Census citizenship question appeared first on District Dispatch.

      Adaptation: the Continuing Evolution of the New York Public Library’s Digital Design System / Code4Lib Journal

      A design system is crucial for sustaining both the continuity and the advancement of a website's design. But it's hard to create such a system when content, technology, and staff are constantly changing. This is the situation faced by the Digital team at the New York Public Library. When those are the conditions of the problem, the design system needs to be modular, distributed, and standardized, so that it can withstand constant change and provide a reliable foundation. NYPL's design system has gone through three major iterations, each a step towards the best way to manage design principles across an abundance of heterogeneous content and many contributors who brought different skills to the team and department at different times. Starting from an abstracted framework that provided a template for future systems, then a specific component system for a new project, and finally a system of interoperable components and layouts, NYPL's Digital team continues to grow and adapt its digital design resource.

      Getting More out of MARC with Primo: Strategies for Display, Search and Faceting / Code4Lib Journal

      Going beyond author, title, subject and notes, there are many new (or newly-revitalized) fields and subfields in the MARC 21 format that support more structured data and could be beneficial to users if exposed in a discovery interface. In this article, we describe how the Orbis Cascade Alliance has implemented display, search and faceting for several of these fields and subfields in our Primo discovery interface. We discuss problems and challenges we encountered, both Primo-specific and those that would apply in any search interface.

      Extending and Adapting Metadata Audit Tools for Mountain West Digital Library Members / Code4Lib Journal

      As a DPLA regional service hub, Mountain West Digital Library harvests metadata from 16 member repositories representing over 70 partners throughout the Western US and hosts over 950,000 records in its portal. The collections harvested range in size from a handful of records to many thousands, presenting both quality control and efficiency issues. To assist members in auditing records for metadata required by the MWDL Metadata Application Profile before harvesting, MWDL hosts a metadata auditing tool adapted from North Carolina Digital Heritage Center’s original DPLA OAI Aggregation Tools project, available on GitHub. The tool uses XSL tests of the OAI-PMH stream from a repository to check conformance of incoming data with the MWDL Metadata Application Profile. Use of the tool enables student workers and non-professionals to perform large-scale metadata auditing even if they have no prior knowledge of application profiles or metadata auditing workflows. In the spring of 2018, we further adapted and extended this tool to audit collections coming from a new member, Oregon Digital. The OAI-PMH provision from Oregon Digital’s Samvera repository is configured differently than that of the CONTENTdm repositories used by existing MWDL members, requiring adaptation of the tool. We also extended the tool by adding the Dublin Core Facet Viewer, which gives the ability to view and analyze values used in both required and recommended fields by frequency. Use of this tool enhances metadata completeness, correctness, and consistency. This article will discuss the technical challenges of project, offer code samples, and offer ideas for further updates.

      Copyright and access restrictions–providing access to the digital collections of Leiden University Libraries with conditional access rights / Code4Lib Journal

      To provide access to the digitized collections without breaking any copyright laws, Leiden University Library built a copyright module for their Islandora-based repository. The project was not just about building a technical solution, but also addressed policy, metadata, and workflows. A fine-grained system of access rights was set up, distinguishing conditions based on metadata, IP address, authentication and user role.

      Using XML Schema with Embedded Schematron Rules for MODS Quality Control in a Digital Repository / Code4Lib Journal

      The Michigan State University Libraries Digital Repository relies primarily on MODS descriptive metadata to convey meaning to users and to improve discoverability and access to the libraries’ unique information resources. Because the repository relies on this metadata for so much of its functionality, it’s important that records are of consistently high quality. While creating a metadata guidelines document was an important step in assuring higher-quality metadata, the volume of MODS records made it impossible to evaluate metadata quality without some form of automated quality assessment. After considering several possible tools, an XML Schema with embedded Schematron rules was ultimately chosen for its customizability and capabilities. The two tools complement each other well: XML Schemas provide a concise method of dictating the structure of XML documents and Schematron adds more robust capabilities for writing detailed rules and checking the content of XML elements and attributes. By adding the use of this Schema to our metadata creation workflow, we’re able to catch and correct errors before metadata is entered into the repository.

      Are we still working on this? A meta-retrospective of a digital repository migration in the form of a classic Greek Tragedy (in extreme violation of Aristotelian Unity of Time) / Code4Lib Journal

      In this paper we present a retrospective of a 2.5 year project to migrate a major digital repository system from one open source software platform to another. After more than a decade on DSpace, Oregon State University’s institutional repository was in dire need of a variety of new functionalities. For reasons described in the paper, we deemed it appropriate to migrate our repository to a Samvera platform. The project faced many of the challenges one would expect (slipping deadlines, messy metadata) and many that one might hope never to experience (exceptional amounts of turnover and uncertainty in personnel, software, and community). We talk through our experiences working through the three major phases of this project, using the structure of the Greek Tragedy as a way to reflect (with Stasimon) on these three phases (Episode). We then conclude the paper with the Exodus, wherein we speak at a high level of the lessons learned in the project including Patience, Process, and Perseverance, and why these are key to technical projects broadly. We hope our migration story will be helpful to developers and repository managers as a map of development hurdles and an aspiration of success.

      Spinning Communication to Get People Excited About Technological Change / Code4Lib Journal

      Many organizations struggle with technological change. Often, the challenges faced are due to fear of change from stakeholders within the institution. Users grow accustomed to certain user interfaces, to processes associated with a specific system, and they can be frustrated when they have to revisit how they interact with a system, especially one that they use on a daily basis. This article will discuss how to acknowledge the fears associated with technological change and will suggest communication tactics and strategies to ease transitions. Specific scenarios and examples from the author’s experiences will be included.

      Machine Learning and the Library or: How I Learned to Stop Worrying and Love My Robot Overlords / Code4Lib Journal

      Machine learning algorithms and technologies are becoming a regular part of daily life - including life in the libraries. Through this article, I hope to: * To introduce the reader to the basic terminology and concepts of machine learning * To make the reader consider the potential ethical and privacy issues that libraries will face as machine learning permeates society * To demonstrate hypothetical possibilities for applying machine learning to circulation and collections data using TensorFlow/Keras and open datasets Through these goals, it is my hope that this article will inspire a larger, ongoing conversation about the utility and dangers of machine learning in the library (and concurrently society as a whole). In addition, the tripartite division of the article is meant to make the material accessible to readers with different levels of technical proficiency. In approaching the first two goals, the discussion is focused on high level terms and concepts, and it includes specific public cases of machine learning (ab)use that are of broad interest. For the third goal, the discussion becomes more technical and is geared towards those interested in exploring practical machine learning applications in the library.

      Assessing the Potential Use of High Efficiency Video Coding (HEVC) and High Efficiency Image File Format (HEIF) in Archival Still Images / Code4Lib Journal

      Both HEVC (ISO/IEC 23008–2) video compression and the HEIF (ISO/IEC 23008-12) wrapper format are relatively new and evolving standards. Though attention has been given to their recent adoption as a JPEG replacement for more efficient local still image use on consumer electronic devices, the standards are written to encompass far broader potential application. This study examines current HEVC and HEIF tools, and the standards’ possible value in the context of digital still image archiving in cultural heritage repositories.

      The Tools We Don’t Have: Future and Current Inventory Management in a Room Reservation System / Code4Lib Journal

      Fondren Library at Rice University has numerous study rooms which are very popular with students. Study rooms, and equipment, have future inventory needs which require a visual calendar for reservation. Traditionally libraries’ manage reservations through a booking module in an Integrated Library System (ILS), but most, if not all, booking modules lack a visual calendar which allows patrons to pick out a place and time to create a reservation. The IT department at Fondren library was able to overcome this limitation by modifying the open source Booked Scheduling software so that it did all of the front end work for the ILS, while still allowing the ILS to manage the use of the rooms.

      WMS, APIs and LibGuides: Building a Better Database A-Z List / Code4Lib Journal

      At the American University of Sharjah, our Databases by title and by subject pages are the 3rd and 4th most visited pages on our website. When we changed our ILS from Millennium to OCLC’s WorldShare Management Services (WMS), our previous automations which kept our Databases A-Z pages up-to-date were no longer usable and needed to be replaced. Using APIs, a Perl script, and LibGuides’ database management interface, we developed a workflow that pulls database metadata from WMS Collection Manager into a clean public-facing A-Z list. This article will discuss the details of how this process works, the advantages it provides, and the continuing issues we are facing.

      The Blockchain Trilemma / David Rosenthal

      The blockchain trilemma
      In The economics of blockchains Markus K Brunnermeier and Joseph Abadi (BA) write:
      much of the innovation in blockchain technology has been aimed at wresting power from centralised authorities or monopolies. Unfortunately, the blockchain community’s utopian vision of a decentralised world is not without substantial costs. In recent research, we point out a ‘blockchain trilemma’ – it is impossible for any ledger to fully satisfy the three properties shown in Figure 1 simultaneously (Abadi and Brunnermeier 2018). In particular, decentralisation has three main costs: waste of resources, scalability problems, and network externality inefficiencies.
      Below the fold, some commentary.

      BA's conclusion (reformatted):
      The blockchain trilemma highlights the key economic trade-offs in designing a ledger. Traditional ledgers, managed by a single entity, forgo the desired feature of decentralisation. A centralised ledger writer is incentivised to report honestly because he does not wish to jeopardise his future profits and franchise value. Blockchains can eliminate the rents extracted by centralised intermediaries through two types of competition: free entry of writers and fork competition. Decentralisation comes at the cost of efficiency, however.
      • Free entry completely erodes writers' future profits and franchise values. The ledger's correctness must rely on the purely static incentives provided by proof of work.
      • Fork competition facilitates competition between ledgers but can lead to instability and produce too many ledgers.
      Furthermore, the ideal of decentralisation may be unattainable when substantial legal enforcement by the government is necessary ... for the blockchain to function properly.
      In their paper Blockchain Economics Abadi and Brunnermeier write:
      Finally, we informally make the important point that while blockchains guarantee transfers of ownership, some sort of enforcement is required to ensure transfers of possession.
      When banks were fraudulently transferring ownership by foreclosing on mortgages by robosigning, they had to get the assistance of sheriffs to obtain possession. The problem with MERS, the record-keeping system that enabled the frauds, was not that it was centralized, but that it was subject to "garbage in, garbage out".

      The first thing to note is that, as Eric Budish has shown in The Economic Limits Of Bitcoin And The Blockchain, the high cost of adding a block to a chain is not an unfortunate side-effect, it is essential to maintaining the correctness of the chain, and limits the value of the largest transaction it is safe to allow into the chain. This supports BA's analysis that if the system is decentralized and correct it must be inefficient.

      Second, BA's analysis is of Platonic ideal blockchains, for which the assumption holds that miners are abundant and independent, and decentralization is an achievable goal:
      • Miners of successful blockchains in the real world, such as Bitcoin's and Ethereum's, are not independent; they collude in a few large mining pools to achieve reasonably smooth income. The two largest Bitcoin pools are apparently both controlled by Bitmain. They only have to collude with one other pool to mount a 51% attack.
      • Miners of less successful blockchains may be independent but they are not abundant. The availability of mining-as-a-service means 51% attacks on these chains are becoming endemic.
      In practice decentralized and secure is not an achievable goal. The security of a successful blockchain rests on the reluctance of the dominant pools to kill the goose that lays the golden eggs. Or, to put it another way, far from being trustless, users of a successful blockchain must trust the humans controlling the dominant pools. Not to mention the core developers.

      Mine the Gap: Bitcoin and the Maintenance of Trustlessness by Gili Vidan and Vili Lehdonvirta (VL) examines how the belief in the trustless nature of blockchains is maintained despite the fact that they aren't decentralized or, in practice, immutable:
      The division between tangible, knowable core code, considered as internal to the Bitcoin ecosystem, and the disruptive actions of actors outside of what constitutes the core allows for the network to maintain its narrative of algorithmic decentralization when facing contradictory evidence. Why do users continue to trust in code, especially in this particular code, in the face of such breakdowns?
      From VL's abstract:
      In contrast to the discourse, we find that power is concentrated to critical sites and individuals who manage the system through ad hoc negotiations, and who users must therefore implicitly trust—a contrast we call Bitcoin’s “promissory gap.” But even in the face of such contradictions between premise and reality, the discourse is maintained. We identify four authorizing strategies used in this work
      The first of the strategies is:
      the collapse of users and their representations on the network into the aggregation of CPUs that power the network. This ambiguity with regards to the identity of the Bitcoin community — individual human actors or their dedicated machines — allows the network to be portrayed as a self-regulating system not susceptible to human foibles, and simultaneously as an enabler of direct action. Under the edict of ‘one-CPU-one-vote,’ any incident within the Bitcoin network is at once the expected result of running the protocol and the enforcement of an expressed consensus of its users. The Bitcoin protocol reimagines its constituency as amass of CPUs.
      The second is market liberalism ideology:
      the assumption of rational, self-interested agents. When CPUs as stand-ins for a mass of individual users appeared to have been accumulated at the hands of a single actor such as GHash.IO, both the pool operators and the core developers issued statements to reassure users that it would not be in the pool’s rational self-interest to undermine the network or to go over the 51 percent threshold. ... Developers did not hesitate to predict what the pools would or would not do based on rational choice, even though the original developer had failed to predict the (rational) emergence of pools in the first place.
      The third is trust in experts(!):
      The belief that cryptographic know-how should grant particular actors in the network governing power enables the simultaneous elevation of Nakamoto’s paper as the ultimate authority of keeping the network within the bounds of its intended purpose, and the acceptance of the Core Development Team as the legitimate body to carry out updates of the code. A commitment to technocratic order first enrolls users in the network through the promise of a decentralized system that ensures the need to trust no-one, and then, when the system’s unsettled and unpredictable nature becomes visible, the technocratic order privileges certain actors as legitimate holders of centralized power until the infrastructure can be stabilized again. Collapsing the difference between users and CPUs further facilitates this technocratic structure, because if the Bitcoin network is composed of machines, then who better to rule it than engineers.
      The final strategy is "its just a bug":
      The fourth and final discursive strategy is casting problems as temporary bugs that will not be present in the final, ideal version of the code. Instead of critically reflecting on the shortcomings of the ‘trust in code’ narrative, participants are asked to ignore contradictions as limitations of a particular implementation of the code. Sites of centralized power are cast not as inherent consequences of the architecture, as features of it, but rather as temporary shortcomings to be overcome in later iterations of the code, as bugs to be patched. Bitcoin’s and blockchain’s initial appeal comes from the promise of a one-time buy-in into infallible code, which will be from that moment on fixed, knowable, and autonomous. Once issues of centralization emerge, however, the code then becomes a malleable experiment, subject to iterations and improvements to address the temporary aberration. If anything is maintained as fixed, it is the belief that trustlessness can be engineered, the belief that Nakamoto’s elegant vision is almost within reach through minor technical adjustments. Participants are asked to trust if not this version of the code, then the next one, in perpetuity. It is of no consequence whether solutions are in sight today, because the peer production model will keep iterating until they are.
      Thus as problems appear, such as the uselessness of Bitcoin for actual transactions, fixes such as the Lighting Network are layered on top of the inadequate underlying technology. And when problems are found in the Lightning Network, such as the difficulty of routing leading to the emergence of centralized "banks", another layer will be created. Layering can be an independent activity, whereas fixing the underlying technology involves obtaining consensus from its governance structure. The history of Bitcoin's blocksize shows how difficult this can be.

      Program for Samvera Connect 2018 available / Samvera

      We are pleased to announce that the full list of workshops, presentations and panels for this year’s Samvera conference is now available at the Samvera wiki.  Samvera Connect 2018 will take place from Tuesday October 9 – Friday October 12 and is hosted by the University of Utah, J. Willard Marriott Library.

      Registration and hotel booking can be made through the Samvera Connect 2018 website which also provides helpful information about transportation, dining, family resources, and additional information about  the University of Utah, Salt Lake City, and Utah to help plan your visit.

      If you can only make it to one Samvera meeting in 2018, this is the one to attend!

      The post Program for Samvera Connect 2018 available appeared first on Samvera.

      Jobs in Information Technology: August 8, 2018 / LITA

      New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

      New This Week

      Drexel University Libraries, Manager, Library Integrated Technology Systems, Philadelphia, PA

      McConnell Library – Radford University, Systems Librarian, Radford, VA

      City of Lubbock, Library Director, Lubbock, TX

      Santa Clara County Central Fire Protection District, Information Technology Officer, Los Gatos, CA

      Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

      Editorial: Update to Lead Pipe submission guidelines / In the Library, With the Lead Pipe

      In Brief: Announcing an update to In the Library with the Lead Pipe’s submission guidelines.

      We have received feedback about our submission process and have reexamined our framework questions. As a result, the Lead Pipe Editorial Board has revised the set of framework questions to better assist author(s) in developing their proposals and provide the board with a deeper understanding of the author’s proposal. We have edited out some parts of the framework questions that were superfluous and expanded on others.

      There are two major changes to the process. The first is the inclusion of a two-part question that will result in an author’s positionality statement. Positionality is a term from the field of sociology in which the researcher reflects on their position in the world and how it shapes their research. This may be influenced by the researcher’s race, gender, geographic location, religious beliefs, etc.

      Second, in relation to question four, in order to engage with a wide audience, authors should make an effort to minimize the use of jargon. More information will be included in our style guide in the coming weeks.

      Here are the updated framework questions:

      1. Briefly explain what specific event or experience led you to pursue this topic, what motivates you? How does your positionality or identity inform your relationship to this topic?

      2. What are the 3 most important things to consider about your topic and why are they the most critical?

      3. What problem is your article addressing and what actions do you want readers to take after reading it? What do you want your readers to remember after they finish reading your article?

      4. How can Lead Pipe help you connect with your intended audience for this article? How is your topic meaningful to someone not in that target audience?

      5. In what ways does your article build upon and/or contribute to the existing literature? Provide 3 sources. Depending upon your topic, these citations may be for research on which your article is based; examples of conversations to which you are adding reinforce issues that you’re raising in your article; articles to which yours is responding; conversations to which you are adding; etc.

      6. If your article involves research on human subjects, have you secured proper permissions and approval to report on this data? Please indicate if your article includes images that require permissions to publish.

      We hope these changes will make the submission process clearer and easier. If you have feedback or questions about these changes, please feel free to email the Board at any time at itlwtlp[at]

      How do you get to Carnegie Hall? Use a MAP! / Digital Library Federation

      Photo Credit: Michael Tomczak

      Lisa Barrier, Asset Cataloger (right) and Kathryn Gronsbell, Digital Collections Manager (left) work together in the Carnegie Hall Archives. Kathryn is also an active member of the DLF Metadata Assessment Working Group, and co-leads its Website subgroup.

      In celebration of Preservation Week 2018, Carnegie Hall Archives released the initial version of its Digital Collections Metadata Application Profile. The Metadata Application Profile (MAP), co-authored by Lisa Barrier (Asset Cataloger) and Kathryn Gronsbell (Digital Collections Manager), describes metadata elements for item-level asset records within the Carnegie Hall Digital Collections. Our goal for developing the initial MAP was to begin to assess our metadata maturity. We recognized the opportunity to document and share the Carnegie Hall (CH) metadata standards, cataloging procedures, and controlled vocabularies. We want to share the relatively streamlined process for generating and publishing a MAP to encourage others to consider this path for self-assessment of collections metadata.

      We expected this level of metadata wrangling and organization to be a daunting task. However, we realized the benefits of utilizing the resources in the DLF AIG Metadata Application Profile Clearinghouse Project, which is part of the larger Assessment Toolkit. We opted for a simple MAP profile format of a “Quick Look” summary and a separate, detailed elements table. We created the Quick Look, a list of metadata elements and their obligations (required, mandatory if applicable, and optional). We then compiled element descriptions and pulled sample data from the CH Digital Collections to create entries in the Elements table. The samples helped us identify controlled vocabularies, free-text fields, and different sources and structures. We described input guidelines to clarify how to populate each element field. We referenced the Sample Metrics for Common Metadata Quality Criteria and followed instructions for building “Your Application Profile” in the Framework section of the Toolkit.

      This metadata gathering and documentation process took approximately eight hours, over three days. Internal documents and implementations were referenced to create the elements table and Quick Look:

      • staff training wiki entry for uploading and tagging material;
      • cataloging requirements;
      • Carnegie Hall/Empire State Digital Network (ESDN) Mapping exercise supported by the Metropolitan Library Council (METRO) to prepare for contributing to the Digital Public Library of America (DPLA);
      • local taxonomy managed in the CH digital asset management system; and
      • integrated performance history data, which has its own authority control (version available at

      We published the MAP via our institutional GitHub account (, using the GitHub Pages option with a default Jekyll template. GitHub Pages is a feature that semi-automatically turns repository content into streamlined websites, useful for presentation and ongoing maintenance. We translated the elements table from a draft spreadsheet to a text document using the lightweight markup language Markdown. Markdown text displays in GitHub documents and GitHub Pages as formatted text. Because of the lengthy elements table, we chose to create a MAP overview homepage to link to the elements and to the Quick Look. We borrowed language and structure from Clearinghouse examples and other open documentation projects to draft a brief overview, feedback options, and acknowledgements. The publishing, formatting, summation, and copy-editing process took roughly four hours, over two days.

      Our initial MAP release demonstrates significant progress in metadata documentation through a small investment of time. Two staff members organized, drafted, and implemented the CH Digital Collections MAP in about 12 hours, over 2 weeks. Through the process, we prioritized metadata fields for evaluation, revision, or removal. We used the MAP to update our internal Digital Collections wiki, which guides CH staff members how to add and edit system metadata.

      While some elements map to Dublin Core and to DPLA properties, others are CH specific. These elements require further research for normalization and interoperability. Future additions to the MAP will include mapping to DPLA properties, Dublin Core, and other appropriate metadata schema.

      We recently contributed the CH MAP to the DLF AIG Metadata Application Profile Clearinghouse Project. We hope that others can borrow formatting or content from our profile, as well as provide constructive feedback so we can continue to correct, clarify, and improve the site.

      The post How do you get to Carnegie Hall? Use a MAP! appeared first on DLF.

      What is the Fourth Industrial Revolution? / Lucidworks

      Industry 4.0 or the Fourth Industrial Revolution are the new buzzwords that refer to the use of advanced computing, sensors, simulation, and additive techniques in manufacturing. They are largely synonymous with digital manufacturing and smart manufacturing. These techniques are supposed to provide greater customization, as well as faster design modification and personalization.

      As opposed to Industry 3.0, which used computing and automation, Industry 4.0 adds intelligence and rapid prototyping along with decentralized decision making. This not only involves new techniques like additive manufacturing but technologies like 3D scanners as well as decision support and data management technologies.

      Industry 4.0 evolves to instrumented “smart factories,” which can detect faults during the manufacturing process as well as adjust workspace lighting conditions based on the activity below. These capabilities are based on sensor networks and are commonly referred to as the “Internet of Things” (IoT).

      arge modern factory with robots and machines producing industrial plastic pieces and equipment

      Industry 4.0, like IoT, has various challenges and sweet spots. For relatively fast-paced developing items, being able to customize, personalize and rapidly prototype changes is essential. However, a lot of what is produced in the world is already relatively modular and doesn’t actually change or get replaced often enough for advanced capabilities.

      Manufacturing as a whole has to deal with supply chain complexities. Industry 4.0 may in some cases be limited by which component manufacturer down the chain can adapt to. Moreover, the costs of measuring quality and defects when rapid prototyping and changes are taking place across a multinational supply chain, may outweigh the benefits when compared to more stable “Industry 3.0” practices.

      Hand holding tablet pressing button on touch screen interface in front industrial container cargo

      Industry 4.0 answers some of these challenges with technologies like “predictive quality”. By using sensor and other data, AI and analytics can detect sources of scrap as well as defects or lower quality output. In today’s manufacturing, these predictive quality tools have to be networked across the supply chain and include data from contract manufacturing organizations (such as Foxconn) in order to be effective.

      Industry 4.0, smart factories, and digital manufacturing can be seen as the logical next steps as computing and manufacturing technology have evolved. Industry 4.0 basically extends current techniques by:

      Using computer aided design (CAD) technology then simulating stress testing
      Deploying sensor networks then adding AI to analyze the data
      Using additive and automated manufacturing technologies, making incremental changes in the physical world then re-digitizing them
      Using data management and communication technologies to distribute decision making and manage quality across the supply chain

      As this fourth industrial revolution takes hold, companies of all industries must take advantage of the capabilities outlined above to satisfy customers and stay ahead of the competition.

      Next Steps:


      *Header image by Christoph Roser at

      The post What is the Fourth Industrial Revolution? appeared first on Lucidworks.

      Registration Open for VIVO Camp in New York City / DuraSpace News

      From Violeta Ilik, VIVO Leadership Group, on behalf of the VIVO Camp instructors

      VIVO Camp is a multi-day training event designed specifically for new and prospective users. Camp will be held October 18-20, 2018 in the Butler Library at Columbia University. Over two and a half days, VIVOCamp will start with an introduction to VIVO leading to a comprehensive overview by exploring these topics:

      • VIVO features
      • Examples and demos of VIVO including customizations
      • Representing scholarship
      • Loading, displaying and using VIVO data
      • Introduction to the ontologies
      • Managing a VIVO project
      • Engaging communities
      • Using VIVO resources

      Participants can expect to gain a broad understanding of all aspects of implementing VIVO, including scoping and planning, development and data ingest, user engagement and customizations. This event will be an opportunity to meet and consult with key VIVO community members and to benefit from the experience of many successful VIVO implementations. The curriculum will include multiple opportunities to address current challenges. Participants should bring specific questions to VIVO Camp for discussion.

      VIVO Camp instructors provide years of experience holding VIVO events, managing VIVO implementations, and participating in the VIVO community.

      Violeta Ilik, Head of Digital Collections & Preservation Systems, Columbia University
      Huda Khan, VIVO Developer, Cornell University
      Benjamin Gross, Clarivate
      Mike Conlon, VIVO Project Director, University of Florida

      Join us in October to learn about VIVO and consult with this group of VIVO experts. Registration is limited. Please register to reserve your place at VIVO Camp. We will close registration when the event has filled.

      See you in New York!

      Please visit this LINK to register:

      The post Registration Open for VIVO Camp in New York City appeared first on

      Help us find the world’s electoral boundaries! / Open Knowledge Foundation

      mySociety and Open Knowledge International are looking for the digital files that hold electoral boundaries, for every country in the world — and you can help. 

      Yeah, we know — never let it be said we don’t know how to party.

      But seriously, there’s a very good reason for this request. When people make online tools to help citizens contact their local politicians, they need to be able to match users to the right representatives.

      So head on over to the Every Boundary survey and see how you can help — or read on for a bit more detail.

      Image credit: Sam Poullain

      Data for tools that empower citizens

      If you’ve used mySociety’s sites TheyWorkForYou — or any of the other parliamentary monitoring sites we’ve helped others to run around the world — you’ll have seen this matching in action. Electoral boundary data is also integral in campaigning and political accountability,  from Surfers against Sewage’s ‘Plastic Free Parliament’ campaign, to Call your Rep in the US.

      These sites all work on the precept that while people may not know the names of all their representatives at every level — well, do you? — people do tend to know their own postcode or equivalent. Since postcodes fall within boundaries, once both those pieces of information are known, it’s simple to present the user with their correct constituency or representative.

      So the boundaries of electoral districts are an essential piece of the data needed for such online tools.  As part of mySociety’s commitment to the Democratic Commons project, we’d like to be able to provide a single place where anyone planning to run a politician-contacting site can find these boundary files easily.

      And here’s why we need you

      Electoral boundaries are the lines that demarcate where constituencies begin and end. In the old days, they’d have been painstakingly plotted on a paper map, possibly accessible to the common citizen only by appointment.

      These days, they tend to be available as digital files, available via the web. Big step forward, right?

      But, as with every other type of political data, the story is not quite so simple.

      There’s a great variety of organisations responsible for maintaining electoral boundary files across different countries, and as a result, there’s little standardisation in where and how they are published.

      How you can help

      We need the boundary files for 231 countries (or as we more accurately — but less intuitively — refer to them, ‘places’), and for each place we need the boundaries for constituencies at national, regional and city levels. So there’s plenty to collect.

      As we so often realise when running this sort of project, it’s far easier for many people to find a few files each than it would be for our small team to try to track them all down. And that, of course, is where you come in.

      Whether you’ve got knowledge of your own country’s boundary files and where to find them online, or you’re willing to spend a bit of time searching around, we’d be so grateful for your help.

      Fortunately, there’s a tool we can use to help collect these files — and we didn’t even have to make it ourselves! The Open Data Survey, first created by Open Knowledge International to assess and display just how much governmental information around the world is freely available as open data, has gone on to aid many projects as they collect data for their own campaigns and research.

      Now we’ve used this same tool to provide a place where you can let us know where to find that electoral boundary data we need.

      Start here  — and please feel free to get in touch if anything isn’t quite clear, or you have any general questions. You might want to check the FAQs first though!

      Thanks for your help — it will go on to improve citizen empowerment and politician accountability throughout the world. And that is not something everyone can say they’ve done.

      FRBR as a Data Model / CrossRef

      (I've been railing against FRBR since it was first introduced. It still confuses me some. I put out these ideas for discussion. If you disagree, please add your thoughts to this post.)

      I was recently speaking at a library conference in OSLO where I went through my criticisms of our cataloging models, and how they are not suited to the problems we need to solve today. I had my usual strong criticisms of FRBR and the IFLA LRM. However, when I finished speaking I was asked why I am so critical of those models, which means that I did not explain myself well. I am going to try again here, as clearly and succinctly as I can.

      Conflation of Conceptual Models with Data Models

      FRBR's main impact was that it provided a mental model of the bibliographic universe that reflects a conceptual view of the elements of descriptive cataloging. You will find nothing in FRBR that could not be found in standard library cataloging of the 1990's, which is when the FRBR model was developed. What FRBR adds to our understanding of bibliographic information is that it gives names and definitions to key concepts that had been implied but not fully articulated in library catalog data. If it had stopped there we would have had an interesting mental model that allows us to speak more precisely about catalogs and cataloging.

      Unfortunately, the use of diagrams that appear to define actual data models and the listing of entities and their attributes have led the library world down the wrong path, that of reading FRBR as the definition of a physical data model. Compounding this, the LRM goes down that path even further by claiming to be a structural model of bibliographic data, which implies that it is the structure for library catalog data. I maintain that the FRBR conceptual model should not be assumed to also be a model for bibliographic data in a machine-readable form. The main reason for this has to do with the functionality that library catalogs currently provide (and and what functions they may provide in the future). This is especially true in relation to what FRBR refers to as its Group 1 entities: work, expression, manifestation, and item.

      The model defined in the FRBR document presents an idealized view that does not reflect the functionality of bibliographic data in library catalogs nor likely system design. This is particularly obvious in the struggle to fit the reality of aggregate works into the Group 1 "structure," but it is true even for simple published resources. The remainder of this document attempts to explain the differences between the ideal and the real.

      The Catalog vs the Universe

      One of the unspoken assumptions in the FRBR document is that it poses its problems and solutions in the context of the larger bibliographic universe, not in terms of a library catalog. The idea of gathering all of the manifestations of an expression and all of the expressions of a work is not shown as contingent on the holdings of any particular library. Similarly, bibliographic relationships are presented as having an existence without addressing how those relationships would be handled when the related works are not available in a data set. This may be due to the fact that the FRBR working group was made up solely of representatives of large research libraries whose individual catalogs cover a significant swath of the bibliographic world. It may also have arisen from the fact that the FRBR working group was formed to address the exchange of data between national libraries, and thus was intended as a universal model. Note that no systems designers were involved in the development of FRBR to address issues that would come up in catalogs of various extents or types.

      The questions asked and answered by the working group were therefore not of the nature of "how would this work in a catalog?" and were more of the type "what is nature of bibliographic data?". The latter is a perfectly legitimate question for a study of the nature of bibliographic data, but that study cannot be assumed to answer the first question.


      Although the F in FRBR stands for "functional" FRBR does little to address the functionality of the library catalog. The user tasks find, identify, select and obtain (and now explore, added in the LRM) are not explained in terms of how the data aids those tasks; the FRBR document only lists which data elements are essential to each task. Computer system design, including the design of data structures, needs to go at least a step further in its definition of functions, which means not only which data elements are relevant, but the specific usage the data element is put to in an actual machine interaction with the user and services. A systems developer has to take into account precisely what needs to be done with the FRBR entities in all of the system functions, from input to search and display.

      (Note: I'm going to try to cover this better and to give examples in an upcoming post.)

      Analysis that is aimed at creating a bibliographic data format for a library catalog would take into account that providing user-facing information about work and expression is context-dependent based on the holdings of the individual library and on the needs of its users. It would also take into account the optional use of work and expression information in search and display, and possibly give alternate views to support different choices in catalog creation and deployment. Essentially, analysis for a catalog would take system functionality into account.

      There a lot of facts about the nature of computer-based catalogs have to be acknowledged: that users are no longer performing “find” in an alphabetical list of headings, but are performing keyword searches; that collocation based on work-ness is not a primary function of catalog displays; that a significant proportion of a bibliographic database consists of items with a single work-expression-manifestation grouping; and finally that there is an inconsistent application of work and expression information in today's data.

      In spite of nearly forty years of using library systems whose default search function is a single box in which users are asked to input query terms that will be searched as keywords taken from a combination of creator, title, and subject fields in the bibliographic record, the LRM doubles down on the status of textual headings as primary elements, aka: Nomen. Unfortunately it doesn't address the search function in any reasonable fashion, which is to say it doesn't give an indication of the role of Nomen in the find function. In fact, here is the sum total of what the LRM says about search:

      "To facilitate this task [find], the information system seeks to enable effective searching by offering appropriate search elements or functionality."

      That's all. As I said in my talk at Oslo, this is up there with the jokes about bad corporate mission statements, like: "We endeavor to enhance future value through intelligent paradigm implementation." First, no information system ineffectivesearching. Yet the phrase "effective searching" is meaningless in itself; without a definition of what is effectivethis is just a platitude. The same is true for "appropriate search elements": no one would suggest that a system should use inappropriatesearch elements, but defining appropriate search is not at all a simple task. In fact, I contend that one of the primary problems with today's library systems is that we specifically lack a definition of appropriate, effective search. This is rendered especially difficult because the data that we enter into our library systems is data that was designed for an entirely different technology: the physical card catalog, organized as headings in alphabetical order.

      One Record to Rule Them All

      Our actual experience regarding physical structures for bibliographic data should be sufficient proof that there is not one single solution. Although libraries today are consolidating around the MARC21 record format, primarily for economic reasons, there have been numerous physical formats in use that mostly adhere to the international standard of ISBD. In this same way, there can be multiple physical formats that adhere to the conceptual model expressed in the FRBR and LRM documents. We know this is the case by looking at the current bibliographic data, which includes varieties of MARC, ISBD, BIBFRAME, and others. Another option for surfacing information about works in catalogs could follow what OCLC seems to be developing, which is the creation of works through a clustering of single-entity records. In that model, a work is a cluster of expressions, and an expression is a cluster of manifestations. This model has the advantage that it does not require the cataloger to make decisions about work and expression statements before it is known if the resource will be the progenitor of a bibliographic family, or will stand alone. It also does not require the cataloger to have knowledge of the bibliographic universe beyond their own catalog.

      The key element of all of these, and countless other, solutions is that they can be faithful to the mental model of FRBR while also being functional and efficient as systems. We should also expect that the systems solutions to this problem space will not stay the same over time, since technology is in constant evolution.


      I have identified here two quite fundamental areas where FRBR's analysis differs from the needs of system development: 1) the difference between conceptual and physical models and 2) the difference between the (theoretical) bibliographic universe and the functional library catalog. Neither of these are a criticism of FRBR as such, but they do serve as warnings about some widely held assumptions in the library world today, which is that of mistaking the FRBR entity model for a data and catalog design model. This is evident in the outcry over the design of the BIBFRAME model which uses a two-tiered bibliographic view and not the three-tiers of FRBR. The irony of that complaint is that at the very same time as those outcries, catalogers are using FRBR concepts (as embodied in RDA) while cataloging into the one-tiered data model of MARC, which includes all of the entities of FRBR in a single data record. While cataloging into MARC records may not be the best version of bibliographic data storage that we could come up with, we must acknowledge that there are many possible technology solutions that could allow the exercise of bibliographic control while making use of the concepts addressed in FRBR/LRM. Those solutions must be based as least as much on user needs in actual catalogs as on bibliographic theory.

      As a theory, FRBR posits an ideal bibliographic environment which is not the same as the one that is embodied in any library catalog. The diagrams in the FRBR and LRM documents show the structure of the mental model, but not library catalog data. Because the FRBR document does not address implementation of the model in a catalog, there is no test of how such a model does or does not reflect actual system design. The extrapolation from mental model to physical model is not provided in FRBR or the LRM, as neither addresses system functions and design, not even at a macro level.

      I have to wonder if FRBR/LRM shouldn't be considered a model for bibliography rather than library catalogs. Bibliography was once a common art in the world of letters but that has faded greatly over the last half century. Bibliography is not the same as catalog creation, but one could argue that libraries and librarians are the logical successors to the bibliographers of the past, and that a “universal bibliography” created under the auspices of libraries would provide an ideal context for the entries in the library catalog. This could allow users to view the offerings of a single library as a subset of a well-described world of resources, most of which can be accessed in other libraries and archives.


      IFLA. Functional Requirements for Bibliographic Records. 1998/2008
      IFLA. Library Reference Model. 2017

      Jobs in Information Technology: August 6, 2018 / LITA

      New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

      New This Week

      California State University, San Bernardino, Access Services Librarian, San Bernardino, CA

      Alder Graduate School of Education, Instructional and Research Librarian, Redwood City, CA

      Open Society Foundations, Data and Research Support Specialist, New York, NY

      The Graduate Center, City University of New York, Science Resources Librarian, New York, NY

      GWU, Gelman Library, Middle East and Africa Librarian, Washington, DC

      Library of Congress, Librarian (Senior Library Information Systems Specialist), Washington, DC

      University of Maryland, Baltimore, Data Services Librarian, Baltimore, MD

      City of Keller, Library Services Manager, Keller, TX

      Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

      MarcEdit 7/3: Build Links Updates (non-MARC format support) / Terry Reese

      I’ve had a few folks lately asking if I could expand MarcEdit’s build links tooling beyond MARC data.  The idea here is that:

      1. a lot of library data isn’t in marc
      2. there are a lot of different workflows where linked data starts to make its way into the process

      So, with this update, I’ve tried to address both request through changes to the Build Links tooling and to the API.


      In the Build Links tool, I’ve added the ability to directly interact with Excel (binary and xml) and tab delimited files.  The tool, when it encounters those formats, will prompt the user to identify the column and vocabulary that should be searched.  Data will then be saved back to a new file of the same format as the source. 


      To enable the above functionality, I had to create a new API point that allowed for individual term look up and for data normalization.  These are not in the COM object yet, but will be once I determine if any other values need exposed.  Right now, these new API points are exposed via the .NET assembly.

      You can see this work in action – here:

      This stream includes a link to a youtube video that demonstrates the process on Windows – however, the process works as well on current MacOS builds.

      Let me know if you have questions.


      Using Tableau, SQL, and Search for Fast Data Visualizations / Lucidworks

      This post builds on my previous blog post, where I introduced the Fusion SQL service: Since that post, we’ve been busy adding new optimizations and ensuring better integration with BI tools like Tableau, especially for larger datasets.

      In the interest of time, I’ll assume you’re familiar with the concepts I covered in the previous blog post. In this post, I highlight several of the interesting features we’ve added in Fusion 4.1.

      Querying Larger Datasets

      The Solr community continues to push the limits in size and complexity of data sets that can be handled by Solr. In this year’s upcoming Activate Conference, a number of talks cover scaling Solr into the hundreds of millions to billions of documents. What’s more is that Solr can compute facets and basic aggregations (min, max, sum, count, avg, and percentiles) over these large data sets. The Fusion SQL service leverages Solr’s impressive scalability to offer SQL-based analytics over datasets containing tens to hundreds of millions of rows, often in near real-time without prior aggregation. To reach this scale with traditional BI platforms, you’re typically forced to pre-compute aggregations that can only satisfy a small set of predetermined queries.

      Self-service analytics continues to rank high on the priority list of many CIOs, especially as organizations strive to be more “data-driven.” However, I can’t imagine CIOs letting business users point a tool like Tableau at even a modest scale dataset in Solr terms. However, Fusion SQL makes true self-service analytics a reality without having to resort to traditional data warehouse techniques.

      To illustrate, let’s use the movielens 20M ratings dataset from I chose this since it aligns with the dataset I used in the first blog post about Fusion SQL. To be clear, 20M is pretty small for Solr but as we’ll see shortly, already stresses traditional SQL databases like MySQL. To index this dataset, use Fusion’s Parallel Bulk Loader ( using the Fusion Spark bootcamp lab:
      (Note: you only need to run the lab to index the 20M ratings if you want to try out the queries in this blog yourself.)

      You can set up a join between the movies_ml20m table using (id) and ratings_ml20m table (movie_id) in Tableau as shown in the screenshot below.Tableau screenshot

      When the user loads 1000 rows, here’s what Tableau sends to the Fusion SQL service:

      SELECT 1 AS `number_of_records`,
      `movies_ml20m`.`genre` AS `genre`,
      `ratings_ml20m`.`id` AS `id__ratings_ml20m_`,
      `movies_ml20m`.`id` AS `id`,
      `ratings_ml20m`.`movie_id` AS `movie_id`,
      `ratings_ml20m`.`rating` AS `rating`,
      `ratings_ml20m`.`timestamp_tdt` AS `timestamp_tdt`,
      `movies_ml20m`.`title` AS `title`,
      `ratings_ml20m`.`user_id` AS `user_id`
      FROM `default`.`movies_ml20m` `movies_ml20m`
      JOIN `default`.`ratings_ml20m` `ratings_ml20m` ON (`movies_ml20m`.`id` = `ratings_ml20m`.`movie_id`)
      LIMIT 1000

      Behind the scenes, Fusion SQL translates that into an optimized query into Solr. Of course, doing joins natively in Solr is no small feat given that Solr is at first a search engine that depends on de-normalized data to perform at its best. Behinds the scenes, Fusion SQL performs what’s known in the database world as a hash join between the ratings_ml20m and movies_ml20m collections using Solr’s streaming expression interface. On my laptop, this query takes about 2 seconds to return to Tableau with the bulk of that time being the read of 1000 rows from Solr to Tableau.

      The same query against MySQL on my laptop takes ~4 seconds, so not a big difference, so far, so good. A quick table view of data is nice, but what we really want are aggregated metrics. This is where the Fusion SQL service really shines.

      In my previous blog, I showed an example of an aggregate then join query:

      SELECT m.title as title, agg.aggCount as aggCount FROM movies m INNER JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as agg ON agg.movie_id = ORDER BY aggCount DESC

      Sadly, when I execute this against MySQL with an index built on the movie_id field in the ratings table, the query basically hangs (I gave up after waiting after a minute). For 20M rows, Fusion SQL does it in 1.2 seconds! I also tried MySQL on an Ec2 instance (r3.xlarge) and the query ran in 17 secs, which is still untenable for self-service analytics.

      0: jdbc:hive2://localhost:8768/default> SELECT m.title as title, agg.aggCount as aggCount FROM movies_ml20m m INNER JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings_ml20m WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as agg ON agg.movie_id = ORDER BY aggCount DESC;

      Ok, well maybe MySQL just doesn’t do the aggregate then join correctly there. Let’s try another more realistic query written by Tableau for the following chart:

      Tableau screeshot2

      SELECT COUNT(1) AS `cnt_number_of_records_ok`,
      `ratings_ml20m`.`rating` AS `rating`
      FROM `default`.`movies_ml20m` `movies_ml20m`
      JOIN `default`.`ratings_ml20m` `ratings_ml20m` ON (`movies_ml20m`.`id` = `ratings_ml20m`.`movie_id`)
      WHERE (`movies_ml20m`.`genre` = ‘Comedy’)
      GROUP BY `ratings_ml20m`.`rating`

      Fusion SQL executes this query in ~2 secs:

      Let’s give that a try with MySQL. First it took >2 minutes to just get the unique values of the rating field, which Fusion SQL does almost instantly using facets:

      Fusion screenshot

      Again, this is against a modest (by Solr’s standards) 20M row table with an index built for the rating column. Then to draw the basic visualization, the query didn’t come back with several minutes (as seen below, we’re still waiting after 4 minutes, it finished around 5 minutes in).

      Fusion screenshot

      The point here is not to pick on MySQL as I’m sure a good DBA could configure it to handle these basic aggregations sent by Tableau, or another database like Postgres or SQL Server may be faster. But as the data size scales up, you’ll eventually need to setup a data warehouse with some pre-computed aggregations to answer questions of interest about your dataset. The bigger point is that Fusion SQL allows the business analyst to point a data visualization tool like Tableau at large datasets to generate powerful dashboards and reports driven by ad hoc queries without using a data warehouse. In the age of big data, datasets are only getting bigger and more complex.

      How Does Fusion Optimize SQL Queries?

      A common SQL optimization pattern is to aggregate and then join so that the join works with a smaller set of aggregated rows instead of join then aggregate, which results in many more rows to join. It turns out that Solr’s facet engine is very well suited for aggregate then join style queries.

      For aggregate then join, we can use Solr facets to compute metrics for the join key bucket and then perform the join. We also leverage Solr’s rollup streaming expression support to rollup over different dimensions. Of course, this only works for equi-joins where you use the join to attach metadata to the metrics from other tables. Over time, Fusion SQL will add more optimizations around other types of joins.

      What About High-cardinality Fields?

      If you’re familiar with Solr, than you probably already know that the distributed faceting engine can blaze through counting and basic metrics on buckets that have a low cardinality. But sometimes, a SQL query really needs dimensions that result in a large number of buckets to facet over (high cardinality). For example, imagine a group by over a field with a modest number of unique values ~100,000 but then grouped by a time dimension (week or day), and you can quickly get into a high cardinality situation (100K * 365 days * N years = lots of dimensions).

      To deal with this situation, Fusion SQL tries to estimate the cardinality of fields in the group by clause and uses that to decide on the correct query strategy into Solr, either facet for low-cardinality or a map/reduce style streaming expression (rollup) for high cardinality. The key takeaway here is that you as the query writer don’t have to be as concerned about how to do this correctly with Solr streaming expressions. Fusion SQL handles the hard work of translating a SQL query into an optimized Solr query using the characteristics of the underlying data.

      This raises the question of what constitutes high-cardinality. Let’s do a quick experiment on the ratings_ml20m table:

      select count(1) as cnt, user_id from ratings_ml20m
      group by user_id having count(1) > 1000 order by cnt desc limit 10

      The query performs a count aggregation for each user in the ratings table ~138K. With faceting, this query executes in 1.8 secs on my laptop. When using the rollup streaming expression, the query takes over 40 seconds! So we’re still better off using faceting at this scale. Next, let’s add some more cardinality using the following aggregation over user_id and rating which has >800K unique values:

      select count(1) as cnt, user_id, rating from ratings_ml20m
      group by user_id, rating order by cnt desc limit 10

      With faceting, this query takes 8 seconds and roughly a minute with rollup. The key takeaway here is the facet approach is much faster than rollup, even for nearly 1 million unique groups. However, depending on your data size and group by complexity you may reach a point where facet breaks down and you need to use rollup. You can configure the threshold where Fusion SQL uses rollup instead of facet using the fusion.sql.bucket_size_limit.threshold setting.

      Full Text Queries

      One of the nice features about having Solr as the backend for Fusion SQL is that we can perform full-text searches and sort by importance. In older versions of Fusion, we relied on pushing down a full subquery to Solr’s parallel SQL handler to perform a full-text query using the _query_ syntax. However, in Fusion 4.1, you can simply do:

      select title from movies where plot_txt_en = ‘dogs’
      select title from movies where plot_txt_en IN (‘dogs’, ‘cats’)

      Fusion SQL consults the Solr schema API to know that plot_txt_en is an indexed text field and performs a full-text query instead of trying to do an exact match against the plot_txt_en field. Fusion SQL also exposes a UDF named _query_ where you can pass any valid Solr query through SQL to Solr, such as:

      select place_name,zip_code from zipcodes where _query_(‘{!geofilt sfield=geo_location pt=44.9609,-93.2642 d=50}’)

      Avoiding Table Scans

      If we can’t pushdown an optimized query into Solr, what happens? Spark automatically pushes down WHERE filters and field projections to the spark-solr library. However, if a query matches 10M docs in Solr, then Spark will stream them from Solr in order to execute the query. As you can imagine, this may be slow depending on how many Solr nodes you have. We’ve seen table scan rates of 1-2M docs per second per Solr node, so reading 10M docs in a 3-node cluster could take 3-5 secs at best (plus a hefty I/O spike between Solr and Spark). Of course, we’ve optimized this as best we can in spark-solr, but the key takeaway here is to avoid queries that need large table scans from Solr.

      One of the risks of pointing a self-service analytics tool at very large datasets is that users will craft a query that needs a large table scan, which can hog resources on your cluster. Fusion SQL has a configurable safe guard for this situation. By default if a query requires more than 2M rows, the query will fail. That may be too small of a threshold for larger clusters, so you can increase the threshold using the fusion.sql.max_scan_rows configuration property.


      In this post, I covered how Fusion SQL enables building rich visualizations using tools like Tableau on large datasets. By leveraging Solr’s facet engine and streaming expressions, you can perform SQL aggregations, ad hoc queries, and joins across millions of rows in Solr in near real-time. What’s more is that scaling out Fusion horizontally to handle bigger data sets has never been easier or more cost-effective, especially when compared to traditional BI approaches. If you’re looking to offer self-service analytics as a capability for your organization, then I encourage you to download Fusion 4.1 today and give the SQL service a try.

      Next Steps:

      The post Using Tableau, SQL, and Search for Fast Data Visualizations appeared first on Lucidworks.

      The rise of Wikidata as a linked data source / HangingTogether

      [Wikidata: this image is in the public domain]
      As I analyze the responses to the OCLC Research 2018 International Linked Data Survey for Implementers, I’m looking out for significant differences with the responses to the previous, 2015 survey. One change that jumped out at me was the surge of using Wikidata as a linked data source consumed by linked data projects or services.

      Wikidata became the #5 ranked data source consumed by linked data projects/services described in the 2018 survey, compared to a #15 ranking in the 2015 survey. Here’s the comparison of the ranking of the Top 5 linked data sources between 2018 and 2015:

      Linked data source 2018 Rank 2015 Rank 1 3
      VIAF (Virtual International Authority File) 2 1
      DBpedia 3 2
      GeoNames 4 3
      Wikidata 5 15

      41% of the linked data projects/services described in 2018 reported using Wikidata as a source they consumed, versus just 9% of the projects/services described in 2015.

      The North Rhine-Westphalian Library Service Center in Germany advised others considering a project to consume linked data  to “Check out if Wikidata  covers your needs and contribute to it.” The Ignacio Larramendi Foundation in Spain advised those considering a project to publish linked data to “increase the presence of bibliographic data in Wikipedia and Wikidata.” Other respondents’ comments about Wikidata accompanied why they are greatly interested in tracking its developments:

      • “Wikidata is becoming more and more significant for cultural heritage institutions including our library.” (National Library of Finland)
      • “Wikidata [is] a potential authority hub.” (British Library)
      • “It seems to be a great place where we can share our data and use their data to enhance ours in ways we hadn’t envisioned before.” (Smithsonian)
      • “…all facts taken from Wikipedia stored in Wikidata turned to linked data is a tremendous achievement and we’re actively working together to link our data offering with theirs even closer.” (Springer Nature)

      Want to find out more about Wikidata from a library perspective? Check out the recording of the OCLC Research 12 June 2018 Works in Progress Webinar: Introduction to Wikidata for Librarians presented by Andrew Lih (author of The Wikipedia Revolution) and Robert Fernandez (Resources Development/eLearning Librarian, Prince George’s Community College and Wikimedia DC Chapter).

      August is peak season for advocacy / District Dispatch

      August is not just for vacations and summer reading programs—it’s high season for library advocacy. US representatives are on recess and back home in their districts to reconnect with their constituents, so now through Labor Day is the perfect time for library advocates to share the many ways we are transforming our communities.

      Invite your representative to your library to see in person how your library is meeting the needs of your community. The value of your library’s services may be crystal clear to you and the families, students, researchers, and other patrons you serve, but your elected leaders may not understand the value of your services unless you show them. Here are a few tips from librarians across the country for arranging visits with members of Congress.

      Know your legislator’s background and values

      During National Library Legislative Day in May, librarians in Rep. Tom Emmer’s (R-Minn.) district invited him to visit their facilities. As Jami Trenam, associate director of collection development of Great River Regional Library (GRRL) in St. Cloud, wrote, “Knowing Mr. Emmer is quite fiscally conservative and serves on the Financial Services Committee, we made sure to highlight programs and services that demonstrate the library’s stewardship of tax dollars.” For GRRL, that meant focusing on their state’s Institute of Museum and Library Services–funded interlibrary loan system and community partnerships on workforce development.

      Stay in touch with congressional staffers

      Rep. Charlie Crist’s (D-Fla.) staff members became especially interested in technology services after the American Library Association (ALA) Washington Office held a National Library Week event showcasing library makerspaces, and libraries in Crist’s district quickly followed up. Crist has visited three libraries in his district over the past three months, focusing on different services each time. As Rino Landa, maker studio coordinator at Clearwater Public Library System, wrote, “The best way for legislators to understand the value of our libraries and library staff is to see us in action.”

      Plan B: Visit your decision maker or their staff at their home office

      To cover a detailed list of policy priorities, including school and rural library issues, Ann Ewbank, director of Montana State University’s school library media preparation program, requested a one-on-one meeting with Sen. Jon Tester’s (D-Mont.) field staff in Bozeman. In addition to attending a listening session with Tester, “I chose to take the time to meet with my senator’s field office staff because I believe in the power of civic engagement,” Ewbank wrote, “and because I know that libraries change lives.”

      Build relationships with other library advocates

      When the New Jersey Library Association (NJLA) learned in early 2017 that Rep. Rodney Frelinghuysen (R-N.J.) was going to chair the powerful House Committee on Appropriations, staffers developed an advocacy plan to promote national library interests. Even though the NJLA was unsuccessful in getting a face-to-face meeting, they contributed to ALA’s successful campaign to save federal funding for libraries in the FY2018 budget cycle. “Don’t be discouraged if you are turned down,” writes NJLA Public Policy Committee Chair Eileen Palmer. “Use the opportunity [of sending an invitation] to convey your concern about library funding.”

      Most importantly, remember that advocacy is about building relationships, which takes a long-term commitment. Whether the short-term goal is to protect federal funding for library programs in FY2019 or to pass the Marrakesh Treaty, making the health of all America’s libraries a national priority requires your year-round advocacy.

      For guidance in setting up a visit with your member of Congress, or for the talking points on current legislative priorities, contact Shawnda Hines, assistant director of communications in ALA’s Washington Office.

      The post August is peak season for advocacy appeared first on District Dispatch.

      Fellow Reflection: Steven Booth / Digital Library Federation

      Steven Booth is currently the Audiovisual Archivist at the Barack Obama Presidential Library and attended the Digital Humanities Summer Institute and the DLFxDHSI unconference in Victoria, BC with support from a Cross-Pollinator Tuition Award. Learn more about Steven, or read on for his reflection on his experience.

      This past June I had the honor to attend the Digital Humanities Summer Institute (DHSI) as a recipient of the 2018 DLFxDHSI Cross-Pollinator Tuition Award. Held at the University of Victoria in British Columbia, DHSI provides opportunities for participants to study the scope of digital humanities through “intensive coursework, seminars, and lectures” that explore historical, theoretical and praxis-based perspectives. The myriad courses and workshops offered throughout the two-week sessions attract an international influx of faculty, graduate students, technologists, librarians, and, yes, even archivists.

      Ray Pun, Y Vy Truong, Steven Booth and Karen Ng at the DLFxDHSI unconference in Victoria, BC.

      Having no prior DH experience, I enrolled in “Making Choices About Your Data” (also known as #wrangledata) taught by Paige Morgan and Yvonne Lam. The design of this foundations course was structured to help navigate my classmates and I through the process of understanding data – what it is, what it represents, what to do with it, and how to use it. Throughout the week, we were challenged yet encouraged to overcome our anxieties and insecurities about DH. For many of us, the class discussions, one-on-one consultations as well as small group and personal exercises using sample datasets and our own project-based datasets helped debunk the preconceived notion that in order to do DH one must know how to code.

      While much of the course focused on the practical knowledge of DH, the instructors also incorporated the following FemTechNet T.V. MEALS Framework into the curriculum:

      • Tech assumes mastery of TACIT KNOWLEDGE practices, although often presented as transparent
      • Tech promotes particular VALUES, though often presented as value-neutral
      • Technology is MATERIAL, though it is often presented as transcendent
      • Technology involves EMBODIMENT, though it is often presented as disembodied
      • Tech solicits AFFECT, though is often presented as highly rational
      • Tech requires LABOR, though it is often presented as labor-saving
      • Tech is SITUATED in particular contexts, though often presented as universal.

      Based on Black feminist theory, this framework created by the 2016 #femDH class aided our process of thinking critically about not only our technology assumptions but also our research, datasets and the tools we were introduced to. This was an important aspect of the class as most of the our projects examined research topics surrounding race, class, gender, and sexuality. My project focused on mapping Black queer spaces in Washington, D.C. using data I gathered from a walking tour brochure that lists the name and location of bars/clubs, and other information that I hadn’t considered pertinent to my research prior to DHSI.

      With the assistance of Paige and Yvonne, my #wrangledata classmates, and others from the #raceDH class, I learned a great deal about similar mapping projects, cryptography, GIS, and various mapping tools. Being afforded the opportunity to attend DHSI provided me with the space and time to broaden my understanding of digital scholarship, which in turn helped me conceptualize and implement my research project.

      Based on my DHSI experience and the conversations I had with participants throughout the week, there is an apparent need for more archivists to be actively involved with digital humanities scholarship. A number of projects and sessions (primarily those held during the DLFxDHSI unconference) were in fact related to both community archives and digital archives. Moving forward, I would like to see more of us in these settings to collaborate and help facilitate conversations, provide assistance, and share best practices about our work and how archivists can play a role in shaping future DH efforts. As of right now I’m planning to return and I hope to bring a few colleagues along with me!

      The post Fellow Reflection: Steven Booth appeared first on DLF.

      New Member: Amherst College / Islandora

      The Islandora Foundation is very pleased to welcome our newest member: Amherst College. Currently the owners of a custom Fedora 3 digital collection (ACDC), the team at Amherst College are explorers on the bleeding edge of Islandora, working with Islandora CLAW and Fedora 4. They have played an active role in development and discussions for the future of Islandora and we look forward to working more closely with them as members of the Islandora Foundation.

      My Face is Personally Identifiable Information / Eric Hellman

      Facial recognition technology used to be so adorable. When I wrote about it 7 years ago, the facial recognition technology in iPhoto was finding faces in shrubbery, but was also good enough to accurately see family resemblances in faces carved into a wall. Now, Apple thinks it's good enough to use for biometric logins, bragging that "your face is your password".

      I think this will be my new password:

      The ACLU is worried about the civil liberty implications of facial recognition and the machine learning technology that underlies it. I'm worried too, but for completely different reasons. The ACLU has been generating a lot of press as they articulate their worries - that facial recognition is unreliable, that it's tainted by the bias inherent in its training data, and that it will be used by governments as a tool of oppression. But I think those worries are short-sighted. I'm worried that facial recognition will be extremely accurate, that its training data will be complete and thus unbiased, and that everyone will be using it everywhere on everyone else and even an oppressive government will be powerless to preserve our meager shreds of privacy.

      We certainly need to be aware of the ways in which our biases can infect the tools we build, but the ACLU's argument against facial recognition invites the conclusion that things will be just peachy if only facial recognition were accurate and unbiased. Unfortunately, it will be. You don't have to read Cory Doctorow's novels to imagine a dystopia built on facial recognition. The progression of  technology is such that multiple face recognizer networks could soon be observing us where ever we go in the physical world - the same way that we're recognized at every site on the internet via web beacons, web profilers and other spyware.

      The problem with having your face as your password is that you can't keep your face secret. Faces aren't meant to be secret. Our faces co-evolved with our brains to be individually recognizable; evidently, having an identity confers a survival advantage. Our societies are deeply structured around our ability to recognize other people by their faces. We even put faces on our money!

      Facial recognition is not new at all, but we need to understand the ways in which machines doing the recognizing will change the fabric of our societies. Let's assume that the machines will be really good at it. What's different?

      For many applications, the machine will be doing things that people already do. Putting a face-recognizing camera on your front door is just doing what you'd do yourself in deciding whether to open it. Maybe using facial recognition in place of a paper driver's license or passport would improve upon the performance of a TSA agent squinting at that awful 5-year-old photo of you. What's really transformative is the connectivity. That front-door camera will talk to Fedex's registry of delivery people. When you use your face at your polling place, the bureau of elections will make sure you don't vote anywhere else that day. And the ID-check that proves you're old enough to buy cigarettes will update your medical records. What used to identify you locally can now identify you globally.

      The reason that face-identity is so scary is that it's a type of identifier that has never existed before. It's globally unique, but it doesn't require a central registry to be used. It's public, easily collected and you can't remove it. It's as if we all had to tattoo our prisoner social security numbers on our foreheads! Facial profiles can be transmitted around the world, and used to index ALL THE DATABASEZ!

      We can't stop facial recognition technology any more than we can reverse global warming, but we can start preparing today. We need to start by treating facial profiles and photographs as personally identifiable information. We have some privacy laws that cover so-called "PII", and we need to start applying them to photographs and facial recognition profiles.  We can also impose strict liability for the misuse of biased inaccurate facial recognition; slowing down the adoption of facial recognition technology will give our society a chance to adjust to its consequences.

      Oh, and maybe Denmark's new law against niqabs violates GDPR?

      Shitcoin And The Lightning Network / David Rosenthal

      The Lightning Network is an overlay on the Bitcoin network, intended to remedy the fact that Bitcoin is unusable for actual transactions. Andreas Brekken, of, tried installing, running and using a node. He describes his experience in four blog posts:
      1. Can I compile and run a node?
      2. We must first become the Lightning Network
      3. Paying for goods and services
      4. What happens when you close half of the Lightning Network?
      Brekken's final TL;DR was “Operating the largest node on the Bitcoin Lightning Network has been educational, frustrating, fun, and at times terrifying. I look forward to trying it again once the technology matures.” Below the fold I look into some of the details.

      Brekken brought up a node running on AWS
      Compiling, installing, and running Lightning Network Daemon, lnd, was straight forward. I look forward to using payment channels for sending and receiving bitcoin.
      That's "straight forward" for a tech-savvy person. The Lightning Network white paper was published only 30 months ago, so it is good to see that the code is already in reasonable shape. He used a c5.large AWS instance with 500GB of disk, which would cost $61.20/month plus $22.50/month for the storage plus data transfer charges. So more than $83.70/month.

      Now Brekken's node was up, it needed to contain Bitcoin and create links to other Lightning Network nodes. How many Bitcoin and how many links?
      When I started writing the review the total capacity of the Lightning Network was slightly over 20 BTC (around $130,000). I decide to shake things up.
      By shaking things up Brekken means becoming the biggest node in the network by assigning to it enough Bitcoin to form over 15% of the total capacity.

      Source had almost 4 times the capacity of the next biggest node although, as a newcomer, it had fewer direct connections to other nodes. He had an interesting time tweaking the configuration parameters of his node, and trying to figure out what it was doing, but it did start routing payments.
      The node has routed 260 payments for other users, averaging a profit of $0.0012 USD per transaction. I doubt that this will cover the costs of running the node, but leave the node running for now.
      So would need to route more than 70K transactions/month to cover its costs, or about 1.6 transactions/minute. Brekken concludes that:
      Maintaining a Lightning Network payment hub is stressful and makes very little profit. Hopefully the risk will decrease and profit increase as the Lightning Network gains traffic.
      Lightning Spin100,000Yes
      Satoshi's Place800,000No
      Blockstream Store1,021,201No
      Now he controlled by far the largest node on the Lightning Network, Brekken should have been in a good position to transact on his own behalf. He recounts five transaction attempts, three of which failed as shown in the table. Both the wallets he tried while attempting to pay the Blockstream Store failed.

      Brekken concludes that:
      Sending payments using the Lightning Network is cheaper than the regular Bitcoin network, but suffers from routing errors and wallet bugs that make it impractical even for highly technical users.
      Brekken leaves the node running for a week, then asks:
      Are the funds still there? ... The funds are still there. My Lightning Node has routed 389 payments, making a profit of $0.34. I suspect the increase is mostly from the recent increase in bitcoin’s price. ... Running a large Lightning Network node has been quite stressful. An exploit such as we saw with heartbleed could allow an attacker to drain all funds from the node while I’m sleeping. It’s time to end the experiment.
      I think Brekken means "income" not "profit". If so, the income of the largest node in the Lightning Network is $0.34/week against costs of at least $21. He tries to gracefully close all the node's channels to recover the funds they are using, but some don't close, trapping the funds:
      The remaining channels cannot close gracefully because the Lightning Node on the other side of the channel is offline. I force close these channels using lncli closechannel --force.

      Closing a channel using --force results in a unilateral close which makes the funds unavailable to me. The amount of time the funds are locked up depends on the channel policy. This policy is negotiated when the channel opens. Most channels will release the funds to me in between 1440 and 20180 minutes.
      That is between 1 and 14 days. It is hard to disagree with Brekken's conclusion that the technology isn't ready for prime time.