Planet Code4Lib

Introducing the 2019 Authenticity Project Fellows / Digital Library Federation

Thanks to a generous grant from the IMLS, the HBCU Library Alliance and Digital Library Federation are pleased to announce the first of three annual cohorts of Authenticity Project Fellows!

These fifteen Fellows work in libraries and archives at historically black colleges and universities. They will receive full travel, lodging, and registration expenses to the 2019 DLF Forum in Tampa, FL; professional development through online discussions, activities, and in-person networking; and opportunities to apply for microgrant funding to undertake inter-institutional projects of strategic importance across DLF and HBCU Library Alliance institutions and communities.

They will also participate in quarterly facilitated, online networking and discussion sessions, and will be matched for mentorship and mutual learning with two experienced library professionals: an established mentor from an HBCU Library Alliance library or with a strong background in HBCUs, and a “conversation partner” working in an area of the Fellow’s interest, ideally within a DLF member institution. (Names of our 2019 mentors and conversation partners will be announced soon.)

Applications for this fellowship opportunity were extraordinarily strong, and we are pleased that we will be able to welcome 30 more Authenticity Project Fellows in our 2020 and 2021 cohorts. If you are interested in applying or re-applying for the fellowship, please stay tuned for a call in the autumn of 2019, and see our press release about the program for more information.

About the Fellows:

Meaghan Alston
Prints and Photographs Librarian, Moorland-Spingarn Research Center, Howard University

Meaghan Alston received her MLIS with a focus on Archives and Information Science from the University of Pittsburgh in 2015. She has been the Prints and Photographs Librarian at Howard University’s Moorland-Spingarn Research Center since 2017. In this position she is responsible for a collection of over 150,000 graphic images depicting African American and African Diaspora history. Prior to joining the staff at Moorland-Spingarn, she worked as a Visiting Librarian with the University of Pittsburgh’s Archives and Special Collections. Her interests include digital preservation, community archives, digital humanities, and archival education.    


Danisha Baker-Whitaker
Archivist/Museum Curator, Bennett College (North Carolina)

Danish Baker-Whitaker is the Archivist/Museum Curator at Bennett College. She’s also a Ph.D. student in the Communication, Rhetoric, and Digital Media Program at North Carolina State University.  She focused on archives and special collections in obtaining the MLIS degree from the University of North Carolina – Greensboro. Her main research interests include exploring how the digital humanities field can intersect with and influence the duties of archivists. The overarching interests behind her work are identity and librarianship in the 21st century. Other interests include digital archives, access, and information architecture.


Cassandra Burford (@DigitalPresPro)
Special Collections Librarian, Talladega College (Alabama)

Cassandra Hill Burford is the special collections librarian at Talladega College in Talladega, Alabama, where she manages eight collections. An Alabama native, she received a BA and MA in history from Jacksonville State University, and her MLIS, with a focus in digital preservation and digital libraries, from the University of Alabama in 2018. Cassandra firmly believes that objectivity is an archivist’s most important quality. An archivist is not concerned with who won or lost the battle, but rather with the materials and artifacts and the stories they tell. With that objectivity in mind, she is an avid proponent of digital preservation initiatives for archives, especially niche collections like those maintained at Talladega College.


Justin de la Cruz (@justindlc)
Unit Head, E-Learning Technology, Atlanta University Center Robert W. Woodruff Library (Georgia)

Justin has worked on technology training for library staff and patrons in both public and academic libraries. He currently serves as the Unit Head of E-Learning Technology at the Atlanta University Center Robert W. Woodruff Library, where he collaborates with faculty and students on technology projects involving multimedia production, 3D design, and social media, among other topics. His recent publications, including a chapter in ACRL’s Applying Library Values to Emerging Technology, have focused mainly on library staff professional development. Justin recently joined the editorial advisory board for Library Hi Tech and is working on an ALA Diversity Grant-funded research project investigating the information seeking and sharing behaviors of LGBTQIA+ students.


Cheryl Ferguson (@cdferg4)
Archival Assistant, Tuskegee University (Alabama)

Cheryl D. Ferguson is the Archival Assistant at Tuskegee University in Tuskegee, Alabama.  As part of the university archives she helps to build and promote the rich history of the university. Her areas of interest include archival research, digital preservation, digitization, outreach/development, and program management.

Cheryl is a member of the Society of American Archivists, Society of Alabama Archivists and the Association of African American Museums.

Outside of the archives, Cheryl can be found actively involved servicing her community as a member of Tuskegee Alumnae Chapter of Delta Sigma Theta Sorority, Inc. and the Tuskegee United Women’s League, Inc.


Ida Jones (@Ida39J)
University Archivist, Morgan State University (Maryland)

Ida E. Jones is the University Archivist at Morgan State University. She is the first professional archivist hired by Morgan in celebration of the 150th anniversary. Since her arrival 3 years ago there are 10 processed manuscript collections with online finding aids and a number of new donors and departmental contacts she made. The contacts are working with her in preparation for their future deposits, research and reference queries. She has taught at the Lancaster Bible College, University of Maryland, College Park and Howard University. She specializes in African American church history, organizational history, and local history. She authored four books. Her most recent publication is Baltimore Civil Rights Leader: Victorine Q. Adams: The Power of the Ballot debuted in January 2019.


Alvin Lee
Library Technical Assistant Supervisor at Florida Agricultural & Mechanical University (Florida)

Alvin Lee is a recent MA-LIS graduate.  He is currently employed as Senior Library Technical Assistant Supervisor and Resource Sharing Coordinator for University Libraries at Florida Agricultural and Mechanical University, a Historic Black College/University, located in Tallahassee, Florida.  Additionally, Alvin also serves as the chair of University Libraries’ Digitization Committee. Mr. Lee has a passion for learning about digital libraries and digital scholarship.

Alvin is an individual committed to life-long learning.  In December of 2018 he receive his MA-LIS degree from the University of South Florida.  He took graduate level coursework at Georgia Southern University, in Statesboro, Georgia, to gain state certification in Georgia as a Middle Grades Teacher.  Alvin has a BA in Political Science/History from Armstrong Atlantic University in Savannah, Georgia. Alvin began his undergraduate studies at Oxford of Emory University in Oxford, Georgia.  Alvin is a proud honors graduate of Boggs Academy in Keysville, Georgia.

Alvin currently serves on a statewide committee charged with identifying training opportunities for staff at member institutions of the Sunshine State Digital Network; the state hub for Digital Public Libraries of America.  Alvin is a life-time member of Phi Kappa Phi Honor Society. He is also a member of the Golden Key Honor Society. Alvin is currently serving a two year term as a member of the state’s Resource Sharing Steering Committee.
Currently Alvin is working on writing a grant for submission to the LYRASIS Catalyst Fund.  The proposal will be based on the use of inclusion and use of digital scholarship for advancing conceptual research.

Brandon Lunsford
University Archivist and Digital Manager, Johnson C. Smith University (North Carolina)

Brandon Lunsford has been the Archivist at Johnson C. Smith University in Charlotte since 2009. He  received his MA in Public History with a concentration in Historic Preservation from the University of North Carolina at Charlotte in 2009, and his BA in History from UNCC  in 2001. He is currently taking courses at the University of North Carolina at Greensboro for his MA in Library and Information Science, with a focus on Archives. He wrote Charlotte Then and Now for Anova Books and Thunder Bay Press as part of his thesis project in 2008, and completed a revised edition in 2012.  Before coming to JCSU he completed internships at the Charlotte Museum of History and the Charlotte Mecklenburg Historic Landmarks Commission. While the archivist at Smith, Brandon has written five successfully funded grant projects from the National Endowment for the Humanities, Lyrasis, the Andrew W. Mellon Foundation through the Lyrasis HBCU Photographic Preservation Project, and the North Carolina Library Services and Technology Act (LSTA). His projects have included an NEH grant to create a digital interactive map of the historic African American neighborhood surrounding Johnson C. Smith University and a digitization grant and exhibit showcasing the James G. Peeler Photograph Collection.

Raeshawn McGuffie
Assistant Director of Technical Services — Hampton University (Virginia)

Ms. Raeshawn McGuffie is from Los Angeles, California. She earned her BA in Psychology from Fisk University located in Nashville, Tennessee. Ms. McGuffie earned her Master of Library Science degree at North Carolina Central University which is in Durham, North Carolina.  She is currently the Assistant Director of Technical Services at Hampton University’s William R. and Norma B Harvey Library. It is in this position that she discovered an interest in digitization in order to preserve the many historical resources housed in the library’s Special Collections. Ms. McGuffie is, of course, an avid reader, but she also loves knitting, photography, and music.

DeLisa Minor Harris
Special Collections Librarian, Fisk University (Tennessee)

DeLisa Minor Harris is a Fisk alumna who returned in 2016 to serve her alma mater after completion of her Master’s Degree at the University of North Texas and after spending four years with the Nashville Public Library. Connecting students, faculty and staff, researchers, scholars, and the Nashville community to the many historical collection holdings of Fisk University is Ms. Minor Harris’ top priority. Since her start in Special Collections at Fisk University, Ms. Minor Harris has curated five exhibits including, “Lord, I’m Out Here on Your Word”-Fisk Jubilee Singers: Singing from spirit to spirit” and written two articles published in the enlarged two-volume set of the Encyclopedia of African American Business, ABC/CLIO. She has twice presented at the Annual Conference for Association for the Study of African American Life and History (ASALH). In 2017 discussing photograph preservation project and Fisk University and; In 2018 on the Student Army Training Corps during World War I. In 2017, Ms. Minor Harris was awarded the Rare Book School’s National Endowment for the Humanities-Global Book Histories Initiative Scholarship to attend Rare Book School at the University of Virginia. She currently serves as Co-Chair of the Nashville Area Library Alliance (NALA), working in partnership with Librarians across Davidson County.

Aletha Moore
Digitization Project Manager Atlanta University Center Robert W. Woodruff Library (Georgia)

Aletha R. Moore, a graduate of Spelman College, earned a Bachelor’s degree in History with a specialization in American and African American history and her Master’s degree in Archival Studies with a concentration in digital archives at Clayton State University. While in graduate school Aletha worked in the Spelman Archives, and The Jimmy Carter Presidential Research Library.  Currently, she is the Digitization Project Manager of a CLIR grant at the Atlanta University Center Robert W. Woodruff Library Archives Research Center. Professionally, Moore is a member of Society of American Archivist, Society of Georgia Archivist.


Monica Riley
Serials Librarian, Morehouse School of Medicine (Georgia)

I am the current Serials Librarian at the M. Delmar Edwards, M.D. Library at Morehouse School Medicine. My responsibilities include selecting, acquiring, and managing access to digital and print serials. I’m a native of San Francisco, CA and proud alumna of Clark Atlanta University, where I received a B.A. in Mass Media Arts and M.S. in Library and Information Studies. I’ve enjoyed a rich career as a professional librarian, having held positions in reference, youth services, and interlibrary loan at academic, public and special libraries. My interests include project management, assessment and curation of digital collections.

Carla Sarratt (@offtheshelfMLS)
Director, Lincoln University (Pennsylvania)

Carla R. Sarratt earned her Bachelor of Arts degree in English and Psychology with a minor in Education from Wittenberg University in Springfield, Ohio.  After college, Carla taught high school English in Columbus, Ohio and Charlotte, North Carolina. She later earned a Master of Library Science degree from North Carolina Central University in Durham, North Carolina.

During graduate school, she participated in a study abroad trip to Copenhagen, Denmark that allowed her to experience the international scope of librarianship in Denmark as well as Sweden.  She interned with the State Library of North Carolina where she assisted with programming and outreach endeavors with an emphasis on genealogy research. Through her passion for social media, she started the library’s Pinterest account that continues to thrive today.  

Before graduation, she accepted a position as the solo librarian for the African American Cultural Center Library at North Carolina State University. Through her ties to the African American children’s literature community, she bought children’s author and illustrator Don Tate to campus as part of the cultural center’s programming.

In her role as librarian and Virtual Services Librarian with New Hanover County Public Library, she continued to build on her experiences as a teacher and librarian with programming and outreach within the local community.  She worked with high school students and teachers as well as the writing community to promote library resources and services. In 2017, she was awarded a stellar award for Innovation and Professionalism from New Hanover County.


Kayla Siddell
Scholarly Communications and Instruction Librarian (Louisiana)

Kayla Siddell is the Scholarly Communications and Instruction Librarian in University Library at Xavier University of Louisiana where she manages the institutional repository, the Data Visualization Lab, and consults with faculty, staff and students on their research and use of library resources and services. Previously she served was the Data Curation Librarian at Indiana State University where she served as webmaster and managed the institutional repository, CONTENTdm and Omeka websites as well as running the digitization laboratory. Her research interest include alternative data, best practices for data curation and institutional repositories, scholarly communication and information literacy. Kayla is a an Alumni of East Tennessee State University where she studied psychology as well as the University of Tennessee in Knoxville, where she earned her Masters degree in Information Science.


Raquel K. Williams Donahue (@RaquelKWilliams)
Reference & Instruction Librarian I at Prairie View A&M University (Texas)

Raquel K. Williams has been a Reference & Instruction Librarian at PVAMU’s John B. Coleman Library since March 2016.  She graduated with honors from Texas Woman’s University in December of 2015 with her M.L.S. and earned a B.A. in English from Oakland University in Rochester, MI in 2006.  Raquel has been active in the Texas Library Association as a member and as a round table officer since 2014 in several groups, including the Latino Caucus RT, Library Instruction RT, Library Support Staff RT, and Reference & Information Services RT.  She is regularly on Twitter @RaquelKWilliams.

About DLF, the HBCU Library Alliance, and the IMLS

The HBCU Library Alliance is a consortium that supports the collaboration of information professionals dedicated to providing an array of resources to strengthen Historically Black Colleges and Universities (HBCUs) and their constituents. As the voice of advocacy for member institutions, the HBCU Library Alliance is uniquely designed to transform and strengthen its membership by developing library leaders, helping to curate, preserve and disseminate relevant digital collections, and engaging in strategic planning for the future.

DLF is an international network of member institutions and a robust community of practice, advancing research, learning, social justice, and the public good through the creative design and wise application of digital library technologies. It is a program of CLIR, the Council on Library and Information Resources — an independent, nonprofit organization that forges strategies to enhance research, teaching, and learning environments in collaboration with libraries, cultural institutions, and communities of higher learning.

This project is made possible in part by the Institute of Museum and Library Services, through grant # RE‐70‐18‐0121. The IMLS is the primary source of federal support for the nation’s libraries and museums. They advance, support, and empower America’s museums, libraries, and related organizations through grantmaking, research, and policy development. Their vision is a nation where museums and libraries work together to transform the lives of individuals and communities. To learn more, visit

The post Introducing the 2019 Authenticity Project Fellows appeared first on DLF.

MarcEdit 7/3: Updating MarcEditor file processing / Terry Reese

This year, the biggest project that I’ve been undertaking is a re-write of how MarcEdit’s MarcEditor handles and tracks changes in files.  In MarcEdit (since the very first version) – the tool has separated edits into two workflows.

1. Edits made manually

2. Edits made using the global tools

Some resources, like Report generation and validation fell into a middle ground.  These workflows existed because of MarcEdit’s special undo function and the ability to track large global changes for undo.  Internally, the tool is required to keep copies of files at various state, and these changes get set when global updates occur.  This means the best practice within the application has always been that if you are going to perform manual edits, you should save these edits prior to editing globally.  The reason is that once a global edit occurred, the manual edits may not be preserved as the internal tracking file snapshots the changes at the time of global edit to support the undo functionality.

To fix this, I need to rethink how the program reads data.  In all versions since MarcEdit 5, the program has utilized pages to read MARC data.  These pages create a binary map of the file – allow the application to locate and extract data from the file at very specific locations (creating each page).  If a page was manually editing, a physical page was created and reinserted when the entire file was saved.  The process works well, unless users start to mix the two editing workflows together.  Do this too often, and sometimes the internal data map of the file can become out of sync. 

To fix the sync problem, what really needs to happen is that the file simply needs to be repaged when manual edits have occured, but before a global action takes place.  The problem is that paging the document is expensive.  While MarcEdit 7 brought a number of speed enhancements to the page loading and saving – it still took significant time. 

For example, when MarcEdit 7 first came out, the tool could process for paging, ~15,000 records per second.  That meant that a file of 150,000 records would take ~10 seconds to process for paging.  This time was spent primarily doing a lot of heavy IO tasks due to the fact that many files I encounter have mixed new line characters.  This time to process was a disincentive to repage documents until necessary – so it occurred when a Save action occurred.

As of Christmas 2018, I introduced a new file processing method.  This repaging method works much faster – processing closer to 120,000 records per second.  In real life tests, this means that I can process a GB of data in under 6 seconds and 12 GBs of data in around a minute.  That is a significant improvement, and now, repaging can happen more liberally.

This has lead to a number of changes in how pages are loaded and processed.  Currently, there is a beta version that is being tested to make sure that all functions are being connected to the new document queue.  At this point, I think this work is done, but given the wide range of workflows, I’m looking to make sure I haven’t missed something. 

Once complete, the current best practice of separating manual and global editing workflows will end, as liberal repaging will enable manual edits to live harmoniously with the special undo function.

If you have questions, please let me know.


MarcEdit 7: Folder Watcher Service Enhancements / Terry Reese

I’ve been continuing to think about new workflows that could be added to the Watcher service.  One of the most requested is the ability for MarcEdit’s tool to not only watch local files, but to be able to monitor remote files hosted on FTP/SFTP servers as well.  So, over the past couple of weeks, I’ve been fleshing out what this might look like, and writing the components necessary to include the FTP/SFTP functionality into the application.

At this point, I have a working model that I’m connecting into the program – and will be making available with the next update.  The process, as I envision it now, will look something like this:

1. Users now select from a Local or Remote Folder to Watch


2. If FTP or SFTP is selected, you will be prompted for log info:


3. After config info is provided, the tool will log into the server at the host.  This is a folder browse.  You can view folders and browse subfolders.


4. Click Select folder (when you are in or have selected the folder to process) and it will be pulled into the watcher profile


Once the folder has been specified, the tool will now watch this FTP/SFTP resource.  This also means thinking about adding more granularity to the watch scheduling.  Currently, schedules are set per 24 hour period.  I’m thinking that for most vendor/FTP/SFTP resources, this many be too often.  To work around that right now, the tool creates a hash file of all data downloaded, and will check with the server to determine if the hash has changed prior to downloading the data.  Ideally, that means that the tool will only download records for processing when there is a change at the server.  That will probably work for now – but I am aware that it would likely be better to have schedules that can be planned to run at a specific point during a month or week.

What is exciting for me as well, is that this work allows me to think about other potential workflows and how people bring data into the MarcEditor.   So, if you currently have to go somewhere to download MARC data from an FTP/SFTP server – I’d be interested in hearing how I could make this easier for you.

In addition to these changes, I’m looking to add some “global” watcher elements.  Things like MARCXML to MARC conversion, Joining resulting files into a single data sets, and character conversion options…though I’m thinking through how these work and the order of operations as some of these (character conversion) would be run on an individual file (like a task) with others (Join) would need to run after all files in a watched folder were processed.  So – how order is determined and processed would be important.

If you have questions, please feel free to let me know.


What data counts in Europe? Towards a public debate on Europe’s high value data and the PSI Directive / Open Knowledge Foundation

This blogpost was co-authored by Danny LämmerhirtPierre Chrzanowski and Sander van der Waal (*author note at the bottom)

January 22 will mark a crucial moment for the future of open data in Europe. That day, the final trilogue between European Commission, Parliament, and Council is planned to decide over the ratification of the updated PSI Directive. Among others, the European institutions will decide over what counts as ‘high value’ data. What essential information should be made available to the public and how those data infrastructures should be funded and managed are critical questions for the future of the EU.

As we will discuss below, there are many ways one might envision the collective ‘value’ of those data. This is a democratic question and we should not be satisfied by an ill and broadly defined proposal. We therefore propose to organise a public debate to collectively define what counts as high value data in Europe.

What does PSI Directive say about high value datasets?  

The European Commission provides several hints in the current revision of the PSI Directive on how it envisions high value datasets. They are determined by one of the following ‘value indicators’:

  • The potential to generate significant social, economic, or environmental benefits,
  • The potential to generate innovative services,
  • The number of users, in particular SMEs,  
  • The revenues they may help generate,  
  • The data’s potential for being combined with other datasets
  • The expected impact on the competitive situation of public undertakings.

Given the strategic role of open data for Europe’s Digital Single Market, these indicators are not surprising. But as we will discuss below, there are several challenges defining them. Also, there are different ways of understanding the importance of data.

The annex of the PSI Directive also includes a list of preliminary high value data, drawing primarily from the key datasets defined by Open Knowledge International’s (OKI’s) Global Open Data Index, as well as the G8 Open Data Charter Technical Annex. See the proposed list in the table below.

List of categories and high-value datasets:

Category Description
1. Geospatial Data Postcodes, national and local maps (cadastral, topographic, marine, administrative boundaries).
2. Earth observation and environment Space and situ data (monitoring of the weather and of the quality of land and water, seismicity, energy consumption, the energy performance of buildings and emission levels).
3. Meteorological data Weather forecasts, rain, wind and atmospheric pressure.
4. Statistics National, regional and local statistical data with main demographic and economic indicators (gross domestic product, age, unemployment, income, education).
5. Companies Company and business registers (list of registered companies, ownership and management data, registration identifiers).
6. Transport data Public transport timetables of all modes of transport, information on public works and the state of the transport network including traffic information.


According to the proposal, regardless of who provide them, these datasets shall be available for free, machine-readable and accessible for download, and where appropriate, via APIs. The conditions for re-use shall be compatible with open standard licences.

Towards a public debate on high value datasets at EU level

There has been attempts by EU Member States to define what constitutes high-value data at national level, with different results. In Denmark, basic data has been defined as the five core information public authorities use in their day-to-day case processing and should release. In France, the law for a Digital Republic aims to make available reference datasets that have the greatest economic and social impact. In Estonia, the country relies on the X-Road infrastructure to connect core public information systems, but most of the data remains restricted.

Now is the time for a shared and common definition on what constitute high-value datasets at EU level. And this implies an agreement on how we should define them. However, as it stands, there are several issues with the value indicators that the European Commission proposes.

For example, how does one define the data’s potential for innovative services? How to confidently attribute revenue gains to the use of open data? How does one assess and compare the social, economic, and environmental benefits of opening up data? Anyone designing these indicators must be very cautious, as metrics to compare social, economic, and environmental benefits may come with methodical biases. Research found for example, that comparing economic and environmental benefits can unfairly favour data of economic value at the expense of fuzzier social benefits, as economic benefits are often more easily quantifiable and definable by default.

One form of debating high value datasets could be to discuss what data gets currently published by governments and why. For instance, with their Global Open Data Index, Open Knowledge International has long advocated for the publication of disaggregated, transactional spending figures. Another example is OKI’s Open Data For Tax Justice initiative which wanted to influence the requirements for multinational companies to report their activities in each country (so-called ‘Country-By-Country-Reporting’), and influence a standard for publicly accessible key data.  

A public debate of high value data should critically examine the European Commission’s considerations regarding the distortion of competition. What market dynamics are engendered by opening up data? To what extent do existing markets rely on scarce and closed information? Does closed data bring about market failure, as some argue (Zinnbauer 2018)? Could it otherwise hamper fair price mechanisms (for a discussion of these dynamics in open access publishing, see Lawson, Gray and Mauri 2015)? How would open data change existing market dynamics? What actors proclaim that opening data could purport market distortion, and whose interests do they represent?

Lastly, the European Commission does not yet consider cases of government agencies  generating revenue from selling particularly valuable data. The Dutch national company register has for a long time been such a case, as has the German Weather Service. Beyond considering competition, a public debate around high value data should take into account how marginal cost recovery regimes currently work.

What we want to achieve

For these reasons, we want to organise a public discussion to collectively define

  1. i) What should count as a high value datasets, and based on what criteria,
  2. ii) What information high value datasets should include,
  3. ii) What the conditions for access and re-use should be.

The PSI Directive will set the baseline for open data policies across the EU. We are therefore at a critical moment to define what European societies value as key public information. What is at stake is not only a question of economic impact, but the question of how to democratise European institutions, and the role the public can play in determining what data should be opened.

How you can participate

  1. We will use the Open Knowledge forum as main channel for coordination, exchange of information and debate. To join the debate, please add your thoughts to this thread or feel free to start a new discussion for specific topics.
  2. We gather proposals for high value datasets in this spreadsheet. Please feel free to use it as a discussion document, where we can crowdsource alternative ways of valuing data.
  3. We use the PSI Directive Data Census to assess the openness of high value datasets.

We also welcome any reference to scientific paper, blogpost, etc. discussing the issue of high-value datasets. Once we have gathered suggestions for high value datasets, we would like to assess how open proposed high-value datasets are. This will help to provide European countries with a diagnosis of the openness of key data.



Author note:

Danny Lämmerhirt is senior researcher on open data, data governance, data commons as well as metrics to improve open governance. He has formerly worked with Open Knowledge International, where he led its research activities, including the methodology development of the Global Open Data Index 2016/17. His work focuses, among others, on the role of metrics for open government, and the effects metrics have on the way institutions work and make decisions. He has supervised and edited several pieces on this topic, including the Open Data Charter’s Measurement Guide.

Pierre Chrzanowski is Data Specialist with the World Bank Group and a co-founder of Open Knowledge France local group. As part of his work, he developed the Open Data for Resilience Initiative (OpenDRI) Index, a tool to assess the openness of key datasets for disaster risk management projects. He has also participated in the impact assessment prior to the new PSI Directive proposal and has contributed to the Global Open Data Index as well as the Web Foundation’s Open Data Barometer.

Sander van der Waal is Programme Lead for Fiscal Transparency at Open Knowledge International. Furthermore, he’s responsible for the team at Open Knowledge International that works in the areas of Research, Communications, and Community. Sander combines a background in Computer Science with Philosophy and has a passion for ‘open’ in all its form, ranging from open data to open access and open source software.

Physicians Improve Patient Treatment With AllMedx and Lucidworks Fusion / Lucidworks

For 10 years, Doug Grose, Chief Executive Officer of AllMedx, envisioned a “Google for doctors.” With extensive experience in medical communications, Grose knew doctors were frustrated with conventional search engines like Google and Bing because results were often diluted with unreliable content that was intended for consumers and patients. He felt that physicians and other healthcare professionals could benefit from a search tool that sourced its content solely from MD-vetted articles, high-impact medical journals, and other select, reputable clinical sources. The goal was to eliminate irrelevant, consumer-type pieces that would be of little-to-no value to physicians looking for answers to clinical, point-of-care questions.

Once the AllMedx corpus was built, Fusion was used to index data sources including PubMed, CDC, FDA, the leading physician news sites,, rarediseases. org, NIH DailyMed, Merck Manuals Professional, and many more, including clinical guidelines from 230 medical societies and thousands of branded drug sites.

Learning From the User is customized to each user based on their interests, previous queries and behavior, and medical specialty. For example, a cardiologist searching for “valve defects” will first see articles in their query results that other cardiologists searching for “valve defects” previously found helpful. Cutting-edge algorithms in Fusion, using AI and ML, automate the content indexing and tailor search results so is able to provide doctors with the answers they seek faster than other sites.

Unique Taxonomy

Fusion has been the muscle behind AllMedx’s search since the launch of the website in April 2018. The user-friendly Fusion platform allows the AllMedx internal development team to be hands-on. Chief Operating Officer and Editor- in-Chief Carol Nathan performs one-on-one user testing with physicians in all medical specialties on a regular basis and can easily tune query pipelines accordingly with this user feedback. With Fusion, the staff has time for real-time improvements, can index new content quickly, and can do a lot of the configuration from the admin panel without having to engage with engineers working on code-level development.

In a particularly unique and market-first application, AllMedx boiled down the medical field into a taxonomy of 12,000 disease states and applied this taxonomy to more than seven million documents across dozens of data sources on a platform called AllMedicineTM. AllMedicine is updated daily with links from 2,000 sources and has 10 to 20 times more content than other physician resources, all neatly organized in the way doctors think about patient care. With the large taxonomy and number of records, the processing power required was a big question, and the team was concerned that the process to properly index the data could take months. However, using Fusion, the team efficiently and elegantly indexes sources and applies their taxonomy on a regular basis, with a full index taking just a few hours each day.

For AllMedx, Fusion has been intuitive, efficient, and dependable. “We’ve never been down. It’s been very stable, 100% stable,” says Grose. “Our physician users are getting precisely the experience that we envisioned when building AllMedx.”

The site’s popularity and reputation is steadily increasing; AllMedx already has 125,000 physician users and hopes to reach their goal of 250,000 physician users by the end of the year. The AllMedx team plans to index up to 2,000 additional medical sites, which will allow them to serve each medical specialty with a robust variety of quick and easy-to-navigate resources. Based on physician user feedback, the AllMedx team is confident the site will increase access to critical, clinical point-of-care information that will help improve patient care.

Learn More

The post Physicians Improve Patient Treatment With AllMedx and Lucidworks Fusion appeared first on Lucidworks.

Establishing a Strategic Vision for Our Library’s Web Presence / Library Tech Talk (U of Michigan)

Picture of a spider web

Over the past 20 years, the University of Michigan Library has led the way on creating digital collections and establishing best practices around digital preservation that have become benchmark standards for other libraries. However, as our web presence expanded, it became increasingly difficult to adapt it at scale, keep pace with the changing needs of research, and create cohesion between a growing number of applications, sites, and services. It eventually became clear that a new model for web governance was needed. In this post, learn about the library’s history around its web governance and what led us to establish a new committee to create a vision and strategy for our web presence. You’ll also read about some of the committee’s accomplishments so far and learn how the committee’s members are supporting the recently launched Library Search application and the ongoing website redesign.

Save the dates! Samvera Connect 2019 / Samvera

We are pleased to announce that Samvera’s 2019 Connect conference will be held at Washington University in St. Louis, St. Louis Missouri, during the week beginning Monday, October 21st.

We realize this is the week following DLF and that having two conferences many of our community typically attend so close together can be problematic—but better than having them conflict outright, and these dates were the best remaining option.  We will work in future years to have our organizations coordinate on scheduling.

Monday the 21st will be a Samvera Partner meeting.  Connect workshops will take place on Tuesday the 22nd and the Connect conference itself will run from the morning of Wednesday 23rd through Friday the 24th.

We anticipate that registration will open early in June—in the meantime please hold the dates and feel free to share this information with others!

Advance notice that the subsequent Connect will be in the fall of 2020 at the University of California Santa Barbara.

On behalf of the Steering and 2019 Host Committees,

Andrew Rouner (Chair, host committee)
Richard Green (Chair, Samvera Steering Group)

The post Save the dates! Samvera Connect 2019 appeared first on Samvera.

Learning how much I have to learn - What I learned in 2018 and what I'm hoping to start learning in 2019 / Hugh Rundle

At the beginning of last year I made a list of things I wanted to learn:

  • Koha templating and themes
  • mySQL
  • Perl
  • Unit testing
  • How to cook bagels
  • To read long texts immersively again
  • Remembering names better

It's a little awkward to look at now, because i didn't make much progress on anything really except learn to read long texts immersively again. This is why you shouldn't make New Year Resolutions.

What I did learn in 2018

It's not like I learned nothing last year though - just not the things I though maybe I wanted to learn. Some of the things I learned in 2018 are:

  • how to publish a package on npm
  • how Git tags and Github releases work
  • how to host a Mastodon instance and, kinda, how Docker Compose works
  • Virtual Studio Code exists and is amazing
  • the four day, 32 hour working week is significantly superior to the five day working week. (see also: learn to read long texts immersively again)
  • there are people working in universities whose entire job consists of trying to find out whether researchers also working at the same university have published any research papers recently. 🙃
  • some people get so much from newCardigan cardiParties that they list how many they attended in their end-of-year review blog posts!

Naturally, now that I'm working with academic librarians I've also learned just how much I don't know about how academic and research librarians work - I've been surprised by how many things are really just the same as public libraries, but the differences are really different and things change quickly.

What I'm learning in 2019

These aren't really goals for learning in the next 12 months, because to varying degrees they are all life-long projects, but a short list of things I want to learn, or learn much more about, this year is:

  • Python
  • Bash scripting
  • Australian First Nations cultural awareness - not exactly a small project, as I get older I realise how much I don't know and how much I was lied to as a child
  • how to shut up and listen

I guess time will tell if I'm more successful with this list than I was in 2018.

how the light gets in / Bethany Nowviskie

I took a chance on a hackberry bowl at a farmer’s market—blue-stained and turned like a drop of water. It’s a good name for it. He had hacked it down at the bottom of his garden. (They’re filling in the timber where the oaks aren’t coming back.)

But the craftsman had never worked that kind of wood before, kiln-dried at steamy summer’s height. “Will it split?”

It did. Now it’s winter, and I make kintsukuroi, a golden repair. I found the wax conservators use on gilded picture-frames, and had some mailed from London. It softens in the heat of hands.

Go on. Let the dry air crack you open. You can break and be mended again.

hackberry bowl, repaired

Public Domain Day advent calendar #24: The Night Before Christmas (recitation with music and drawings) by Hanna van Vollenhoven and Grace Drayton / John Mark Ockerbloom

Christmas Eve is a good time to revisit Clement Clarke Moore’s poem “A Visit from St. Nicholas”.  In the public domain since the mid-19th century, it’s been adapted many times, and I’ve already featured a chorale adaptation by Frances McCollin in a previous calendar entry.  But there’s no limit to the number or variety of adaptations people can make to works in the public domain.  And there’s another adaptation from 1923 that’s somewhat better-known, and worth a look.

In 1923, the Boston Music Company published “The Night Before Christmas: A Spoken Song or Recitation, by Hanna Van Vollenhoven”.  Van Vollenhoven, a Dutch woman who had immigrated to New York around the time of the first World War, was a fairly well-known pianist and composer at the time, with performances on the radio and articles and tour promotions in magazines like The Musical Monitor.  It’s not surprising, then, that she was billed prominently on the cover of the piano music she wrote to accompany a recitation of the classic Christmas poem, and that the copyright to her music was renewed in 1950.

But the edition featuring Van Vollenhoven’s music may now be better known for the contributions of its other creator, artist Grace Drayton.   Trained in Philadelphia, Drayton was an early cartoonist for the Hearst syndicate, sometimes in collaboration with her sister, Margaret Gebbie Hays.  She also illustrated books and stories in magazines like St. Nicholas, and created a popular line of “Dolly Dingle” paper dolls.

Her most lasting creations are the Campbell’s Soup Kids, who she designed in 1904 and who continue to appear in Campbell’s advertising and publicity items to this day.  The Campbell Kids cast has grown and changed somewhat in appearance since Drayton’s initial unsigned drawings were published, but they still have a notable resemblance to the originals. A 2017 article by Kate Kelly at America Comes Alive shows examples of Drayton’s drawings of the Campbell Kids and other characters.

The Boston Music Company’s 1923 “Night Before Christmas” featured a color illustration by Drayton on the cover, and additional drawings by Drayton made appearances through the book’s 16 pages.  The cover illustration shows a jolly man in a red suit with white fur lining, carrying a sack full of toys leaning over two children in bed.  All the figures have the sort of round faces and rosy cheeks that also appear on Dolly Dingle and the Campbell Kids– St. Nicholas himself looks like he could be an overgrown child.  Drayton’s illustrations were copyrighted and renewed as well, and both Drayton’s pictures and Van Vollenhoven’s music will join the public domain in the US eight days from now.

I haven’t to date featured other artwork in this advent calendar, largely because it’s often difficult to say with certainty whether a given piece of art created in 1923 is joining the US public domain in 2019, is already in the public domain, or will remain under copyright.  The copyright system in the US in the early 20th century was largely designed for widely copied work, with the time of publication being the start of that work’s copyright term, or its entry into the public domain if copyright was not then claimed as required.  But it’s usually not obvious when a one-of-a-kind work of art, like a Picasso painting or a Duchamp sculpture, was “published” for the purposes of US copyright law, or if its publication met requirements for claiming copyright (requirements that themselves have changed over time, particularly for non-US works).  It’s somewhat easier to tell for artworks designed for publication, such as Drayton’s book and magazine illustrations, which tended to published, registered, and renewed (or not) much like books and magazines were.  But most such works of art, if renewed at all, were renewed as part of the publications in which they appeared.  Very few works of art have their own renewed copyrights.  The renewal for Drayton’s illustrations for “The Night Before Christmas”, for instance, is one of only 114 renewals for “prints and pictorial illustrations” filed in 1950.

Because of that renewal, and the 1950 renewal for Van Vollenhoven’s music, I can be sure that their “Night Before Christmas” book will be among the new arrivals in the American public domain in eight days.  I hope that it will be a welcome gift in the coming new year.  And I hope that all those waiting for presents from St. Nicholas tonight get welcome gifts as well.

2019 update: Link to illustrated score of The Night Before Christmas: A Spoken Song or Recitation, now in the US public domain, courtesy of the Frances G. Spencer Collection of American Popular Sheet Music at  Baylor University.

Public Domain Day advent calendar #28: Parisian Pierrot by Noël Coward / John Mark Ockerbloom

One of the hits of the 1923 London theater season was the musical revue London Calling!  It was the first publicly performed musical by Noël Coward, who starred in the 1923 production alongside Gertrude Lawrence, and who continued to write and perform theatrically for nearly 50 years afterwards.  With additional scripting by Ronald Jeans, additional music by Philip Braham and others, and some tap-dance choreography by a young Fred Astaire, the revue strung together a couple of dozen songs, dance routines, and sketches.  One innovative segment involved a stereoscopic shadowgraph, a then-new form of three-dimensional display that audiences viewed through special glasses.

I’d like to be able to say that the show will be in the US public domain a few days from now, like the other 1923 works I’ve been featuring in the calendar.  But I’m afraid it won’t be– at least, not in its entirety.  The problem is that performing a work in public doesn’t actually start its copyright term under US copyright law.  Up until 1977, registering a work or publishing it did start the term, but public performance of a dramatic or musical work doesn’t itself count as publication for the purposes of US copyright.  That’s why, for instance, the play Peter Pan is still under copyright in the US; even though it opened in 1904, its script wasn’t published until 1928, and that’s when its 95-year copyright term started.

It’s not clear exactly when the US copyright clock started running for London Calling! as a complete show.  It does not appear to have been registered as a play in 1923, and I haven’t found a book in the WorldCat catalog that consists of the entire show, other than some very recent publications.  Various parts of the show have been published separately over time, though.  Collections of Coward’s sketches have been published at various times, some of which include the sketch scripts from the show.  The individual songs have also been published as sheet music at different times since 1923.

One of the better-known songs of the production is “Parisian Pierrot”, written by Coward for Lawrence to sing in a “Pierrot” clown costume. The singer laments that while you may be “society’s hero” on the outside, on the inside you can have “spirits at zero”, knowing that even though “the rue de la Paix is under your sway” at present, “your star will be falling as soon as the clock goes ’round”.  Written in second-person, the song was one of Coward’s first hits, performable with a variety of singers and contexts.  Coward himself sung the song on a 1936 recording, and Julie Andrews also performed it in Star!, a 1968 film on the life of Lawrence.

With a 1923 copyright registration that was renewed in 1950, “Parisian Pierrot” joins the public domain in the US four days from now, along with some, but not all, of the other songs from London Calling!  For instance, the copyright term for “You Were Meant For Me”, another hit song from the show, appears to begin in 1924, so that song joins the public domain here one year after “Parisian Pierrot” does.

Because musical revues tend to be loose assemblages, we’re not missing out as much getting the show into the public domain piecewise as we would if it were the sort of tightly integrated dramatic production that more recent musicals tend to be. Still, I could see value in a knowledge base containing publishing history and copyright data of theatrical productions generally, so that we could determine more readily when such works join the public domain in whole or in part.

I’m a bit busy myself with a serials copyright knowledge base to take on drama as well. But if anyone else is doing it or plans to take it on, I’d love to hear about it.

2019 update: Link to piano-vocal sheet music for “Parisian Pierrot”, now in the US public domain, courtesy of the Lester H. Levy Sheet Music Collection at Johns Hopkins University.

Public Domain Day advent calendar #30: In the Orchard by Virginia Woolf / John Mark Ockerbloom

“The world broke in two in 1922 or thereabouts,” Willa Cather wrote in a preface to a 1936 collection of essays.  Many read her remark as referring to major changes in literary forms and styles underway in the early 1920s, particularly the rise of modernist and experimental literature.  We’ve already looked at Jean Toomer’s Cane, a 1923 novel using some of those new styles, in a previous calendar entry.  In that same year,  Virginia Woolf published two short stories still read today as key examples of modernist literature, employing stream-of-consciousness techniques to illuminate a slice of their main characters’ lives.  One of those stories will join the US public domain the day after tomorrow.  The other one is already there.

The story already in the public domain is “Mrs. Dalloway in Bond Street”, which relates a shopping trip of the title character and the thoughts and memories she experiences while going out to buy gloves.  Much of what action exists in the story is internal or told in flashback, though the surface narration includes a “violent explosion” at the end whose nature is not made clear in the short story.  Woolf later reworked and expanded this short story into the book Mrs. Dalloway, which was published in 1925, and will remain under copyright in the US for a couple of years longer.

Mrs. Dalloway’s original 1923 short story, however, is already in the public domain.  It was first published in the July 1923 issue of the American literary magazine The Dial, and there was no copyright renewal filed either for the magazine issue or for the story.  (There was a renewal filed in 1953 for the 1925 book, but that renewal is too late to cover the 1923 story.)

“Mrs. Dalloway in Bond Street” takes place over a small portion of a day. “In the Orchard” features an even shorter slice of life: only about a second, when the main character lies in a chair outdoors on the verge of sleep, and then leaps up exclaiming “Oh, I shall be late for tea!” But as in Mrs. Dalloway’s story, the main point is not the outward action, but a depiction of all that goes on around her, and in her own thoughts.  Woolf’s shifts of perspective from Miranda lying in her chair to things happening above her and around her are reminiscent of a wide-scene painting or a cinematic pan and zoom, though the narrative relates a combination of sounds, sights and thoughts that a painting or a film can only partially convey.

“In the Orchard” was first published in the April 1923 issue of The Criterion, a British literary magazine.  Again, there’s no renewal for the magazine issue or for the short story.  But this story, unlike “Mrs. Dalloway in Bond Street” is still under copyright in the US, at least for two more days.

Why the difference?  The reason is a trade agreement that the US made, and enacted into law, to exempt many works first published abroad from copyright formality requirements such as registration, renewal, and notice.  (I alluded to this exemption in a previous calendar entry.)  Eligible works were not only exempted from such requirements, but retroactively brought back into copyright, if they would still be under copyright as registered and renewed works.

The Copyright Office’s Circular 38A describes in detail the rules of eligibility for copyright restoration.  Here’s how they apply to these stories: Virginia Woolf was British, and her works were under copyright protection in Britain in 1996, when the copyright restorations went into effect.  So works of hers that were first published outside the US at least 30 days before their first US publication are eligible for restoration.  As I mentioned, The Criterion, where “In the Orchard” first appeared,  was published in the UK (and not, to my knowledge, in the US).  The first US publication of the story that I’m aware of, in the September 1923 issue of the American magazine Broom,  was more than 30 days after the April 1923 Criterion, so copyright restoration applies.  On the other hand, because “Mrs. Dalloway in Bond Street”, was first published in the US magazine The Dial, that story is not eligible for copyright restoration.

In this case, the difference will be moot in two days, when “In the Orchard” joins “Mrs. Dalloway in Bond Street” in the public domain at the completion of its full 95-year term.  If you haven’t read this story before, or aren’t sure you’d like its style of literature, you’ll have a good opportunity to check out this brief story then, for a little expenditure of time, and no expenditure of money.

We’ve almost reached the end of this Public Domain Day advent calendar.  If you’d like to continue reading about public domain works in the new year, though, I recommend checking out The Public Domain Review.  It regularly publishes essays about all manner of public domain works, both those newly arrived there and those that have been there for a while.  The authors of the essays are generally experts on the works they write about, and can also spend more time discussing the works and their contexts than I do here.  If you’re already a fan of the Public Domain Review, you may want to consider supporting it by buying some of its merchandise or making a donation.

One of that site’s regular features is an annual review of some of what’s coming into the public domain, both in the US and in other countries, in the coming year.  The latest in this series, Class of 2019, has just been posted, and includes a mention of the Woolf story featured in this post.

2019 update: Link to full text of “In the Orchard” as published in the September 1923 issue of Broom, now in the US public domain, courtesy of the Blue Mountain Project

Public Domain Day advent calendar #25: Christmas Day at Sea by Joseph Conrad / John Mark Ockerbloom

“In all my twenty years of wandering over the restless waters of the globe I can only remember one Christmas Day celebrated by a present given and received.”

By the time Joseph Conrad wrote that line in 1923, he had not only his twenty-year career at sea behind him, but also most of a writing career pursued on land for thirty years afterwards.  Readers familiar with his work, which includes novels like Heart of Darkness (1899), Nostromo (1904), and Lord Jim (1900), as well as short stories like “The Secret Sharer” (1910), know not to expect a light and cheery Christmas tale from Conrad.  His only earlier story set at Christmas was Typhoon (1902), where sailors fight for their lives in a storm that strikes their ship on a day that happens to be December 25.

Though Conrad himself had little use for Christmas (or Christianity generally), he had noted in a letter to his agent that stories with a link to Christmas were given ample space in December periodical issues.  That may have had something to do with his decision to write the short memoir “Christmas Day at Sea”, which ran in the December 1923 issue of the American magazine The Delineator and the December 24, 1923 issue of London’s Daily Mail.  It joins the public domain in the US seven days from today (and is already in the public domain in most other countries).

Conrad writes that on a working ship, Christmas was a day to note, but not to make a fuss over.  He underlines this last point by noting how distraction from one’s job at sea on any day of the year could be disastrous, as on the Christmas Day that his ship narrowly misses colliding with a steamer that suddenly appeared out of a thick fog.  The mood that dominates Conrad’s Christmas memoir is the sense of isolation of those at sea.  The present-giving occasion Conrad describes takes place in 1879, “long before there was any thought of wireless message”, when his ship is eighteen days out of Sydney, and encounters another ship with its sails oddly furled.

The other ship turns out to be an American whaler, two years out of New York, that has not touched land for over two hundred days.  The captain of Conrad’s ship has “an enormous bundle” of newspapers they had picked up in Sydney placed in a keg along with two boxes of figs, and tossed into the rough seas towards the whaler.  Despite “rolling desperately all the time”, the whaler manages to lower a boat, pick up the keg, and signal seven words of greeting and news to send back to America before the ships part company.

As it happened, Christmas 1923 was Joseph Conrad’s last Christmas.  He died the following August, and his younger son John renewed the copyright to his Christmas memoir in 1950.  To me, the starkness of Conrad’s 1923 essay on Christmases at sea (which he characterizes as “fair to middling… down to plainly atrocious”) helps reveal by contrast what many of us seek out in Christmas.  Specifically, there’s a yearning for connection, whether it’s in the religious sense of “God-with-us”, or in the person-to-person connections we make and renew in giving gifts, exchanging cards and letters, or sharing a festive meal.

I’ve valued being with people I love on Christmas, though there have always been some people that I can’t be with this day, for one reason or another.  So I plan to make some phone calls later today, remember some of the people I’ve spent past Christmases with, and do a little sharing online, including sending out this post.  (I’m also gratified for initiatives like the  #joinin hashtag being used on Twitter, to promote connecting with strangers over the holiday.)  I hope all of you reading this make or renew some connections today that you or others crave.   To all who celebrate it, however you do so, merry Christmas!

2019 update: Link to full text of “Christmas Day at Sea” as published in the December 1923 issue of The Delineator, now in the US public domain, courtesy of Conrad First.  (This version is somewhat different from longer versions published elsewhere.)

It's "academic" / CrossRef

We all know that writing and publishing is of great concern to those whose work is in academia; the "publish or perish" burden haunts pre-tenure educators and grant-seeking researchers. Revelations that data had been falsified in published experimental results brings great condemnation from publishers and colleagues, and yet I have a feeling that underneath it all is more than an ounce of empathy from those who are fully aware of the forces that would lead one to put ones' thumbs on the scales for the purposes of winning the academic jousting match. It is only a slight exaggeration to compare these souls to the storied gladiators whose defeat meant summary execution. From all evidence, that is how many of them experience the contest to win the ivory tower - you climb until you fall.

Research libraries and others deal in great part with the output of the academe. In many ways their practices reinforce the value judgments made on academic writing, such as having blanket orders for all works published by a list of academic presses. In spite of this, libraries have avoided making an overt statement of what is and what is not "academic." The "deciders" of academic writing are the publishers - primarily the publishers of peer-reviewed journals that decide what information does and does not become part of the record of academic achievement, but also those presses that issue scholarly monographs. Libraries are the consumers of these decisions but stop short of tagging works as "academic" or "scholarly."

The pressure on academics has only increased in recent years, primarily because of the development of "impact factors." In 1955, Eugene Garfield introduced the idea that one could create a map of scientific publishing using an index of the writings cited by other works. (Science, 1955; 122 :108–11) Garfield was interested in improving science by linking works so that one could easily find supporting documents. However, over the years the purpose of citation has evolved from a convenient link to precedents into a measure of the worth of scholars themselves in the form of the "h-index" - the measure of how often a person (not a work) has been cited. The h-index is the "lifetime home runs" statistic of the academic world. One is valued for how many times one is cited, making citations the coin of the realm, not sales of works or even readership. No one in academia could or should be measured on the same scale as a non-academic writer when it comes to print runs, reviews, or movie deals. Imagine comparing the sales figures of "Poetic Autonomy in Ancient Rome" with "The Da Vinci Code". So it matters in academia to carve out a world that is academic, and that isolates academic works such that one can do things like calculate an h-index value.

This interest in all things academic has led to a number of metadata oddities that make me uncomfortable, however. There are metadata schemas that have an academic bent that translates to a need to assert the "scholarliness" of works being given a bibliographic description. There is also an emphasis on science in these bibliographic metadata, with less acknowledgement of the publishing patterns of the humanities. My problem isn't solely with the fact that they are doing this, but in particular with how they go about it.

As an example, the metadata schema BIBO clearly has an emphasis on articles as scholarly writing; notably, it has  a publication type "academic article" but does not have a publication type for "academic book." This reflects the bias that new scientific discoveries are published as journal articles, and many scientists do not write book-length works at all. This slights the work of historians like Ann M. Blair whose book, Too Much to Know, has what I estimate to be about 1,450 "primary sources," ranging from manuscripts in Latin and German from the 1500's to modern works in a number of languages. It doesn't get much more academic than that.

BIBO also has different metadata terms for "journal" and "magazine":
  • bibo:journal "A periodical of scholarly journal Articles."
  • bibo:magazine"A periodical of magazine Articles. A magazine is a publication that is issued periodically, usually bound in a paper cover, and typically contains essays, stories, poems, etc., by many writers, and often photographs and drawings, frequently specializing in a particular subject or area, as hobbies, news, or sports."
Something in that last bit on magazines smacks of "leisure time" while the journal clearly represents "serious work."  It's also interesting that the description of magazine is quite long, describes the physical aspects ("usually bound in a paper cover"), and gives a good idea of the potential content. "Journal" is simply "scholarly journal articles." Aside from the circularity of the definitions (journal has journal articles, magazines have magazine articles), what this says is simply that a journal is a "not magazine."

Apart from the snobbishness of the difference between these terms is the fact that one seeks in vain for a bright line between the two. There is, of course, the "I know it when I see it" test, and there is definitely some academic writing that you can pick out without hesitation. But is an opinion piece in the journal of a scientific society academic? How about a book review? How about a book review in the New York Review of Books (NYRB), where articles run to 2-5,000 words, are written by an academic in the field, and make use of the encyclopedic knowledge of the topic on the part of the reviewer? When Marcia Angell, professor at the Harvard Medical School and former Editor in Chief of The New England Journal of Medicine writes for the NYRB, has she slipped her academic robes for something else? She seems to think so. On her professional web site she lists among her publications a (significantly long) letter to the editor  (called a "comment" in academic journal-eze) of a science journal article about women in medicine but she does not include in her publication list the articles she has written for NYRB even though these probably make more use of her academic knowledge than the comment did. She is clearly making a decision about what is "academic" (i.e. career-related) and what is not. It seems that the dividing line is not the content of the writing but how her professional world esteems the publishing vehicle.

Not to single out BIBO, I should mention other "culprits" in the tagging of scholarly works, such as WikiData. Wikidata has:
  • academic journal article (Q18918145) article published in an academic journal
  • academic writing (Q4119870) academic writing and publishing is conducted in several sets of forms and genres
  • scholarly article (Q13442814) article in an academic publication, usually peer reviewed
  • scholarly publication (Q591041) scientific publications that report original empirical and theoretical work in the natural sciences
There is so much wrong with each of these, from circular definitions to bias toward science as the only scholarly pursuit (scholarly publication is a "scientific publication" in the "natural sciences"). (I've already commented on this in WikiData, sarcastically calling it a fine definition if you ignore the various directions that science and scholarship have taken since the mid-19th century.)  What this reveals, however is that the publication  and publisher defines whether the work is "scholarly." If any article in an academic publication is a scholarly article, then the comment by Dr. Angell is, by definition, scholarly, and the NYRB articles are not. Academia is, in fact, a circularly-defined world. 
Giving one more example, has this:
  • schema:ScholarlyArticle (sub-class of Article) A scholarly article.
Dig that definition! There are a few other types of article in schema, org, such as "newsArticle" and "techArticle" but it appears that all of those magazine articles would be simple "Article."

Note that in real life publications call themselves whatever they wish. With a hint at how terms may have changed over time: Ladies' Home Journal calls itself a journal, and the periodical published by the American Association for the Advancement of Science, Science, gives itself the domain "Science Magazine" just sounds right, doesn't it?

It's not wrong for folks to characterize some publications and some writing as "academic" but any metadata term needs a clear definition, which these do not have. What this means is that people using these schemas are being asked to make a determination with very little guidance that would help them separate the scholarly or academic from... well, from the rest of publishing output. With the inevitable variation in categorization, you can be sure that in metadata coded with these schemas the separation between scholarly/academic and not scholarly/academic writing is probably not going to be useful because there will be little regularity of assignment between communities that are using this metadata.

I admit that I picked on this particular metadata topic because I find the designation of "scholarly" or "academic" to be judgemental. If nothing else, when people judge they need some criteria for that judgement. What I would like to see is a clear definition that would help people decide what is and what is not "academic," and what the use cases are for why this typing of materials should be done. As with most categorizations, we can expect some differences in the decisions that will be made by catalogers and indexers working with these metadata schemas. A definition at least gives you something to discuss and to argue for.  Right now we don't have that for scholarly/academic publications.

And I am glad that libraries don't try to make this distinction.

Searching for hierarchical data in Solr / Brown University Library Digital Technologies Projects

Recently I had to index a dataset into Solr in which the original items had a hierarchical relationship among them. In processing this data I took some time to look into the ancestor_path and descendent_path features that Solr provides out of the box and see if and how they could help to issue searches based on the hierarchy of the data. This post elaborates on what I learned in the process.

Let’s start with some sample hierarchical data to illustrate the kind of relationship that I am describing in this post. Below is a short list of databases and programming languages organized by type.

  ├─ Relational
  │   ├─ MySQL
  │   └─ PostgreSQL
  └─ Document
      ├─ Solr
      └─ MongoDB
Programming Languages
  └─ Object Oriented
      ├─ Ruby
      └─ Python

For the purposes of this post I am going to index each individual item shown in the hierarchy, not just the children items. In other words I am going to create 11 Solr documents: one for “Databases”, another for “Relational”, another for “MySQL”, and so on.

Each document is saved with an id, a title, and a path. For example, the document for “Databases” is saved as:

  "id": "001", 
  "title_s": "Databases",
  "x_ancestor_path": "db",
  "x_descendent_path": "db" }

and the one for “MySQL” is saved as:

  "id": "003", 
  "title_s": "MySQL",
  "x_ancestor_path": "db/rel/mysql",
  "x_descendent_path": "db/rel/mysql" }

The x_ancestor_path and x_descendent_path fields in the JSON data represent the path for each of these documents in the hierarcy. For example, the top level “Databases” document uses the path “db” where the lowest level document “MySQL” uses “db/rel/mysql”. I am storing the exact same value on both fields so that later on we can see how each of them provides different features and addresses different use cases.

ancestor_path and descendent_path

The ancestor_path and descendent_path field types come predefined in Solr. Below is the definition of the descendent_path in a standard Solr 7 core:

$ curl http://localhost:8983/solr/your-core/schema/fieldtypes/descendent_path
      "class":"solr.PathHierarchyTokenizerFactory", "delimiter":"/"}},

Notice how it uses the PathHierarchyTokenizerFactory tokenizer when indexing values of this type and that it sets the delimiter property to /. This means that when values are indexed they will be split into individual tokens by this delimiter. For example the value “db/rel/mysql” will be split into “db”, “db/rel”, and “db/rel/mysql”. You can validate this in the Analysis Screen in the Solr Admin tool.

The ancestor_path field is the exact opposite, it uses the PathHierarchyTokenizerFactory at query time and the KeywordTokenizerFactory at index time.

There are also two dynamic field definitions *_descendent_path and *_ancestor_path that automatically create fields with these types. Hence the wonky x_descendent_path and x_ancestor_path field names that I am using in this demo.

Finding descendants

The descendent_path field definition in Solr can be used to find all the descendant documents in the hierarchy for a given path. For example, if I query for all documents where the descendant path is “db” (q=x_descendent_path:db) I should get all document in the “Databases” hierarchy, but not the ones under “Programming Languages”. For example:

$ curl "http://localhost:8983/solr/your-core/select?q=x_descendent_path:db&fl=id,title_s,x_descendent_path"

Finding ancestors

The ancestor_path not surprisingly can be used to achieve the reverse. Given the path of a given document we can query Solr to find all its ancestors in the hierarchy. For example if I query Solr for the documents where x_ancestor_path is “db/doc/solr” (q=x_ancestor_path:db/doc/solr) I should get “Databases”, “Document”, and “Solr” as shown below:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&fl=id,title_s,x_ancestor_path"

If you are curious how this works internally, you could issue a query with debugQuery=true and look at how the query value “db/doc/solr” was parsed. Notice how Solr splits the query value by the / delimiter and uses something called SynonymQuery() to handle the individual values as synonyms:

$ curl "http://localhost:8983/solr/your-core/select?q=x_ancestor_path:db/doc/solr&debugQuery=true"
    "parsedquery":"SynonymQuery(Synonym(x_ancestor_path:db x_ancestor_path:db/doc x_ancestor_path:db/doc/solr))",

One little gotcha

Given that Solr is splitting the path values by the / delimiter and that we can see those values in the Analysis Screen (or when passing debugQuery=true) we might expect to be able to fetch those values from the document somehow. But that is not the case. The individual tokens are not stored in a way that you can fetch them, i.e. there is no way for us to fetch the individual “db”, “db/doc”, and “db/doc/solr” values when fetching document id “007”. In hindsight this is standard Solr behavior but something that threw me off initially.

Digital Preservation Network Is No More / David Rosenthal

In Why Is the Digital Preservation Network Disbanding? Roger Schonfeld examines the demise of the Digital Preservation Network which was announced last month:
An initial announcement said directly that "After careful analysis of the Digital Preservation Network's membership, operating model, and finances, the Board of Trustees of DPN passed a resolution to affect an orderly wind-down of DPN," including committing to consultations with each member to ensure that content would not be lost in the wind-down. Shortly thereafter, messages came out from DPN's hubs, both individually including HathiTrust, and collectively, characterizing their operating and financial strength and ability to provide for an orderly transition. Because DPN was not itself directly preserving anything but rather a broker for preservation services by underlying repositories, it does not appear that any content will be put at risk.
Below the fold, I look at various views of the lessons to be learned.

I'm often critical of Ithaka's and Schonfeld's work, so it is important to start by saying that I agree with much of what he wrote. He starts thus:
The vision for the Digital Preservation Network (DPN) was outlined in an early overview that illustrates much of the founders' thinking. It was established to solve two problems at once. First, its founders observed that not all preservation services at the time were as resilient as one might hope, and so, "the heart of DPN is a commitment to replicate the data and metadata of research and scholarship across diverse software architectures, organizational structures, geographic regions, and political environments." Second, as far too little scholarly content was being preserved, DPN would also enable existing preservation capacity to be utilized for a wider array of purposes, recognizing that, "once that infrastructure is in place, it can be extended at much lower marginal costs." In one sense, DPN thereby offered an elegant technical solution. But as elegant as it may have been technically, its product offering was never as clear as it could have been. And as much as it accomplished, it ultimately could not be sustained.
Schonfeld is correct that diversity was a theme of DPN from the start, but he doesn't have the full story. I took part in the very first meeting that led to the DPN. It was hosted on his mega-yacht by the leader of an organization that held a vast video collection of significant academic interest, which he was anxious to see preserved for posterity. The leader was very wealthy, with the kind of donor potential that ensured the attendance of a number of major University Librarians including James Hilton (Michigan) and Michael Keller (Stanford). I was one of the few technical people in the meeting. We pointed out that the scale of the video collection meant that the combined resources of the assembled libraries could not possibly store even a single copy, much less the multiple copies needed for robust preservation.

Each of the libraries represented had made significant investments in establishing an institutional repository, which was under-utilized due to the difficulty of persuading researchers to deposit materials. With the video collection out of the picture as too expensive, the librarians seized on diversity as the defense against the monoculture threat to preservation. In my view there were two main reasons:
  • Replicating pre-ingested content from other institutions was a quicker and easier way to increase the utilization of their repository than educating faculty.
  • Jointly marketing a preservation service that, through diversity, would be more credible than those they could offer individually was a way of transferring money from other libraries' budgets to their repositories' budgets.
Alas, this meant that the founders' incentives were not aligned with their customers'. Despite this, the marketing part worked well. As Schonfeld writes:
It was comparatively easy to get several dozen libraries to sign up to pay a $20,000 annual membership fee, especially after Hilton met with AAU presidents and pitched an early vision of DPN to them. One reader suggested to me that initial sign-ups may have been more out of courtesy or community citizenship than commitment.
The reader was right. Another example of this phenomenon is the contrast between LOCKSS, marketed to librarians, and the greater participation in Ithaka's Portico, marketed initially to University presidents by Bill Bowen, ex-President of Princeton and ex-President of the Andrew W. Mellon Foundation.

Schonfeld lists reasons for DPN's failure, with most of which I concur:
  • "[DPN] had a strong technical vision, but a clear product offering took time to emerge and the value proposition was not uniformly understood." I disagree that DPN's technical vision was strong. Given the commitment to infrastructure diversity, it was limited to implementing transfer of content and metadata, and fixity auditing between diverse mutually trusting repositories. Even this limited capability took far too long to achieve production, especially given that the basic functions were widely available off-the-shelf. As always, reconciling diverse metadata standards was a stumbling block, but this is more of an organizational than a technical issue. Members cannot be faulted for observing that a "clear product offering" was taking far too long to reach production.
  • "CIOs and others were comfortable that cloud solutions were secure enough for almost all purposes. ... DPN and its members were, ... unsuccessful in distinguishing the added value of a preservation solution from cloud storage." This is a continuing problem, compounded by accounting systems that subject (large but intermittent) capital expenses to more severe scrutiny than (smaller but recurring) operational expenses that, capitalized over an appropriate planning horizon, may be significantly larger. I will return to it in the report on cloud storage on which I'm about to start work.
  • "While memberships have proved to be a durable way to fund certain kinds of "clubs" such as professional organizations, they seem to be misaligned when there is a need to match value with actual products or services. Membership organizations necessarily seek a degree of consensus in their governing and cross-subsidization in their fee structures, while product organizations need to be able to deliver a clear solution to a well-defined problem for a reasonable price and to adapt their approach aggressively as marketplace conditions dictate." Cameron Neylon wrote the book blog on this problem in 2016 with Squaring Circles: The economics and governance of scholarly infrastructures, and I commented at length here. Neylon wrote:
    Membership models can work in those cases where there are club goods being created which attract members. Training experiences or access to valued meetings are possible examples. In the wider world this parallels the "Patreon" model where members get exclusive access to some materials, access to a person (or more generally expertise), or a say in setting priorities. Much of this mirrors the roles that Scholarly Societies play or at least could play.
    In summary, DPN's "club goods" were not seen as justifying the membership dues. Neylon subsequently expanded his post into Sustaining Scholarly Infrastructures through Collective Action: The Lessons that Olson can Teach us. In a fascinating read he applies the analysis of Mancur Olson's The Logic of Collective Action: Public Goods and the Theory of Groups.
  • British Library Real Income
    "the academic community, and its libraries in particular, have moved into a period of more rigorous value assessment and more strategic resource allocation, ... the library and scholarly communications communities have created more small independent collaborative organizations than we can possibly sustain. We need, and are experiencing, consolidation. This is the crux of the matter. Preservation Is Not A Technical Problem, it is an economic problem. We know how to do it, we just don't want to pay enough from continually decreasing library discretionary funds to have it done that way. DPN lacked Portico's focused marketing pitch, lacked LOCKSS' focus on cost minimization, and lacked both of their business model's justification of protecting investment in expensive paywalled content. Even once DPN achieved production it was little used and thus vulnerable:
    only 27 members ever deposited content into DPN.
    DPN’s financial model needed members to submit additional content beyond 5 TB annual deposit to succeed. Only 2 of 60 members did.
Schonfeld is certainly right that a necessary consolidation is under way:
Consolidation in this sector can arise in one of two ways - through pivots and shut-downs as services decline and organizations fail, or through value-creating reorganizations and mergers among organizations that are doing fine but should be able to deliver more.
But I think he underestimates the risks. The differences in quality between different preservation services are evident only in the long term, the difference in price is evident immediately. And, importantly:
While research universities and cultural heritage institutions are innately long-running, they operate on that implicitly rather than by making explicit long-term plans.
This gives rise to a Gresham's Law of preservation, in which low-quality services, economizing on replication, metadata, fixity checks and so on, out-compete higher-quality services. Services such as DPN and Duracloud, which act as brokers layered on top of multiple services and whose margins are thus stacked on those of the underlying services, find it hard to deliver enough value to justify their margins, and are strongly incentivized to use the cheapest among the available services.

The point of DPN was to use organizational and technological diversity to mitigate monoculture risk. Unfortunately, as Brian Arthur's 1994 Increasing Returns and Path Dependence in the Economy described, technology markets have increasing returns to scale. Thus as services scale up diversity becomes an increasingly expensive luxury, and monoculture risk increases. Fortunately, the winner then acquires too-big-to-fail robustness (see Amazon).

Six Reasons to Switch From Endeca Today / Lucidworks

When you deployed Endeca, it was the best ecommerce search on the market. With state of the art relevance, faceted search, and customer experience tools it drove search for most of the large ecommerce sites. But that was then. With years since a major refresh, Endeca fails to meet customer expectations today.

Here are six reasons to ditch Endeca and switch to Lucidworks Fusion today:

1. AI-powered UX Converts More Browsers…Into Buyers

In the age of Amazon and Google, customers don’t expect to learn your site; they expect it to learn them. Leverage customer Signals and AI-powered search to determine intent and recommend products that meets your customers wants.

AI-powered recommendations result in higher average order size — leading to greater transactions and revenue.

“We’ve seen dramatic bumps in conversion rates and, overall, some of those key success metrics for transactional revenue within an order of magnitude of a 50% increase since the migration.”

– Marc Desormeau, Senior Manager, Digital Customer Experience, Lenovo

2. Head-n-Tail Analysis

Customers don’t always describe things in the same terms that your site does. An Artificial Intelligence technique called “Head-n-Tail Analysis” automatically fixes search keywords based on misspellings, word order, synonyms and other types of common mismatches.

3. Data & Expertise Protection

Your customer insights are what drive algorithms. Don’t be fooled into using blackbox products that take your expertise to refine their algorithms — taking away your data, your control — and your merchandising knowledge!

Your algorithms should be just that. Yours.

4. Stats-Based Predictions

Rules are great, but they shouldn’t be a work horse. They can turn into a maintenance nightmare. Instead use AI-powered search to rely on stats and reduce rules.

Save rules to boost seasonal items, new trends — or whenever your merchandise expertise says you should.

5. Powerful Analytics

App Insights allows your analysts to look at customers at both a personal and statistical level. You will also be able to see what works and what doesn’t using A/B testing and experiments.

6. Faster Indexing

You can’t sell what you can’t serve, so faster indexing lets you be more agile in your merchandising.

Further, make ancillary business data from ERP and supply chain systems readily available to drive search results and recommendations.

Learn More

The post Six Reasons to Switch From Endeca Today appeared first on Lucidworks.

OpenSRF 3.1.0-beta released / Evergreen ILS

We are pleased to announce the release of OpenSRF 3.1.0-beta, a message routing network that offers scalability and failover support for individual services and entire servers with minimal development and deployment overhead.

OpenSRF 3.1.0-beta is a beta release of what will become OpenSRF 3.1.0. As this is a beta release, we encourage testing of it but do not recommend it for production use. Assuming that no showstoppers appear, we expect release of OpenSRF 3.1.0 to occur on 16 January 2019.

The beta release introduces several enhancements and major bugfixes, including:

  • Support for a more reliable backend for the WebSocket gateway.
  • The ability to queue incoming requests as a way of handling transitory spikes without dropping requests.
  • The ability to force the recycling of a drone that has just handled a long-running and memory-consuming request.
  • A significantly improved and more secure example NGINX configuration.
  • Better handling of transport errors by the WebSocket JavaScript client.
  • Support for Ubuntu 18.04 Bionic Beaver.
  • Dropping support for Debian Wheezy.

To download OpenSRF and view the full release notes, please visit the downloads page.

We would also like to thank the following people who contributed to the release:

  • Galen Charlton
  • Jeff Davis
  • Bill Erickson
  • Mike Rylander
  • Chris Sharp
  • Ben Shum
  • Remington Steed
  • Jason Stephenson

OCLC RLP Assessment Interest Group — lessons learned and ingredients for success / HangingTogether

In OCLC Research, we very much see ourselves as being a learning organization. In that spirit, we are always trying new things, and in the process learning what works and what doesn’t. In the OCLC Research Library Partnership, we are learning together with those of you that work at institutions in the OCLC RLP.

Photo by Aaron Burden on Unsplash

In 2018 we launched an interest group on Library Assessment. The Assessment Interest Group built on a three-part WebJunction Webinar Series: Evaluating and Sharing Your Library’s Impact. It was delightful to have a partnership within OCLC Research that combined the OCLC RLP with WebJunction, and leveraged the expertise of our colleague Lynn Silipigni Connaway, OCLC’s Director of Library Trends and User Research. On the OCLC RLP side, I worked with the group, alongside my colleagues Rebecca Bryant and Titia van der Werf, and we were joined by WebJunction colleague Jennifer Peterson. Rebecca, Titia, Jennifer and I are by no means assessment experts so we really were co-learning with the OCLC RLP cohort.

We’ve written about the Assessment Interest group along the way. As a learning organization we also took some time to evaluate what worked and what did not go so well for us and our co-learners. We also interviewed two people who seemed to get a lot out of the course, Anne Simmons, and Tessa Brawley-Barker both from the (US) National Gallery of Art to help us see what we could glean from successful learners.

First, a few negatives:

Be(a)ware of spam filters To communicate with our learners, we set up a Google Group. We did this to give us the flexibility to not only send and share emails but also to potentially share documents. You can either directly add people to a Google Group or invite them to join and we quickly learned that the “direct add” method works best. But alas! Invitations got routed to spam or were never received. Even with the direct add method, emails from the group were caught by hungry spam filters which needed to be trained to accept emails from the group. We never wound up using the shared document functionality in Google Groups. The lessons here is that if you are setting up a new listserv of any type, make sure people are checking and correcting their spam filters! Otherwise, they will miss valuable communications and discussions.

Too much elapsed time? The WebJunction course spanned April to October, with the second of the three webinars in August. This left us with a lot of time between sessions one and two, and not a lot to discuss in the interim. Ideally, the sessions would have been more evenly spaced.

Not maximizing our resources. As I said previously, none of us helping to lead the interest group were assessment experts — but of course we had experts who were leading the WebJunction courses. We could better have leveraged the expertise of our presenters by including them in our discussion groups. This occurred to us rather late in the game but is something we would if we had a second chance.

However, we had many positive lessons learned:

Alongside those takeaways, we also learned from our star students what worked. As more and more of us look to expand and extend our skills by learning in online courses and cohorts some of these reflections seem particularly apt.

Use prepared course materials. Anne and Tessa valued the Webinar Series Learner Guide and treated the guide as an assignment which filled out and shared before each call. Treating the guide as a homework assignment helped to keep them accountable.

Use group meetings as deadlines. Tessa and Anne valued having the interest group calls on their calendars and treated them as deadlines for working through their assignments (the learner guide).

Leverage adjacent opportunities. Tessa took a Library Juice Academy “User Experience Research and Design” course concurrently with part of the webinar / interest group. This was beneficial because ideas presented in that class overlapped and complemented what she learned in our group.

Learning is better together. Tessa and Anne work in the same small library but don’t normally work together. Participating in this learning opportunity together helped with an NGA goal of fostering more cross functional collaboration. They reported that having a “learning partner” was a factor in their success (and as an added bonus they now know each other better as colleagues).

Mixing it up. A strength of the OCLC RLP is the diversity of institution types and geographic regions that are represented. Anne and Tessa both valued the mix of people who attended the calls, and the range of expertise, from novices to more seasoned assessment professionals.

As always, it is invigorating to work with those at our OCLC RLP institutions and with colleagues on our extended OCLC Research team. We have been applying our lessons learned from the Assessment Interest Group to a Research Data Management interest group that took place from October through December, and sharing those conversations here as well.

I hope if you have either pitfalls or success tips gleaned from leading or participating in an online learning or discussion group you will share that with us!

Proxying FeedBurner MyBrand for HTTPS with CloudFront and Lambda at Edge / Peter Murray

So I’m paying more attention to the DLTJ blog now, and one of the things I quickly noticed is that the Atom syndication feed was broken. Or, at least modern web clients would refuse to retrieve the feed. The problem turned out to be not with the feed file, but the web client refusing to connect to the server to retrieve it. The root cause was turning on HTTP Strict-Transport-Security (HSTS) for the “” domain sometime in the middle of the year last year.

I had done the work of moving all of the resources used by the blog to HTTPS over the summer, so I turned on the HSTS header:

strict-transport-security: max-age=31536000; includeSubDomains; preload

The practical impact of this header is to tell a web client (such as a browser) to use HTTPS only when connecting to this particular domain name. The includeSubDomains attribute tells the web client to apply this rule to all domains under, which would include the domain where the syndication feed lives. The problem was that is an alias for the FeedBurner service, and the FeedBurner service doesn’t have the security key to serve content from that domain name using HTTPS. We need a way to legitimately serve a file with using HTTPS with the address.

About FeedBurner

Back in the day when blog syndication feeds were far more popular when they are now, FeedBurner was a tool that blog authors could use to get statistics about how others used their content as well as providing a way for blog posts to be turned into email-delivered newsletters. Wikipedia has a more complete description of FeedBurner, including how the company was bought by Google in 2007 – three years after its founding. But it didn’t last. After four years at Google, they announced the shutdown of the FeedBurner API, and a year later the discontinuation of AdSense for Feeds. Still, as a service retrieving syndication feeds from blogs and making them pretty, FeedBurner lives on. (Aside, last year the Motherboard section of the Vice website had an interesting article about why Google can’t shut down Feedburner.)

_MyBrand_ is a feature of FeedBurner that allowed one to use their own domain name for the syndication feed rather than the FeedBurner-supplied With a little bit of extra effort, this gives the blog author the ability to leave FeedBurner at some point without losing all of the users that subscribed to the FeedBurner syndication; you could just use another syndication feed service, or serve the feed yourself. In my case, I used the domain name

Fixing the HTTPS and HSTS problem

But now I’m left with the problem that FeedBurner can’t securely send my syndication feed using the MyBrand domain name. And I don’t think Google will ever put in the effort to allow that to happen. Heck, I’m somewhat surprised that FeedBurner is still around as a service, much less being enhanced. (Read the Motherboard article above for why this is surprising.) But I don’t want to turn off HSTS for the domain, so what can be done?

This blog is hosted using Amazon Web Services (AWS), and AWS has all of the tools I need to fix this problem:

Here is how the pieces fit together:

  1. CloudFront is going to retrieve, cache, and serve the FeedBurner-enhanced syndication file.
  2. Certificate Manager automatically creates an HTTPS security certificate that CloudFront uses to verify the domain name.
  3. Lambda@Edge is needed because I don’t want to proxy all of the FeedBurner syndication feeds – just the one that matters to me.

CloudFront distribution

Here are the screen captures of the setup of the CloudFront distribution in the AWS web console.

Screen capture of the AWS CloudFront distribution screen Screen capture of the AWS CloudFront distribution screen. Click image to enlarge.

Nothing very unusual here, but note that Alternate Domain Names (CNAME) with the domain name that I’m publishing to the world and the SSL Certificate with the cert supplied by AWS.

Screen capture of the AWS CloudFront distribution screen Screen capture of the AWS CloudFront distribution screen. Click image to enlarge.

Nothing unusual here either; note that the origin domain name is

Screen capture of the AWS CloudFront distribution screen Screen capture of the AWS CloudFront distribution screen. Click image to enlarge.

This one is a little unusual because there is a “Lambda Function Associations” set for the Viewer Request function. I did not set this Lambda@Edge function on the AWS CloudFront screen; rather set it using the Lambda screen.

AWS Certificate Services

There isn’t anything special about creating the cert using AWS Certificate Services for this CloudFront distribution. You’ll need to verify ownership of the domain name, of course. Just follow the instructions in the AWS Certificate Manager documentation.


If you have followed along this far, you’ll note that we set up a CloudFront distribution for all content located at the origin server. I don’t want to proxy all of FeedBurner’s feeds – just the one for my blog. So I have a function running on Lambda@Edge to filter out all CloudFront requests that don’t match my blog. Lambda@Edge uses JavaScript running in Node, so it looks like this:

'use strict';

exports.handler = (event, context, callback) => {
    const request = event.Records[0].cf.request;
    const headers = request.headers;

    const mainFeed = "DisruptiveLibraryTechnologyJester";
    const passthroughUris = [

    const notFoundResponse =  {
        status: '404',
        statusDescription: 'Not Found',
        headers: {
            'content-type': [{
                key: 'Content-Type',
                value: `text/plain`,
            'content-encoding': [{
                key: 'Content-Encoding',
                value: 'UTF-8'
        body: 'The feed you are looking for is not here.',
    const redirectResponse = {
        status: '302',
        statusDescription: 'Found',
        headers: {
            location: [{
                key: 'Location',
                value: `https://${[0].value}/${mainFeed}`,

    if (request.uri && request.uri == '/') {
        callback(null, redirectResponse);
    if (request.uri && passthroughUris.includes(request.uri)) {
        callback(null, request);

    callback(null, notFoundResponse);

You can read the Lambda@Edge documentation for the specifics of how this code works, but basically:

  1. There is a list of partial URLs (just the path portion of the URL) that we are interested in passing through to the origin server (passthroughUris).
  2. We set up a response variable that tells CloudFront to return a 404-not-found HTTP status code.
  3. We set up another response variable that redirects the web client to our syndication feed.
  4. We test to see if the requested URL is just the root (/), and if so we return the redirect response variable.
  5. We test to see if the requested URL is one of the ones we want to pass through, and then pass it through.
  6. For everything else we return the not-found response variable.

This Lambda@Edge function is then linked to the “Viewer Request” event of the CloudFront distribution. The Viewer Request event is the first thing that happens at CloudFront – before CloudFront looks in its cache of files or tries to get the file from the origin server. At this event point we determine whether we want to handle the request or return the 404-not-found status code.

For the future

True be told, this is a stop-gap measure. I’m not sure I can count on Google continuing to run FeedBurner, and I’m not sure I want to continue to force my readers to use a Google service. The author of the Motherboard article says:

I was stuck on FeedBurner for a while myself, and it annoyed me, to be honest. Fortunately, we’re not completely stuck with terrible options anymore. These days, I use FeedPress, which costs money but I recommend, though there are plenty of others, like Feedity and Feedio. (You can always go back to your old RSS or Atom feed, of course.)

I’ll look at one of these options to enhance the syndication feed. One of the things I really like about FeedBurner is giving blog readers the option of receiving posts by email. It seems like most people use services like TinyLetter or similar for this kind of functionality, so I know alternatives exist for blog-post-by-email readers. It feels at the moment, though, like the DLTJ blog is getting rebooted, and I want to focus attention on other fixes and enhancements. This CloudFront-proxied FeedBurner syndication feed will work for now.

Shaping the future of Open Knowledge in Nepal / Open Knowledge Foundation

This blog has been reposted from the  Open Knowledge Nepal blog as part of our blog series of Open Knowledge Network updates.

Wrapping up 2018, we’d like to take this opportunity to thanks everyone who supported us over the past year. In this cold winter season, we tried to reflect our key works of 2018 over the cup of coffee.

2018 has been the year of collaboration and growth.  As our commitment in last year’s blog, we still will remain focus and dedicate all our energy/resources to improve the state of Open Data in South Asia.


Key highlights of 2018 are:

  1. A celebration of International Open Data Day 2018 in Nepal: We collaborated with Open Nepal community of practice to organize Open Data Day 2018, which is one of the biggest celebrations of open data. Unlike previous years we celebrated this year open data day slight differently. This year the way we celebrated open data was different, in-fact we implemented the core concept of OPEN in real life by hosting it in a publicly accessible place. We also organize the side event “Data-a-thon for Journalist” in collaboration with Central for Data Journalism Nepal to train journalists.
  2. Launching Open Data Nepal: We invested much of our technical and human resource to build and launch a crowdsourced open data portal to make Nepal’s data accessible online. Till now more than 600+ datasets have been harvested from various governmental source and a huge volunteer team of data wranglers is working actively to increase the number of datasets.
  3. Joining Open Nepal community of practice and knowledge hub: With the aim of demonstrating our dedication and enthusiasm towards open data, we joined Open Nepal community of practices and knowledge hub. Currently, 10 different organization are the part of Open Nepal community.
  4. Hack for Nepal Initiative: In collaboration with Code for Nepal, we launched Hack for Nepal initiative and hosted AngelHack Hackathon for the first time in Nepal. This was our first experience of hosting an overnight hackathon, where more than 70 participants compete to build ‘Seamless Technology for Humanitarian Response’.
  5. Improving AskNepal platform: To make the data and information request easier, we joined hands with Code for Nepal to improve the AskNepal platform, which can be used by everyone to request information with different bodies of Nepal Government. AskNepal is run and maintained by Open Knowledge Nepal and Code for Nepal, in partnership with mySociety.
  6. Travelling 3 provinces to train about open data: In collaboration with YUWA, we travelled 6 districts of Nepal to train 126 youths of various background. The aim of project was to create a network of young data leaders who will lead and support the development of their communities through the use of open data as evidence for youth-led and data-driven development.
  7. Participation at UN World Data Forum 2018: With the help of scholarship provided by the Data for Development (D4D) Program in Nepal, we were able to mark our presence at UN World Data Forum 2018, which gave us an opportunity for learning and networking.
  8. Participation in Open Data Training of Trainers Course by The Open Data Institute, UK: The five days rigorous training took place in Nepal from 26th to 30th November 2018 and thanks for the D4D team for bringing the Open Data Institute to Nepal. Throughout this training, we learned to develop an understanding of open data principles and learned to create, deliver and evaluate high-quality interactive training.

Thank you again for your continuous support for our work. Except few events and workshops, our focus was entirely on building a civic-tech platform and ecosystem to encourage the use of open data and In 2019, we look forward to harnessing our capacity to support this moment fully.

Our plans for 2019:

  1. Promoting the value of diverse data: Adding up on our effort of opening up government data through our Open Data Nepal portal, we look forward to promote the diverse field of data in Nepal. Our focus will be around WikiData, Citizen Generated Data, Inclusive and Disable Data, Data for SDG and Personal Data (MyData).
  2. Increasing/Improving the women participation: We plan to work continuously to improve the women participants in open data and civic-tech in Nepal. For this, we definitely look forward to join hands with others civil societies organization and institutions.
  3. Opening up more datasets: We will be harvesting more datasets to solve the problem of data scarcity and promote the culture of data-driven decision making in Nepal.
  4. Collaboration with Government and International Organization: Our focus will be on working with government directly through policy and technology lobbying. We will be pushing the government to conduct open data activities and join hands with an international organization for support.


We would especially like to thank Data for Development (D4D) Program in Nepal for financially supporting most of the key activities of 2018, we are also grateful to the Code for Nepal, Central for Data Journalism Nepal, YUWA and Open Nepal Community, whom we partnered for the successful implementation of activities and projects.

To be updated about our activities, please follow us at different medias:

Bargaining Parity for Librarians and Archivists / William Denton

Last month the Canadian Association of University Teachers (CAUT), an association of academic faculty associations and unions, released Bargaining Parity for Librarians and Archivists, an eight-page bargaining advisory that covers all the major issues for academic librarians and archivists in collective agreements, with examples of good language from different contracts across the country.

Here’s the announcement:

One of the greatest barriers to librarians and archivists accessing their academic rights is their definitional separation from the rest of the academic staff. This bargaining advisory, also available on-line in the members’ only section of the CAUT website, reviews current collective agreement language related to librarian and archivist terms and conditions of employment, including language that promotes parity with faculty and reflects the needs of librarians and archivists.

It will be of interest to any librarian or archivist involved in bargaining or revising collective agreements, and will be helpful to others who want to know more about these issues and how they affect us (especially professors on bargaining teams).

Snippet from the start of the advisory. Snippet from the start of the advisory.

I am on the Librarians’ and Archivists’ Committee right now and helped write the advisory. It was a complex project and took a while, and I’m delighted it’s now out. Thanks to all the other committee members and the CAUT people involved. It’s a good committee, and CAUT does great work.

CAUT doesn’t make its bargaining advisories publicly available because it doesn’t want employers to have access to the expertise it provides to unions. I understand that point, and I’m never going to out of my way to help management in labour negotiations, but I don’t think this advisory should be restricted this way. It’s the only thing I’ve written in my professional career that isn’t freely available online somewhere. Perhaps we’ll be able to change CAUT policy one day!

Nevertheless the advisory is available to those who need it. Anyone in Canada who’s in an association or union that’s a CAUT member can get it on the web site (your union will have a username and password; just ask) and in the US I presume there’s a way to go through the AAUP to get it. Or just ask one of us on the committee. (Aussi disponsible en français.)

Blogging again and Never again / Mita Williams

It appears that I haven’t written a single post on this blog since July of 2018. Perhaps it is all the talk of resolutions around me but I sincerely would like to write more in this space in 2019. And the best way to do that is to just start.

In December of last year I listened to Episode 7 of Anil Dash’s Function Podcast: Fn 7: Behind the Rising Labor Movement in Tech.

This week on Function, we take a look at the rising labor movement in tech by hearing from those whose advocacy was instrumental in setting the foundation for what we see today around the dissent from tech workers.

Anil talks to Leigh Honeywell, CEO and founder of Tall Poppy and creator of the Never Again pledge, about how her early work, along with others, helped galvanize tech workers to connect the dots between different issues in tech.

Fn 7: Behind the Rising Labor Movement in Tech

I thought I was familiar with most of Leigh’s work but I realized that wasn’t the case because somehow her involvement with the Never Again pledge escaped my attention.

Here’s the pledge’s Introduction:

We, the undersigned, are employees of tech organizations and companies based in the United States. We are engineers, designers, business executives, and others whose jobs include managing or processing data about people. We are choosing to stand in solidarity with Muslim Americans, immigrants, and all people whose lives and livelihoods are threatened by the incoming administration’s proposed data collection policies. We refuse to build a database of people based on their Constitutionally-protected religious beliefs. We refuse to facilitate mass deportations of people the government believes to be undesirable.

We have educated ourselves on the history of threats like these, and on the roles that technology and technologists played in carrying them out. We see how IBM collaborated to digitize and streamline the Holocaust, contributing to the deaths of six million Jews and millions of others. We recall the internment of Japanese Americans during the Second World War. We recognize that mass deportations precipitated the very atrocity the word genocide was created to describe: the murder of 1.5 million Armenians in Turkey. We acknowledge that genocides are not merely a relic of the distant past—among others, Tutsi Rwandans and Bosnian Muslims have been victims in our lifetimes.

Today we stand together to say: not on our watch, and never again.

“Our pledge”, Never Again.

The episode reminded me that while I am not an employee in the United States who is directly complicit with the facilitation of deportation, as a Canadian academic librarian, I am not entirely free from some degree of complicity as I am employed at a University that subscribes to WESTLAW .

The Intercept is reporting on Thomson Reuters response to Privacy International’s letter to TRI CEO Jim Smith expressing the watchdog group’s “concern” over the company’s involvement with ICE. According to The Intercept article “Thomson Reuters Special Services sells ICE ‘a continuous monitoring and alert service that provides real-time jail booking data to support the identification and location of aliens’ as part of a $6.7 million contract, and West Publishing, another subsidiary, provides ICE’s “Detention Compliance and Removals” office with access to a vast license-plate scanning database, along with agency access to the Consolidated Lead Evaluation and Reporting, or CLEAR, system.” The two contracts together are worth $26 million. The article observes that “the company is ready to defend at lease one of those contracts while remaining silent on the rest.”

“Thomson Reuters defends $26 million contracts with ICE”
by Joe Hodnicki (Law Librarian Blog) on June 28, 2018

I also work at a library that subscribes to products that are provided by Elsevier and whose parent company is the RELX Group.

In 2015, Reed Elsevier rebranded itself as RELX and moved further away from traditional academic and professional publishing. This year [2018], the company purchased ThreatMetrix, a cybersecurity company that specializes in tracking and authenticating people’s online activities, which even tech reporters saw as a notable departure from the company’s prior academic publishing role.

Surveillance and Legal Research Providers: What You Need to Know“, Sarah Lamdan, Medium, July 6, 2018.

Welcome to 2019. There is work to do and it’s time to start.

My year in books, 2018 / Meredith Farkas


I had such good intentions to blog more this year, but the second half of 2018 has thrown me a lot of curveballs emotionally and it’s pulled me away from a lot of the things that keep me engaged with others (funny how that seems to happen when you need people the most).Books are always a comforting constant in my life — a good way to get out of my own head. I left 2018 feeling brittle, but hopeful.

I did a pretty terrible job of keeping track of what I read this year, so it’s quite possible I read other things beyond these 52 and just don’t remember. I bolded the books that I really loved and would recommend to others and I was very surprised that Barbara Kingsolver’s new book did not even come close to making that list (I really disliked it). There are a lot of critically-acclaimed books that I read this year and felt rather “meh” about.

I’m coaching my son’s Oregon Battle of the Books team again this year (they made regionals last year as third graders!) so I committed to read all 16 books and develop practice questions for the kids. It took up way more time than I’d anticipated, so I don’t imagine I’ll do that ever again (it’s always hard to find that sweet spot as a parent where you feel like you’re doing enough that you don’t feel like a crappy mom, but aren’t doing so too much that no one is really going to appreciate). Live and learn.

Here’s my list for 2018:

  • The Heir and the Spare by Emily Albright
  • Famous Last Words by Katie Alender
  • The Brixton Brothers: The Ghostwriter’s Secret by Mac Barnett
  • The Terrible Two (part 3) by Mac Barnett and Jory John
  • The Book Scavenger by Jennifer Chambliss Bertman
  • Kitchen Confidential by Anthony Bourdain
  • A Whole New Ballgame by Phil Bildner
  • Good Morning, Midnight by Lily Brooks-Dalton
  • The Wild Robot by Peter Brown
  • Love and Trouble: A Midlife Reckoning by Claire Dederer
  • Fresh Complaint by Jeffrey Eugenides
  • Class Mom by Laurie Gelman
  • George by Alex Gino
  • Less by Andrew Sean Greer
  • Real Friends by Shannon Hale
  • Asymmetry: A Novel by Lisa Halliday
  • Uncommon Type by Tom Hanks
  • Before the Fall by Noah Hawley
  • The Hero’s Guide to Saving Your Kingdom by Christopher Healy
  • Celine: A novel by Peter Heller
  • Nightbird by Alice Hoffman
  • Ugly by Robert Hoge
  • Roller Girl by Victoria Jamieson
  • Unsheltered by Barbara Kingsolver
  • A Wizard of Earthsea by Ursula K. LeGuin
  • A Wrinkle in Time (The Graphic Novel) by Madeleine L’Engle
  • Hana’s Suitcase by Karen Levine
  • The Rules do not Apply by Ariel Levy
  • When the Sea Turned to Silver by Grace Lin
  • Her Body and Other Parties: Stories by Carmen Maria Machado
  • In the Footsteps of Crazy Horse by Joseph Marshall III
  • Dark Money: The Hidden History of the Billionaires Behind the Rise of the Radical Right by Jane Mayer
  • The Infinity Year of Avalon James by Dana Middleton
  • Nasty Women: Feminism, Resistance, and Revolution in Trump’s America edited by Samhita Mukhopadhyay and Kate Harding
  • The Sympathizer: A Novel by Viet Thanh Nguyen
  • Wish by Barbara O’Connor
  • So You Want to Talk About Race by Ijeoma Oluo
  • There There by Tommy Orange
  • Pip Bartlett’s Guide to Magical Creatures by Jackson Pearce and Maggie Steifvater
  • Waylon! One Awesome Thing by Sara Pennypacker
  • The Book of Dust: La Belle Sauvage by Philip Pullman
  • Will Not Attend by Adam Resnick
  • The Bright Hour by Nina Riggs
  • Where’d You Go Bernadette? by Maria Semple
  • You Think It, I’ll Say It by Curtis Sittenfeld
  • Swing Time by Zadie Smith
  • Sing, Unburied, Sing by Jesmyn Ward
  • Educated: A Memoir by Tara Westover
  • The Best Kind of People: A Novel by Zoe Whittall
  • The Female Persuasion by Meg Wolitzer
  • Teaching Men of Color in the Community College by J. Luke Wood, Frank III Harris, and Khalid White
  • Eleven Kinds of Loneliness by Richard Yates

And here are some books I hope to read in 2019. Any you’d particularly recommend?

  • Friday Black by Nana Kwame Adjei-Brenyah
  • The Power by Naomi Alderman
  • The Flight Attendant by Chris Bohjalian
  • A Lucky Man by Jamel Brinkley
  • Lives Other Than My Own: A Memoir by Emmanuel Carrère
  • White Fragility by Robin DiAngelo
  • Gone So Long by Andre Dubus III
  • Manhattan Beach by Jennifer Egan
  • Your Duck Is My Duck by Deborah Eisenberg
  • My Brilliant Friend by Elena Ferrante (this has been on my list — and my Kindle — for way too long)
  • The Lost Girls of Camp Forevermore by Kim Fu
  • Florida by Lauren Groff
  • Homegoing by Yaa Gyasi
  • Exist West by Moshin Hamid (this has been on my list — and my Kindle — for way too long)
  • Plainsong by Kent Haruf (for the third time — it’s one of my all-time faves and my book club is reading it this month!)
  • Night Hawks by Charles Johnson
  • An American Marriage by Tayari Jones
  • The Leavers by Lisa Ko
  • The Mars Room by Rachel Kushner
  • My Year of Rest and Relaxation by Ottessa Moshfegh
  • Becoming by Michelle Obama
  • I Am Not Your Perfect Mexican by Daughter Erika Sanchez
  • Good and Mad by Rebecca Traister

What did you love reading in 2018? What’s at the top of your list for 2019?

2018: a year in gratitude / Mark Matienzo

This year was largely complicated and often felt like a massive garbage fire to myself and my crew. I didn’t accomplish a number of my goals and was inconsistent about others, so recapping awesome things I did doesn’t feel appropriate and also happens to be a soft reminder of either failure or things not going as planned. I also tend to hate “best of the year” lists but I find them helpful to remember about where I found joy or the ability to connect to something outside of myself. I suppose this is an attempt to reconcile those things, or perhaps more in line with the end of year spirit, a way to articulate gratitude to the people and things around me that impacted me.

The work world

  • I’ve said it before but I am endlessly grateful for my time in ITLP as a period of reflection and growth, especially amidst my own professional uncertainty. It get out of my cultural heritage technology bubble for a while amidst other smart and talented colleagues from a pool of other institutions. It also helped reinforce a commitment to developing professional relationships into something much bigger. My time with everyone in the program was valuable, but I’d like to recognize my coach, Laura Patterson, and my colleagues Dani Aivazian from Stanford and Luis Corrales from Lawrence Berkeley National Laboratory, for their support and strong feedback throughout the program.
  • While I tend to present at conferences with some frequency, the presentations of which I am the proudest are those which I got in touch with something deeper in me. I had two of those this year, which is pretty good by all considerations, and my thanks extends to those involved. The first was a panel with Aliza Elkin, Mallory Furnier, Monika Chavez, and Elisa Rodrigues at the New Librarian Summit on mixed/multiracial identity and librarianship. Their contributions were profound, thoughtful, and mutually supportive, and it was absolutely awesome to collaborate with group of information workers of color. The other was my lightning talk, “Evidence of Them: Digitization, Preservation, and Labor”, from the “Digitization Is/Not Preservation” session at SAA this year. We had a great group of controversial ideas - some funny, some troubling - and it brought me closer to pretty much everyone on the session. Again, thanks to Julia Kim, Frances Harrell, Tre Berney, Andrew Robb, Snowden Becker, Fletcher Durant, Siobhan Hagan, and Sarah Werner, and to Fletcher, Siobhan, and Maureen Callahan for being willing to take the lead on getting a pop-up session proposal submitted.
  • Huge thanks go to the Code4Lib community, especially as we prepare for Code4Lib 2019 in February in San José, California.
  • And of course, I have to acknowledge my broader professional family. I really feel lucky to be surrounded by smart folks with great ideas, and in 2018, I continued to learn and be inspired by Jarrett Drake, Eira Tansey, Hillel Arnold, Ruth Kitchin Tillman, David Staniunas, Maureen Callahan, Shannon O’Neill, Dinah Handel, Erin O’Meara, Camille Villa, and many, many more.

Personal life

  • My partner, Chela, has been a great source of support and strength. She has really done the world for me and helped me think through where I’m headed.
  • The San Mateo Quaker Worship Group has been instrumental in resituating my spiritual life.
  • Getting the courage to reconnect with the creative process around music has been huge. I just issued digital rerelease of a tape from 15+ years ago that I remain proud of. I’m grateful to everyone who encouraged me or gave me advice as I relearn lots of stuff about recording and music technology.


This is not a ranked list, nor a list based on number of listens. Lots of records came out this year that are incredibly important to me. The list below are those that had the most profound impact. I probably listened to a lot of these alone, while on a hike, at the gym, on the train to work, or in the car. These are the records that struck a chord, that caused me to feel twinges of something, that made me feel either more or less alone. Thanks go to everyone who was involved with them.

What’s next

I want to extend this practice of gratitude into this new year, and I’m going to keep my goals a little closer to the chest at the suggestion of a few folks. However, I think that gratitude really needs to not be something to be reflected at the end of something - be it a year, a project, or anything else. I look forward to reestablishing a sense of connection between people and place and appreciating both of them, and hope you’ll join me in doing the same.

Trust In Digital Content / David Rosenthal

This is the fourth and I hope final part of a series about trust in digital content that might be called:
Is this the real  life?
Is this just fantasy
  The series so far moved down the stack:
  • The first part was Certificate Transparency, about how we know we are getting content from the Web site we intended to.
  • The second part was Securing The Software Supply Chain, about how we know we're running the software we intended to, such as the browser that got the content whose certificate was transparent.
  • The third part was Securing The Hardware Supply Chain, about how we can know that the hardware the software we secured is running on is doing what we expect it to.
Below the fold this part asks whether, even if the certificate, software and hardware were all perfectly secure, we could trust what we were seeing.

Max Read's How Much of the Internet Is Fake? Turns Out, a Lot of It, Actually introduces the idea of "The Inversion":
For a period of time in 2013, the Times reported this year, a full half of YouTube traffic was “bots masquerading as people,” a portion so high that employees feared an inflection point after which YouTube’s systems for detecting fraudulent traffic would begin to regard bot traffic as real and human traffic as fake. They called this hypothetical event “the Inversion.”
These were "click-fraud" bots. Read explains:
In late November, the Justice Department unsealed indictments against eight people accused of fleecing advertisers of $36 million in two of the largest digital ad-fraud operations ever uncovered. Digital advertisers tend to want two things: people to look at their ads and “premium” websites — i.e., established and legitimate publications — on which to host them. The two schemes at issue in the case, dubbed Methbot and 3ve by the security researchers who found them, faked both. Hucksters infected 1.7 million computers with malware that remotely directed traffic to “spoofed” websites — “empty websites designed for bot traffic” that served up a video ad purchased from one of the internet’s vast programmatic ad-exchanges, but that were designed, according to the indictments, “to fool advertisers into thinking that an impression of their ad was served on a premium publisher site,” like that of Vogue or The Economist. Views, meanwhile, were faked by malware-infected computers with marvelously sophisticated techniques to imitate humans: bots “faked clicks, mouse movements, and social network login information to masquerade as engaged human consumers.” Some were sent to browse the internet to gather tracking cookies from other websites, just as a human visitor would have done through regular behavior. Fake people with fake cookies and fake social-media accounts, fake-moving their fake cursors, fake-clicking on fake websites — the fraudsters had essentially created a simulacrum of the internet, where the only real things were the ads.
Outside the simulacrum the ads may be real, malvertising, or cryptocurrency miners, but the content they are displayed in is less and less real. In Managing the Cultural Record in the Information Warfare Era, Cliff Lynch surveys this problem:
The first development is the ability to fabricate audio and video evidence. Software that can do this is becoming readily available and doesn't require extraordinary computational resources. If you want to produce a persuasive video of someone speaking any script you'd like and if that person has a reasonable amount of available recorded video, you can synthesize that video into the fabrication software. The obvious place for this is politics: pick your target politician, put words in his or her mouth, then package this into propaganda or attack ads as desired.

Fabrication is much more than talking heads, of course. In keeping with the long tradition of early technology exploitation in pornography markets, another popular application is "deepfakes," where someone (a public figure or otherwise) is substituted into a starring role in a porn video (the term "deepfakes" is used both for the overall substitution technology and for the specific porn application). This is already happening, though the technology is as yet far from perfect. Beyond the obvious uses (e.g., advertising and propaganda), there are plentiful disturbing applications that remain unexplored, particularly when these can be introduced into authoritative contexts. Imagine, for example, being able to source fabrications such as police body-camera footage, CATV surveillance, or drone/satellite reconnaissance feeds. The nature of evidence is changing quickly.
While there's a great deal to be learned from our experiences over the past century, what's different today is the scale, the ready availability of these tools to interested individuals (rather than nation-states), and the move into audio/video contexts.
Insecurity of the Internet of Things adds to the problem, as Charlie Osborne describes in Hackers can exploit this bug in surveillance cameras to tamper with footage:
Researchers have discovered a vulnerability in Nuuo surveillance cameras which can be exploited to hijack these devices and tamper with footage and live feeds.

On Thursday, cybersecurity firm Digital Defense said that its Vulnerability Research Team (VRT) had uncovered a zero-day vulnerability in Nuuo NVRmini 2 Network Video Recorder firmware, software used by hundreds of thousands of surveillance cameras worldwide.
Back in September, researchers from Tenable revealed a remote code execution flaw in Nuuo cameras. This vulnerability, nicknamed Peekaboo, also permitted attackers to tamper with camera footage.
As with Lynch, Max Read addresses the same issue of trust in content:
The only site that gives me that dizzying sensation of unreality as often as Amazon does is YouTube, which plays host to weeks’ worth of inverted, inhuman content. TV episodes that have been mirror-flipped to avoid copyright takedowns air next to huckster vloggers flogging merch who air next to anonymously produced videos that are ostensibly for children. An animated video of Spider-Man and Elsa from Frozen riding tractors is not, you know, not real: Some poor soul animated it and gave voice to its actors, and I have no doubt that some number (dozens? Hundreds? Millions? Sure, why not?) of kids have sat and watched it and found some mystifying, occult enjoyment in it. But it’s certainly not “official,” and it’s hard, watching it onscreen as an adult, to understand where it came from and what it means that the view count beneath it is continually ticking up.

These, at least, are mostly bootleg videos of popular fictional characters, i.e., counterfeit unreality. Counterfeit reality is still more difficult to find—for now. In January 2018, an anonymous Redditor created a relatively easy-to-use desktop-app implementation of “deepfakes,” the now-infamous technology that uses artificial-intelligence image processing to replace one face in a video with another — putting, say, a politician’s over a porn star’s. A recent academic paper from researchers at the graphics-card company Nvidia demonstrates a similar technique used to create images of computer-generated “human” faces that look shockingly like photographs of real people. (Next time Russians want to puppeteer a group of invented Americans on Facebook, they won’t even need to steal photos of real people.) Contrary to what you might expect, a world suffused with deepfakes and other artificially generated photographic images won’t be one in which “fake” images are routinely believed to be real, but one in which “real” images are routinely believed to be fake — simply because, in the wake of the Inversion, who’ll be able to tell the difference?
Even without altering a frame of video, simply editing it has already had major real-world consequences. Paul Schrodt's 'The Apprentice' editor recalls Donald Trump saying he wanted to 'drill' female crew members reports:
CineMontage, a journal for the Motion Picture Editors Guild, talked to editors who worked on the NBC reality show, who say that the image of Donald Trump "was carefully crafted and manufactured in postproduction to feature a persona of success, leadership, and glamour, despite the raw footage of the reality star that was often 'a disaster.'"

"We were told to not show anything that was considered too much of a 'peek behind the curtain,'" one editor, Jonathon Braun, told CineMontage.

The editors say one of their biggest challenges was in the boardroom, making Trump's often whimsical decisions about who was fired instead look "legitimate."

"Trump would often make arbitrary decisions which had nothing to do with people's merit," an anonymous editor said. "He'd make decisions based on whom he liked or disliked personally, whether it be for looks or lifestyle, or he'd keep someone that 'would make good TV' [according to Trump]."

This required creative editing to set up the firings in a way that would make them seem logical, according to the sources, and while manipulative editing is standard in reality TV, this was apparently on another level.
Thomas Hale's What crowdfunding is really about describes another way to distort reality:
Imagine you’re a small, fast-growing business. You attract £20m from a single, institutional investor. Suddenly, there’s a public relations crisis. You hire a specialist to deal with the fallout. The institutional investor is concerned, but they don’t really get involved. Their expertise lies in risk and return, not reputation.

Now, imagine that the same investor gets thousands of its employees to continuously defend your reputation on social media. Imagine they include it as a condition of the investment, for free.

You’ve just imagined crowdfunding.
It multiplies the size of the online community willing to participate (which in the case of Monzo is already bigger than those with investments). The £20m the company aims to raise this week could easily add well over 10,000 equity investors in its business (the investment limit is £2,000 per person). Instead of trying to control reputation through relationships with a small number of custodians, the sheer volume of support has the capacity to swamp critical voices.
Lynch notices an even more subtle problem:
Anyone who has followed security breaches and penetrations over the past few years knows that the track record of protecting data aggregations from exfiltration and subsequent disclosure or exploitation is very poor. And there are many examples of attackers that have maintained a presence in organizational networks and systems over long periods of time once they have succeeded in an initial penetration. While a tremendous amount of data has been stolen, we hear very little about data that has been compromised or altered, particularly in a low-key way. I believe that in the long term, compromise is going to be much more damaging and destabilizing than disclosure or exfiltration.
He distinguishes two motivations for compromise:
I want to explicitly note here the difference between the act of quietly rewriting the record and enjoying the results of the rewrites that are accepted as truth and that of deliberately destroying the confidence of the public (including the scholarly community) by creating compromise, confusion, and ambiguity to suggest that the record cannot be trusted.
Faked data doesn't have to be the result of external compromise:
Metrics should be the most real thing on the internet: They are countable, trackable, and verifiable, and their existence undergirds the advertising business that drives our biggest social and search platforms. Yet not even Facebook, the world’s greatest data–gathering organization, seems able to produce genuine figures. In October, small advertisers filed suit against the social-media giant, accusing it of covering up, for a year, its significant overstatements of the time users spent watching videos on the platform (by 60 to 80 percent, Facebook says; by 150 to 900 percent, the plaintiffs say). According to an exhaustive list at MarketingLand, over the past two years Facebook has admitted to misreporting the reach of posts on Facebook Pages (in two different ways), the rate at which viewers complete ad videos, the average time spent reading its “Instant Articles,” the amount of referral traffic from Facebook to external websites, the number of views that videos received via Facebook’s mobile site, and the number of video views in Instant Articles.
Facebook is lying for its own profit. Nathalie Maréchal's Targeted Advertising Is Ruining the Internet and Breaking the World gets closer to the real question of cui bono? from destroying trust in media, and especially academic content:
Safiya Noble, an associate professor at the University of California, Los Angeles and author of Algorithms of Oppression, told me in an email that “we are dependent upon commercial search engines to sort truth from fiction, yet these too, are unreliable fact-checkers on many social and political issues. In essence, we are witnessing a full-blown failure of trust in online platforms at a time when they are the most influential force in undermining or protecting democratic ideals around the world.”
Advertising’s shift to digital has cannibalized the news media’s revenue, thus weakening the entire public sphere. And linking advertising to pageviews incentivizes media organizations to produce articles that perform well, sometimes at the expense of material that educates, entertains, or holds power-holders accountable. Targeted advertising provides tools for political advertisers and propagandists to micro-segment audiences in ways that inhibit a common understanding of reality. This creates a perfect storm for authoritarian populists like Rodrigo Duterte, Donald Trump, and Jairo Bolsanaro to seize power, with dire consequences for human rights.
Tom Sullivan edges toward the same question:
It is remarkable what moral compromises people will make provided the right incentives. Intelligence and foreign policy analyst Malcolm Nance recently mentioned the acronym used in the intelligence community for motives behind people becoming spies. MICE: Money, Ideology, Compromise or Coercion, and Ego. It was money that made “Sashko” lie for a living.

Filmmaker Kate Stonehill's "Fake News Fairytale" tells the story of a Macedonian teenager who took up spreading fake news for a quick buck until he saw what havoc his actions contributed to thousands of mile away:
“I think there’s a common misconception that people who write fake news must have a nefarious desire to influence politics one way or another,” Stonehill told The Atlantic. “I’m sure some of them undoubtedly do, but I met many people in Macedonia, including Sashko, who were writing fake news simply because they can make some money.”


Stonehill believes the first step to combatting the seductive proliferation of falsehoods is opening up an honest, critical discussion about technology, speech, and politics in order to better understand the fake-news phenomenon. “How, when, and why did the truth lose its currency?” she asked. “Who is profiting when the truth doesn’t matter? In my opinion, we’re only just beginning to unpack the answers to these questions.”
"Who is profiting when the truth doesn’t matter?" is the important question. The answer is obvious - those with the power to shape the lie. Governments, not just authoritarian governments, and powerful corporations. Here are a few US examples:
  • We are at the end of a historic process of governments of all stripes failing to weigh the short-term benefits of lying to their citizens against the long-term costs of eroding belief in government pronouncements. The Cold War, the JFK, RFK and MLK assassinations, Vietnam, Iraq, the War on Terror, austerity, and the foreclosure crisis all featured blatant lying by the government. They have led to a situation in which no-one in the reality-based community believes anything they hear from the current administration. Greg Sargent's tweetstorm makes important points about Trump's lying:
    Why does Trump lie *all the time* about *everything,* even the most trivial, easily disprovable matters?

    The frequency and the audacity of Trump’s disinformation is the *whole point* of it -- to wear you down. More and more of the lies slip past, undetected and uncorrected.
    Once Trump’s lying is understood as concerted and deliberate disinformation, it becomes clear that the frequency and audacity of it is *the whole point.*

    Those are features of the lying. They are central to declaring the power to say what reality is

    The other crucial half of this is to destroy the credibility of the institutional press.

    Previous presidents have tangled with the media. But Trump’s ongoing casting of the press as the "enemy of the people" is in important respects something new:
    Trump is *openly and unapologetically* declaring that norms of consistency and standards of interplay with the institutional press *do not* bind him.
  • Over the same period corporations have run massive long-term disinformation campaigns about pollution, smoking, and climate change, among others. And billionaire right-wing media owners have facilitated them, while using their reach to intimidate governments (see Hack Attack, Nick Davies' astonishing account of the Murdoch press' "information operations" in the UK).
In the academic context, Lynch sees some hope:
A four-pronged approach to the new information warfare environment seems to be emerging. One prong is greatly improved forensics; this is a mostly technical challenge, and memory organizations will be mainly users, not developers, of these technologies. Documentation of provenance and chain of custody are already natural actions for memory organizations; the challenge here is to make this work more transparent and rigorous and to allow broad participation. Capture of materials, particularly in a world of highly targeted and not easily visible channels, will be a third challenge at both technical and intellectual levels (though we are seeing some help now from platform providers). Finally, contextualization of fakes or suspected fakes is perhaps the greatest challenge, and the one that is least amenable to technological solutions.
Even in the academic context, Lynch's prongs have issues:
  1. Forensics will always remain one side of a co-evolutionary process. Machine learning merely accelerates both sides of it. And even during periods when forensics have the upper hand, they must be applied to have an effect. Unless they are built in to browsers and enabled by default, even most scholars will not bother to apply them.
  2. Documentation of provenance and chain of custody maybe a natural action for memory organizations, but in an era of shrinking real budgets it isn't a priority for funding as against, for example, subscriptions.
  3. Capture of open access academic content isn't a priority for funding either; capture of paywalled content is difficult and sure to become more so as Elsevier absorbs the whole of the academic workflow.
  4. Contextualization is resource intensive and, as Lynch points out, hard to automate. So, again, it will be hard to justify adequate funding.
In the research communication context, the bad incentives researchers, reviewers and publishers operate under are destroying trust in peer-reviewed research from the inside. I wrote in 2016's More Is Not Better:
Even if there is no actual misconduct, the bad incentives will still cause bad science to proliferate via natural selection, or the scientific equivalent of Gresham's Law that "bad money drives out good". The Economist's Incentive Malus, subtitled Poor scientific methods may be hereditary, is based on The natural selection of bad science by Paul E. Smaldino and Richard McElreath, which starts:

Poor research design and data analysis encourage false-positive findings. Such poor methods persist despite perennial calls for improvement, suggesting that they result from something more than just misunderstanding. The persistence of poor methods results partly from incentives that favour them, leading to the natural selection of bad science. This dynamic requires no conscious strategizing—no deliberate cheating nor loafing—by scientists, only that publication is a principal factor for career advancement.
Of Lynch's four prongs, only forensics in the form of replication and audit bots such as Statcheck can improve things at this level.

Outside the academic context, Lynch's prongs are wholly inadequate to the scale of the problem. It is one of those problems so common in this era of massive inequality whose costs are imposed on everyone while the benefits accrue to a small, affluent minority. Fixing them requires overwhelming coordinated political action, the very thing destroying trust in digital content is designed to prevent.

Towards Impact-based OA Funding / Eric Hellman

Earlier this month, I was invited to a meeting sponsored by the Mellon Foundation about aggregating usage data for open-access (OA) ebooks, with a focus on scholarly monographs. The "problem" is that open licenses permit these ebooks to be liberated from hosting platforms and obtained in a variety of ways. A scholar might find the ebook via a search engine, on social media or on the publisher's web site; or perhaps in an index like Directory of Open Access Books (DOAB), or in an aggregator service like JSTOR. The ebook file might be hosted by the publisher, by OAPEN, on Internet Archive, Dropbox, Github, or Libraries might host files on institutional repositories, or scholars might distribute them by email or via ResearchGate or discipline oriented sites such as Humanities Commons.

I haven't come to the "problem" yet. Open access publishers need ways to measure their impact. Since the whole point of removing toll-access barriers is to increase access to information, open access publishers look to their usage logs for validation of their efforts and mission. Unit sales and profits do not align very well with the goals of open-access publishing, but in the absence of sales revenue, download statistics and other measures of impact can be used to advocate for funding from institutions, from donors, and from libraries. Without evidence of impact, financial support for open access would be based more on faith than on data. (Not that there's anything inherently wrong with that.)

What is to be done? The "monograph usage" meeting was structured around a "provocation": that somehow a non-profit "Data Trust" would be formed to collect data from all the providers of open-access monographs, then channel it back to publishers and other stakeholders in privacy-preserving, value-affirming reports. There was broad support for this concept among the participants, but significant disagreements about the details of how a "Data Trust" might work, be governed, and be sustained.

Why would anyone trust a "Data Trust"? Who, exactly, would be paying to sustain a "Data Trust"? What is the product that the "Data Trust" will be providing to the folks paying to sustain it? Would a standardized usage data protocol stifle innovation in ebook distribution? We had so many questions, and there were so few answers.

I had trouble sleeping after the first day of the meeting. At 4 AM, my long-dormant physics brain, forged in countless all-nighters of problem sets in college, took over. It proposed a gendanken experiment:
What if there was open-access monograph usage data that everyone really trusted? How might it be used?
The answer is given away in the title of this post, but let's step back for a moment to provide some context.

For a long time, scholarly publishing was mostly funded by libraries that built great literature collections on behalf of their users - mostly scholars. This system incentivized the production of expensive must-have journals that expanded and multiplied so as to eat up all available funding from libraries. Monographs were economically squeezed in this process. Monographs, and the academic presses that published them, survived by becoming expensive, drastically reducing access for scholars.

With the advent of electronic publishing, it became feasible to flip the scholarly publishing model. Instead of charging libraries for access, access could be free for everyone, while authors paid a flat publication fee per article or monograph. In the journal world, the emergence of this system has erased access barriers. The publication fee system hasn't worked so well for monographs, however. The publication charge (much larger than an article charge) is often out of reach for many scholars, shutting them out of the open-access publishing process.

What if there was a funding channel for monographs that allocated support based on a measurement of impact, such as might be generated from data aggregated by a trusted "Data Trust"? (I'll call it the "OA Impact Trust", because I'd like to imagine that "impact" rather than a usage proxy such as "downloads" is what we care about.)

Here's how it might work:

  1. Libraries and institutions register with the OA Impact Trust, providing it with a way to identify usage and impact relevant to the library or institutions.
  2. Aggregators and publishers deposit monograph metadata and usage/impact streams with the Trust.
  3. The Trust provides COUNTER reports (suitably adapted) for relevant OA monograph usage/impact to libraries and institutions. This allows them to compare OA and non-OA ebook usage side-by-side.
  4. Libraries and institutions allocate some funding to OA monographs.
  5. The Trust passes funding to monograph publishers and participating distributors.

The incentives built into such a system promote distribution and access. Publishers are encouraged to publish monographs that actually get used. Authors are encouraged to write in ways that promote reading and scholarship. Publishers are also encouraged to include their backlists in the system, and not just the dead ones, but the ones that scholars continue to use. Measured impact for OA publication rises, and libraries observe that more and more, their dollars are channeled to the material that their communities need.

Of course there are all sorts of problems with this gedanken OA funding scheme. If COUNTER statistics generate revenue, they will need to be secured against the inevitable gaming of the system and fraud. The system will have to make judgements about what sort of usage is valuable, and how to weigh the value of a work that goes viral against the value of a work used intensely by a very small community. Boundaries will need to be drawn. The machinery driving such a system will not be free, but it can be governed by the community of funders.

Do you think such a system can work? Do you thing such a system would be fair, or at least fairer than other systems? Would it be Good, or would it be Evil?

  1. Details have been swept under a rug the size of Afghanistan. But this rug won't fly anywhere unless there's willingness to pay for a rug.
  2. The white paper draft which was the "provocation" for the meeting is posted here.
  3. I've been thinking about this for a while.

Archival Shapes / Ed Summers

If you’ve wandered across this blog before you’ll have gathered that over the past few years I’ve become interested in studying how web archives are assembled. By web archives I generally mean collections of web content, that are often (but not necessarily) also on the web. Web Archives are often thought to have a particular architectural shape, e.g. as a Wayback Machine, running atop a corpus of WARC data, that has been assembled with Heritrix – or some variation on that theme (e.g. Webrecorder). But I’d like to suggest that this is not the only shape that web archives come in, and that artificially limiting our assessment of web archives in this way shortchanges us from recognizing the full spectrum of archiving that is being practiced on the web today.

For example while indexing the web Google routinely collect content from the web which sometimes leaks out in the form of fleeting citations to the Google Cache. Not all archives keep everything forever (whatever forever means). So the fact that Google Cache isn’t permanent doesn’t necessarily mean it’s not archival. I think it’s actually kind of useful to look at Google Cache as one aspect of an archival process at work. Google has built particular distributed workflows for collecting content from the web, to use in their indexing, search, advertising and other business operations. They aren’t using Heretrix and Wayback Machine–I guess there’s the possibility of WARC being used somewhere in their infrastructure, but I doubt it (if you know otherwisee it would be great to hear from you.

For the last 6 months I’ve been collecting tweets that reference particular web archives including Google’s Cache to see how web archives are shared in social media. The thought being that this data could provide some insight into how archival representations are being used on the web (props to Jess Ogden for the idea). So far it has found 631,811 tweets, only 10,053 of which point at Google’s Cache. Here’s an example of one such tweet:

As another archival shape consider the social media site Pinterest, which allows users to build their own collections of images from the web. Users add images as pins to boards where they can be described, put into different sections and then shared and reshared with others both inside and outside of the Pinterest platform. Pinboard provide a browser extension that lets you hover over any image you see on the web and easily add it to one of your boards. This pinning operation doesn’t capture the entire page, but just the one image that you have selected. Pinboard does attempt to extract the title and description of the image, and they also remember where the image has from in terms of its original URL if you want to go back to see where this image came from. In this example below I’ve added a portrait of Michelle Obama from the National Portrait Gallery to a Pinterest board. You can see near the bottom of the third screen that the link to the National Portrait Gallery is preserveed. I think this simple measure of provenance is what gives Pinterest an archival shape.

The architecture of the Internet Archive has done a lot to promulgate the idea that web archives have a particular architectural shape. I think part of the reason for this is because of the astounding job IA does in providing access to both the portions of the web that they have archived, as well as the descriptive metadata for their collections. They make the data they have collected from the web available for replay through their Wayback machine, and also through a variety of REST APIs as well as BitTorrent.

This data is gold mine for people wanting to view the web over time, and also for researchers studying the processes by which web archives are built. But it’s important to remember that IA is just one particular type of web archive, and that other archival representations of the web exist. As Ogden, Halford, & Carr (2017) hasve pointed out, it’s important to understand not only the architectural formations of web archives, but also the forms of labor that go into producing them.

One of the interesting aspects to the way that Internet Archive decide what to collect from the web is their collaboration with the ArchiveTeam. ArchiveTeam is a loose collective of volunteers that work to collect content from sites that are going offline. For example they are working right now to collect Tumblr as it is removing large portions of its content so they can get their iOS app back in the iOS App Store. ArchiveTeam build crawling/archiving bots using their own infrastructure and transfer the collected WARC data to the InternetArchive.

Since this data is deposited at the Internet Archive you can use the Internet Archive’s API to examine how the data has grown over time. The data itself is not available to the general public outside of the Wayback Machine. But any member of the public is able to view the metadata for each item, where an item is a set of WARC data delivered from the ArchiveTeam to the Internet Archive. I’ve documented this process in a Jupyter notebook if you are interested in the details, but here is a resulting graph of the ingest rate per month:

For now let’s set aside the ethical questions raised by Tumblr’s decision to purge their site, and ArchiveTeam’s response to unilaterally archive the at risk Tumblr content for perpetuity without with the content creator’s consent. These are important questions, and are a big part of what I’m concerned with in my own research, but they aren’t really the focus of this post. Instead what I wanted to emphasize is that the different shapes of web archives extend outside of their technical architectures and into the forms of labor that they make possible, or are made possible by, and that the assemblages of technology. We can think about this as usability, collaboration, or crowdsourcing, but ultimately these amount to different types of labor.

Allow me to close this meandering blog post by connecting the discussion of archival shapes back to archives. In his genealogical exploration of the development of the Records Continuum model Frank Upward suggests that the idea of the records continuum sits atop the foundation Australian recordkeeping established by Ian Maclean (Maclean, 1959; Upward, 1994 ). According to Upward, Maclean broke with the (at the time) established Jenkinsonian idea that archival records accrue as part of natural processes. For Jenkinson the archivist is an impartial actor who merely cares for the records that have been generated as part of the functioning of some administrative body. Upward saw that Maclean returned to an older conception of archival records found in the so called Dutch Manual (Muller, Feith, & Fruin, 2003), where archival records are created or produced. Archival records aren’t simply collected but created, and made to be archival. In this view the archivist plays a significant role in that creative process, and is not a neutral, unbiased participant.

I mention this here just as a personal place holder for a larger discussion. Sometimes web resources are published in such a way that reflects archival thinking. For example Wikipedia’s publishing of the version history of an article. Or when a blog post is published with a permalink that makes it easier to cite, and move about on the web. But sometimes (most of the time) web resources are published without thinking about the archive at all. But these resources are made archival by placing them into an archival context, such as Google Cache or Pinterest mentioned above. The mechanics of this placement (which let’s be honest is really a type of creation) are significant to the study of archival shapes on the web.


Maclean, I. (1959). Australian experience in records and archives management. American Archivist, 22(4), 399–418.

Muller, S., Feith, J. A., & Fruin, R. (2003). Manual for the arrangement and description of archives. Society of American Archivists.

Ogden, J., Halford, S., & Carr, L. (2017). Observing web archives. In Proceedings of WebSci’17. Troy, NY: Association of Computing Machinery. Retrieved from

Upward, F. (1994). In search of the continuum. In The records continuum: Ian Maclean and Australian Archives: First fifty years (pp. 110–130). Ancora Press.

Public Domain Day advent calendar #29: Success by A. A. Milne / John Mark Ockerbloom

I’ve been giving plays somewhat short shrift in this calendar so far. The only ones I’ve mentioned so far were noted in yesterday’s post as not actually joining the public domain in January.  So today I’ll discuss a play from 1923 that is joining the public domain, from a famous playwright of the time.

Many of us nowadays don’t usually think of A. A. Milne as a playwright, but up until 1923, he was largely known for his plays, and for his writing for  Punch.  Milne started contributing to that British humour magazine in 1903, and joined the staff as an assistant editor in 1906.  Before 1923 he had also published a few light novels (including the detective story The Red House Mystery and the satirical fairy tale Once on a Time).  He also wrote well over a dozen plays and screenplays, the main focus of his writing for several years after World War I ended.

Milne’s play Success, first published in 1923, was a more serious piece than some of his earlier work.  There’s a good summary of the play (in its 1926 version) in a 2012 Captive Reader blog post on his collection Four Plays.  The main character is a man who has followed the urgings of people around him into a career that is increasingly successful in the eyes of the world.  But when he meets a friend and a former love from his distant past, he regrets the path he’s taken, and dreams of the life he could have had if he hadn’t pursued the “success” the world recognizes.   The play’s retitling when it opened in New York, Give Me Yesterday, is an apt summation of its emotional focus.  So are lines like this one the main character delivers: “Success! It closes in on you […] I tried to get free – I did try, Sally – but I couldn’t. It had got me. It closes in on you.”

I suspect that Milne in 1923 had no idea that the same thing was about to happen to him and his family, on a much larger scale and more suddenly.  The following year, he published in Punch some poems about his young son Christopher Robin and his stuffed animals.  A book of those poems, titled When We Were Very Young, was published the same year and became an international best-seller, with 50,000 copies sold in its first two months on the market.  Two years later, another book made those stuffed animals the main characters, and that book, Winnie-the-Pooh, quickly became an even bigger best-seller, and remains one of the best-known and best-selling children’s books more than 90 years later.

Like his protagonist in Success, Milne felt like that the success of his children’s books had closed him in, and that he could not escape.  In 2017, Danuta Kean wrote an article in the Guardian on a reissue of Milne’s 1939 memoir, which had been aptly titled It’s Too Late Now.  Kean quotes Milne as saying “I wanted to escape from [children’s books]…  as I have always wanted to escape.  In vain. England expects the writer, like the cobbler, to stick to his last.”  Milne’s ironic fate also affected his son Christopher Robin.  Kean quotes him as saying that his father “had filched from me my good name and had left me with nothing but the empty fame of being his son.”

Personally, I’ll be glad to see Milne’s children’s books join the US public domain over the next few years.  But in the public domain now, readers of Milne will have the chance to focus on the kinds of works that he wanted to be known for, before his wildly successful children’s books overwhelmed him (and his son).  And once Success joins those works in the public domain three days from now, readers may get from it a better sense of the emotional price paid for those later books.

PS:  Alexandra Alter has an informative new article in The New York Times  on the impending new arrivals to the American public domain, and their significance.  It’s good to read yourself, or to share with other people who might not be aware of what all the fuss is about.  (And I’m not just saying that because I’m one of the people mentioned in the article.)  Alter’s article mentions a number of works and authors that are also featured in this calendar, as well as some I won’t have time to discuss this month that you might be interested in as well.

2019 update: Link to full text of Success, now in the US public domain, courtesy of HathiTrust.

Public Domain Day advent calendar #26: Crystallizing Public Opinion by Edward L. Bernays / John Mark Ockerbloom

“Online manipulation and disinformation tactics played an important role in elections in at least 18 countries over the past year, including in the United States.” That was the lead finding of Freedom House’s 2017 Freedom on the Net report, “Manipulating Social Media to Undermine Democracy”.  2018, in turn, has brought further revelations of how various companies, organizations and countries have manipulated social media and other Internet sites through botnets, hidden funding, astroturfing, algorithmic manipulation, and “fake news” (both in the sense of fabricated stories and in the sense of discrediting unwelcome news by calling it “fake”).

The effectiveness of the Internet in spreading propaganda and manipulating citizens is a stark contrast to early utopian visions of the Internet that assumed it would be a natural fount of knowledge and an inevitable promoter of freedom.  But it’s not a surprise to those who have studied the history of other mass media. In various times and places, newspapers, radio, films, and television have been used to spread either knowledge and encouragement, or ignorance and fear.  Neither of these outcomes is inevitable. Those who understand the effective use of mass media can promote either of them, or can learn how to defend themselves against manipulation and support trustworthy communication.

Therefore, it’s worth taking a look at what pioneers in the field known as “public relations” say about their craft, and how it can be used to shape public opinion for good or for ill.  In 1923, public relations was a new and growing field of specialization.  One of its key figures was Edward Bernays, who that year published his first book on the subject, Crystallizing Public Opinion.

Bernays’s book repeatedly cites Walter Lippmann’s 1922 book Public Opinion, a work that introduces the concept of the “manufacture of consent” through mass media.  Lippmann’s book, now in the public domain, is an influential piece of analysis; Bernays’ book refocuses many of Lippmann’s ideas (and those of others) into more of a how-to manual.

One of Bernays’s key recommendations is to look beyond the expected ways of communicating information.  Companies had been advertising for a long time, just as governments had long issued official proclamations and publications.  But getting a product or idea into the news, and talked about by the public, could easily be a more effective method of promotion.  A subtler, indirect suggestion of a viewpoint could persuade people who might resist or ignore a direct, obvious promotion.  Bernays, a nephew of Sigmund Freud, also notes the importance of the “unconscious” mind and the ego in people’s decision-making.

There are, of course, more or less ethical ways of doing the sort of indirect publicity that Bernays discusses.  In this book, however, Bernays shows more concerns about the ends of publicity than the means.  “The only difference between ‘propaganda’ and ‘education,’ really,” he writes, “is in the point of view. The advocacy of what we believe in is education. The advocacy of what we don’t believe in is propaganda.”  It’s seductively easy to move from that idea to an “ends justify the means” approach to public manipulation.

And not all of Bernays’s ends were good.  Some of his campaigns promoted health, racial harmony, and peace.  But he also worked for cigarette companies, particularly on campaigns to increase smoking among women, at a time when doctors were clearly establishing smoking’s links to ill health.  In the 1950s, he also promoted the interests of United Fruit against the government of Guatemala, which would eventually be overthrown in 1954 by a CIA-led coup.

Even if there are problems with who Bernays worked for, though, it’s still worth reading what Bernays has to say about how he worked,  to understand the techniques better, and where appropriate, how to counter them.  Some of what he says in 1923 seems eerily applicable today. For instance:

“Domination to-day is not a product of armies or navies or wealth or policies. It is a domination based on the one hand upon accomplished unity, and on the other hand upon the fact that opposition is generally characterized by a high degree of disunity.”

It’s not hard to trace a line from that idea to the covert social media influence campaigns in recent elections, where government agencies  simultaneously ran accounts provoking opposing sides of hot-button political issues, to foment disunity, create distractions, and discourage voting and meaningful political participation.

The US copyright on Crystallizing Public Opinion ends six days from now.  If having it more widely available and readable helps Americans to better understand and remediate an unhealthy social media environment, it will be an especially welcome (and overdue) addition to the public domain.

2019 update: Link to full text of Crystallizing Public Opinion, now in the US public domain, courtesy of HathiTrust.

Entering 2019 / Galen Charlton

A very brief post to start the new year. I’m not inclined to make elaborate resolutions for the new year other than being very firm that I will stop writing “2018” in dates by the end of March… or maybe April.

But seriously, I do want to write and engage more this year and more actively try new things. As I’m doing, right now, by trying WordPress’s new Gutenberg editor. Beyond that? We’ll see.

A brief digression on Gutenberg: I will bet a bag of coffee that the rollout of Gutenberg will become a standard case study in software management course syllabi. It encapsulates so many points of conflict: open source governance and the role of commercial entities in open source communities; accessibility and the politics of serving (or not) all potential users; technical change management and the balance between backwards compatibility and keeping up to date with modern technology (or, more cynically, modern fashions in technology); and managing major changes to the conceptual model required to use a piece of software. (And an idea for a future post, either by me or anybody who wants to run with it: can the transition of WordPress’s editor from a document-based model to a block-based modal be usefully compared with the transition from AACR2/ISBD to RDA/LRM/LOD/etc.?) Of course, the situation with Gutenberg is evolving, so while initial analyses exist, obviously no definitive post mortems have been written.

But before I let this digression run away from me… onwards to 2019. May everybody reading this have a happy new year, or at least better one than 2018.

Hecate the tortoiseshell asleep, facing the camera, and curled up behind me in my chair.Hecate sleeping behind me in my chair. Her New Year’s resolutions are pretty clear: play, sleep, and eat. Also, torment her humans and her brother/uncle cats.

Public Domain Day advent calendar #31: New Hampshire by Robert Frost / John Mark Ockerbloom

…for destruction ice
Is also great
And would suffice.

So ends “Fire and Ice”, one of the more than forty poems included in Robert Frost‘s 1923 Pulitzer Prize-winning collection New Hampshire.  Frost’s short poem uses ice as a metaphor for hate, though I’ve frequently used ice– or, more precisely, freezing– to describe the stasis in the public domain in the US over the last twenty years, from my first Public Domain Day post in 2008 to more recent Public Domain Day posts like 2016’s “Freezes and Thaws”.

That freeze in the public domain has come with destruction as well.  Sometimes that’s been literal, as film stock, magnetic tapes, and brittle pages deteriorate, or as old publications not kept elsewhere are discarded.  Sometimes that destruction has been of memories and personal connections, as authors and those who knew them or read their newly published works die or fall silent.  Sometimes the destruction has been of creative opportunity, as those who wanted to build on existing works have been stymied by copyright restrictions even long after the authors of those works have passed on.

Not that New Hampshire in itself has been in any danger of disappearing. The famous poems in it, like “Fire and Ice”, “Nothing Gold Can Stay”, and “Stopping by Woods on a Snowy Evening” will be remembered and recited for a long time to come.  But if those are the short hit singles, the collection as a whole is a fascinating double-album or more worth of poetry, with lots of longer, deeper cuts. From the opening title track, an eccentric ode to an eccentric state, to the final poem where birds cheerily occupy a burnt-out and abandoned farmstead, certain patterns recur among his rhythms: the stark landscape of northern New England, the cycles of nature and of human activity, and death, which stalks through many of the poems in both metaphorical and literal forms.

Even as New Hampshire, and collections that include it, stay in print, as a whole the collection hasn’t had the cultural impact that it could have.  Not only are websites prevented from posting many of its poems online (unlike those from his earlier books), but adaptations of its works have also been limited.  In an article on the upcoming arrivals to the US public domain in the January 2019 issue of Smithsonian, Glenn Fleishman notes how Frost’s estate kept Eric Whitacre from releasing a choral adaptation of “Stopping by Woods on a Snowy Evening”.  There have been some authorized musical settings of Frost’s poems, but not many. I’ve enjoyed singing Randall Thompson’s 1959 setting of  “Stopping by Woods…” as part of his Frostiana suite, but I’d love to hear what Whitacre and other 21st-century composers could do with it.  Starting tomorrow, they’ll have their chance.

(A couple of copyright-nerd asides: It’s arguable that “Stopping by Woods…” is already in the public domain.  It was published in the March 7, 1923 issue of The New Republic before it appeared in New Hampshire, there’s no renewal for that magazine issue or that specific contribution, and the renewal for New Hampshire was filed in September 1951, at a time of 28-year copyright terms that were not rounded up to the end of the year. Given that, the book’s renewal may have been too late to cover the poem’s magazine publication.  But I can understand a composer not wanting to get into a long legal battle over this issue.  Also, note that the “January 2019” issue of Smithsonian was published in 2018, requiring Fleishman to remain circumspect about quoting from New Hampshire in his article.  Similarly, many periodical issues with cover dates in early 1924 were actually published in 1923, will end their copyright terms tomorrow with other 1923 publications, and will be eligible for posting online then.)

Twenty years ago, I was moderating a mailing list of people posting texts online who were eagerly awaiting each new year’s worth of books they could post.  The first year of the list included a discussion of copyrights on Frost’s books, misdirected (and later retracted) takedown notices from his rightsholders, and the prospects for posting New Hampshire in 1999.  There were people who had been able to meet Frost in person, and hear him read his poems, while he was still alive.

Many of the people in that conversation are gone now from the Net, or from the world.  Eric Eldred, who challenged 1998’s 20-year copyright extension all the way to the Supreme Court, moved overseas, and eventually his site, where he wanted to post New Hampshire and other works that had been due to enter the public domain, dropped off the Net.  (A mostly-functional snapshot of his site is preserved at Ibiblio.)  David Reed, one of the people who prepared Frost’s early books, and many others, for Project Gutenberg, died in 2009. Michael Hart, who founded Project Gutenberg, and who after the 1998 copyright extensions passed told me he wanted to post Winnie-the-Pooh and Gone With the Wind someday when they reached the public domain, died in 2011.  Before he died, though, he inspired enough interested volunteers to keep Project Gutenberg going and posting new texts. Tomorrow, those volunteers will have full access to the promised land of 1923 that Michael Hart didn’t get to reach.

And God willing, when tomorrow begins, Mary and I will greet the year 2019, and continue work we and others have been doing for more than 20 years, bringing the public domain to light through online publishing and cataloging, shedding light on the hidden public domain of unrenewed materials, and doing our best to ensure that new works keep joining the public domain here not just tomorrow, but every year after that, for many years to come.  And I’ll be thinking of the words Robert Frost published in 1923:

…I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.

See you in the new year.

2019 update: Link to full text of New Hampshire, now in the US public domain, courtesy of HathiTrust.

Public Domain Day 2019: Welcome to 1923! / John Mark Ockerbloom

Early this morning, a full year’s worth of published works were welcomed into the public domain in the United States for the first time since 1998. Hundreds of thousands of works from 1923 either joined the public domain here, or achieved a much more obvious and visible public domain status.

This is not news to anyone who’s been following this blog, which has had a post per day discussing some of the many upcoming additions to the public domain.  I’ve also been posting about Public Domain Day here since 2008, but as an American haven’t had a lot to celebrate here till now.  I now find myself feeling much like I did when the Red Sox broke their World Series drought in 2004, or the Eagles finally won a Super Bowl last year: elation, mixed with thoughts that it’s been a long time coming, and wishes that I could celebrate now with everyone I’ve known who’s waited for it these past 21 years.

One thing I’m very happy to see today is that the public domain now has lots of friends, who are now much more numerous, aware, and organized than they were in 1998, the last time copyright was extended here.  They’ve helped ensure that there wasn’t another serious attempt to extend copyright terms here when the 20-year public domain freeze here ended.  They’ve been spreading the word about the new arrivals to the public domain, and why that’s a good thing.  (In my advent calendar series, I’ve pointed to a few of the articles written about this; Lisa Gold’s blog post today is another such article, which also points to a few others.)

Various groups have also been quick to make works that have newly joined the public domain freely readable online.  HathiTrust opened access to over 40,000 works from 1923 today.  Also today, Project Gutenberg released a transcription of The Prophet they had ready to go for its first day in the US public domain; they’re also releasing other 1923 transcriptions.  At the Penn Libraries, where I work, a team led by Brigitte Burris is digitizing 1923 publications from our collections to share online.  A story by Peter Crimmins at WHYY has more information, and pictures, from our digitization work.

While 1923 may be making the biggest splash today, there’s other work also joining the public domain today in various places.  People in Europe and other countries with “life+70 years” copyright terms get works joining the public domain from authors who died in 1948.  (In the US, we’re also today getting works by authors who died that year that were not published prior to 2003.)  People in Canada and other countries maintaining “life+50 years” terms get works by authors who died in 1968.  Some of the relevant authors whose works are joining the public domain in these countries are mentioned in the Public Domain Review’s Class of 2019 feature.

As for me, here’s what I’m giving the world today:

  • A newly updated Creative Commons licensed guide for identifying public domain serial content.  I discussed this guide, when it was still in draft form, in a blog post last month.   Today’s update, now out of draft status, fixes some awkward sentences, says a little more about government publications, and removes references to 1923 copyrights, since they’ve now expired.  I hope folks find the guide useful, and I’d love to hear what you do with it, or if you have questions about it.
  • A grant to the public domain (via CC0 dedication) of any work I published in 2004 whose copyright is under my sole control.  (I typically do this every year on Public Domain Day for copyrights more than 14 years old, in recognition of the original term of copyright available in the United States.)
  • Links added to the advent calendar posts to online copies of the featured works.  They won’t all be linked today (it may take a while to find them all, and not all of them are online at the moment), but I’ll add the ones I can over the next few days, as well as creating or updating listings where appropriate for The Online Books Page.

And now that I’m done with the advent calendar, here’s a list of all of its posts and featured works:

  1. The Prophet by Kahlil Gibran
  2. ‘Twas the Night Before Christmas Cantata by Frances McCollin
  3. Murder on the Links by Agatha Christie
  4. “The Adventure of the Creeping Man” by Arthur Conan Doyle
  5. The Federal Reporter (1923 publications) by West Publishing
  6. “Barney Google” (fox-trot) by Billy Rose and Con Conrad
  7. New York Tribune (1923 issues)
  8. “Yes! We Have no Bananas” by Frank Silver and Irving Cohn
  9. Tagebücher by Theodor Herzl
  10. “The Road Away From Revolution” by Woodrow Wilson
  11. Washington and its Romance by Thomas Nelson Page
  12. Cane by Jean Toomer
  13. Safety Last! by Hal Roach, Sam Taylor, and Tim Whelan
  14. Tarzan and the Golden Lion by Edgar Rice Burroughs
  15. “Keen” by Edna St. Vincent Millay
  16. Emily of New Moon by Lucy Maud Montgomery
  17. Jeeves by P. G. Wodehouse
  18. The Art Spirit by Robert Henri
  19. Souls for Sale (photoplay) by Rupert Hughes
  20. The Vanishing American by Zane Grey
  21. “Nobody Knows You When You’re Down and Out” by Jimmie Cox
  22. A Son at the Front by Edith Wharton
  23. “Great is Thy Faithfulness” by Thomas Chisholm and William Runyan
  24. The Night Before Christmas (recitation with music and drawings) by Hanna van Vollenhoven and Grace Drayton
  25. “Christmas Day at Sea” by Joseph Conrad
  26. Crystallizing Public Opinion by Edward L. Bernays
  27. “The Invisible Monster” by Sonia Greene
  28. “Parisian Pierrot” by Noël Coward
  29. Success by A. A. Milne
  30. “In the Orchard” by Virginia Woolf
  31. New Hampshire by Robert Frost

Happy Public Domain Day!  We have lots to celebrate this year, and I’m thankful to everyone who’s helped make this celebration possible, and merrier. May we also have lots to celebrate every year hereafter!


2018 A Year of Victories! / Open Library

Happy holidays & Happy New Year, readers! We are thrilled to announce 2018 has been an unprecedented year for and a great time to be a book-lover. Without skipping a beat, we can honestly say we owe our progress to you, our dedicated community of volunteer developers, designers, and librarians. We hope you’ll join us in celebrating as we recap our 2018 achievements:

Highlighted Victories

New Features

Teamwork Makes the Dream Work

In 2018, 45 members of our community helped fix over 300 issues, contributing over 100,000 lines of code improvements to and eliminating 95,000 lines of old code.

October was an especially monumental month for our community. Thanks to the organizational efforts of Salman Shah and Tabish Shaikh, Open Library participated in the Hacktoberfest challenge, attracting attention and interest from all around the globe. During this period, 22 members of our community submitted 125 bug fixes and improvements.

The Faces of Open Library

Of the many deserving, we’re proud to feature Charles Horn for his contributions to our Open Library. Charles dedicated three years volunteering as a core developer on before enthusiastically joining Internet Archive as a full-time staff member this year. Charles has written bots responsible for correcting catalog data for millions of books and tens of thousands of authors. Not only has Charles been a foundational member of the community, running stand-ups and performing code reviews, he’s also designed technology which allows us to fight spam and has designed plumbing which allows millions of new book records to flow into our catalog.

Drini Cami sprung into action during a time when the Open Library’s future was most uncertain and he has left an enormous impact. Drini has written mission critical code to improve our search systems, he’s written code to merge catalog records, fixed thousands of records, worked on linking Open Library records to Wikidata, repaired our Docker build on countless occasions, and has been a critical adviser towards making sure we make the right decisions for our users. We can’t speak highly enough about Drini and our gratitude for the positive energy he’s brought to our Open Library. 

Jon Robson has nearly single-handedly brought order to Open Library’s once sprawling front-end. In just a handful of weeks, Jon has re-organized over 20,000 lines of code and eliminated 1,000 unneeded lines in the process! He is the author and maintainer of Open Library’s Design Pattern Library — the one-stop resource for understanding Open Library’s front-end components. Jon brings with him a wealth of experience in nurturing communities and designing front-end systems that he has earned while leading mobile design efforts at Wikipedia. We all feel extremely lucky and grateful Jon is in on team Open Access! 

Tabish Shaikh is one of Open Library’s most dedicated Open Library contributors, attending community calls at 12am. He’s brought an infectious enthusiasm and passion to the project and has made major contributions, including leading a redesign of our website footer, designing a mobile login experience, making numerous front-end fixes with Jon, and helping with Hacktoberfest coordination.

Salman Shah was Open Library’s 2018 resident Google Summer of Coder and community evangelist. In addition to importing thousands of new book records into Open Library, he also has been a driving force in organizing Hacktoberfest and improving our documentation. He’s a key reason so many volunteers have flocked to Open Library to help make a lasting difference.


“On the internet nobody knows you’re a dog”. For several years, LeadSongDog has anonymously championed better experiences for our users, opening more than 40 issues and participating in discussion for twice that number. Few people have consistently poured their energy into improving Open Library — we’re so grateful and lucky for LeadSongDog’s librarian expertise and conviction.


Lisa Seaberg (@seabelis), is not only an amazingly prolific Open Librarian, but one of our trusted designers for the website. Lisa fixed hundreds of Open Library book records, has redesigned our logo, and actively participates in design conversation within our github issues.



Tom Morris is one of our longest-time contributors of Open Library. He serves as a champion for high-quality metadata, linked data standards, and better search for our readers. Tom has been instrumental during our Community Calls, advising us to make the right decisions for our patrons.



Christian Clauss is leading the initiative to migrate Open Library to Python 3 by the end of 2019. He’s already made incredible progress towards this goal. Because of his work, Open Library will be more secure, faster, and easier to develop.



Gerard Meijssen, one of our liaisons from the wikidata community, has coordinated efforts which have helped Open Library merge over 90,000 duplicate authors in our catalog. He has also been a champion for internationalization (i18n).



James Ford paved the way for further design progress on Open Library by consolidating tens of colors in our pallet to a manageable handful, and converting them to less css.




You can thank Maura Church for adding average star ratings and reading log summary statistics to all of our books:




Galen Mancino collaborated with the Open Library team on the Book Widget feature which you can read more about here! In addition to his love for books, Galen is passionate about sustainable and local economic growth, revitalization, and how technology can bring us there.



Oh hi, I’m I feel extremely privileged to serve as a Citizen of the World for the Internet Archive’s Open Library community. In 2018, I contributed thousands of high-fives and hundreds of code reviews to support our amazing community. I’m proud to work with such a capable and passionate group of champions of open access. I’m hopeful, together, we can create a universal library, run by and for the people.

… And over 40 others including Num170r, html5cat, thefifthisa, linkel, GLBW, Alexis Rossi, Jessamyn West, et al who have no less significantly worked tirelessly to make Open Library an inclusive, safe, useful place where readers can thrive!

Thank you and here’s to a wonderful 2019!

On the Surveillance Techno-state / Eric Hellman

I used to run my own mail server. But then came the spammers. And  dictionary attacks. All sorts of other nasty things. I finally gave up and turned to Gmail to maintain my online identities. Recently, one of my web servers has been attacked by a bot from a Russian IP address which will eventually force me to deploy sophisticated bot-detection. I'll probably have to turn to Google's recaptcha service, which watches users to check that they're not robots.

Isn't this how governments and nations formed? You don't need a police force if there aren't any criminals. You don't need an army until there's a threat from somewhere else. But because of threats near and far, we turn to civil governments for protection. The same happens on the web. Web services may thrive and grow because of economies of scale, but just as often it's because only the powerful can stand up to storms.  Facebook and Google become more powerful, even as civil government power seems to wane.

When a company or institution is successful by virtue of its power, it needs governance, lest that power go astray. History is filled with examples of power gone sour, so it's fun to draw parallels. Wikipedia, for example, seems to be governed like the Roman Catholic Church, with a hierarchical priesthood, canon law, and sacred texts. Twitter seems to be a failed state with a weak government populated by rival factions demonstrating against the other factions. Apple is some sort of Buddhist monastery.

This year it became apparent to me that Facebook is becoming the internet version of a totalitarian state. It's become so ... needy. Especially the app. It's constantly inventing new ways to hoard my attention. It won't let me follow links to the internet. It wants to track me at all times. It asks me to send messages to my friends. It wants to remind me what I did 5 years ago and to celebrate how long I've been "friends" with friends. My social life is dominated by Facebook to the extent that I can't delete my account.

That's no different from the years before, I suppose, but what we saw this year is that Facebook's governance is unthinking. They've built a machine that optimizes everything for engagement and it's been so successful that they they don't know how to re-optimize it for humanity. They can't figure out how to avoid being a tool of oppression and propaganda. Their response to criticism is to fill everyone's feed with messages about how they're making things better. It's terrifying, but it could be so much worse.

I get the impression that Amazon is governed by an optimization for efficiency.

How is Google governed? There has never existed a more totalitarian entity, in terms of how much it knows about every aspect of our lives. Does it have a governing philosophy? What does it optimize for?

In a lot of countries, it seems that the civil governments are becoming a threat to our online lives. Will we turn to Wikipedia, Apple, or Google for protection? Or will we turn to civil governments to protect us from Twitter, Amazon and Facebook. Will democracy ever govern the Internet?

Happy 2019!

DLTJ in another #NEWWWYEAR / Peter Murray

Well, we have reached the end of another arbitrary orbit around our small unregarded yellow sun1, and this primitive ape-descended life form2 is looking back on this blog’s past twelve months. Not much to show for it – this’ll be just the third blog post this year.

And yet it started with so much excitement over a refresh of the technology behind DLTJ. This was my plan:

  1. I’ll start January 1.
  2. I’ll convert ‘’ from WordPress to a static site generator (probably Jekyll) with a new theme.
  3. My deadline is January 13 (with a conference presentation in between the start and the end dates).

Along the way, I’ll probably package up a stack of Amazon Web Services components that mimic the behavior of GitHub pages with some added bonuses (standing temporary versions of the site based on branches, supporting HTTPS for custom domains, integration with GitHub pull request statuses). I’ll probably even learn some Ruby along the way.

So the blog did get converted from WordPress to Jekyll, and all of the old posts were brought over as HTML-looking things with significant YAML headers at the top of each file. That was good. 3

There were some pieces left missing, notably the short URLs that were based on the WordPress post number were never recreated into redirects. That was bad. 4

I also did a lot of work on an Amazon Web Services CloudFormation setup that uses a GitHub commit hook to fire off a process of automatically rebuilding the website for every commit. That was ugly; 653 lines of CloudFormation YAML ugly. 5

So what do I want to do with DLTJ this year? Sort of like a list of New Year’s resolutions? Well, this is what I’m thinking:

  1. Write more blog posts.
  2. Do some enhancements to the underlying code – like address the old short URL redirects. I’d also like to build in some diff displays for posts based on the underlying Git files, and maybe even implement a Memento TimeMap based on the Git history. Anyone done something like that yet.
  3. Learn Serverless and reimplement the ugly CloudFormation template.

Check back with me in a year and see how I did!

  1. With apologies to Douglas Adams for mangling his opening paragraph

  2. Ibid 

  3. Now apologizing to Sergio Leone for twisting his spaghetti western film title

  4. Ibid 

  5. Ibid 

2018 in review / Hugh Rundle

Inspired by others I'm taking stock of my 2018. I don't tend to count things like how many books I read, but it's good to reflect on where you've been before checking where you're heading. I felt a bit like not much happened in my life this year, but on reflection that's laughable.

In January I spent a couple of weeks in Singapore, which I didn't really know a great deal about prior to our visit. Whilst Singapore certainly has elements of authoritarianism, I was intrigued by the Singaporean approach to 'multiculturalism' compared with Australia. The uncomfortable feeling I experienced seeing familiar British imperial architecture as - well, imperial architecture - stayed with me when I returned home. It seems odd to write that visting Singapore made me much more conscious of the continuing physical (and therefore mental) presence of British Imperialism in present-day Australia, but it did. Perhaps it was also the cumulative impact of four years of First World War nostalgia 'commemoration', but on a visit to Daylesford's Wombat Hill Botanical Gardens later in the year I was overwhelmed by the sense that all of it - "Pioneers' Memorial Tower", the nineteenth-century rotunda, and the cannons placed about the hill (captured as war booty at various times) - was a bit grotesque.

I also delivered a talk and participated in some great conversations about 'generous GLAM' at LinuxConfAU.

In February I had the enormous pleasure of introducing Angela Galvan for her keynote The revolution will not be standardized at the VALA 2018 conference. Then I got to visit ACCA's Unfinished Business exhibition with Angela, her sister, and Andrew Kelly. That was a pretty good week.

In March and April I learned how React works and even wrote a little demo app, but I have to say I didn't love it and I'm not convinced it's needed in all, or perhaps even most of the places you'll find it being used. The experience did make me a bit more confident with my coding - I worked my way through the book I was learning from, created an app that worked the way I wanted it to and understood how it was working. I just ...don't like React. Especially the bit where you write JavaScript to create CSS 😒.

In May I set up my own Mastodon instance at You should join.

In June I left local government and public libraries to take up a completely new role supporting librarians in the Academic sector. I now work four days a week and cannot recommend this strongly enough. It's had a huge impact on my stress levels, given me more perspective about what's important to me, and made me somewhat less insufferable to be around. Whilst it's certainly not possible for everyone, I'm convinced most people can afford and would be happier to work four days on 80% of the income they get working five days - if only more employers offered the option.

I had three weeks between jobs in July, and took the opportunity to think about life more broadly. I wanted to use social media - particularly Twitter - less, but still share links to and thoughts about things I was reading, listening to and watching. I was also a bit sick of my typical 'man pontificating' blogging style, so was looking to do something different with my blog. Thus Marginalia was born. Despite being unemployed for most of the month, I also managed to attend two conferences in July: Levels, which ironically made me more comfortable with coding just for my own amusement rather than needing it to be a career move; and APLIC, which stretched into August and was my first conference standing on a vendor booth - causing a few double-takes.

In September I tested the static site generator Eleventy and liked what I saw, spending the next two months setting it up and migrating my blog from Ghost to Eleventy.

In November I published my first npm package - a command-line program that creates a template, including stock image for social media posts, for static-site publishing (e.g. with Eleventy). It appears to have had some downloads on npm, though the stats are a bit opaque as to whether it's automated bots or real humans doing the downloading.

In December I started learning Python and created my first couple of scripts. Not at all coincidentally, one of these auto-deletes Mastodon toots after a certain period of time, and I also used someone else's script to do the same thing with Twitter. I make an effort to keep my blog posts available and their URLs permanent, but social media is supposed to be ephemeral, and I'm increasingly uncomfortable with the idea of it all being there waiting to be read without context some time in the future. I'm continuing to learn more Python, both by making my way slowly through the 1500 page door-stopper Learning Python and also by migrating the code that runs Aus GLAM Blogs from node/Meteor to Python.

Counting this one, I've published eighteen blog posts this year, which I'm surprised by, given I didn't manage to post every month for GLAM Blog Club. According to Pocket, I also read the equivalent of 96 books worth of articles on the web - which partially explains why I read a lot fewer actual books than that! World politics is a dumpster fire, but personally I'm feeling happier than I have been for some time, and I'm looking forward to seeing what 2019 brings. I'm expecting a lot more reading, coding, writing, and time to think, and maybe even a bit more exercise. But perhaps that's the Christmas pudding talking.

Google's "Crypto-Cookies" are tracking Chrome users / Eric Hellman

Ordinary HTTP cookies are used in many ways to make the internet work. Cookies help websites remember their users. A common use of cookies is for authentication: when you log into a website, the reason you stay logged is because of a cookie that contains your authentication info. Every request you make to the website includes this cookie; the website then knows to grant you access.

But there's a problem: someone might steal your cookies and hijack your login. This is particularly easy for thieves if your communication with the website isn't encrypted with HTTPS. To address the risk of cookie theft, the security engineers of the internet have been working on ways to protect these cookies with strong encryption. In this article, I'll call these "crypto-cookies", a term not used by the folks developing them. The Chrome user interface calls them Channel IDs.

Development of secure "crypto-cookies" has not been a straight path. A first approach, called "Origin Bound Certificates" has been abandoned. A second approach "TLS Channel IDs" has been implemented, then superseded by a third approach, "TLS Token Binding" (nicknamed "TokBind"). If you use the Chrome web browser, your connections to Google web services take advantage of TokBind for most, if not all, Google services.

This is excellent for security, but might not be so good for privacy; 3rd party content is the culprit. It turns out that Google has not limited crypto-cookie deployment to services like GMail and Youtube that have log-ins. Google hosts many popular utilities that don't get tracked by conventional cookies. Font libraries such as Google Fonts, javascript libraries such as jQuery, and app frameworks such as Angular, are all hosted on Google servers. Many websites load these resources from Google for convenience and fast load times.  In addition, Google utility scripts such as Analytics and Tag Manager are delivered from separate domains so that users are only tracked across websites if so configured.  But with Google Chrome (and Microsoft's Edge Browser), every user that visits any website using Google Analytics, Google Tag Manager, Google Fonts, JQuery, Angular, etc. are subject to tracking across websites by Google. According to Princeton's OpenWMP project, more than half of all websites embed content hosted on Google servers.
Top 3rd-party content hosts. From Princeton's OpenWMP.
Note that most of the hosts labeled "Non-Tracking Content"
are at this time subject to "crypto-cookie" tracking.

While using 3rd party content hosted by Google was always problematic for privacy-sensitive sites, the impact on privacy was blunted by two factors – cacheing and statelessness. If a website loads fonts from, or style files from, the files are cached by the browser and only loaded once per day. Before the rollout of crypto-cookies, Google had no way to connect one request for a font file with the next – the request was stateless; the domains never set cookies. In fact, Google says:
Use of Google Fonts is unauthenticated. No cookies are sent by website visitors to the Google Fonts API. Requests to the Google Fonts API are made to resource-specific domains, such as or, so that your requests for fonts are separate from and do not contain any credentials you send to while using other Google services that are authenticated, such as Gmail. 
But if you use Chrome, your requests for these font files are no longer stateless. Google can follow you from one website to the next, without using conventional tracking cookies.

There's worse. Crypto-cookies aren't yet recognized by privacy plugins like Privacy Badger, so you can be tracked even though you're trying not to be. The TokBind RFC also includes a feature called "Referred Token Binding" which is meant to allow federated authentication (so you can sign into one site and be recognized by another). In the hands of the advertising industry, this will get used for sharing of the crypto-cookie across domains.

To be fair, there's nothing in the crypto-cookie technology itself that makes the privacy situation any different from the status quo. But as the tracking mechanism moves into the web security layer, control of tracking is moved away from application layers. It's entirely possible that the parts of Google running services like and have not realized that their infrastructure has started tracking users. If so, we'll eventually see the tracking turned off.  It's also possible that this is all part of Google's evil master plan for better advertising, but I'm guessing it's just a deployment mistake.

So far, not many companies have deployed crypto-cookie technology on the server-side. In addition to Google and Microsoft, I find a few advertising companies that are using it.  Chrome and Edge are the only client side implementations I know of.

For now, web developers who are concerned about user privacy can no longer ignore the risks of embedding third party content. Web users concerned about being tracked might want to use Firefox for a while.


  1. This blog is hosted on a Google service, so assume you're being watched. Hi Google!
  2. OS X Chrome saves the crypto-cookies in an SQLite file at "~/Library/Application Support/Google/Chrome/Default/Origin Bound Certs". 
  3. I've filed bug reports/issues for Google Fonts, Google Chrome, and Privacy Badger. 
  4. Dirk Balfanz, one of the engineers behind TokBind has a really good website that explains the ins and outs of what I call crypto-cookies.
  5. (added 12/29/2018) It seems that Chrome will be removing support for Token Binding.

Public Domain Day advent calendar #27: The Invisible Monster by Sonia Greene / John Mark Ockerbloom

Visitors to newsstands in early 1923 encountered a number of significant American magazines for the first time. They could pick up the first issues of Time, with brief, breezy dispatches relating the week’s news from around the world.  (Issues of that magazine from 1923 and some years afterward are already in the public domain, due to a lack of copyright renewals.)  Or they might find dispatches from stranger, creepier worlds in another new magazine: Weird Tales, a pulp fiction magazine featuring stories of horror, fantasy, and what in a few years would be called “science fiction”.  A number of iconic genre characters such as Robert E. Howard’s Conan, C. L. Moore’s Jirel of Joiry, and H. P. Lovecraft’s Cthulhu made their mass-media debut in the magazine.

It took a while for Weird Tales to hit its stride, but there are some notable stories in its first issues, many of which will be joining the public domain five days from now.  (Much of the content published in Weird Tales was not copyright-renewed, but most of the 1923 issues were.)  One tale that caught my interest, as much for its circumstances as for its content, is a story by Sonia Greene titled “The Invisible Monster” in the November 1923 issue (and called “The Horror of Martin’s Beach” in some later reprints).  The story features a strange sea-beast that sailors find, kill, and bring to shore, unknowingly incurring the wrath of the beast’s bigger and fiercer mother.  Able to hypnotize humans so they they both fail to see her and lose control over their body movements, the mother-beast exacts her revenge on the sailors and nearby beachgoers, dragging them to watery deaths.

The story may remind a reader today of stories with similar elements like   Beowulf and Jaws.  But it also reads like a Lovecraft story.  That’s not a coincidence: Greene and Lovecraft, who were both active in the world of amateur journalism, had met not long before.  In 1922, Greene visited Lovecraft in New England and suggested the idea for the story while they walked along the beach.  According to L. Sprague de Camp’s biography of Lovecraft, Greene wrote up an outline of the story that night, and Lovecraft was so enthusiastic about the story that Greene spontaneously kissed him, the first kiss he had had since infancy.

Thus began a romance that would eventually result in the marriage of Greene and Lovecraft in 1924, as well as the publication of the story in Weird Tales in 1923.  It’s pretty clear that Lovecraft had some hand in the story that ran there. At the time, a fair bit of his income came from unsigned editing and revising of others’ stories, he appears to have shepherded it into print at Weird Tales, and some of the vocabulary in this story is distinctly Lovecraftian.  Some commenters have therefore not only added Lovecraft as an author of the story, but also credited him as the primary author, or even speculated, as de Camp does, that he wrote the whole thing from Greene’s “mere outline”.  However, both Greene and Lovecraft were experienced writers, and knowing both the tendency of attributions to gravitate to more famous writers, and of women’s writing contributions to be marginalized, I’m inclined to keep crediting Greene as the author of this story, as she is credited in the Weird Tales issue.

Sadly, Greene and Lovecraft’s relationship would soon grow strained.  Beset with financial woes and health problems after their marriage, they spent much of their time apart, were living in different cities by 1925, and by the end of the 1920s had divorced.  Lovecraft’s relationship with his genre has also been increasingly strained.  He was deeply racist, and while his stories have had a significant influence on fantasy and horror literature, many of them are also inherently infused with fear and contempt for non-white races and foreigners.  That eventually led the administrators of the World Fantasy Awards, which had used his likeness on their trophies since their establishment in 1975, to redesign the award without him in 2017.

Some writers, though, have found ways to recognize the contributions of Lovecraft and other early horror writers to the genre while still engaging unflinchingly with their racism.  One work doing this that I particularly like is Matt Ruff’s 2016 novel Lovecraft Country, where the main characters have to deal with both the forces of supernatural horror and the forces of Jim Crow– and the latter are often scarier than the former.

Ruff manages to rework flaws in Lovecraft’s 20th-century work into strengths for the story he wants to tell, and combines it with other Lovecraftian elements to make a sort of narrative alloy well suited for the 21st century.  Once “The Invisible Monster” and other stories from the first year of Weird Tales join the public domain next week, other writers will also have the chance to take their flaws and strengths and make other wonderful things with them.  I don’t know what will result, but I’d love to see what people try.


Securing The Hardware Supply Chain / David Rosenthal

This is the third part of a series about trust in digital content that might be called:
Is this the real life?
Is this just fantasy?
We are moving down the stack:
  • The first part was Certificate Transparency, about how we know we are getting content from the Web site we intended to.
  • The second part was Securing The Software Supply Chain, about how we know we're running the software we intended to, such as the browser that got the content whose certificate was transparent.
  • This part is about how we can know that the hardware the software we secured is running on is doing what we expect it to.
Below the fold, some rather long and scary commentary.

Attacks on the hardware supply chain have been in the news recently, with the firestorm of publicity sparked by Bloomberg's, probably erroneous reports, of a Chinese attack on Supermicro motherboards that added "rice-grain" sized malign chips. Efforts to secure the software supply chain will be for naught if the hardware it runs on is compromised. What can be done to reduce risks from the hardware?

Hardware Implants

As regards Bloomberg's story, the experts agree on three main points.

First, the details cannot be correct. Patrick Kennedy's Investigating Implausible Bloomberg Supermicro Stories is the most detailed critique, but aspects of his analysis are supported by, among others, Riverloop Security's A Tale of Two Supply Chains, and Joe Fitzpatrick's Hardware Implants.

Second, attacks using hardware implants are feasible. A year before the Bloomberg story, Joe Fitzpatrick listed four scenarios:
  1. Modify the ASPEED flash chip [3] to give a backdoor that can drop a payload into the host CPU’s memory sometime after boot.
  2. Modify the PC Bios flash chip [2] to drop a bootkit backdoor into the OS sometime after boot.
  3. Solder a device onto the board to intercept/monitor/modify the values read from the flash chip as they are accessed to inject malicious code somewhere
  4. Find debug connections on the testpoints [5] to allow debugger controll of the ASPEED BMC [1], allowing you to direct it to drop a payload into memory
I think that #4 would probably be the coolest illustration of the point. you could glue the microcontroller you’ve got upside-down to the top of the ASPEED chip, and then solder its legs to some nearby testpoints (AKA dead-bug-style soldering).
Note that both 1 and 2 are in effect firmware attacks and, unlike 3 and 4, are not visible.

Third, hardware implants are not the best way to attack via the hardware supply chain. Joe Fitzpatrick writes:
There are plenty of software vectors for exploiting a system. None of them require silicon design, hardware prototyping, or manufacturing processes, and none of them leave behind a physical item once they’re implanted.
In contrast to Bloomberg's report, reports of the NSA's hardware supply chain attacks are well documented. Among the Snowden revelations documented by Glenn Greenwald was this from an NSA manager in 2010:
Here’s how it works: shipments of computer network devices (servers, routers, etc,) being delivered to our targets throughout the world are intercepted. Next, they are redirected to a secret location where Tailored Access Operations/Access Operations (AO-S326) employees, with the support of the Remote Operations Center (S321), enable the installation of beacon implants directly into our targets’ electronic devices. These devices are then re-packaged and placed back into transit to the original destination. All of this happens with the support of Intelligence Community partners and the technical wizards in TAO.
The picture supports the belief that in this case the NSA was injecting malicious firmware rather than hardware implants. The "beacon implant" can be very small, its only purpose being to locate the compromised device and provide a toehold for further compromise. Earlier, Der Spiegel had reported that it wasn't just routers and firmware:
If a target person, agency or company orders a new computer or related accessories, for example, TAO can divert the shipping delivery to its own secret workshops. The NSA calls this method interdiction. At these so-called "load stations," agents carefully open the package in order to load malware onto the electronics, or even install hardware components that can provide backdoor access for the intelligence agencies. All subsequent steps can then be conducted from the comfort of a remote computer.

These minor disruptions in the parcel shipping business rank among the "most productive operations" conducted by the NSA hackers, one top secret document relates in enthusiastic terms. This method, the presentation continues, allows TAO to obtain access to networks "around the world."
Obviously, many of TAO's operations only involve malicious firmware, but note the wording "install hardware components". Even Cisco found it hard to detect whether this was happening:
Cisco has poked around its routers for possible spy chips, but to date has not found anything because it necessarily does not know what NSA taps may look like, according to Stewart.
At Hacker News, lmilcin's post and the subsequent discussion shows that sophisticated supply chain attacks on supposedly secure hardware really do happen:
I have worked in card payment industry. We would be getting products from China with added boards to beam credit card information. This wasn't state-sponsored attack. Devices were modified while on production line (most likely by bribed employees) as once they were closed they would have anti-tampering mechanism activated so that later it would not be possible to open the device without setting the tamper flag.

Once this was noticed we started weighing the terminals because we could not open the devices (once opened they become useless).

They have learned of this so they started scraping non-essential plastic from inside the device to offset the weight of the added board.

We have ended up measuring angular momentum on a special fixture. There are very expensive laboratory tables to measure angular momentum. I have created a fixture where the device could be placed in two separate positions. The theory is that if the weight and all possible angular momentums match then the devices have to be identical. We could not measure all possible angular momentums but it was possible to measure one or two that would not be known to the attacker.
In contrast to TAO's, this is a broadcast attack. The bribed employees probably don't know where the devices are to be shipped, and knowing the location isn't necessary for the attack to be profitable. Bribing employees has the added advantage of increasing the difficulty of correctly attributing the attack.


Much of what we think of as "hardware" contains software to which what we think of as "software" has no access or visibility. Examples include Intel's Management Engine, the baseband processor in mobile devices, and complex I/O devices such as NICs or GPUs. Even if this "firmware" is visible to the system CPU, it is likely supplied as a "binary blob" whose source code is inaccessible. For example, a friend reports that updating his BIOS also updated his USB Type C interface, his Intel Management Engine, and his Embedded Controller. None of this software, nor the firmware in his WiFi chip and other I/O devices, is open source and thus cannot be secured via reproducible builds and a transparency overlay.

Bloomberg's reporting implies that the putative Supermicro attack was targeted, though fairly broadly. Dan Goodin points out the lack of security in Supermicro's Board Management Controllers (BMCs):
several researchers ... unearthed a variety of serious vulnerabilities and weaknesses in Supermicro motherboard firmware (PDF) in 2013 and 2014. This time frame closely aligns with the 2014 to 2015 hardware attacks Bloomberg reported. Chief among the Supermicro weaknesses, the firmware update process didn’t use digital signing to ensure only authorized versions were installed. ... Also in 2013, a team of academic researchers published a scathing critique of Supermicro security (PDF). ... The critical flaws included a buffer overflow in the boards’ Web interface that gave attackers unfettered root access to the server and a binary file that stored administrator passwords in plaintext. ... for the past five years, it was trivial for people with physical access to the boards to flash them with custom firmware that has the same capabilities as the hardware implants reported by Bloomberg.
He then asks If Supermicro boards were so bug-ridden, why would hackers ever need implants?:
Besides requiring considerably less engineering muscle than hardware implants, backdoored firmware would arguably be easier to seed into the supply chain. The manipulations could happen in the factory, either by compromising the plants’ computers or gaining the cooperation of one or more employees or by intercepting boards during shipping the way the NSA did with the Cisco gear they backdoored.

Either way, attackers wouldn’t need the help of factory managers, and if the firmware was changed during shipping, that would make it easier to ensure the modified hardware reached only intended targets, rather than risking collateral damage on other companies.
It isn't just server motherboards, designed for remote access via the BMC, that have remotely exploitable vulnerabilities, so do regular PC's BIOS:
Though there's been long suspicion that spy agencies have exotic means of remotely compromising computer BIOS, these remote exploits were considered rare and difficult to attain.

Legbacore founders Corey Kallenberg and Xeno Kovah's Cansecwest presentation ... automates the process of discovering these vulnerabilities. Kallenberg and Kovah are confident that they can find many more BIOS vulnerabilities; they will also demonstrate many new BIOS attacks that require physical access.
But in the context of a targeted attack on sophisticated customers such as Apple, Amazon and telcos it is important to note that the customer's defenses would likely make exploiting the weak security of the motherboards impossible once they were installed.

Patrick Kennedy's analysis starts from this point:
Even smaller organizations with a handful of servers generally have segregated BMC networks. That basic starting point, from where large companies take further steps, looks something like this.
The key here is that the companies named are all sophisticated, and will have better protections than your average small to medium enterprise. Bloomberg’s report describes an attack that is not possible at the companies listed in the article.
Compromising the systems in transit to a known destination, or selectively at the factory would be necessary, and well within the capability of a nation state. Selectivity is important; as Joe Fitzpatrick points out, a broadcast attack is noisy:
Every board has it, but we probably only care about one targeted customer of the board. This is where it gets complicated. If 10 million backdoored motherboards all ping the same home server, everyone will notice.
An attacker need only get a few compromised systems into the target, he does not want to compromise them all. For major targets with many systems this poses two problems. For the attacker, intercepting without being detected a truck-load of systems all destined for the same customer in order to compromise a few of them is harder than intercepting a single box in transit. And for the defense, in that sampled audits of the supply chain are unlikely to detect the needle in the haystack. Supermicro publicized the result of such an audit:
Reuters' Joseph Menn reported that the audit was apparently undertaken by Nardello & Co, a global investigative firm founded by former US federal prosecutor Daniel Nardello. According to Reuters' source, the firm examined sample motherboards that Supermicro had sold to Apple and Amazon, as well as software and design files for products. No malicious hardware was found in the audit, and no beacons or other network transmissions that would be indicative of a backdoor were detected in testing.
But with customers the size of Apple and Amazon the audit would be very unlikely to find a targeted attack.

Given the target's likely defenses, it is the server software not just the BMC that needs to be compromised. The BMC could be the start of such an attack, but the result would be detectable at the server level. I/O interfaces are other potential routes for a compromise:
Bloomberg had reported that in addition to targeting Apple and Amazon Web Services, Chinese intelligence had managed to get implanted hardware inside an unnamed major telecommunications provider. The alleged victim was never named, with Bloomberg's report citing a non-disclosure agreement signed by the company Bloomberg used as its source for the story, Sepio Systems. Sepio's co-CEO, Yossi Appleboum, claimed that a scan had revealed the implant and that it had been added to an Ethernet adapter when the computer was manufactured.
Other routes would have included disk drive firmware as used by the "Equation Group":
One of the Equation Group's malware platforms, for instance, rewrote the hard-drive firmware of infected computers—a never-before-seen engineering marvel that worked on 12 drive categories from manufacturers including Western Digital, Maxtor, Samsung, IBM, Micron, Toshiba, and Seagate.

The malicious firmware created a secret storage vault that survived military-grade disk wiping and reformatting, making sensitive data stolen from victims available even after reformatting the drive and reinstalling the operating system. The firmware also provided programming interfaces that other code in Equation Group's sprawling malware library could access. Once a hard drive was compromised, the infection was impossible to detect or remove.
The revelation of this compromise three and a half years ago led drive manufacturers to secure their firmware update mechanism. Two years earlier the amazing Bunnie Huang and his colleague xobs had demonstrated essentially the same vulnerability for smaller devices in their Chaos Computer Conference talk called "On Hacking MicroSD Cards".

Cooper Quintin at the EFF's DeepLinks blog weighed in at the time with a typically clear overview of the issue entitled Are Your Devices Hardwired For Betrayal?. The three principles:
  • Firmware must be properly audited.
  • Firmware updates must be signed.
  • We need a mechanism for verifying installed firmware.
Adhering to these principles would help, but each of them is problematic in its own way:
  • Auditing requires third-party access to proprietary source code, and is expensive. While Supermicro's business model would likely be able to afford it, this isn't the case for cheaper devices that also attach to the network, including the Internet of Things.
  • Signing depends upon the vendor's ability to keep their private key secret, and to revoke their keys promptly in the event of a compromise. The vendor's private key is a very high-value target for attackers, as illustrated in 2015 when it was revealed that NSA and GCHQ had compromised Gemalto's network to obtain their private key and thus the ability to compromise SIM cards.
  • Verifying requires extracting the binary firmware from the device for analysis. Physically removing the flash chip containing the firmware is expensive, but the only way to be sure of obtaining the actual contents. Otherwise, access is via the firmware being extracted, which can be programmed to lie.
The more firmware is open source, the more the techniques of Securing The Software Supply Chain can be used to defend it. Microsoft's recent announcement of Project Mu, open sourcing the UEFI implementation used in Surface devices and Hyper-V, is encouraging. The goal is to provide:
a code structure and development process for efficiently building scalable and serviceable firmware. These enhancements allow Project Mu devices to support Firmware as a Service (FaaS). Similar to Windows as a Service, Firmware as a Service optimizes UEFI and other system firmware for timely quality patches that keep firmware up to date and enables efficient development of post-launch features.
we learned that the open source UEFI implementation TianoCore was not optimized for rapid servicing across multiple product lines. We spent several product cycles iterating on FaaS, and have now published the result as free, open source Project Mu! We are hopeful that the ecosystem will incorporate these ideas and code, as well as provide us with ongoing feedback to continue improvements.
Project Mu features:
  • A code structure & development process optimized for Firmware as a Service
  • An on-screen keyboard
  • Secure management of UEFI settings
  • Improved security by removing unnecessary legacy code, a practice known as attack surface reduction
  • High-performance boot
  • Modern BIOS menu examples
  • Numerous tests & tools to analyze and optimize UEFI quality.
Designing a firmware development, maintenance and distribution channel holistically, rather than bolting maintenance and update on as afterthoughts, is a critical advance.

Chip-Level Attacks

In A2: Analog Malicious Hardware (also here) Kaiyuan Yang et al describe the potential for chip-level attacks:
While the move to smaller transistors has been a boon for performance it has dramatically increased the cost to fabricate chips using those smaller transistors. This forces the vast majority of chip design companies to trust a third party — often overseas — to fabricate their design. To guard against shipping chips with errors (intentional or otherwise) chip design companies rely on post-fabrication testing. Unfortunately, this type of testing leaves the door open to malicious modifications since attackers can craft attack triggers requiring a sequence of unlikely events, which will never be encountered by even the most diligent tester.
The paper describes previous chip-level attacks and the techniques for detecting them. Then they:
show how a fabrication-time attacker can leverage analog circuits to create a hardware attack that is small (i.e., requires as little as one gate) and stealthy (i.e., requires an unlikely trigger sequence before effecting a chip’s functionality). In the open spaces of an already placed and routed design, we construct a circuit that uses capacitors to siphon charge from nearby wires as they transition between digital values. When the capacitors fully charge, they deploy an attack that forces a victim flip-flop to a desired value. We weaponize this attack into a remotely-controllable privilege escalation by attaching the capacitor to a wire controllable and by selecting a victim flip-flop that holds the privilege bit for our processor. We implement this attack in an OR1200 processor and fabricate a chip. Experimental results show that our attacks work, show that our attacks elude activation by a diverse set of benchmarks, and suggest that our attacks evade known defenses.
This is an extremely dangerous attack, since it involves only intercepting a finalized chip design on its way from the back-end house to the fab and injecting an almost undetectable change to the design. The result is a chip that passes all the necessary tests but can be compromised by an attacker who can run user-level code on it.

Open-Source Hardware

The chip design and fabrication process can be analogized to the software development and deployment process. It consists of developing source code (in a Register Transfer Language or a Hardware Description Language), compiling it into binary (typically polygons in GDS II), and writing the result to a write-once medium (silicon). To what extent could the techniques of Securing The Software Supply Chain be used to secure it?

My list of what it would take to secure CPUs in this way is:
  • Open source CPU designs: Several such designs exist, perhaps the most prominent being RISC-V which is now used, for example, by Western Digital for the CPUs in their disk drives. It has gained enough momentum to force MIPS to open source its instruction set and R6 core. But these designs are typically for small system-on-chip CPUs suitable for the Internet of Things. Western Digital's design is somewhat slower than a low-end Intel Xeon. It has taken ARM three decades to evolve up from IoT-level CPUs to server CPUs; it isn't clear when, if ever, there would be a competitive open source server CPU.
  • Open source tooling: Again, at least one complete open source toolchain exists, but as Chinmay Tongale reports they aren't competitive with commercial tools:
    You can design fairly complex chips in these tools (if not industry standard). I have designed (RTL to GDS2) 16 bit RISC Processor Chip using these.
  • Reproducible tooling: all tools in the chain would have to generate reproducible outputs.
  • Bootstrapped tooling: all tools in the chain would have to be built with bootstrapped compilers.
  • A transparency overlay: the hashes of all tools in the chain, and all their inputs and outputs would have to be secured by a transparency overlay analogous to Certificate Transparency.
Of course, system-level security would require this process not just for the system CPU, for which at least some open source designs are available, but also for the BMC and the I/O controllers, for which open source designs are hard to find, and whose business models would likely not support the additional cost.

While a design and fabrication process secured in this way is conceivable, it is hard to see such a fundamental transformation of the chip business, which jealously guards its intellectual property, being feasible.


With the many variants of Spectre and Meltdown, 2018 was the year of the side-channel attack. Fundamentally, these attacks are enabled by two things:
  • Multiple processes (an attacker and a target) sharing the same underlying hardware resources.
  • Performance optimizations using hardware resources that may or may not be available.
The attacks manipulate the availability of the hardware resources and use timing to detect whether or not the optimization occurred. From these timings the attacker can infer information about the target.

Presumably, this could allow a chip-level attack to be disguised as a fully functional performance optimization, just one that enabled information leakage via a side-channel. Such an attack would be hard to detect in an audit, since it would have a genuine justification.

The Hardware Supply Chain

Riverloop Security describes the general hardware supply chain with these stages:
  1. Design TimeFirst, a specification is developed – this can be checked for backdoors or weaknesses prior to manufacture. Subverting this stage to introduce a backdoor provides the greatest access.
  2. Hardware ManufacturingManufacturing is often subcontracted to a third party and is not easy to check. Manufacturers frequently substitute parts due to availability and cost constraints. Small malicious changes are possible at this stage. The ease of doing so depends on the device and format of the plans the attacker can access and modify.
  3. Third Party Hardware & Firmware IntegrationManufacturers frequently act as integrators and subcontract manufacture of subcomponents. This third-party integration leaves room a malicious actor to introduce backdoors or exploitable flaws into the system.
  4. Supply Distribution TimeBy the time the manufactured device reaches distribution, the company and consumers have little ability to verify the device matches the specification as originally designed. Devices can be replaced wholesale with counterfeit devices or modified to include additional malicious components.
  5. Post DeploymentIn the final stage, defense depends largely on the end customer’s physical security and processes.
And Supermicro's supply chain in more detail thus:
Super Micro contracts manufacturing and supply chain logistics. They use Ablecom Technology: a company which manufactures and provides warehousing before international shipments (US, EU, Asia). Ablecom is a private Taiwanese company run by the brother of Super Micro’s CEO, largely owned by the CEO’s family and is critical to Super Micro’s business processes. They rely on them to accurately forecast and warehouse parts from various contract manufacturers to be able to create their products.

If we attempt to simplify the above supply chain, Super Micro’s public disclosures would suggest that the steps applied to their case is, at a high level, as follows:
  1. Super Micro designs product
  2. Ablecom coordinates manufacturing with contract manufacturers; contract manufacturer produces and ships to Ablecom for warehousing
  3. Ablecom ships to Super Micro facility (San Jose, Netherlands, or Taiwan) for final assembly
  4. Super Micro ships to distributor, OEM, or customer
  5. Customer utilizes product
They sum up the problem thus:
Although it is extremely difficult to be assured that something you purchase from a source you do not fully and totally control is trustworthy, there are a number of steps companies can take to make them more difficult to target. These include implementing a supply chain security program, which can involve obfuscating end users of purchases from manufacturers, buying directly from authorized vendors, verifying parts, doing randomized in-depth inspections, and more.
In 2015, Darren Pauli reported for The Register that:
Cisco will ship boxes to vacant addresses in a bid to foil the NSA, security chief John Stewart says.

The dead drop shipments help to foil a Snowden-revealed operation whereby the NSA would intercept networking kit and install backdoors before boxen reached customers. ... Stewart says the Borg will ship to fake identities for its most sensitive customers, in the hope that the NSA's interceptions are targeted.

"We ship [boxes] to an address that's has nothing to do with the customer, and then you have no idea who ultimately it is going to," Stewart says.

"When customers are truly worried ... it causes other issues to make [interception] more difficult in that [agencies] don't quite know where that router is going so its very hard to target - you'd have to target all of them. There is always going to be inherent risk."
Clearly, TAO's operations are targeted. the NSA doesn't want to compromise all Cicso routers then have to figure out which are the small proportion of interest. They want to compromise just the ones being shipped to targets, which will then "phone home" and identify themselves.

Redesigning Hardware

Riverloop Security suggests that hardware supply chain security:
also can be aided by designing systems which can tolerate, contain, or detect compromise.
But how to do that?

In List of criteria for a secure computer architecture and A computer architecture with hardwarebased malware detection, Igor Podebrad et al's goals are:
to derive a hardware architecture which provides much more security features in comparison to current architectures. to:
  • support antivirus agents
  • disable typical malware properties (infection, stealth mechanism etc.)
  • support sensing of attacks
  • support forensic analysis to analyse successful attacks
They proposed, and prototyped in an FPGA, an interesting if impractical system architecture. It was a Harvard architecture (separate code and data address spaces) machine with a "Security Core" that loaded code and detected bad behavior. It isn't clear how the "Security Core" was to be maintained securely, given that its security was based on being also a Harvard architecture machine with code in ROM.

In effect, Podebrad et al take a similar, but in my view much less practical, approach to hardware as the Bootstrappable Builds project described in Securing The Software Supply Chain takes to software; starting from a kernel "secure by inspection" building up to a useful system. The reason I think that Bootstrappable Builds is practical is that it is conservative, not trying to change the way an entire industry works. Podebrad et al's approach is radical, throwing away a half-century of experience and investment in optimizing CPU design and imposing a severe performance penalty in the name of security. History, even as recent as Spectre and Meltdown, shows that this is a very difficult sell.

Dover Microsystems takes a less impractical approach, applying the concept of a security core to a conventional CPU architecture. Their CoreGuard Policy Enforcer monitors writes from the CPU cache to memory against a set of policies based on metadata extracted from the compilation process, and traps on violations. Their technique costs area on the die, which could have been used to enhance performance, and it isn't clear how the metadata on which the mechanism depends can be protected from an attacker.

Co-evolution Of Attack & Defense

It is important to view the security of IT systems not as a single, solvable problem, but as a system in which attacks and defenses co-evolve, just as diseases evolve drug resistance, and research creates new drugs. Lets take Return-oriented programming (ROP) as an example cycle in this co-evolution. The attack was published in 2007 and is described in detail by Erik Buchanan et al's Black Hat talk from 2008, Return-oriented Programming: Exploitation without Code Injection (53-slide PDF).

As Chris Williams explained in 2016's RIP ROP: Intel's cunning plot to kill stack-hopping exploits at CPU level, ROP was the attack's response to improved defenses against earlier attacks:
Once upon a time, you could – for example – find a memory buffer in some software and inject more data into it than the array could hold, thus spilling your extra bytes over other variables and pointers. Eventually you could smash the return address on the stack and make it point to a payload of malicious code you smuggled into the gatecrashing data. When the running function returns, the processor wouldn't jump back to somewhere legitimate in the software, instead it will jump to wherever you've defined in the overwritten stack – ie: your malicious payload.

Voila, deliver this over a network, and you've gained arbitrary code execution in someone else's system. Their box is now your box.

Then operating systems and processors began implementing mechanisms to prevent this. The stack is stored in memory marked in the page tables as data, not executable code. It is therefore easy to trap these sorts of attack before any damage can be done: if the processor starts trying to execute code stored in the non-executable, data-only stack, an exception will be raised. That's the NX – no-execute – bit in the page tables; Intel, AMD, ARM etc have slightly different official names for the bit.
Williams explains ROP in simple terms:
Now, here comes the fun part: return-orientated programming (ROP). Essentially, you still overwrite the stack and populate it with values of your choosing, but you do so to build up a sequence of addresses all pointing to blocks of useful instructions within the running program, effectively stitching together scraps of the software to form your own malicious program. As far as the processor is concerned, it's still executing code as per normal and no exception is raised. It's just dancing to your tune rather than the software's developer

Think of it as this: rather than read a book the way the author intended – sentence by sentence, page by page – you decide to skip to the third sentence on page 43, then the eight sentence on page 3, then the twelfth sentence on page 122, and so on, effectively writing your own novel from someone else's work

That's how ROP works: you fill the stack with locations of gadgets – useful code in the program; each gadget must each end with a RET instruction or similar. When the processor jumps to a gadget, executes its instructions, and then hits RET, it pulls the next return address off the stack and jumps to it – jumps to another gadget, that is, because you control the chain now.
Note that since ROP doesn't involve executing data, the Harvard-architecture CPUs envisaged by Podebrad et al would be vulnerable, illustrating the difficulty of radical changes to address these security issues.

In 2016 Baiju Patel's Intel Releases New Technology Specifications to Protect Against ROP attacks described how Intel, working with Microsoft, introduced new hardware defenses against ROP called Control-flow Enforcement Technology (CET):
CET defines a second stack (shadow stack) exclusively used for control transfer operations, in addition to the traditional stack used for control transfer and data. When CET is enabled, CALL instruction pushes the return address into a shadow stack in addition to its normal behavior of pushing return address into the normal stack (no changes to traditional stack operation). The return instructions (e.g. RET) pops return address from both shadow and traditional stacks, and only transfers control to popped address if return addresses from both stacks match. There are restrictions to write operations to shadow stack to make it harder for adversary to modify return address on both copies of stack implemented by changes to page tables. Thus limiting shadow stack usage to call and return operations for purpose of storing return address only. The page table protections for shadow stack are also designed to protect integrity of shadow stack by preventing unintended or malicious switching of shadow stack and/or overflow and underflow of shadow stack.
CET also protects against a variant of ROP, Jump Oriented Programming, by ensuring via Indirect Branch Tracking (IBT) that all valid targets of jumps or indirect branch instructions are labeled as such:
The ENDBRANCH instruction is a new instruction added to ISA to mark legal target for an indirect branch or jump. Thus if ENDBRANCH is not target of indirect branch or jump, the CPU generates an exception indicating unintended or malicious operation. This specific instruction has been implemented as NOP on current Intel processors for backwards compatibility (similar to several MPX instructions) and pre-enabling of software.
In Williams' words:
What CET does here is ensure that, when returning from a subroutine, the stack hasn't been tampered with to hijack the flow of the software. No ROP, no working exploit, no malware infection.
He suggests a possible direction for the next stage of co-evolution:
The shadow stack can't be modified by normal program code. Of course, if you can somehow trick the kernel into unlocking the shadow stack, meddle with it so that it matches your ROP chain, and then reenable protection, you can sidestep CET. And if you can do that, I hope you're working for the Good Guys.
Despite the care with which Intel maintained compatibility, it took two more years before , as Jonathan Corbet explained in Kernel support for control-flow enforcement, the Linux kernel developers had figured out how to use it effectively:
Yu-cheng Yu recently posted a set of patches showing how this technology is to be used to defend Linux systems.

The patches adding CET support were broken up into four separate groups: CPUID support and documentation, some memory-management work, shadow stacks, and indirect-branch tracking (IBT). The current patches support 64-bit systems only, and they only support CET for user-space code. Future versions are supposed to lift both of those restrictions.
But support for CET is in its early stages:
there appear to be no concerns ... about the CET features overall. They should make the system far more resistant to some common attack techniques with, seemingly, little in the way of performance or convenience costs. Chances are, though, that this technology won't be accepted until it is able to cover kernel code as well, since that is where a lot of attacks are focused. So CET support in Linux won't happen in the immediate future — but neither will the availability of CET-enabled processors.
Note the long timescales involved. It took nine years from publication of the attack to publication of the defense, and another two years before operating systems were able to start supporting it. It will take many more years before even a majority of the installed base of CPUs implements CET.


In the foreseeable future it doesn't seem likely that it will be possible to build generally useful systems with chips produced via a secure design and fabrication process, and firmware secured via reproducible builds. The foundations of our IT systems will continue to be shaky.

Retail Site Search: Shoppers Still Can’t Find What They’re Looking For / Lucidworks

In one of U2’s best-known songs, Bono sings dolefully, “I still haven’t found what I’m looking for.” He’s not alone. On many retail websites, shoppers longing to find an item with ease and precision also end up feeling unfulfilled.

In one survey of U.S. shoppers, a solid majority (60%) reported being frustrated with irrelevant search results. In another, almost half (47%) of online shoppers complained that “it takes too long” to find what they want, while 41% have difficulty finding “the exact product” they are looking for.

Consumers’ frustration with poor shopping site search is no small matter. The fact that shoppers can search independently for products online, as opposed to waiting for a retailer to present them, is at the heart of the ongoing transformation of retailing.

60% of shoppers reported being frustrated with irrelevant search results

Recommendations Create the Shopper Journey

Analysts at Deloitte Consulting identified the “trend for consumers to take their own lead in the shopping journey” in a report called The New Digital Divide. “A significant number of consumers want to manage the journey themselves, directing the ways and times in which they engage retailers rather than following a path prescribed by retailers or marketers.”

Precise, personalized, and speedy search—and the companion function of product recommendations—is becoming a key differentiator in ecommerce. Even Amazon, where 47% of shoppers already start their product searches, is working on the next-generation search experience.

Amazon’s new service called Scout is based on products’ visual attributes, reports CNBC. “It is perfect for shoppers who face two common dilemmas: ‘I don’t know what I want, but I’ll know it when I see it,’ and ‘I know what I want, but I don’t know what it’s called,’” said a statement Amazon sent to CNBC.

As leading-edge ecommerce sites like Amazon train consumers to expect better search, shoppers will have even less tolerance for a mediocre search experience.

‘What’s New?’ Is Still the Leading Question

Once customers have made a purchase, retailers have the knowledge to use to entice them back. A simple and effective way to do that is with automation; Amazon’s Subscribe & Save program, for example, provides a discount to shoppers who sign up for regularly-scheduled automatic delivery of certain items.

However, because consumers overwhelmingly want to see what’s new when they interact with a retailer, site search and personalized recommendations provide ecommerce with the greatest opportunity to capture new shoppers or to introduce existing customers to additional products or categories.

In fact, 69% of consumers responding to a Salesforce and Publicis.Sapient survey reported that it is “important” or “very important” to see new merchandise every time they visit a physical store or shopping site, and three-quarters of shoppers are using new site search queries online each month.

This explains why more than half (59%) of the top 5% of best-selling products on e-commerce sites change every month, according to the report. “That means retailers and brands can’t sleep on analyzing shopper searches and delivering the ever-changing items they seek in real time.”

Machine Learning: Know Your Customer

E-retailers are increasingly using Artificial Intelligence (AI), specifically Machine Learning (ML) and Natural Language Processing (NLP), to help shoppers discover what they want, perhaps before they know themselves.

Engagement pays off: 6% of e-commerce visits that include engagement with AI-powered recommendations drive 37% of revenue

AI-driven personalized recommendations can also provide a big payoff for retailers. A survey by Salesforce and Publicis.Sapient found that “6% of e-commerce visits that include engagement with AI-powered recommendations [drive] an outsized 37% of revenue.”

“The best way to understand your customers’ needs is to actually track and listen to your consumer,” said Lasya Marla, Director of Product Management at Lucidworks. “You do this by tracking customer signals, what they click on, what they ignore, what they call things. Recording and analyzing signals is crucial to learning your customers’ likes and dislikes and their intent.”

Merchandising Expertise Still Key

While machine learning can automatically suggest products and help customers discover items they wouldn’t have otherwise, many brands, particularly lifestyle brands, are loathe to risk merchandising with machines while they have experts on hand.

According to Peter Curran, president of Cirrus10, an onsite search system integrator in Seattle, “we work with brands that want ML to eliminate the drudgery of search curation—synonyms, boosts, redirects, and keywording—but who still want to finesse the customer experience.

“The dance between brand and brand aficionado is filled with nuance that merchandisers tend to notice—and IT departments tend to miss. The role of the merchandiser is ready for transformation,” Curran continued. “Feature selection, entity extraction, embeddings, and similar concepts are currently the job of the data scientist, but that work can’t be done well without the cooperation of the business user. We need tools that allow business users and data scientists to cooperate on improved models and always allow business users to override automation.”

Next Steps for Brands and Retailers

For ecommerce retailers and brands thinking about upgrading and modernizing their search functionality, it’s critical to develop a strategy that is integrated with the organization’s long-term goals. There are many aspects to consider during the selection of new technology for site search, but these questions can help in the process:

  • How can we develop better algorithms and techniques to match keywords to products?
  • What do we need to automatically fix search keywords based on misspellings, word order, synonyms, and other types of common mismatches?
  • How can we enable our marketing and merchandising people to take control of search so that it supports the business, including promotions, inventory, and seasons?
  • What tools do we need to analyze search trends at both an individual and macro level so that we can adjust in real time?
  • What are the signals customers are sending—and how can we best capture them?

While it is true that looking for a new pair of boots or a specialized metalworking tool does not rank up there with the search for a soulmate that U2’s Bono sings about, the desire to find a certain item in an online store has an emotional component that is intimately connected to the shopper’s perception of that brand.

Ironically, technology can produce search results and recommendations that are so personalized that they enhance this emotional connection, giving the consumer the sense that the brand “knows me.” The choice of a platform for site search, then, will make visitors fall more deeply in love with a retail brand—or send them elsewhere to find what they are looking for.

Marie Griffin is a New Jersey-based writer who covers retail for numerous B2B magazines.

The post Retail Site Search: Shoppers Still Can’t Find What They’re Looking For appeared first on Lucidworks.

They CAN and they SHOULD and it’s BOTH AND: The role of undergraduate peer mentors in the reference conversation / In the Library, With the Lead Pipe

In Brief:

Academic libraries hire and train student employees to answer reference questions which can result in high-impact employment experiences for these students. By employing students in this role, opportunities are created for peer-to-peer learning and for a learning community to develop among the student employees. However, not everyone supports this practice. Some believe undergraduates lack the expertise to handle reference questions; others express a fear of “missing out” on consultations, assuming student employees will not make a referral. This article expands on Brett B. Bodemer’s 2014 article, “They CAN and They SHOULD: Undergraduates Providing Peer Reference and Instruction.” The author discusses why undergraduates can and should provide reference assistance and how in these situations, it’s both and. Undergraduates can receive help from both their peers and librarians; it’s not a dichotomy or either-or situation. This article reflects on the practice of peer-to-peer reference services, offers counters to critiques against this type of student employment, and provides insight on the opportunities available when librarians believe in both and.

By Hailley M. Fargo


We call them consultants. Navigators. Mentors. Assistants. Coaches. Educators. Leaders. These are some of the titles we give the students we employ in academic libraries; these titles convey a sense of expertise and leadership that our students bring to their positions. The way we name our student employees signal the ways we believe undergraduate students contribute to the library and the university’s mission through their work.

It is common for academic libraries to employ students and for some, the library may employ the highest number of students across campus. Our student employees help run, maintain, and support the library. Regardless of their role, student employees are a crucial cog in academic libraries. As student engagement experiences and high-impact practices continue to gain popularity as the “gem” of an undergraduate’s college journey, student employment in the library has great potential to provide students with the skills they need to succeed, academically and professionally. Librarianship has discussed how to set up meaningful student employee programs (see Guerrero & Corey, 2004; O’Kelly, Garrison, Merry, & Torreano, 2015; Becker-Redd, Lee, & Skelton, 2018). More recently, a group of librarians (Mitola, Rinto, & Pattni, 2018) looked at this literature to see how student employment in the library matched up with the characteristics of high-impact practices, as defined by the Association of American Colleges & Universities (AAC&U) and George Kuh (2008). If libraries can build student employment programs that use characteristics of high-impact practices, we have a chance to become leaders in this area of student engagement. As we think about how to build student employment opportunities, creating a peer-to-peer service becomes one option.

In reconsidering what student employment can look like in the libraries, we also are reconsidering what reference can look like in an academic library. We know that reference continues to change as we strive to find new and innovative ways to reach our users. Librarianship has devoted a fair amount of time and space in our scholarship to discuss the evolution of reference services. In the early 2000s, articles were written on the decline of reference questions asked in academic libraries, and there was a move towards shifting librarians off the reference desk, to allow librarians to focus on instruction, outreach, and liaison duties (Faix, 2014). The discussion around the future of reference went as far as author Scott Carlson (2007) saying reference desks will no longer exist by 2012. In 2018, reference desks still exist, and academic libraries have tried all sorts of ways to handle reference — one desk, no desk, many desks, desks with paraprofessional staff, desks with students, consultations models, and roving reference just to name a few. Based on the institution, size of staff, and funding, reference services can look different and often will evolve over time as the institution and staff change. Today, we still conduct reference conversations1 with our users, but we are also thinking about how to staff these desks from different angles, including using undergraduates to provide foundational help during the research process.

In this article, I expand on Brett B. Bodemer’s 2014 article where he stated that, done correctly, undergraduates can and should provide reference services to their peers. Not only can and should undergraduates provide reference service in academic libraries, but in thinking about this opportunity, we have to understand how employing undergraduate students creates a both and situation in assisting users with research. In the reference landscape, undergraduates seeking reference help can receive this help from both their peers and librarians. Reference does not have to be a dichotomy or either-or situation with our students. Instead, we can view this as an opportunity to leverage peer-to-peer services, contribute to meaningful student employment experiences, extend our reach, and strengthen our reference services.

Overview and history of peer-to-peer services in academic libraries

Over time, we have asked our student employees to be the “face” of the library. This might mean they are the first employee users see when entering the building and help create first impressions for what the library can do (Brenza, Kowalsky, & Brush, 2015). Or, student employees might be involved with outreach initiatives and assist with marketing the library (Barnes, 2017). We ask students for their insight on advisory boards and help us steer the library in new directions. In asking our student employees to take on more responsibilities and help advocate and promote the library, we must ensure their experiences working in the library extend beyond directional and clerical duties. Peer-to-peer services have emerged as one way to empower our student employees in becoming stronger researchers, library users, and peer teachers.

A common way to view peer-to-peer services is through the peer-assisted learning (PAL) framework. PAL is a space in “which students can manage their own learning experiences by exploring, practicing, and questioning their understanding of issues and topics with a well-trained peer, untethered from the hierarchy inherent in formal instruction environments or in working with professional librarians and staff” (O’Kelly, Garrison, Merry, & Torreano, 2015). In thinking about this learning space, it is important to know that PAL is grounded in Lev Vygotsky’s concept of “zone of proximal development.” When a novice student researcher is working with a more experienced peer-mentor, both students “stretch” to meet each other in the middle. Both the peer-mentor and the learner benefit from the interaction because both are asked to learn something new in order to create new knowledge, together.

More broadly, scholars Topping and Ehly (2001) believe that PAL is a group of strategies that all carry these key values:

  • Those students that are helping their peers, also learn something in the interaction
  • This interaction always compliments, never supplements, professional teaching
  • Both the mentor and learner gain new knowledge through the interaction
  • All learners should have access to PAL
  • Peer mentors should be trained and assessed by professional teachers, who work with mentors throughout their time assisting learners

Academia has embraced PAL in a variety of ways, as seen through the various type of peer-mentor groups that exist on campuses. Some of these groups include writing tutors/consultants and tutors/mentors for a discipline or a class. Over time, libraries began building programs around the PAL framework. As Erin Rinto, John Watts, and Rosan Mitola say in their introduction to Peer-Assisted Learning in Academic Libraries, PAL gives “librarians…the chance to intentionally design and implement experiences that meet the criteria of these highly effective educational practices and create meaningful opportunities for students to learn from and with one another” (2017, p. 14). Many libraries have already embraced PAL, from the 14 case studies featured in Peer-Assisted Learning in Academic Libraries to other recently published case studies (Faix et. al, 2010; Wallis, 2017; Meyer & Torreano, 2017; Bianco & O’Hatnick, 2017). These case studies provide valuable insight into how to set up these programs, as well as the potential obstacles and benefits to consider.


Some colleagues worry about employing students in this peer-to-peer reference role. In this section, I expand on some of the common critiques given by librarians and library employees when considering or deploying a peer-to-peer model of reference support. In drawing out these critiques, I offer a counter to them, which could be used when advocating for our student employees in this model. The critiques I will explore are: the fear of “missing” out, quality assurance issues, and loss of professionalism when transferring some reference responsibilities to undergraduate students. These three critiques are interconnected and I will do my best to tease them out, while also including commentary on how peer-to-peer reference models can be setup to support the both and.

Fear of “missing” out

A top critique from librarians is concern over “missing out” on referrals and an opportunity to connect with an undergraduate student. This critique seems to come from a lack of trust or confidence in our student’s ability to answer research related questions and efficiently use library resources, but also some anxiety about being replaced with undergraduate students. Peer-to-peer services should never be created to be in lieu of, or to replace subject librarians. The role of peer-to-peer services is to complement subject librarians and also fill out the reference services landscape. By supporting and growing peer-to-peer services, we give our patrons another option for research support — a student employee who might have taken the class the student is seeking help in, a student employee who better understands the experience of being a student at the institution, or a student employee that can vouch for and recommend library services and support, like subject librarians. Sometimes that peer-to-peer recommendation goes farther than a faculty member or librarian suggesting to their students to set up an appointment or “use the library.”

We also know that librarians are “missing out” on reference conversations on a regular basis. In work done by Project Information Literacy, students who encounter obstacles throughout the research process are more likely to seek help from their peers, instead of a librarian (Head & Eisenberg, 2010). In this situation, having well-trained peer leaders can be instrumental in bridging this gap and helping students find and understand the information they need. In addition, undergraduate students do not always run on a “traditional” 9-5, Monday-Friday schedule. Student learning happens when they are situated to put what they have learned into practice — the “right” time to be situated happens at all hours of the day (Bell, 2000). Libraries, in employing students to provide reference at various times throughout the day, help these situated learners by meeting students when they are ready to learn.

Previous experience as an evening reference & instruction librarian confirmed that students need help finding information at a variety of times, and usually not during “daylight” hours (Fargo, 2017). Librarians are not always available when students need help finding information. Instead of these students fending for themselves, having peer-to-peer reference support, outside of the 9-5 schedule, allows for our students to still receive high-quality help. Again, the both and situation arises — perhaps a student will receive after-hours help from their peer and due to a positive interaction from this situation, they might be more likely to seek out a subject librarian for help on their next research project. Or, even if this satisfied student seeks out their peer again, the well-trained peer mentor will know when it is appropriate to pass this student along to the subject librarian for more extensive research help. In order to make these situations happen, we have to spend time training our student employees and helping them locate that sweet spot for a referral. Without this intentionality, we fall into the second critique — a decline in quality.

Decline in quality

Related to fear of missing out, some suggest that there is a decline in the quality of service provided by peers during a reference conversation. One way to ease this fear is to provide extensive, in-depth, and continual training for our student employees in peer-to-peer roles. In almost every article written about building a PAL program or training students to provide reference help, the authors discuss the importance of good training. Rinto, Watts, and Mitola (2017) mention this in their introduction saying “…it is essential that students are well-prepared for the demands of their position and are able to deliver high-quality learning experiences to their peers” (p. 10). Without well-planned and continual training for student employees, quality of interactions will undoubtedly suffer. From personal experience in building a peer-to-peer reference program, along with best practices mentioned in the literature, it takes about 15-20 hours of on-boarding, paired with regular staff meetings to share ideas, talk through previous reference conversations, and bring in library colleagues from various departments to provide additional training. As we train our students, we should do our best to use real-life reference examples, attempting to get as close as we can to an actual peer-to-peer interaction. When we fabricate examples, including fictional database names or articles, we signal to students that we do not take their role in the reference landscape seriously and this can lead to a decrease in quality of service. If we believe that our student employees are collaborators and part of our community, then we should use examples that we have seen previously.

Another way to ensure quality of service is to create a set of learning outcomes for the peer-to-peer program, which helps inform training and assessment. The assessment piece can be a way to quell concerns about quality of service and also garner buy-in from colleagues. Having a clear, internal communication plan about what success looks like for this program can also help get colleagues excited. At Penn State, our Peer Research Consultants (PRCs) took a “final” test after their initial onboarding. This “test” asked them to answer a reference question with a supervisor sitting in on the conversation. The PRCs were graded on a rubric that used the program’s learning outcomes to guide evaluation. Beyond onboarding, staff meetings, and a “final” test, many programs also ask student employees to reflect on a regular basis, such as after a reference conversation in order to help the student better understand their role helping their peers (Courtney & Otto, 2017). These reflections also provide insight to the supervisors about the types of questions being asked and any challenges their student employees might be facing.

In creating the training program for peer-to-peer services, coordinators should think strategically about how to communicate with other librarians and staff. This communication could include learning outcomes, training outlines, and assessment. Providing clear documentation can help ease fear around quality of work while also inviting colleagues to participate in training the student employees. These colleagues could be guest speakers at staff meetings or could drop in to introduce themselves to the student employees. Having an open dialogue with librarians, staff, and student employees creates an environment where everyone’s voice is heard and ensures quality service can be provided by all parties.

Another important element of training is deciding what a referral process will look like. Referrals can be sticky to handle and each library makes decisions about how a peer mentor/leader will make a referral. At Michigan State University, their Peer Research Assistants do not have a formal referral mechanism but are encouraged to share subject librarian information with the students they are helping (Marcyk & Oberdick, 2017); Hope College spends a good chunk of their training defining what a referred question will look like, and positioned the reference desk near the office of librarians that would handle the referrals (Hronchek & Bishop, 2017). Regardless of the process, there should be clear communication over what a referral will look like so that all parties know the expectations. The procedure for referrals will inevitably change over time and future iterations will be able to accommodate what is learned through trial and error.

Finally, in thinking about quality of service, there is something to be said around expertise. Just like we value a librarian’s subject or functional expertise, we should also value our students’ expertise and the experiential knowledge they bring into their role as peer mentors/leaders. They know how to be a student at your institution and this expertise should be celebrated the same way we value subject and functional expertise. Just like we speak the language of library and information science, our students speak the language of their peers and this can be incredibly powerful. Lee Burdette Williams (2011) said it best,

There is no aspect of the collegiate experience…that cannot benefit from the involvement of a peer who explains, in language often more accessible, a difficult concept. A peer can talk with students…in ways even the knowledgeable professionals cannot. A peer will use communication tools, media, and language that may seem foreign to those of us even a decade older (p. 99).

As we create the training for our students, we need to make sure we are preparing them to succeed in their position. In this preparation, we also trust our students to rise to the occasion in providing the best service they can to their peers. Through professionalizing their role, we can show them their expertise is valued and this trust can help ensure quality of their service to all users. Often, our peer mentors will teach us as librarians about new ways we can discuss the research process to our students. As we think about ways to professionalize the students’ role in the library, the final critique we arrive at is the loss of professionalism for our jobs as librarians.

Loss of professionalism

When deciding to implement a peer-to-peer service in the libraries, some might discuss their anxieties or fears around a loss of professionalism if this service model shift occurs. If undergraduate students are able to handle reference questions, outsiders might assume that librarians are no longer needed, or you do not need as many librarians to staff a library. This is an incorrect assumption, as having a peer-to-peer service requires the dedication, time, and resources of one or more librarians to assist with hiring, training, supervision, and evaluation of the peer mentors. As Lee Burdette Williams states, “Peer educators should never be seen as a stopgap measure to save money. They cannot replace competently and committed professionals who have spent years learning and re-learning their craft, any more than teaching assistants can replace competent and committed faculty” (2011, p. 98). Competent and committed professionals are needed to help build and maintain this program and train students to do their best work. Not only does it take a large amount of time to build a program, but sustaining a peer-reference program requires a considerable amount of time from the coordinator(s) of the program. Our student employees should not be duplicating the efforts made by librarians; their role is to extend our reference landscape and set up our both and situation. By extending hours of available help and raising awareness about library resources through the student perspective, our users are in a better position to receive help from both their peers and librarians.

In some ways, being asked to create a peer-to-peer model of reference support is a way to utilize and leverage our expertise around providing reference services. Alison Faix (2014) says that “peer reference itself can be seen as another form of teaching, on where librarians first teach the reference student assistants, who then go on to help other students in their roles as peer information literacy tutors” (p. 307-8). Just like we know the benefits of undergraduates being asked to articulate a process and lead their peer to new knowledge (zone of proximal development), we ourselves are challenged to do something similar when building a new peer-to-peer service. In creating a peer-to-peer program, we are helping to build a community of practice for a wide range of library employees. This new community gives everyone involved the chance to talk about reference, the research process, and their practice of providing support in the library. In some cases, our student employees might challenge or push us to rethink how we answer reference questions and together, we will reach new knowledge and insight. For example, many of the ways I think about providing reference and supporting students in providing reference came from a community of practice between the student employees and myself working in my library. We need this community of practice and by collaborating and having dialogue with our student employees, we are strengthening our reference landscape. All of this work helps support a student-centered library and ensure our students are getting the help they need.

In thinking about professionalism and peer-to-peer reference programs, this reference landscape might be impacted by issues around labor and neoliberalism in higher education. For libraries that are a part of a union, there might be stipulations or requirements for what is allowed to be done by librarians versus student assistants. By asking students to perform reference responsibilities, some might see this as a way of deprofessionalizing the field and breaking the expected rules around who gets to do and be paid for what type of labor. In these situations, it can be helpful to return to the idea of both and, and be intentional about how labor is divided and valued. It is important to be critical and conscientious of these ideas when considering peer-to-peer reference programs; there are not a one-size fits all model for this. Luckily, much of the current literature on these programs are published as case studies, which can examine an institutional and library context, and show how they built a program within that context.

At the same time, academic libraries are faced with shrinking budgets and unfilled staff lines, fewer librarians for increasing enrollments, and new strategic positions that ask librarians to step outside of the traditional library setting to do their jobs. All of this contributes to less time to spend on a reference desk answering questions that may or may not fully utilize the skills librarians bring to that desk. It is in these situations that leveraging student labor in a meaningful way — through intentional and continual training that provides transferable skills, frequent interactions with library staff, and regular feedback to ensure success — can be a way to deal with these pressures within higher education. However, when implementing this, we should make sure the core motivation to creating these student services extends beyond labor and is centered on the benefits the library can provide through employment to our peer mentors. The experience, skills, and learning that can happen with our peer mentors should drive us forward because those are the experiences the library should be advocating and supporting.

Conclusion and next steps

To take action in supporting a peer-assisted learning approach, we have to be intentional about how we create, train, and assess student employees. This intentionality can help us move away from student employment in the library being transactional and instead, become a transferable experience (Mitola, Rinto, & Pattni, 2018). To make this transformation, we must take our student employees seriously and see them as collaborators and one piece of the larger reference conversation landscape. In viewing this larger landscape, we have to remember that students can help us reach more of our patrons and these situations will always be both and.

To create these meaningful employment opportunities, we must commit time to set up the training, provide the necessary scaffolding to give our students the skills they need to participate in a reference conversation and communicate our progress with all our colleagues, who might be removed from the day-to-day work of our student employees. What can drive us forward is knowing that if we are able to take the characteristics of high-impact practices — time that the student has to devote to their job, regular interactions with faculty and peers, space to receive formal and informal feedback, opportunities to establish connections to the campus and broader communities, diversity integrated all training, and projects that provide transferable skills to those who participate (Kuh, 2008) — and embed these characteristics in our employment opportunities, we can be leaders at our institutions. Well-constructed library employment can be a way for our students to not only learn more about the library, but also increase their own research skills (Allen, 2014; McCoy, 2011), gain transferable skills, experiences around leadership, time management, and working with others towards a common goal (Charles, Lotts, & Todorinova, 2017; Melilli, Mitola, & Hunsaker, 2016; & Beltman & Schaeben, 2012). Coordinators should think strategically about what long-term assessment can look like for their program in order to be able to document and share how student employees use these skills in positions after the library.

As with most new programs, building a peer-reference service means that you will inevitably get some pushback from someone. This pushback comes from a variety of places, including some of the fears and anxieties mentioned earlier. It is important to address these concerns in order to cultivate buy-in and create an environment where student employees can thrive in the library. At the same time, these concerns, sometimes from a vocal minority, can cause coordinators to lose momentum on these programs. Throughout the process, coordinators should be strategic about what assessment to put into place and collecting reflections from their student employees as the program evolves. Some of the most powerful and meaningful justification for these programs can come from the students themselves (Fargo, Salvati, & Sanchez Tejada, 2018).

Patricia Iannuzzi said in her forward to Peer-Assisted Learning in Academic Libraries, “Peer learning experiences provide a pathway for libraries to expand their teaching and learning mission…” (2017, p. xii). We have the opportunity to open up these pathways, reach our users, and prepare our student employees for their future careers. We need to take our student employees seriously and understand not only their role as peer mentors but also their role in helping us to make the library a better place.

Thank you to Annie Pho, Denisse Solis, and Rosan Mitola for their feedback, insight, and help in writing this article. Additional thanks to Chelsea Heinbach who is always a great sounding board in talking through new ideas (and the proposal for this article). Finally, I’d like to thank all the student employees I’ve had the chance to work with at the University of Illinois and at Penn State, especially the original Peer Research Consultants (Jackie, Sarah, Vik, Luz, and Kelly) — your enthusiasm, big ideas, and spunk is what makes me jazzed about advocating for meaningful student employment experiences and helps me to know that I’m on the right track.


Allen, S. (2014). Towards a conceptual map of academic libraries’ role in student retention. The Christian Librarian, 57(1), 7-19. Retrieved from

Barnes, J.M. (2017). Student to student marketing & engagement: A case study of the University of Nebraska-Lincoln Libraries Peer Guides. In S. Arnold-Garza & C. Tomlinson (Eds.), Student Lead the Library: The Importance of Student Contributions to the Academic Library, (pp. 129-146). Chicago, IL: Association of College & Research Libraries.

Becker-Redd, K., Lee, K., & Skelton, C. (2018). Training student workers for cross-departmental success in an academic library: A new model. Journal of Library Administration, 58(2), 153-165, doi: 10.1080/01930826.2017.1412711

Bell, S.J. (2000). Creating learning libraries in support of seamless learning cultures. College & Undergraduate Libraries, 6(2), 45-58. doi:10.1300/J106v06n02_05

Beltman, S. & Schaeben, M. (2012). Institution-wide peer mentoring: Benefits for mentors. The International Journal of the First-Year in Higher Education, 3(2), 33-44, doi: 10.5204/intjfyhe.v3i2.124

Bianco, K. & O’Hatnick, J. (2017). Aligning values, demonstrating value: Peer educator programs in the library. In S. Arnold-Garza & C. Tomlinson (Eds.), Student Lead the Library: The Importance of Student Contributions to the Academic Library, (pp. 57-73). Chicago, IL: Association of College & Research Libraries.

Brenza, A., Kowalsky, M., & Brush, D. (2015). Perceptions of students working as library reference assistants at a university library. Reference Services Review, 43(4), 722-736. doi:10.1108/RSR-05-2015-0026

Carlson, S. (2007). Are reference desks dying out? Chronicle of Higher Education, 53(33), A37-A39.

Charles, L.H., Lotts, M., & Todorinova, L. (2017). A survey of the value of library employment to the undergraduate experience. Journal of Library Administration, 57(1), 1-6, doi: 10.1080/01930826.2016.1251248

Courtney, M., & Otto, K. (2017). The Learning Commons Research Assistance Program at Indiana University Libraries. In E. Rinto, J. Watts, & R. Mitola (Eds.), Peer-Assisted Learning in Academic LIbraries, (pp. 147-164). Santa Barbara, CA: Library Juice Press.

Fargo, H. (2017). Reference after 5 PM: A reference librarian’s experience working atypical hours at a large research library. Pennsylvania Libraries: Research & Practice, 5(2), 87-95. DOI 10.5195/palrap.2017.144

Fargo, H., Salvati, C., Sanchez Tejada, L. (2018). Setting up a successful peer research service: Penn State librarians and students on what works best [Webinar]. In Credo InfoLit Learning Community. Retrieved from

Faix, A.I., Bates, M.H., Hartman, L.A., Hughes, J.H., Schacher, C.N., Elliot, B.J., & Woods, A.D. (2010). Peer reference redefined: New uses for undergraduate students. Reference Services Review, 38(1), 90-107. Doi: 10.1108/00907321011020752

Faix, A. (2014). Peer reference revisited: Evolution of a peer-reference model. Reference Services Review, 42(2), 305-319. Doi: 10.1108/RSR-07-2013-0039

Guerrero, T.S. & Corey, K.M. (2004). Training and retraining student employees: A case study at Purdue University Calumet. Journal of Access Services, 1(4), 97-102, doi: 10.1300/J204v01n04_08

Head, A.J., & Eisenberg, M.B. (2010). Truth be told: How college students evaluate and use information in the digital age. Retrieved from

Hronchek, J. & Bishop, R. (2017). Undergraduate research assistants at Hope College. In E. Rinto, J. Watts, & R. Mitola (Eds.), Peer-Assisted Learning in Academic LIbraries, (pp. 191-205). Santa Barbara, CA: Library Juice Press.

Kuh, G. (2008). High-impact education practices: What they are, who has access to them, and why they matter. Washington, DC: Association of American Colleges & Universities.

Marcyk, E.R., & Oberdick, B. (2017). The Peer Research Assistants at the Michigan State University Libraries. In E. Rinto, J. Watts, & R. Mitola (Eds.), Peer-Assisted Learning in Academic LIbraries, (pp. 165-178). Santa Barbara, CA: Library Juice Press.

McCoy, E.H. (2011). Academic performance among student library employees: How library employment impacts grade point average and perception of success. The Christian Librarian, 54(1), 3-12. Retrieved from

Melilli, A., Mitola, R., & Hunsaker, A. (2016). Contributing to the library student employee experience: Perceptions of a student development program. The Journal of Academic Librarianship, 42(4), 430-437, doi: 10.1016/j.acalib.2016.04.005

Meyer, K., & Torreano, J. (2017). The front face of library services: How student employees lead the library at Grand Valley State University. In S. Arnold-Garza & C. Tomlinson (Eds.), Student Lead the Library: The Importance of Student Contributions to the Academic Library, (pp. 39-56). Chicago, IL: Association of College & Research Libraries.

Mitola, R., Rinto, E., & Pattni, E. (2018). Student employment as a high-impact practice in academic libraries: A systematic review. The Journal of Academic Librarianship, 44(3), 352-373, doi: 10.1016/j.acalib.2018.03.005

O’Kelly, M., Garrison, J., Merry, B., & Torreano, J. (2015). Building a peer-learning service for students in an academic library. Portal: Libraries and the Academy, 15(1), 163-182, doi:10.1353/pla.2015.0000

Rinto, E., Watts, J., & Mitola, R., eds. (2017). Peer-Assisted Learning in Academic Libraries. Santa Barbara, CA: Libraries Unlimited.

Topping, K.J., & Ehly, S.W. (2001). Peer assisted learning: A framework for consultation. Journal of Education and Psychological Consultation, 12(2), 113-132. doi:10.1207.S1532768XJEPC1202_03

Wallis, L, (2017). Information on my own: Peer reference and feminist pedagogy. In M.T. Accardi (Ed.), The Feminist Reference Desk: Concepts, Critiques, and Conversations, (pp. 189-204). Sacramento, CA: Library Juice Press.

Williams, L.B. (2011). The future of peer education: Broadening the landscape and assessing the benefits. New Directions for Student Services, 133, 97-99. doi:10.1002/ss.388

  1. In this paper, I’ll use “reference conversation” instead of the traditional, reference interview. I believe that reference interview is limiting, especially in peer-to-peer spaces. Reference conversation is meant to more fully encompass and capture what happens when answering a reference question — a conversation ensues where both parties benefit and learn something new.

Happy Holidays (and Brief Update) / Cynthia Ng

Update I realize it’s been quite some time since I’ve properly posted. For those who haven’t heard, I’ve left the library world (for now) by getting a support position at GitLab. It’s been an eye-opening experience, especially compared to public service type organizations that I’ve worked at until now, so I’m hoping to write a … Continue reading "Happy Holidays (and Brief Update)"

Mutation / Ed Summers

This is just a quick note to jot down this quote from Ketelaar (2005) because of the way it dispels the idea that the archival record is some fixed thing. Archival records are valorized for their stability and fixity because we think the archive’s integrity depends on them.

The record is a “mediated and ever-changing construction” (Cook, 2001) ; records are “constantly evolving, ever mutating” (McKemmish, 2005), over time and space infusing and exhaling what I have called “tacit narratives”. These are embedded in the activations of the record. Every interaction, intervention, interrogation, and interpretation by creator, user, and archivist activates the record. These activations may happen consecutively or simultaneously, at different times, in different places and contexts. Moreover, as I argued before, any activation is distributed between texts and other agents in a network. The record, “always in a state of becoming”, has therefore many creators and consequently, many who may claim the record’s authorship and ownership.

But this valorization comes at a cost because not all archival records exhibit this characteristic of fixity. This is the case with electronic records, which are often in flux and motion as they are constantly assembled and reassembled from heterogenous data sources, onto our screens.

If we define an archive as a place where this fixity must reside, then we fashion a particular type of memory. This view of archival records fails to recognize the agency of our archival tools, and their relation to us. Ketelaar actually frames his discussion in terms of Actor-Network Theory which is a method for explicitly examining these relations.

I also like that Ketalaar is referencing McKemmish here because I think the Records Continuum is really built around the idea that the flow of records takes place in the wider field of record creation, and memory practices. The fixity or evidentiary nature of records is really just one aspect, and not a totalizing one that should be allowed to overly determine the archive. Our technologies for fixing the record always shape our archives in particular ways that obscure and eclipse. Telling these stories, and describing these relations between archival technologies and memory is what I’m trying to do in my own work.


Cook, T. (2001). Archival science and postmodernism: New formulations for old concepts. Archival Science, 1(1), 3–24.

Ketelaar, E. (2005). Sharing, collected memories in communities of records. Archives and Manuscripts, 33(1), 44.

McKemmish, S. (2005). Traces: Document, record, archive, archives. In S. McKemmish, M. Piggott, B. Reed, & F. Upward (Eds.), Archives: Recordkeeping in society (pp. 1–20). Charles Sturt University.