photo from Flickr user gigi_nyc used under CC-BY-ND-2.0
The context for this article is Australian libraries and my experience there with cross-cultural provision. However, this article is not about providing library services for any specific group; it’s about cultural competence and whiteness. I begin with my background, so as to make clear how I participate, as a white librarian, in discussions about libraries and how they might be places where people from any cultural group find themselves reflected and where they find information the more easily for that reflection. I also start at that point because cultural competence requires an awareness of your own culture; for me, as a white person, that means thinking about whiteness. I then link experience with reading about cultural competence, and conversations with librarians who are also interested in cross cultural provision. Whiteness in libraries is introduced via these conversations. A brief comparison is drawn between the usefulness of intersectionality and cultural competence in addressing whiteness. The conclusion is that cultural competence embedded in professional approaches, library operations and the library environment can be the means for addressing whiteness, if the understandings of power and privilege outlined in intersectionality are incorporated.
I am a 56 year old, tertiary-educated, female Anglo-Australian. I am also a librarian. I fit the demographic profile of the Australian library workforce, which is described as highly feminised, professionally educated, ageing, and predominantly Anglo-Saxon (Hallam 2007); or, ‘a largely English-speaking, culturally homogenous group’ (Partridge et al 2012, p. 26). I have worked in the library industry for eight years, coming to the job after an employment history spanning at least four other industries. In these eight years I have developed a professional interest in cultural competence and whiteness in libraries.
Three factors motivated me to write this article. The first is the challenges I experienced in my first library job. One set of challenges helped me find my feet as a librarian; another, outlined below, set a strong direction for future work and further study. The second factor is cultural competence, about which I learned in response to those challenges. Thirdly, the Australian library and information management industry is beginning to address diversity, often through cultural competence. The Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS), the State Library of New South Wales and the State Library of Victoria all include cultural competence or proficiency in key policies and strategies. Charles Sturt University is committed to cultural competence in the context of Indigenous content in curriculum; RMIT University includes it as a topic in the professional experience course in its Master of Information Management. Most other library schools include in their program aims, development of skills for working in a diverse environment.
I began work in the library and information industry as Special Collections Manager at Alice Springs Public Library in Central Australia. The special collections were the Alice Springs Collection, documenting the history, geography, economic development, and cultures of Central Australia; and the Akaltye Antheme Collection, a local Indigenous knowledge collection, developed in partnership with the Traditional Owners. ‘Akaltye Antheme’ translates into English as ‘giving knowledge’, the knowledge being a showcase of local culture for Aboriginal and non-Indigenous users of the library.
In addition to a Graduate Diploma in Information Management, I also brought the accidents of life to that job. By ‘accidents’ I mean those developments which aren’t the product of any particular decision, which just seem to occur as life itself occurs and which coalesce into fundamental themes or directions. One of those accidents is being born white.
Other accidents include two books, read when I was twenty: Bury My Heart at Wounded Knee; An Indian History of the American West, and Living black; Blacks talk to Kevin Gilbert. They were my introduction to Indigenous people’s experience of history and colonisation. I had no idea of that reality until then. The ensuing couple of decades included NAIDOC marches,Sorry Day ceremonies, reconciliation activities, and visits to theTent Embassy. Friends and I mused about whiteness – what it means to be white when being white is the norm. This participation involved some decision-making but I had drifted into that left wing milieu – yes, an accident.
Another key accident was work as a personal care attendant in a supported accommodation centre for Koories. The health effects of a colonised life of disadvantage and discrimination were glaringly evident: very high incidence of diabetes and corollary conditions, alcohol-related brain damage, staggering male morbidity. Also clearly evident were the strength and resilience of culture, how hard people worked to maintain it and how they worked within it to maintain themselves and their community. The power of being white struck me for the first time: the residents were far more likely to do something when I asked them than when my largely Sri Lankan co-workers did. I attribute this to two things: the residents’ experience, often from very young, of near-complete control of their lives by white people in positions of power; and how, as a white person, I unwittingly used to the power and privilege that redounds to being white, and was able unknowingly but effectively to convey expectations.
Without the activist activities described earlier, I wouldn’t have perceived the effect of whiteness in an ordinary working environment for what it was; without that work experience, my understanding of the effects of a colonised life would be weaker. I outline this to indicate that I came to the Special Collections job beginning to understand my privilege as a white person. This privilege is reflected in the quality of my life; it is a product in part of the dispossession of Aboriginal and Torres Strait Islander people in Australia, and continuing systemic advantage for white people. Why is this important to this article? As I said in a conference presentation with Sylvia Perrurle Neale, the Indigenous Services Officer at Alice Springs Public Library, being a member of the dominant group is the biggest challenge I face in working in partnership with other, minority groups.
Working with Aboriginal and Torres Strait Islander people has been the main path for my learning about cultural competence and whiteness. However, cultural competence applies far more widely than only working with Indigenous peoples. As Ruby Hamad illustrates, whiteness resounds systemically. I would like to extend Hamad’s sentence, “[if you’re white,] you’re not going to be discriminated against on the basis of the colour of your skin” and suggest that you are also not likely to begin a sentence with the explanatory phrase, “In my/our culture …”, as I have so often heard members of minority groups do. You may never have to think about what your culture is, because as Henry and Tator (2006, cited in Calgary Anti-Racism Education) point out, whiteness in social, political and economic arenas is so much the norm, that it represents “neutrality”’. In a system that privileges some and marginalises others, often on the basis of skin colour but also on the basis of group membership, there are many marginalised groups. Jaeger et al. (2011) argue that working with any marginalised group requires cultural competence.
The challenges in that first job
In 2006, Alice Springs staff suggested that the Akaltye Antheme Collection be nominated for the Library Stars Award at the Australian Library and Information Association conference. (This happened before I worked there.) It won: delegates judged it the best initiative for its method of establishment, content, popularity with Aboriginal patrons, and the way the library adapted to the changed demand and use of the library that it generated.
Despite this organisational pride, Akaltye Antheme occupied a kind of limbo. Everything was the Special Collections Manager’s responsibility – to keep it tidy, repair items, endprocess acquisitions, liaise with Aboriginal organisations on a range of library matters, and manage incidents arising among Aboriginal patrons. Similarly focussed collections and target groups weren’t similarly quarantined; for example, the junior fiction and non-fiction collections, and children’s behaviour, weren’t considered the exclusive responsibility of the Children’s Librarian. Akaltye Antheme was considered something for Aboriginal people, not everyone who walked through the door, contrary to the intention of those who established it. Aboriginal people, who could be up to 30% of the library’s patrons, used Akaltye Antheme regardless of this differential staff approach. They would often spend hours every day browsing and reading it. I wondered why Akaltye Antheme retained its special project status long after it was established, particularly when it was such an integral resource to a significant proportion of the library’s clientele and when it was intended for all patrons. I found this frustrating and isolating. I fit the librarian stereotype, I belong to the dominant group; yet the attitude of my (largely white, older, educated, female) colleagues to a collection they didn’t seem to consider core business, affected me. Sylvia Purrurle Neale, an Eastern Arrernte woman, voiced similar frustrations.
I felt capable of learning to manage the historical collection, partly because my undergraduate degree included an honours in history. I had no idea about how to manage the Indigenous knowledge collection. This lack of educational preparation for working cross-culturally, then the isolation and frustration, echo Mestre’s research into the experience of librarians responsible for services to diverse populations (2010). She reports stress, potential burnout, and isolation of individual professionals. She also identifies opportunity costs to library organisations which rely on individuals for the provision of ‘diversity services’. The costs include loss of experienced staff and of the opportunity for all staff to learn, and benefit from learning, how to work cross-culturally. She argues that embedding culturally competent service within the organisation benefits it and all staff. Other commentators discuss the benefits of cultural competence in all aspects of library operations to organisational performance overall (Kim & Sin 2008, Andrade and Rivera 2011).
Learning about cultural competence
My next job was as Community Engagement Librarian with Libraries ACT, focussing on building engagement with Aboriginal and Torres Strait Islander communities. While in Alice Springs, I had thought that managing Akaltye Antheme could be something on which to build in my career – there probably were not too many librarians in Australia with experience providing services for and with Aboriginal people. I had also thought uneasily about the differential in the benefit that accrues to a white librarian coming to town for a short time and leaving with a marketable skill; and that which accrues to the local community, who would stay in Alice Springs after I had left. I can’t at this point cite any research that verifies this differential. However, if my experience resonates with that of others who have worked with minority groups, research in this area may suggest that greater benefit accrues to those already in a privileged position, in this instance, white librarians.
I began at Libraries ACT determined that there had to be an organisational approach to community engagement, partly to avoid aspects of my experience in Alice Springs but also to achieve organisational aims. For Aboriginal and Torres Strait Islander people to want to come to their library, they have to find a place where they are comfortable, where they can see themselves or their culture reflected. Partridge et al. (2012) point out that this applies for any cultural group. That is, any groups whose identity incorporates religion, disability, sexual orientation, age, recreation, employment, political beliefs, socio-economic status, educational attainment, and class (Helton 2010, Jaeger et al 2011). Creating such an environment in a system of nine branches, a heritage library, and a central administration clearly could not be done by one person. Advocating an organisational approach and the support of management led to a decision to implement theATSILIRN Protocols. The Protocols are a set of guidelines for appropriate library, information, and records services for Aboriginal and Torres Strait Islander peoples, developed by Aboriginal and non-Indigenous librarians.
I document this engagement with the ACT Aboriginal and Torres Strait Islander communities in a case study (Blackburn, 2014). Findings include that:
a small team can achieve a lot with support from colleagues and where the community wants to be engaged;
synergy between library objectives and a group’s aims will enhance outcomes;
the Protocols are useful in designing and choosing engagement activities; and
the community will meet you more than half way in your engagement activities.
There are still challenges. Where staff responses to Akaltye Antheme included a kind of resistance, a significant proportion of Libraries ACT staff, throughout the staff structure, want to engage with the Aboriginal and Torres Strait Islander community. The first challenge was to demonstrate that it wasn’t hard; once connections are made and sustained, engagement kind of runs itself. Another challenge relates to staff being able to find the time in a busy service to make connections, including going outside the library, and then maintaining involvement. The next relates to how libraries usually conduct business. Libraries are great on systems and processes; they are essential features of information management. However, if you want to build an engaged community, an insistence on a way of operating that suits internally devised systems is going to bump up rather hard against a community with its own way of organising, which is also given to taking ideas and running with them.
These are essentially facets of the one challenge. The ‘special project’ status of a resource that should have been embedded in core business; the limitation on time for building and maintaining relationships; and a preference for uniform service delivery rather than flexibility, are each part of the challenge of sustainable cross-cultural provision. This challenge, in the manifestations just outlined, resides in library professionals and in organisations.
For the first five years of working in libraries, I searched with little success for information about cross-cultural provision, cross-cultural communication, etc. in a library context. Then a speaker who worked in education mentioned ‘cultural competence’ at a Protocols implementation workshop. This was a key moment, albeit another accident. There was nothing in the Australian Library and Information Science (LIS) literature then about cultural competence but there was discussion of it in US library literature.
Overall (2009) defines cultural competence for library and information professionals as:
the ability to recognise the significance of culture in one’s own life and in the lives of others; to come to know and respect diverse cultural backgrounds and characteristics through interaction with individuals from diverse linguistic, cultural and socioeconomic groups; and to fully integrate the culture of diverse groups into service work, and institutions in order to enhance the lives of both those being serviced by the library profession and those engaged in service (p. 176).
Other service industries, like health and education, recognise that care or instruction that does not address the cultural context could have serious negative consequences. Failing to acknowledge the inappropriateness of male clinicians providing some procedures for women from particular groups, for example, could result in those women choosing not to access health services. Overall’s definition, which draws on theory from these industries, locates the site of cultural competence development within the professional workforce and library organisations, also the locations where the challenges of cross cultural provision arise. Cultural competence has been incorporated into US library and information science education accreditation standards. Research has supported its role in recruitment and retention, staff development, organisational performance, collection management, and service and program design (Andrade & Rivera 2011, Kim & Sin 2008, Mestre 2010).
Whether cultural competence has been truly embedded into US library and information science is debated. Case studies document incorporation into library business (e.g., Rivera 2013, Montague 2013); but Berry (1999) and Mehra (2011) assert that only token efforts have been made. Others (Galvan 2015, Honma 2006, Jaeger et al. 2011, Pawley 2005, Swanson et al. 2015) suggest that the issue is broader than development of cultural competence and includes diversity, race, racism, and whiteness. Broadening the debate in this way names the issues – diversity, race, racism and whiteness – which cross-cultural provision should address.
Cultural competence clearly begins with the professional – and just as clearly should go beyond the individual to be developed within the whole organisation. The following examples demonstrate why culturally competent organisations are required as well as professionals. In 2013, during a Libraries ACT planning day exercise, I noticed that a significant proportion of staff were either born overseas or were children of migrants; and the majority of that group were not Anglo-Saxon. (This reflects the demographic profile of the Australian population: 26% are migrants; nearly three quarters of whom are not Anglo-Saxon.) (Australian Bureau of Statistics 2012) Nevertheless the library service remains an organisation based in Western systems. The non-fiction shelves are organised according to Dewey Decimal Classification, which privileges Western or white concepts of knowledge. The bulk of the collection is in English. The songs sung during programs for babies and young children are most often English nursery rhymes. The library service remains a white one; staff from minority groups have adapted to the prevailing structure.
Wong et al. (2003) suggest that minority groups not only adapt to prevailing structures, they also adopt the underlying values. Wong et al., Canadian health practitioners of Asian descent, found their heritage did not guarantee that they would deliver mental health care appropriately to members of their own groups. They instead adopted the racialised approaches to power embedded in the Western health system in which they worked. Why would libraries be any different, particularly as they run on complex, long established systems, systems which can be adapted without changing embedded values? Dewey Decimal Classification, for example, is an ethnocentric arrangement of knowledge which has been modified to accommodate new and emerging areas of knowledge without changing the fundamental privileging of original concepts.
The diversity envisaged in US discussions about cultural competence “encompasses race, gender, ethnicity, language, literacy, disability, age, socio-economic status, educational attainment, technology access and skill” (2012 Symposium on Diversity and Library and Information Science Education, cited in Jaeger, Bertot & Subramaniam 2013). If culture is defined as “the shared daily activities of groups and individuals” (Rosaldo 1989, cited in Montiel-Overall, 2009, p 3) then religion, political beliefs and affiliation, and recreational activities are also part of diversity and should also influence cross-cultural provision. Helton (2010) and Jaeger et al. (2011) acknowledge the usefulness of cultural competence for providing library services for all groups in diverse populations, not only those whose identity is defined by race or ethnicity.
A fundamental aspect of cultural competence is that the process of achieving it never stops. Press and Diggs-Hobson (2005) point out that the professional is of necessity constantly learning about cultures in a service population: knowing everything about all the cultures in a population, before encountering them, is not possible. Ongoing interaction and actively seeking out knowledge (Garrison 2013) are integral components of developing cultural competence. The knowledge I brought to that first job in Alice Springs has continued to expand, through work and study. Most recently, during a short-term transfer to AIATSIS, I had cause to think a lot about colonisation and its ongoing effects in a post-colonial world.
Whiteness in libraries
In a recent conversation, a woman described how, when her family migrated from Egypt to the United States and then to Australia, her parents took her and her siblings to the library precisely so that they would learn how to fit in. Relating this as an adult, she said her parents chose the library “because it was a white place.” When I mentioned this to other librarians interested in cross-cultural provision and social inclusion, responses included:
You know I am really going to have to think this through. The whiteness of a library as a place to learn how to fit in. I never considered it that. I loved to read and that is the place to find books. At the same time, one learns English – to read and write – which is part of education and educating in the ‘white’ way which is at the foundation of libraries.(personal communication with an Indigenous librarian, 16th September 2015);
I find it curious how they intentionally used it in their acculturation to the dominant Australian culture … the literature that I have seen generally shows that immigrants trust the library and librarians. In that sense, libraries are welcoming and friendly spaces. However, that does not mean that libraries are culturally neutral zones and/or are as inclusive as one would like to think. I don’t think that this is all bad. It sounds like newcomers can benefit from it as they transition into the new society, however, long term this may cause them to feel excluded and/or that their cultures are less valued. Likewise, this would clearly be exclusionary to minority groups, such as indigenous [sic] peoples, who are not trying adapt to the dominant culture, but are nations within their own right. (personal communication with PhD candidate researching inclusion in libraries, 20th September 2015)
These communications, and the following discussion, indicate that the need for cultural competence is not reduced by the uses people from minority groups can make of white spaces. If anything they underline the need for it, and for dexterity in its deployment. A 2003 evaluation of the project to establish Akaltye Antheme included comments that Aboriginal people came to the library because it was a “neutral space”. They meant that it was a whitefella space free from the tensions of blackfella life; it was also a space where whitefella and blackfella clashes, common elsewhere in town, weren’t going to occur, where they could relax for a while and also make use of library services. In 2008, Aboriginal people were observed using the library to do online banking, socialise, organise or inform others of community events like funerals, read hard copy and digital Akaltye Antheme resources, watch videos, draw, or browse the other collections (Kral, unpublished report for council, 2008). Aboriginal people used the Alice Springs library before the establishment of the Akaltye Antheme Collection; however its popularity and changes in library use following its establishment suggest that the changed environment, while not making the library any less a white place, was valuable to Aboriginal patrons.
The Indigenous librarian quoted before, further commented about the affirmation members of her tribe find in their own libraries. Her comment reveals the value to individuals of places that reflect their identity:
on the other side, you have tribal libraries where Indigenous people go to learn not just reading and writing, but cultural aspects and language in the comfort of their created environment. My co-worker, she finds a reconnection to herself at the place we work. (personal communication 16th September 2015)
The potential alienation of libraries built on whiteness, mentioned by the PhD candidate, can be inferred from this comment.
Ettarh (2014) suggests “intersectional librarianship” as a means for working effectively with diverse populations. Intersectionality recognises the interactions between any person or group’s multiple layers of identity and the marginalisation or privilege attendant on each. No single identity is in play at any one time; and outcomes and experiences vary correspondingly. Multiple layers of identity result in multiple interactions between privilege and discrimination or marginalisation. The differing outcomes and responses arising from that interplay are evident in by the Egyptian migrants’ use of the library for their children’s acculturation; and in the use of the public library and the Akaltye Antheme Collection by Aboriginal people in Alice Springs.
An intersectional perspective can be developed by “learning to become allies … not just learning about the issues that affect the underrepresented but also learning how our own biases and privileges make it difficult for us to build alliances” (Ettarh (2014). Cultural competence requires virtually the same strategy for modifying personal and organisational practice.
Intersectional librarianship, however, discusses power and privilege, an omission in cultural competence theory that I have read. Intersectional librarianship “involves challenging and deconstructing privilege and considering how race, gender, class, disability, etc., affect patrons’ information needs” (Ettarh 2013). Wong et al. (2003) argue that understanding power must be central to understanding culture and to negotiating its multiple layers and interactions. Ettarh identifies as a queer person of color and talks of the challenges “we” librarians as a diverse group face in a diverse environment. Her use of the first person plural pronoun, to include all librarians, accords with the effect of structurally embedded racialised power on all health staff, that Wong et al. describe.
Cultural competence, as defined by Overall (2009), does address the framework in which library operations occur: the professional development of the individual practitioner, the interactions between colleagues and between practitioners and patrons, and the effect of the environment, inside and outside the library. Privilege, while not explicitly referenced in cultural competence theory, is implicit in how culture works; whiteness, again not explicitly referenced in cultural competence theory, is central in Western library structures and operations, in the environment in which libraries are located. If the starting point of cultural competence is an understanding of the role of culture in your life (including your workplace), and in the lives of others, then you will also become aware of the interactions and interplay of privilege and marginalisation described by Ettarh (2014). It should be possible to incorporate awareness of privilege and whiteness as another starting point for culturally competent practice.
Achieving inclusive services in the diverse Australian population when the Australian library workforce is culturally homogenous therefore poses a test. Individual Australian libraries are providing services to particular groups but how these initiatives are sustained is unclear, meaning that the risk remains for individuals responsible for ‘diversity services’ to struggle with the lack of support and isolation identified by Mestre (2010). Yarra Valley Regional Library obtained grant funding to develop programs with the hearing impaired community, children and adults with low literacy, and children with autism autism (Mackenzie 2014) – which makes me wonder whether the organisational challenge, of incorporating initiatives for minority groups into ongoing core business, might also remain. Without education in cultural competence, practitioners do not have the opportunity to discuss and evaluate their cross-cultural initiatives within a theoretical framework.
In a workforce that is predominantly Anglo-Saxon, in an industry that is firmly based on Western concepts of knowledge and systems giving prominence to those concepts, but which provides services to a diverse population, a cultural competence that includes awareness of whiteness, of privilege and the mechanisms that make it available to some and not others, is essential. Cultural competence can make the information at the heart of a library’s existence genuinely accessible. It can help create “low intensity meeting places” where different groups can interact – or not (Audunson 2004); where people can seek answers to culturally shaped questions in culturally mediated ways (Abdullahi 2008).
I have appreciated the open-review process, particularly being able to choose one of the reviewers. It has felt more collaborative than the peer-review processes of other publications. Thanks to Sue Reynolds and Ellie Collier for picking their way through the two drafts of this article, correcting grammar and asking questions that spurred me to clarify and extend what I was writing about. Thanks also to Hugh Rundle, publishing editor. It’s been a cross-cultural exercise of itself and I particularly appreciate Ellie’s contribution in that respect.
Abdullahi, I. (2008). Cultural mediation in library and information science (LIS) teaching and learning. New Library World, Vol. 109, No. 7, pp. 393-389.
Andrade, R. & Rivera, A. (2011). Developing a diversity-competent workforce: the UA Libraries’ experience. Journal of Library Administration, Vol 51. Nos. 7-8, pp. 692-727.
Audunson, R. (2004). The public library as a meeting-place in a multicultural and digital context; the necessity of low-intensive meeting-places. Journal of Documentation, Vol. 61, No. 3, pp. 429-440.
Australian Bureau of Statistics. (2012). ‘Cultural diversity in Australia’, reflecting a nation:
stories from the 2011 census, 2012–2013, cat. no. 2071.0. Canberra: Australian Bureau of Statistics. http://www.abs.gov.au
Berry, J. (1999). Culturally competent service’. Library Journal, Vol. 124, No. 14, pp. 112-113.
Blackburn, F. (2014). An example of community engagement: Libraries ACT and the ACT Aboriginal and Torres Strait Islander communities. Australian Academic & Research Libraries, Vol. 45, No. 2, pp. 121-138.
Brown, D. (1972). Bury My Heart at Wounded Knee: An Indian history of the American West. New York: Bantam Books.
Helton, R. (2010). Diversity dispatch: Increasing diversity awareness with cultural competency, Kentucky Libraries, Vol. 74, No. 4, pp. 22-24.
Honma, T. (2006). Trippin’ over the color line: the invisibility of race in library and information studies. InterActions: UCLA Journal of Education and Information Studies, Vol. 1, No. 2. http://escholarship.org/uc/item/4nj0w1mp.
Jaeger, P. T., Bertot, J. C. & Subramaniam, M. M. (2013). Introduction to the special Issue on diversity and library and information science education. The Library Quarterly, Vol. 83, No. 3, pp. 201-203.
Jaeger, P. T., Subramaniam, M. M., Jones, C. B. & Bertot, J.C. (2011). Diversity and LIS education: inclusion and the age of information. Journal of Education for Library and Information Science, Vol. 52, No. 3, pp. 166-183.
Kim, K. S., & Sin, S. C. J. (2008). Increasing ethnic diversity in LIS: strategies suggested by librarians of color. Library Quarterly, Vol. 78, No. 2, pp. 153-177. JSTOR,http://www.jstor.org/stable/10.1086/528887
Montague, R. A. (2013). Advancing cultural competency in library and information science,paper presented to IFLA World Library and Information Congress 2013, Singapore, 17-23 August 2013, International Federation of Library Associations, The Hague. http://library.ifla.org/274/1/125-montague-en.pdf
Overall, P. M. (2009). Cultural competence: a conceptual framework for library and information science professionals. The Library, Vol. 79., No. 2, pp. 175-204.
Partridge, H. L., Hanisch, J., Hughes, H. E., Henninger, M., Carroll, M., Combes, B., … & Yates, C. (2011). Re-conceptualising and re-positioning Australian library and information science education for the 21st century [Final Report 2011]. http://eprints.qut.edu.au/46915/
Press, N. O. & Diggs-Hobson, M. (2005). Providing health information to community members where they are: characteristics of the culturally competent librarian. Library Trends, vol. 53, no. 3, pp. 397-410. https://www.ideals.illinois.edu
Rivera, A. (2013). Indigenous knowledge and cultural competencies in the library profession: from theory to practice, paper presented to IFLA World Library and Information Congress 2013, Singapore, 17-23 August 2013, International Federation of Library Associations, The Hague, http://library.ifla.org/275/1/125-rivera-en.pdf
Wong, Y. R., Cheng, S., Choi, S., Ky, K, LeBa, S., Tsang, K. & Yoo, L. (2003) De-constructing culture in cultural competence: dissenting voices from Asian-Canadian practitioners, Canadian Social Work Review/Revue canadienne de service social, Vol. 20. No. 2, pp. 149-167.
The Eleventh International Conference on Open Repositories, OR2016, will be held on June 13th-16th, 2016 in Dublin, Ireland. The organizers are pleased to issue this call for contributions to the program.
As previous Open Repositories have demonstrated, the use of digital repositories to manage research, scholarly and cultural information is well established and increasingly mature. Entering our second decade, we have an opportunity to reflect on where we’ve been and, more importantly, where we’re heading. New development continues apace, and we’ve reached the time when many organizations are exploring expansive connections with larger processes both inside and outside traditional boundaries. Open Repositories 2016 will explore how our rich collections and infrastructure are now an inherent part of contemporary scholarship and research and how they have expanded to touch many aspects of our academic and cultural enterprises.
The theme of OR2016 is “Illuminating the World.” OR2016 will provide an opportunity to explore the ways in which repositories and related infrastructure and processes:
bring different disciplines, collections, and people to light;
expose research, scholarship, and collections from developing countries;
increase openness of collections, software, data and workflows;
highlight data patterns and user pathways through collections; and
how we can organize to better support these – and other – infrastructures.
We welcome proposals on these ideas, but also on the theoretical, practical, technical, organizational or administrative topics related to digital repositories. Submissions that demonstrate original and repository-related work outside of these themes will be considered, but preference will be given to submissions which address them. We are particularly interested in the following themes.
Supporting Open Scholarship, Open Data, and Open Science
Papers are invited to consider how repositories can best support the needs of open science and open scholarship to make research as accessible and useful as possible, including:
Open access, open data and open educational resources
Scholarly workflows, publishing and communicating scientific knowledge
Exposure of research and scholarship from developing countries and under-resourced communities and disciplines
Compliance with funder mandates
Repositories and Cultural Heritage
Papers are invited to consider how repositories and their associated infrastructures best support the needs of cultural heritage collections, organizations, and researchers. Areas of interest include:
Impact of aggregation on repository infrastructure and management
Exposure of collections and cultural heritage from developing countries and under-resourced communities and disciplines
Special considerations in access and use of cultural heritage collections
Reuse and analysis of content.
Repositories of high volume and/or complex data and collections
Papers are invited to consider how we can use tools and processes to highlight data patterns and user pathways through large corporas including:
Data and text mining
Interaction with large-scale computation and simulation processes
Issues of scale and size beyond traditional repository contexts
Managing Research Data, Software, and Workflows
Papers are invited to consider how repositories can support the needs of research data and related software and workflows. Areas of interest are:
Curation lifecycle management, including storage, software and workflows
Digital preservation tools and services
Reuse and analysis of scientific content
Scholarly workflows, publishing and communicating scientific knowledge
Integrating with the Wider Web and External Systems
Papers are invited to explore, evaluate, or demonstrate integration with external systems, including:
CRIS and research management systems
Notification and compliance tracking systems
Preservation services and repositories
Collection management systems and workflows
Exploring Metrics, Assessment, and Impact
Papers are invited to present experiences on metrics and assessment services for a range of content, including:
Downloads (e.g. COUNTER compliance)
Altmetrics and other alternative methods of tracking and presenting impact
Papers are invited to examine the role of rights management in the context of open repositories, including:
Research and scholarly communication outputs
Licenses (e.g. Creative Commons, Open Data Commons)
Requirements of funder mandates
Developing and Training Staff
Papers are invited to consider the evolving role of staff who support and manage repositories across libraries, cultural heritage organizations, research offices and computer centres, especially:
New roles and responsibilities
Training needs and opportunities
Career path and recruitment
01 February 2016: Deadline for submissions and Scholarship Programme applications
01 February 2016: Registration opens
28 March 2016: Submitters notified of acceptance to general conference
11 April 2016: Submitters notified of acceptance to Interest Groups
13-16 June 2016: OR2016 conference
Conference Papers and Panels
We expect that proposals for papers or panels will be two to four-pages (see below for optional Proposal Templates). Abstracts of accepted papers and panels will be made available through the conference’s web site, and later they and associated materials will be made available in an open repository. In general, sessions will have three papers; panels may take an entire session or may be combined with a paper. Relevant papers unsuccessful in the main track will be considered for inclusion, as appropriate, as an Interest Group presentation, poster or 24/7.
Interest Group Presentations
The opportunity to engage with and learn more about the work of relevant communities of interest is a key element of Open Repositories. One to two page proposals are invited for presentations or panels that focus on the work of such communities, traditionally DSpace, EPrints, Fedora, and Invenio, describing novel experiences or developments in the construction and use of repositories involving issues specific to these technical platforms. Further information about applications for additional Interest Groups and guidance on submissions will be forthcoming.
24×7 presentations are 7 minute presentations comprising no more than 24 slides. Proposals for 24×7 presentations should be one to two-pages. Similar to Pecha Kuchas or Lightning Talks, these 24×7 presentations will be grouped into blocks based on conference themes, with each block followed by a moderated discussion / question and answer session involving the audience and whole block of presenters. This format will provide conference goers with a fast-paced survey of like work across many institutions, and presenters the chance to disseminate their work in more depth and context than a traditional poster.
“Repository RANTS” 24×7 Block
One block of 24×7’s will revolve around “repository rants”: brief exposés that challenge the conventional wisdom or practice, and highlight what the repository community is doing that is misguided, or perhaps just missing altogether. The top proposals will be incorporated into a track meant to provoke unconventional approaches to repository services.
“Repository RAVES” 24×7 Block
One block of 24×7’s at OR2016 will revolve around “repository raves”: brief exposés that celebrate particular practice and processes, and highlight what the repository community is doing that is right. The top proposals will be incorporated into a track meant to celebrate successful approaches to repository services.
We invite one-page proposals for posters that showcase current work. Attendees will view and discuss your work during the poster reception.
2016 Developer Track: Top Tips, Cunning Code and Illuminating Insights
Each year a significant proportion of the delegates at Open Repositories are software developers who work on repository software or related services. OR2016 will feature a Developer Track and Ideas Challenge that will provide a focus for showcasing work and exchanging ideas.
Building on the success of last year’s Developer Track, where we encouraged live hacking and audience participation, we invite members of the technical community to share the features, systems, tools and best practices that are important to you. Presentations can be as informal as you like, but once again we encourage live demonstrations, tours of code repositories, examples of cool features and the unique viewpoints that so many members of our community possess. Submissions should take the form of a title and a brief outline of what will be shared with the community.
Further details and guidance on the Ideas Challenge will be forthcoming.
Developers are also encouraged to contribute to the other tracks as papers, posters, 24×7 presentations, repository raves and rants 24×7 blocks.
Workshops and Tutorials
One to two-page proposals for workshops and tutorials addressing theoretical or practical issues around digital repositories are welcomed. Please address the following in your proposal:
The subject of the event and what knowledge you intend to convey
Length of session (e.g., 1-hour, 2-hour, half a day or a whole day)
A brief statement on the learning outcomes from the session
How many attendees you plan to accommodate
Technology and facility requirements
Any other supplies or support required
Anything else you believe is pertinent to carrying out the session
The OR2016 proposal templates are a guideline to help you prepare an effective submission. They will be provided in both the Word document and plain-text Markdown formats and provide details around the requirements for conference papers and panels and 24/7’s and posters. These will be available from the conference website shortly.
The conference system will be open for submissions by 15 December 2015. PDF format is preferred.
OR2016 will again run a Scholarship Programme which will enable us to provide support for a small number of full registered places (including the poster reception and banquet) for the conference in Dublin. The programme is open to librarians, repository managers, developers and researchers in digital libraries and related fields. Applicants submitting a paper for the conference will be given priority consideration for funding. Please note that the programme does not cover costs such as accommodation, travel and subsistence. It is anticipated that the applicant’s home institution will provide financial support to supplement the OR Scholarship Award. Full details and an application form will shortly be available on the conference website.
David Minor, University of California, San Diego
Matthias Razum, FIZ Karlsruhe
Sarah Shreeves, University of Miami
Palfrey was a law professor at Harvard and lately the vice-dean of library and information services at its law library, and he helped create the Digital Public Library of America and was its first board chairman. Recently he left all that to be head of posh New England private school Andover, where he gathered his thoughts and plans about libraries and set them out in this short book.
It reflects, as you’d expect, the approach of the DPLA, which is based on networked collaboration, digitization and preservation of unique resources, specialization in local material, community involvement, and open content and APIs for hackers to hack and reuse. All of this is well-argued in the book, set out clearly and logically (as law professors do), always grounded in libraries as educational institutions and librarians as educators. He knows the problems libraries (both public and academic) are having now, and sets out a way to get past them while keeping libraries both fundamental to local communities but also important at the national and international level.
I was agreeing with the book and then hit this in chapter five, “Hacking Libraries: How to Build the Future:”
The next phase of collaboration among libraries may prove to be harder. The development of digital libraries should be grounded in open platforms, with open APIs (application programming interfaces). open data, and open code at all levels. No one library will own the code, the platform, or the data that can be downloaded, “free to all.” The spirit that is needed is the hacker spirit embodied by Code4Lib, a group of volunteers associated with libraries, archives and museums who have dedicated themselves to sharing approaches, techniques and code related to the remaking of libraries for the digital era.
Damn right! The Code4Lib approach is the right approach for the future of libraries. (Well, one right approach: I agree with Palfrey’s plan, but add that on the digital side libraries need to take a leading role regarding privacy, and also need to take on climate change, with progressive approaches to labour and social issues underlying everything.)
In the next chapter he enthuses about Jessamyn West and a few others doing fresh, important, different kinds of library work. This is good to see!
Any GLAM hacker will want to read this book. I’m a bit puzzled, though: who is it aimed at? All my Code4Lib colleagues will agree with it. Non-technical librarians I know would agree with the plan, though with reservations based on worries about the future of their own areas of specialization and lack of technical skills. It will be useful when talking to library administrators. How many people outside libraries and archives will read the book? Are there more people out there interested in the future of libraries than I’d have guessed? If so, wonderful! I hope they read this and support the online, open collaboration it describes.
For the past few months I’ve been working on a project to migrate a museum’s collection registry to CollectionSpace. CollectionSpace is a “free, open-source, web-based software application for the description, management, and dissemination of museum collections information.”1 CollectionSpace is multitenant software — one installation of the software can serve many tenants. The software package’s structure, though, means that the configuration for one tenant is mixed in with the code for all tenants on the server (e.g, the application layer, services layer, and user interface layer configuration are stored deep in the source code tree). This bothers me from a maintainability standpoint. Sure, Git’s richly featured merge functionality helps, but it seems unnecessarily complex to intertwine the two in this way. So I developed a structure that puts a tenant’s configuration in a separate source code repository and a series of procedures to bring the two together at application-build time.
CollectionSpace Tenant Configuration Structure
There are three main parts to a CollectionSpace installation: the application layer, the services layer, and the user interface layer. Each of these has configuration information that is specific to a tenant. The idea is to move the configuration from these three layers into one place, then use Ansible to enforce the placement of references from the tenant’s configuration directory to the three locations in the code. That way the configuration can be changed independent of the code.
The configuration consists of a file and three directories. Putting the reference to the file — application-tenant.xml — into the proper place in the source code directory structure is straightforward: we use a file system hard link. By their nature, though, We cannot use a hard link to put a reference to a directory in another place in the file system. We can use a soft link, but those were problematic in my specific case because I was using ‘unison‘ to synchronize the contents of the tenant configuation between my local filesystem and a Vagrant virtual system. (Unison had this annoying tendency to delete the directory and recreate it in some synchronization circumstances.) So I resorted to a bind mount to make the configuration directories appear inside the code directories.
To make sure this setup is consistent, I use Ansible to describe the exact state of references. Each time the Ansible playbook runs, it ensures that everything is set the way it needs to be before the application is rebuilt. That Ansible script looks like this:
Lines 12-18 create the hard link for the tenant application XML file.
Handling the tenant configuration directories takes three steps. Using the application configuration as an example, lines 20-24 first make sure that a directory exists where we want to put the configuration into the code directory.
Next, lines 26-34 uses mount --bind to make the application configuration appear to be inside the code directory.
Lastly, lines 35-41 ensures the mount-bind lasts through system rebuilds (although line 33 makes sure the mount-bind is working each time the playbook is run).
Then the typical CollectionSpace application build process runs.
Lines 89-120 stop the Tomcat container and rebuilds the application, services, and user interface parts of the system.
Lines 135-163 log into CollectionSpace, gets the session cookie, then initializes the user interface and the vocabularies/authorities.
I run this playbook almost every time I make a change to the CollectionSpace configuration. (The exception is for simple changes to the user interface; sometimes I’ll just log into the server and run those manually.) If you want to see what the directory structure looks like in practice, the configuration directory is on GitHub.
As we discussed in a post on November 19th, the final language of the Every Student Succeeds Act negotiated by a House and Senate Conference Committee has just been released. This bill reauthorizes the Elementary and Secondary Education Act (ESEA) and I am happy to report that the members of the Committee recognized the importance of an effective school library program to a child’s education and reflected that in their Conference Report! As expected, now that we have the completed Report things will move quickly. The House is currently scheduled to vote on the bill later this week and the Senate will follow soon after.
A child reading, photo credit Tim Pierce
While we made it into the Conference Report, our job is not yet finished. Speaker Ryan has said that unless he feels sure of passage, he will not bring the Conference Report to a vote. It is essential that you take a moment to let your Representative know that you support holding a vote on the Conference Report and want that vote to be Yes!
Head over to ALA’s Action Center for all of the necessary information to make your voice heard now!
An overview of the library provisions found in the Every Student Succeeds Act can be found here.
If you weren’t there in 1989, this is going to be very hard to imagine. But go ahead and try to picture this: the world without the web, without mobile (let alone smart) phones, without so many of the things that we take for granted today. The Internet was here, certainly, but only for some people (mostly those at large research universities) and it was extremely primitive.
So when I attended the ALA Annual Conference in 1989 and discovered that Charles W. Bailey, Jr. at the University of Houston had established a BITNET mailing list for librarians at the end of June, I was all excited. Dubbed “PACS-L” for “Public Access Computer Systems” list, I was eager to sign up for it but I didn’t have a clue how to do it. It turned out the instructions passed out at the conference only worked if you were solely on BITNET. If you were on the true Internet, like I was, and only had a gateway to BITNET, how you subscribed was different. That sent me into a deep dive into the arcane world of BITNET, Unix, and the Internet, which had me printing out reams of documentation. I eventually succeeded in signing on, however, and was actually one of the first few dozen subscribers.
Early discussion topics included Hypercard, CD-ROM systems, local area networks, licensing agreements (some things never change), and much more. Check out the archive if you ever want to be entertained by what was top of mind then.
By the 18th of August there were just over 300 subscribers to this new discussion. Here are just a few of them, at least some of whose names you surely recognize:
Ivy Anderson (then at Brandeis, now at the California Digital Library)
Christine Borgman (UCLA)
Sharon Bostick (then at Wichita State, now at the Illinois Institute of Technology)
Marshall Breeding (then at , now a speaker and consultant)
Priscilla Caplan (then at Harvard, now retired)
Steve Cisler (then at the Apple Library, now unfortunately no longer with us)
Walt Crawford (then at RLG, now writing and consulting)
Selden Deemer (then at Emory, now at Biblio Solutions, LLC)
Thom Hickey (OCLC)
Judith Hopkins (University of Buffalo)
Katharina Klemperer (then at Dartmouth, now a consultant)
John MacColl (then at Glasgow, now at St. Andrews)
Mike Ridley (then at McMaster, now at Guelph)
John Ulmschneider (then at NCSU, now at Virginia Commonwealth University)
Don Waters (then at Yale, now at the Andrew Mellon Foundation)
Perry Willett (then at SUNY Binghamton, now at the California Digital Library)
For quite a while this list was where everything new in librarianship was happening. Despite its name, topics well beyond public access computer systems were discussed and debated. It was, in a nutshell, an essential place to hear and be heard. Its like was never to be again, as since then online communication channels have burgeoned and diversified. But for a little while, at least, there was a single place to be. And it was PACS-L.
Thank you, Charles, for that seminal move that brought so many of us together at such a key time in librarianship. I know that I’m not the only one who views those days with fondness — despite the turmoil, the uncertainty, and the joy of learning constantly and making things up as you went along.
The Directory of Open Access Journals (DOAJ) is an international directory of journals and index of articles that are available open access. Dating back to 2003, the DOAJ was at the center of a controversy surrounding the “sting” conducted by John Bohannon in Science, which I covered in 2013. Essentially Bohannon used journals listed in DOAJ to try to find journals that would publish an article of poor quality as long as authors paid a fee. At the time many suggested that a crowdsourced journal reviewing platform might be the way to resolve the problem if DOAJ wasn’t a good source. While such a platform might still be a good idea, the simpler and more obvious solution is the one that seems to have happened: for DOAJ to be more strict with publishers about requirements for inclusion in the directory. 1.
The process of cleaning up the DOAJ has been going on for some time and is getting close to an important milestone. All the 10,000+ journals listed in DOAJ were required to reapply for inclusion, and the deadline for that is December 30, 2015. After that time, any journals that haven’t reapplied will be removed from the DOAJ.
“Proactive Not Reactive”
Contrary to popular belief, the process for this started well before the Bohannon piece was published 2. In December 2012 an organization called Infrastructure Services for Open Access (IS4OA) (founded by Alma Swan and Caroline Sutton) took over DOAJ from Lund University, and announced several initiatives, including a new platform, distributed editorial help, and improved criteria for inclusion. 3 Because DOAJ grew to be an important piece of the scholarly communications infrastructure it was inevitable that they would have to take such a step sooner or later. With nearly 10,000 journals and only a small team of editors it wouldn’t have been sustainable over time, and to lose the DOAJ would have been a blow to the open access community.
One of the remarkable things about the revitalization of the DOAJ is the transparency of the process. The DOAJ News Service blog has been detailing the behind the scenes processes in detail since May 2014. One of the most useful things is a list of journals who have claimed to be listed in DOAJ but are not. Another important piece of information is the 2015-2016 development roadmap. There is a lot going on with the DOAJ update, however, so below I will pick out what I think is most important to know.
The New DOAJ
In March 2014, the DOAJ created a new application form with much higher standards for inclusion. Previously the form for inclusion was only 6 questions, but after working with the community they changed the application to require 58 questions. The requirements are detailed on a page for publishers, and the new application form is available as a spreadsheet.
While 58 questions seems like a lot, it is important to note that journals need not fulfill every single requirement, other than the basic requirements for inclusion. The idea is that journal publishers must be transparent about the structure and funding of the journal, and that journals explicitly labeled as open access meet some basic theoretical components of open access. For instance, one of the basic requirements is that “the full text of ALL content must be available for free and be Open Access without delay”. Certain other pieces are strong suggestions, but not meeting them will not reject a journal. For instance, the DOAJ takes a strong stand against impact factors and suggests that they not be presented on journal websites at all 4.
To highlight journals that have extremely high standards for “accessibility, openness, discoverability reuse and author rights”, the DOAJ has developed a “Seal” that is awarded to journals who answer “yes” to the following questions (taken from the DOAJ application form):
allow reuse and remixing of content in accordance with a CC BY, CC BY-SA or CC BY-NC license (Question 47). If CC BY-ND, CC BY-NC-ND, ‘No’ or ‘Other’ is selected the journal will not qualify for the Seal.
Part of the appeal of the Seal is that it focuses on the good things about open access journals rather than the questionable practices. Having a whitelist is much more appealing for people doing open access outreach than a blacklist. Journals with the Seal are available in a facet on the new DOAJ interface.
Getting In and Out of the DOAJ
Part of the reworking of the DOAJ was the requirementand required all currently listed journals to reapply–as of November 19 just over 1,700 journals had been accepted under the new criteria, and just over 800 had been removed (you can follow the list yourself here). For now you can find journals that have reapplied with a green check mark (what DOAJ calls The Tick!). That means that about 85% of journals that were previously listed either have not reapplied, or are still in the verification pipeline 5. While DOAJ does not discuss specific reasons a journal or publisher is removed, they do give a general category for removal. I did some analysis of the data provided in the added/removed/rejected spreadsheet.
At the time of analysis, there were 1776 journals on the accepted list. 20% of these were added since September, and with the deadline looming this number is sure to grow. Around 8% of the accepted journals have the DOAJ Seal.
There were 809 journals removed from the DOAJ, and the reasons fell into the following general categories. I manually checked some of the journals with only 1 or 2 titles, and suspect that some of these may be reinstated if the publisher chooses to reapply. Note that well over half the removed journals weren’t related to misconduct but were ceased or otherwise unavailable.
Inactive (has not published in the last calender year)
Suspected editorial misconduct by publisher
Website URL no longer works
Journal not adhering to Best Practice
Journal is no longer Open Access
Has not published enough articles this calendar year
Other; delayed open access
Other; no content
Other; taken offline
Removed at publisher’s request
The spreadsheet lists 26 journals that were rejected. Rejected journals will know the specific reasons why their applications were rejected, but those specific reasons are not made public. Journals may reapply after 6 months once they have had an opportunity to amend the issues. 6 The general stated reasons were as follows:
Has not published enough articles
Journal website lacks necessary information
Not an academic/scholarly journal
Web site URL doesn’t work
The work that DOAJ is doing to improve transparency and the screening process is very important for open access advocates, who will soon have a tool that they can trust to provide much more complete information for scholars and librarians. For too long we have been forced to use the concept of a list of “questionable” or even “predatory” journals. A directory of journals with robust standards and easy to understand interface will be a fresh start for the rhetoric of open access journals.
Are you the editor of an open access journal? What do you think of the new application process? Leave your thoughts in the comments (anonymously if you like).
At some point, I will need to start numbering these. The Islandora community has been busy building tools and modules to improve our stack, so even though we did this just back in September, it's time to have a look at modules in development outside the core Islandora release once again.
From the folks at Common Media, this module allows the use of Drupal webforms to contribute metadata for an Islandora object, with a workflow at the webform or object level for site managers to review and ingest submissions. Useful for allowing public contributions into a moderated workflow.
Developed by Marcus Barnes and the team at Simon Fraser University, the Move to Islandora Kit converts source content files and accompanying metadata into ingest packages used by existing Islandora batch ingest modules. Basically, it is a tool for preparing your objects to import into Islandora. It got its start as a tool to support their move from ContentDM into Islandora, but is growing into a more general tool to get your content ready for the big move into Islandora.
From Mark Jordan's bag of tricks, this is an Islandora module that lets you get the UUID for an object if you know its PID, and get an object's PID if you know its UUID. Also lets you assign UUIDs to newly ingested objects.
Built off of work done at Simon Fraser University, this tool from UPEI's Robertson Library systems team allows Pydio users to export files and folders directly to Islandora from within the Pydio application.
Winchester, MA Fedora enthusiasts from around the US and Canada got together to teach, learn and get to know one another at the recent Fedora Camp held at “The Edge”, Duke University Libraries’ learning commons for research, technology, and collaboration. The open makerspace atmosphere provided an ideal environment for Fedora Camp’s 40 participants to share ideas and work together to better understand how to take advantage of the Fedora open source repository platform.
Today is my last day here at Johns Hopkins University Libraries.
After Thanksgiving, I’ll be working, still here in Baltimore, at Friends of the Web, a small software design, development, and consulting company.
I’m excited to be working collaboratively with a small group of other accomplished designers and developers, with a focus on quality. I’m excited by Friends of the Webs’ collaborative and egalitarian values, which show in how they do work and treat each other, how decisions are made, and even in the compensation structure.
Their clientele is intentionally diverse; a lot of e-commerce, but also educational and cultural institutions, among others.
They haven’t done work for libraries before, but are always interested in approaching new business and technological domains, and are open to accepting work from libraries. I’m hoping that it will work out to keep a hand in the library domain at my new position, although any individual project may or may not work out for contracting with us, depending on if it’s a good fit for everyone’s needs. But if you’re interested in contracting an experienced team of designers and developers (including an engineer with an MLIS and 9 years of experience in the library industry: me!) to work on your library web (or iOS) development needs, please feel free to get in touch to talk about it. You could hypothetically hire just me to work on a project, or have access to a wider team of diverse experience, including design expertise.
Libraries, I love you, but I had to leave you, maybe, at least for now
I actually really love libraries, and have enjoyed working in the industry.
It may or may not be surprising to you that I really love books — the kind printed on dead trees. I haven’t gotten into ebooks, and it’s a bit embarrassing how many boxes of books I moved when I moved houses last month.
I love giant rooms full of books. I feel good being in them.
Even if libraries are moving away from being giant rooms full of books, they’ve still got a lot to like. In a society in which information technology and data are increasingly central, public and academic libraries are “civil society” organizations which can serve user’s information needs and advocate for users, with libraries interests aligned with their users, because libraries are not (mainly) trying to make money off their patrons or their data. This is pretty neat, and important.
In 2004, already a computer programmer, I enrolled in an MLIS program because I wanted to be a “librarian”, not thinking I would still be a software engineer. But I realized that with software so central to libraries, if I were working in a non-IT role I could be working with software I knew could be better but couldn’t do much about — or I could be working making that software better for patrons and staff and the mission of the library.
And I’ve found the problems I work on as a software engineer in an academic library rewarding. Information organization and information retrieval are very interesting areas to be working on. In an academic library specifically, I’ve found the mission of creating services that help our patrons with their research, teaching, and learning to be personally rewarding as well. And I’ve enjoyed being able to do this work in the open, with most of my software open source, working and collaborating with a community of other library technologists across institutions. I like working as a part of a community with shared goals, not just at my desk crunching out code.
So why am I leaving?
I guess I could say that at my previous position I no longer saw a path to make the kind of contributions to developing and improving libraries technological infrastructures and capacities that I wanted to make. We could leave it at that. Or you could say I was burned out. I wasn’t blogging as much. I wasn’t collaborating as much or producing as much code. I had stopped religiously going to Code4Lib conferences. I dropped out of the Code4Lib Journal without a proper resignation or goodbye (sorry editors, and you’re doing a great job).
9 years ago when, with a fresh MLIS, I entered the library industry, it seemed like a really exciting time in libraries, full of potential. I quickly found the Code4Lib community, which gave me a cohort of peers and an orientation to the problems we faced. We knew that libraries were behind in catching up to the internet age, we knew (or thought we knew) that we had limited time to do something about it before it was “too late”, and we (the code4libbers in this case) thought we could do something about it, making critical interventions from below. I’m not sure how well we (the library industry in general or we upstart code4libbers) have fared in the past decade, or how far we’ve gotten. Many of the Code4Lib cohort I started up with have dropped out of the community too one way or another, the IRC channel seems a dispiriting place to me lately (but maybe that’s just me). Libraries aren’t necessarily focusing on the areas I think most productive, and now I knew how hard it was to have an impact on that. (But no, I’m not leaving because of linked data, but you can take that essay as my parting gift, or parting shot). I know I’ve made some mistakes in personal interactions, and hadn’t succeeded at building collaboration instead of conflict in some projects I had been involved in, with lasting consequences. I wasn’t engaging in the kinds of discussions and collaborations I wanted to be at my present job, and had run out of ideas of how to change that.
So I needed a change of perspective and circumstance. And wanted to stay in Baltimore (where I just bought a house!). And now here I am at Friends of the Web! I’m excited to be taking a fresh start in a different sort of organization working with a great collaborative team.
I am also excited by the potential to keep working in the library industry from a completely different frame of reference, as a consulting/contractor. Maybe that’ll end up happening, maybe it won’t, but if you have library web development or consulting work you’d like discuss, please do ring me up.
What will become of Umlaut?
There is no cause for alarm! Kevin Reiss and his team at Princeton have been working on an Umlaut rollout there (I’m not sure if they are yet in production). They plan to move forward with their implementation, and Kevin has agreed to be a (co-?)maintainer/owner of the Umlaut project.
Also, Umlaut has been pretty stable code lately, it hasn’t gotten a whole lot of commits but just keeps on trucking and working well. While there were a variety of architectural improvements I would have liked to make, I fully expect Umlaut to remain solid software for a while with or without major changes.
This actually reminds me of how I came to be the Umlaut lead developer in the first place. Umlaut was originally developed by Ross Singer who was working at Georgia Tech at the time. Seeing a priority for improving our “link resolver” experience, and the already existing and supported Umlaut software, after talking to Ross about it, I decided to work on adopting Umlaut here. But before we actually went live in production — Ross had left Georgia Tech, they had decided to stop using Umlaut, and I found myself lead developer! (The more things change… but as far as I know, Hopkins plans to continue using Umlaut). It threw me for a bit of a loop to suddenly be deploying open source software as a community of one institution, but I haven’t regretted it, I think Umlaut has been very successful for our ability to serve patrons with what they need here, and at other libraries.
I am quite proud of Umlaut, and feel kind of parental towards it. I think intervening in the “last mile” of access, delivery, and other specific-item services is exactly the right place to be, to have the biggest impact on our users. For both long-term strategic concerns — we don’t know where our users will be doing ‘discovery’, but there’s a greater chance we’ll still be in the “last mile” business no matter what. And for immediate patron benefits — our user interviews consistently show that our “Find It” link resolver service is both one of the most used services by our patrons, and one of the services with the highest satisfaction. And Umlaut’s design as “just in time” aggregator of foreign services is just right for addressing needs as they come up — the architecture worked very well for integrating BorrowDirect consortial disintermediated borrowing into our link resolver and discovery, despite the very slow response times of the remote API.
I think this intervention in “last mile” delivery and access, with a welcome mat to any discovery wherever it happens, is exactly where we need to be to maximize our value to our patrons and “save the time of the reader”/patron, in the context of the affordances we have in our actually existing infrastructures — and I think it has been quite successful.
So why hasn’t Umlaut seen more adoption? I have been gratified and grateful by the adoption it has gotten at a handful of other libraries (including NYU, Princeton, and the Royal Library of Denmark), but I think it’s potential goes further. Is it a failure of marketing? Is it different priorities, are academic libraries simply not interested in intervening to improve research and learning for our patrons, preferring to invest in less concrete directions? Are in-house technological capacity requirements simply too intimidating (I’ve never tried to sugar coat or under-estimate the need for some local IT capacity to run Umlaut, although I’ve tried to make the TCO as low as I can, I think fairly successfully). Is Umlaut simply too technically challenging for the capacity of actual libraries, even if they think the investment is worth it?
I don’t know, but if it’s from the latter points, I wonder if any access to contractor/vendor support would help, and if any libraries would be interested in paying a vendor/contractor for Umlaut implementation, maintenance, or even cloud hosting as a service. Well, as you know, I’m available now. I would be delighted to keep working on Umlaut for interested libraries. The business details would have to be worked out, but I could see contracting to set up Umlaut for a library, or providing a fully managed cloud service offering of Umlaut. Both are hypothetically things I could do at my new position, if the business details can be worked out satisfactorily for all involved. If you’re interested, definitely get in touch.
Other open source contributions?
I have a few other library-focused open source projects I’ve authored that I’m quite proud of. I will probably not be spending much time on the in the near future. This includes traject, bento_search, and borrow_direct.
I wrote Traject with Bill Dueber, and it will remain in his very capable hands.
The others I’m pretty much sole developer on. But I’m still around on the internet to answer questions, provide advice, or most importantly, accept pull requests for changes needed. bento_search and borrow_direct are both, in my not so humble opinion, really well-architected and well-written code, which I think should have legs, and which others should find fairly easy to pick up. If you are using one of these projects, send a good pull request or two, and are interested, odds are I’d give you commit/release rights.
What will happen to this blog?
I’m not sure! The focus of this blog has been library technology and technology as implemented in libraries. I hadn’t been blogging as much as I used to anyway lately. But I don’t anticipate spending as much(any?) time on libraries in the immediate future, although I suspect I’ll keep following what’s going on for at least a bit.
Will I have much to say on libraries and technology anyway? Will the focus change? We will see!
So long and thanks for all the… fiche?
Hopefully not actually a “so long”, I hope to still be around one way or another. I am thinking of going to the Code4Lib conference in (conveniently for me) Philadelphia in the spring.
Much respect to everyone who’s still in the trenches, often in difficult organizational/political environments, trying to make libraries the best they can be.
Notes from the breakout session discussing the Code4libBC event itself how to have the intermediary discussions how to get people to know enough about each side in order to have good conversation the gathering is very diverse; the more diversity, the more we all learn communications is not easy, so it’s a good opportunity this … Continue reading Code4LibBC Day 2: Moving forward with Code4libBC Breakout Notes
Notes from the breakout sessions. Archival Advantage (OCLC report): advantages to working with archivists with information archivists consider the provenance rather than by subject e.g. retaining order the stuff that you find is what was made a long the way rather than the final product material tend to be unique in context, rather than repeatable … Continue reading Code4libBC Day 2: Archives 101 for digital librarians
The last part of lightning talks for Code4libBC. Speeding up Digital Preservation with a Graphics Card, Alex Garnett, SFU GPU-accelerated computing. graphics cards are very powerful nowadays, and many organizations have figured out how to use the graphics cards to do things. graphics are much more powerful to CPU, but very specialized for video or … Continue reading Code4libBC Day 2: Lightning Talks Part 2
XML Databases and Document Stores, Michael Joyce, SFU EXIST for storing XML indexing and searhcing provided by Lucene doesn’t include analysis, but can used Stanford named entity recognizer MASHUP of over 200 books mapped using statistical tagger and dbpedia, great for getting people started similarity matching to find similar content Demo of MIK (the Move … Continue reading Code4LibBC Day 2: Lightning Talks Part 1
These notes are a bit disparate, but here’s what I got in terms of notes for the breakout session. global updates vendor records marcedit edits before loading load tables/profiles: pre-define templates to be applied to batches of records e.g. publisher number in 001, might match and overlay or not; can use list to isolate sets … Continue reading Batch Record Editing Processes Breakout Group Notes
Part 2 of code4libBC unconference of morning lightning talks on day 1. Open Collections Overview, Paul Joseph, UBC open collections at UBC library sits on top of 4 repositories: IR (DSpace), digital images (contentdm), archival material (AtoM), open access data sets (dataverse) can normalize into a single model using ElasticSearch browse by collection, featured collections … Continue reading Code4LibBC Day 1: Lightning Talks Part 2
Part 1 of the morning lightning talks at Code4LibBC 2015. Fun & Games with APIs, Calvin Mah, SFU SFU just launched to the official iOS app, driven by the new CIO included view computer availability under the library tab amazing, because library was not consulted library has robust API to disseminate information: http://api.lib.sfu.ca was thinking … Continue reading Code4LibBC Day 1: Lightning Talks Part 1
Tomorrow is Thanksgiving in the US. As often happens this time of year, I’m in a reflective mood. And sentimental. (Warning: I mix metaphors and use parentheticals a lot when I’m like this.)
I’ve made no secret of the fact that the last two years have been hard for Dale and me. A bad career move leading to a big drop in my confidence level (which was arguably shaky to begin with), Dale and me living apart for 9 months, both of us living out of boxes for 13+ months, financial trouble… it’s been rough. The whole thing just culminated in a very emotionally draining two weeks, which I can only liken to a fever breaking. That is, it was super bad, but I feel like things are turning around, now. (I am knocking on wood as I type that.)
Things turning around
Our house sold, finally. For the first time in a year and a half, I am not panicked about money—our equity paid off the worst of the last year of moving costs and credit card debt. We’re back within sight of zero, which sounds negative or backhanded to say, but zero (or near it) is remarkably good in America today. Zero, with a good income? We’re OK, .
That includes being able to pay the ransom to get the belongings we shipped from Alaska out of storage. I am so excited to see my pots and pans! (I’ve been cooking with a saucepan and two iron skillets for over a year. My bakeware is in better shape, because we sold/donated/replaced some of it over time; the pots & pans were a wedding present, though, and didn’t fit in the Subaru, so I’ve had to wait.) There’s probably lots of other stuff I’ll be super happy to see, but it’s been so long; it’ll be fun to rediscover it all.
Selling the house also means that we no longer feel stretched across the continent. We don’t have taxes to pay, there, and we don’t have the lingering worry that something will happen and we will be forced to get on a last-minute redeye flight to Alaska. There is no longer question about what constitutes our “permanent address,” because we only have one address. (And we got PA licenses recently, so we’re legit.)
I don’t mean to sound relieved to be away from Alaska. That isn’t it, as you’ll see in a moment. But I am relieved to no longer have that responsibility, or debt, hanging over me.
A photo project (subtitle: the stages of grief)
In a move of stunning-but-accidental brilliance—truly, I wish I could claim it was insight and emotional intelligence that led me to do this, but it was not—I chose the last few days to sort all of my photos from 2009 onward. I’ve been meaning to do it for ages; it’s embarrassing for a librarian’s digital assets to be so messy, and it bothered me.
I was horribly depressed on Monday—”little-d depressed,” I call it, because it wasn’t chronic and happened for known reasons—and at the outset of the photo sorting project, I sort of thought I’d make it worse. I risked it because it was a thing I could do that was useful and that I had the energy for.
Honestly, yes, whole swaths of the photos were sad, upsetting, angry-making. Going through the ones from summer 2014 made my stomach hurt, by turns, with guilt, anxiety, and rage. I couldn’t bring myself to keep the photo of my 2014 Christmas tree, because that was such a dark time for me. I noticed I had taken very few photos in Charlottesville, and I kept far fewer. And, throughout the whole process, I watched myself put photo after photo into a folder titled “alaska,” and I grieved for a chapter of Dale’s and my life, closed.
But then the most remarkable thing happened: in the process of curating my photos, I started coming to terms with the last six years. In sorting those photos that went in the “alaska” folder, I watched us discover beautiful places, make friends*, make a home (with lots of help from said friends), and grow up a little more from who we had been. I relived some successful projects—and some not so successful ones. I noticed that the wonder over Alaska’s unique environment never really left me; I had snow and ice photos from every winter, even the last; inlet (and puffin) photos from every trip to Seward; flower photos every summer; and (crappy ) photos of the mountains around town in every season, every year. I tried to photograph every moose and most flocks of ravens. (I wish I were so assiduous in capturing photos of my friends, but I treasure the ones I did get.)
Our Alaska photos tell a great story, one I’m so lucky to have had the chance to live. Going through them, making sure they were labeled and sorted, was soothing to my soul.
And because my most recent photos were all on my phone, the very last thing I did was go through my photos since July. The drive from Alaska looks fun in retrospect, and I can nearly forget how stressed we were for most of it. We definitely saw and photographed some cool stuff, anyway.
And then I realized that I had to make a new folder, “pittsburgh,” to put some recent, happy photographs into. And as simplistic as it may sound, that was a pretty transformational moment. Yes, a big, fantastic chapter had closed, and it’s fair to say I’ve lived between chapters for many months; but we’re starting a new chapter, in a place where we already have some roots laid down (though they need watering; if you’re a Pittsburgh friend who is reading this, let’s hang out soon?).
And it could be a great chapter, too. Dale likes his job, and it pays well, relative to Pittsburgh’s cost of living. I’ve applied for a job I’d be awesome at, and I think I have a good shot at it. If that job doesn’t work out, I have a pretty great backup plan: splitting my time between contract work and a kind of solo Hacker School Recurse Center, so I can fill in the skills I want but don’t yet have. I hear there’s a whole slew of retirements coming up at one of the local libraries, so it’s likely I can get back into librarianship soon, assuming I don’t fall in love with contracting. Maybe I can even manage to talk one of the local libraries into making a 25-35 hour/week position, so that I can keep a little extra “life” in my “work-life balance.”
A list of grateful thoughts
This post hasn’t been what I set out to write, although I think I like what it turned into. I still, as is my tradition most years, want to make a list of the things I’m thankful for. As usual, it’s an unordered and incomplete list.
Tomorrow we’re spending time with friends who are great hosts, great cooks, and great cultivators of friendships. … I don’t know how else to put that. I don’t just like them (though I do, very much); I also like their other friends. It will be a wonderful holiday!
Dale and I have been together for a decade, married for four years. He is patient and kind, and although I don’t know how he does it, he seems to love me very much, even knowing all of my flaws (probably better than I do). He is a steadying influence, tempering my tendency to be frenetic and impatient and overly goal-oriented. I still have a lot to learn from him.
I have four birds who are very funny and who seem to be healthy and happy. The oldest has beaten the average for what her lifespan could be, and I am grateful to have the time with her.
We’re back in Pittsburgh. This is a good town. Until Anchorage, it was the only place I’d ever been homesick for. (Like, crying during that Batman movie that was filmed here, homesick.) Dale and I met and fell in love here, and this is the second time we’ve come back. I think this might be home.
I covered this above, but we’re OK, financially. I am so thankful that that’s the case.
I’m also grateful for friends, new and old, near and far. There are so many people I love and so many who love me. The sadness of missing far off friends is always there, but it’s cushioned by warm feelings. I wouldn’t trade those friendships for anything. Plus, lots of excuses to travel and hopefully get visitors, right?
This is a little meta, but I am glad that I can find comfort in the holidays. I know they’re hard for a lot of people—they’ve been hard for me, in the past. I’m glad that I like Christmas carols and colored lights and the smell of cinnamon. The boost that I get from the trappings of the holiday season (after Halloween, anyway) helps make winter a lot more pleasant for me.
Things are better for us than they have been in ages. I have finally started to let go of some of the negative feelings I’ve built up in the last two years and to let myself hope and plan. I have a lot to be thankful for.
* Something maybe nowhere else does as well as Alaska: giving time and space and assistance to new folks, so that they have a chance to establish a community/support network for themselves. The South could learn a lot about hospitality from America’s northernmost state. (Yeah, I went there. I’m originally from VA, though, so I can say that.)
This presentation was a lightning talk done at Code4LibBC Unconference 2015 on batch editing MARC records. I’ve been hearing for quite some time that people struggle with vendor records, not least of all because making changes can be very time consuming. I’d like to present one possible method to help not only to fix vendor … Continue reading Semi-Automating Batch Editing MARC Records : Using MarcEdit
I have been seeing an enormous amount of momentum in the library industry toward “linked data”, often in the form of a fairly ambitious collective project to rebuild much of our infrastructure around data formats built on linked data.
I think linked data technology is interesting and can be useful. But I have some concerns about how it appears to me it’s being approached. I worry that “linked data” is being approached as a goal in and of itself, and what it is meant to accomplish (and how it will or could accomplish those things) is being approached somewhat vaguely. I worry that this linked data campaign is being approached in a risky way from a “project management” point of view, where there’s no way to know if it’s “working” to accomplish it’s goals until the end of a long resource-intensive process. I worry that there’s an “opportunity cost” to focusing on linked data in itself as a goal, instead of focusing on understanding our patrons needs, and how we can add maximal value for our patrons.
I am particularly wary of approaches to linked data that seem to assume from the start that we need to rebuild much or all of our local and collective infrastructure to be “based” on linked data, as an end in itself. And I’m wary of “does it support linked data” as the main question you asked when evaluating software to purchase or adopt. “Does it support linked data” or “is it based on linked data” can be too vague to even be useful as questions.
I also think some of those advocating for linked data in libraries are promoting an inflated sense of how widespread or successful linked data has been in the wider IT world. And that this is playing into the existing tendency for “magic bullet” thinking when it comes to information technology decision-making in libraries.
This long essay is an attempt to explain my concerns, based on my own experiences developing software and using metadata in the library industry. As is my nature, it turned into a far too long thought dump, hopefully not too grumpy. Feel free to skip around, I hope at least some parts end up valuable.
What is linked data?
The term “linked data” as used in these discussions basically refers to what I’ll call an “abstract data model” for data — a model of howyou model data.
The model says that all metadata will be listed as a “triple” of (1) “subject”, (2) “predicate” (or relationship), and (3) “object”.
1. Object A [subject]
2. Is a [predicate]
3. book [object]
1. Object A [subject]
2. Has the ISBN [predicate]
3. "0853453535" [object]
1. Object A
2. has the title
3. "Revolution and evolution in the twentieth century"
1. Object A 2. has the author 3. Author N
1. Author N 2. has the first name 3. James
1. Author N 2. has the last name 3. Boggs
Our data is encoded as triples, statements of three parts: subject, predicate, object.
Linked data prefers to use identifiers for as many of these data elements as possible, and in particular identifiers in the form of URI’s.
“Object A” in my example above is basically an identifier, but similar to the “x” or “y” in an algebra problem, it has meaning only in the context of my example; someone elses “Object A” or “x” or “y” in another example might mean something different, and trying to throw them all together you’re going to get conflicts. URI’s are nice as identifiers in that, being based on domain names, they have a nice way of “namespacing” and avoiding conflicts, they are global identifiers.
# The identifiers I'm using are made up by me, and I use # example.org to get across I'm not using standard/conventional
# identifiers used by others.1. http://example.org/book/oclcnum/828033 [subject]
2. http://example.org/relationship/is_member_of_class [predicate]
3. http://example.org/type/Book [object]
# We can see sometimes we still need string literals, not URIs1. http://example.org/book/oclcnum/828033
3. "Revolution and evolution in the twentieth century"
3. "Boggs, James"
I call the linked data model an “abstract data model“, because it is a model for how you model data: As triples.
You still, as with any kind of data modeling, need what I’ll call a “domain model” — a formal listing of the entities you care about (books, people), and what attributes, properties, and relationships with each other those entities have.
In the library world, we’ve always created these formal domain models, even before there were computers. We’ve called it “vocabulary control” and “authority control”. In linked data, that domain model takes the form of standard shared URI identifiers for entities, properties, and relationships. Establishing standard shared URI’s with certain meanings for properties or relationships (eg `http://example.org/relationship/has_title` will be used to refer to the title, possibly with special technical specification of what we mean exactly by ‘title’) is basically “vocabulary control”, while establishing standard shared URI’s for entities (eg `http://example.org/lccn/79128112`) is basically “authority control”.
You still need common vocabularies for your linked data to be inter-operable, there’s no magic in linked data otherwise, linked data just says the data will be encoded in the form of triples, with the vocabularies being encoded in the form of URIs. (Or, you need what we’ve historically called a “cross-walk” to make data from different vocabularies inter-operable; linked data has certain standard ways to encode cross-walks for software to use them, but no special magic ways to automatically create them).
For an example of vocabulary (or “schema”) built on linked data technology, see schema.org.
You can see that through aggregating and combining multiple simple “triple” statements, we can build up a complex knowledge graph. Through basically one simple rule of “all data statements are triples”, we can build up remarkably complex data, and model just about any domain model we’d want. The library world is full of analytical and theoretically minded people who will find this theoretical elegance very satisfying, the ability to model any data at all as a bunch of triples. I think it’s kind of neat myself.
You really can model just about any data — any domain model — as linked data triples. We could take AACR2-MARC21 as a domain model, and express it as linked data by establishing a URI to be used as a predicate for every tag-subtag. There would be some tricky parts and edge cases, but once figured out, translation would be a purely mechanical task — and our data would contain no more information or utility output as linked data than it did originally, nor be any more inter-operable than it was originally, as is true of the output of any automated transformation process.
You can model anything as linked data, but some things are more convenient and some things less convenient. The nature of linked data as being building complex information graphs based on simple triples can actually make the linked data more difficult to deal with practically, as you can see looking at our made up examples above and trying to understand what they mean. By being so abstract and formally simple, it can get confusing.
Some things that might surprise you are kind of inconvenient to model as linked data. It can take some contortions to model an ordered sequence using linked data triples, or to figure out how to model alternate language representations (say of a title) in triples. There are potentially multiple ways to solve these goals, with certain patterns as established as standards for inter-operability, but they can be somewhat confusing to work with. Domain modeling is difficult already — having to fit your domain model into the linked data abstract model can be a fun intellectual exercise, but the need to undertake that exercise can make the task more difficult.
Other things are more convenient with linked data. You might have been wondering when the “linked” would come in.
Modeling all our data as individual “triples” makes it easier to merge data from multiple sources. You just throw all the triples together (You are still going to need to deal with any conflicts or inconsistencies that come about). Using URI’s as vocabulary identifiers means that you can throw all this data together from multiple sources, and you won’t have any conflicts, you won’t find one source using MARC tag 100 to mean “main entry” and another source using the 100 tag to mean all sorts of other things (See UNIMARC!).
Linked data vocabularies are always “open for extension”. let’s see we established that there’s a sort of thing as a `http://example.org/type/Book` and it has a number of properties and relationships including `http://example.org/relationship/has_title`. But someone realizes, gee, we really want to record the color of the book too. No problem, they just start using `http://mydomain.tld/relationship/color`, or whatever they want. It won’t conflict with any existing data (no need to find an unused MARC tag!), but of course it won’t be useful outside the originator’s own system unless other people adopt this convention, and software is written to recognize and do something with it (open for extension, but we still need to adopt common vocabularies).
And using URI’s is meant to make it more straightforward to combine data from multiple sources in another way, that an http URI actually points to a network location, that could be used to deliver more information about something, say, `http://example.org/book/oclcnum/828033`, in the form of more triples. Mechanics to make it easier to assemble (meta)data from multiple sources together.
There are mechanics meant to support aggregating, combining, and sharing data built into the linked data design — but the fundamental problems of vocabulary and authority control, of using the same or overlapping vocabularies (or creating cross-walks), of creating software that recognizes and does something useful with vocabulary elements actually in use, etc, — all still exist. So do business model challenges with entities that don’t want to share their data, or human labor power challenges with getting data recorded. I think it’s worth asking if the mechanical difficulties with, say, merging MARC records from different sources, are actually the major barriers to more information sharing/coordination in the present environment, vs these other factors.
“Semantic web” vs “linked data”? vs “RDF”?
The “semantic web” is an older term than “linked data”, but you can consider it to refer to basically the same thing. Some people cynically suggest “linked data” was meant to rebrand the “semantic web” technology after it failed to get much adoption or live up to it’s hype. The relationship between the two terms according to Tim Berners-Lee (who invented the web, and is either the inventor or at least a strong proponent of semantic web/linked data) seems to be that “linked data” is the specific technology or implementations of individual buckets of data, while the “semantic web” is the ecosystem that results from lots of people using it.
RDF, which stands for “Resource Description Framework”, and is actually the official name of the abstract data model of “triples”. Whereas then “linked data” could be understood as data using RDF and URI’s, and the “semantic web” the ecosystem that results from plenty of people doing it. Similarly, “RDF” can be roughly understood as a synonym.
Technicalities aside, “semantic web”, “linked data”, and “RDF” can generally be understood as rough synonyms when you see people discussing them — whatever term they use, they are talking about (meta)data modeled as “triples”, and the systems that are created by lots of such data integrated together over the web.
So. What do you actually want to do? Where are the users?
At a recent NISO forum on The Future of Library Resource Discovery, there was a session where representatives from 4(?) major library software vendors took Q&A from a moderator and the audience. There was a question about the vendor’s commitment to linked data. The first respondent (who I think was from EBSCO?) said something like
[paraphrased] Linked data is a tool. First you need to decide what you want to do, then linked data may or may not be useful to doing that.
I think that’s exactly right.
Some of the other respondents, perhaps prompted by the first answer, gave similar answers. While others (especially OCLC) remarked of their commitment to linked data and the various places they are using it. Of these though, I’m not sure any have actually resulted in any currently useful outcomes due to linked data usage.
Four or five years ago, talk of “user-centered design” was big in libraries — and in the software development world in general. For libraries (and other service organizations), user-centered design isn’t just about software — but software plays a key role in almost any service a contemporary library offers, quite often mediating the service through software, such that user-centered design in libraries probably always involves software.
For academic libraries, with a mission to help our patrons in research, teaching, and learning — user-centered design begins with understanding our patrons’ research and leaning processes. And figuring out the most significant interventions we can make to improve things for our patrons. What are their biggest pain points? Where can we make the biggest difference? To maximize our effectiveness when there’s an unlimited number of approaches we could take, you want to start with areas you can make a big improvement for the least resource investment.
Even if your institution lacks the resources to do much local research into user behavior, over the past few years a lot of interesting and useful multi-institutional research has been done by various national and international library organizations, such as reports from OCLC [a] [b], JISC [a], and Ithaka [a], [b], as well as various studies done by practitioner and published in journals.
To what extent is the linked data campaign informed by, motivated by, or based on what we know about our users behavior and needs? To what extent are the goals of the linked data campaign explicit and specific, and are those goals connected back to what our users need from us? Do we even know what we’re supposed to get out of it at all, beyond “data that’s linked better”, or “data that works well with the systems of entities outside the library industry”? (And for the latter, do we actually understand in what ways we want it to “work well”, for what reasons, and what it takes to accomplish that?) Are we asking for specific success stories from the pilot projects that have already been done? And connecting them to what we need to do provide our users?
To be clear, I do think goals to increase our own internal staff efficiency, or to improve the quality of our metadata that powers most of our services are legitimate as well. But they still need to be tied back to user needs (for instance, to know the metadata you are improving is actually the metadata you need and the improvements really will help us serve our users better), and be made explicit (so you can evaluate how well efforts at improvement are working).
I think the motivations for the linked data campaign can be somewhat unclear and implicit; when they are made explicit, they are sometimes very ambitious goals which require a lot of pieces falling into place (including third-party cooperation and investment that is hardly assured) for realization only in the long-term — and with unclear or not-made-explicit benefits for our patrons even if realized. For a major multi-institution multi-year resource-intensive campaign — this seems to me not sufficiently grounded in our user’s needs.
Is everyone else really doing it? Maybe not.
At another linked data presentation I attended recently, a linked data promoter said something along the lines of:
[paraphrased] Don’t do linked data because I say so, or because LC says so. Do it because it’s what’s necessary to keep us relevant in the larger information world, because it’s what everyone else is doing. Linked data is what lets Google give you good search results so quickly. Linked data is used by all the major e-commerce sites, this is how they do can accomplish what they can.
The thing is, from my observation and understanding of the industry and environment, I just don’t think it’s true that “everyone is doing it”.
Google also uses linked data to a somewhat more central extent in it’s Knowledge Graph feature, which provides “facts” in sidebars on search results. But most of the sources of data Google harvests from for it’s Knowledge Graph aren’t actually linked data, rather Google harvests and turns them into linked data internally — and then doesn’t actually expose the linked-data-ified data to the wider world. In fact, Google has several times announced initiatives to expose the collected and triple-ified data to the wider world, but they have not actually turned into supported products. This doesn’t necessarily say what advocates might want about the purported central role of linked data to Google, or what it means for linked data’s wider adoption. As far as I know or can find out, linked data does not play a role in the actual primary Google search results, just in the Knowledge Graph “fact boxes”, and the “rich snippets” associated with results.
My sense is that the general industry understanding is that linked data has not caught on like people thought it would in the 2007-2012 heyday, and adoption has in fact slowed and reversed. (Google trend of linked data/semantic web)
An October 2014 post on Hacker News asks: ” A few years ago, it seemed as if everyone was talking about the semantic web as the next big thing. What happened? Are there still startups working in that space? Are people still interested?”
In the ensuing discussion on that thread (which I encourage you to read), you can find many opinions, including:
“The way I see it that technology has been on the cusp of being successful for a long time” [but has stayed on the cusp]
“A bit of background, I’ve been working in environments next to, and sometimes with, large scale Semantic Graph projects for much of my career — I usually try to avoid working near a semantic graph program due to my long histories of poor outcomes with them. I’ve seen uncountably large chunks of money put into KM projects that go absolutely nowhere and I’ve come to understand and appreciate many of the foundational problems the field continues to suffer from. Despite a long period of time, progress in solving these fundamental problems seem hopelessly delayed.”
“For what it’s worth, I spent last month trying to use RDF tooling (Python bindings, triple stores) for a project recently, and the experience has left me convinced that none of it is workable for an average-size, client-server web application. There may well be a number of good points to the model of graph data, but in practice, 16 years of development have not lead to production-ready tools; so my guess is that another year will not fix it.”
But also, to be fair: “There’s really no debate any more. We use the the technology borne by the ‘Semantic Web’ every day.” [Personally I think this claim was short on specifics, and gets disputed a bit in the comments]
At the very least, the discussion reveals that linked data/semantic web is still controversial in the industry at large, it is not an accepted consensus that it is “the future”, it has not “taken over.” And linked data is probably less “trendy” now in the industry at large than it was 4-6 years ago.
Talis was a major UK vendor of ILS/LMS library software, the companies history begins in 1969 as a library cooperative, similar to OCLC’s beginnings. In the mid-2000’s, they started shifting to a strategic focus on semantic web/linked data. In 2011, they actually sold off their library management division to focus primarily on semantic web technology. But quickly thereafter in 2012, they announced “that investment in the semantic web and data marketplace areas would cease. All efforts are now concentrated on the education business. ” They are now in the business of producing “enterprise teaching and learning platform” (compare to Blackboard, if I understand correctly), and apparently fairly succesful at it — but the semantic web focus didn’t pan out. (Wikipedia, Talis Group)
In 2009, The New York Times, to much excitement, announced a project to expose their internal subject vocabulary as linked data in. While the data is still up, it looks to me like was abandoned in 2010; there has been no further discussion or expansion of the service, and the data looks not to have been updated. Subject terms have a “latest use” field which seems to be stuck in May or June 2010 for every term I looked at (see Obama, Barak for instance), and no terms seem to be available for subjects that have become newsworthy since 2010 (no Carson, Ben, for instance).
In the semantic web/linked data heydey, a couple attempts to create large linked data databases were announced and generated a lot of interest. Freebase was started in 2007, acquired by Google in 2010… and shut down in 2014. DBPedia was began much earlier and still exists… but it doesn’t generate the excitement or buzz that it used to. The newer WikiData (2012) still exists, and is considered a successor to Freebase by some. It is generally acknowledged that none of these projects have lived up to initial hopes with regard to resulting in actual useful user-facing products or services, they remain experiments. A 2013 article, “There’s No Money in Linked Data“, suggests:
….[W]e started exploring the use of notable LD datasets such as DBpedia, Freebase, Geonames and others for a commercial application. However, it turns out that using these datasets in realistic settings is not always easy. Surprisingly, in many cases the underlying issues are not technical but legal barriers erected by the LD data publishers.
In Jan 2014, Paul Houle in “The trouble with DBpedia” argues that the problems are actually about data quality in DBPedia — specifically about vocabulary control, and how automatic creation of terms from use in wikipedia leads to inconsistent vocabularies . Houle thinks there are in fact technical solutions — but he, too, begins from the acknowledgement that DBPedia has not lived up to it’s expected promise. In a very lengthy slide deck from February 2015, “DBpedia Ontology and Mapping Problems”, vladimiralexiev has a perhaps different diagnosis of the problem, about ontology and vocabulary design, and he thinks he has solutions. Note that he too is coming from an experience of finding DBPedia not working out for his uses.
There’s disagreement about why these experiments haven’t panned out to be more than experiments or what can be done or what promise they (and linked data in general) still have — but pretty widespread agreement in the industry at large that they have not lived up to their initial expected promise or hype, and have as of yet delivered few if any significant user-facing products based upon them.
It is interesting that many diagnoses of the problems there are about the challenges of vocabulary control and developing shared vocabularies, the challenges of producing/extracting sufficient data that is fit to these vocabularies, as well as business model issues — sorts of barriers we are well familiar with in the library industry. Linked data is not a magic bullet that solves these problems, they will remain for us as barriers and challenges to our metadata dreams.
Semantic web and linked data are still being talked about, and worked on in some commercial quarters, to be sure. I have no doubt that there are people and units at Google who are interested in linked data, who are doing research and experimentation in that area, who are hoping to find wider uses for linked data at Google, although I do not think it is true that linked data is currently fundamentally core to Google’s services or products or how they work. What they have not done is taken over the web, or become a widely accepted fact in the industry. It is simply not true that “every major ecommerce site” has an architecture built on linked data. It is certainly true that some commercial sector actors continue to experiment with and explore uses of linked data.
But in fact, I would say that libraries and the allied cultural heritage sector, along with limited involvement from governmental agencies (especially in the UK, although not to the extent some would like, with 2010 cancellation of a program) and scholarly publishing (mainly I think of Nature Publishing), are primary drivers of linked data research and implementation currently. We are some of the leaders in linked data research, we are not following “where everyone else is going” in the private sector.
There’s nothing necessarily wrong with libraries being the drivers in researching and implementing interesting and useful technology in the “information retrieval” domain — our industry was a leader in information retrieval technology 40-80 years ago, it would be nice to be so again, sure!
But we what we don’t have is “everyone else is doing it” as a motivation or justification for our campaign — not that it must be a good idea because the major players on the web are investing heavily in it (they aren’t), and not that we will be able to inter-operate with everyone else the way we want if we just transition all of our infrastructure to linked data because that’s where everyone else will be too (they won’t necessarily, and everyone using linked data isn’t alone sufficient for inter-operability anyway, there needs to be coordination on vocabularies as well, just to start).
My Experiences in Data and Service Interoperability Challenges
For the past 7+ years, my primary work has involved integrating services and data from disparate systems, vendors, and sources, in the library environment. I have run into many challenges and barriers to my aspired integrations. They often have to do with difficulties in data interoperability/integration; or in the utility of our data, difficulties in getting what I actually need out of data. These are the sorts of issues linked data is meant to be at home in.
However, seldom in my experience do I run into a problem where simply transitioning infrastructure to linked data would provide a solution or fundamental advancement. The barriers often have at their roots business models (entities that have data you want to interoperate with, but don’t want their data to be shared because it keeping it close is of business value to them; or that simply have no business interest in investing in the technology needed to share data better); or lack of common shared domain models (vocabulary control); or lack of person power to create/record the ‘facts’ needed in machine-readable format.
Linked data would be neither necessary nor sufficient to solving most of the actual barriers I run into. Simply transitioning to a linked data-based infrastructure without dealing with the business or domain model issues would not help at all; and linked data is not needed to solve the business or domain model issues, and of unclear aid in addressing them: A major linked data campaign may not be the most efficient, cost effective, or quickest way to solve those problems.
Here are some examples.
What Serial Holdings Do We Have?
In our link resolver, powered by Umlaut, a request might come in for a particular journal article, say the made up article “Doing Things in Libraries”, by Melville Dewey, on page 22 of Volume 50 Issue 2 (1912) of the Journal of Doing Things.
I would really like my software to tell the user if we have this specific article in a bound print volume of the Journal of Doing Things, exactly which of our location(s) that bound volume is located at, and if it’s currently checked out (from the limited collections, such as off-site storage, we allow bound journal checkout).
My software can’t answer this question, because our records are insufficient. Why? Not all of our bound volumes are recorded at all, because when we transitioned to a new ILS over a decade ago, bound volume item records somehow didn’t make it. Even for bound volumes we have — or for summary of holdings information on bib/copy records — the holdings information (what volumes/issues are contained) are entered in one big string by human catalogers. This results in output that is understandable to a human reading it (at least one who can figure out what “v.251(1984:Jan./June)-v.255:no.8(1986)” means). But while the information is theoretically input according to cataloging standards — changes in practice over the years, varying practice between libraries, human variation and error, lack of validation from the ILS to enforce the standards, and lack of clear guidance from standards in some areas, mean that the information is not recorded in a way that software can clearly and unambiguously understand it.
This is a problem of varying degrees at other libraries too, including for digitized copies, for presumably similar reasons. In addition to at my own library, I’d like my software to be able to figure out if, say, HathiTrust has a digitized copy of this exact article (digitized copy of that volume and issue of that journal). Or if nearby libraries in WorldCat have a physical bound journal copy, if we don’t here. I can’t really reliably do that either.
We theoretically have a shared data format and domain model for serial holdings, Marc Format for Holdings Data (MFHD). A problem is that not all ILS’s actually implement MFHD, but more than that, that MFHD was designed in a world of printing catalog cards, and doesn’t actually specify the data in the right way to be machine actionable, to answer the questions we want answered. MFHD also allows for a lot of variability in how holdings are recorded, with some patterns simply not recording sufficient information.
In 2007 (!) I advocated more attention to ONIX for Serials Coverage as a domain model, because it does specify the recording of holdings data in a way that could actually serve the purposes I need. That certainly hasn’t happened, I’m not sure there’s been much adoption of the standard at all. It probably wouldn’t be that hard to convert ONIX for Serials Coverage to a linked data vocabulary; that would be fine, if not neccesarily advancing it’s power any. It’s powerful, if it were used, because it captures the data actually needed for the services we need in a way software can use, whether or not it’s represented as linked data. Actually implementing ONIX for Serials Coverage — with or without linked data — in more systems would have been a huge aid to me. Hasn’t happened.
Likewise, we could probably, without too much trouble, create a “linked data” translated version of MFHD. This would solve nothing, neither the problems with MFHD’s expressiveness nor adoption. Neither would having an ILS whose vendor advertises it as “linked data compatible” or whatever, make MFHD work any better. The problems that keep me from being able to do what I want have to do with domain modeling, with adoption of common models throughout the ecosystem, and with human labor to record data. They are not problems the right abstract data model can fix, they are not fundamentally problems of the mechanics of sharing data, but of the common recording of data in common formats with sufficient utility.
Lack of OCLC number or other identifiers in records
Even in a pre-linked data world, we have a bunch of already existing useful identifiers, which serve to, well, link our data. OCLC numbers as identifiers in the library world are prominent for their widespread adoption and (consequent) usefulness.
If several different library catalogs all use OCLC numbers on all their records, we can do a bunch of useful things, because we can easily know when a record in one catalog represents the same thing as a record in another. We can do collection overlap analysis. We can link from one catalog to another — oh, it’s checked out here, but this other library we have a reciprocal borrowing relationship with has a copy. We can easily create union catalogs that merge holdings from multiple libraries onto de-duplicated bibs. We can even “merge records” from different libraries — maybe a bib from one library has 505 contents but the bib from doesn’t, the one that doesn’t can borrow the data and know which bib it applies to. (Unless it’s licensed data they don’t have the right to share, a very real problem, which is not a technical one, and linked data can’t solve either).
We can do all of these things today, even without linked data. Except I can’t, because in my local catalog a great many (I think a majority) of records lack OCLC numbers.
Why? Many of them are legacy records from decades ago, before OCLC was the last library cooperative standing, from before we cared. All the records missing OCLC numbers aren’t legacy though. Many of them are contemporary records supplied by vendors (book jobbers for print, or e-book vendors), which come to us without OCLC numbers. (Why do we get from there instead of OCLC? Convenience? Price? No easy way to figure out how to bulk download all records for a given purchased ebook package from OCLC? Why don’t the vendors cooperate with OCLC enough to have OCLC numbers on their records — I’m not sure. Linked data solves none of these issues.)
Even better, I’d love to be able to figure out if the book represented by a record in my catalog exists in Google Books, with limited excerpts and searchability or even downloadable fulltext. Google Books actually has a pretty good API, and if Google Books data had OCLC numbers in it, I could easily do this. But even though Google Books got a lot of it’s data from OCLC Worldcat, Google Books data only rarely includes OCLC numbers, and does so in entirely undocumented ways.
Lack of OCLC numbers in data is a problem very much about linking data, but it’s not a problem linked data can solve. We have the technology now, the barriers are about human labor power, business models, priorities, costs. Whether the OCLC numbers that are there are in a MARC record in field 035, or expressed as a URI (say, `http://www.worldcat.org/oclc/828033`) and included in linked data — are entirely irrelevant to me, my barriers are about lack of OCLC numbers in the data, I could deal with them in just about any format at all, and linked data formats won’t help appreciably, but I can’t deal with the data being absent.
And in fact, if you convert your catalog to “linked data” but still lack OCLC numbers — you’re still going to have to solve that problem to do anything useful as far as “linking data”. The problem isn’t about whether the data is “linked data”, it’s about whether the data has useful identifiers that can be used to actually link to other data sets.
As you might guess from the fact that so many records in our local catalog don’t have OCLC numbers — most of the records in our local catalog also haven’t been updated since they were added years, decades ago. They might have typos that have since been corrected in WorldCat. They might represent ages ago cataloging practices (now inconsistent with present data) that have since been updated in WorldCat. The WorldCat records might have been expanded to have more useful data (better subjects, updated controlled author names, useful 5xx notes).
Our catalog doesn’t get these changes, because we don’t generally update our records from WorldCat, even for the records that do have OCLC numbers. (Also, naturally, not all of our holdings are actually listed with WorldCat, although this isn’t exactly the same set as those that lack OCLCnums in our local catalog). We could be doing that. Some libraries do, some libraries don’t. Why don’t the libraries that don’t? Some combination of cost (to vendors), local human labor, legacy workflows difficult to change, priorities, lack of support from our ILS software for automating this in an easy way, not wanting to overwrite legacy locally created data specific to the local community, maybe some other things.
Getting our local data to update when someone else has improved it, is again the kind of problem linked data is targeted at, but linked data won’t necessarily solve it, the biggest barriers are not about data format. After all, some libraries sync their records to updated WorldCat copy now, it’s possible with the technology we have now, for some. It’s not fundamentally a problem of mechanics with our data formats.
I wish our ILS software was better architected to support “sync with WorldCat” workflow with as little human intervention as possible. It doesn’t take linked data to do this — some are doing it already, but our vendor hasn’t chosen to prioritize it. And just because software “supports linked data” doesn’t guarantee it will do this. I’d want our vendors focusing on this actual problem (whether solved with or without linked data), not the abstract theoretical goal of “linked data”.
Difficulty of getting format/form info from our data, representing what users care about
One of the things my patrons care most about, when running across a record in the catalog for say, “Pride and Prejudice”, is format/genre issues.
Is a given record the book, or a film? A film on VHS, or DVD (you better believe that matters a lot to a patron!)? Or streaming online video? Or an ebook? Or some weird copy we have on microfiche? Or a script for a theatrical version? Or the recording of a theatrical performance? On CD, or LP, or an old cassette?
And I similarly want to answer this question when interrogating data at remote sources, say, WorldCat, or a neighboring libraries catalog.
It is actually astonishingly difficult to get this information out of MARC — the form/format/genre of a given record, in terms that match our users tasks or desires. Why? Well, because the actual world we are modeling is complicated and constantly changing over the decades, it’s unclear how to formally specify this stuff, especially when it’s changing all the time (Oh, it’s a blu-ray, which is kind of a DVD, but actually different). (I can easily tell you the record you’re looking at represents something that is 4.75″ wide though, in case you cared about that…)
Well, it’s a hard problem of domain modeling, harder than it might seem at first glance. A problem that negatively impacts a wide swath of library users across library types. Representing data as linked data won’t solve it, it’s an issue of vocabulary control. Is anyone trying to solve it?
Workset Grouping and Relationships
Related to form/format/genre issues but a distinct issue, is all the different versions of a given work in my catalog.
There might be dozens of Pride and Prejudices. For the ones that are books, do they actually all have the same text in them? I don’t think Austen ever revised it in a new edition, so probably they all do even if published a hundred years apart — but that’s very not true of textbooks, or even general contemporary non-fiction which often exists in several editions with different text. Still, different editions of Pride and Prejudice might have different forwards or prefaces or notes, which might matter in some contexts. Or maybe different pagination, which matters for citation lookup.
And then there’s the movies, the audiobooks, the musical (?). Is the audiobook the exact text of the standard Pride and Prejudice just read aloud? Or an abridged version? Or an entirely new script with the same story? Are two videos the exact same movie one on VHS and one on DVD, or two entirely different dramatizations with different scripts and actors? Or a director’s cut?
These are the kinds of things our patrons care about, to find and identify an item that will meet their needs. But in search results, all I can do is give them a list of dozens of Pride and Prejudices, and let them try to figure it out — or maybe at least segment by video vs print vs audio. Maybe we’re not talking search results, maybe my software knows someone wants a particular edition (say, based on an input citation) and wants to tell the user if we have it, but good luck to my software in trying to figure out if we have that exact edition (or if someone else does, in worldcat or a neighboring library, or Amazon or Google Books).
This is a really hard problem too. And again it’s a problem of domain modeling, and equally of human labor in recording information (we don’t really know if two editions have the exact same text and pagination, someone has to figure it out and record it). Switching to the abstract data model of linked data doesn’t really address the barriers.
The library world made a really valiant effort at creating a domain model to capture these aspects of edition relationships that our users care about: FRBR. It’s seen limited adoption or influence in the 15+ years since it was released, which means it’s also seen limited (if any) additional development or fine-tuning, which anything trying to solve this difficult domain modeling problem will probably need (see RDA’s efforts at form/format/genre!). Linked data won’t solve this problem without good domain modeling, but ironically it’s some of the strongest advocates for “linked data” that I’ve seen arguing most strongly against doing anything more with adoption or development of FRBR ; as far as I am aware, the needed efforts to develop common domain modeling is not being done in the library linked data efforts. Instead, the belief seems to be if you just have linked data and let everyone describe things however they want, somehow it will all come together into something useful that answers the questions our patrons need, there’s no need for any common domain model vocabulary. I don’t believe existing industry experience with linked data, or software engineers experience with data modeling in general, supports this fantasy.
Multiple sources of holdings/licensing information
For the packages of electronic content we license/purchase (ebooks, serials), we have so many “systems of record”. The catalog’s got bib records for items from these packages, the ERM has licensing information, the link resolver has coverage and linking information, oh yeah and then they all need to be in EZProxy too, maybe a few more.
There’s no good way for software to figure out when a record from one system represents the same platform/package/license as in another system. Which means lots of manual work synchronizing things (EZProxy configuration, SFX kb). And things my software can do only with difficulty or simply can’t do at all — like, when presenting URLs to users, figure out if a URL in a catalog is really pointing to the same destination as a URL offered by SFX, even though they’re different URLs (epnet.com vs ebscohost.com?).
So one solution would be “why don’t you buy all these systems from the same vendor, and then they’ll just work together”, which I don’t really like as a suggested solution, and at any rate as a suggestion is kind of antithetical to the aims of the “linked data” movement, amirite?
So the solution would obviously be common identifiers used in all these systems, for platforms, packages and licenses, so software can know that a bib record in the catalog that’s identified as coming from package X for ISSN Y is representing the same access route as an entry in the SFX KB also identified as package X, and hey maybe we can automatically fetch the vendor suggested EZProxy config listed under identifier X too to make sure it’s activated, etc.
Why isn’t this happening already? Lack of cooperation between vendors, lack of labor power to create and maintain common identifiers, lack of resources or competence from our vendors (who can’t always even give us a reliable list in any format at all of what titles with what coverage dates are included in our license) or from our community at large (how well has DLF-ERMI worked out as far as actually being useful?).
In fact, if I imagined an ideal technical infrastructure for addressing this, linked data actually would be a really good fit here! But it could be solved without linked data too, and coming up with a really good linked data implementation won’t solve it, the problems are not mainly technical. We primarily need common identifiers in use between systems, and the barriers to that happening are not that the systems are not using “linked data”.
Google Books won’t link out to me
Google Scholar links back to my systems using OpenURL links. This is great for getting a user who choses to use Google Scholar for discovery back to me to provide access through a licensed or owned copy of what they want. (There are problems with Google Scholar knowing what institution they belong to so they can link back to the right place, but let’s leave that aside for now, it’s still way better than not being there).
I wish Google Books did the same thing. For that matter, I wish Amazon did the same thing. And lots of other people.
They don’t because they have no interest in doing so. Linked data won’t help, even though this is definitely an issue of, well, linking data.
OpenURL, a standard frozen in time
Oh yeah, so let’s talk about OpenURL. It’s been phenomenally succesful in terms of adoption in the library industry. And it works. It’s better that it exists than if it didn’t. It does help link disparate systems from different vendors.
The main problem is that it’s basically abandoned, I don’t know if there’s technically a maintanance group, but if there is, they aren’t doing much to improve OpenURL for scholarly citation linking, the use case it’s been successful in.
For instance, I wish there was a way to identify a citation as referring to a video or audio piece in OpenURL, but there isn’t.
Now, theoretically the “open for extension” aspect of linked data seems relevant here. If things were linked data and you needed a new data element or value, you could just add one. But really, there’s nothing stopping people from doing that with OpenURL now. Even if technically not allowed, you can just decide to say `&genre=video` in your OpenURL, and it probably won’t disturb anything (or you can figure out a way to do that not using the existing `genre` key that really won’t disturb anything).
The problem is that nothing will recognize it and do anything useful with it, and nobody is generating OpenURLs like that too. It’s not really an ‘open for extension’ problem, it’s a problem of getting the ecosystem to do it, of vocabulary consensus and implementation. That’s not a problem that linked data solves.
Linking from the open web to library copies
One of the biggest challenges always in the background of my work, is how we get people from the “open web” to our library owned and licensed resources and library-provided services. (Umlaut is engaged in this “space”).
How could linked data play a role in solving this problem? To be sure, if every web page everywhere included schema.org-type information fully specifying the nature of the scholarly works it was displaying, citing, or talking about — that would make it a lot easier to find a way to take this information and transfer the user to our systems to look up availability for the item cited. If every web page exposed well-specified machine-accessible data in a way that wasn’t linked-data-based, that would be fine too. But something like schema.org does look like the best bet here — but it’s not a bet I’d wager anything of significance on.
It would not be necessary to rebuild our infrastructure to be “based on linked data” in order to take advantage of structured information on external web pages, whether or not that structured information is “linked data”. (There are a whole bunch of other non-trivial challenges and barriers, but replacing our ILS/OPAC isn’t really a necessary one, neither is replacing our internal data format.). And we ourselves have limited influence over what “every web page everywhere” does.
Okay, so why are people excited about Linked Data?
If it’s not clear it will solve our problems, why is there so much effort being put into it? I’m not sure, but here’s some things I’ve observed or heard.
Most people, and especially library decision-makers, agree at this point that libraries have to change and adapt, in some major ways. But they don’t really know what this means, how to do it, what directions to go on. Once there’s a critical mass of buzz about “linked data”, it becomes the easy answer — do what everyone else is doing, including prestigious institutions, and if it ends up wrong, at least nobody can blame you for doing what everyone else agreed should be done. “No one ever got fired for buying IBM”
So linked data has got good marketing and a critical mass, in an environment where decision-makers want to do something but don’t know what to do. And I think that’s huge, but certainly that’s not everything, there are true believers who created that message in the first place, and unlike IBM they aren’t necessarily trying to get your dollars, they really do believe. (Although there are linked data consultants in the library world who make money by convincing you to go all-in on linked data…)
I think we all do know (and I agree) that we need our data and services to inter-operate better — within the library world, and crossing boundaries to the larger IT and internet industry and world. And linked data seems to hold the promise of making that happen, after all those are the goals of linked data. But as I’ve described above, I’m worried it’s a promise long on fantasy and short on specifics. In my experience, the true barriers to this are about good domain modeling, about the human labor to encode data, and about getting people we want to cooperate with us to use the same domain models.
I think those experienced with library metadata realize that good domain modelling (eg vocabulary control), and getting different actors to use the same standard formats is a challenge. I think they believe that linked data will somehow solve this challenge by being “open to extension” — I think this is a false promise, as I’ve tried to argue above. Software and sources need to agree on vocabulary in linked data too, to be able to use each others data. Or use the analog of a ‘crosswalk’, which we can already do, and which does not becomes appreciably easier with linked data — it becomes somewhat easier mechanically to apply a “cross-walk”, but the hard part in my experience is not mechanical application, but the intellectual labor to develop the “cross-walk” rules in the first place and maintain it as vocabularies change.
I think library decision-makers know that we “need our stuff to be in Google”, and have been told “linked data” is the way to do that, without having a clear picture of what “in Google” means. As I’ve said, I think Google’s investment in or commitment to linked data has been exagerated, but yesschema.org markup can be used by Google for rich snippets or Knowledge Graph fact boxes. And yes, I actually agree, our library web pages should use schema.org markup to expose their information in machine-readable markup. This will right now have more powerful results for library information web pages (rich snippets) than it will for catalog pages. But the good thing is it’s not that hard to do for catalog bib pages either, and does not requires rebuilding our entire infrastructure, our MARC data as it is can fairly easily be “cross-walked” to schema.org, as Dan Scott has usefully shown with VuFind, Evergreen, and Koha. Yes, all our “discovery” web pages should do this. Dan Scott reports that it hasn’t had a huge effect, but says it would if only everybody did it:
We don’t see it happening with libraries running Evergreen, Koha, and VuFind yet, realistically because the open source library systems don’t have enough penetration to make it worth a search engine’s effort to add that to their set of possible sources. However, if we as an industry make a concerted effort to implement this as a standard part of crawlable catalogue or discovery record detail pages, then it wouldn’t surprise me in the least to see such suggestions start to appear.
Maybe. I would not invest in an enormous resource-intensive campaign to rework our entire infrastructure based on what we hope Google (or similar actors) will do if we pull it off right — I wouldn’t count on it. But fortunately it doesn’t require that to include schema.org markup on our pages. It can fairly easily be done now with our data in MARC, and should indeed be done now; whatever barriers are keeping us from doing it more with our existing infrastructure, solving them are actually a way easier problem than rebuilding our entire infrastructure.
I think library metadataticians also realize that limited human labor resources to record data are a problem. I think the idea is that with linked data, we can get other people to create our metadata for us, and use it. It’s a nice vision. The barriers are that in fact not “everybody” is using linked data, let alone willing to share it; the existing business model issues that make them reluctant to share their data don’t go away with linked data; they may have no business interest in creating the data we want anyway (or may be hoping “someone else” does it too); and that common or compatible vocabularies are still needed to integrate data in this way. The hard parts are human labor and promulgating shared vocabulary, not the mechanics of combining data.
I think experienced librarians also realize that business model issues are a barrier to integration and sharing of data presently. Perhaps they think that the Linked Open Data campaign will be enough to pressure our vendors, suppliers, partners, and cooperatives to share their data, because they have to be “Linked Open Data” and we’re going to put the pressure on. Maybe they’re right! I hope so.
One linked data advocate told me, okay, maybe linked data is neither necessary nor sufficient to solve our real world problems. But we do have to come up with better and more inter-operable domain models for our data. And as long as we’re doing that, and we have to recreate all this stuff, we might as well do it based on linked data — it’s a good abstract data model, and it’s the one “everyone else is using” (which I don’t agree is happening, but it might be the one others outside the industry end up using — if they end up caring about data interoperability at all — and there are no better candidates, I agree, so okay).
Maybe. But I worry that rather than “might as well use linked data as long as we’re doing”, linked data becomes a distraction and a resource theft (opportunity cost?) from what we really need to do. We need to figure out what our patrons are up to and how we can serve them; and when it comes to data, we need to figure out what kinds of data we need to do that, and to come up with the domain models that capture what we need, and to get enough people (inside or outside the library world) to use compatible data models, and to get all that data recorded (by whom paid for by whom).
Sure that all can be done with linked data, and maybe there are even benefits to doing so. But in the focus on linked data, I worry we end up focusing on how most elegantly to fit our data into “linked data” (which can certainly be an interesting intellectual challenge, a fun game), rather than on how to model it to be useful for the uses we need (and figuring out what those are). I think it’s unjustified to think the rest will take care of itself if it’s just good linked data. The rest is actually the hard part. And I think it’s dangerous to undertake this endeavor as “throw everything else out and start over”, instead of looking for incremental improvements.
The linked data advocate I was talking to also suggested (or maybe it was my own suggestion in conversation, as I tried to look on the bright side): Okay, we know we need to “fix” all sorts of things about our data and inter-operability. We could be doing a lot of that stuff now, without linked data, but we’re not, our vendors aren’t, our consortiums and collaboratives aren’t. Your catalog does not have enough records OCLC numbers in it, or sync it’s data to OCLC, even though it theoretically could, and without linked data. It hasn’t been a priority. But the very successful marketing campaign of “linked data” will finally get people to pay attention to this stuff and do what they should have been doing.
Maybe. I hope so. It could definitely happen. But it won’t happen because linked data is a magic bullet, and it won’t happen without lots of hard work that isn’t about the fun intellectual game of creating domain models in linked data.
What should you do?
Okay, so maybe “linked data” is an unstoppable juggernaut in the library world, or at your library. (It certainly is not in the wider IT/web world, despite what some would have you believe). I certainly don’t think this tl;dr essay will change that.
And maybe that will work out for the best after all. I am not fundamentally opposed to semantic web/linked data/RDF. It’s an interesting technology although I’m not as in love with it as some, I recognize that it surely should play some part in our research and investigation into metadata evolution — even if we’re not sure how succesful it will be in the long-term.
Maybe it’ll all work out. But for you reading this who’s somehow made it this far, here’s what I think you can do to maximize those chances:
Be skeptical. Sure, of me too. If this essay gets any attention, I’m sure there will be plenty of arguments provided for how I’m missing the point or confused. Don’t simply accept claims from promoters or haters, even if everyone else seems to be accepting that — claims that “everyone is doing it”, or that linked data will solve all our problems. Work to understand what’s really going on so you can evaluate benefits and potentials yourself, and understand what it would take to get there. To that end…
Educate yourself about the technology of metadata. About linked data, sure. And about entity-relational modeling and other forms of data modeling, about relational databases, about XML, about what “everyone else” is really doing. Learn a little programming too, not to become a programmer, but to understand better how software and computation work, because all of our work in libraries is so intimately connected to that. Educating yourself on these things is the only way to evaluate claims made by various boosters or haters.
Treat the library as an IT organization. I think libraries already are IT organizations (at least academic libraries) — every single service we provide to our users now has a fundamental core IT component, and most of our services are actually mediated by software between us and our users. But libraries aren’t run recognizing them as IT organizations. This would involve staffing and other resource allocation. It would involve having sufficient leadership and decision-makers that are competent to make IT decisions, or know how to get advice from those who are. It’s about how the library thinks of itself, at all levels, and how decisions are made, and who is consulted when making them. That’s what will give our organizations the competence to make decisions like this, not just follow what everyone else seems to be doing.
Stay user centered. “Linked data” can’t be your goal. You are using linked data to accomplish something to add value to your patrons. We must understand what our patrons are doing, and how to intervene to improve their lives. We must figure out what services and systems we need to do that. Some work to that end, even incomplete and undeveloped if still serious and engaged, comes before figuring out what data we need to create those services. To the extent it’s about data, make sure your data modeling work and choices are about creating the data we need to serve our users, not just fitting into the linked data model. Be careful of “dumbing down” your data to fit more easily into a linked data model, but maybe losing what we actually need in the data to provide the services we need to provide.
Yes, include schema.org markup on your web pages and catalog/discovery pages. To expose it to Google, or to anyone. We don’t need to rework our entire infrastructure to do that, it can be done now, as Dan Scott has awesomely shown. As Google or anyone else significant recognizes more or different vocabularies, make use of them too by including them in your web pages, for sure. And, sure, make all your data (in any format, linked data or not) available on the open web, under an open license. If your vendor agreements prevent you from doing that, complain. Ask everyone else with useful data to do so too. Absolutely.
Avoid “Does it support linked data” as an evaluative question.I think that’s just not the right question to be asking when evaluating adoption or purchase of software. To the extent the question has meaning at all (and it’s not always clear what it means), it is dangerous for the library organization if it takes primacy over the specifics of how it will allow us to provide better services or provide services better.
Of course, put identifiers in your data. I don’t care if it’s as a URI or not, but yeah, make sure every record has an OCLC number. Yeah, every bib should record the LCCN or other identifier of it’s related creators authority records, not just a heading. This is “linked data” advice that I support without reservation, it is what our data needs with or without linked data. Put identifiers everywhere. I don’t care if they are in the form of URLs. Get your vendors to do this too. That your vendors want to give you bibs without OCLC numbers in them isn’t acceptable. Make them work with OCLC, make them see it’s in their business interests to do so, because the customers demand it. If you can get the records from OCLC even if it costs more — it might be worth it. I don’t mean to be an OCLC booster exactly, but shared authority control is what we need (for linked data to live up to it’s promise or for us to accomplish what we need without linked data), and OCLC is currently where it lives. Make OCLC share it’s data too, which it’s been doing already (in contrast to ~5 years ago) — keep them going — they should make it as easy and cheap as possible for even “competitors” to put OCLC numbers, VIAF numbers, any identifiers in their data, regardless of whether OCLC thinks it threatens their own business model, because it’s what we need as a community and OCLC is a non-profit cooperative that represents us.
Who should you trust? Trust nobody, heh. But if you want my personal advice, pay attention to Diane Hillmann. Hillmann is one of the people working in and advocating for linked data that I respect the most, who I think has a clear vision of what it will or won’t or only might do, and how to tie work to actual service goals not just theoretical models. Read what Hillmann writes, invite her to speak at your conferences, and if you need a consultant on your own linked data plans I think you could do a lot worse. If Hillmann had increased influence over our communal linked data efforts, I’d be a lot less worried about them.
Require linked data plans to produce iterative incremental value.I think the biggest threat of “linked data” is that it’s implemented as a campaign that won’t bear fruit until some fairly far distant point, and even then only if everything works out, and in ways many decision-makers don’t fully understand but just have a kind of faith in. That’s a very risky way to undertake major resource-intensive changes. Don’t accept an enormous investment whose value will only be shown in the distant future. As we’re “doing linked data”, figure out ways to get improvements that effect our userspositively incrementally, at each stage, iteratively. Plan your steps so each one bears fruit one at a time, not just at the end. (Which incidentally, is good advice for any technology project, or maybe any project at all). Because we need to start improving things for our users now to stay alive. And because that’s the only way to evaluate how well it’s going, and even more importantly to adjust course based on what we learn, as we go. And it’s how we get out of assuming linked data will be a magic bullet if only we can do enough of it, and develop the capacity to understand exactly how it can help us, can’t help us, and will help us only if we do certain other things too. When people who have been working on linked data for literally years advocate for it, ask them to show you their successes, and ask for success in terms of actually improving our library services. If they don’t have much to show, or if they have exciting successes to demonstrate, that’s information to guide you in decision-making, resource allocation, and further question-asking.
Hi there, future text miners. Before we head down the coal shoot together, I’ll begin by saying this, and I hope it will reassure you- no matter your level of expertise, your experience in writing code or conducting data analysis, you can find an online tool to help you text mine.
The internet is a wild and beautiful place sometimes.
But before we go there, you may be wondering- what’s this Brave New Workplace business all about? Brave New Workplace is my monthly discussion of tech tools and skill sets which can help you adapt and know a new workplace. In our previous two installments I’ve discussed my own techniques and approaches to learning about your coworkers’ needs and common goals. Today I’m going to talk about text mining the results of your survey, but also text mining generally.
Now three months into my new position, I have found that text mining my survey results was only the first step to developing additional awareness of where I could best apply my expertise to library needs and goals. I went so far as to text mine three years of eresource Help Desk tickets and five years of meeting notes. All of it was fun, helpful, and revealing.
Text mining can assist you in information gathering in a variety of ways, but I tend to think it’s helpful to keep in mind the big three.
1. Seeing the big picture (clustering)
2. Finding answers to very specific questions (question answering)
3. Hypothesis generation (concept linkages)
For the purpose of this post, I will focus on tools for clustering your data set. As with any data project, I encourage you to categorize your inputs and vigorously review and pre-process your data. Exclude documents or texts that do not pertain to the subject of your inquiry. You want your data set to be big and deep, not big and shallow.
I will divide my tool suggestions into two categories: beginner and intermediate. For my beginners just getting started, you will not need to use any programming language, but for intermediate, you will.
Start yourself off easy and use WordClouds.com. This simple site will make you a pretty word cloud, and also provide you with a comprehensive word frequencies list. Those frequencies are concept clusters, and you can begin to see trends and needs in your new coworkers and your workplace goals. This is a pretty cool, and VERY user friendly way to get started text mining.
WordClouds eliminates frequently used words, like articles, and gets you to the meat of your texts. You can copy paste text or upload text files. You can also scan a site URL for text, which is what I’ve elected to do as an example here, examining my library’s home page. The best output of WordClouds is not the word cloud. It’s the easily exportable list of frequently occurring words.
To be honest, I often use this WordClouds’ function in advance of getting into other data tools. It can be a way to better figure out categories of needs, a great first data mining step which requires almost zero effort. With your frequencies list in hand you can do some immediate (and perhaps more useful) data visualization in a simple tool of your choice, for instance Excel.
Depending on your preferred programming language, many options are available to you. While I have traditionally worked in SPSS for data analysis, I have recently been working in R. The good news about R versus SPSS- R is free and there’s a ton of community collaboration. If you have a question (I often do) it’s easy to find an answer.
Getting started in R with text mining is simple. You’ll need to install the packages necessary if you are text mining for the first time.
Then save your text files in a folder titled: “texts,” and load those in R. Once in, you’ll need to pre-process your text to remove common words and punctuation. This guide is excellent in taking you through the steps to process your data and analyze it.
Just like our WordClouds, you can use R to discover term frequencies and visualize them. Beyond this, working in R or SPSS or Python can allow you to cluster terms further. You can find relationships between words and examine those relationships within a dendrogram or by k-means. These will allow you to see the relationships between clusters of terms.
Ultimately, the more you text mine, the more familiar you will become with the tools and analysis valuable in approaching a specific text dataset. Get out there and text mine, kids. It’s a great way to acculturate to a new workplace or just learn more about what’s happening in your library.
Now that we’ve text mined the results of our survey, it’s time to move onto building a Customer Relationship Management system (CRM) for keeping our collaborators and projects straight. Come back for Brave New Workplace: Your Homegrown CRM on December 21st.
For the last few months, we’ve been asking on the Mozilla add-ons mailing list that Zotero be whitelisted for extension signing. If you haven’t been following that discussion, 1) lucky you, and 2) you can read my initial post about it, which gives some context. The upshot is that, if changes aren’t made to the signing process, we’ll have no choice but to discontinue Zotero for Firefox when Firefox 43 comes out, because, due to Zotero’s size and complexity, we’ll be stuck in manual review forever and unable to release timely updates to our users, who rely on Zotero for time-sensitive work and trust us to fix issues quickly. (Zotero users could continue to use our standalone app and one of our lightweight browser extensions, but many people prefer the tighter browser integration that the original Firefox version provides.)
Mozilla should give Zotero the special treatment it deserves. It’s a very important tool, and a crucial part of ongoing research all over the world. Mozilla needs to support it.
This morning I received an email asking me to peer review a book proposal for Chandos Publishing, the Library and Information Studies imprint of Elsevier. Initially I thought it was spam because of some sloppy punctuation and the “Dr. Robertson” salutation.
When other people pointed out that this likely wasn’t spam my ego was flattered for a few minutes and I considered it. I was momentarily confused–would participating in Elsevier’s book publishing process be evil? Isn’t it different from their predatory pricing models with libraries and roadblocks to sharing research more broadly? I have a lot to learn about scholarly publishing, but decided that I’m not going to contribute my labour to a company that are jerks to librarians, researchers and libraries.
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.
Winchester, MA For a quick round-up of current news and information about events and achievements happening around the digital preservation and access ecosystem visit DuraSpace Today: http://duraspace.org/duraspace-today. Follow DuraSpace on Twitter by clicking the link at the top of the page.
Bologna, Italy In the last two months Cineca attended two very important IT events focused on support for Higher Education. At both events the Italian Consortium presented its research ecosystem related activities, focusing on DSpace and DSpace-CRIS.
Petrópolis, Rio de Janeiro, Brazil Provider IT Neki Technologies, the Brazilian Duraspace Registered Service Provider, has undergone a major change during the past few months and is now Neki IT. Located in Petrópolis, Rio de Janeiro, Neki IT has left the Provider Group and is again running its own structure.
A couple of weeks ago, NYPL Labs was very excited to release Emigrant City, the Library's latest effort to unlock the data found within our collections.
But Emigrant City is a bit different from the other projects we’ve released in one very important way: this one is built on top of a totally new framework called Scribe, built in collaboration with Zooniverse and funded by a grant from the NEH Office of Digital Humanities along with funds from the University of Minnesota. Scribe is the codebase working behind the scenes to support this project.
What is Scribe?
Scribe is a highly configurable, open-source framework for setting up community transcription projects around handwritten or OCR-resistant texts. Scribe provides the foundation of code for a developer to configure and launch a project far more easily than if starting from scratch.
NYPL Labs R&D has built a few community transcription apps over the years. In general, these applications are custom built to suit the material. But Scribe prototypes a way to describe the essential work happening at the center of those projects. With Scribe, we propose a rough grammar for describing materials, workflows, tasks, and consensus. It’s not our last word on the topic, but we think it’s a fine first pass proposal for supporting the fundamental work shared by many community transcription projects.
So, what’s happening in all of these projects?
Our previous community transcription projects run the gamut from requesting very simple, nearly binary input like “Is this a valid polygon?” (as in the case of Building Inspector) to more complex prompts like “Identify every production staff member in this multi-page playbill” (as in the case of Ensemble). Common tasks include:
Identify a point/region in a digitized document or image
Answer a question about all or part of an image
Flag an image as invalid (meaning it’s blank or does not include any pertinent information)
Flag other’s contributions as valid/invalid
Flag a page or group of pages as “done”
There are many more project-specific concerns, but we think the features above form the core work. How does Scribe approach the problem?
Scribe reduces the problem space to “subjects” and “classifications.” In Scribe, everything is either a subject or a classification: Subjects are the things to be acted upon, classifications are created when you act. Creating a classification has the potential to generate a new subject, which in turn can be classified, which in turn may generate a subject, and so on.
This simplification allows us to reduce complex document transcription to a series of smaller decisions that can be tackled individually. We think reducing the atomicity of tasks makes projects less daunting for volunteers to begin and easier to continue. This simplification doesn’t come at the expense of quality, however, as projects can be configured to require multiple rounds of review.
The final subjects produced by this chain of workflows represent the work of several people carrying an initial identification all the way through to final vetted data. The journey comprises a chain of subjects linked by classifications connected by project-specific rules governing exposure and consensus. Every region annotated is eventually either deleted by consensus or further annotated with data entered by several hands and, potentially, approved by several reviewers. The final subjects that emerge represent singular assertions about the data contained in a document validated by between three and 25 people.
In the case of Emigrant City specifically, individual bond records are represented as subjects. When participants mark those records up, they produce “mark” subjects, which appear in Transcribe. In the Transcribe workflow, other contributors transcribe the text they see, which are combined with others’ transcriptions as “transcribe” subjects. If there’s any disagreement among the transcriptions, those transcribe subjects appear in Verify where additional classifications are added by other contributors as votes for one or another transcription. But this is just the configuration that made sense for Emigrant City. Scribe lays the groundwork to support other configurations.
Is it working?
I sure hope so! In any case, the classifications are mounting for Emigrant City. At writing we’ve gathered 227,638 classifications comprising marks, transcriptions, and verifications from nearly 3,000 contributors. That’s about 76 classifications each, on average, which is certainly encouraging as we assess the stickiness of the interface.
We’ve had to adjust a few things here and there. Bugs have surfaced that weren’t apparent before testing at scale. Most issues have been patched and data seems to be flowing in the right directions from one workflow to the next. We’ve already collected complete, verified data for several documents.
Reviewing each of these documents, I’ve been heartened by the willingness of a dozen strangers spread between the US, Europe, and Australia to meditate on some scribbles in a 120 year old mortgage record. I see them plugging away when I’m up at 2 a.m. looking for a safe time to deploy fixes.
As touched on above, Scribe is primarily a prototype of a grammar for describing community transcription projects in general. The concepts underlying Scribe formed over a several-month collaboration between remote teams. We built the things we needed as we needed them. The codebase is thus a little confusing in areas, reflecting several mental right turns when we found the way forward required an additional configuration item or chain of communication. So one thing I’d like to tackle is reigning in some of the areas that have drifted from the initial elegance of the model. The notion that subjects and workflows could be rearranged and chained in any configuration has been a driving idea, but in practice the system obliges only a few arrangements.
An increasingly more pressing desire, however, is developing an interface to explore and vet the data assembled by the system. We spent a lot of time developing the parts that gather data, but perhaps not enough on interfaces to analyze it. Because we’ve reduced document transcription into several disconnected tasks, the process to reassemble the resultant data into a single cohesive whole is complicated. That complexity requires a sophisticated interface to understand how we arrived at a document’s final set of assertions from the the chain of contributions that produced it. Luckily we now have a lot of contributions around which to build that interface.
Most importantly, the code is now out in the wild, along with live projects that rely on it. We’re already grateful for the tens of thousands of contributions people have made on the transcription and verification front, and we’d likewise be immensely grateful for any thoughts or comments on the framework itself—let us know in the comments, or directly via Github, and thanks for helping us get this far.
Also, check out the other community transcription efforts built on Scribe. Measuring the Anzacs collects first-hand accounts from New Zealanders in WWI. Coming soon, “Old Weather: Whaling” gathers Arctic ships’ logs from the late 19th and early 20th centuries.
In his opening remarks at the November 17 Re:Create conference, Public Knowledge President & CEO Gene Kimmelman shared his thoughts about fair use as a platform for today’s creative revolution, and about it being a key to the importance of how knowledge is shared in today’s society. That set the tone for a dynamic discussion of copyright policy and law that followed, the cohesive focus behind the Re:Create coalition.
Panel Moderator Mike Masnick, founder, Techdirt and CEO, Copia Institute, poses a question to panelists (L.toR.): Casey Rae, CEO, Future of Music; Julie Samuels, executive director, Engine; Howard University Law Professor Lateef Mtima, founder and director, Institute for Intellectual Property and Social Justice; and Greta Peisch, international trade counsel, Senate Finance Committee.
“Yes, it’s important for creators to have a level of protection for their work,” Eli Lehrer, president of the R Street Institute, said, “but that doesn’t mean government should have free rein. The founding fathers wanted copyright to be limited but they also wanted it to support the growth of science and the arts.” He went on to decry how copyright has been “taken over by special interests and crony capitalism. We need a vibrant public domain to support true creation,” he said, “and our outdated copyright law is stifling the advancement of knowledge and new creators in the digital economy.”
Three panels of experts brought together by the Re:Create Coalition then proceeded to critique pretty much every angle of copyright law and the role of the copyright office. They also discussed the potential for modernization of the U.S. Copyright Office, whether the office should stay within the Library of Congress or move, and the prospects for reform of the copyright law. The November 17 morning program was graciously hosted by Washington, D.C.’s Martin Luther King, Jr. Memorial Library.
Cory Doctorow, author and advisor to the Electronic Frontier Foundation, believes audiences should have the opportunity to interact with artists/creators. He pointed to Star Wars as an example. Because fans and audiences have interacted and carried the theme and impact forward, Star Wars continues to be a big cultural phenomenon, despite long pauses between new parts in the series. As Michael Weinberg, general counsel and intellectual property (IP) expert at Shapeways, noted, there are certain financial benefits in “losing control,” i.e. the value of the brand is being augmented by audience interaction, thus adding value to the product. Doctorow added that we’ve allowed copyright law to become entertainment copyright law, thus “fans get marginalized by the heavyweight producers.”
On the future of the copyright office panel, moderator Michael Petricone, senior vice president, government affairs, Consumer Technology Association (CTA), said we need a quick and efficient copyright system, and “instead of fighting over how to slice up more pieces of the pie, let’s focus on how to make the pie bigger.”
Jonathan Band, counsel to the Library Copyright Alliance (LCA), said the Copyright Office used to just manage the registration process, but then, 1) volume multiplied 2) some people registered and others didn’t who are not necessarily using the system, and 3) the office didn’t have the resources to keep up on the huge volume of things being created. This “perfect storm,” he said, is not going to improve without important changes, such as modernizing its outdated and cumbersome record-keeping, but the Office also needs additional resources to address the “enormous black hole of rights.” Laura Moy, senior policy counsel at the Open Technology Institute (OTI) agreed that this is a big problem, because many new creators don’t have the resources or the legal counsel to help them pursue copyright searches and registration.
All the panelists were in agreement that it makes no sense to move the Copyright Office out of the Library of Congress, as has been proposed by a few. Matt Schruers, vice president for law & policy, Computer & Communications Industry Association (CCIA) agreed, urging for more robust record-keeping, incentives to get people to register and taking steps to mitigate costs. He said “we need to look at what the problems are, and fix them where they are. A lot of modernization can be done in the Office where it is, instead of all the cost of moving it and setting it up elsewhere.” Band strongly agreed. “Moving it elsewhere wouldn’t solve the issue/cost of taking everything digital. Moving the Office just doesn’t make sense.” Moy suggested there are also some new skills and expertise that are needed, such as someone with knowledge in IP and its impacts on social justice.
Later in the program, panelists further batted around the topic of fair use. For Casey Rae, CEO of the Future of Music Coalition, fair use is often a grey area of copyright law because it depends on how fair use is interpreted by the courts. In the case of remixes, for example, the court, after a lengthy battle, ruled in favor of 2 Live Crew’s remix of Pretty Woman, establishing that commercial parodies can qualify for fair use status. Lateef Mtima, professor of law, Howard University, and founder and director of the Institute for Intellectual Property and Social Justice, cited the Lenz v. Universal case that not only ruled in favor of the mom who had posted video on YouTube of her baby dancing to Prince’s Let’s Get Crazy, but established that fair use is a right, warning those who consider issuing a takedown notice to “ignore it at your own peril.”
When determining fair use, Greta Peisch, international trade counsel, Senate Finance Committee, said “Who do you trust more to best interpret what is in the best interests of society, the courts or Congress?” The audience response clearly placed greater confidence in the courts on that question. And Engine Executive Director Julie Samuels concluded that “fair use is the most important piece of copyright law—absolutely crucial.”
In discussing the future for copyright reform, Rae said there’s actually very little data on how revising the laws will impact the creative community and their revenue streams. He said legislation can easily be created based on assumptions without the data to back it up, so he urged for more research. But he also implied that the music industry (sound recording and music studios) need to do a better job of explaining their narrative…i.e. go to policymakers with data in hand and real life stories to share.
Mtima is optimistic that society is making progress in better understanding how the digital age has opened up the world for sharing knowledge and expanding literacy (what he called society’s Gutenberg moment). At first, he says, there was resistance to change. But as content users have made more and more demands for access to content, big content providers are recognizing the need to move away from the old model of controlling and “monetizing copies.” New models are developing and there’s recognition that opening access is often expanding the value of the brand.
Re:Create’s ability to focus on such an important area of public policy as copyright is the reason the coalition has attracted a broad and varied membership. It remains an important forum for discourse among creators, advocates, thinkers and consumers.
This was originally posted on The Pastry Box on October 1. Unfortunately, for some reason it was not tweeted about, so I didn’t see that it had been published. Anyway, here it is re-published. Enjoy. People frequently ask me whether I like or enjoy the work that I do. In theory, I’m helping 10% of … Continue reading Better than Christmas Morning: Finding Your Motivation
What better time than Fall for a new craft beer recipe? This one, in particular, has a unique origin story—and it starts with Founding Father and first US President George Washington.
The recipe was found written in a notebook that Washington kept during the French and Indian War, digitized and available through The New York Public Library. The notebook entries, which begin in June 1757, put a 25-year-old Washington at Fort Loundoun in Winchester, Virginia, where he served as a colonel in the Virginia Regiment militia. Washington’s experience in the militia, where he served as an ambassador, led expeditions, and defended Virginia against French and Indian attacks, gave him a military and political savvy that helped shape his leadership of the Continental Army during the Revolutionary War.
The notebook gives a unique view into Washington’s time in the military on a day-to-day basis. These include his notes for “Sundry things to be done in Williamsburg,” and lists of supplies (the pages marked with cross-hatched x’s, once the items were done). Washington outlines memos and letters, including to the Governor of Virginia and the Speaker of the House of Burgesses. He describes his horses, too—Nelly, Jolly, Ball, Jack, Rock, Woodfin, Prince, Buck, Diamond, and Crab—with illustrations of their brand marks.
Among these notes, on the final page of the book, is Washington’s recipe for “small beer.” This type of beer is thought to have low alcohol content and low quality, and is believed to have been regularly given to soldiers in the British Army. While other, higher-quality alcohol was for the rich, who could afford the luxury, small beer was typically for paid servants. Other alcohol rations, like rum and later whiskey, were given to both slaves and employees at Mount Vernon on a weekly basis.
The small beer recipe, transcribed below, makes provisions for the types of conditions Washington or others may have needed for wartime preparation, outside of a more stable brewery. The directions require little time or ingredients, and include additional steps to take depending on the weather.
Take a large Sifter full of Bran Hops to your Taste — Boil these 3 hours. Then strain out 30 Gall. into a Cooler put in 3 Gallons Molasses while the Beer is scalding hot or rather drain the molasses into the Cooler. Strain the Beer on it while boiling hot let this stand til it is little more than Blood warm. Then put in a quart of Yeast if the weather is very cold cover it over with a Blanket. Let it work in the Cooler 24 hours then put it into the Cask. leave the Bung open til it is almost done working — Bottle it that day Week it was Brewed.
For Washington, beer was considered a favorite drink (though he enjoyed a higher quality than that described in his notebook). It was typically on the menu for dinners at Mount Vernon, and a bottle of beer was given to servants daily. Washington even brewed his own beer on the estate, at what Mount Vernon historians believe to be sizeable rates.
In 1797, he started a whiskey distillery, too, making use of the plantation’s grain, which produced up to 12,000 gallons a year. While his distillery was a successful business venture for Washington, he himself wasn’t a fan of whiskey, and preferred his customary mug of beer each night at dinner.
Washington’s notebook was digitized as part of The New York Public Library’s Early American Manuscripts Project, which is looking to digitize 50,000 pages of material. These unique documents give a new perspective on life in the colonies and during the Revolutionary War, on a large and small scale. Besides the digitized papers of Founding Fathers (like Washington, Thomas Jefferson, Alexander Hamilton and James Madison), there are collections of diaries, business papers, and other fascinating colonial material.
As I typed the title for this post, I couldn’t help but think “Well, yeah. What else would the library be?” Instead of changing the title, however, I want to actually unpack what we mean when we say “research partner,” especially in the context of research data management support. In the most traditional sense, libraries provide materials and space that support the research endeavor, whether it be in the physical form (books, special collections materials, study carrels) or the virtual (digital collections, online exhibits, electronic resources). Moreover, librarians are frequently involved in aiding researchers as they navigate those spaces and materials. This aid is often at the information seeking stage, when researchers have difficulty tracking down references, or need expert help formulating search strategies. Libraries and librarians have less often been involved at the most upstream point in the research process: the start of the experimental design or research question. As one considers the role of the Library in the scholarly life-cycle, one should consider the ways in which the Library can be a partner with other stakeholders in that life-cycle. With respect to research data management, what is the appropriate role for the Library?
In order to achieve effective research data management (RDM), planning for the life-cycle of the data should occur before any data are actually collected. In circumstances where there is a grant application requirement that triggers a call to the Library for data management plan (DMP) assistance, this may be possible. But why are researchers calling the Library? Ostensibly, it is because the Library has marketed itself (read: its people) as an expert in the domain of data management. It has most likely done this in coordination with the Research Office on campus. Even more likely, it did this because no one else was. It may have done this as a response to the National Science Foundation (NSF) DMP requirement in 2011, or it may have just started doing this because of perceived need on campus, or because it seems like the thing to do (which can lead to poorly executed hiring practices). But unlike monographic collecting or electronic resource acquisition, comprehensive RDM requires much more coordination with partners outside the Library.
Steven Van Tuyl has written about the common coordination model of the Library, the Research Office, and Central Computing with respect to RDM services. The Research Office has expertise in compliance and Central Computing can provide technical infrastructure, but he posits that there could be more effective partners in the RDM game than the Library. That perhaps the Library is only there because no one else was stepping up when DMP mandates came down. Perhaps enough time has passed, and RDM and data services have evolved enough that the Library doesn’t have to fill that void any longer. Perhaps the Library is actually the *wrong* partner in the model. If we acknowledge that communities of practice drive change, and intentional RDM is a change for many of the researchers, then wouldn’t ceding this work to the communities of practice be the most effective way to stimulate long lasting change? The Library has planted some starter seeds within departments and now the departments could go forth and carry the practice forward, right?
Well, yes. That would be ideal for many aspects of RDM. I personally would very much like to see the intentional planning for, and management of, research data more seamlessly integrated into standard experimental methodology. But I don’t think that by accomplishing that, the Library should be removed as a research partner in the data services model. I say this for two reasons:
The data/information landscape is still changing. In addition to the fact that more funders are requiring DMPs, more research can benefit from using openly available (and well described – please make it understandable) data. While researchers are experts in their domain, the Library is still the expert in the information game. At its simplest, data sources are another information source. The Library has always been there to help researchers find sources; this is another facet of that aid. More holistically, the Library is increasingly positioning itself to be an advocate for effective scholarly communication at all points of the scholarship life-cycle. This is a logical move as the products of scholarship take on more diverse and “nontraditional” forms.
Some may propose that librarians who have cultivated RDM expertise can still provide data seeking services, but perhaps they should not reside in the Library. Would it not be better to have them collocated with the researchers in the college or department? Truly embedded in the local environment? I think this is a very interesting model that I have heard some large institutions may want to explore more fully. But I think my second point is a reason to explore this option with some caution:
2. Preservation and access. Libraries are the experts in the preservation and access of materials. Central Computing is a critical institutional partner in terms of infrastructure and determining institutional needs for storage, porting, computing power, and bandwidth but – in my experience – are happy to let the long-term preservation and access service fall to another entity. Libraries (and archives) have been leading the development of digital preservation best practices for some time now, with keen attention to complex objects. While not all institutions can provide repository services for research data, the Library perspective and expertise is important to have at the table. Moreover, because the Library is a discipline-agnostic entity, librarians may be able to more easily imagine diverse interest in research data than the data producer. This can increase the potential vehicles for data sharing, depending on the discipline.
Yes, RDM and data services are reaching a place of maturity in academic institutions where many Libraries are evaluating, or re-evaluating, their role as a research partner. While many researchers and departments may be taking a more proactive or interested position with RDM, it is not appropriate for Libraries to be removed from the coordinated work that is required. Libraries should assert their expertise, while recognizing the expertise of other partners, in order to determine effective outreach strategies and resource needs. Above all, Libraries must set scope for this work. Do not be deterred by the increased interest from other campus entities to join in this work. Rather, embrace that interest and determine how we all can support and strengthen the partnerships that facilitate the innovative and exciting research and scholarship at an institution.
Back in September, the Islandora community completed its first volunteer sprint, a maintenance sprint on Islandora 7.x-1.x that cleaned up 38 tickets in advance of the 7.x-1.6 release (November 4th). For our second sprint (and likely for all sprints in the future), we moved over to Islandora's future and did our work on Islandora 7.x-2.x (also known as CLAW). With CLAW being very new to the vast majority of the community, we put the focus on knowledge sharing and exploring the new stack, with a a lot user documentation and discussion tickets for new folks to dig their teeth into. A whopping 17 sprinters signed up:
We got started on Monday, November 2nd with a live demo of CLAW provided by Nick Ruest and Danny Lamb, which has been recorded for anyone who'd like to take it in on their own time:
And then a virtual meeting to discuss how we'd approach the sprint and hand out tickets. As with our previous sprint, we mixed collaboration with solo work, coordinating via the #islandora IRC channel on freenode and with discussion on GitHub issues and pull requests. We stayed out of JIRA this time, doing all of our tracking with issues right in the CLAW GitHub Repo. In the end, we closed up nine tickets, and there should be extra kudos for Donald Moses from UPEI, who was the sprint MVP with his wonderful user documentation:
Nick Ruest put together a pretty awesome visualization of work on CLAW so far, where you can see the big burst off activity in November from our sprint:
You should also note that even in the early days of the project, activity on the code is usually followed up by activity on the documentation - that's a deliberate approach to make documenting CLAW an integral part of developing CLAW, so that when you are ready to adopt it, you'll find a rich library of technical, installation, and user documentation to support you.
With the success of the first two sprints, we are going to start going monthly. The next sprint will be December 7th - 18th and we are looking for volunteers to sign up. This sprint will have a more technical focus, concentrating on improvements in a single area of the stack; PHP Services. We're especially looking for some developers who'd like to contribute to help us reach our goals. That said, there is always a need for testers and reviewers, so don't be afraid to sign up even if you are not a developer.
PHP Services description: Have the majority of RESTful services pulled out of the CMS context, and exposed so that Drupal hooks or Event system can interact with them. We've already implemented two (images and collections) in Java, and we'd like to start by porting those over. These services will handle operations on PCDM objects and object types. There are lots of different ways to do this (Silex, Slim, Phalcon, Symfony, etc...), but the core idea is maintaining these as a separate layer.
Chullo will be the heart of the micro services. If everything is written properly the code reuse will allow for individual services to be a thin layer to expose the Chullo code in a particular context.
Winchester, MA The VIVO Project has launched a new website (http://www.vivoweb.org) focused on telling the VIVO story, and simplifying access to all forms of information regarding VIVO.
Short videos tell the VIVO story—how VIVO is connecting data to provide an integrated view of the scholarly work of an organization, how VIVO uses open standards to share data, and how VIVO is used to discover patterns of collaboration and work within and between organizations.
This week we focused on issues of privacy with Jessica Vitak after reading Palen & Dourish (2003), Vitak & Kim (2014) and Smith, Dinev, & Xu (2011). Of course privacy is a huge topic to tackle in a couple hours. But even this brief introduction was useful, and really made me start to think about how important theoretical frameworks are for the work I would like to do around appraisal in web archives.
Notions of privacy predate our networked world, but they are clearly bound up in, and being transformed, by the information technologies being deployed today. We spent a bit of time talking about Danah Boyd’s idea of context collapse that social media technologies often engenders or (perhaps) affords. Jessica used the wedding as a prototypical example of context collapse happening in a non-networked environment: extended family and close friends from both sides (bride and groom) are thrown into the same space to interact.
I’m not entirely clear on whether it’s possible to think of a technology affording privacy. Is privacy an affordance? I got a bit wrapped around the axle about answering this question because the notion of affordances has been so central to our seminar discussions this semester. I think that’s partly because of the strength of the Human Computer Interaction Lab here at UMD. Back in week 4 we defined affordances as a particular relationship between an object and a human (really any organism) that allows the human to perform some action. Privacy feels like it is more of a relation between humans and other humans, but perhaps that’s largely the result of it being an old concept that is being projected into a networked world of the web, social media and big data. Computational devices certainly have roles to play in our privacy, and if we look closer perhaps they always have.
Consider this door with a lock. Let’s say it was the door to your bedroom in a house you are renting with some friends. Imagine you want to get some peace and quiet to read a book. You can go into your room and close this door. The door affords some measure of privacy. But if you are getting changed and want to prevent someone from accidentally coming into your room you can choose to lock the door. The lock affords another measure of privacy. This doesn’t seem too much of a stretch to me. When I asked in class about whether privacy was an affordance I got the feeling that I was barking up the wrong tree. So I guess there’s more for me to unpack here.
One point I liked in the extensive literature review that Smith et al. (2011) provides was the distinction between normative versus descriptive privacy research. Normative privacy research focuses on ethical commitments or ought statement about the way things should be. Whereas descriptive privacy research focuses on what is, and can be further broken down into purely descriptive or empirically descriptive research. I think the purely descriptive line of research interests me the most, because privacy itself seems like an extremely complex topic that isn’t amenable to the way things should be, or positivist thinking. The authors basically admit this themselves early in the paper:
General privacy as a philosophical, psychological, sociological, and legal concept has been researched for more than 100 years in almost all spheres of the social sciences. And yet, it is widely recognized that, as a concept, privacy “is in disarray [and n]obody can articulate what it means” (Solove (2006), p. 477).
Privacy has so many facets, and is so contingent on social and cultural dynamics that I can’t help but wonder about how useful it is to think about it in abstract terms. But privacy is such an important aspect to the work I’m sketching out around social media and Web archives that it is essential that I spend significant time following some of these lines of research backwards and forwards. In particular I want to follow up on the work of Irwin Altman and Sandra Petronio who helped shape communication privacy management theory, as well as Helen Nissenbaum who has done work bridging this work into online spaces Nissenbaum (2009) and Brunton & Nissenbaum (2015). I’ve also had MacNeil (1992) on my to-read list for a while since it specifically addresses privacy in the archive.
Maybe there’s an independent study in my future centered on privacy?
Brunton, F., & Nissenbaum, H. (2015). Obfuscation: A user’s guide for privacy and protest. Mit Press.
MacNeil, H. (1992). Without consent: The ethics of disclosing personal information in public archives. Scarecrow Press.
Nissenbaum, H. (2009). Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press.
Palen, L., & Dourish, P. (2003). Unpacking privacy for a networked world. In Proceedings of the sIGCHI conference on human factors in computing systems (pp. 129–136). Association for Computing Machinery.
Smith, H. J., Dinev, T., & Xu, H. (2011). Information privacy research: An interdisciplinary review. MIS Quarterly, 35(4), 989–1016.
Solove, D. J. (2006). A taxonomy of privacy. University of Pennsylvania Law Review, 477–564.
Vitak, J., & Kim, J. (2014). You can’t block people offline: Examining how facebook’s affordances shape the disclosure process. In Proceedings of the 17th aCM conference on computer supported cooperative work & social computing (pp. 461–474). Association for Computing Machinery.