Introduction
The role of libraries and librarians continues to evolve. The days of polished catalogue drawers are behind us. The information sector is transitioning through a period of change as the role of the library adapts to meet new service requirements. The implications of linked data are huge and have fundamental implications for the future role of the libraries in connecting users to the resources that match their study and research needs.
Richard Wallis from OCLC discussed topics surrounding the power of linked data and what the web wants at the NASIG 2014 conference:
Image licensed for reuse on google images
http://upload.wikimedia.org/wikipedia/commons/8/89/Linking-Open-Data-diagram_2007-09.png
The Power of shared data:
Changes and challenges
Richard began by outlining some of the current challenges facing librarians today.
The changing format of our resources and the evolving needs of our users is gradually shifting to prioritise ubiquitous online access to material rather than just managing access to a physical collection. Our users now exist in both the physical and online space of the Library - socially and virtually. Users interact through technology away from the physical location of the library on devices personalised to their requirements. Library budgets often target electronic resources for customers who demand instant access to material. The idea of collection management is changing as access moves to online portals.
The rapid pace of technological change continues to be a big challenge facing our sector. Our users, perceptions of collections, research outputs and other factors continue to adapt to the current infrastructure and the new, emerging scholarly landscape. Universities are becoming more involved in producing and disseminating materials. This is prompting changes in the behaviour of our users.
It’s widely accepted that the user is now everywhere, using their devices as a window to the world. It’s clear that libraries must inhabit this space in such a way that our collections and services are visible to the virtual community, thereby facilitating the needs of the user.
In this new landscape our users search for knowledge in places that have served them well in the past– places that are readily accessible and available. Whilst this may drive a librarian or professor to the point of despair, the fact is services like Facebook, Google and Wikipedia remain central starting points for many of our users.
It’s a fact that People don’t start in library catalogue at the beginning of their research process. The risk in this new environment is the connection between the library and the user doesn’t happen. The library may never connect its users to with the resources vital to study and research. The channels of communication between the library and the user may be completely circumvented because the library is not signposted on the virtual roads of discovery our users are walking.
Another aspect of the mission of libraries is to select, describe and preserve our material for public access. In support of this, many collections and archives are now accessible through the network. To some extent, our ability to present this information to users and the wider world has been taken out of the hands of librarians because our traditional methods of exposure are incompatible with the requirements of a web of linked data.
This global issue of discovery crosses many industries and organisations. Information professionals need to ensure their collections are properly exposed on the web, in the way the web wants. There needs to be a move away from record management to the management of entities that can be recognised and consumed by services on the web.
Libraries and Discovery
From the ancient cataloguing systems (providing access to scrolls) to ‘cutting edge’ systems of card cataloguing using index's (to discover authors, titles, subjects etc.) the library’s has been a central means of providing access to information. The pre-printed catalogue card really was groundbreaking technology in its day. It made records sortable, expandable and manageable. The functions and processes of library catalogues and metadata administration were developed in the context of traditional systems of managing physically crafted documents.
MARC was born as the ‘machine readable catalogue card’ as a means of sharing records. In 1994 the first web based OPAC was born, combining the formats and features of the printed card, bringing these benefits to users as well as staff. Librarians were among the first to make their records available behind servers to a wider community.
The format of the OPAC quickly began to show its age. This is unsurprising when you consider the pace of change of the web. Whilst the Library website is usable, patrons often require practice, guidance and help of librarians to perform the most basic of searches. Machine readable card catalogues were built out of a specific time and technology. Going forward, this not a good way to present data to the web. As we approach the web of data we must transform legacy formats to be compatible with the common identifiers and web schemas required for correct exposure. As the world begins to embrace linked data, libraries have an opportunity for our resources to exist in a common space inhabited by our users. An excellent symbiotic relationship; libraries can deliver information to the web in the way that it wants, to be consumed by high level services for the mutual benefit of our information seekers.
What does the web want? What is required in order to join the web of data?
Richard talked about what’s needed to improve the structure of our data and talked around the Worldcat linked data project which has transformed their static bibliographic records into a format compatible with the web of data. He discussed how this new phase of the web adds real-world value to our virtual environments and value to the web community. He described how libraries can use this space to assert their own value in an environment shared with our users.
The web respects and likes size
Big sites and resources tend to attract more web traffic. If it’s a popularly used resource, the data it holds could be an authoritative hub of information. If this hub is constantly updated and maintained it becomes a valuable resource for the web community as a whole. There are numerous examples of this, e.g. Wikipedia. In terms of preservation, large websites have the potential for greater longevity, encouraging other services to seek them out.
WorldCat has been involved in this for some time. They have been an aggregator of library records for many years. Currently they hold approximately 311M records and are the biggest collection of linked bibliographic data on the on web.
Benefits from using big, authoritative hubs, structured as linked data, are things like cascading updates. For WorldCat, any updates they make to their records at the work level will be cascaded down to the manifestation level. Similarly any external resource that is consuming information from WorldCat will always be brought current, up-to-date information that is constantly being maintained.
Standing together, libraries can have a bigger impact on the web. This is part of what WorldCat is aiming to achieve. Libraries can contribute to this today by ensuring their holdings are registered at Worldcat, and current.
There are advantages for companies like Google in allowing their services to consume large, popular resources. As well as making browsing a more useful browsing experience for the user, people will continue to journey through the services of their provider of choice.
The web wants structure and standards.
The construction of the current web is based on fundamental standards, e.g. HTML. Most people are now familiar with the structure of web pages, accessible through a network of links. However, in order to join the web of data, the web wants us to change our techniques in order to join entities together.
Schema.org is a standard adopted by OCLC (and many other organisations), which allows us to define entities and attributes in a way that is consumable by the web. By adopting a shared vocabulary we are able to connect with other services that speak the same language. A central aim when adding linked data in WorldCat was to ensure they behave like the rest of the web. Schema.org is widely understood and shared across the web - approximately 15% already use it. Big competing companies Google/Bing/Yahoo and Yandex all collaborated for its inception. This is because there is a growing global clamour for the benefits provided by structured data across the web.
There are other standards that OCLC could have chosen for WorldCat. For example, Bibframe (developed by the library of congress) that recognises entity based data as the best structure to serve the needs of our users. Utilising multiple standards is complimentary with the aims of linked data. Library rich vocabularies are too complex for the rest of the web. We should expose our data through standards in a rich form like Bibframe, but also in a high level form that the rest of the web can devour.
RDF is format that linked data travels through the web. RDFa is a way of putting it in amongst the HTML in order that services like Google can harvest information from the pages.
WorldCat data links to Dewey, DOID, VIAF and many others. WorldCat explicitly licenses its data as open (under ODC-BY). This means any person or web service can use it, not just Google, highlighting its value as a community resource to different scales of the web.
Network of links
The bibliographic web of data is starting to form. It’s not just libraries; there are plenty of organisations across the world forming their own webs. Google is the obvious big player here, as a service which harvests data from all these sources, but all these services also link with each other. WorldCat is integrated with Wikipedia, VIAF, LCSH, Dewey. These are all well respected linked data hubs that contain authoritative sources of data.
For example, from World cat records there is linked data identifier representing the book this page is about. Clicking on the attributes will eventually take you through to its description, e.g. item type, which is ultimately held by the authoritative hub which looks after this description.
In a web environment the user is clicking on links, which feels natural. But WorldCat is actually going out to the web on behalf of the user, to bring back their information from an authority for display.
Entity identifiers - “Things not strings”
The web wants entity identifiers – a unique identifier for a thing. These are known as URIs (Uniform Resource Identifiers). Why does the web want this? The web is a representation of our world and increasingly we spend more and more time in this space. By gathering, identifying and describing entities in this way we add value to the virtual world. As information professionals, these are skills we are very familiar with. Additionally, when things are identifiable they are consumable by numerous different services.
WorldCat makes use of persistent identifiers for its entities. It is a new concept that allows everyone to know that the same ‘thing’ is being referred to, which allows it to be linked.
There are well recognised relationships between the attributes of our catalogue records and the wider world in concepts like people and places. The world thinks in entities, not just subject records. As part of their linked data project, OCLC have been harvesting this information out of catalogue records, including persons, author, producer, creator, etc. With library data stored as entities connections between things, persons and works, item availability, subjects and concepts are possible. These connections can be used in a new way to promote discovery of these resources.
The FRBR data model (which RDA is based on) is already being used in commercial world, i.e. Amazon. This is because it is a comfortable way to organise data. OCLC are operating on similar models when extracting works data from their records.
Richard introduced a library knowledge graph of relationships. This demonstrates how different resources relate and link with each other. Knowledge cards are an example of this in practice. Google currently uses knowledge cards in search results, which provides access to related, linked data surrounding search topics. Libraries could use similar tools to connect people with related research outputs. This is known as serendipitous discovery, with the user following paths to subjects they didn’t know were available.
What other relationships are valuable to our users? Relationships of availability? Authors? Publishers? These are entities that can be uniquely identified by URIs on the web. Because relationships work both ways, this has the potential to be a very powerful tool in bringing explorers back to our collections and services. Once they have found the library, they have the opportunity to explore the detailed descriptions of our collections to further their knowledge or research.
The Power of Sharing Linked Data for libraries
We are all familiar with the ‘internet of documents’ – the web of links. The transition to what’s known as the ‘internet of things’, or a ‘web of data’ is emerging. An internet of entities, which may have relationships with other entities. It has huge implications for collaboration, shared data and impact.
Commercial enterprises are using linked data internally to further their business aims and improve their services. Facebook describes the data it collects across huge numbers of entities. By ensuring this information is properly related, they are able to target advertising by focusing on the patterns that emerge. Libraries have the opportunity to use data about entities and their relationships in the same way, albeit with more altruistic behaviours. This could offer huge insights into trends of usage and research, helping to inform the services we provide and demonstrating the impact of our collections across the research community. This also helps minimise the risks in connecting users to content.
Our catalogue records are often buried in the OPAC and not properly exposed to the wider world. Marc records often contain a wealth of wonderful, descriptive information about a piece work - vital for those exploring our collections. However, when not exposed or not understandable by the wider web, that value is not useful.
We already store this kind of information across our resources in the library and are very familiar with the concepts surrounding why and how we use it. If we expose it to the web in the way that it wants it will cement the role of libraries today and transform our mission in providing public access to our collections. Our data must be consumable to the same services used by our students and researchers, and usable to the web at large.