Open futures - A blog for information technology and library related things: November 2014

Tuesday, 25 November 2014

"Using text mining to enrich social science content" & "Semantic content enrichment: magic vs myth"

These are my notes on two UKSG Forum 2014 presentations. One by Alan Maloney, Senior Product Analyst at SAGE publications ("Using text mining...") and one by Sam Herbert, Director and Co-Founder of 67 Bricks ("Semantic content...")

Both talks covered similar areas and are excellent examples of the practical application of semantic data being put to work for researchers.

Using text mining to enrich social science content - SAGE

In essence SAGE has developed a system that allows them to mine the contents of their publications. The primary goal is to enhance the user experience in the same linked data format demonstrable by Google’s knowledge cards or OCLC's WorldCat platform. Another aim is to link content across their sites more meaningfully. It is based around the idea that structure and search terminology are often already implicit in a document.
http://connection.sagepub.com/blog/2012/12/20/text-mining-and-the-social-sciences/.

An algorithm they have developed analyses the structure of the document and extracts, tests and indexes relevant nouns that breaks up topics into smaller subtopics.

Alan demonstrated this approach with an encyclopaedia entry. There were numerous terms within the entry that were extracted and indexed by the algorithm in order to be identifiable in search results to related terminologies as well as exact phrase matching.

It's still early days of the project and there are some aspects which work very well. As you can imagine, however, the project has also encountered many challenges. Progress has been good but areas for improvement include:

Structural inconsistencies – The differences in the types of document. What the content contains and how it is laid out all present potential difficulties, for example with untagged mixed media items.
Fuzzy matching vs extract phrase matching
Ambiguous and generic terms
Trigger words - rules that only apply to certain terms

The main area of work is creating subject specific taxonomies which can be used by the algorithm for more accurate matching. Whilst the readability of terms had improved, ambiguity across terms is also a factor that needs to be worked on. An audience member commented that if functionality exists that would allow the user to validate the accuracy of a search then this might help speed up the development process.

In future they expect to see other kinds of content like datasets appearing alongside content.

Semantic content enrichment: magic vs myth - 67 Bricks

67 Bricks work with publishers to enrich their content to make it more structured, granular, flexible and reusable by applying semantic content enrichment.

They help publishers develop content processes, systems and delivery channels that support more agile and flexible production workflows, increase the value of legacy and new content, increase revenues from existing channels as well as enabling better reuse of content to help maximise the usage (and revenues) from new and existing digital products.

To harness the power of semantic information is to take advantage of the new opportunities emerging in the publishing world. This statement encompasses some of the reasons why semantic information is important.

Users expectations of web applications are being standardised and defined by the services they encounter on a daily basis. Often commercial, examples include Google and Amazon. This is important because their expectations will skew how your own site is received. If their perceptions are not indicative of better experiences elsewhere on the web, your own site will quickly lose their attention. Users want a personalised experience and they want to notify about things they care about. Semantic content can realise these expectations.

So how do 67 Bricks apply semantic information to content and how does this benefit the user? Their data extraction models begin by identifying keywords and search terms. Items which the user will find pertinent and interesting. Their algorithms are able to perform a grammatical analysis of a huge volume of digital content quickly and at the same time apply a statistical approach in this context. Giants of the web like Wikipedia also use a statistical approach in this way.

Like Wikipedia (and others) the approach is fully automated. Whilst this approach means enrichment can take place very quickly the quality of the results cannot always be guaranteed.

Once applied, the data collected through the semantic approach can be metaphorically compared to a fingerprint. By comparing this against the ‘fingerprints’ of other articles in the collection this allows items of interest to be brought to the user’s attention.

67 Bricks also apply an entity approach to an analysis. An entity can be described as a known thing based against a known taxonomy. Linked data on the web allows entities to be referenced and identifiable across resources and authoritative data hubs using a compliant schema through the use of URIs. This means the ‘fingerprints’ identified by the analysis of a collection can be used to go out to the wider web and call back other useful information and resources that the user may want to see. Fingerprints can be applied to user profiles and can therefore be setup to degrade over time. This means a system can adapt to present the user articles with articles they are interested in real time as their interests and investigations change.

Some examples Sam provided were Medical entities. He demonstrated a mesh taxonomy the subject of which was viruses. By holding the data in this format he was able to perform faceted searches as well as show a graphical form of the mesh, allowing users to see easily where the information terms and subtypes are coming from. This can be down to the level of geographic locations. The applications here could involve identifying and containing outbreaks or new strains of virus before an epidemic can take place. Or vaccines for a particular virus could be related to other topics like reaction types.

What has changed is the moving away from the analysis of individual terms and to move toward analysis of the content as a whole. Machine learning is a growing area and 67 Bricks are currently working on defining taxonomies and training their system for numerous clients, with a particular focus on the interest for research articles. Taxonomies take much longer to define and the nodes of taxonomy form a fundamental part of machine learning. In laments terms, many articles are applied to each node and so ‘learned’ by the system.

In closing, whilst machine learning is going to enhance the user experience even further, To the keywords system is instantaneous.

It’s neither Myth nor Magic. Just ask Harry...

Monday, 24 November 2014

The use of data in humanitites research to deliver new insights with examples from the music and film industries

These are my notes on the UKSG Forum 2014 2014 presentation by Roger Press, Managing director, Academic Rights Press Ltd

Academic Rights Press (ARP) is a publisher of academic databases featuring exclusive content sets for scholarly research, teaching and learning. The platforms deliver innovative and easy to use functionality to enable researchers, faculty and students to benefit from the best of contemporary research tools. This leads to innovative scholarship and the opening up of new areas of study. The subject areas include Law, Entertainment, Music, Philosophy, English Letters, Women Writers, Religious Studies, Asian Studies, Social Sciences/Humanities and Health and Medical.

The work of Academic Rights Press is intended to rigorously bring about a cultural change in the humanities by bringing data into service of these disciplines.

Roger commented on the differences between STEM subjects and the Humanities concerning research data. In simplistic terms, STEM subjects often rely on measurements, testing and results whilst Humanities subjects often rely on researcher observation and interpretation. However, research in the humanities is gearing more and more toward the use of datasets in informing their research.
In particular, historical datasets from big business have demonstrably been put to scholarship and the insights from these companies are driving many areas of research. ARP services help to put some of these datasets to work for the researcher.

Roger gave the example of a researcher who was able to measure the contemporary impact of books published in the 1800’s, in part due to sales and manufacturing data from the era. Other disciplines are deriving social tends through monitoring data on television and film. By bringing data into right format and platform (i.e. via a service like ARP), this allows researchers to find new insights from existing data. In the case of social trends, very often this can now be monitored in real time –a live social commentary.

As an example for data from the music industry, Roger showed some relative pitch graphs from a research project showing how sales for the ‘King of Pop’ Michael Jackson were massively overtaken by a previously unknown grunge metal band Nirvana. This pre-existing data was available to the researcher from Neilson via ARP.

The research also revealed that longer term over the decade; the heavy metal band Metallic trumped them both. Here’s to the die-hard heavy metal fan!

Making historical datasets available offers the opportunity to investigate fascinating research questions that could not be answered before.

There are echoes in my brain of the BBC’s Doomsday project as a good example of this kind of research (aside from the digital obsolescence problems they had in the 80’s).

Image licensed for reuse on commons.wikimedia.org

Social Media in the library - discovering best practice: findings from the Taylor & Francis White Paper

These are my notes on the UKSG Forum 2014 presentation by Laura Horton and Jodie Bell (Taylor & Francis).

The ‘use of social media by the library’ is a white paper researched and compiled by Taylor & Francis to provide an overview of current practices relating to the use by libraries of social media. This is a world-wide perspective, against which individual institutions can benchmark their own activities and be inspired to try new approaches and is available online here:

http://www.tandf.co.uk/libsite/whitePapers/socialMedia/

Many libraries are experiencing a drop in traffic to their static web pages but are seeing much more active engagement with users through the use of social media.

Research methodology for the paper involved several interviews and case studies across libraries of different sizes and types. Some of the top findings were:

The fast-paced nature of social media means that short response times are very important to users. If libraries are not able to respond quickly, they should clearly outline a suitable timeframe (immediately visible to the user), which is in keeping with service level agreements.
Presently most libraries only employ a small number of specialist staff to handle social media communications. It is worth considering how to share this knowledge and skills across a larger team to ensure business continuity of the service. It was noted that this method of communication is expected to grow and that team structures may need training to accommodate this change in user behaviour.
The best use of social media was considered to be where engagement through conversations could be demonstrated and it was felt that twitter was an effective tool for this. "Broadcasting" messages alone does not foster engagement with a service or positive relationships with users.
It was noted that it takes time to develop strong communities of users through social media. As the founders of your community you can decide if it will engender a passive or active form of engagement. A more active way of engaging users might be to poll them. Drake University uses a "soapbox" platform that allows library staff and users to pitch ideas, services and promotions.

Being more active means addressing issues that are important to users – quickly and professionally. This issue is compounded by the fact that problems outlined by users and the actions taken by staff are visible to the entire community in realtime.

It's important to post a variety of messages in order to keep your users engaged. This includes using a variety of different media. The use imagery is an effective way to communicate ideas quickly. This format is somewhat changing the way users search for and absorb information online

YouTube is being seen by some as a valuable collection management tool. Local digital collections are not as exposed to the wider web as applications like YouTube. This service can also make it easier to connect different accounts and so can make outputs discoverable to a number of different communities simultaneously.

The role of social media is expected to grow and become a primary means of communication for the future. Many of those who took part expect to see dedicated roles and teams emerging in the future.

Image licensed for reuse on commons.wikimedia.org

UKSG Forum 2014 - ‘Information without frontiers – barriers and solutions’

What's it all about?

I attended the UKSG Forum on 21st November 2014, now in its second year. It's an informal networking and exhibition event, with a program of short sessions this year on the theme of ‘Information without frontiers – barriers and solutions’. It aims to bring together key stakeholders across the scholarly information community.

What was discussed?

The event is representative of publishers, subscription agents, technology vendors and other industry players. There were a lot of different talks being discussed and it was a great opportunity to share experiences and swap strategies with colleagues across the sector who all face similar challenges.

See below links to pages on the talks I attended over the day. Please link through if any peak your interest:

Sunday, 23 November 2014

AltMetrics and Open Access at King's College London

Data was based on KOA database, King's research Portal, Cottage Labs, DOAJ, Sherpa Romeo and AltMetrics for open access articles (that we know to be green or gold) published over a section of 2014.

Whilst we had article level metrics we decided at this time it would be better to do a short and simple infographic.

Open Access Week 2014 at King's College London

A global event now entering its eighth year, is an opportunity for the academic and research community to continue to learn about the potential benefits of Open Access, to share what they've learned with colleagues, and to help inspire wider participation in helping to make Open Access a new norm in scholarship and research.

We created this GIF as a good (shareable) way to highlight the dedication of the university (via the library) for supporting authors in the open access publication of their research and also spread some positive OA messages at the same time. The very central london backdrop of somerset house gave it a nice touch. Paul Weller was in the area watching with bemusement as we created this masterpiece.

Drop in sessions

We hosted several drop in sessions over the week across the university. We were available answer any questions authors' had and were able to provide information about the many services available to them through the library. Points of interest included:

Depositing your outputs in Pure
Funder policies (RCUK, Wellcome Trust, and many more)
Freebies, competitions and more!

We came up with this neat poster based on OA week branding and shared this widely across all our communication outlets; the library services homepage, emails to heads of schools and A1 posters around the college.

Competitions

We cheesily offered a cuddly BioMed Turtles to reward deposits into Pure over the week as a fun icebreaker and chance to repeat this as a fundamental message promoted by our service. We also got involved with the knowledge unlatched competition as well as sending our GIF and pictures across the twittersphere!

Open futures - A blog for information technology and library related things

Pages