Both talks covered similar areas and are excellent examples of the practical application of semantic data being put to work for researchers.
Using text mining to enrich social science content - SAGE
In essence SAGE has developed a system that allows them to mine the contents of their publications. The primary goal is to enhance the user experience in the same linked data format demonstrable by Google’s knowledge cards or OCLC's WorldCat platform. Another aim is to link content across their sites more meaningfully. It is based around the idea that structure and search terminology are often already implicit in a document.
http://connection.sagepub.com/blog/2012/12/20/text-mining-and-the-social-sciences/.
An algorithm they have developed analyses the structure of the document and extracts, tests and indexes relevant nouns that breaks up topics into smaller subtopics.
Alan demonstrated this approach with an encyclopaedia entry. There were numerous terms within the entry that were extracted and indexed by the algorithm in order to be identifiable in search results to related terminologies as well as exact phrase matching.
It's still early days of the project and there are some aspects which work very well. As you can imagine, however, the project has also encountered many challenges. Progress has been good but areas for improvement include:
- Structural inconsistencies – The differences in the types of document. What the content contains and how it is laid out all present potential difficulties, for example with untagged mixed media items.
- Fuzzy matching vs extract phrase matching
- Ambiguous and generic terms
- Trigger words - rules that only apply to certain terms
The main area of work is creating subject specific taxonomies which can be used by the algorithm for more accurate matching. Whilst the readability of terms had improved, ambiguity across terms is also a factor that needs to be worked on. An audience member commented that if functionality exists that would allow the user to validate the accuracy of a search then this might help speed up the development process.
In future they expect to see other kinds of content like datasets appearing alongside content.
Semantic content enrichment: magic vs myth - 67 Bricks
67 Bricks work with publishers to enrich their content to make it more structured, granular, flexible and reusable by applying semantic content enrichment.They help publishers develop content processes, systems and delivery channels that support more agile and flexible production workflows, increase the value of legacy and new content, increase revenues from existing channels as well as enabling better reuse of content to help maximise the usage (and revenues) from new and existing digital products.
To harness the power of semantic information is to take advantage of the new opportunities emerging in the publishing world. This statement encompasses some of the reasons why semantic information is important.
Users expectations of web applications are being standardised and defined by the services they encounter on a daily basis. Often commercial, examples include Google and Amazon. This is important because their expectations will skew how your own site is received. If their perceptions are not indicative of better experiences elsewhere on the web, your own site will quickly lose their attention. Users want a personalised experience and they want to notify about things they care about. Semantic content can realise these expectations.
So how do 67 Bricks apply semantic information to content and how does this benefit the user? Their data extraction models begin by identifying keywords and search terms. Items which the user will find pertinent and interesting. Their algorithms are able to perform a grammatical analysis of a huge volume of digital content quickly and at the same time apply a statistical approach in this context. Giants of the web like Wikipedia also use a statistical approach in this way.
Like Wikipedia (and others) the approach is fully automated. Whilst this approach means enrichment can take place very quickly the quality of the results cannot always be guaranteed.
Once applied, the data collected through the semantic approach can be metaphorically compared to a fingerprint. By comparing this against the ‘fingerprints’ of other articles in the collection this allows items of interest to be brought to the user’s attention.
67 Bricks also apply an entity approach to an analysis. An entity can be described as a known thing based against a known taxonomy. Linked data on the web allows entities to be referenced and identifiable across resources and authoritative data hubs using a compliant schema through the use of URIs. This means the ‘fingerprints’ identified by the analysis of a collection can be used to go out to the wider web and call back other useful information and resources that the user may want to see. Fingerprints can be applied to user profiles and can therefore be setup to degrade over time. This means a system can adapt to present the user articles with articles they are interested in real time as their interests and investigations change.
Some examples Sam provided were Medical entities. He demonstrated a mesh taxonomy the subject of which was viruses. By holding the data in this format he was able to perform faceted searches as well as show a graphical form of the mesh, allowing users to see easily where the information terms and subtypes are coming from. This can be down to the level of geographic locations. The applications here could involve identifying and containing outbreaks or new strains of virus before an epidemic can take place. Or vaccines for a particular virus could be related to other topics like reaction types.
What has changed is the moving away from the analysis of individual terms and to move toward analysis of the content as a whole. Machine learning is a growing area and 67 Bricks are currently working on defining taxonomies and training their system for numerous clients, with a particular focus on the interest for research articles. Taxonomies take much longer to define and the nodes of taxonomy form a fundamental part of machine learning. In laments terms, many articles are applied to each node and so ‘learned’ by the system.
In closing, whilst machine learning is going to enhance the user experience even further, To the keywords system is instantaneous.
It’s neither Myth nor Magic. Just ask Harry...
-------------------------------------------------------------