It was a sunny day in early August 2011. A meeting in the spacious University of Waikato facilities was in full swing when my cell phone rang. I picked up assuming the call was important, and it was! Peter Wren-Hilton, Pingar’s CEO, was calling from Sunnyvale in California. He has just received official confirmation from the New Zealand Ministry of Science & Innovation that MSI would co-fund Pingar’s proposed joint taxonomy research project with Professor Ian Witten and his PostDoc students, Lan Anna Huang and Dave Milne. I was glad to be able to share this news with Anna and Dave immediately. We had all gathered that day to discuss the agreed milestones in detail, and now we knew that the project was all go.
The project is about helping organizations automatically generate taxonomies from their own documents. Documents come in different flavours (emails, websites, wiki pages, proposals, applications, CVs), and an average employee in a typical organization produces dozens of such documents every week. Taxonomies are extremely useful for organizing and searching all this unstructured data. They consistently group documents on the same topic, or term, irrespective of how that topic is named in text. One can browse from a generic taxonomy term to a more specific one and find all documents in one place.
But taxonomies have two major drawbacks. The first one is the requirement of creating links between taxonomy terms and documents, also known as assignment of metadata. Requiring employees to specify metadata manually rarely works, it’s just too much effort. Pingar has already addressed this problem with an API method that automatically figures out relevant taxonomy terms for input documents. The Pingar Metadata Add-On for SharePoint 2010 packages this API method into a one-click application.
The second drawback of taxonomies is that creating them from scratch is daunting. Companies employ information architects who firstly research into topics discussed in internal documents, and then group them into a useful hierarchical structure. This process is costly, but most importantly, difficult to sustain. Most organizations change over the time and the ideal taxonomy should reflect these changes. After hearing regular requests for a tool that would simplify taxonomy creation, Pingar decided to build one, with the help of experienced University of Waikato PHD researchers.

From L-R: Dr Lan Anna Huang, Dr Jeen Broekstra, Steve Manion, Professor Ian Witten, Dr Dave Milne, Dr Alyona Medelyan & Dr Anna Divoli
Last week, our joint research team has met again to review the progress on the milestones . The image to the left shows topics on Cycling from a news articles taxonomy constructed using a prototype we developed. The colors denote terms identified using different methods. The prototype repeats the same steps an information architect uses in their work. First, we detect relevant terms for the new taxonomy. Our sources are entity and terminology extraction methods, existing taxonomies, Linked Data resources, and Wikipedia. Second, we group these terms hierarchically by drawing on known relations from existing resources, as well as detecting new relations using Machine Learning.
In this project, we work closely with commercial partners – companies, who can’t wait to start using such tools and hence are keen to help us develop them. We expect to complete the prototype of a taxonomy extraction tool in March 2012 and look forward to report here on the outcomes.