
A recent project at Pingar investigated how we can use the Pingar API to make sense of what people talk about on Twitter. This blog post talks about the challenges we faced and the lessons we learned in the process.
Human language is fascinating. There are so many ways to express the same thing that even people sometimes struggle to understand each other. At Pingar, we develop algorithms that understand written text, but we are realistic about what is possible to achieve:
However, with the popularity of social networks like Facebook and Twitter and those that organizations use behind the Firewalls, more and more data is generated that is different to conventional document text. Let’s call it unstructured social data. It includes instant messages sent via Skype, Windows Messenger, Lync, or even via phone messaging, as well as status updates like those on Twitter, Facebook and the internal social networks. This data is messy and lacks context, but it is valuable for organizations because in addition to meaning it carries people’s thoughts and emotions.
Analysis of social data on public social networks can reveal what current and potential customers say about the service, product quality, or competitors. By tapping into the internal social data one can monitor what topics employees discuss, or how their mood and attitude changes over time. We wanted to investigate how Pingar API can help in both cases, so we set up a short study to develop a prototype that makes sense of unstructured social data.
Most organizations use some sort of instant messaging (even Outlook can be used to send IMs!) and some adopt social networks behind the Firewall (Yammer or Newsgator for SharePoint), but getting hold of this data is tricky. Twitter is the most common source of social media research, but even tweets are hard to come by due to privacy policies First, we tried using the Twitter API to collect data from Twitter directly. We monitored specific New Zealand topics with the hope to obtain a well-defined set of Tweets, but found that Kiwis don’t tend to talk on Twitter about companies all that much. AirNZ, for example, received on average 10 tweets per day. In the end, we were able to get a part of the SNAP: Network dataset through our collaboration with Swansea’s University.
Sentiment analysis can determine whether text expresses a positive or a negative emotion. Pingar API does not offer sentiment analysis and hence we decided to try out an existing tool, called NLTK. It uses a simple Naïve Bayes classifier, which needs to be trained on manually annotated data that is similar to the actual input. We found the Sanders Analytics Twitter Sentiment Corpus, which contains sentiment annotated tweet IDs, and had to pull associated tweets via the Twitter API (150 per hour!). The drawback of this corpus is that it only focuses on the technical domain, so NLTK won’t be accurate on non-technical tweets.
First, we cleaned the tweets by removing all the duplicates, as thousands of re-tweets and spam tweets can negatively affect the results. From each tweet we removed URLs, hashtags, user names and stopwords such as RT, via, lol, lmao, while keeping the original copy for display later. Once all the tweets are cleaned and categorized into dates and sentiments, we applied the Pingar API Entity Extraction method to determine the keywords for the two sets of positive and negative tweets. The API returned two lists of keywords along with the keyword scores. Sometimes the same keyword appeared in both positive and negative list. In this case, we removed the keyword with the lower score from one of the lists.
You can check out the result of this experiment as our prototype Twitter demo. We took four topics that had sufficient number of tweets in the SNAP dataset (google, iphone, iran and obama), processed as described in this blog and visualized how trending keywords in positive and negative tweets change over time.
We would like to thank Jon Hurlock and Max Wilson for their help with this project.
At the end of last year, Pingar received funding from MSI to fund a postgraduate intern, which allowed us to hire Andy Chao, a recent University of Auckland Computer Science graduate, who has completed this work.
Last week the Pingar team attended Strata O'Reilly Making Data Work Conference, a fantastic conference organized by O’Reilly Media. This conference focuses on technologies that unlock the immense potential hidden in large volumes of data and apply it to solving real-world problems.
What I liked about this conference is the perfect mix of high quality keynotes, educational tutorials and interesting company exhibits. Overall, Strata was a success: I got a chance to attend several memorable talks, Dr. Anna Divoli and I gave a presentation entitled “Mining Unstructured Data: Practical Applications”, and Pingar was one of the exhibitors in the Innovators’ Pavilion.
Lots of interest in the Pingar demos!
One of my personal highlights was a keynote by Hal Varian (Google's chief economist) taking about economic predictions using query logs. Similarly to how the number of searches for vodka can predict the number of searches for hangover, one can also accurately predict the number of German visitors to Hong Kong in a given month or when recession is likely to hit the economy. Mike Olson (Cloudera) also gave a great keynote. He talked about how big data analysis can solve important world problems, from developing effective drugs over monitoring crime to detecting oil underwater.
In our presentation on unstructured data, Anna and I also focused on real-world problems. We talked about how in the legal domain automated metadata can save hundreds of hours and therefore dollars, how in health care automated sanitization means that medical records can be exchanged more safely and re-used for research and how in finance entity extraction can solve tedious compliance problems like FATCA.
After our talk we spoke to a number of people interested in finding out how exactly text analytics works and what it can achieve. We’ve been told that there are not enough presentations addressing the issues of unstructured data, even here at Strata. Such comments, as well as a constant stream of visitors at Pingar’s booth who wanted to see our latest demos, has confirmed that the demand for unstructured data solutions is high.
Alyona Medelyan, Pingar Chief Research Officer
P.S. There were 2,500 attendees at Strata, and a visualization by Guardian shows their distribution over countries, companies and gender. Interestingly, if you select New Zealand, it shows 100% male attendance. This is because they forgot to include the speakers: Both Anna Divoli & I travelled from NZ, which would make up at least 17%, the overall female attendance. Such small mistakes in handling data can lead to misrepresentation, which is why handling data is not trivial.