This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more here

NEWS2 June 2010

Automated sentiment analysis gives poor showing in accuracy test

Big Data UK

UK— Automated sentiment analysis is less accurate then flipping a coin when it comes to determining whether brand mentions in social media are positive or negative, according to a white paper from FreshMinds.

Tests of a range of different social media monitoring tools conducted by the research consultancy found that comments were, on average, correctly categorised only 30% of the time.

FreshMinds’ experiment involved tools from Alterian, Biz360, Brandwatch, Nielsen, Radian6, Scoutlabs and Sysomos. The products were tested on how well they assessed comments made about the coffee chain Starbucks, with the comments also having been manually coded.

On aggregate the results look good, said FreshMinds. Accuracy levels were between 60% and 80% when the automated tools were reporting whether a brand mention was either positive, negative or neutral.

“However, this masks what is really going on here,” writes Matt Rhodes, a director of sister company FreshNetworks, in a blog post. “In our test case on the Starbucks brand, approximately 80% of all comments we found were neutral in nature.

“For brands, the positive and negative conversations are of most importance and it is here that automated sentiment analysis really fails,” Rhodes said.

Excluding the neutral comments, FreshMinds manually coded conversations that the tools judged to be either positive or negative in tone. “We were shocked that, without ‘training the tools’, they could be so wrong,” said the firm. “While positive sentiment was more consistently categorised than negative, not one tool achieved the 60-80% accuracy we saw at the aggregate level.

“To get real value from any social media monitoring tool, ongoing human refinement and interpretation is essential,” said the company.

The full whitepaper can be download online here. Get the lowdown on social media monitoring here.

@RESEARCH LIVE

22 Comments

7 years ago

It all comes down to doing the work to get the automated systems working as best as possible. The amount of validation work that must going into creating an automated sentiment analysis system that is accurate is simply enormous and continually ongoing. Systems that do not incorporate ongoing validity mechanisms cannot improve and will only worsen over time as speech and language changes with the times. What this says to me is buyer beware and buyer do your homework. Ask your vendor if they validate their engines, how they do it, and how often they do it. Annie Pettit, Chief Research Officer www.conversition.com

Like Report

7 years ago

These findings will come as no surprise to companies that use automated analysis properly. Using automated analysis for individual pieces of coverage and without 'training' the software is never going to produce good results; and, in this respect, the Freshminds study is itself flawed because, frankly, they should understand that. Equally, the companies studied should not be offering generic automated analysis services for exactly this reason, so in that respect the study is valid. In fact, automated analysis used properly can achieve remarkably accurate results. Something the study does not do, of course, is compare the use of properly trained and correctly used automated analysis against humans to analyse, say, 1000 pieces of online coverage in real-time, which is increasingly required in today's highly connected world. Had they done so the automated analysis would win hands-down. In other words it's 'horses for courses' and this study really should have pointed that out.

Like Report

7 years ago

Thanks for your comments Matt. Many of our clients come to us having attempted to try Social Media Monitoring for themselves and after discovering the issues we highlighted in our report. Through the research it was our intention to see how the tools varied without such 'training' as this is not always consistent and is certainly not always used by clients. We've had some great feedback, particularly from the tool providers and we plan to update this research shortly.

Like Report

7 years ago

Couldn't agree more. Social media measurement has a LONG way to go, and no number of funky dashboards and black box algorhythms is going to make a difference until some pretty fundamental weaknesses have been addressed. http://tiny.cc/d9ld5

Like Report

7 years ago

As regular visitors to this site may recall, Mark Westaby and I debated this very question - whether automated tools could provide sufficiently accurate sentiment analysis to support critical business decisions - earlier this year. This study supports what is now generally considered a settled view – that automated analysis tools cannot, and generally do not pretend to deliver the same levels of sentiment accuracy as well trained, fully briefed human analysts. However, as others have pointed out, there are inadequacies in this study. But I would contend that these research issues do not detract from the central finding that when it comes to sentiment, automated tools are simply not as accurate or consistent as humans. Faced with this finding, proponents of automated tools, as Mark is, often retreat into justifying their use by virtue of their benefits in "real time" analysis. However, in practice, real time analysis is really only necessary in crisis or rapid response situations. And the paradox is that in these situations, whilst actionable results can be achieved by automated tools, there is actually no need for sentiment analysis. Crises are marked by specific topics under discussion – it is these that need to be tracked. Their very presence will indicate where remedial or defensive action may be required – no sentiment required... In more strategic contexts, where business insight and support for business outcomes are critical, delivering accurate, reliable, consistent and robust sentiment analysis from trained human analysts massively outweighs the constant nagging doubt about the consistency and accuracy of data from an automated platform. In our experience, owners of valuable brands simply cannot, and indeed do not take the risk of using such potentially inaccurate data in determining the performance of their assets. The noticeable swing back to human-derived analytics from companies previously using automated only tools is tangible proof of that particular pudding. As a sidebar, I would strongly dispute the study's view that neutral coverage is somehow less important than positive/negative sentiment , especially in relation to building and sustaining brands and reputation – and even more so in a competitive context. There are plenty of research studies showing that neutral brand visibility helps build awareness, and, more importantly also serves to build up reputational "trust bank" reserves...

Like Report

7 years ago

Thanks for all the comments. Just to pick up on Mike Daniels' reference to his head-to-head debate with Mark Westaby on people vs. machine analysis – that piece can be found here: http://bit.ly/gLHJ8

Like Report

7 years ago

First, we see automated sentiment scoring as part of our business process to assist analysts rather than as a stand alone tool. Second, the white paper is not especially transparent about the statistical tools and methods used to arrive at their conclusion. It is meaningless to lump all the systems together and lament that only 30% of the posts were scored accurately. Likewise, it is not meaningful to say that the best system achieved around 50% accuracy. How was this calculated? Third, I continue to be amazed by the success achieved by computer scientists in their models using automated sentiment scoring. Fourth, I am surprised by the claim that Twitter is easier to rate due to short text length. All academic and research papers I have read state the opposite. Finally, are we falling into a trap of feeling that we have to provide a sentiment score for everything to achieve the results we need? I am regularly impressed with results reported by academics where they use manual scoring on a small sample of stories (say 1,000). Why do business clients always want scoring of everything? Is this overkill? Should we instead revert to an appropriate sample-based research design to address specific client questions.

Like Report

7 years ago

Brian, While automated sentiment technology isn't perfect, it is improving on a steady basis as the technology evolves. At the same time, it is important to recognize that technology does a lot of grunt work in processing millions of conversations - something that couldn't be done manually. As well, there is a role for people to play alongside automated sentiment technology to make sure that the results can be edited or tweaked to reflect context, sarcasm, etc. In many respects, social media sentiment works effectively if there is a solid marriage between technology and people. cheers, Mark Mark Evans Director of Communications Sysomos Inc.

Like Report

7 years ago

Unfortunately the study tested the most popular, but least reliable of the systems available. I'm convinced that PR people would rather measure what is easy to measure than to measure accurately. Just as an FYI, we routinely test humans against humans to ensure a minimum 90% intercoder reliability score and THEN test automated sentiment analysis against that. The only system that comes close is SAS's Social Media Analysis, but that's in part because they are a client of ours and used our coding instructions to design their system.

Like Report

7 years ago

Trying them out "without training" makes no sense, and if I were a company using this sort of software to analyze my brand I'd make sure to train it first. To anyone who's familiar with the literature on this topic, it's not at all surprising that untrained accuracies would be abysmal. I agree with Mark Westaby, it's been demonstrated over and over again that an automatic sentiment analyzer needs to be trained to avoid being hopelessly bad, so this study is flawed.

Like Report

Results 1 to 10 of 22