The language of the internet

Johannes Eichstaedt mines social data to determine the psychological state of populations, with some compelling findings. He spoke to Jane Bainbridge

speech bubble made up of lots of illustrations of people

Throughout his academic career, Johannes Eichstaedt has always worked with data – it’s just that initially, as a physicist, it was particle physics data rather than psychological data he was processing. But when he realised that he “didn’t much care for working in a particle accelerator”, Eichstaedt switched to psychology.

Now, as a data scientist in psychology at the University of Pennsylvania, and co-founder of the World Well-Being Project, his time is spent using natural language processing to measure well-being among populations.

His research has included using Tweets to predict heart disease and Facebook statuses to identify depression.

Language patterns

In the case of heart disease and Twitter, he was part of the team of scientists that analysed more than 50,000 tweeted words to characterise community-level psychological correlates of dying from atherosclerotic heart disease (AHD) in the US.

The language patterns identified as risk factors reflected negative social relationships, disengagement and negative emotions such as anger; while positive emotions and psychological engagement emerged as protective factors. In their findings, published in Psychological Science, the researchers found that “a cross-sectional model based only on Twitter language predicted AHD mortality significantly better than a model combining 10 common demographic and socioeconomic risk factors, including smoking, diabetes and obesity”.

Twitter topics that positively correlated with county-level AHD mortality included hostility and aggression; hate and interpersonal tension; and boredom and fatigue. In comparison, topics that negatively correlated were skilled occupations; positive experiences and optimism.

“It’s not that Twitter has some magical prediction power that other variables don’t have. It’s an extremely good predictor of income and education and of communities where people smoke – so it picks up predictors of health behaviour, and then it adds a sliver of psychological causation that the other variables don’t seem to be getting at,” says Eichstaedt.

A linear discriminant analysis (LDA) algorithm crunches the data by working with 2,000 language clusters that distil what people talk about in their Facebook statuses or Tweets.

But how accurate is this social media data? Eichstaedt says there are two biases in it – sample bias and desirability bias. However he says the sample bias is overestimated: “The median age on Twitter is 32 and for the US population it’s about 36/37”, adding that once the sample is big enough, the model re-stratifies the sample to be more representative.

He says there’s some evidence that people misrepresent themselves (desirability bias), in particular suppressing negative emotion but that “the variance between people is still highly interpretable”.

facebook data

And there are differences between the media in terms of data. With Facebook data, the users have to give permission, which means “if you get 50,000 in a sample that’s amazing – generally data from Facebook users is 3 – 10 times as good as Twitter users”.

For his depression study he used Facebook data.

“For psychological insight, Facebook is preferable; it’s just you can’t get its data for that many people.” But in this case he was using data collected by someone else, which he reinterpreted to understand depression.

Looking forward he thinks diabetes, which seems to have a lot of behavioural predictors, might be an area worth researching – not that all areas of wellbeing research are ripe for social media data analysis.

“As long as your data is big enough it will always work; the question is, will it improve on other methods? And there the answer is sometimes no; when trying to predict something like cancer, it didn’t work because income and education appear to be a much better predictor than what’s happening on Twitter.”

We hope you enjoyed this article.
Research Live is published by MRS.

The Market Research Society (MRS) exists to promote and protect the research sector, showcasing how research delivers impact for businesses and government.

Members of MRS enjoy many benefits including tailoured policy guidance, discounts on training and conferences, and access to member-only content.

For example, there's an archive of winning case studies from over a decade of MRS Awards.

Find out more about the benefits of joining MRS here.

0 Comments


Display name

Email

Join the discussion

Newsletter
Stay connected with the latest insights and trends...
Sign Up
Latest From MRS

Our latest training courses

Our new 2025 training programme is now launched as part of the development offered within the MRS Global Insight Academy

See all training

Specialist conferences

Our one-day conferences cover topics including CX and UX, Semiotics, B2B, Finance, AI and Leaders' Forums.

See all conferences

MRS reports on AI

MRS has published a three-part series on how generative AI is impacting the research sector, including synthetic respondents and challenges to adoption.

See the reports

Progress faster...
with MRS 
membership

Mentoring

CPD/recognition

Webinars

Codeline

Discounts