Generative AI models mostly inaccurate when sourcing statistical data, finds trial

UK – Most current large language models failed to accurately answer a question focused on UK statistics, in an experiment published by the MRS Census and GeoDemographics Group (CGG).

robot with magnifying glass with GDP and a graph line arrow

The experiment, outlined in a report published on 21st October, Inaccuracies of Generative AI Based Tools for Extracting Data, asked several commercial chatbots and search interfaces, including ChatGPT, Google Gemini and Microsoft’s CoPilot, a question about UK GDP (gross domestic product) in 1973.

Because the prompt related to pre-internet data, the experiment assumed that only a few statistically focused websites would contain the information needed to answer the question.

According to the report, most of the AI models got the answer wrong or refused to answer the question. Only one system returned the correct answers (according to ONS) the first time, while another got it right on the second attempt at prompting (with no changes to the prompt).

The report found that while the outputs looked coherent in terms of their vocabulary and grammar, the quality of the numbers provided was poor. Additionally, running the same question again was likely to result in a different answer.

Report authors Jaan Nellis and Peter Furness conducted the trials to look at what AI tools can do against a particular query about particular data, following a discussion in a CGG meeting earlier this year.

Speaking to Research Live, Furness explained: “It’s no longer the preserve of experts to get access to public datasets – anybody can type something into Google and get an AI summary, and it goes away and finds data and comes back with answers. That’s wonderful, but on the other hand, in the wrong hands – i.e. in the hands of perhaps less skilled people, people with axes to grind – it could be quite dangerous if the numbers being put out are not accurate. If there was a warning flag to be raised, we wanted to raise it.”

The report authors’ expectation, borne out in the test results, was that if you repeat a question, you get different results. 

Nellis said: “We tested Google’s chatbot and Google’s search agent in June/July. They should be one and the same, but they weren’t, back then – we got different results. Then we came back in September because we had some queries we wanted to resolve, and lo and behold Gemini was consistent across the Google platform – they were at least reporting the same things. [AI models] are refining all the time and you would hope that that refinement would lead to a better situation.

“We thought that asking for GDP was a simple thing to ask for, but it’s quite complicated because there are lots of different GDPs, so you need to be relatively cognisant of which GDP you want. The one we asked for was what we considered to be the most common, standard metric.”

The errors in the experiment results, according to Nellis, were primarily due to the search algorithm selecting old webpages with out-of-date figures.

While there are two different core ways an AI model can operate – in one, you ask an LLM a question and receive a response that is not using any external data but rather what it has learned – most systems use a technique called RAG (retrieval-augmented generation).

Nellis explained: “RAG runs a search algorithm to get what it considers to be a tight sample of pages that will have the answer in there somewhere. That’s the way we would do it if we were doing it by hand, but you get the issues with the search algorithm – is the search algorithm accurate? It’s slightly arbitrary which pages get pushed to the front and which don’t.” 

The report also looked at whether there are tools in development which may be able to select statistical data more precisely. One tool, StatGPT, released by the IMF in partnership with EPAM Systems, reports only on high-quality sources by using SDMX-compliant data queries to source its RAG data. SDMX (statistical data and metadata exchange) is an ISO standard to describe statistical data and metadata and standardise queries across data providers.

Furness likens this to data producers having “nice little hooks sticking up in their data which the tools can then find and plug into”, but said this is not in place in the UK at the moment. While ONS, through its partner NOMIS, and the UK Data Service both provide an SDMX API, StatGPT is not currently connected to either. The CGG report recommends that the tool – initially for testing, and if shown to be useful for customer facing deployment – should be connected to these UK sources.

Nellis said: “We are not necessarily saying that StatGPT is the solution, but it’s a solution. LLMs are really good at translation – natural language into computer code. What StatGPT is doing is translating a natural language query into SDMX. Currently, it’s looking at country-level, supranational data, but the ONS does provide SDMX interfaces, so it could be done.”

For researchers and other individuals using AI models to source statistical data, the key implication of the report is the need to effectively design and tailor prompts accordingly. Furness said: “That might mean multiple iterations of the question.”

Nellis said: “Which then opens up the question – why not do it yourself? You’ve got to check what you see and not assume it’s right. Sometimes you might look at an AI result and see how it’s made that decision – it’s not necessarily the right decision, I see why it’s made the wrong decision. We need to be careful – when we had just landlines, you always remembered everybody’s phone number. The brain is lazy – you’ve got to keep sharp.”

We hope you enjoyed this article.
Research Live is published by MRS.

The Market Research Society (MRS) exists to promote and protect the research sector, showcasing how research delivers impact for businesses and government.

Members of MRS enjoy many benefits including tailoured policy guidance, discounts on training and conferences, and access to member-only content.

For example, there's an archive of winning case studies from over a decade of MRS Awards.

Find out more about the benefits of joining MRS here.

0 Comments


Display name

Email

Join the discussion

Newsletter
Stay connected with the latest insights and trends...
Sign Up
Latest From MRS

Our latest training courses

Our new 2025 training programme is now launched as part of the development offered within the MRS Global Insight Academy

See all training

Specialist conferences

Our one-day conferences cover topics including CX and UX, Semiotics, B2B, Finance, AI and Leaders' Forums.

See all conferences

MRS reports on AI

MRS has published a three-part series on how generative AI is impacting the research sector, including synthetic respondents and challenges to adoption.

See the reports

Progress faster...
with MRS 
membership

Mentoring

CPD/recognition

Webinars

Codeline

Discounts