I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.
When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?
I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)
So is are LLMs reliable for research like that?
I work on a project where we are trying to analyse financial data. We use Claude and llama and they are really good. We needed a few months to achieve 87% reliability.
For our application that’s probably almost enough. For an application that needs 100% all the time, every time, that quite a lot away.