Practical Issues with LLM and AI Agent Projects




Sections




This is a loose catalog of common issues that Large Language Models (LLM) and AI agent projects often encounter in production. I will also include some thoughts on what we could experiment with, to try and alleviate the issues. However, do note that many of these are major challenges that even big AI companies like OpenAI and Google are unable to fully resolve.

I wish I can say this is a short concise article, but the list of issues is quite long! Unfortunately, LLMs are currently an immature technology. However, a lot of work is currently being done by massive companies to evolve the LLM ecosystem into something more production ready.




Hallucinations and Probabilistic Outcomes

The Wikipedia page has a good concise description of what hallucinations are:

a response generated by AI that contains false or misleading information presented as fact

It is also amusing that the Wikipedia page also unironically says it is also called bullshitting. Hallucinations have been in the headlines since the release of LLMs such as ChatGPT. For example, lawyers have been caught citing hallucinated cases. Deloitte had to issue a partial refund to the Australian government over hallucinations found in a report.

Note that Hallucinations are a fundamental feature of how current artificial intelligence (AI) and machine learning (ML) technologies work. They are statistical models attempting to give the most probable output. They are not built to produce precise deterministic outcomes.

A closely related issue is the fact that LLMs output are probabilistic. They often produce different outcomes even if the prompt is exactly the same. However, this might be less of a problem IF the output is correct and usable. A lot of the suggestions below are aimed at reducing the variability of an LLM’s output. So, they can be useful if probabilistic outcomes is a problem.


What We Could Try


Opinions are mixed about this. I have seen big companies recommend this. But at the same time, people report that LLMs can behave weirdly when temperature is set to 0.


This is a common thing that people tried to do in the early days. They were hoping that they could make the LLM better at producing accurate answers for their use case. I would not recommend this! See the next section for a list of reasons why.


This potential solution is very popular right now. The idea is this: when the user enters a prompt, search for documents relevant to the prompt, then ask the LLM to generate their response using those documents.

This can take a bit of time and effort to build out correctly. Documents have to be chunked and then embedded. Then a vector search strategy has to be set up to find and retrieve the ones most relevant to the prompt.


Ask for an output that is as short and precise as possible. Ask the LLM to check its output. LLMs often admit that they are wrong or made things up when asked to double-check! Providing a few examples can improve performance. This is known as in-context learning.


Use deterministic software workflow unless the problem absolutely needs an LLM. Try weaving regular software into your LLM workflow. Even Anthrophic recommendeds this approach!


This is probably the most common advice: have a human check everything. Of course, if the goal of the project is to automate certain tasks, having a human repeat those tasks might invalidate your project! However, a lot of LLM projects take the approach of offering suggestions to human operators, with the disclaimer that AI suggestions are not always correct.




This is a work in progress. More sections will be added in the future.