Practical Issues with LLM and AI Agent Projects

Sections

Introduction
Hallucinations and Probabilistic Outcomes
Fine-tuning and Catastrophic Forgetting
Prompt Injection
Expensive Token Cost
AI Agents
Long Context and Prompt

Introduction

This is a loose catalog of common issues that Large Language Model (LLM) and AI agent projects often encounter in production. I will also include some thoughts on what we could experiment with to try and alleviate the issues. However, do note that many of these are major challenges that even big AI companies like OpenAI and Google are unable to fully resolve.

I wish I could say that this is a short, concise article, but the list of issues is quite long! Unfortunately, at the time of writing, LLMs are currently an immature technology. A lot of work is being done by massive companies to evolve the LLM ecosystem into something more production-ready, but it is not clear if we will eventually get there.

There is one suggestion that will show up over and over again. Instead of repeating it in almost every single section, I am just going to state it up here:

Avoid using the LLM as much as possible.

Use a deterministic software workflow unless the problem absolutely needs an LLM. Try weaving regular software into your LLM workflow. Not using an LLM whenever possible is the best way to reduce issues caused by LLMs. Even Anthropic recommended this approach!

Hallucinations and Probabilistic Outcomes

The Wikipedia page has a good concise description of what hallucinations are:

a response generated by AI that contains false or misleading information presented as fact

It is also amusing that the Wikipedia page also unironically says it is also called bullshitting. Hallucinations have been in the headlines since the release of LLMs such as ChatGPT. For example, lawyers have been caught citing hallucinated cases. Deloitte had to issue a partial refund to the Australian government over hallucinations found in a report.

Note that hallucinations are a fundamental feature of how current artificial intelligence (AI) and machine learning (ML) technologies work. They are statistical models attempting to give the most probable output. They are not built to produce precise deterministic outcomes.

A closely related issue is the fact that LLM outputs are probabilistic. They often produce different outcomes even if the prompt is exactly the same. However, this might be less of a problem IF the outputs are still correct and/or usable. A lot of the suggestions below are aimed at reducing the variability of an LLM’s output. So, they can be useful if probabilistic outcomes are a problem.

List of Suggestions

Enable web search.

One thing that many have noticed is that LLM answers can get much better when web search is enabled. This allows them to go beyond their training dataset and take in new information from the internet. However, you will need to test if this is true for your use case. Also, web search is pretty much mandatory if your use case involves information that can change over time. For example, asking about the most current regulations or recent news.

Use a thinking/reasoning model.

People have also noticed that thinking/reasoning LLM models can have better accuracy. However, the tradeoff is that these models are often much more expensive. See the “expensive token cost” section further down below.

Setting the temperature to 0.

Opinions are mixed about this. I have seen big companies recommend this. But at the same time, people report that LLMs can behave weirdly when the temperature is set to 0. Alternatively, you can experiment with setting the temperature to a low value like 0.1 instead of going all the way to 0.

Fine-tuning.

This is a common thing that people tried to do in the early days. They were hoping that they could make the LLM better at producing accurate answers for their use case. I would not recommend this unless you absolutely know what you are getting yourself into! See the next section for a list of reasons why.

Retrieval Augmented Generation (RAG).

This potential solution is very popular right now. The idea is this: when the user enters a prompt, search for documents relevant to the prompt, then ask the LLM to generate their response using those documents.

This can take a bit of time and effort to build out correctly. Documents have to be chunked and then embedded. Then, a vector search strategy has to be set up to find and retrieve the ones most relevant to the prompt.

Prompt Engineering.

Ask for an output that is as short and precise as possible. Ask the LLM to check its output. LLMs often admit that they are wrong or made things up when asked to double-check! Providing a few examples can improve performance. This is known as in-context learning.

Human in the loop.

This is probably the most common advice: have a human check everything. Of course, if the goal of the project is to automate certain tasks, having a human repeat those tasks might invalidate your project! However, a lot of LLM projects take the approach of branding themselves as offering suggestions to human operators, with the disclaimer that AI suggestions are not always correct.

Avoid using the LLM as much as possible.

See the Introduction section.

Fine-tuning and Catastrophic Forgetting

As mentioned in the previous section, fine-tuning was popular in the early days, probably because there was nothing else people could do to improve LLM performance for their specific use case. The idea was to basically continue training the LLM with data consisting of input-output pairs to fine-tune the LLM into providing better output responses.

The current paradigm appears to be to try doing Retrieval Augmented Generation (RAG) instead. There are two key reasons:

Expensive and high effort.

Fine-tuning requires a lot of data to make an impact. The cost of training an LLM is high. It also requires dedicated effort. This is a big long-term project and not a short-term quick fix.

Catastrophic Forgetting.

Fine-tuning modifies the parameters of the LLM neural network based on the new data that we are supplying. We are hoping that this improves the LLM’s performance on our task. One phenomenon that has been reported is “catastrophic forgetting”, where the LLM performs worse at other tasks after fine-tuning. This is not unexpected. We are messing around with the neural network’s parameters; there are bound to be side effects!

Prompt Injection

Prompt injection is a massive security issue that every LLM and AI agent project will face. LLMs cannot tell the difference between a legitimate prompt and a malicious one.

This is reminiscent of SQL injection, where attackers try to trick your software into unintentionally running SQL code. However, in this case, we cannot “escape” the characters in the injection. The possibilities for prompt injection are essentially unlimited.

At the time of writing, while big AI companies have been trying to research solutions, there is currently no clear way to defend against this! This is why AI browsers and AI agents that can access the internet are both considered major security risks.

While the situation sounds rather dire, I do have a list of suggestions that might still help.

List of Suggestions

Scan the input text.

An easy first step would be to just scan the input text for prompt injection attempts. There are also packages being developed that try to scan for malicious prompts. I cannot recommend specific packages at the moment.

Restrict your project to internal use.

If you restrict your project to users in your company and track their inputs, it lowers the chance of a prompt injection attempt. Even if it happens, you will be able to track which employee carried out the attack. Don’t forget to also monitor for unauthorized use by external users.

Restrict the scope.

If your project allows you to, severely restricting the kind of allowed inputs will help mitigate prompt injections. For example, if you are only using the LLM to do named-entity recognition of input documents, you might be able to preprocess the input document and pick out only the relevant parts to feed into the LLM.

Avoid using the LLM as much as possible.

See the Introduction section.

Expensive Token Cost

LLMs can get very expensive. At the time of writing, a big worry of the entire AI industry is whether they can generate enough value from these LLMs to justify their cost. Here are some ways to try and avoid getting a big bill.

List of Suggestions

Try using smaller, older, or open-source models.

Smaller and/or older models are usually much cheaper than the current state-of-the-art models. Also, the word on the street is that open-source models are feasible for a lot of LLM or AI agent workflows. For example, OpenAI’s open-source models appear to have good reviews when it comes to building AI agents. Of course, you will still have to pay for the compute to run these open-source models.

Avoid reasoning models if possible.

Reasoning models are really compute-intensive, and so companies charge much higher prices for token usage. For example, see OpenAI’s pricing webpage for a comparison. At the time of writing, the input token price for the o3 reasoning model is 60% higher than that of their GPT 5.1 frontier model. There is another problem with reasoning models that makes them much more expensive when it comes to output tokens: they will produce a lot of text while performing their chain of thought reasoning process!

Ask for shorter prompts and answers if possible.

This suggestion is more of a hack that really depends on your use case. You might be able to constrain the input prompt and output answer to reduce your token cost. For example, if you are using the LLM to do named-entity recognition and are just looking for a specific phrase in a document, you just need it to output the phrase and not an entire essay!

Avoid using the LLM as much as possible.

See the Introduction section.

AI Agents

AI agents are, loosely speaking, “LLMs that can use tools”. Tools are regular functions written in a programming language, say Python. Remember, LLMs only have the ability to take in text prompts and produce text outputs. They cannot run software. So, the software has to be run externally. How this usually works is that the LLM is combined with a wrapper like LangChain. This wrapper adds instructions for tool calling to every prompt and then monitors the LLM text output for text triggers. Once the text triggers are seen in the LLM’s output, the wrapper runs the function.

These AI agents come with their own set of problems. At the time of writing, they are still a very immature technology and highly experimental. I would not recommend designing projects around agents, but here are some suggestions for working with them.

List of Suggestions

Avoid chaining agents.

Orchestrating an army of automated agents to do your work for you is the dream. But you should be aware that because LLM models produce probabilistic output and are prone to hallucinations, each individual agent will have a probability \( p < 1 \) of getting it correct. The chance of making an error compounds quite badly as the number of agents participating in the process grows. For example, if you have two independent agents, the probability that none of them makes an error is \( p^2 < p < 1 \).

So, the reliability gets worse by a factor of \( p \). In fact, the probability that at least one agent makes an error goes to \( 1 \) as the number of agents grows.

Avoid overloading the context window; avoid MCP if possible.

Anthropic created the Model Context Protocol (MCP) to allow AI agents access to a wide variety of tools. But one side effect is that MCP loads these tools into the context window of the LLM, but LLMs are terrible at handling long contexts. This is one well-known paper highlighting this. At the time of writing, Anthropic actually highlighted this problem and recommends avoiding MCP as much as possible! Even if you are not using MCP, you should be mindful that AI agent performance will most probably degrade as the context increases.

Avoid using the LLM as much as possible.

See the Introduction section.

Long Context and Prompt

This is going to be a very short section. I already talked about how LLMs do not handle long prompts (“long context”) well in the AI agents section. This is one reason why Anthropic actually suggests not using the MCP that they created. I also brought up this well-known paper. All I have to say about this is avoid having long contexts or prompts! If your input is a long document, definitely try chunking it into smaller pieces.