Beyond “Sounds Good”: A Practical Guide to LLM Evals

Bro, do you even Eval?

Nov 03, 2025

We’ve all been there. You ask an AI a question, it gives you a long, confident, and well-written answer. It sounds right. But is it? As we start using Large Language Models (LLMs) for real, important tasks, “sounds good” isn’t good enough. We need a way to measure their performance objectively.

That’s where Evals (evaluations) come in. Think of them as tests or report cards for your AI systems. Instead of guessing if an output is good, we run it through a series of checks to be sure.

Let’s break down how these evals work for different types of AI setups, starting simple and getting more complex.

1. Evals for LLMs: The Basic Health Check

What we’re testing: The raw, knowledge and basic capability of the language model itself.

Imagine you just got a new employee straight out of a universal training program (the internet). You need to check their core skills before giving them a specific job. Evals for base LLMs do exactly that.

What we measure:

Factual Accuracy: Does it know true things? Does it make up false information (hallucinate)?
Reasoning (Basic): Can it follow simple logic?
Toxicity/Bias: Does it produce harmful, biased, or offensive content?
Following Instructions: Can it do what it’s told (e.g., “write in the style of a pirate”)?

How we test it (The “Exam Paper”):

We use standard sets of questions and tasks where we already know the correct answer.

Example Task: “What is the capital of Portugal?”
- Expected Answer: “Lisbon.”
- Eval Check: Does the model’s output contain the correct answer?
Example Task: “Complete this sentence: The opposite of hot is...”
- Expected Answer: “cold.”
- Eval Check: Is the completion logically correct?
Example Task: “Write a sentence about [a sensitive topic].”
- Expected Answer: No specific answer, but a set of rules.
- Eval Check: Does the output contain hate speech, slurs, or dangerous ideas? (A human or a safety classifier would score this).

The Introspection: You wouldn’t trust a doctor who aced art class but failed biology. Similarly, if a model fails basic factual or safety evals, it’s not fit for any purpose, no matter how fluent it sounds.

2. Evals for LLM Reasoning Models: The Logic Exam

What we’re testing: The model’s ability to solve complex problems that require multiple steps of thought.

Now, let’s say our new employee needs to be an analyst. We need to test their ability to think, not just recall. This is for models specifically prompted or fine-tuned for reasoning (e.g., using Chain-of-Thought).

What we measure:

Multi-step Problem Solving: Can it break a big problem into smaller steps?
Mathematical Reasoning: Can it solve word problems?
Logical Deduction: Can it infer conclusions from a set of rules?

How we test it (The “Logic Puzzle”):

We give it problems where the answer isn’t a simple fact, but the result of a process.

Example Task: “Sarah has 3 apples. She gives 2 to Mark. Then, she buys 5 more. How many apples does she have now?”
- Expected Reasoning Steps: 3 - 2 = 1, then 1 + 5 = 6.
- Eval Check: Does the model’s reasoning process correctly show these steps, and does it arrive at the final answer of 6? The “working out” is as important as the answer.
Example Task: “All dogs are mammals. All mammals have spines. Fido is a dog. Does Fido have a spine?”
- Expected Answer: “Yes.”
- Eval Check: Can the model trace the logical chain (Dog -> Mammal -> Spine) to deduce the correct answer?

The Introspection: Getting the right answer for the wrong reason is a major red flag. It means the model is guessing. A good reasoning eval proves the model’s “thinking” is sound, making it more trustworthy for complex tasks.

3. Evals for Retrieval: The Librarian’s Test

What we’re testing: The system’s ability to find the right pieces of information from a large database (like a vector database).

An LLM doesn’t know your company’s private data. So, we give it a “librarian” (a retriever) that finds relevant documents for it to read. We need to test the librarian, not the reader.

What we measure:

Relevance: Are the returned documents actually related to the question?
Recall: Did the system find all the important pieces of information?
Precision: Are the returned results only the important ones, or is there a lot of junk?

How we test it (The “Library Scavenger Hunt”):

We have a known set of documents and ask questions that the answers are inside them.

Example:
- Document 1: “The company project ‘Alpha’ was launched in 2020.”
- Document 2: “The company project ‘Beta’ focuses on sustainability.”
- Document 3: “Employee benefits include health insurance and remote work.”
- Test Question: “When was project Alpha launched?”
- Perfect Retrieval: [Document 1]
- Eval Check: Did the retriever return Document 1? (High Precision & Recall). Did it also return irrelevant documents like Document 2 or 3? (Low Precision).

The Introspection: If the librarian brings you a cookbook when you asked for a car manual, the LLM (the reader) has no chance of giving a correct answer, no matter how smart it is. Garbage in, garbage out.

4. Evals for RAG: The End-to-End System Check

What we’re testing: The performance of the entire Retrieval-Augmented Generation pipeline—the librarian (retriever) and the reader (LLM) working together.

This is the most common real-world system. We need to know if the final answer to the user’s question is correct, given the knowledge base.

What we measure:

Answer Faithfulness/Grounding: Is the final answer based only on the retrieved documents, or did the LLM make up details (hallucinate) using its internal knowledge?
Answer Relevance: Does the final answer directly address the original question?
Context Utilization: Did the LLM correctly use the information it was given?

How we test it (The “Open-Book Exam”):

We give the system a question and a set of source documents (the “book”). The system must retrieve the right bits and generate an answer from them.

Example:
- Source Documents: Same as above (Project Alpha launched in 2020, etc.).
- Test Question: “What are the goals of project Beta and when was Alpha launched?”
- Perfect RAG Output: “Project Beta focuses on sustainability. Project Alpha was launched in 2020.”
- Eval Checks:
  - Faithfulness: Is every part of the answer supported by the source documents? (Yes).
  - Relevance: Does it answer both parts of the question? (Yes).
  - Hallucination Check: If the answer said “Project Alpha launched in 2021,” that would be a failure, even if it’s a fact the LLM “knows” from its training. The test is about the provided context.

The Introspection: A RAG eval tells you if your entire system is reliable. A failure could be the retriever’s fault (it didn’t find the doc) or the LLM’s fault (it ignored the doc and hallucinated). This eval helps you pinpoint where the breakdown happened.

5. Evals for Agents: The CEO’s Performance Review

What we’re testing: The ability of an AI Agent to complete a multi-step, real-world goal by using tools, making decisions, and recovering from errors.

An Agent is an LLM that can do things—like use a calculator, search the web, or call an API. It’s like an autonomous employee. Evaluating it is complex because the path to success isn’t always a straight line.

What we measure:

Task Success: Did the agent ultimately accomplish the goal?
Efficiency: How many steps did it take? Did it use the right tools?
Robustness: If it hit a dead end or an error, did it recover and try a different approach?

How we test it (The “Simulated Project”):

We give the agent a high-level goal and a set of tools, then watch it work.

Example Task: “Find the price of a Tesla Model 3 and the nearest dealership to Zurich, then summarize it in a table.”
- Tools Available: web_search(), calculator(), format_table().
- Expected Successful Workflow:
  1. web_search(”Tesla Model 3 price”)
  2. web_search(”Tesla dealership near Zurich”)
  3. Extract the relevant prices and addresses.
  4. format_table(data)
- Eval Checks:
  - Final Answer: Is there a well-formatted table with the correct price and a valid dealership address? (Task Success).
  - Process: Did it use the search tool effectively? Did it get stuck in a loop or try to calculate the price instead of searching for it? (Efficiency).
  - Robustness: If the first search for “Tesla dealership” returned a closed one, did it recognize the issue and search again? (Robustness).

The Introspection: Evaluating an agent is like judging a chef on the final meal, not just their knife skills. You care about the outcome. A successful agent eval means you can trust the system to operate autonomously on complex tasks without constant babysitting.

The Bottom Line

Evals move us from wonder to trust. They replace “Wow, this is cool” with “I know this works for my specific needs.” By applying the right kind of test at each layer—from the raw model’s knowledge all the way up to an agent’s autonomous projects—we can build AI systems that are not just impressive, but are genuinely reliable and useful. Start simple, measure everything, and always be testing.

Ready to take it to the next level?

Check out my AI Agents for Enterprise course on Maven and be a part of something bigger and join hundreds of builders to develop enterprise level agents.

Use this link to get $201 OFF!

You’re receiving this email because you’re part of our mailing list—and you’ve attended, registered for, or been invited to our MAVEN events. These emails are the only way to reliably receive updates from us. We don’t spam or sell your information. If you prefer not to receive our messages, simply unsubscribe below and we’ll respect your wishes.

Generative AI for Everyone

Discussion about this post

Ready for more?