Why Your LLM Product Probably Sucks (And How Evals Will Save You)
Most PM teams ship AI features without systematic testing. Learn how to build lightweight eval systems that keep quality high without needing an ML background.
Why Your LLM Product Probably Sucks (And How Evals Will Save You)
I watched a product team spend six months building an AI-powered customer support feature. They demoed it to executives with cherry-picked examples. It looked magical. They shipped it to 10% of users. Within 48 hours, the feature was generating responses that told users to "contact the ancient spirits" for billing issues.
No one had systematically tested it. They had vibes, demos, and hope. But no evals.
If you're building LLM-powered products without evaluation systems, you're essentially shipping blindfolded. The scary part? Most product managers don't realize they need evals until something breaks publicly.
Key Takeaways
- Traditional product metrics fail for LLM products—you need systematic evaluation before and after launch
- Evals aren't just for ML engineers; PMs can build basic evaluation systems in hours, not weeks
- The most critical evals measure real user outcomes, not model performance
- Start with qualitative human evals before scaling to automated systems
- Your prompt changes need version control and A/B testing, just like feature releases
The Uncomfortable Truth About LLM Products
Here's what nobody tells you when you start building AI products: LLMs are non-deterministic, constantly evolving systems that will break in ways you cannot predict.
Unlike traditional software where if (x == 5)
behaves the same way every time, LLMs are probabilistic. The same prompt with the same input can produce different outputs. Model providers update their models (sometimes without warning). Your production data distribution shifts. Edge cases emerge that your test set never covered.
You cannot ship LLM products the way you ship traditional features. Yet most PMs try exactly that.
⚠️ Reality Check
I've reviewed 30+ AI product launches in the past year. Only 3 had systematic evaluation frameworks before launch. Two of them worked well. The others? Firefighting mode for months.
What Are Evals, Really?
Evaluations (evals) are systematic tests that measure whether your LLM-powered product actually works. Not "works in the demo" or "works on Tuesday." Works consistently for real users with real edge cases.
Think of evals as your quality assurance system for AI products. But unlike QA for traditional software, evals need to assess:
- Correctness: Does the output solve the user's problem?
- Safety: Does it avoid harmful, biased, or inappropriate responses?
- Consistency: Does it maintain quality across different inputs and contexts?
- Latency: Is it fast enough for the use case?
- Cost: Are you burning money on unnecessarily long completions?
The Three Types of Evals You Actually Need
1. Pre-launch Evals (Development Testing)
- Run before you deploy changes
- Test on curated examples covering edge cases
- Compare prompt versions against each other
- Catch regressions before users see them
2. Continuous Evals (Production Monitoring)
- Run automatically on production traffic samples
- Monitor for quality degradation over time
- Alert when metrics drop below thresholds
- Track performance across user segments
3. Human Evaluation (Ground Truth)
- Domain experts review actual outputs
- Establish baseline quality standards
- Validate automated eval metrics
- Uncover issues automation misses
📚 Contrarian Corner
Most teams start with automated evals because they scale. This is backwards. Start with human evals on 50-100 examples. Learn what "good" actually looks like. Then automate. Teams that reverse this spend months building evals that measure the wrong things.
Why Traditional PM Metrics Fail for LLM Products
You cannot measure LLM product success the way you measure a checkout flow. Here's why:
Scenario: You launch an AI writing assistant. Your traditional metrics look great:
- 75% adoption rate ✓
- Users generate 50K words per day ✓
- 4.2/5 star rating ✓
But when you actually read the outputs, you discover:
- 40% of generated content is generic filler
- Users spend 30 minutes editing AI outputs that should take 10 minutes to write from scratch
- The AI occasionally suggests plagiarized content
- Power users have stopped using it entirely
Traditional engagement metrics told you everything was fine. They lied.
LLM products require output quality metrics, not just usage metrics. You need to know:
- What percentage of outputs are actually used vs. discarded?
- How much time do users spend editing AI outputs?
- What's the quality distribution across different use cases?
- When do users abandon the AI and do it manually?
Getting Started: Your First Eval System (No PhD Required)
Most PMs avoid building evals because they assume it requires deep ML expertise. It doesn't. You can build a basic eval system in a few hours using tools you already know.
Phase 1: Build Your Evaluation Dataset (Week 1)
Step 1: Collect Real Examples
Don't use synthetic data. Don't ask ChatGPT to generate test cases. Use real user inputs from:
- User research sessions
- Customer support tickets
- Beta user logs (with permission)
- Your own team testing the product
Aim for 50-100 examples that cover:
- Happy path scenarios (30%)
- Edge cases (40%)
- Adversarial inputs (20%)
- Known failure modes (10%)
Step 2: Define "Good" Explicitly
For each example, write down what a good response looks like. Not the exact words—the characteristics:
- Answers the actual question asked
- Uses appropriate tone
- Includes necessary context
- Avoids prohibited content
- Stays within length constraints
Create a simple rubric. Here's one I use:
Score 5: Perfect response, ready to ship
Score 4: Minor tweaks needed, mostly there
Score 3: Right direction, significant gaps
Score 2: Misses the mark, some useful elements
Score 1: Fundamentally wrong or harmful
Step 3: Store It Properly
Create a simple spreadsheet:
- Column A: Input (user query/context)
- Column B: Expected behavior (rubric criteria)
- Column C: Current output
- Column D: Score (1-5)
- Column E: Notes (what failed/succeeded)
- Column F: Category (feature/edge case/adversarial)
⚡ Implementation Guide
Use Google Sheets or Airtable for this. Resist the urge to build custom tooling yet. You'll waste weeks building infrastructure when you should be learning what to measure. Tool sophistication comes later.
Phase 2: Run Your First Evals (Week 2)
Manual Evaluation Process
-
Generate outputs for all test cases
- Run your current prompt/system against all examples
- Save the outputs (timestamp them!)
- Note any errors or timeouts
-
Score them systematically
- Blind evaluation if possible (don't look at which prompt version generated what)
- Use your rubric consistently
- Take notes on why you scored each way
- Flag surprising failures
-
Analyze patterns
- What's your average score? (Baseline metric)
- Which categories perform worst?
- Are there systematic failures?
- What percentage would you ship? (Your quality bar)
This gives you:
- A baseline quality score
- Visibility into failure modes
- Confidence (or fear) about shipping
- Data for stakeholder conversations
⚠️ Reality Check
Your first eval results will be depressing. You'll discover your AI feature works well on maybe 60-70% of cases. This is normal. Better to learn this in evals than in production. One team I worked with discovered their chatbot failed on 45% of realistic user queries. They thought they were weeks from launch. They were actually months away. Evals saved them from a disaster launch.
Phase 3: Scale to Semi-Automated Evals (Week 3-4)
Once you have 50+ manually scored examples, you can start automating parts of the evaluation.
The LLM-as-Judge Pattern
Use a strong LLM (like Claude or GPT-4) to evaluate your product's outputs. Yes, this sounds circular. It works surprisingly well for initial quality signals.
Here's a simple evaluation prompt structure:
You are evaluating a customer support response for quality.
USER QUERY: {user_input}
RESPONSE TO EVALUATE: {model_output}
EVALUATION CRITERIA:
- Directly addresses the user's question
- Provides accurate information
- Uses helpful, professional tone
- Includes relevant next steps
- Avoids jargon or overly technical language
Score this response from 1-5 and explain your reasoning.
When LLM-as-Judge Works:
- Assessing tone, helpfulness, clarity
- Checking if requirements are met
- Comparing relative quality of different versions
- Catching obvious failures
When It Doesn't:
- Domain-specific correctness (needs human experts)
- Subtle bias detection
- Cultural appropriateness
- Complex reasoning errors
Pro tip: Always validate your LLM judge's scores against human scores on a subset. If they don't correlate (>0.7), your judge prompt needs work.
Phase 4: Prompt Optimization Loop (Ongoing)
Now you have an eval system. Use it to actually improve your product:
The Systematic Prompt Iteration Process:
-
Make ONE change
- Modify the system prompt
- Add/remove examples
- Change temperature or parameters
- Update retrieval logic
-
Run evals on your test set
- Compare new version vs. baseline
- Look at overall score change
- Dig into specific examples that regressed
-
Analyze the trade-offs
- Did you improve case X while breaking case Y?
- Is the improvement significant enough?
- What's the cost impact?
-
Version control everything
- Git commit your prompt changes
- Tag with eval scores
- Document why you made the change
- Track what worked and what didn't
📚 Key Insight
Prompt optimization is empirical, not intuitive. What you think will improve quality often doesn't. What seems like a minor tweak can have major impacts. The only way to know is to eval systematically.
The Framework: RITE (Rapid Iteration Through Evals)
After working with dozens of PM teams on LLM products, I developed this framework for integrating evals into product development:
R - Real examples define your test set Start with actual user inputs, not synthetic scenarios. Your test set should make you uncomfortable—include the edge cases that will break your product.
I - Iterate on small batches Don't try to fix everything at once. Pick one failure category, improve it, validate the fix didn't regress other cases. Ship incrementally.
T - Track everything systematically Version control prompts like code. Every change gets an eval run. Maintain a changelog of what you tried and the results. Build institutional memory.
E - Escalate to humans strategically Automate what you can, but keep humans in the loop for:
- Establishing ground truth
- Validating edge cases
- Catching issues automation misses
- Understanding why failures happen
Common Traps (And How to Avoid Them)
Trap 1: Building Perfect Evals Before Shipping Anything
I've seen teams spend 3 months building comprehensive evaluation systems before testing with real users. Don't do this. Start simple, ship to small beta groups, learn fast.
Better approach: 50 manual evals → small beta → learn → expand evals → broader launch
Trap 2: Only Testing on Clean Data
Your eval set should include:
- Typos and grammatical errors
- Vague or ambiguous queries
- Inappropriate requests
- Out-of-scope questions
- Multiple languages (if relevant)
- Edge cases that will embarrass you
If your eval set makes you confident, it's too easy.
Trap 3: Treating Evals as a One-Time Thing
Your LLM product needs continuous evaluation because:
- Model providers update their models
- Your user base evolves
- New edge cases emerge
- Your prompts drift over time
Set up weekly eval runs on a sample of production data. Alert when quality drops below thresholds.
Trap 4: Optimizing for the Wrong Metric
"We improved our eval score by 15%!"
Great. Did user satisfaction improve? Did retention go up? Did support tickets go down?
Your eval metrics should predict real user outcomes. Validate this constantly. If your eval score improves but users hate it, your evals are measuring the wrong thing.
Your Action Plan: First 30 Days
Week 1: Build Foundation
- Collect 50 real user examples
- Create evaluation rubric
- Manually score current system
- Identify top 3 failure modes
Week 2: First Iteration
- Modify prompt to address #1 failure mode
- Re-run evals
- Compare results
- Document what worked
Week 3: Scale Up
- Expand test set to 100 examples
- Set up LLM-as-judge for automated scoring
- Validate automated scores vs. human scores
- Create eval dashboard
Week 4: Establish Process
- Version control your prompts
- Set up pre-deployment eval gate
- Schedule weekly eval reviews
- Define quality thresholds for launch
Tools You Can Start With Today
You don't need expensive enterprise platforms to start. Here's what actually works:
For storing and running evals:
- Google Sheets (seriously, start here)
- Airtable (when you need more structure)
- GitHub (for version controlling prompts and test cases)
For automated evaluation:
- Direct API calls to Claude/GPT-4 for LLM-as-judge
- Simple Python scripts (50 lines can get you far)
- Notion databases (for team collaboration)
When to graduate to real tools:
- You have >500 eval cases
- Multiple team members need access
- You need CI/CD integration
- Budget exists for proper tooling
The Question You Should Be Asking
Not "How do I build evals?" but "What happens to my users when my LLM product fails?"
That question should terrify you into building evaluation systems. Because the cost of poor quality LLM outputs isn't just bad metrics—it's users who stop trusting you, regulatory risk, brand damage, and support nightmares.
The good news? Unlike model training or infrastructure scaling, evals are entirely in your control as a PM. You don't need to wait for ML engineers or research scientists. You can start today.
Start This Week
- Open a spreadsheet right now
- Write down 10 real examples of inputs your LLM product needs to handle
- Score your current system on those 10 examples
- Pick the worst one and improve it
- Score it again
That's an eval system. Everything else is just scaling what you learned.
The teams that win with LLM products aren't the ones with the most sophisticated infrastructure or the biggest models. They're the ones who systematically measure quality and improve it relentlessly.
Your move.
Download the Simple Eval Starter Kit: Get the spreadsheet template, rubric framework, and LLM-judge prompts I use to help teams launch their first evals in under a week. [Link to template]
What's your biggest challenge with LLM product quality? Share in the comments or reach out—I read every message and often feature community challenges in future articles.