Your code has tests. Your code has a CI pipeline. A bad change can't merge
without going green.
Your prompts? Vibes. A teammate edits the system prompt to fix one customer
complaint, output quality drops 8% on the other 99% of cases, nobody
notices for a month, and the regression eventually surfaces as a
mysterious churn bump in the metrics deck.
This post is the 5-minute setup that closes that gap.
What "tests for prompts" actually means
There are two viable approaches and you need to know which to use when.
Assertion-based. You write code that checks the LLM output against
fixed rules: regex matches, JSON shape validation, field-presence checks,
length bounds. Fast, cheap, deterministic.
Use it when: the output is structured and the contract is rigid. JSON
extraction, classification, function-call payloads, schema-conformant
generation.
LLM-judge. Another LLM compares the candidate output to a baseline and
returns "regressed: yes/no" with a severity score. Slower, costs a few
cents per comparison, handles fuzzy outputs.
Use it when: the output is freeform β summaries, rewrites, creative
generation, anything where two correct answers can look very different.
A mature setup uses both. PromptFork ships the LLM-judge built in (we
chose Claude Haiku at temp 0 with a strict "only flag strictly worse"
rubric); assertions are easy to add yourself in custom test cases.
The 5-minute setup
1. Pin your prompts in version control
prompts/
summarize_ticket.txt
extract_email.txt
classify_intent.txt
Plain text files. Not constants in prompts.py. Not Notion docs. Files
with a git history.
2. Push them to PromptFork
pip install promptfork
export PROMPTFORK_API_KEY=pf_xxxx
for f in prompts/*.txt; do
name=$(basename "$f" .txt)
promptfork push "$name" --file "$f" --message "initial commit"
done
This creates v1 of each prompt server-side and gives you a stable identifier.
3. Add test cases
For each prompt, pin 5-30 representative inputs. Real production inputs are
worth 10x synthetic ones.
promptfork add-test summarize_ticket happy_path \
--input ticket="Order arrived. Loved it." \
--rubric "summary should be positive and under 20 words"
promptfork add-test summarize_ticket angry_refund \
--input ticket="3 weeks late, want money back NOW" \
--rubric "must mention refund and high urgency"
promptfork add-test summarize_ticket edge_garbled \
--input ticket="hi pls help thx" \
--rubric "summary should request more info, not invent details"
Three test cases is a starting point. Six is a good baseline. Thirty is
production-grade.
4. Wire the GitHub Action
# .github/workflows/prompt-tests.yml
name: Prompt Regression Tests
on:
pull_request:
paths:
- 'prompts/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Push current prompts
env:
PROMPTFORK_API_KEY: ${{ secrets.PROMPTFORK_API_KEY }}
run: |
pip install promptfork
for f in prompts/*.txt; do
name=$(basename "$f" .txt)
promptfork push "$name" --file "$f" \
--message "PR #${{ github.event.pull_request.number }}"
done
- uses: shaunvand/promptfork-cli@v0
with:
prompt: summarize_ticket
baseline: 1
api-key: ${{ secrets.PROMPTFORK_API_KEY }}
Add the secret at Settings β Secrets β PROMPTFORK_API_KEY. Done.
5. Open a PR that changes a prompt
The action runs, executes your prompt across Claude/GPT/Gemini, has the
LLM-judge compare each output against your baseline version, and posts a
PR comment with the regression report. If anything regresses, the action
exits non-zero, branch protection blocks the merge, the change goes back
for review.
You now have a CI gate for prompts. The same gate you have for code.
What goes in the test suite
After running this on a few projects, the pattern that works:
- One happy-path case. "Normal" input, expected output.
- One edge case. Empty input, very long input, input in another language, malformed structure.
- One adversarial case. Prompt-injection attempt, contradictory instructions, a customer trying to break the bot.
That's 3 per prompt. If a prompt is mission-critical, scale to 10-30.
What goes wrong if you don't do this
We've seen this play out enough times to predict it:
- New model drops. Team migrates. "Looks fine in playground." Ships.
- Quality degrades 5-15% on a subset of inputs. No alert fires.
- Customer support volume creeps up. Nobody connects the dots.
- Three weeks later, churn ticks up. "Why?"
- Eventually somebody runs an A/B back-test and finds the regression.
- Rollback. Apology emails. Deck slide titled "Lessons Learned."
The whole loop is six commands and an afternoon.
PromptFork has a free tier (3 prompts, 50 runs/mo) that's enough for the
setup above on a small project. https://promptfork.online/diff
United States
NORTH AMERICA
Related News
What Does "Building in Public" Actually Mean in 2026?
20h ago
The Agentic Headless Backend: What Vibe Coders Still Need After the UI Is Done
20h ago
Why Iβm Still Learning to Code Even With AI
22h ago
I gave Claude a persistent memory for $0/month using Cloudflare
1d ago
NYT: 'Meta's Embrace of AI Is Making Its Employees Miserable'
1d ago