TechTrends Now - Tech News for Builders and Operators

Your code has tests. Your code has a CI pipeline. A bad change can't merge
without going green.

Your prompts? Vibes. A teammate edits the system prompt to fix one customer
complaint, output quality drops 8% on the other 99% of cases, nobody
notices for a month, and the regression eventually surfaces as a
mysterious churn bump in the metrics deck.

This post is the 5-minute setup that closes that gap.

What "tests for prompts" actually means

There are two viable approaches and you need to know which to use when.

Assertion-based. You write code that checks the LLM output against
fixed rules: regex matches, JSON shape validation, field-presence checks,
length bounds. Fast, cheap, deterministic.

Use it when: the output is structured and the contract is rigid. JSON
extraction, classification, function-call payloads, schema-conformant
generation.

LLM-judge. Another LLM compares the candidate output to a baseline and
returns "regressed: yes/no" with a severity score. Slower, costs a few
cents per comparison, handles fuzzy outputs.

Use it when: the output is freeform — summaries, rewrites, creative
generation, anything where two correct answers can look very different.

A mature setup uses both. PromptFork ships the LLM-judge built in (we
chose Claude Haiku at temp 0 with a strict "only flag strictly worse"
rubric); assertions are easy to add yourself in custom test cases.

The 5-minute setup

1. Pin your prompts in version control

prompts/
  summarize_ticket.txt
  extract_email.txt
  classify_intent.txt

Plain text files. Not constants in prompts.py. Not Notion docs. Files
with a git history.

2. Push them to PromptFork

pip install promptfork
export PROMPTFORK_API_KEY=pf_xxxx

for f in prompts/*.txt; do
  name=$(basename "$f" .txt)
  promptfork push "$name" --file "$f" --message "initial commit"
done

This creates v1 of each prompt server-side and gives you a stable identifier.

3. Add test cases

For each prompt, pin 5-30 representative inputs. Real production inputs are
worth 10x synthetic ones.

promptfork add-test summarize_ticket happy_path \
  --input ticket="Order arrived. Loved it." \
  --rubric "summary should be positive and under 20 words"

promptfork add-test summarize_ticket angry_refund \
  --input ticket="3 weeks late, want money back NOW" \
  --rubric "must mention refund and high urgency"

promptfork add-test summarize_ticket edge_garbled \
  --input ticket="hi pls help thx" \
  --rubric "summary should request more info, not invent details"

Three test cases is a starting point. Six is a good baseline. Thirty is
production-grade.

4. Wire the GitHub Action

# .github/workflows/prompt-tests.yml
name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Push current prompts
        env:
          PROMPTFORK_API_KEY: ${{ secrets.PROMPTFORK_API_KEY }}
        run: |
          pip install promptfork
          for f in prompts/*.txt; do
            name=$(basename "$f" .txt)
            promptfork push "$name" --file "$f" \
              --message "PR #${{ github.event.pull_request.number }}"
          done
      - uses: shaunvand/promptfork-cli@v0
        with:
          prompt: summarize_ticket
          baseline: 1
          api-key: ${{ secrets.PROMPTFORK_API_KEY }}

Add the secret at Settings → Secrets → PROMPTFORK_API_KEY. Done.

5. Open a PR that changes a prompt

The action runs, executes your prompt across Claude/GPT/Gemini, has the
LLM-judge compare each output against your baseline version, and posts a
PR comment with the regression report. If anything regresses, the action
exits non-zero, branch protection blocks the merge, the change goes back
for review.

You now have a CI gate for prompts. The same gate you have for code.

What goes in the test suite

After running this on a few projects, the pattern that works:

One happy-path case. "Normal" input, expected output.
One edge case. Empty input, very long input, input in another language, malformed structure.
One adversarial case. Prompt-injection attempt, contradictory instructions, a customer trying to break the bot.

That's 3 per prompt. If a prompt is mission-critical, scale to 10-30.

What goes wrong if you don't do this

We've seen this play out enough times to predict it:

New model drops. Team migrates. "Looks fine in playground." Ships.
Quality degrades 5-15% on a subset of inputs. No alert fires.
Customer support volume creeps up. Nobody connects the dots.
Three weeks later, churn ticks up. "Why?"
Eventually somebody runs an A/B back-test and finds the regression.
Rollback. Apology emails. Deck slide titled "Lessons Learned."

The whole loop is six commands and an afternoon.

PromptFork has a free tier (3 prompts, 50 runs/mo) that's enough for the
setup above on a small project. https://promptfork.online/diff

Prompt regression testing in CI: a 5-minute setup

What "tests for prompts" actually means

The 5-minute setup

1. Pin your prompts in version control

2. Push them to PromptFork

3. Add test cases

4. Wire the GitHub Action

5. Open a PR that changes a prompt

What goes in the test suite

What goes wrong if you don't do this

Comments (0)

United States

Related News

What Does "Building in Public" Actually Mean in 2026?

The Agentic Headless Backend: What Vibe Coders Still Need After the UI Is Done

Why I’m Still Learning to Code Even With AI

I gave Claude a persistent memory for $0/month using Cloudflare

NYT: 'Meta's Embrace of AI Is Making Its Employees Miserable'