Inside the Forge

How our adversarial pipeline generates, tests, attacks, and hardens every skill your agent uses.

by Nils Ekström · March 2026

By Nils Ekström, CTO at Stomme AI

Your agent has 28 skills. Email triage. Calendar management. Web research. Code deployment. Payment processing. CRM operations. Document generation. Twenty more.

Each one went through a process we call the forge.

The forge isn't a code review. It's an automated adversarial pipeline that generates a skill, creates comprehensive tests, then attacks its own output to find the security holes, edge cases, and failure modes that normal testing misses.

Here's how it works — and why we built it.

The problem with normal testing

Standard software testing works like this: a developer writes code, then writes tests to verify the code works. The problem is obvious — the person writing the tests has the same mental model as the person who wrote the code. They share the same assumptions, the same blind spots.

If the developer didn't think about what happens when a timezone is UTC-12, neither did the test.

AI-generated code compounds this problem. When you ask an AI to write a skill and then ask the same AI to write tests for that skill, you get tests that verify the AI's assumptions — not tests that challenge them.

You need a hostile reviewer.

The forge pipeline

Step 1: Specification

We write a brief for each skill. Not pseudocode — a functional specification:

What the skill does
What inputs it accepts
What outputs it produces
What edge cases matter
What security constraints apply
What integrations it touches

The specification is the contract. Everything downstream is tested against it.

Step 2: Generation

An AI agent writes the first implementation. Complete with error handling, input validation, type checking, and integration points. This is a competent first draft — not final, never shipped as-is.

Step 3: Test generation

A separate AI agent — one that hasn't seen the implementation — generates tests from the specification alone. This is critical: the test writer doesn't know how the code works. It only knows what it should do.

This produces tests that challenge the implementation's assumptions rather than confirming them. If the implementation handles a null input by returning an empty array, but the specification says it should throw an error, the test catches it.

Minimum: 50 tests per skill. Complex skills get 150-200.

Step 4: Adversarial review

A third AI agent reviews the implementation and tests with a single directive: break it.

This agent looks for:

Security vulnerabilities. Injection attacks, credential exposure, unvalidated inputs, privilege escalation paths.
Edge cases the tests miss. Unicode characters in email subjects. Calendar events spanning midnight. Files with no extension. Rate limit responses from APIs.
Performance issues. N+1 queries. Unbounded loops. Memory leaks in long-running operations.
Integration failures. What happens when the email API returns a 429. When the calendar API is down. When the file system is read-only.

The adversarial reviewer is hostile by design. It's not trying to verify the code works. It's trying to prove it doesn't.

Step 5: Iteration

Failures from the adversarial review feed back into generation. The skill is rebuilt — not patched. A new implementation addresses the identified issues, new tests verify the fixes, and the adversarial reviewer attacks again.

This loop runs until the skill passes everything. Typically 2-4 iterations per skill. Some complex skills took 6.

Step 6: Human audit

A human reviews the final output. Not line by line — the architecture, the security model, the integration quality. Does this skill do what the specification says? Are the security constraints enforced? Would we trust this with a customer's email?

What the adversarial pass actually catches

In our 28-skill sprint, every skill that passed the initial generation had issues caught by the adversarial reviewer:

An email triage skill that didn't handle emails with no subject line
A calendar skill that miscalculated duration for events crossing daylight saving transitions
A file management skill that followed symlinks outside the sandbox
A payment processing skill that logged card metadata in plaintext
A web research skill that didn't validate URL schemes, allowing file:// access

None of these would have been caught by conventional testing. Each would have caused a real failure for a real customer.

Why this matters to you

You don't need to understand the forge to use your agent. But you should know it exists.

Every capability your agent has — every email it triages, every meeting it schedules, every file it manages — was built by this pipeline. Generated, tested, attacked, rebuilt, retested, and reviewed.

2,939 tests. 100% passing. Zero security vulnerabilities in the final set.

That's not a marketing number. It's an engineering standard. And it's why your agent handles your email at 3 AM without you worrying.