By Kai Stomme, CEO at Stomme AI
The problem was simple: AI agents are hard to set up.
Not the intelligence part — the models are brilliant. Claude, GPT-4, Gemini — all extraordinary at reasoning, writing, and analysis. The intelligence exists. What doesn't exist is the infrastructure to make it useful.
Setting up an autonomous AI agent that handles your email, manages your calendar, researches topics, deploys code, and works while you sleep — that takes weeks. Maybe months. You need to configure APIs, build tool integrations, design prompts, test edge cases, set up security policies, and wire everything together. Most people give up before they start.
We didn't want to sell that experience. We wanted to sell the result.
The sprint
On a Friday evening in March 2026, we had a working agent platform but only a handful of skills — the core capabilities that let an agent do real work. Email triage. Calendar management. Basic research. Not enough to deliver the promise of "your agent handles your day."
We needed 28 skills. We needed them tested. We needed them production-ready.
We started at 10 PM on Friday.
How the forge works
We didn't write 28 skills by hand. We built a forge — an adversarial review pipeline that generates, tests, reviews, and hardens each skill automatically.
Here's the process:
- Specification. We write a brief: what the skill does, what inputs it takes, what outputs it produces, what edge cases matter.
- Generation. An AI agent writes the first draft — complete with error handling, input validation, and integration points.
- Testing. Automated test generation: unit tests, integration tests, edge case tests. Every skill gets a minimum of 50 tests. Some get over 200.
- Adversarial review. A second AI agent tears the code apart. It looks for security holes, performance issues, missing error handling, edge cases the tests don't cover. It's hostile. That's the point.
- Iteration. Failures feed back into generation. The skill is rebuilt, retested, and re-reviewed until it passes everything.
- Final audit. A human reviews the output. Not every line — the overall architecture, the security model, the integration quality.
The numbers
By Sunday evening:
- 28 skills — email triage, calendar management, web research, code deployment, file management, CRM operations, payment processing, document generation, and 20 more.
- 2,939 tests — every skill with comprehensive test coverage.
- 100% passing — not 99%. Not "mostly passing." All of them.
- Zero security vulnerabilities flagged by the adversarial reviewer in the final set.
We went from "we have a platform" to "we have a product" in 48 hours.
The first customer
By Monday morning, we deployed our first customer. Mareike — a professional in Berlin who'd been on our waitlist. Personal tier. Agent configured, connected, and running.
Her first morning briefing landed in Telegram at 7 AM. Forty-seven emails triaged overnight. Calendar conflicts resolved. Research brief for her first meeting — done.
She sent us a message that afternoon: "I didn't know this was possible."
What we learned
Speed comes from infrastructure, not heroics. The sprint wasn't 48 hours of typing. It was 48 hours of directing an automated pipeline we'd spent weeks building. The forge did the heavy lifting. We did the directing.
Adversarial review is non-negotiable. Every skill that passed the first review had issues caught by the adversarial pass. Security oversights. Missing edge cases. Performance problems under load. If we'd shipped the first drafts, we'd have shipped bugs.
Testing is the product. 2,939 tests isn't a metric to brag about. It's the reason your agent doesn't crash when your calendar has a timezone conflict or your email contains malformed HTML. Reliability is invisible — until it isn't.
Why this matters
Most AI companies talk about their models. We talk about our infrastructure.
The model is the brain. The skills are the hands. The forge is the factory. And the product — the thing you actually use — is all three working together, tested to a standard most SaaS companies don't reach in their first year.
We built it in a weekend because we'd spent months building the tools to build it. That's the real story.
Your agent is built on that foundation. Every skill it uses went through the forge. Every capability was adversarially reviewed. Every integration was tested against real-world edge cases.
That's what we mean by "stomme" — the structural frame that everything else depends on.