We Built a Voice Agent to Train System Design Thinking — Here's the Bet We're Making
Most AI products default to chat. You type, it responds. The interface is familiar, low-friction, and easy to ship.
We built something different — a voice agent — and we did it knowing that voice is harder to build, harder to use, and harder to get right.
The Actual Problem
Here's what we kept hearing when we talked to engineers preparing for system design interviews: "I know the material. I just can't explain it under pressure."
These are not junior engineers. They've read the classic texts. They've written design docs. They can answer most system design questions in writing, given enough time. But put them in a real interview — a Zoom call with a staff engineer from a FAANG company, 45 minutes, whiteboard open — and something breaks down.
The knowledge is there. The articulation isn't.
This is the recognition vs. production gap. You can recognize a correct answer when you read it. You can even write one when you have time to edit.
But producing a structured verbal explanation, in real time, under time pressure, to someone who's probing your reasoning as you go — that's a completely different skill.
And it's exactly what interviews test.
Every existing preparation tool builds recognition:
- Books and videos: you read or watch someone else's reasoning. You're not producing — you're absorbing.
- Peer mock interviews: feedback is inconsistent, scheduling is a friction point, and your peer is as likely to miss your gaps as catch them.
- ChatGPT: responds to anything, agrees with everything, asks no follow-up questions. It is the opposite of a skeptical interviewer.
None of them force you to produce under constraints. That's the gap. And the gap is, fundamentally, about interface. If your practice tool lets you think, edit, and hedge before answering — you are not practicing the thing that matters.
The constraint is the training. That realization is what made us build what we built.
Why Voice — The Bet
Here's the position we're staking:
Let's break down why we believe this.
Speaking aloud forces structure. When you type, your brain is slower than your fingers, and your fingers can backspace. When you speak, you have to commit. If you start a sentence you can't finish, everyone hears it — including you. That pressure is diagnostic. It surfaces exactly where your reasoning has gaps.
Voice approximates real interview conditions. You're not going to type your system design answer on a whiteboard. You're going to talk through it. The closest approximation to that, outside of a real interview, is talking through it with something that actually responds to what you said — not a general-purpose answer, but a specific probe of the thing you just said.
Text mode gives you too much time to edit your thinking. A typed chat interface lets you draft, revise, and present the version of your answer you're happy with. That's useful for documentation. It's counterproductive for interview prep. The version you're happy with is not the version you'd produce in real time.
The discomfort is a signal. When you're mid-explanation and you realize you don't know what comes next — that moment of "uh..." is not a failure. It's the most valuable data point in the session. It tells you exactly which concept you don't have at production-ready recall. No passive study tool gives you that signal. Voice does.
The cognitive science literature has a name for this: the testing effect, or retrieval practice. Being forced to produce an answer — not recognize one — strengthens the neural pathways you'll need in the actual high-stakes moment. Voice, under time pressure, is as close to production practice as you can get outside a real interview.
The Technical Choices — and What They Cost
This is the part where we talk about what it actually took to build this.
The 800ms Problem
Voice AI, in theory, is simple: capture audio, transcribe, respond, synthesize speech. In practice, the hard constraint is latency.
That 800ms budget covers every stage of the pipeline. Here's how we actually allocated it:
| Stage | Component | Target Budget |
|---|---|---|
| Speech recognition (STT) | Real-time voice API | ~150ms |
| Session orchestration | Orchestration layer | ~50ms |
| LLM evaluation | Language model | ~400ms |
| Speech synthesis (TTS) | TTS engine | ~150ms |
| Total | ≤750ms |
We built to 750ms, not 800ms. The 50ms margin is not generous, but it's enough to absorb normal infrastructure variance.
The Architecture Decisions
Real-time voice streaming (STT layer). The biggest latency lever in the pipeline is how you handle the audio loop. A traditional approach — record, send, transcribe, respond — introduces round-trip overhead that compounds at every stage.
Speech synthesis quality (TTS layer). This was a quality decision, not a latency decision. We evaluated several TTS engines. The selection criteria were narrow: natural-sounding output over an 8-minute session, correct handling of technical terminology (database names, system design jargon), and cadence that doesn't fatigue a listener over repeated drills.
Session orchestration. Managing turn-taking, interruption detection, and session lifecycle across a voice interaction is a non-trivial engineering problem.
The Honest Acknowledgment
Voice isn't the right interface for everyone. We're not going to pretend otherwise.
Open offices are real. Commutes on public transit are real. Apartments with thin walls and roommates are real. If your realistic practice window is 7am on a crowded train or during a lunch break in an open floor plan, voice is a significant friction point.
We built text mode as a first-class interface — not a fallback, not a consolation prize. Every drill works in text. The evaluation logic is the same.
But here's the bet we're making: for the subset of engineers who can use voice — who have the private space, the willingness to speak aloud to a machine, and the patience to sit with the discomfort of hearing their own reasoning gaps in real time — we believe this is the fastest path to improvement we can offer.
What We're Testing
We're opening early access to the first 200 engineers: this is a calibration phase, not a soft launch.
The core product is near to completion — you can do drills, get scored, and see your Thinking Gym Score (TGS) move across sessions. TGS is an ELO-style rating from 700 (beginner) to 1,500 (staff-level), updated after every drill based on the concepts you surfaced, the depth of your reasoning, and the difficulty of the prompt.
What we're calibrating is whether the score feels right. Does a session that felt strong to you register as strong in the score? Does a session where you got stuck on capacity estimation show up as a weakness in the right skill node?
If you join and your score feels wrong — if you nail a drill and it penalizes you, or you stumble through one and it gives you full credit — that feedback is exactly what we need. Early users who send us that signal are not complaining. They're co-building.
System Design Drills Using Voice
Voice, with time pressure, trained against your actual reasoning — not a general system design curriculum — is the fastest way to close the gap between knowing something and being able to explain it under pressure.
By joining us early, you have an opportunity to refine the scoring model, the architecture and the overall performance.
Let's start building the Thinking Gym together!
First 200 engineers · $29/month · Rate-locked for life
Thinking Gym is a voice AI training simulator for system design interviews. Eight-minute drills. A calibrated score after every session. Built for engineers who take system design seriously.