AI coding agent at a terminal workstation building multi-system integrations - case study featured image

    AI Agent Case Study: How an AI Coding Agent Built a Voice Intelligence Platform Without Writing a Single Line of Code

    | |

    An AI coding agent built a complete multi-system voice intelligence platform — Twilio, Microsoft Teams, Supabase, n8n — without a single line of human-written code. Every workflow, every database schema, every configuration file, every shell script. All authored and deployed by the agent through API calls and file writes. The human directed the architecture, set constraints, and reviewed output. The agent did the building. This is what that looked like.

    The Brief

    A joint venture between AI Governance Group and Black Gazelle needed something specific: an AI system that dials into Microsoft Teams meetings via the public telephone network, listens to the full conversation in real time, and at precisely the right moment, asks one highly targeted question.

    The domain is Action Learning, a methodology where peer enterprise leaders work through complex issues, difficult decisions, and high-stakes workplace challenges together. The AI doesn’t moderate. It doesn’t summarize. It listens deeply to everything being said and contributes a single, well-crafted question that moves the group’s thinking forward.

    The stakes are unusual. If the system asks one bad question out of a thousand, that’s a failure. Every question has to be excellent. Every time. There is no room for the kind of verbose, hedging output most AI systems produce. The question has to be precise, grounded in what was actually said, and delivered without preamble.

    Security and data locality were non-negotiable. Everything runs and is stored in Europe. All data, models, and systems are either European or locally hosted, giving the client a high degree of control despite Fountain City being a US-based company.

    The entire technical build for the voice infrastructure, the system that handles dialing in, transcribing, and speaking back into live calls, was executed by an AI coding agent. Zero lines of human-written code. Zero manual UI configuration. Every n8n workflow node, every SQL migration, every TwiML template, and every shell script was authored and deployed by the agent through API calls and file writes.

    System architecture diagram showing how Twilio, n8n, Supabase, Microsoft Teams, and AWS work together in the AI-built voice intelligence platform

    The Architecture

    Six systems had to work together. Each one handles a different layer of the problem:

    LayerTechnologyRole
    TelephonyTwilio Programmable VoicePSTN dialing, DTMF tones, audio codec negotiation, call lifecycle
    TranscriptionTwilio Real-Time Transcription (Deepgram Nova-3)Speech-to-text, streamed as HTTP webhooks
    Orchestrationn8n (self-hosted, Docker)Workflow logic, webhook endpoints, API coordination
    DatabaseSupabase (managed PostgreSQL)Transcript storage, session state
    ConferencingMicrosoft Teams (Audio Conferencing)The meeting room the system dials into
    InfrastructureAWS EC2 + Cloudflare + NginxHosting, SSL termination, reverse proxy
    AI AgentClaude Code CLI (Anthropic)The builder — wrote and deployed everything

    The core architectural constraint, set by the human director before a single line was written: Twilio owns the ears and mouth, n8n owns the brain. No custom server. No WebSocket handling. No raw audio streaming. No media processing. All audio stays inside Twilio’s infrastructure. The orchestration layer only ever touches text and REST API calls. The AI agent respected this constraint throughout every milestone.

    How the AI Agent Built It

    The build happened across three milestones. Each one was independently testable and had a concrete “it works” checkpoint before moving forward. This is Part 1 of the case study, covering the voice infrastructure. Part 2 will cover the operator dashboard and AI question generation pipeline.

    Milestone 1: Infrastructure Validation

    The directive was simple: prove that Twilio, n8n, and Supabase can all talk to each other. No call logic yet.

    The agent created a Supabase schema by generating SQL and executing it against the Postgres database over IPv6. A transcripts table with columns for session tracking, segment ordering, and confidence scores, plus a composite index for ordered retrieval.

    Then it built three n8n webhook endpoints by composing workflow JSON and deploying it through the n8n REST API. No human touched the n8n UI. The agent created a transcript receiver webhook that parses Twilio’s form-encoded POST data, filters for final transcriptions only, and inserts rows into Supabase. It also built a TwiML server webhook for XML responses and a status callback webhook for call lifecycle events.

    The agent configured n8n credentials for both Twilio and Supabase through API calls, setting up API key pairs, database connection strings, and authentication headers programmatically. Then it validated the entire pipeline end-to-end with curl commands simulating Twilio webhook payloads, confirming data flowed from webhook to n8n to Supabase.

    The first real surprise came here. n8n’s Supabase node uses internal field names that don’t match the documentation: dataToSend and fieldsUi and fieldId instead of the documented fieldsToSend, fieldValues, and fieldName. The agent discovered this through trial-and-error API calls, inspecting n8n’s node source patterns, and corrected the workflow JSON accordingly. A human clicking through the UI would never have encountered this, but a human reading documentation would have been equally misled.

    Milestone 2: Dial-In and Transcription

    The directive: dial into a real Teams meeting, transcribe the audio, store it in Supabase.

    The agent built a call initiator workflow using the Twilio REST API to place an outbound PSTN call to a Teams dial-in number. It engineered the TwiML response to handle the Teams auto-attendant sequence: an initial pause for Teams to answer, a DTMF digit sequence with calibrated timing to enter the conference ID, real-time transcription activation with Deepgram Nova-3, and a long pause to hold the call open.

    Live testing uncovered several things documentation doesn’t tell you. Transcript text arrives inside a JSON string in TranscriptionData (not TranscriptionText), and the sequence field is SequenceId, not SequenceNumber. The agent updated the workflow parsing logic in-place via API calls.

    It also caught that ampersand characters in webhook URLs within TwiML must be encoded as &, and that Twilio silently rejects malformed XML with a generic error code (12100). Diagnosed from Twilio’s error logs, fixed without human intervention.

    DTMF timing was the last piece. Teams auto-attendants vary by tenant, so the agent ran 3 to 5 test calls, adjusting pause durations between each attempt, until the digits were accepted reliably. Total cost for this calibration: about $0.50 in Twilio call charges.

    The result: a real Teams meeting was dialed into, and live utterances from meeting participants appeared in the Supabase transcripts table in real time, correctly ordered by session, segment, and sequence number.

    Milestone 3: Voice-Out (AI Speech Injection)

    The directive: make the system speak into the live call, then restart transcription so nothing is lost after the AI talks.

    The agent built a voice-out trigger workflow exposing a webhook that accepts a JSON payload with the text to speak, the session ID, and the Twilio CallSid.

    This milestone produced the most significant discovery. Twilio’s <Start><Transcription> runs as a sidecar process. When TwiML is replaced via the Call Update API, the old transcription session does not automatically stop. Without explicit cleanup, transcription sessions stack up silently, producing duplicate or ghost utterances. The agent solved this by prepending <Stop><Transcription> to every voice-out TwiML payload, using unique session names tied to the segment counter.

    It implemented segment tracking using n8n’s workflow static data, which persists across webhook executions. Each voice-out increments the segment number, so transcripts before and after each AI interjection are correctly sequenced.

    The agent also added XML escaping for dynamic text (AI-generated responses can contain ampersands, angle brackets, quotes, and apostrophes that break TwiML) and enforced the TwiML 4,000-character size limit on the Call Update API, capping AI response text at 3,500 characters to leave room for XML overhead.

    The result: the system successfully spoke AI-generated text into a live Teams call, then seamlessly resumed transcription. Multiple speak-listen cycles were tested, with all segments correctly tracked in the database.

    Testing interface for the AI voice intelligence platform showing how operators trigger AI-generated phrases during live calls

    This screen shows a test point of the tool where we show multiple candidates and which one the AI is about to speak so we can be sure the system is indeed asking the most effective question into the conversation.

    The AI Agent’s Toolkit

    The agent operated entirely through a terminal session on the EC2 instance. Here is what it actually used:

    For n8n workflow authoring, it used the n8n REST API to create and update complete workflow definitions as JSON, including node positions, connections, parameters, and credential references. Every workflow was built API-first. The n8n web UI was only used by the human for visual verification after the fact. It also used the credential API to set up Twilio and Supabase authentication, and the workflow activation API to toggle workflows when forcing webhook re-registration after updates.

    For database work, it ran psql over IPv6 directly against Supabase’s Postgres instance for schema creation, data inspection, and debugging. It wrote migration files to the repo for version control.

    For Twilio integration, it used the Twilio REST API via curl to initiate test calls, update live calls with new TwiML, and query call status. It inspected Twilio error logs to diagnose webhook failures and TwiML parse errors.

    For infrastructure, it managed the n8n Docker container, wrote and validated Nginx reverse proxy configs and SSL certificate setup, and created Cloudflare WAF rules to bypass bot challenges on webhook paths.

    Everything else: file writes for documentation, shell scripts, environment files, and SQL migrations. Git commits at each logical milestone.

    One of several AI-generated workflows in N8N. Keeping the logic in N8N means that a human can review the system easily and understand at a glance what the AI is building, even if the person is not technical. Code blocks (which with {} ) are used to do any logic parsing.

    AI agent orchestrating multiple interconnected systems through holographic interfaces

    What the Human Actually Did

    Before the build started, the human provisioned accounts on Twilio, Supabase, Microsoft Azure, and Cloudflare. Purchased a Twilio phone number. Set up an Azure app registration and Teams bot for audio conferencing. Provisioned an AWS EC2 instance, installed Docker, deployed n8n via Docker Compose with a public URL, and populated a .env file with API keys, tokens, and connection strings.

    During the build, the human directed each milestone (“Start M2. Here’s the Teams dial-in number and conference ID.”), set architectural constraints (“No WebSockets. No audio processing. Twilio handles all media.”), reviewed the agent’s work by checking n8n workflows in the UI and querying Supabase to verify data, course-corrected when needed (“The DTMF timing is too fast, add more pause.”), and approved destructive actions when the agent asked for confirmation before overwriting workflows or restarting services.

    What the human never did: write or edit any code. Configure any n8n workflow node through the UI. Manually set up any database table or index. Debug any API integration by hand. Write any Nginx config or Cloudflare rules.

    Human director reviewing AI agent work in a modern office - the director not builder paradigm

    This is the pattern we call “human as director, AI as builder.” The human’s expertise was essential for architectural decisions, constraint-setting, and knowing when the agent’s output was correct. The mechanical work of translating those decisions into running code was entirely delegated. It’s a different skill set: systems thinking and quality judgment, not implementation detail.

    The Gotchas: What the Agent Had to Figure Out

    Building integrations between four live systems produces edge cases that no documentation covers completely. This table is the unedited record of what the AI agent encountered and resolved:

    ProblemHow the Agent Solved It
    n8n Supabase node uses undocumented internal field namesTried the documented names, got errors, inspected n8n source patterns, iterated until the correct names were found
    Twilio sends Final as string "true", not booleanDiscovered through live webhook inspection, updated the n8n IF node comparison
    Transcript text nested inside a JSON string fieldParsed TranscriptionData JSON to extract .transcript and .confidence
    TwiML & in URLs causes silent Twilio rejectionDiagnosed from Twilio error code 12100, applied XML entity encoding
    Two webhooks in one n8n workflow require specific response modeBoth must use responseMode: "responseNode", not the default lastNode
    n8n API workflow updates don’t take effect on active webhooksLearned to deactivate, update, then reactivate the workflow to force re-registration
    Transcription sessions persist as sidecar processes after TwiML replacementAdded explicit <Stop><Transcription> to every voice-out payload with unique session names
    Teams auto-attendant timing varies by tenantRan iterative test calls, adjusting DTMF pause characters until digits were accepted reliably

    Each of these would have cost a human developer minutes to hours of debugging. The AI agent encountered and resolved them within its normal workflow, typically within 2 to 3 retry cycles.

    Anyone can claim an agent built something. The gotchas prove it encountered real-world messiness and worked through it.

    AI agent debugging and iterating through integration challenges - discovering solutions through trial and error

    Cost of the Build

    Under $5 in total telephony costs for the entire build-and-test cycle.

    ItemCost
    Twilio test calls (M1 through M3, approximately 15 calls)~$3.00
    Twilio phone number (monthly)$1.15
    Supabase (free tier)$0.00
    n8n (self-hosted)$0.00
    AWS EC2 (existing instance)Marginal
    Claude Code CLI usagePer-token API costs

    The primary cost was AI agent API usage. The infrastructure and telephony costs were negligible.

    What This Proves

    AI agents can handle multi-system integration plumbing. This was not a single-API wrapper or a CRUD app. The agent coordinated four live external services, each with its own authentication model, data format, and behavioral quirks. It managed form-encoded webhooks, XML generation, JSON APIs, SQL DDL, Docker containers, and Nginx configs in a single continuous workflow.

    No-code platforms become more powerful with AI agents. n8n is a visual workflow builder, but the AI agent never used the visual interface. It authored workflows as JSON and deployed them via REST API. This is faster than clicking through a UI, produces version-controllable artifacts, and allows the agent to iterate programmatically when something doesn’t work. The “no-code” platform became a headless orchestration engine.

    The human role shifts from builder to director. The human’s expertise was essential for architectural decisions, constraint-setting, and recognizing when the agent’s output was correct. But the mechanical work of turning those decisions into running code was entirely delegated. This is a different skill set: systems thinking and quality judgment, not implementation detail.

    Debugging is where AI agents earn their keep. Half the work in any integration project is diagnosing why System A’s output doesn’t match System B’s expected input. The agent’s ability to make an API call, inspect the error, form a hypothesis, modify the code, and retry in a tight loop — without fatigue or frustration — is where the productivity gain is largest. The gotcha table above is the evidence: eight real problems, each solved through methodical iteration that would have cost a human developer significant debugging time.

    Human silhouette before a vast interconnected AI network - the transformation from traditional to agentic development

    Documentation becomes a natural byproduct. Because the agent operates through explicit tool calls and file writes, every action is logged. The project ended up with comprehensive documentation (milestone specs, a rebuild-from-zero guide, exported workflow JSON, SQL migrations) not because someone sat down to write docs, but because the agent’s working method inherently produces artifacts.

    The Hardest Parts (And They Weren’t Code)

    The most time-consuming part of this project was, ironically, not the coding. It was getting permissions, credentials, DNS changes, and other access systems granted and permitted. At one point we waited about a week for Microsoft to provision the agent with the right credentials to act on calls as expected. Once all the tools and system access points were in place, things moved at breakneck speed again.

    The other part that required significant iteration was the quality of the questions themselves. Getting an AI to ask precisely the right question, without going into extended preamble before and after it, is hard. Conventional wisdom says that to increase an AI’s output quality, you should make it think out loud. That’s what all agentic systems do when they have “thinking mode” enabled: in the background, they reason through the problem step by step, and this greatly increases reliability and correctness. The challenge here was getting to a high-quality result with no space at all for that kind of self-dialogue.

    This was solved during an initial proof-of-concept phase where extensive testing was done across a hundred-plus case study conversations using synthetic data. All of it was human-validated for quality to train and refine the system until it consistently produced excellent questions.

    This pattern was used as a loop to bulk test large amounts of synthetic data until we got the quality we need to ensure consistent, excellent AI questions

    What the Client Says

    Ghislaine Caulat from the client team: “I really want to thank you, Sebastian, for the way you engage with us, for your transparency and for your precision. You have demonstrated a deep and effective understanding of our needs and Action Learning principles. I appreciate this a lot!”

    How to Reproduce This Pattern

    The prerequisites: an AI coding agent with shell access running on the target server, all external accounts provisioned with API keys stored in an environment file, a clear architecture document that becomes the agent’s persistent reference, and a milestone-based build plan where each phase is independently testable.

    The workflow loop: the human directs (“Here’s the architecture. Here are the credentials. Build Milestone 1.”), the agent builds (creates schemas, configures credentials, builds workflows, tests end-to-end), the human reviews and provides corrections, the agent iterates and moves to the next milestone. Repeat until the system is complete.

    Two things make this pattern work reliably. First, constraint-driven architecture. The more clearly you define what the system should not do, the fewer wrong turns the agent takes. Second, API-first tooling. Every system in this stack (Twilio, n8n, Supabase, Cloudflare) exposes a REST API. If a component can only be configured through a GUI, the agent cannot touch it.

    If you want to learn how to work this way, we run agentic coding training for development teams and agencies.

    What We Learned Building This

    This system is the first step in building an agentic system that can sit with leaders and provide deep insight into important meetings and discussions. Performing at the level of thought that difficult decisions and high-stakes situations demand.

    The project was fun to build. With the agentic coding approach, it runs smoothly at a fraction of the time, cost, and complexity. The end result is a better designed and built system than we could have produced by hand-coding it the traditional way.

    There is still serious craft required to work this way, but the effort is in the solution design. Collaborating with AI to refine the solution until it’s better than either could produce alone.

    As Martin Fowler’s team at ThoughtWorks notes, “CLI coding agents represent a fundamental shift from AI as a writing assistant to AI as a development partner.” This case study is what that shift looks like in production, with real systems, real clients, and real money flowing through the wires.

    We build managed autonomous AI agents for companies that need this kind of capability. We also run agents in our own operations — an autonomous SEO research agent and an autonomous content pipeline that produced this article. The gap between conversational AI and agentic AI is where the real value lives, and it’s where we spend our time.

    What’s Next

    With the MVP completed the application will now be tested in real life calls. Based on feedback we will spec and document the contents of the subsequent development phase. Our next phase will focus primarily on quality improvements, customer feedback but also on technical aspects like system scaling to handle multiple parallel sessions.

    Frequently Asked Questions

    Can AI coding agents really build production integrations without human code?

    Yes. This case study documents a complete multi-system integration (Twilio, Microsoft Teams, Supabase, n8n) where zero lines of code were written by a human. The AI agent authored every workflow, every database schema, every configuration file, and every shell script. The human role was provisioning accounts, directing milestones, and reviewing output.

    What kinds of projects are best suited for AI agent development?

    Projects with API-first tooling, clear architectural constraints, and milestone-based build plans. The systems being integrated need to expose REST APIs. If the only way to configure a tool is through a GUI, the agent cannot work with it. Multi-system integrations, workflow automation, and data pipeline projects are strong candidates.

    How much does it cost to have an AI agent build a multi-system integration?

    In this case, the infrastructure and telephony costs for the entire build-and-test cycle were under $5. The primary cost is the AI agent’s API usage (per-token pricing) and the human director’s time for supervision and review. Traditional development of equivalent scope would typically involve multiple developer-weeks of effort.

    What role does the human play when an AI agent is doing the building?

    The human is the director, not the builder. That means three things: provisioning accounts and API keys before the build, directing the approach at each milestone, and reviewing what the agent produces. The human’s expertise in architecture, constraint-setting, and quality judgment remains essential. The mechanical implementation is delegated.

    What happens when the AI agent encounters a bug it can’t solve?

    In our experience, the agent resolves most integration bugs within 2 to 3 retry cycles by inspecting error logs, forming a hypothesis, modifying the code, and retesting. For problems requiring information the agent doesn’t have (like tenant-specific Teams auto-attendant timing), it runs iterative tests to converge on the solution empirically. The human steps in with domain knowledge or redirects the approach when the agent is stuck.

    How does this compare to traditional development approaches?

    Traditional development would have a developer manually reading API documentation, writing code, configuring UIs, debugging integration issues, and documenting the work separately. With the agent-based approach, all of those steps happen in a continuous automated loop. The agent also produces documentation as a natural byproduct of its working method, and it can iterate on bugs without fatigue. The trade-off is that the human needs to know enough about the systems to direct well and recognize correct output.