When people see a demo of AI-powered survey programming, the reaction is usually some version of: "okay, but how accurate is it really?" That's a reasonable question. It's also, we've come to believe, the wrong one.
Accuracy on first pass matters. But a survey that's 95% correct on initial generation still has errors in it, and those errors still need to be found and fixed before the study goes to field. In practice, the initial build is only the beginning of the process. What happens next, the QA cycles, the change requests, the back-and-forth between programmer and client, is where most of the time goes. If you're only measuring how fast or how accurately an AI can generate a first draft, you're measuring the smaller half of the problem.
The metric we think about at Questra is total time to field: the elapsed time from when a questionnaire lands in a programmer's inbox to when the first respondent completes. That number reflects everything, not just the build.
Why the initial build isn't the bottleneck
Survey programmers know this intuitively. An experienced programmer can turn around a solid first pass on a moderately complex study in a day. What stretches timelines is everything that follows. The client reviews the test link and sends a list of changes. The QA team catches a routing error in one of the longer survey paths. A project manager notices that a brand description carried over from last wave verbatim when the questionnaire had called for updated language. None of these are unusual, and none of them are large requests in isolation. Across a typical study, they add up to days.
The teams that field studies fastest are not necessarily the fastest programmers. They're the teams with the shortest distance between "someone found a problem" and "the problem is fixed and back in review." That's the gap worth closing.
Two different jobs
This observation shaped how we designed Questra's AI, and specifically why we didn't build a single general-purpose agent and ask it to do everything.
Initial construction and iterative editing are different enough problems that they reward genuinely different approaches. Taking a full questionnaire document and producing a working survey from it is a context-heavy task. A real questionnaire might span dozens of pages, with routing logic that references questions across sections, loops that depend on earlier responses, and instructions that assume the reader already understands what the study is trying to do. That's a lot of context to hold correctly, and models that aren't specifically built for it tend to hallucinate structure, drop conditions, or silently resolve ambiguities in ways that look plausible but aren't what the spec says. Getting this right requires more than a capable general-purpose model pointed at the right prompt.
What we've built for the Programmer is a fine-tuned model paired with a proprietary survey definition language that sits between the questionnaire and the final platform output. Rather than trying to read a questionnaire and emit Decipher XML or ConfirmIt configuration directly, the model first translates the spec into our intermediate format, which is purpose-built to represent survey logic cleanly and completely. That representation then compiles to whatever platform the customer is targeting. The model's job is narrower and better defined, which makes it more reliable, and the compilation step handles the platform-specific syntax that would otherwise add noise to the generation task.
An agent optimized for editing has a different set of constraints. The survey already exists. The agent needs to understand its current structure, interpret a natural-language change request, figure out what that request actually implies for the underlying code, make a targeted edit, and leave everything else intact. A request like "add a neutral midpoint to the satisfaction scale in Q9" sounds simple, but on a platform like Decipher or ConfirmIt, it might mean updating response values, adjusting the display logic for a conditional follow-up that fires below the scale midpoint, and re-checking the data export mapping. The editor needs to understand those implications without being told.
These aren't just different tasks. They require different strengths, and trying to make one model excellent at both is a harder problem than building two agents that each excel at one.
Errors are part of the job
One thing we've tried to be honest about internally is that the goal isn't to eliminate errors. Nobody, human or AI, produces error-free surveys on the first pass every time. Questionnaire documents have ambiguities. Study designs get revised. Platforms have quirks that cause unexpected behavior. Errors are part of the survey programming lifecycle, and a platform that pretends otherwise is building for the demo, not for real projects.
The more useful question is what happens when an error is found. That's where QA tooling matters. If the surface between "error discovered" and "error resolved" is wide, if it requires navigating complex platform UI, understanding obscure syntax, or waiting on a programmer's availability, then each error represents a meaningful delay. If that surface is narrow, if you can describe the correction in a message and get it back in minutes, the cost of an error drops significantly. The study doesn't stall; it keeps moving.
This is a different design philosophy than building for the ideal case. It starts from the assumption that problems will arise and asks what the fastest path through them looks like.
What we're building toward
The two-agent approach is where Questra is today, but it's a foundation rather than a finished product. Now that we've accepted errors as a reality of the programming process, the next phase of work is building tools specifically designed to find them — and the approach we're taking breaks into two parts.
The first, which we'll be releasing soon, is pressure testing: a tool that exhaustively clicks through a survey by simulating nearly every permutation of answers a respondent could give, looking for hard errors. Broken skip conditions, unresolvable pipes, logic paths that lead nowhere — the class of errors that cause surveys to malfunction in ways that are immediately obvious once triggered but easy to miss when you're spot-checking manually. This kind of automated coverage is the part of QA that currently takes the most time and is the most likely to let something slip through.
The second part addresses a harder problem: soft errors. These are survey states that are technically valid code but don't match what the researcher actually designed. A skip condition that routes correctly but skips a question it shouldn't. A quota that fills before the right respondents have been collected. Language that reads as intended on its own but creates an unintended impression in context. Soft errors can only be caught by a person who understands the intent of the survey, not just its mechanics. But that doesn't mean the platform can't help. Our approach here is to surface the logic and metadata that makes human review faster, fill quotas to get the reviewer to the right states quickly, and build regression tests for functionality that's already been validated so reviewers never have to check the same thing twice. The goal isn't to eliminate the human from this step; it's to make the time they spend on it as efficient as possible.
The researchers and programmers we work with aren't looking for magic. They know their work is complex and that automation doesn't make it simple. What they want is a process that's faster, more transparent, and less dependent on any single person being available at any given moment. That's what we're trying to build, one piece of the workflow at a time.
Questra automates survey programming from uploaded questionnaire to deployable survey, with an AI editor that handles change requests and corrections after the initial build. See how it works.