A candidate interviews flawlessly. They answer every question with structured confidence, respond to follow-ups without hesitation, and demonstrate apparent fluency in topics that their resume only briefly mentions. They get the offer. Two weeks in, their manager is fielding questions the candidate should have been able to answer on day one.
This is the off-camera coaching scenario — and it is, in many ways, harder to catch than a full proxy fraud. A proxy requires a different face and voice. Off-camera coaching uses the real candidate. The coaching is invisible.
What off-camera coaching actually looks like
The term covers a range of arrangements, roughly ordered by sophistication:
- Earpiece coaching. A coach listens to the interview in real time and feeds answers verbally through a concealed earpiece. The candidate repeats them, usually with slight rephrasing. The delay between hearing and speaking is consistent and short — a cadence that looks like thoughtful pausing but has a different underlying rhythm.
- Second-screen text feed. The candidate reads answers from a phone, tablet, or monitor positioned just outside the camera frame — often below or to the right of the primary screen. The coaching may be a human typing live, a prepared script, or an AI tool generating answers in real time as the questions are heard.
- Real-time AI assistance. The candidate runs a tool like ChatGPT on a second device or browser tab, entering the interview question as it's asked and reading the generated response. No human coach is needed. The latency between question and answer is almost entirely determined by how fast the AI responds — typically two to four seconds.
- Screen-control assistance. For technical or coding interviews, a third party has remote access to the candidate's coding environment and contributes to the solution while the candidate talks through it verbally.
Why it's easy to confuse with preparation
A well-prepared candidate and a coached candidate can look nearly identical to a human reviewer watching in real time. Both give structured answers. Both respond without significant hesitation. Both demonstrate apparent command of the material.
The difference shows up in the details of how the answer is produced, not in the content of the answer itself. A genuinely prepared candidate has internalized knowledge — their answers vary in structure depending on the question, they can go deeper unprompted, and their eye movement patterns reflect cognitive retrieval. A coached candidate is performing a real-time relay — and that relay has distinct signatures.
The behavioral signals
Gaze directed off-center at a fixed focal depth
Genuine thinking produces varied eye movement: upward, away, sometimes at the camera, shifting as the person assembles their answer. Reading from a positioned screen produces something different — a consistent lateral or downward gaze, held at a fixed focal distance, with a subtle return-to-start pattern as the candidate moves to the next line or the next generated response. The gaze isn't wandering. It's tracking.
Verbatim repetition with a fixed delay
Earpiece coaching produces a characteristic latency pattern. The candidate hears a phrase, then speaks it — with a delay of roughly one to two seconds that is consistent across the session. Authentic speech has variable latency depending on sentence complexity and how much retrieval the thought requires. Coached speech has flatter latency and a tendency toward direct repetition of phrasing — the words arrive pre-formed, so they're reproduced with minimal modification.
Micro-pauses that don't match question complexity
A genuine expert pauses longer before harder questions and shorter before familiar ones. That correlation breaks under coaching. The pause duration becomes uniform — it reflects the time to receive the coached answer, not the time to think one up. A candidate who answers a complex architecture question in two seconds but hesitates three seconds before "tell me about yourself" has inverted the normal difficulty-latency relationship.
Audio anomalies in the candidate's track
Earpiece usage and nearby devices leave traces in the audio. A faint second voice, occasional RF interference from a wireless earpiece, keyboard sounds during the candidate's own speech, or the ambient notification sound from a second phone are all detectable in the audio track of a recorded interview. None of these individually prove anything — but they create a signal worth examining in context.
Answer depth that doesn't extend on follow-up
Ask a coached candidate to expand on one point from their last answer. If the coaching source is a pre-generated response or a human who has moved on, the candidate will often restate the original answer or hedge. Genuine expertise generates new information when pushed — additional examples, caveats, related concepts. Coached answers replay the source. There's nothing behind the answer because the answer was never theirs.
Why real-time AI changed the scale of this problem
Until recently, off-camera coaching required a real person: someone knowledgeable enough to coach and available to do it in real time. That was a meaningful constraint. A candidate needed a willing accomplice with relevant expertise.
AI removed that constraint. Any candidate can now run ChatGPT, Claude, or a similar tool on a second device and receive technically fluent, well-structured answers to almost any interview question within seconds. No human coach needed. The barrier dropped from "find an expert willing to commit fraud" to "have a smartphone nearby."
This is why the volume of off-camera-assisted fraud has grown even as candidates have become more aware that integrity tools exist. The activation cost is low, the answers are persuasive, and the candidate remains on camera — so it doesn't register as the kind of fraud that gets flagged by appearance-based checks.
What systematic analysis finds that manual review misses
A human reviewer watching a recording in real time will catch the most egregious cases — a visible earpiece, a candidate clearly reading from something nearby. The subtler signals require analysis across the full session: comparing latency across all answers, mapping gaze patterns frame by frame, isolating the audio track for background signals.
No reviewer does this manually at scale. The signals that confirm off-camera coaching are too granular and too distributed across a 40-minute session. They need to be measured across every question, compared to baseline, and surfaced with timestamps so the finding can actually be reviewed.
That's the difference between "something felt assisted" and "here are seven timestamps where the gaze pattern, response latency, and audio signature are consistent with off-camera reading — and here's what each one looks like."
See these signals detected automatically
HireBetter analyzes every interview recording and surfaces each flag with a timestamp and reviewable clip — so you can verify it, not just trust it.