Blog

Long-form notes on the tradeoffs that actually change how a live interview feels.

Which model you should use

Last updated April 11, 2026

Live interviews punish hesitation. You can have the right idea, the right wording, and the right instinct, and still sound weaker than you are if the answer arrives a beat too late. We optimize for time to first answer token before almost everything else because that first beat decides whether the assistant feels like calm support or dead weight. In practice, the experience of using an interview assistant is shaped less by benchmark theater and more by whether the first useful sentence lands quickly enough to keep your own thinking rhythm intact when the room gets tense.

Mercury 2 stays at the top of our stack for that reason. In live sessions it most often feels ready to move at conversation speed, not demo speed, and that difference matters when every pause sounds louder than it should. Gemini Flash 3.1 Lite stays close behind because it is still quick, scales well, and gives us a very strong fallback when we want speed with a more familiar ecosystem behind it. The difference is not that Gemini is weak. The difference is that Mercury more often feels like the sharper tool when the cost of even a small delay is your own confidence.

The table below is the cleanest version of that opinion. It is not meant to pretend these models live in a sterile benchmark lab. It is meant to reflect what matters in a real interview loop: how quickly the first answer appears, whether the stream keeps up, and whether the price feels justified once real pressure enters the conversation.

My short list for live interview use, weighted toward responsiveness over lab-style score chasing.
Model	Time to first token	Tokens / sec	Cost / price ratio
Mercury 2	p95 sub-second latency under high concurrency. Our measurement on a technical question: avg 0.5 ~ 1.5s.	1,009 tok/s.	$0.25 in / $0.75 out.
Gemini Flash 3.1 Lite	2.5x faster than Gemini 2.5 Flash. Our measurement on a technical question: avg 1 ~ 2.5s.	+45% vs Gemini 2.5 Flash.	$0.25 in / $1.50 out.

Because vendors publish different official speed metrics, this table uses the concrete numbers each company actually discloses instead of inventing a fake apples-to-apples benchmark. What matters to us is still the same question: which stack keeps the answer moving when the interview is live.

Once the model decision is clear, the audio stack becomes the next bottleneck. After going through the vendor docs, shipping the integrations, and then living with them in real sessions, the pattern has been pretty stable for us. Deepgram Flux is the best default for English interview flow because it is built around conversational turn-taking, fast end-of-turn decisions, and low-latency voice-agent behavior. The tradeoff is that it is English-only today, and that specialization is part of why it feels so dialed in when the interview is happening in English and the transcript needs to keep pace with the room.

ElevenLabs Scribe is the better choice once multilingual quality becomes the real requirement instead of an edge case. Its realtime stack is designed for live use, it publicly targets under 150 ms latency, and it covers 90+ languages, which makes it much easier to trust when the candidate or interviewer moves outside English. Apple Speech is still worth keeping around because the local path is attractive and the privacy story is clean, but we treat it as the fallback rather than the first pick. It is useful when you want to stay closer to the device, yet the dedicated cloud stacks still feel stronger in the moments where interview pressure exposes every weak transcription choice.

For people who do not want to manage keys, we run managed models on our side, but we try very hard to earn that convenience. The routing has been heavily tested, the prompt and timeout behavior have gone through repeated iteration, and we spend deep engineering work on the unglamorous details that make the product feel reliable instead of lucky. We care about answer quality, stability, and recovery behavior just as much as raw model speed, which is why we are comfortable offering higher rate limits while still keeping a generous per-session token budget in place. If that tradeoff matters to you, the privacy note and legal page are the right places to read the boundary clearly: requests go directly to your provider when you bring your own key, and managed setups carry their own privacy tradeoffs.

Get your free API key

Useful provider links

If you want to compare the stacks yourself, these are the four starting points we keep coming back to.

Session Cost

Token budgets and what they actually cost

Last updated April 12, 2026

There are two ways to look at token usage in a session. The budget token is the hard ceiling you configure - the maximum number of tokens a single session is allowed to consume before the assistant stops. The costing token is just that same number expressed from the other side: what the provider actually charges you for. Same count, two different lenses. Setting a budget token limit means you already know your worst-case spend before the session starts.

We enforce a session cap on managed plans for two reasons. The first is predictability. A deep technical session with long transcript context can accumulate tokens faster than it feels like in the room. Without a ceiling, costs drift in ways that are hard to reason about after the fact. The second is fairness. Managed infrastructure is shared, and an uncapped session that runs unusually long puts pressure on rate limits that other sessions depend on. The cap keeps the system stable for everyone.

The ceiling we set is generous. After running several hundred real sessions across both technical and behavioral tracks - short loops, long loops, candidates who talk a lot, candidates who keep things tight - 3 million tokens covers every realistic scenario comfortably. You are not going to hit the wall in a normal interview. The cap is there to protect against runaway usage, not to interrupt you mid-session.

To make that concrete, here is what 3 million tokens translates to at current provider rates for the two models we recommend most. The estimate uses a realistic session split of roughly half input, half output - system prompts, transcript context, and prior answers on the input side; generated coaching and answer drafts on the output side.

Estimated session cost at 3M token budget cap. Assumes ~1.5M input tokens and ~1.5M output tokens - a conservative ceiling well above what most sessions actually consume.
Model	Input rate	Output rate	Est. cost at 3M cap
Mercury 2	$0.25 / 1M tokens	$0.75 / 1M tokens	~$1.50 worst case. Most sessions land well under half that.
Gemini Flash 3.1 Lite	$0.25 / 1M tokens	$1.50 / 1M tokens	~$2.63 worst case. Higher output rate, still very affordable per session.

Inside the app, the status bar at the bottom of the overlay panel shows two small pills during every session so you can see where you stand at a glance.

Overlay status bar showing C 0.5M/3M for costing tokens and Y 0.2M/1M for your own BYOK tokens, with a red exhausted state example on the right

C is the costing token counter - it tracks usage on app-managed AI (TerviewSky). Y is your own token counter - it tracks usage when you bring your own provider key. Both show spent / limit in compact form (e.g. 0.5M/3M). When either pill turns red the session has hit its cap and responses pause.

If you bring your own API key, you set the token budget yourself in Settings → AI & Providers → Token Budget. The Token Limit field is the number that maps to the Y pill in the overlay - that number feeds directly into your provider spend, so the cost table above is your ceiling. The Warning Threshold slider turns the Y pill amber before you hit the hard cap, giving you a heads-up mid-session.

Settings panel showing AI & Providers selected, with Token Budget section containing a Token Limit field set to 1,000,000 and a Warning Threshold slider at 80%

Free-tier keys from both providers have enough quota to run a lite session without touching paid credits. If you want higher rate limits and more headroom for back-to-back sessions, a pay-as-you-go key unlocks that, and the in-app budget limit keeps your monthly bill from drifting higher than you expect.

One important clarification: the token cap only applies to the live interview overlay. Everything else - CV review, application drafting, pre-interview prep, and post-interview feedback - runs without a session cap. Those features are powered by the same fine-tuned models we have spent a lot of time optimising for quality, so use them as much as you need.

On managed plans we go further. The routing, prompt structure, and context-trimming logic have been through repeated iteration specifically to spend fewer tokens per useful answer. That means the effective cost per session is lower than the raw per-token rate suggests, rate limits are higher than what a standard free-tier key gets, and the whole thing stays behind the same 3M cap as a hard backstop. If you want to understand the privacy tradeoff that comes with managed routing versus your own key, the detail is in the legal page - the short version is that managed requests go through our infrastructure and your own-key requests go directly to your provider.

Bring your own key

Get started with a provider key

Both Mercury 2 and Gemini Flash 3.1 Lite have free tiers that cover a lite session. Sign up, paste the key into TerviewSky settings, and the token budget you set maps directly to the cost table above.

More notes coming soon

This slot stays here so future posts only need content edits, not a layout rebuild.