Introducing Speak Mode: AI Voice Translation That Speaks For You in Meetings
There is a specific, painful moment that every non-native speaker in a multilingual meeting knows. The discussion is moving quickly. Captions are scrolling on your screen. You understand everything perfectly — Gaavala has been translating the other side for twenty minutes and you are following along word by word. Someone turns to you and asks a direct question.
And then the language barrier, which had been one-way, becomes a wall.
You know the answer. You know exactly what you want to say. You simply cannot say it at the same speed and precision in the meeting's working language. You hesitate, soften, simplify, and end up sounding less informed than you are. By the time you assemble a careful sentence, the conversation has moved on.
We built Speak Mode to close that loop. With Speak Mode enabled, you speak in your native language and Gaavala speaks your translated voice back into the meeting. The other participants hear a fluent, real-time voice delivering your response in their language. You stop being the quiet person in the room.
This post walks through exactly what Speak Mode does, the two capture modes that ship with it, the three TTS engines you can choose between, and the tradeoffs you should understand before enabling it for a high-stakes call.
What Speak Mode Actually Does
Speak Mode is not a dubbing track layered on top of a transcript. It is a live, bidirectional voice pipeline that runs alongside the existing Gaavala transcription flow.
When you activate it during a meeting, the following happens inside your browser:
- Your microphone audio is captured by the extension — separately from the tab audio Gaavala already captures for the incoming side.
- That microphone audio streams directly to Soniox over a WebSocket for speech-to-text. This is the same direct-to-Soniox pattern the rest of Gaavala uses — no audio touches our backend.
- As Soniox emits committed tokens in your source language, the extension batches them into short phrases and sends the text to Gaavala's translation step for the target language you configured.
- The translated text is handed to your selected TTS engine, which synthesizes a voice clip.
- That voice clip is written into the meeting audio output — the same virtual audio path your Teams, Zoom, Meet, or Webex client uses for the microphone — so everyone in the call hears the translated voice as if it were your own.
The whole loop runs while you are still speaking. By the time you finish a sentence in your language, the synthesized translation is already playing out in theirs. You do not wait for a "translate" button. You do not pause between thoughts longer than you normally would.
Two Modes: Push-to-Talk and Timed Speak
Speak Mode ships with two capture styles. They exist because real meetings are not all alike, and the right capture behavior depends on how the conversation is flowing.
Push-to-Talk (PTT)
PTT is the walkie-talkie model. You hold a hotkey while you speak and release it when you are done. While the key is held, Gaavala captures your microphone, transcribes, translates, and plays the synthesized voice into the meeting. When you release, the pipeline closes cleanly.
PTT is the safer mode. It gives you explicit control over exactly when the synthesized voice is heard, so you never accidentally broadcast a side comment, a cough, or a moment of thinking-out-loud. It is the right choice for the first few calls you run with Speak Mode, for high-stakes meetings where precision matters more than flow, and for any environment where your microphone might pick up background voices you do not want translated.
The tradeoff is that you have to remember to hold the key, and your hand is occupied for the duration of your speech. For a short answer, that is nothing. For a long explanation, it gets tiring.
Timed Speak
Timed Speak is hands-free. You enable it once at the start of your turn, and Gaavala automatically decides when each thought ends — using pause detection and transcript punctuation signals — and flushes the translated voice into the meeting without waiting for you to press anything.
Timed Speak is the mode for flow. When you are presenting, telling a story, or walking through a slide deck, it feels much more natural. You speak the way you would in your own language, and the translation rolls out on the other side with a small trailing delay.
The tradeoff is precision. Because Gaavala is deciding when to cut, you lose the ability to hold back a half-formed thought. If you mutter to yourself mid-sentence, that mutter can end up in the synthesized stream. Most people start with PTT, get comfortable with the latency and voice of their chosen engine, and graduate to Timed Speak after a few meetings.
Three Engines: Soniox Studio Voices, Kokoro On-Device, and ElevenLabs Cloud
Speak Mode supports three text-to-speech engines. They are very different tools and we picked each for a specific kind of user.
Soniox studio voices (Pro default)
Soniox studio voices are the default engine on Pro and the one most people never need to change: 28 studio voices that speak all 60 languages Gaavala translates, cross-lingually, with a managed key — there is nothing to set up. Synthesis goes browser-direct to Soniox using the same short-lived-key pattern as captions, so the translated text never passes through Gaavala's backend.
Kokoro-82M (on-device, free)
Kokoro-82M is a compact open-weights TTS model that runs entirely inside your browser, accelerated by WebGPU where available. When you select Kokoro in Speak Mode settings, the model weights download once and stay cached — subsequent sessions start instantly.
What you get:
- 25 English voices covering a range of tone and gender
- Zero marginal cost — there is no per-character billing
- Complete privacy — the synthesis never leaves your machine
- Low latency once the model is warm, because there is no network round-trip
What you do not get:
- Languages other than English
- Voice cloning
- The polish of a commercial cloud voice
Kokoro is the right choice if your target audience speaks English and you care about cost, privacy, or offline-friendliness. It is also a good option for practicing with Speak Mode before you start paying a cloud provider for characters.
ElevenLabs (cloud, BYOK)
ElevenLabs is a commercial cloud TTS service with multilingual support and high-quality voice cloning. To use it with Speak Mode, you bring your own API key — you create a free or paid account directly with ElevenLabs, paste the key into Gaavala's side panel, and the extension makes calls straight from your browser to the ElevenLabs API. Your key is stored in chrome.storage.local and never touches Gaavala's backend.
What you get:
- Coverage of 30+ languages, matching most of the language pairs Gaavala already translates
- Voice cloning — you can train a voice on a short sample of your own speech and use it as your Speak Mode voice, so the meeting hears a voice that sounds like you
- High-quality, natural-sounding output
The honest tradeoffs:
- You pay ElevenLabs directly for characters synthesized. Gaavala does not take a cut, but we also do not absorb the cost.
- Every sentence adds a cloud round-trip, so latency is higher than Kokoro — typically a second or two depending on geography and plan tier.
- You are trusting a second vendor (ElevenLabs) with the translated text for the duration of the synthesis call. If your meeting is sensitive enough that this matters, Kokoro is the better choice.
Pick ElevenLabs when you want your own cloned voice on the other side, or when a specific premium ElevenLabs voice matters more than cost.
How It Compares to What Else Is Out There
Real-time voice translation in meetings has been a "coming soon" feature for years. A few things exist, but none of them cover the same ground Speak Mode covers.
- Microsoft Teams Interpreter Agent. Microsoft announced an Interpreter Agent for Teams that can translate and speak with a user's voice. It is Teams-only. It requires a Microsoft 365 Copilot license, which is currently around $30 per user per month on top of a qualifying base plan, and it is gated to tenant admins who have enabled Copilot for the organization. If you live inside a Copilot-licensed Teams estate, it is a legitimate option. If you work across Teams, Zoom, Meet, and Webex with customers outside your tenant, it will not help.
- ChatGPT Voice Mode. ChatGPT can hold a voice conversation with you, and it can translate. What it cannot do is sit inside your Zoom call and speak on your behalf to other participants. It is a personal voice chatbot, not a meeting voice agent.
- Google Meet translated captions. Google Meet has real-time translated captions. They are text-only. There is no spoken output on the other side.
- Interprefy, KUDO, and enterprise interpretation platforms. These are real interpretation services — human interpreters, mostly, sometimes with AI assistance — priced and structured for large conferences and regulated industries. They are not something an individual consultant adds to a one-on-one sales call.
As far as we can tell, there is no consumer-priced tool in the market that does genuinely two-way voice translation inside the meeting platforms people actually use. Speak Mode is our answer to that gap.
Who Speak Mode Is For
Speak Mode is a Pro feature and it is priced for individual professionals, not enterprise deployments. The people who get the most out of it are:
- International consultants running project calls with clients whose English is limited and who would rather hear you in their own language than struggle through yours.
- Multilingual team leads joining standups with offshore engineering teams, where half the team thinks in one language and half thinks in another.
- Sales reps selling into foreign markets where a fluent-sounding voice in the customer's language changes how the call feels and how the quote lands.
- Field operations people bridging between an offshore site and HQ, where the stakes of a misunderstanding are operational, not just social.
If your meetings are fully bilingual in a single pair and everyone can keep up, you probably do not need Speak Mode. If there is a recurring moment where you or someone on the other side falls silent because the language gap is one-way, this is the feature that fixes that moment.
Privacy: What Changes and What Does Not
Gaavala's core privacy architecture does not change when you enable Speak Mode. Incoming meeting audio still streams directly from your browser to Soniox over a WebSocket that never touches our backend. Your microphone audio for Speak Mode follows the same direct-to-Soniox pattern — we never see it.
The TTS step is where the engines differ:
- Soniox studio voices send the translated text browser-direct to Soniox, authenticated with the same short-lived temporary keys captions use. Gaavala's backend never sees the text or the audio.
- Kokoro runs the synthesis entirely inside your browser. The translated text never leaves your machine during synthesis. There is no third party in the voice generation loop at all.
- ElevenLabs requires a network call from your browser directly to the ElevenLabs API, using your own API key. Gaavala never sees the key and never sees the generated audio. You are trusting ElevenLabs with the translated text during the synthesis, exactly as you would if you called their API from any other tool.
In both cases, Gaavala's backend remains outside the audio path. If you want to verify this, the same network inspection trick from our privacy post still works: open the extension's background page DevTools, filter the Network tab, and you will see direct connections to Soniox and (if enabled) ElevenLabs — and no audio traffic to gaavala.com.
How to Enable It
Speak Mode is a Pro-only feature. Gaavala's free trial (one-time 5 minutes, no credit card, never resets) does not include Speak Mode, so the setup flow goes roughly like this:
- Install Gaavala from the Chrome Web Store free and sign in.
- Upgrade to Pro from the side panel — $24.99/month, no trial period — upgrade directly when you need Speak Mode.
- Open the side panel settings and find the Speak Mode section.
- Choose an engine. Soniox studio voices are the default on Pro and need no setup. Kokoro downloads on first use if you select it. If you want ElevenLabs, paste your API key and pick a voice from the list.
- Configure the target language for the meeting.
- Pick a capture mode. For your first call, use Push-to-Talk and set a comfortable hotkey. You can switch to Timed Speak and tune its interval once you trust the flow.
Then join a meeting. Hold the hotkey when you want to speak, or flip Timed Speak on for a hands-free stretch.
Real-World Usage Tips
A few practical notes from the team testing this with real customers:
- Warm up with PTT. The first time you hear your synthesized voice play into a meeting, it will feel strange. Use Push-to-Talk for a few calls before going hands-free. Your sense of when to pause, how loud to speak, and how the translated voice lands on the other side all calibrate quickly — but only if you are in control of each burst.
- Watch latency on long-distance ElevenLabs. If you and your ElevenLabs region are on opposite sides of the planet, the round-trip can push past two seconds. For a back-and-forth negotiation, that is uncomfortable. Test on an internal call before trusting it on an external one.
- Cloning etiquette matters. ElevenLabs will let you clone any voice you have a sample of. Just because you can does not mean you should. Clone your own voice, or clone with explicit consent. Do not clone a colleague, customer, or public figure to speak on their behalf.
- Disclose if it matters. If your meeting has compliance implications — legal, medical, regulated sales — tell the other side that a real-time voice translation layer is in the call. It is the same etiquette as disclosing a transcription tool.
- Keep the native audio on. Gaavala still pipes the original meeting audio to your speakers. You should still be listening to and reading the incoming side. Speak Mode is the outgoing half, not a replacement for understanding what is being said to you.
FAQ
Does the meeting know I am using it? Not directly. The synthesized voice arrives on the other side through the same microphone channel you would use yourself, so there is no banner or metadata. The other side hears a very fluent voice and may notice small pauses before each response. We recommend disclosing it when it matters — see the etiquette note above.
Can I use my own cloned voice? Yes, through ElevenLabs. Create a voice clone in your ElevenLabs account, select it in Gaavala's Speak Mode settings, and your synthesized speech will use that voice. Kokoro does not support cloning.
What is the latency? With Kokoro on a machine with WebGPU, expect end-of-sentence to end-of-playback in well under a second once the model is warm. With ElevenLabs, latency depends on geography and plan tier — typically one to two seconds per phrase. Both engines start playing as soon as the first chunk is ready, so you are not waiting for the whole sentence to render before the meeting hears anything.
What if my target language is not English? Kokoro-82M is English-only, but the Soniox studio voices — the Pro default — already speak all 60 languages Gaavala translates, so non-English targets work out of the box. Switch to ElevenLabs (30+ languages) when you want a cloned voice on a non-English target.
Does it work in Teams Premium meetings? Yes. Speak Mode works in any meeting that runs in a Chrome tab, including Microsoft Teams, Zoom, Google Meet, and Webex. It does not matter whether the meeting organizer has Teams Premium or Copilot. Speak Mode runs on your side of the call, not on the meeting host's infrastructure.
Try It
Speak Mode is available right now to Pro subscribers. If you are new to Gaavala, start on the one-time free trial — 5 minutes of transcription, no credit card required — so you can evaluate the incoming translation experience first. Speak Mode is not included in the free trial; upgrade to Pro ($24.99/month, no trial period) when you are ready to unlock it on the outgoing side.
Install Gaavala from the Chrome Web Store free, then upgrade to Pro to unlock Speak Mode.