Introducing Speak Mode: AI Voice Translation That Speaks For You in Meetings

There is a specific, painful moment that every non-native speaker in a multilingual meeting knows. The discussion is moving quickly. Captions are scrolling on your screen. You understand everything perfectly — Gaavala has been translating the other side for twenty minutes and you are following along word by word. Someone turns to you and asks a direct question.

And then the language barrier, which had been one-way, becomes a wall.

You know the answer. You know exactly what you want to say. You simply cannot say it at the same speed and precision in the meeting's working language. You hesitate, soften, simplify, and end up sounding less informed than you are. By the time you assemble a careful sentence, the conversation has moved on.

We built Speak Mode to close that loop. With Speak Mode enabled, you speak in your native language and Gaavala speaks your translated voice back into the meeting. The other participants hear a fluent, real-time voice delivering your response in their language. You stop being the quiet person in the room.

This post walks through exactly what Speak Mode does, the two capture modes that ship with it, the three TTS engines you can choose between, and the tradeoffs you should understand before enabling it for a high-stakes call.

What Speak Mode Actually Does

Speak Mode is not a dubbing track layered on top of a transcript. It is a live, bidirectional voice pipeline that runs alongside the existing Gaavala transcription flow.

When you activate it during a meeting, the following happens inside your browser:

  1. Your microphone audio is captured by the extension — separately from the tab audio Gaavala already captures for the incoming side.
  2. That microphone audio streams directly to Soniox over a WebSocket for speech-to-text. This is the same direct-to-Soniox pattern the rest of Gaavala uses — no audio touches our backend.
  3. As Soniox emits committed tokens in your source language, the extension batches them into short phrases and sends the text to Gaavala's translation step for the target language you configured.
  4. The translated text is handed to your selected TTS engine, which synthesizes a voice clip.
  5. That voice clip is written into the meeting audio output — the same virtual audio path your Teams, Zoom, Meet, or Webex client uses for the microphone — so everyone in the call hears the translated voice as if it were your own.

The whole loop runs while you are still speaking. By the time you finish a sentence in your language, the synthesized translation is already playing out in theirs. You do not wait for a "translate" button. You do not pause between thoughts longer than you normally would.

Two Modes: Push-to-Talk and Timed Speak

Speak Mode ships with two capture styles. They exist because real meetings are not all alike, and the right capture behavior depends on how the conversation is flowing.

Push-to-Talk (PTT)

PTT is the walkie-talkie model. You hold a hotkey while you speak and release it when you are done. While the key is held, Gaavala captures your microphone, transcribes, translates, and plays the synthesized voice into the meeting. When you release, the pipeline closes cleanly.

PTT is the safer mode. It gives you explicit control over exactly when the synthesized voice is heard, so you never accidentally broadcast a side comment, a cough, or a moment of thinking-out-loud. It is the right choice for the first few calls you run with Speak Mode, for high-stakes meetings where precision matters more than flow, and for any environment where your microphone might pick up background voices you do not want translated.

The tradeoff is that you have to remember to hold the key, and your hand is occupied for the duration of your speech. For a short answer, that is nothing. For a long explanation, it gets tiring.

Timed Speak

Timed Speak is hands-free. You enable it once at the start of your turn, and Gaavala automatically decides when each thought ends — using pause detection and transcript punctuation signals — and flushes the translated voice into the meeting without waiting for you to press anything.

Timed Speak is the mode for flow. When you are presenting, telling a story, or walking through a slide deck, it feels much more natural. You speak the way you would in your own language, and the translation rolls out on the other side with a small trailing delay.

The tradeoff is precision. Because Gaavala is deciding when to cut, you lose the ability to hold back a half-formed thought. If you mutter to yourself mid-sentence, that mutter can end up in the synthesized stream. Most people start with PTT, get comfortable with the latency and voice of their chosen engine, and graduate to Timed Speak after a few meetings.

Three Engines: Soniox Studio Voices, Kokoro On-Device, and ElevenLabs Cloud

Speak Mode supports three text-to-speech engines. They are very different tools and we picked each for a specific kind of user.

Soniox studio voices (Pro default)

Soniox studio voices are the default engine on Pro and the one most people never need to change: 28 studio voices that speak all 60 languages Gaavala translates, cross-lingually, with a managed key — there is nothing to set up. Synthesis goes browser-direct to Soniox using the same short-lived-key pattern as captions, so the translated text never passes through Gaavala's backend.

Kokoro-82M (on-device, free)

Kokoro-82M is a compact open-weights TTS model that runs entirely inside your browser, accelerated by WebGPU where available. When you select Kokoro in Speak Mode settings, the model weights download once and stay cached — subsequent sessions start instantly.

What you get:

What you do not get:

Kokoro is the right choice if your target audience speaks English and you care about cost, privacy, or offline-friendliness. It is also a good option for practicing with Speak Mode before you start paying a cloud provider for characters.

ElevenLabs (cloud, BYOK)

ElevenLabs is a commercial cloud TTS service with multilingual support and high-quality voice cloning. To use it with Speak Mode, you bring your own API key — you create a free or paid account directly with ElevenLabs, paste the key into Gaavala's side panel, and the extension makes calls straight from your browser to the ElevenLabs API. Your key is stored in chrome.storage.local and never touches Gaavala's backend.

What you get:

The honest tradeoffs:

Pick ElevenLabs when you want your own cloned voice on the other side, or when a specific premium ElevenLabs voice matters more than cost.

How It Compares to What Else Is Out There

Real-time voice translation in meetings has been a "coming soon" feature for years. A few things exist, but none of them cover the same ground Speak Mode covers.

As far as we can tell, there is no consumer-priced tool in the market that does genuinely two-way voice translation inside the meeting platforms people actually use. Speak Mode is our answer to that gap.

Who Speak Mode Is For

Speak Mode is a Pro feature and it is priced for individual professionals, not enterprise deployments. The people who get the most out of it are:

If your meetings are fully bilingual in a single pair and everyone can keep up, you probably do not need Speak Mode. If there is a recurring moment where you or someone on the other side falls silent because the language gap is one-way, this is the feature that fixes that moment.

Privacy: What Changes and What Does Not

Gaavala's core privacy architecture does not change when you enable Speak Mode. Incoming meeting audio still streams directly from your browser to Soniox over a WebSocket that never touches our backend. Your microphone audio for Speak Mode follows the same direct-to-Soniox pattern — we never see it.

The TTS step is where the engines differ:

In both cases, Gaavala's backend remains outside the audio path. If you want to verify this, the same network inspection trick from our privacy post still works: open the extension's background page DevTools, filter the Network tab, and you will see direct connections to Soniox and (if enabled) ElevenLabs — and no audio traffic to gaavala.com.

How to Enable It

Speak Mode is a Pro-only feature. Gaavala's free trial (one-time 5 minutes, no credit card, never resets) does not include Speak Mode, so the setup flow goes roughly like this:

  1. Install Gaavala from the Chrome Web Store free and sign in.
  2. Upgrade to Pro from the side panel — $24.99/month, no trial period — upgrade directly when you need Speak Mode.
  3. Open the side panel settings and find the Speak Mode section.
  4. Choose an engine. Soniox studio voices are the default on Pro and need no setup. Kokoro downloads on first use if you select it. If you want ElevenLabs, paste your API key and pick a voice from the list.
  5. Configure the target language for the meeting.
  6. Pick a capture mode. For your first call, use Push-to-Talk and set a comfortable hotkey. You can switch to Timed Speak and tune its interval once you trust the flow.

Then join a meeting. Hold the hotkey when you want to speak, or flip Timed Speak on for a hands-free stretch.

Real-World Usage Tips

A few practical notes from the team testing this with real customers:

FAQ

Does the meeting know I am using it? Not directly. The synthesized voice arrives on the other side through the same microphone channel you would use yourself, so there is no banner or metadata. The other side hears a very fluent voice and may notice small pauses before each response. We recommend disclosing it when it matters — see the etiquette note above.

Can I use my own cloned voice? Yes, through ElevenLabs. Create a voice clone in your ElevenLabs account, select it in Gaavala's Speak Mode settings, and your synthesized speech will use that voice. Kokoro does not support cloning.

What is the latency? With Kokoro on a machine with WebGPU, expect end-of-sentence to end-of-playback in well under a second once the model is warm. With ElevenLabs, latency depends on geography and plan tier — typically one to two seconds per phrase. Both engines start playing as soon as the first chunk is ready, so you are not waiting for the whole sentence to render before the meeting hears anything.

What if my target language is not English? Kokoro-82M is English-only, but the Soniox studio voices — the Pro default — already speak all 60 languages Gaavala translates, so non-English targets work out of the box. Switch to ElevenLabs (30+ languages) when you want a cloned voice on a non-English target.

Does it work in Teams Premium meetings? Yes. Speak Mode works in any meeting that runs in a Chrome tab, including Microsoft Teams, Zoom, Google Meet, and Webex. It does not matter whether the meeting organizer has Teams Premium or Copilot. Speak Mode runs on your side of the call, not on the meeting host's infrastructure.

Try It

Speak Mode is available right now to Pro subscribers. If you are new to Gaavala, start on the one-time free trial — 5 minutes of transcription, no credit card required — so you can evaluate the incoming translation experience first. Speak Mode is not included in the free trial; upgrade to Pro ($24.99/month, no trial period) when you are ready to unlock it on the outgoing side.

Add Gaavala to Chrome →

Install Gaavala from the Chrome Web Store free, then upgrade to Pro to unlock Speak Mode.


Related Articles

Back to Gaavala