Push-to-Talk Translation in Meetings: A Practical Guide
Always-on translation sounds like the dream until you sit through the third meeting where the tool solemnly translates your throat-clearing, your partner asking about dinner, and the coffee machine gurgling in the next room. Somebody on the call politely ignores the German synthesis of "do we have oat milk." You promise yourself you will never do that again.
Push-to-Talk is what you switch to after that meeting. You hold a key, you speak, you release. Nothing goes into the room until you mean it. The rest of the time, you are listening — captions flowing on the side panel, your mic contributing silence. It is the same model radio operators have used since the 1920s for the same reason: intentional transmission beats hot-mic chaos every single time.
This is a practitioner guide to getting the PTT workflow right inside Gaavala. Not the marketing pitch — the actual mechanics, the keyboard habits, the voice engine trade-offs, and the little things that make the difference between "this is uncanny" and "this is just how I take meetings now."
When Push-to-Talk Beats Timed Speak
Gaavala has two ways to drive Speak Mode. Timed Speak commits a fixed window of speech on a schedule — great for monologues, demos where you know you will be talking continuously, or scripted pitches where you want to batch-translate in rhythm. Push-to-Talk commits on release — great for everything that looks like a real conversation.
Reach for PTT when:
- You are on a multi-speaker call. If four people are talking and you are contributor number two, Timed Speak will grab whatever audio happens during its window, including the other people. PTT only sends what you held the key for.
- The room is loud. Open office, airport lounge, co-working space, cafe. PTT lets you pick the quiet moments and commit exactly those. Timed Speak will happily encode the espresso machine.
- You want to stay silent between contributions. In most cross-language meetings you listen 80% of the time and contribute 20%. PTT matches that ratio. Timed Speak assumes you are the talker.
- You are still thinking. With PTT you can actually pause, reconsider, release without pressing, and nothing gets broadcast. You do not owe the tool a sentence every N seconds.
Timed Speak still wins when you are clearly driving the meeting end-to-end — a sales pitch, a product demo, a one-way announcement. For everything else, PTT is the default I reach for.
Setup
The first time you configure PTT takes about four minutes. Do this before your meeting, not during.
- Install the extension. Add Gaavala to Chrome. The install drops the icon in your toolbar.
- Sign in. Click the icon, open the side panel, sign in with Google or Microsoft. The OAuth round-trip lands you back in the side panel authenticated.
- Upgrade to Pro. PTT is a Pro-only feature and is NOT included in the free trial. The one-time free trial (5 minutes of transcription, no credit card, never resets) covers Caption Mode (live captions) only. Speak Mode — which is where PTT lives — requires a direct upgrade to Pro at $24.99/month. If you are still on the free trial, you will see the upgrade CTA in the side panel.
- Open Speak Mode settings. In the side panel, find the Speak Mode section. You will see a toggle, a mode selector (PTT / Timed Speak / Toggle), a voice engine picker, and a key binding field.
- Assign your PTT key. Click the key binding field, press the key you want. I use the right Option key on macOS because nothing else in my normal workflow touches it. Avoid Space, avoid Enter, avoid Cmd/Ctrl combos. More on this later.
- Pick an output voice. Soniox studio voices are the Pro default — 28 voices across all 60 languages, nothing to configure. Kokoro gives you a set of clean on-device English voices. ElevenLabs, if you have a key, gives you cloned-voice options. The Soniox default is fine for the first test.
- Do a dry run. Before you join a meeting, test it standalone. Open any page, hold your key, say "testing one two three" in your source language, release. The side panel should show the transcript and you should hear the translated synthesis through your speakers. If that works end-to-end, you are ready.
The Workflow in a Live Meeting
Here is what a real PTT session looks like. I will assume you are joining a Zoom call with a client who speaks German, and you are answering in English that Gaavala will render into German for them.
- Join the meeting. Zoom tab opens, you land in the call. Unmute yourself in Zoom itself — Gaavala's output goes into the tab, and the tab has to be unmuted for the other side to hear it.
- Open the Gaavala side panel. Click the extension icon or use your shortcut. The side panel opens next to the Zoom tab.
- Start the capture. With the Zoom tab focused, press Start in the side panel. Gaavala grabs the meeting audio instantly through Chrome's tab capture — no share prompt, no picker.
- Pick languages. In the side panel, set the source language to English (your speaking language) and the target to German (what the room should hear). The listening side flips these automatically so incoming German comes back as English captions.
- Confirm Speak Mode is in PTT. The PTT button should show the idle state — a small ring, no glow.
- Listen. The client speaks. English captions roll in the side panel. You read, you think, you formulate your reply.
- Hold the PTT key. As soon as you press, the button blooms into a coral pulse ring. That is the visual confirmation that your mic is hot and Gaavala is listening to you specifically.
- Speak. One sentence, maybe two. Natural pace. Watch the ring pulse with your voice.
- Release the key. The pulse snaps back to idle. A beat later — one to two seconds with Kokoro, two to four with ElevenLabs — your translated voice plays into the Zoom tab and the client hears it in German.
- Listen again. Repeat. You can release and immediately start listening to their reply. There is no mode switch, no menu.
The rhythm you will develop is: listen, think, hold, speak, release, listen. It becomes muscle memory inside the first meeting.
Voice Engine Choice
Speak Mode has three voice engines and they make meaningfully different trade-offs.
Soniox studio voices (Pro default). The same engine that powers your captions also speaks — 28 studio voices covering all 60 languages Gaavala translates, on a managed key with nothing to configure. Synthesis goes browser-direct to Soniox using the same short-lived keys as captions, so it just works out of the box.
Kokoro (on-device). An 82M-parameter neural TTS model that runs locally in the extension's offscreen document. Latency is the star of the show here — you get one to two seconds from release to audible output because nothing leaves your machine for synthesis. It works offline and costs nothing on top of your subscription. The catch: Kokoro's speak-side output is English-only right now. That is perfect if you are speaking into English-speaking rooms, or if you speak B-level English and want a cleaner delivery, but it means you cannot use Kokoro to output German or Japanese.
ElevenLabs (cloud, BYOK). You bring your own ElevenLabs API key, paste it into Gaavala settings, and you get multilingual synthesis across 29 languages plus voice cloning if you have an ElevenLabs Pro seat. Quality is outstanding. Latency is two to four seconds because the synthesis round-trips through ElevenLabs' servers. Good network makes this feel fine; bad wifi turns it into an awkward pause game.
Rule of thumb: the Soniox default already covers every language Gaavala translates, so start there. If the room needs English and you want fully on-device synthesis, use Kokoro. Switch to ElevenLabs for the calls where a cloned voice on the other side earns the latency tax.
Real-World Scenarios
A few patterns I have watched people use PTT for. These are not hypotheticals — these are the meetings PTT was built for.
- Cold client call in German, your German is B1. You understand most of what they say, but formulating a confident sentence on the fly is stressful. With PTT, you can structure the reply in English in your head, hold the key, say it cleanly, and let ElevenLabs deliver it in German. The client hears a confident native-sounding answer; you avoid the trap of groping for vocabulary live.
- Technical Q&A with an offshore team. Your Indian or Filipino engineers speak excellent English but the discussion is dense. PTT lets you slow down without hot-miking your "umm, let me think" phase. Hold, commit a precise technical answer, release, think about the follow-up.
- Sales demo with a Japanese prospect. You are running the demo in English, they ask a question. You hold PTT, answer, release — they hear the answer in Japanese while you keep driving the screen share. No interpreter, no pre-recorded voice-over, no lag in the interaction flow.
- Bilingual 1:1 with a new hire. A new team member joins from Brazil. Their English is a work in progress; your Portuguese is non-existent. PTT in both directions turns the 1:1 from a vocabulary exam into an actual conversation. Over a few weeks they lean on it less and you watch the onboarding curve compress.
Tips From Actual Usage
Small habits that make PTT feel good instead of awkward.
- Pause half a second before pressing. The first syllable after your key press can clip if you are already mid-word. Breathe in, press, speak. This is the single biggest quality improvement you can make.
- Keep each utterance short. Under ten seconds, ideally under seven. Long holds tax the STT, make you ramble, and produce clunky synthesis. Short, complete thoughts translate better and the latency feels shorter because the unit of feedback is smaller.
- Signal end-of-turn verbally if you keep cutting yourself off. Radio operators say "over." You can too — "…and that is why we priced it at ninety-nine, over" — then release. It sounds silly for about one meeting and then you stop noticing.
- Mic environment beats model quality. STT is always the bottleneck. A good directional mic in a quiet room will outperform a laptop mic in a cafe regardless of which voice engine you use. If you take meetings for a living, a $60 USB condenser is the best Gaavala upgrade you can buy.
- Do not translate word-for-word in your head. Speak in your source language naturally. The translation layer will do its job. If you over-rehearse, you sound robotic going in and the output inherits the stiffness.
When PTT Feels Laggy
If the gap between release and output feels longer than it should, walk through this list in order before blaming the tool.
- Network. If you are on ElevenLabs, every utterance round-trips through the cloud. Airport wifi or a hotel guest network will add seconds. Switch to Kokoro for the rest of the call or tether to your phone.
- Wrong engine for the call. If you picked ElevenLabs but the room is English-speaking, you are paying the latency tax for no reason. Switch to Kokoro.
- Captured the wrong tab. If you pressed Start while a different tab was focused, Gaavala is listening to the wrong audio source, which confuses STT and delays the next cycle. End the session, focus the meeting tab, and press Start again.
- Chrome GPU pressure. Other tabs running WebGL, video playback, or heavy canvas work will starve Kokoro of GPU cycles. Close the Figma file, the second YouTube tab, the Google Earth window you forgot about. Gaavala benchmarks much better with a clean Chrome.
- Laptop thermal throttling. After 90 minutes of meetings, a hot laptop will slow Kokoro inference enough to notice. Plugged-in, on a cooling pad, with good airflow — you get consistent latency for hours.
Privacy Note
Your microphone audio goes browser-direct to Soniox over a WebSocket. It does not pass through Gaavala servers. The Speak Mode side follows the same architecture as Caption Mode: we mint a short-lived Soniox key, the extension opens its own connection, and the audio never touches us. Kokoro synthesis happens locally in the offscreen document. Soniox studio-voice synthesis — the Pro default — uses the same short-lived-key, browser-direct pattern as captions; the text goes straight from your browser to Soniox. ElevenLabs synthesis, if you use it, goes browser-direct to ElevenLabs with your own key — again, not through our backend. If you want the full architectural breakdown, the meeting audio privacy post walks through every hop.
This matters for PTT specifically because people assume mic capture is where the privacy risk lives. In Gaavala's model, the mic stream is the least-processed, most-direct path in the whole system.
Keyboard Shortcut Etiquette
The single most common setup mistake is picking a PTT key that fights with the meeting app. A short list of keys to avoid and why:
- Space. Zoom uses hold-Space as push-to-unmute by default. If you pick Space for Gaavala PTT, you will mute and unmute Zoom every time you try to speak, and the meeting will hear half of every utterance.
- M. Most meeting apps bind M (with or without a modifier) to mute toggle. Same problem.
- V. Zoom and Teams both use V-adjacent keys for video toggles. You will turn your camera off mid-sentence.
- Enter / Return. If chat is focused, you will send a blank message every time you try to speak.
- Cmd/Ctrl combinations. These tend to be grabbed by the OS or by Chrome itself (new tab, close tab, find). They also feel bad to hold.
Good choices on macOS: right Option, right Cmd, or a function key you have remapped. On Windows: right Alt, right Ctrl, or a dedicated side key on a gaming mouse. The best PTT keys are the ones you can hold without moving your other hand off the keyboard, and that nothing else in your stack fights for.
FAQ
Can other meeting participants tell it is AI? Kokoro sounds clearly synthetic but clean. ElevenLabs with a cloned voice is indistinguishable from a human in most short exchanges — people in the call will almost never clock it unless you tell them. The bigger tell is rhythm: if you hold and release in a steady pattern, listeners notice the silences between your turns more than the voice itself.
Does it work if I am muted in the meeting? No. Gaavala's Speak Mode output plays into the tab's audio pipeline, and the meeting app sends the tab's audio through your own mic channel. If you are muted in Zoom or Teams, the output gets muted with everything else. Unmute yourself in the meeting app, let Gaavala handle the rest.
Can I use PTT and Timed Speak in the same session? You pick one mode at a time. Switching between them takes one click in the side panel and is instant — no restart, no re-share. A common pattern is to start a demo in Timed Speak while you are presenting slides, then switch to PTT when the Q&A starts.
Battery impact on a laptop? Kokoro runs on CPU and sips GPU. Expected battery draw is noticeable but not dramatic — think "watching a 1080p YouTube video" level. ElevenLabs is actually lighter on battery because the synthesis happens in the cloud. The mic capture itself is trivial. A full day of back-to-back meetings on a MacBook Air runs the battery down faster than it would with no translation, but not alarmingly so.
What happens if I release the key mid-sentence? The utterance commits as-is. Whatever STT had captured up to the moment of release gets translated and synthesized. If you cut yourself off mid-word, you will hear a truncated output. The fix is to hold until you complete the thought. Early releases are the most common new-user mistake and the one that fixes itself after half a dozen turns.
Try It
If you have been looking for a way to contribute confidently in cross-language meetings without surrendering control of when your voice enters the room, this is the workflow. Install Gaavala free (one-time 5-minute trial, no credit card, never resets), then upgrade to Pro to enable PTT. Pro is $24.99/month with no lock-in — no trial period, just a direct upgrade when you are ready.
Install, sign in, upgrade to Pro, assign a key, do one dry run. Your next client call is where the difference shows up.