Knowing when to talk: turn-taking and the Indian-language data gap
AikaLabs · · 9 min read
The most common way a voice assistant fails is not mishearing a word. It is talking at the wrong moment. It cuts you off halfway through a sentence, or it sits there in silence for two seconds after you have plainly finished, waiting for a timer to run out.
In English, on a good day, this is a solved-enough engineering problem. Switch the caller to Hindi, or to the Hindi-and-English mix most people actually speak on a phone, and it comes apart. The models are not really the problem here. The data is.
Humans are bad at this too
It helps to start by admitting that people are not good at this either. We interrupt each other constantly. We talk over the end of a sentence, mistake a breath for a full stop, collide, and back off. Listen to any real conversation and the smooth, gap-free handoffs sit right next to a running stream of overlaps and false starts that the speakers patch up as they go.
What people are good at is the recovery. Grab the floor by mistake and you let go almost at once, inside a few hundred milliseconds, with a quick "sorry, go on" and no hard feelings. Being wrong costs almost nothing, and the cost is paid fast. A machine gets neither break. It is slower to notice it has stepped on you, slower to hand the turn back, and it is judged against a standard no person is held to. A friend who talks over you is just talking. A system that does the same thing feels rude.
So the target for a voice agent is not the impossible one of never misjudging a turn. It is the human one: be wrong a little less often, recover quickly when you are, and earn some of the forgiveness people extend to each other without thinking. Most of the field is chasing only the first of those.
The handoff
When the handoff does go cleanly, and most of the time it does, it goes fast. Across ten languages on five continents, the most common gap between one person finishing and the next starting is close to zero, and the average sits around 200 milliseconds (Stivers et al., 2009).
That number should bother you, because 200 milliseconds is not enough time to hear silence and react to it. It is barely enough time to start moving your mouth. People hit it by predicting the end of the other person's turn before it arrives, reading the shape of what is being said. Pitch, speaking rate, intonation, and grammatical completeness all carry the signal, and the more of those cues line up at once, the more likely a listener is to take the floor (Gravano and Hirschberg, 2009). Conversation is a forecasting problem, and humans are good at it.
Why "wait for silence" is the wrong model
Most voice systems do not forecast. They wait for silence and start a timer. The trouble is that a pause in the middle of a thought sounds exactly like the silence at the end of one. As a 2017 paper on streaming recognition put it:
Silence detection and end-of-query detection are fundamentally different tasks.
That single sentence (Shannon et al., 2017) is the whole problem. If all you measure is silence, you cannot tell "I'd like to check my balance" from "I'd like to check my balance, and also..." until the speaker either keeps going or doesn't. So you set a timer, and the timer is a trap. Make it long and every reply lags: an 800-millisecond pause threshold "adds nearly a full second to every single response," in LiveKit's accounting. Make it short and you talk over people. There is no setting that wins, because the thing you are measuring is not the thing you care about.
The research answer, and what it quietly requires
The better approach is to do what people do and predict the handoff instead of waiting for it. The most influential line of work here is Voice Activity Projection, or VAP (Ekstedt and Skantze, 2022). Rather than labeling the current instant as speech or silence, a VAP model predicts the joint future activity of both speakers over the next second or two, and reads turn-taking events out of that forecast.
It works. It also has a requirement that matters more than it first appears: it learns from stereo audio, one channel per speaker (Inoue et al., 2024). To learn when a turn ends, the model has to watch both sides of thousands of real conversations, kept on separate channels so it can tell who is doing what to whom. Hold onto that requirement, because it is the entire story for Indian languages.
The data wall
A turn-taking model is only as good as the conversations you can show it. And the large public speech corpora for Indian languages are almost all the wrong shape for this. That is not a criticism of them. They were built, carefully, for a different job: training speech recognizers and synthesizers, which need clean, labeled, mostly single-speaker audio.
Look at what exists. IndicVoices, one of the most serious recent efforts, is mostly read speech and one-person monologue, with only about a sixth of it tagged conversational. Even that conversational slice falls short: it was collected over a single 8 kHz telephone channel, so it never gives you the two speakers as the separate, time-aligned tracks a turn-taking model has to learn from. AI4Bharat's own text-to-speech follow-up (IndicVoices-R) dropped the conversational subset altogether over that 8 kHz quality. Vaani, a very large collection from IISc and partners, is image-prompted monologue on a single channel. Kathbath is read speech. Shrutilipi is broadcast news. Each of these is a real achievement. None of them is two-party, channel-separated conversation. The one thing a turn-taking model needs is the one thing nobody set out to record.
"But there are Hindi turn detectors"
It would be wrong to say nobody has shipped anything. Two open models are worth a look, because they take opposite routes. Pipecat's smart-turn is acoustic: a small Whisper-based classifier that listens to the audio and predicts whether the turn is complete. LiveKit's detector goes the other way and reads the words, running the in-progress transcript through a multilingual language model that already knows, from text, what a finished Hindi sentence looks like. Both report strong Indian-language numbers, smart-turn in the low nineties on Hindi, LiveKit higher still.
The catch is what those numbers are measured against. They come from each model's own held-out data, which leans heavily on synthetic examples and clean, mostly non-telephony speech. Whether they hold up on a real Indian phone call is a separate question, and there are public reasons for doubt. Smart-turn's audio front end is hardwired to expect 16 kHz, and feeding it the 8 kHz a phone line actually delivers silently corrupts its input, a failure mode logged in the open and easy to reproduce. The transcript-reading approach sidesteps the missing-audio problem in a clever way, by borrowing grammatical knowledge a language model picked up from written Hindi. But notice what that workaround concedes: the reliable move for Indian languages today is to avoid learning turn-taking from Indian conversational audio, because there is not enough of it to learn from. And as far as we can find, there is still no peer-reviewed turn-taking model with a held-out evaluation on any Indian language at all.
The twist
There is a genuinely humbling result buried in this research. When researchers standardized how speech was segmented and labeled across languages, a turn-taking model's accuracy turned out to depend more on the dataset and its labeling conventions than on the language itself (Sato, Chiba and Higashinaka, 2024). Set that beside a separate finding that one multilingual model can match language-specific ones and even identify which language it is hearing (Inoue et al., 2024), and the lesson is not that Hindi turn-taking is some exotic problem needing a special architecture. It is that the field keeps mistaking holes in its data and its labels for difficulty in the language.
That cuts both ways. You probably do not need to invent a new model to handle Indian languages. You do need the conversations and the labels, and collecting them is the slow, expensive part that nobody has done at scale.
A voice agent that interrupts you, or leaves you hanging while a timer counts down, does not come across as unintelligent. It comes across as rude. And unlike the friend who talks over you, it gets none of the benefit of the doubt. For the languages most of the world actually speaks, fixing that will not come from a larger model or a cleverer silence threshold. It will come from recording the conversations the last decade of speech datasets skipped: both sides of the call, kept on separate channels, in the languages and the half-and-half mixtures people really use. That corpus does not exist yet. Building it is the unglamorous part, and it is the part that matters.
References and further reading
- T. Stivers et al., "Universals and cultural variation in turn-taking in conversation," PNAS, 2009.
- A. Gravano and J. Hirschberg, "Turn-yielding cues in task-oriented dialogue," SIGDIAL, 2009.
- M. Shannon et al., "Improved end-of-query detection for streaming speech recognition," Interspeech, 2017.
- E. Ekstedt and G. Skantze, "Voice Activity Projection: Self-supervised Learning of Turn-taking Events," Interspeech, 2022.
- K. Inoue et al., "Multilingual Turn-taking Prediction Using Voice Activity Projection," LREC-COLING, 2024.
- T. Sato, Y. Chiba and R. Higashinaka, "Investigating the Language Independence of Voice Activity Projection Models through Standardization of Speech Segmentation Labels," APSIPA ASC, 2024.
- G. Skantze, "Turn-taking in conversational systems and human-robot interaction: a review," Computer Speech and Language, 2021.
- T. Javed et al., "IndicVoices: Towards building an inclusive multilingual speech dataset for Indian languages," ACL Findings, 2024.
- Pipecat / Daily, "Announcing smart-turn v3," 2025.
- LiveKit, "Turn detection for voice agents," 2024.