How AI Voice Agents Handle Hindi-English Code-Switching

Divyang Mandani
June 9, 2026
How AI Voice Agents Handle Hindi-English Code-Switching
Article

When a customer says "Sir, mera order cancel karna hai because size fit nahi hua," they switch languages three times in one breath. Around 250 million Indians communicate this way every day. Yet recent research published in Data in Brief shows speech recognition models suffer a relative Word Error Rate increase of 30 to 50 percent on code-switched speech compared to clean monolingual audio.

If you have ever watched a voice bot freeze the moment a caller drops an English word into a Hindi sentence, you already know the problem. Hindi-English code-switching is not an edge case in India. It is the default way people talk. A voice agent that cannot follow the switch sounds broken within the first ten seconds of a call.

So can AI actually keep up? Yes, but only when the system is built for mixed speech from the ground up, not patched on afterward. Here I will walk through exactly how that works: why standard models fail, how code-switching-native pipelines fix it, and why speaking Hinglish back is harder than understanding it.

What Hindi-English Code-Switching Actually Is

What Hindi-English Code-Switching Actually Is

Code-switching is the practice of alternating between two or more languages inside a single conversation, sentence, or phrase. When that mix is Hindi and English, people call it Hinglish. It is not slang or broken speech. It is a structured, rule-governed way that bilingual Indians express themselves.

The phenomenon is huge in scale. A study behind the VITB-HEBiC corpus, published in Applied Acoustics, found that roughly 52 percent of urban India is bilingual and about 18 percent can speak three languages. Voice AI that ignores this reality is not serving most of its callers.

The three patterns Indian callers use

Linguists sort code-switching into three patterns, and each one stresses a voice system differently. Understanding them tells you where a voice agent is most likely to stumble.

  • Inter-sentential switching: the language changes at a sentence boundary. Example: "Let me check that. Aap thoda wait kijiye." This is the easiest pattern, because the model gets a clean acoustic pause to reset its language assumption.
  • Intra-sentential switching: the languages mix inside one sentence. Example: "Mujhe flight book karni hai." Here Hindi grammar wraps around English nouns with no clean boundary, which is why it is the toughest case.
  • Tag-switching: a short phrase or discourse marker from one language drops into another. Example: "You know, main soch raha tha." The tag is brief, but it can still trip a model expecting one language per utterance.

Why intra-sentential switching is the hard part

Intra-sentential switching is where most voice agents break, and the reason is structural. There is no silent gap that signals "a new language starts here." The English noun arrives mid-stream, governed by Hindi sentence structure, and the recognizer has milliseconds to adapt.

Linguists describe this using the Matrix Language Frame model, where one language sets the grammatical frame and the other supplies embedded words. In Hinglish, Hindi usually provides the frame while English supplies nouns and verbs. A voice system that cannot model this relationship treats the English words as noise or mishears them entirely.

Why Standard ASR Fails on Code-Switched Speech

Why Standard ASR Fails on Code-Switched Speech

Most off-the-shelf code-switched ASR failures trace back to one decision made long before the call: the model was trained to expect a single language. Indian speech does not cooperate with that assumption.

The monolingual training problem

Standard speech recognition models learn from monolingual corpora. They build a statistical expectation that an utterance stays in one language from start to finish. The moment a caller violates that expectation, accuracy collapses.

The numbers are stark. The HiACC corpus research reports that ASR systems see a 30 to 50 percent relative jump in Word Error Rate on code-switched audio versus monolingual audio. (That is the difference between a bot that helps and a bot that frustrates.) Some practitioner reports place real-world code-switched WER even higher, near 42 percent on production telephony audio.

How code-switching-native ASR fixes it

A code-switching-native recognizer is trained directly on mixed-language audio, so it expects the switch instead of fighting it. Modern systems pair a streaming Conformer or RNN-T acoustic model with code-switched training data drawn from sources like the AI4Bharat IndicVoices project, which spans 23,700 hours of speech across 22 languages.

Production engines such as Whisper variants and Deepgram now document explicit code-switching support, generating mixed-token transcripts rather than forcing one language. Benchmarks like GLUECoS and language models like L3Cube HingBERT let teams measure and improve performance on real Hinglish rather than guessing. In my experience building voice AI for Indian businesses, this single architectural choice separates a demo from a deployable product.

Speaking Hinglish Back: The Code-Switched TTS Problem

Here is the part almost every guide skips. Understanding Hinglish is only half the job. The agent also has to speak it back naturally, and code-switched TTS is arguably harder than recognition.

Why stitched-together voices sound wrong

When an agent says "Sir, aapka loan amount approve ho gaya hai," that sentence must sound like one person, not two voices spliced together. Many systems generate the Hindi span with a Hindi voice and the English span with an English voice. The result is jarring and instantly signals a machine.

Natural code-switched speech needs a single VITS-based TTS voice trained to pronounce both languages with consistent timbre and prosody. There is also a latency trap. If regional generation runs 300 milliseconds slower than English, the conversation feels lopsided and the caller senses something is off.

Matching the caller's register

Good voice AI does not just translate. It mirrors how the caller speaks. If a customer opens in casual Hinglish, replying in formal pure Hindi feels cold and out of place.

A Wizard-of-Oz study cited in a major code-switching survey found bilingual users actually prefer agents that match their own switching patterns. So register-matching is not a nicety, it is a measurable driver of trust. The agent should sense whether the caller leans English, leans Hindi, or mixes freely, then respond in kind throughout the call.

How Modern Voice AI Detects and Adapts Mid-Call

Detecting the switch is one thing. Doing it inside a live call without adding lag or losing the thread is the real engineering challenge.

Real-time detection without a latency penalty

Language switching mid-sentence has to be detected continuously, not once at the start of a call. The best systems run code-switching detection inside a single model path rather than chaining a separate language-identification step that adds delay.

This matters because every model hop eats your latency budget. If recognition adds 400 milliseconds and the language model needs 200, the agent is already near 600 milliseconds before any network delay. Indian voice AI targets a sub-200 millisecond response feel, so collapsing detection into one pass is essential. Constraining detection to a known pair, like Hindi and English, also sharpens accuracy versus scanning every supported language.

Holding context across the switch

Detecting the language is useless if the meaning gets lost. Research on multilingual dialogue agents shows the most common code-mixed failure is a slot error: the system drops or scrambles a key detail like an account number or a date.

Modern pipelines use RAG-grounded dialogue management so the agent keeps the customer's intent and retrieved data stable even as the surface language flips. The recognizer can hand a clean mixed-token transcript to the language model, which reasons over meaning rather than raw language tags. Done well, the caller never notices the machinery. They just feel understood.

What This Means for Indian Businesses

If you are evaluating multilingual voice AI in India, the takeaway is practical. The question is not whether a vendor "supports Hindi." It is whether the system was built for code-switching at every layer.

Accuracy, cost, and the Tier-2 reality

Accuracy varies sharply by region and language mix, and honest vendors admit this. Here is what real deployments tend to show.

  • Metro Hindi and Hinglish: leading agents report 90 to 95 percent accuracy on clean metro audio, which is strong enough for most support and sales flows.
  • Tier-2 and Tier-3 audio: accuracy drops, often into the 70 to 88 percent range, due to dialect variation, ambient noise, and weaker mobile connectivity.
  • Cost advantage: AI calling in regional languages can run far cheaper than scarce human agents fluent in those languages, which is often the only way to reach Tier-2 and Tier-3 prospects at scale.

The point is to test on your own call recordings, not a vendor demo. A demo on quiet metro audio tells you little about a noisy 2G call from a Tier-3 town.

Compliance and data residency

Indian voice deployments sit inside a real regulatory stack, and code-switching does not exempt you from it. Two frameworks matter most for outbound and customer voice data.

  • DPDP Act 2023: India's data protection law sets the framework for handling personal data, with core obligations expected to phase in over the coming period. Teams should plan for consent, purpose limitation, and clear data-handling records now.
  • TRAI DLT: the Distributed Ledger Technology registration regime governs commercial communications, so outbound calling programs need to align with DLT and DND rules from day one.

Choosing a provider with India data-residency options keeps you on the safe side as these rules tighten. Compliance is not a feature you bolt on later. It is part of being production-ready.

Conclusion

Handling Hindi-English code-switching well comes down to three things: a recognizer trained on mixed speech, a single natural voice that can speak both languages back, and a pipeline that holds context as the caller switches. Get those right and the conversation simply feels human.

You now know the questions that separate a real Hinglish-native system from a relabeled monolingual one. You can ask any vendor about intra-sentential accuracy, code-switched TTS, and latency, and judge the answers with confidence.

At OnDial, we build voice AI tuned for exactly how Indian customers speak, mixing Hindi and English the way they naturally do, not the way a script assumes. If your callers code-switch and your current system cannot keep up, that is the conversation worth having next.

Frequently Asked Questions

Frequently Asked QuestionsAbout This Article

Find answers to common questions related to this article and topic.

Yes. Systems trained natively on code-switched audio handle real Hindi-English mixing accurately, while monolingual-trained models struggle and misfire.

Most are trained for one language per sentence, so mid-sentence switches break their assumptions and Word Error Rate climbs sharply.

Leading agents reach 90 to 95 percent on clean metro audio, dropping to roughly 70 to 88 percent on noisy Tier-2 and Tier-3 calls.

Yes, when it uses a single code-switching-native TTS voice that keeps consistent tone across both languages instead of stitching two voices.

For Hindi-heavy or regional call volumes, yes. It removes caller friction and reaches markets that monolingual or human-only models cannot serve affordably.

Divyang Mandani

Founder & CEO

Divyang Mandani is the CEO of OnDial, driving innovative AI and IT solutions with a focus on transformative technology, ethical AI, and impactful digital strategies for businesses worldwide.

View all articles by Divyang Mandani
AI-Powered Customer Service

Transform Your Business withAI Voice Automation

Don't let your customers wait on hold. Join thousands of businesses using OnDial to provide instant, intelligent customer service 24/7.