Op-Ed: How AI Is Changing the Game for Dubbing, Translation, and Localization

A look at new possibilities, challenges, and solutions coming out of NAB 2025

AI dubbing was by far one of the most vibrant areas of innovation on display at NAB 2025, but artificial intelligence is also opening exciting new frontiers, especially in audio translation. To find specific use cases where AI offers real ROI, SVG sent Contributor Brian Ring to the show floor in Vegas. He explored how the landscape is changing and where AI can serve a role in dubbing, translation, and localization. His report focuses on some of the vendors and technology he discovered.

In the Beginning, Single-Language Media

I earned high honors in cognitive psych more than three decades ago at UC Berkeley and took a linguistic class from Nobel Laureate George Lakoff, author of the great book Don’t Think of an Elephant. I provide these bona fides only as context for what I’m about to say: until ChatGPT’s release more than two years ago, if you had asked me what separates humans from other animals, and indeed from machines, I’d have answered quickly: “Language.”

It’s hard to continue to see it that way.

At first, from about 2018 to ‘22, small language models like natural-language processing made great progress in improving machine translation. AI localization began to be interweaved with human-dubbing workflows that became more efficient and accurate.

Language, a central component of society and culture globally, could more accurately be seen as something that separates humans from other humans. Therefore, advanced translation workflows — human, machine, and networked — became an enabler for the massive globalization of media that sits with us today.

Take Netflix, the ultimate pioneer in global-scale media. As of Q4 2024, there were roughly 90 million Netflix subscribers in the U.S. and Canada. But that number is a mere 30% of the company’s overall subscriber base. Accounting for 33% is EMEA, a region that is home to 24 languages. And then there are 345 million speakers of Arabic in the Middle East and, in the LATAM market, Spanish speakers (60% of the population), Portuguese speakers (30%), and speakers of various indigenous languages (10%).

APAC has at least another 10 major languages spoken, from Mandarin Chinese (spoken by more than 1.1 billion) to Hindi (500 million-plus) and down to Marathi (80 million-plus), the 10th-most-spoken language in the region. In the middle are Japanese, Punjabi, Arabic, Malay, and other languages.

Growing subscribers in those international markets has been central to the growth of most major media companies in the U.S. for at least a decade. And the use of AI and generative AI is going to help improve the offerings (and subscriber experiences) even more.

Deep-Learning Tech Takes AI Deeper Into Dubbing

In 2022, transformers and large language models (LLMs) exploded onto the scene, along with a notable startup called Eleven Labs. Founded on the latest in deep-learning tech, Eleven Labs focused on natural-sounding voice and speech synthesis.

Suddenly, the translation of text into other languages could go one huge step further, into the realm of AI dubbing. One voice turns into 20. In tone, emotion, and timbre, they sound similar, but listen closely: the same person is fluently speaking Japanese? Mandarin? 20 other languages?

Viral clips of Barack Obama speaking AI-induced Japanese made the rounds on YouTube. Media executives familiar with a globalized economic landscape were giddy with excitement. Unlike so many amazing technologies, this one had a concrete use case that was easy to see: flip a switch and broadcast your channel, sports event, or newscast — live or recorded — into 20 regional languages with no additional humans. and minimal workflow complexity.

Thus began one of the most aggressive media-tech landgrabs of the generative-AI era. In my 25 years in this business, I can’t recall any other segment exploding quite as quickly as AI dubbing.

AI Localization Leaders and Dubbing Upstarts

NAB 2025 offered plenty of new technologies and workflows designed to improve delivery of “localized” content in languages that differ from the original. Here are some of the companies and technologies that I encountered on the show floor.

Deepdub (www.deepdub.ai) is an AI-powered localization company founded in 2019 and based in Tel Aviv. Its technology synchronizes emotional tone, audio, and video to deliver compelling human-like translation experiences in 130 languages.

At NAB 2025, Deepdub launched Live, an AI-powered dubbing solution designed for real-time multi-language broadcasting. According to the company, its proprietary Emotive Text to Speech (eTTS) technology dynamically adapts vocal tone, intensity, and energy levels to provide high-energy sports commentary, urgent breaking news, and immersive live event narration. Broadcasters can choose to deliver live dubbing using the original speaker’s cloned voice or select a voice from Deepdub’s curated voice bank, pre-cleared for broadcast and streaming rights.

Running on AWS Elemental MediaPackage with enterprise-grade infrastructure, Deepdub Live provides low-latency, frame-accurate synchronization, allowing publishers of all types to reach global markets at lower cost without sacrificing the authenticity or emotional impact that a human typically brings.

Dubformer (www.dubformer.ai) is a secure AI-dubbing and localization platform founded in 2023 and headquartered in Amsterdam. The company recently raised $3.6 million in seed funding to advance its emotional AI dubbing technology, which supports 70+ languages and, the company says, serves 200+ customers.

The company maintains that AI-generated dubbing suffers from neutral, blandish tones associated with much of generative AI. Many voicing technologies sound hollow or fake. Dubformer’s Emotion Transfer technology works differently from cloning, seeking to transfer fine-grained, local aspects of speech — intonation, emotion, pace — from the source audio reference to better mimic the nuance of speech.

The company serves a range of content types, including documentaries, news, shows, and travel channels. It also offers a live-dubbing solution and a collection of 400+ voices.

Lingopal (www.Lingopal.ai) calls itself the gold standard for speech-to-speech translation, and it has certainly put points on that board in the past 12 months with wins like Tennis Channel, NBA, Disney, and Blackrock.

The company had a robust presence at NAB 2025: embedded in an appliance offering at the Ateme booth; taking center stage at rival MediaKind just across the aisle; and boasting about its partnership with FAST-channel leader Amagi (Tennis Channel’s FAST provider) just a few booths over.

Founded in 2023 in Rego Park, NY, and led by ambitious founders Deven Orie and Casey Schneider, the company recently raised an eye-popping $14 million in Series A capital from notable investors like DCM and Marquee Ventures.

Lingopal’s speech-to-speech engine leverages six foundational AI models developed in-house. Think of it as collapsing three workflow steps into one: instead of moving from speech to text, text to translated text, and then translated text to voice, its single trainable deep-learning platform moves from speech in one language to speech in another. Transcription, speaker detection, and emotion analysis are also part of the formula. The company says it can handle a range of genre-specific vocabularies, including slang, sports idioms, and multiple speakers.

As for the gold standard? While all companies claim to be the best, Lingopal provided some evidence of high scores in a global standard test metric known as “BLEU,” or Bilingual Evaluation Understudy. The test measures the similarity between a machine-generated translation and human-created reference translations.

Sure, there’s no silver bullet for evaluating AI accuracy, but any vendor that at least tries to find an objective North Star for this complicated mission is worth a mention.

Camb.ai (www.camb.ai) is another AI-dubbing startup with an impressive pile of cash. The company raised $4 million in seed funding led by Courtside Ventures and another $11 million in pre–Series A funding more recently. Founded by a father/son team in 2022 and based in the UAE, Camb.ai specializes in live dubbing in more than 100 languages, dialects, and accents using original-sounding voices with plenty of nuance.

Unlike other vendors at NAB 2025, Camb.ai was clear and transparent about the two proprietary foundational AI models used. Co-founder/CTO Ack Prakash described them as MARS and BOLI.

MARS is a proprietary model that Camb.ai has decided to open-source to the Hugging Face community, which I find notable in and of itself. MARS achieves accurate text-to-speech, including voice cloning and speech synthesis, with minimal inputs. The latest iteration, MARS6 Turbo, is available for testing on Hugging Face.

BOLI is all about translation. Besides focusing on the speech content, it captures and conveys the speaker’s original tone and emotion. This is accomplished with less than three seconds of recorded input. As a result, the company reports high performance on tasks like colloquialism translations, zero-shot cross-lingual voice cloning, and cross-lingual time sync.

Camb.ai was tapped for Major League Soccer’s inaugural Innovation Lab cohort, was part of the 2024 AO Startups Program with Tennis Australia, and was recently admitted into the fifth-annual cohort of the Comcast SportsTech Accelerator.

DeepTune (www.DeepTune.com), a New York-based venture-backed startup founded in 2022, specializes in human-like text-to-speech and AI dubbing. It raised $3 million in seed funding and is backed by notable investors Alexis Ohanian, Gary Vaynerchuk, and Seven Seven Six.

Although the company started out in SaaS-based dubbing tools for file-based entertainment workflows, it has launched a live-broadcast product, which is currently on trial with Sinclair in an impressive client win announced in February. Deeptune is helping Sinclair launch real-time, AI-powered Spanish translations of local newscasts, marking a first in U.S. broadcast history. The pilot program is active at WBFF Baltimore, KABB San Antonio, WPEC West Palm Beach, and KSNV Las Vegas, delivering live translated broadcasts via each station’s YouTube channel.

Notably, some of the AI-dubbed content is viewable today on YouTube. It’s impressive, but it’s also an example of a core question: if we can indeed achieve fantastic real-time AI dubbing, what do we do with the graphics? More on that below.

SyncWords (www.syncwords.com) is a cloud-based solution offering broadcast-grade live and on-demand AI captioning, subtitling, and dubbing in more than 100 languages. Founded in 2013, the company solves problems with timing and synchronization in caption workflows.

At NAB 2025, SyncWords showed-off ultra-low-latency captions with a tiny, 1.4-second delay. It also featured AI live dubbing with dynamic speaker changes and dynamic sync. In addition, CEO Ash Shah highlighted the revolutionary multi-lingual workflows that AI dubbing is bringing to market.

The company uses third-party AI for voice, ASR, and translations and focuses on creating simple workflow solutions using the best technologies available.

CaptionHub (www.captionhub.com) is the leader in enterprise-focused AI-powered multimedia localization. Its robust integrations and enterprise workflows serve global brands like TED, Ford, Unilever, Fidelity, Allianz, GSK, and Stripe.

The company’s flagship CaptionHub Live enables real-time captions in 55 source and 250 target languages, custom dictionaries for brand accuracy, patent-pending embeddable widgets, and ultra-low-latency delivery — demonstrated at events like AWS re:invent.

At NAB 2025, CaptionHub showed off one of its newest innovations, Voiceover Editor, which further expands its capabilities with on-the-fly timing adjustments, tempo control, natural-sounding pauses, and complete audio-track management designed to build localization at scale.

Machina Sports (www.machina.gg), a San Francisco startup founded in 2023, delivers developer-friendly gen AI and LLM orchestration tools without requiring complex infrastructure. Its flagship Machina Sports Studio helps sports organizations leverage generative AI across models, across platforms, and across use cases. The company recently won the Sports Pro NY pitch competition and is one of the first companies focused tightly on the use of AI Agents for a wide range of sports-related use cases. The young company is leveraging the power of AI orchestration to power real-time game insights, betting content, and localization, according to CEO André Antonelli.

Verbit (www.verbit.ai) is the new name for VITAC, one of the oldest and largest vendors serving global broadcasters with FCC- and EAA- (European Accessibility Act) compliant captions. Its Captivate product is designed for speech-intensive industries like media, sports, news, and entertainment. Impressive sports-focused dictionaries and a comprehensive suite of captioning, transcription, audio description, translation, note-taking, and dubbing solutions are offered in 50+ languages. Verbit processed more than 4 million hours of transcription last year and serves 3,000+ customers across media, education, broadcast, government, and other industries.

At NAB 2025, Verbit highlighted its Captivate Clips solution, which promises a fast, affordable, and highly accurate way to localize and caption video content in multiple languages.

AI-Media (www.ai-media.tv), founded in Australia in 2003, is a global leader in AI-driven captioning and language technology and counts broadcasters, governments, corporations, and educational institutions across 25+ countries as customers.

At NAB 2025, the company launched its much-anticipated LEXI Voice, an AI-powered real-time voice-translation tool, with a very impressive demo. LEXI Voice converts live broadcasts into multiple languages, using synthetic voices with low latency (about 8-12 seconds) and requires no additional hardware. It supports more than 100 languages and customizable voices and integrates with AI-Media’s caption-encoder network. Priced at $30/hour (plus standard LEXI captioning fees), it reduces live-translation costs by up to 90%.

VP, Product Development and Innovation, Todd Vaccaro said, “There are so many exciting use cases popping up. The idea of translating not only captions, not only voice, but also graphics. That’s exactly the kind of challenge I’m currently looking to dig into.”

And What About the Graphics?

There is some movement in the industry toward solving that challenge. A combination along the lines of Anthropic’s Claude 3.7 Sonnet — which follows a well-documented plan for using PaddleOCR to build a system that could ingest a feed, pull the text from lower thirds, send it to an LLM for translation, and reinsert graphics as regionalized HTML5 — shows promise and illustrates the breathtaking pace at which technology is accelerating.

The trick now is to evaluate the various workflows. Unfortunately, the most common answer from vendors is that the customer needs to run their own tests (one vendor also mentioned a network of human linguists who could be contracted for that purpose, providing at least a couple of new jobs for humans as a result of AI).

Camb.ai and Lingopal.ai boasted about machine-learning test scores that feel objective and impressive. Although they are meaningful, these granular objective metrics are insufficient tests for the quality and accuracy needed in the human and regulatory-based world of media, where mistakes can offend, cause advertisers to flee, or generate sizable fees for compliance violations. State af the art in the deep-learning world sometimes looks nothing like state of the art in the human world, which is why human evaluation and “arena-style” leaderboards are increasingly important tools for navigating these waters.

But when it comes to natural and authentic-sounding AI dubbing, there’s another factor, an esoteric scientific problem for why voicing of text is so complicated. It’s called voice prosody, and it refers to the rhythmic and intonational aspects of speech that are beyond the basic sounds or phonemes.

Prosody includes patterns of stress, intonation, rhythm, tempo, pauses, and pitch variations that convey meaning, emotion, and intent in all spoken language. It helps distinguish questions, statements, and commands. It conveys emotional states and attitudes (excitement, doubt, sarcasm). It signals turn-taking in conversations through pitch changes.

And that feels like the right place to leave this discussion. With a new term, voice prosody, that feels as human as can be.

Password must contain the following:

A lowercase letter

A capital (uppercase) letter

A number

Minimum 8 characters