Like Henry Higgins, the phonetician from George Bernard Shaw’s play “Pygmalion,” Marius Cotescu and Georgi Tinchev recently demonstrated how their student was trying to overcome pronunciation difficulties.
The two data scientists, who work for Amazon in Europe, were teaching Alexa, the company’s digital assistant. Their task: to help Alexa master an Irish-accented English with the aid of artificial intelligence and recordings from native speakers.
During the demonstration, Alexa spoke about a memorable night out. “The party last night was great craic,” Alexa said with a lilt, using the Irish word for fun. “We got ice cream on the way home, and we were happy out.”
Mr. Tinchev shook his head. Alexa had dropped the “r” in “party,” making the word sound flat, like pah-tee. Too British, he concluded.
The technologists are part of a team at Amazon working on a challenging area of data science known as voice disentanglement. It’s a tricky issue that has gained new relevance amid a wave of A.I. developments, with researchers believing the speech and technology puzzle can help make A.I.-powered devices, bots and speech synthesizers more conversational — that is, capable of pulling off a multitude of regional accents.
Tackling voice disentanglement involves far more than grasping vocabulary and syntax. A speaker’s pitch, timbre and accent often give words nuanced meaning and emotional weight. Linguists call this language feature “prosody,” something machines have had a hard time mastering.
Only in recent years, thanks to advances in A.I., computer chips and other hardware, have researchers made strides in cracking the voice disentanglement issue, transforming computer-generated speech into something more pleasing to the ear.
Such work may eventually converge with an explosion of “generative A.I.,” a technology that enables chatbots to generate their own responses, researchers said. Chatbots like ChatGPT and Bard may someday fully act on users’ voice commands and respond verbally. At the same time, voice assistants like Alexa and Apple’s Siri will become more conversational, potentially rekindling consumer interest in a tech segment that had seemingly stalled, analysts said.
Getting voice assistants such as Alexa, Siri and Google Assistant to speak multiple languages has been an expensive and protracted process. Tech companies have hired voice actors to record hundreds of hours of speech, which helped create synthetic voices for digital assistants. Advanced A.I. systems known as “text-to-speech models” — because they convert text to natural-sounding synthetic speech — are just beginning to streamline this process.
The technology “is now able to create a human’s voice and synthetic audio based on a text input, in different languages, accents and dialects,” said Marion Laboure, a senior strategist at Deutsche Bank Research.
Amazon has been under pressure to catch up to rivals like Microsoft and Google in the A.I. race. In April, Andy Jassy, Amazon’s chief executive, told Wall Street analysts that the company planned to make Alexa “even more proactive and conversational” with the help of sophisticated generative A.I. And Rohit Prasad, Amazon’s head scientist for Alexa, told CNBC in May that he saw the voice assistant as a voice-enabled “instantly available, personal A.I.”
Irish Alexa made its commercial debut in November, after nine months of training in comprehending an Irish accent and then speaking it.
“Accent is different from language,” Mr. Prasad said in an interview. A.I. technologies must learn to extricate the accent from other parts of speech, such as tone and frequency, before they can replicate the peculiarities of local dialects — for instance, maybe the “a” is flatter and “t’s” are pronounced more forcibly.
These systems must figure out these patterns “so you can synthesize a whole new accent,” he said. “That’s hard.”
Harder still was trying to get the technology to learn a new accent largely on its own, from a different-sounding speech model. That’s what Mr. Cotescu’s team tried in building Irish Alexa. They relied heavily on an existing speech model of primarily British-English accents — with a far smaller range of American, Canadian and Australian accents — to train it to speak Irish English.
The team contended with various linguistic challenges of Irish English. The Irish tend to drop the “h” in “th,” for example, pronouncing the letters as a hard “t” or a “d,” making “bath” sound like “bat,” or even “bad.” Irish English is also rhotic, meaning the “r” is overpronounced. That means the “r” in “party” will be more distinct than what you might hear out of a Londoner’s mouth. Alexa had to learn these speech features and master them.
Irish English, said Mr. Cotescu, who is Romanian and was the lead researcher on the Irish Alexa team, “is a hard one.”
The speech models that power Alexa’s verbal skills have been growing more advanced in recent years. In 2020, Amazon researchers taught Alexa to speak fluent Spanish from an English language-speaking model.
Mr. Cotescu and the team saw accents as the next frontier of Alexa’s speech capabilities. They designed Irish Alexa to rely more on A.I. than on actors to build up its speech model. As a result, Irish Alexa was trained on a relatively small corpus — about 24 hours of recordings by voice actors who recited 2,000 utterances in Irish-accented English.
At the outset, when Amazon’s researchers fed the Irish recordings to the still-learning Irish Alexa, some weird things happened.
Letters and syllables occasionally dropped out of the response. “S’s” sometimes stuck together. A word or two, sometimes crucial ones, were inexplicably mumbled and incomprehensible. At least in one case, Alexa’s female voice dropped a few octaves, sounding more masculine. Worse, the masculine voice sounded distinctly British, the kind of goof that might raise eyebrows in some Irish homes.
“They are big black boxes,” Mr. Tinchev, a Bulgarian national who is Amazon’s lead scientist on the project, said of the speech models. “You have to have a lot of experimentation to tune them.”
That’s what the technologists did to correct Alexa’s “party” gaffe. They disentangled the speech, word by word, phoneme (the smallest audible sliver of a word) by phoneme to pinpoint where Alexa was slipping and fine-tune it. Then they fed Irish Alexa’s speech model more recorded voice data to correct the mispronunciation.
The result: the “r” in “party” returned. But then the “p” disappeared.
So the data scientists went through the same process again. They eventually zeroed in on the phoneme that contained the missing “p.” Then they fine-tuned the model further so the “p” sound returned and the “r” didn’t disappear. Alexa was finally learning to speak like a Dubliner.
Two Irish linguists — Elaine Vaughan, who teaches at the University of Limerick, and Kate Tallon, a Ph.D student who works in the Phonetics and Speech Laboratory at Trinity College Dublin — have since given Irish Alexa’s accent high marks. The way Irish Alexa emphasized “r’s” and softened “t’s” stuck out, they said, and Amazon got the accent as a whole right.
“It sounds authentic to me,” Ms. Tallon said.
Amazon’s researchers said they were gratified by the largely positive feedback. That their speech models disentangled the Irish accent so quickly gave them hope they could replicate accents elsewhere.
“We also plan to extend our methodology to accents of language other than English,” they wrote in a January research paper about the Irish Alexa project.