Azure Cognitive Services #3: Translator Speech API

Cloud-based machine translation service.

Image Description
Oleksandr Krakovetskyi 02.15.2018

Microsoft Translator Speech API, a part of the Microsoft Cognitive Services API collection, is a cloud-based machine translation service. The API enables businesses to add end-to-end, real-time, speech translations to their applications or services as seen. This technology was launched late 2014 starting with Skype Translator, and has been available as an open API for customers since early 2016. It is integrated into the Microsoft Translator live feature, Skype, Skype meeting broadcast, and the Microsoft Translator apps for Android, iOS, and Windows. Based on the industry standard REST technology, it can be used to build applications, tools, or any solution requiring multi-language speech translation regardless of the target OS or development languages.

How does speech translation work?

Although, at a first glance, it may seem like a straightforward process to build a speech translation technology from the existing technology "bricks", it requires much more work than simply plugging an existing "traditional" human-to-machine speech recognition engine to the existing text translation one. Let's look at the process in details.

To properly translate the "source" speech from one language to a different "target" language, the system goes through a four-step process, so that it is implemented using four separated technologies:

  1. Speech recognition, by Automatic speech recognition (ASR) technology. On this step system converts audio into text.
  2. TrueText: A Microsoft technology that normalizes the text to make it more appropriate for translation
  3. Translation. Through the text translation engine described below, which is based on translation models specially developed for real life spoken conversations/
  4. Text-to-speech, when necessary, to produce the translated audio.

Text results are produced by applying Automatic Speech Recognition (ASR) powered by deep neural networks to the incoming audio stream. TrueText removes disfluencies (the hmms and coughs) and restore proper punctuation and capitalization. The ability to mask or exclude profanities is also included. The recognition and translation engines are specifically trained to handle conversational speech. The Speech Translation service uses silence detection to determine the end of an utterance. After a pause in voice activity, the service will stream back a final result for the completed utterance. The service can also send back partial results, which give intermediate recognitions and translations for an utterance in progress. For final results, the service provides the ability to synthesize speech (text-to-speech) from the spoken text in the target languages.

Text-to-speech audio is created in the format specified by the client. WAV and MP3 formats are available.

However, such complicated four-step technology makes process of translation a bit harder. Actually you need to take into account all four technologies' language abilities to provide correct passing required text through all four steps.

For example, Text Translation supports Ukrainian language, but none of 3 other technologies of this system does. However Microsoft team actively works to increase number of supported languages.

Technological features

The Speech Translation API uses the WebSocket protocol to provide a full-duplex communication channel between the client and the server. Therefore, as you have already understood, Microsoft Translator Speech unfortunately cannot be runned locally.

Thus, as all Microsoft Cognitive Services APIs, Translator Speech code is avaliavle on GitHub: MicrosoftTranslator/SpeechTranslator. This demo-code makes calls to the Microsoft Translator Speech Translation API which are subject to the Microsoft Privacy Statement. An example request is:

Ocp-Apim-Subscription-Key: {subscription key} 
X-ClientTraceId: {GUID}

The request specifies that spoken English will be streamed to the service and translated into Italian. Each final recognition result will generate a text-to-speech audio response with the female voice named Elsa. Notice that the request includes credentials in the Ocp-Apim-Subscription-Key header. The request also follows a best practice by setting a globally unique identifier in header X-ClientTraceId. A client application should log the trace ID so that it can be used to troubleshoot issues when they occur. To get some additional information you also can visit links above:

  1. Documentation
  2. Demo in action on C#
  3. Microsoft Translator Speech Application

Latest Publications

News, posts, articles and more!