Azure Cognitive Services #4: Speech API

An instrument providing abilities to create different speech-enabled features in developer's applications.

Image Description
Oleksandr Krakovetskyi 02.15.2018

Speech Service is an instrument which provides developers ability to create different speech-enabled features in their applications. The Microsoft Speech API supports both Speech to Text and Text to Speech conversion.

  1. Speech to text. This API converts human speech to text that can be used as input or commands to control your application. It provides two ways for developers to add Speech to their apps:

REST APIs: Developers can use HTTP calls from their apps to the service for speech recognition. It converts a short spoken audio (no longer than 15s), for example, commands without interim results. Rest API send a request to the Speech HTTP endpoints with the proper request header and body. Here is the simple example of header and C# code:

HttpWebRequest request = null; 
request = (HttpWebRequest)HttpWebRequest.Create(requestUri); 
request.SendChunked = true; 
request.Accept = @"application/json;text/xml;" 
request.Method = "POST"; 
request.ProtocolVersion = HttpVersion.Version11; 
request.ContentType = @"audio/wav; codec=audio/pcm; samplerate=16000"; 
request.Headers["Ocp-Apim-Subscription-Key"] = "YOUR_SUBSCRIPTION_KEY"; 
// Send an audio file by 1024 byte chunks 
using (fs = new FileStream(YOUR_AUDIO_FILE, FileMode.Open, FileAccess.Read)) 
   // Open a request stream and write 1024 byte chunks in the stream one at a time. 

   byte[] buffer = null; 
   int bytesRead = 0; 
   using (Stream requestStream = request.GetRequestStream()) 
      // Read 1024 raw bytes from the input audio file. 

      buffer = new Byte[checked((uint)Math.Min(1024, (int)fs.Length))]; 
      while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) != 0) 
          requestStream.Write(buffer, 0, bytesRead); 
    // Flush 

Client libraries: It is used for advanced features. Developers can download Microsoft Speech client libraries, and link into their apps. The client libraries are available on various platforms (Windows, Android, iOS) using different languages. Unlike the REST APIs, the client libraries utilize Websocket-based procotol. Client libraries provide converting long audio (up to 10 m) to text, desired interim results of audio, understanding the text converted from audio using LUIS (Language Understanding Intelligent Service). Both of these services support real-time continuous recognition, optimized speech recognition results for interactive (user makes short requests and expects the application to perform an action in response), conversation (users are talking with each other), and dictation (user narrate long sentence) scenarios. However as we see client libraries give developer more opportunities, so let's give some additional technical details about it. Currently, the following Speech client libraries are available:

  1. C# desktop library
  2. C# service library
  3. JavaScript library
  4. Java library for Android
  5. Objective-C library for iOS

In my opinion C# libraries are the most interesting for developers among other libraries and provide more capabilities. Desktop library can be runned locally, so you don't even need an internet access. Unfortunately it is the only library that provides local using. In addition if you need a client library that`s not yet supported, you can create your own SDK. Implement the Speech WebSocket protocol on the platform and use the language of your choice.

Sample application:

  1. Text to speech. This APIs use REST to convert structured text to an audio stream. The APIs provide fast text to speech conversion in various voices and languages. In addition users also have the ability to change audio characteristics like pronunciation, volume, pitch etc. using SSML tags. Application that is using text to speech API sends HTTP requests to cloud server, where text is instantly synthesized into human-sounding speech and returned as an audio file. Text to speech API so as Speech to text support English, Arabic, Russian and other languages: Speech to text; Text to speech. Unfortunately both of them do not support Ukrainian language.

Latest Publications

News, posts, articles and more!