Azure Cognitive Services #2: Custom Speech Service

Cloud-based speech recognition service allowing to train model for your own purposes.

Image Description
Oleksandr Krakovetskyi 02.13.2018

Custom Speech Service (CRIS) is a cloud-based speech recognition service which belongs to Microsoft Cognitive Services family. Unlike Bing Speech API, which is designed for general purposes and can fail in certain situations, for instance while used with specific terminology or when speech is recorded in specific environment, CRIS allows you to train model for your own purposes. To do that, you should just provide the appropriate training data for acoustic and language models. What are those models?

The Acoustic model is a classifier that transforms short audio fragments into some phoneme (sound unit) in a given language. Acoustic model should be trained on a set of short audio fragments and a set of corresponding transcriptions for each audio.

The Language model is a probability distribution over sequence of words, i.e. it decides what sequence of words are more likely to happen in a sentence, if there are few possibilities that sound similar. To train the Language model you should provide it with a plain text containing a set of phrases.

The magic of CRIS lies in training those models. You should prepare training data that suits your purposes as much as possible. For example, if you expect your app to be used mostly on road, you should provide training data to Acoustic model that has also been recorded in the car on the way, etc. If you want your app to recognize speech in specific vocabulary domain, e.g. anatomy or physics, provide training data which contains phrases common in that topic. If you expect certain population with a dialect to use your app, prepare training data that reflects features of that dialect.

Creating your Custom Speech Service

First of all, to get started you need to get the subscription from Azure and sign-in into CRIS. The next step is to upload your training data for acoustic and language models (we used biology dataset available here). To do this select select Adaptation Data from drop-down menu and click Import New.

For language data you should upload the text file with your data, each line of which contains a phrase. The file should meet the following requirements.

Importing acousting data is virtually the same. Note, that you should pack your audio files into zip archive. There are few important things to know about training audio. It is recommended to start and end each audio with a ~ 100 ms of silence. Note, that if you want your model to learn a background noise, few seconds of noise would be better. To see detailed requirements on acoustic data visit Acoustic Data Requirements. After you successfully imported your training data, you can proceed with creating acoustic and language models.

When creating acoustic and language models, you specify your initially uploaded training datasets and select the Base Language Model/Base Acoustic Model. Base model is a basis point for your customization. The are two available types of base models, Microsoft Search and Dictation Model and Microsoft Conversational Model. The former is appropriate for cases when the app is designed to apply commands, search queries, etc, while the latter is more suitable for conversational style calls to your app. You can also enable offline testing feature by selecting Accuracy Testing property. If you do this, you should also provide testing datasets, that will enable the evaluation procedure run on your custom model and results will be reported. After you create the models, their status changes to Succeeded when they are trained.

So, now you can proceed with deploying your models in custom endpoint. Select Deployments from drop-down menu and click Create New.

Choose your previously created Acoustic and Language models. Note that you cannot mix Conversational Acoustic Models with Dictation Language Models and vice versa.

Calling your custom recognition service

Custom Speech Recognition exposes the same API as the Bing Speech Service, thus the calling of CRIS is the same as Bing Speech Service with another base-url. You can do this via HTTP requests, or via Client Speech SDK (based on websocket protocol). Client libraries are available for Windows, Android, iOS using different languages C#, Java, JavaScript, Objective-C. If you want to use Custom Speech Service instead of Bing Speech Service in your already created app with SDK, roughly speaking you should add one more string parameter, url (the endpoint for your custom models) , when you are creating a DataRecognitionClient. Also, in your configurations, you must change Bing Speech Service key to Custom Speech Service key.

public static *DataRecognitionClient* 
CreateDataClient(SpeeechRecognitionMode speechRecognitionMode, string language, 
string primaryOrSecondaryKey, **string url**);

To test our Biology Custom Speech Service, we wrote a simple C# console app using SDK. In main method we initialize DataRecognitionClient and call the Recognize method.

private static DataRecognitionClient speechClient; 

static void Main(string[] args) 
    var dictationMode = SpeechRecognitionMode.LongDictation;
    var language ="en-US"; 
    var authenticationUri = ""; 
    var crisSubscriptionKey = ConfigurationManager.AppSettings["CrisKey"];

    var crisUrl = ConfigurationManager.AppSettings["CrisHostName"];
     speechClient = SpeechRecognitionServiceFactory.CreateDataClient(dictationMode, language, crisSubscriptionKey, crisSubscriptionKey, crisUrl);
    speechClient.AuthenticationUri = authenticationUri;

    speechClient.OnResponseReceived += OnDataDictationResponseReceivedHandler;
    speechClient.OnConversationError += OnConversationErrorHandler;
    speechClient.OnIntent += OnIntentHandler;


In Recognize method, we send audio file using small shunks

private static void Recognize(string fileName) 
    FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read) 
     byte[] buffer = new byte[1024]; 

       int bytes = stream.Read(buffer, 0, buffer.Length);
        while (bytes > 0)
            speechClient.SendAudio(buffer, bytes);
             bytes = stream.Read(buffer, 0, buffer.Length);


In Handlers of response events we display obtained response to the console.

private static void OnIntentHandler(object sender, SpeechIntentEventArgs e) 
    Console.WriteLine($ "OnIntentHandler: {e.Payload},"); 

private static void OnConversationErrorHandler(object sender, SpeechErrorEventArgs e) 
    Console.WriteLine($"Exception: {e}"); 
private static void OnDataDictationResponseReceivedHandler(object sender, SpeechResponseEventArgs e) 
    if (!e.PhraseResponse.Results.Any()) return; 
    foreach (var phrase in e.PhraseResponse.Results) 
         Confidence: {phrase.Confidence} 
         Display Text: {phrase.DisplayText} 

After running, our app outputs the following result:


Custom Speech Service provides you with a convenient way of creating speech recognition model specially for your needs. Note, that it exposes the same API as Bing Speech Service, thus you can easily swap Bing endpoint to custom in your app and also integrate it with LUIS service (by creating DataRecognitionClient with intent). However, a few locales are supported at the moment and service is online-based.

Latest Publications

News, posts, articles and more!