Speaker Recognition API (SR) is a part of Microsoft Cognitive Service`s family. It is a cloud-based service, that provides REST API for a recognition of a speaking person in a given audio. SR supports two different scenarios - verification and identification. In the first case, SR checks whether provided audio contains valid pass-phrase spoken by the valid person. In the second case, given an audio of a conversation SRS identifies the person who is speaking. To implement each of the scenarios, you have to complete three steps: speakers profiles creation, enrollment, verification/identification. Let’s briefly describe these steps for both scenarios.
1) First of all, you have to create verification profile for a speaker, so the unique GUID (verificationProfileId) will be assigned to him/her. Also, when you are creating a profile of a user, you chooses his/her locale. Currently, en-US locale for verification is supported only.
2) On enrollment stage speaker’s pass-phrase and voice signature are processed. Verification is text-dependent, which means that user should use the same pass-phrase during both, enrollment and verification. It should be one of the phrases available for verification. The service requires 3 enrollments, before the verification profile can be used further. The audio file with the pass-phrase should be at least 1 sec and no more than 15 sec long and should satisfy the following format requirements:
|Sample Format||16 bit|
3) To verify a user you should send his/her audio file with pass-phrase and verificationProfileId and service will respond with verification result (accepted/rejected), confidence level of verification and a string that contains recognized pass-phrase.
1) Creating of a profile is exactly the same as in the verification case. However, besides en-US, Chinese locale (zh-CN) is available too.
2) Contrary to the verification scenario, identification is text-independent process. In order, to enroll user it is recommended to provide 30 sec of user’s speech (with removed silence) to the service. However, you can force service to ignore the 30 sec threshold by including the shortAudio=True parameter in your request. The audio file should be at least 5 sec and no more than 5 min long (in practice, usually 10 sec is enough). It should also meet the same format requirements as in the verification case.
3) The identification works as following: you send a list of identificationProfileId and audio file with a voice record. After that, service is comparing one by one voices of profiles with identificationProfileId from your list to the audio, and returns the identificationProfileId of the first user, whose voice was recognized in audio. It is worth to mention, that identification is two-staged process: firstly, you send request with audio and list of GUIDs to the service and it redirects you to the endpoint, where you can retrieve results of your request (the service will retain this operation for 24 hours only, after that it will be deleted from the service).
To explore the full API see Speaker Recognition API reference.
While verification is quite straightforward, the identification scenario implies few interesting use-cases. The simplest case, is if you send the speech record of one person, and service identifies who exactly that person is from your list of candidates. So, what happens if you send a record of a conversation of a few people? In this case, service returns you just a single identificationProfileId of a user, who was first to be recognized as a speaker. This information is quite insufficient, it seems that you can derive more from such a record. Probably, if you have a record of a conversation, you want to know all speakers participated in it (of course, considering the have been already enrolled), or even more, to know the sequence of speakers (with corresponding time periods). This problem can also be solved using Speaker Recognition Service. For instance, you can split your initial audio file into small parts and identify speaker for each part. Thus, you can easily define the speaker’s chronology in your conversation record. Though, this approach has its own drawbacks. Roughly speaking, when you use Speaker Recognition Service, you pay for a number of requests, which can be pretty decent within the approach above.
Microsoft Client Library
The client library, which simplifies work with a Speaker Recognition API has already been implemented by Microsoft. There are Windows (C#) and Android (Java) versions. The library source code with some samples is also published on github. Library contains code that abstracts service calls and processes responds. The screenshots below can give you some notion of how SR can work in your app (we run the Microsoft sample).
Here, the verification enrollment of new user is made. List of available verification phrases was retrieved from the server. New user verification profile is created automatically when user goes to “Scenario 1: Make a new enrollment”. User clicks record, pronounce pass phrase and after he clicks stop, the client sends enrollment request to the server. If everything is OK, the last step is repeated three times.
After user has been successfully enrolled, he can test the verification scenario by choosing “Scenario 2: Verify a Speaker”. After the pass-phrase is recorded, client sends verification request to the service and reflect the result.
To conclude, SR provides an easy way to perform voice recognition, when all algorithmical work is done on the SR server side, and you have to implement just the high-level client code which calls to the SR API. However, the service can be used only online (no analogous microsoft libs). Also, service doesn’t have built-in methods to resolve multiple speaker’s conversations and this work should be done by clients using some tricks. Finally, at the moment SR supports a few locales.