Creating an application that uses Speech Recognition.

4 min readAug 26, 2018

Speech recognition, from the computational point of view is basically a computer software or a hardware device that can decode the human voice. It is usually used to run commands, operate devices, or write without having to use a mouse, keyboard, or press any button.

Today almost every task can be automated using voice commands and speech recognition, ranging from light up to schedule a time in the beauty salon. These tasks can be performed by personal assistants already available in the market such as Siri — Apple, Cortana — Microsoft, Alexia — Amazon and Google Assistant.

I recently had the pleasure of developing a mobile application for a language school that implemented these features arranged in 3 ways. The application reads the exercise statement aloud, records and recognizes the student’s voice in response to the exercise, and optionally reads aloud the correct and told response by the student. The famous Duolingo app was used as reference.

In this post I’ll cover how to integrate native speech recognition and speech synthesis into the browser using the WebSpeech JavaScript API.

According to the Mozilla documentation:

The Web Speech API allows you to embed voice data into web applications. The Web Speech API has two parts: SpeechSynthesis (Speech Text) and SpeechRecognition (Speech Recognition).

The app was built using Ionic Framework (AngularJS) and Cordova plugins, taking as the screen layout of one of the exercises the image below:

Basically, it works as follows: at the first moment when the student touches on the exercise statement that is usually a question or about the “HeadPhone” icon, the app reads it aloud.

Imagine how complex it would be to record all the questions, then make them available, imagine the amount of storage, hours of recording, pronunciation, etc. Therefore, the solution was to ask the computer to do this in the following way: when the student touches the question the text of the same is passed as argument to the speak method of TTS (Text to speech, or, text to speech). This method requires the ‘locale’ parameter, which is a string that identifies the language of the text to be read. The rate parameter indicates how fast the text will be read by the device.

$scope.listenText = function(textToBeRead){
        
        TTS.speak({
            text: textToBeRead,
            locale: 'en-US',
            rate: 1.00
        }, function () {
            console.log("Finished Ok");
        }, function (reason) {
            console.log("ERRO NA FALA", JSON.stringify( reason ) );
        });
        
    }

When the student touches ‘Record My Voice’, the webkitSpeechRecognition object is created and the microphone opened so the student can speak, as soon as he finishes speaking, what he said is placed on the screen, so he can compare his pronunciation with the exercise response.

If there is a mistake in this step, we will re-launch the TTS, causing the device to speak the phrase “Sorry, could you please repeat?”, This generates interactivity between the application and the student, literally a conversation. This process when done using headphones becomes even more magical! ❤

The student can still play in the sentence just as it did in the correct answer and compare them by doing so the device reads aloud each one. This feature allows the student to train his speech to the point where the pronunciation is correct.

$scope.recordMyVoice = function(){
        var recognition = new SpeechRecognition() || new webkitSpeechRecognition();// To Device
        recognition.lang = 'en-US';
        recognition.start();recognition.onresult = function(event) {
            $scope.showAnswer = true;
            if (event.results.length > 0) {
                $scope.showAnswer = true;
                $scope.recognizedText = event.results[0][0].transcript;
                $scope.$apply()
            }
        };recognition.onerror = function(event) {
            TTS.speak({
                text: "Sorry, could you please repeat?",
                locale: 'en-US',
                rate: 1.00
            }
        }
    }

Regarding privacy, the use of the microphone should be allowed by the user. As the app was built to be used on Android and iOS platforms, it was necessary to treat the permissions differently.

Unlike Android where no additional configuration is required, iOS is required to pass a string in the MICROPHONE_USAGE_DESCRIPTION parameter telling you why you want this user’s permission.

The app is published on Google Play and App Store and is restricted to students.

I really recommend using the Web Speech API, it is simple to use and can be used as a differential of competitiveness and accessibility.

References

Cordova Plugin TTS

Web Speech API

IOS Permission for Microphone

Ionic Framework

Creating an application that uses Speech Recognition.

Written by Joel Garcia Jr