Building a user interface

Below is a screenshot of the UI I used. For the purposes of this tutorial, this is a really simple UI. Feel free to fork the repo and create a more attractive layout. Share it on Twitter (+1 if you use SwiftUI)!

Steps to recreate it:

Drag a button onto the view controller, Type “Start recording,” Change text style to “Headline,” Add center x and center y constraints.

The user interface

The user can start the speech recognition functionality by tapping the button, and when they tap it again, speech recognition will stop.

Next, create an outlet titled recordButton and an action titled recordButtonTapped . Make sure sender is of type UIButton — not Any .

Now that we have access to the on-screen button, we’re ready to implement the privacy features. If the user doesn’t grant permission, we’ll show an alert suggesting to open Settings and disable the record button. Here’s a little helper function for that:

We can replace the switch statement with the following:

If we get authorization from the user, we can continue; otherwise, we ask them to open settings.

The last thing left to do is check for permission every time the user enters this screen:

While we’re at it, let’s also create a small utility function to handle errors. This function will present an alert telling the user an error has occurred and will also disable the record button.

Adding a little more structure

Before we can recognize speech, we need to fix a view things. I promise, it won’t take long.

First off, add a property isRecording to the top of your view controller. I set its access controls to public private(set) because other components in the app are allowed to know whether we’re recording or not, but they must not change it for us.

Second, we want to change the content of the button to say “Stop recording” if we aren’t recording, of course. In recordButtonTapped(_ sender: UIButton) :

This snippet utilizes the ternary operator, which is basically an inline if statement. It works like this: <condition> ? <value if true> : <value if false> . The reason we set sender to UIButton is so we can change values on the sender, which is Swifty-er, in my opinion, than accessing the outlet.

Xcode will probably complain that stopRecording and startRecording don’t exist yet. You can fix that by adding placeholders for these functions (both set to private ).

Recognizing speech

First things first, import Speech at the top of ViewController.swift.

import Speech

The task of speech recognition is more complex than synthesis (is that true for humans too?) and requires some set up.

Let’s break it down into five steps:

Create a recognizer Create a speech recognition request Create a recording pipeline Start recognizing speech Stop recognizing speech

To recap, the user can start speech recognition by pressing the button. If we aren’t already recording, we start recording by calling startRecording() on self. If we are recording, we stop it by calling stopRecording() .

Steps 1 to 4 are in startRecording , 5 is in stopRecording .

1. Creating a recognizer

We start by creating an SFSpeechRecognizer . Its initializer might return nil, or it might not be available for another reason, so we need to carefully validate whether we can use it or not before moving on.

2. Creating a speech recognition request

Next, we can create a request that goes along the recognizer.

Our particular request is of type SFSpeechAudioBufferRecognitionRequest , a subclass of SFSpeechRecognitionRequest . An SFSpeechAudioBufferRecognitionRequest is made for recognizing speech in AVFoundation audio pipelines (see step 3 for more details).

The other subclass is SFSpeechURLRecognitionRequest , for recognizing speech in audio files, if you’re interested in that.

I set shouldReportPartialResults to true to demonstrate how we get intermediate results when iOS is decoding speech from the audio. If you’re only planning on using the final result, it should be set to false to save compute resources.

Just like Vision, we also have to tell Speech up front what we’ll be doing with the results — this cannot be changed after we add the task to the recognizer. We’ll print out the results, including partial results, and if the result is final, we’ll update the UI. It’s also possible to update the UI directly after new results come in, like Siri does.

The updateUI function is a very small function that solely shows an alert to the user telling them what they just said:

3. Creating a recording pipeline

While our code is ready to classify, there is nothing to classify yet. Let’s fix that.

AVFoundation allows you to build complicated graphs of audio pipelines. Each item in this graph is called a node. There are three different types of nodes: input nodes, output nodes, and mixer nodes. We only use one input node in this app, but it’s still good to understand what happens under the hood.

The very first thing we need to do is get an audio engine. This is the object that controls the entire pipeline. Because we need it later on to stop recording, we add it as a property on self.

private var audioEngine: AVAudioEngine!

Creating an audio engine is only a matter of calling its initializer:

audioEngine = AVAudioEngine()

After that, we’ll get the input node of the audio engine. We need this object later on as well, so create another property:

private var inputNode: AVAudioInputNode!

The input node is available as a property of the audio engine:

inputNode = audioEngine.inputNode

There is one other thing to review. Audio recordings can be of any duration, meaning we can’t simply assign a block of memory to put the recording into. Luckily, engineers have found a solution for this issue: we cut the recording up into many pieces of length bufferSize that we can store in a fixed block.

You don’t need to worry about audio getting cut off as a result of this — if we stop recording in the middle of a block (it always happens) the rest of the block is filled with silence.

To get these chunks of audio, we need to install a tap (i.e. add a node to the graph) on the input node. The bus is like a channel we’re using. We also tell AVFoundation what our next step is: adding the extracted buffers to the recognition request ready to be transcribed (transcription is done with recurrent neural networks, also in buffers!).

With that, the entire graph is finished. AVAudioEngine can now build the graph for us if we call .prepare() .

4. Start recognizing speech

There are a lot of things that could potentially go wrong when we put all our above code into action. The user might be on a phone call, their microphone may be missing or broken, the graph might not be complete, etc.

Luckily, iOS helps us by checking these things for us and simply givign us an error if anything is not OK. All we have to do is wrap the following code in a do-try-catch block:

The way iOS handles audio and video input and output is through AVAudioSession s, which can be requested either by third party apps or iOS itself.

Through sessions, an app can request to use the microphone, camera, or both simultaneously. By specifying a category, a mode, and options, iOS will automatically prepare the low-level system functions. So for example, if a user is listening to music, the music will be paused and the microphone will be enabled.

It won’t come as a surprise to learn that AVAudioSession is a singleton — there is only one iPhone to manage, after all.

We set the category of our session to .record because we’re recording audio. Other categories are also available: see the full list here.

Finally, we can fire up the entire pipeline, from getting raw input to transcribing the audio, by calling .start() .

5. Stop recognizing speech

The very last thing to do is implement stopRecording . This function should be self explanatory.

One thing I’d like to point out is that the audio pipeline cannot be changed while it’s being used, so we have to stop the audio engine before we can modify the graph by removing our tap.

Running the app

You can now hit build and run to load the final result onto your device. Tap start recording, and once you start talking, results will be printed in Xcode. If you tap stop, an alert will be presented.