TLDR; In this step by step guide we’ll show you how to transcribe an audio file using IBM Watson speech-to-text API and a little bit of Python.

From day one we’ve tried to improve accessibility and provide convenience for all our students. All our classes from our online courses are recorded and stored privately on YouTube, in order to give our students the possibility to watch them in the future. We wanted to go one step further and provide captions and transcripts for each video, and we started investigating options. We do A LOT of classes (4 per week), so a premium (human-driven) service was off the table. On paper, the speech-to-text service from IBM Watson made a lot sense: just $0.023 per minute.

The end result greatly satisfied us. You can see an example of the final transcription from a Youtube video right here: https://github.com/rmotr/speech-to-text (see the examples/ directory). The money side of the equation also was convenient. For example, we transcribed all our Free Flask Tutorial videos (+ a lot of testing), and spent only $2.58:

How we did it? Here are the required steps.

Step 1 — Get the IBM Bluemix credentials

In order to use the Watson API you need to have an active Bluemix account, create an app, and generate a set of “credentials” to communicate with the API. This is what you have to do:

Register in Bluemix

Go to https://console.ng.bluemix.net/registration/ and create a new user. Make sure you input your password, sometimes they just skip this step (you’ll have to recover it in other case). You’ll need to validate your email. Standard procedure.

Go to the Bluemix Dashboard

If you registered and logged in successfully you’ll have access to the Bluemix dashboard: https://console.ng.bluemix.net/dashboard

In your dashboard click on “Create App”. It looks something like this:

Create Organization and Space

Once you click on Create App you’ll be prompted to a series of steps. First, accepting terms and services:

Then create the organization (choose whatever name you want):

Then create a Space (again, name doesn’t make a difference, choose whatever you want):

And confirm your changes:

Create the App

You’ll then be dropped to the “catalog menu”, from which we’ll create the app. Start by selecting “Watson” under services and the Speech to Text service, as shown here:

You’ll see a screen to give your service a name and to create a set of credentials. You can put whatever name you want, I just left the defaults:

Once you hit the Create button, you’ll be dropped to the main screen of your service. Then click on “Service Credentials” as shown below:

We’re close! If you expand with the “View Credentials” arrow, you should see your Username and Password:

Step 2 — Getting the transcript

Now that you have your Username and Password, you can use either our open source command or some small piece of Python code to transcribe your audio file.

Using Python

The Watson Python SDK is a handy library that will make this process extremely simple. First install the SDK:

$ pip install watson-developer-cloud

Then you just need to invoke the service with the audio file that you want to transcribe ( audio_file.flac in our example):

As you can see from our example, the process is extremely simple. Just create an instance of the SpeechToTextV1 class and invoke the recognize() method passing the audio file and a few options. You can see a reference of the options in the Watson API Reference.

Using our Command Line tool

Now that you’ve seen and understood how it works, you can just use (and collaborate with) our command line tool. It won’t just create the transcript, but also parse the result and build a final document with different formats (HTML, markdown, etc):