Call scoring is a crucial part of a call center quality assurance. It enables organizations to fine-tune the workflow so that call center agents can do their job faster and more efficiently, and also avoid meaningless routine work.

With call center productivity in mind, our R&D team has been working on automated call scoring for the last couple of months. They’ve come up with an algorithm that processes all incoming calls and divides them into suspicious and neutral. All calls that are defined as suspicious go directly to a quality assurance team.

How We Trained a Deep Neural Network

We used a sample of 1,700 audio files to train our neural network so that it can cope with automated call scoring. The benchmark data was not marked. The deep neural network had no idea which files were neutral by default and which were suspicious. That was the reason why we manually marked our samples and divided them into neutral and suspicious.

In neutral files, call center reps:

Don’t raise their voice.

Provide clients with all the information they require.

Do not respond to client’s provocation.

In suspicious files reps most likely:

Use explicit language.

Raise the voice or shout at clients.

Go into personals.

Refuse to support and consult.

When the algorithm finished processing the files, it marked 200 of them as invalid. These audio files contained neither neutral parameters nor suspicious. We found out these 200 were the calls where:

The client got off the phone right after a call center agent answered the call.

The client didn’t pronounce anything after they dialed a number.

There was too much noise either on the call center side or on the client side.

After we removed invalid files, we divided the remaining 1,500 recordings into a training sample and a test sample. We proceeded by using these two datasets to train and then test our deep neural network.

Step 1: Feature Extraction

High-level feature extraction matters a lot in machine learning, as it directly affects the algorithm's efficiency. After we checked all the sources we could find, we decided to choose the following features:

Time Statistics

Zero-crossing rate: The rate at which the signal changes from positive to negative or vice versa. Median frame energy: The sum of the signal values squared, normalized by the respective frame length. Entropy of energy of subframes: The entropy of sub-frames' normalized energies. It can be interpreted as a measure of abrupt changes. Mean/median/standard deviation of frame

Spectrum Statistics (With Frequency Bins)

Spectral Centroid Spectral Spread Spectral Entropy Spectral Flux Spectral Rolloff

Mel frequency cepstral coefficients and the chroma vector are sensitive to the input signal length. We could have extracted them from the whole audio file at once. But by doing that we would have lost feature evolution over time. This didn’t work for us and we decided to divide the signal into windows.

To improve feature quality, we broke up the signal into chunks that overlap. Then, we extracted a feature sequence for each chunk; therefore, a feature matrix was computed for each audio file.

Window length — 0.2 seconds. Window step — 0.1 seconds.

Step 2: Detecting the Tone of Voice in Separate Phrases

Our first approach to solving the task was to detect and process each phrase in the stream separately.

First, we applied speaker diarization and partitioned all phrases in the audio using the LIUM library. Input files were low-quality, so we applied output smoothing and adaptive thresholding for each one of them.

Processing Interruptions and Long Silence

After we defined the time limits for every phrase (pronounced by a client and a call center rep), we overlapped them and detected cases when speakers talk over each other and also cases when nobody pronounced anything. The only thing left was to select appropriate threshold values. We decided to define 3+ seconds of talking over each other as an interruption. For silence, we selected the threshold of three seconds.

The thing is that every phrase has a unique length. Therefore, the amount of features extracted from each phrase also differs.

An LSTM neural network could solve this problem. Not only can networks of this type process input sequences of different lengths, they can also contain feedback that allows you to save information. These features matter a lot to us because the phrases that are pronounced earlier contain information that affects phrases pronounced later.

We then trained our LSTM neural network to detect the sentiment of each and every phrase pronounced.

As a training set, we used 70 audio files with the average amount of 30 phrases in each one (15 client phrases and 15 phrases of a call center rep).

The main goal was to score phrases pronounced by the call center agent so we didn’t use client speech to train the classifier. We used about 750 phrases in a training dataset and 250 phrases in a testing set. As a result, the neural network classified speech with an accuracy of 72%.

All in all, we were not satisfied with the LSTM neural network performance. It took too much time and the results were far from perfect. We decided to try another approach.

In the second part of our post, you will find out:

How we detected the tone of voice in the whole audio file using XGBoost.

Why we combined LSTM and XGB.

How we detected a particular phrase in speech.

What was the overall prediction accuracy.

What tasks can be solved via voice recognition software.

Stay tuned!