The system proposed in this study comprises several steps: the speech recordings of the patients are arranged into speech databases (Pathological and Healthy Adults Speech Database). The recordings are normalized and segmented on phoneme level. After selecting the phonemes to be analyzed, acoustic parameters are extracted and arranged into a feature vector. The feature vector is given to a classifier to perform the binary classification (healthy or unhealthy), or to a regression module, performing the estimation of the severity of dysphonia, in possession of prior knowledge. Prior knowledge is gained by the procession of a carefully built speech database and optimal classification and regression models. In case of a new speech sample the class (healthy/pathological) or the severity of dysphonia is unknown. The preprocessing of the speech record is the same and after the acoustic parameters are measured on phoneme level a testing feature vector is constructed that enters a comparative unit, thus the classification or regression is performed. This process is summarized in Fig. 1. This study focuses on the automatic assessment of voice severity, while analyzing the subjective nature of the specialists’ ratings, too.

Fig. 1 The framework of this study Full size image

Pathological and healthy adults speech database

Sound samples from patients were collected during patient consultations in a consulting room at the Department of Head and Neck Surgery of the National Institute of Oncology. Several types of diseases occurred during the survey: functional dysphonia, recurrent paresis, tumors at various points of the vocal tract, gastroesophageal reflux disease, chronic inflammation of the larynx, bulbar paresis, amyotrophic lateral sclerosis, leucoplakia, spasmodic dysphonia, etc. Recordings from healthy people were collected as well. These recordings were used as comparison, and the recordings were collected from people who had attended for unrelated check-ups.

Recording environment and text material

The recordings were made using a near field microphone (Monacor ECM-100), Creative Soundblaster Audigy 2 NX outer USB sound card, with good quality A/D converter and low noise level (audio coding: PCM, sampling rate: 16 kHz, quantization: 16-bit). The recordings were made in a quiet office environment (medical office). Each patient had to read out aloud one of Aesop’s Fables, “The North Wind and the Sun”. This folktale is frequently used in phoniatrics as an illustration of spoken language. It has been translated into several languages, Hungarian included. The text is eight sentences long. The database was annotated and segmented on phoneme level with the help of an automatic phoneme segmentator which was developed in the Laboratory of Speech Acoustics (Kiss et al. 2013).

In the present study two datasets were used, the Initial database and the Selected database.

Initial database

The database containing a total of 263 speech recordings, 127 recordings from healthy subjects (62 male and 65 female) and 136 recordings from patients suffering from functional or organic dysphonia (66 male and 70 female), thus each recording is from a separate subject. The specialist who treated the patient determined the diagnosis. The specialist directly listened to and evaluated the quality of the patient’s speech during the consultations. This database was used for the two-class classification experiment.

Selected database

The Selected database contains a total of 148 recordings, and it was used for the unsupervised cluster and regression analysis. The database contains all the 136 pathological recordings from the Initial database. Furthermore, 12 healthy recordings were selected from the Initial database, because the number of samples for each hoarseness severity category (from H0 to H3) must be balanced for the unsupervised cluster and regression analysis. Table 2 summarizes the diagnoses and their occurrences in the patient group. Four specialists examined these recordings. One of the four specialists set up the diagnosis and evaluated the quality of the patient’s speech during the consultations; the other three specialists did not know the patient and only listened to the previously recorded sound files and determined the severity of dysphonia. Every rater is experienced in working with patients with voice disorders. Table 1 summarizes the diagnoses and their occurrences in the patient group.

Table 1 Diagnoses for the patient group Full size table

Table 2 Two-class classification results Full size table

RBH scale

The RBH scale gives the severity of dysphonia, where R stands for roughness, B for breathiness and H for overall hoarseness. The degree of the category H cannot be less than the highest rate of the other two categories. For example, if B = 3 and R = 2, H is 3, and cannot be 2 or 1. A healthy voice’s code is R0B0H0; the maximum H and respectively RBH value is 3, so a voice’s code with severe dysphonia is R3B3H3. Ptok and his colleagues demonstrated that the application of the RBH scale is suitable for clinical purposes (Ptok et al. 2006). In this study the overall hoarseness H was examined.

Acoustic parameters

In Tulics and Vicsi (2017) we have done a detailed correlation analysis experiment in the shaping an extended parameter set. The following parameter set has been selected: jitter(ddp), shimmer(ddp), Harmonics-to-Noise Ratio (HNR) and mfcc01 means and standard deviations were measured on the vowel [E] (SAMPA), being the most frequent vowel read out in the folk tale. Moreover, Soft Phonation Index (SPI) and Empirical mode decomposition (EMD) based frequency band ratios were measured on the voiced parts of speech, and the measured parameter were grouped into different phonetic classes. While the quality of the continuous speech is determined not only by the quality of the vowels but also by the distortion of speech sounds of other voiced phonetic classes, like nasals, voices fricatives, etc. Therefore, these acoustic parameters which were selected by the detailed correlation analysis were used in this study.

SPI is the average ratio of energy of the speech signal in the low frequency band (70–1600 Hz) to the high frequency band (1600–4500 Hz). If the ratio is large it means the energy is concentrated in the low frequencies, indicating a softer voice (Roussel and Lobdell 2006). The parameter was calculated based on mel-frequency bands. The first band starts at 100 mel (64,95 Hz) and each band is 100 mel wide. Thus, SPI can be represented by the energy ratio of the band with the index from 1 to 13 to the bands with the index from 14 to 22.

EMD decomposes a multicomponent signal into elementary signal components called intrinsic mode functions (IMFs) (Huang et al. 1998). Each of these IMFs contributes both in amplitude and frequency towards generating the speech signal. The IMFs are arranged in a matrix in sorted order according to frequency. The first few IMFs are the high frequency components of the signal, the latter IMFs represent the lower frequency components. We calculate the entropy (E) for each IMF. The frequency band ratios of entropy were calculated the following way:

$$IM{F_{entropy}}=\frac{{\mathop \sum

olimits_{{d=1}}^{2} {E_d}}}{{\mathop \sum

olimits_{{d=2}}^{D} {E_d}}}$$ (1)

Hd is the value of Shannon entropy for each d = 1, 2,… D of the log-transformed IMFs. D is the total number of extracted IMFs. Shannon entropy for a discrete signal is defined as

$$E({p_i})= - K\mathop \sum \limits_{{i=1}}^{n} {p_i}log{p_i}$$ (2)

where K is a positive constant. To extract the parameter, the toolkit presented in Tsanas (2013) was used.

The means and standard deviations of Soft Phonation Index (SPI) and IMFentropy were also calculated on the vowel [E], moreover SPI and IMFentropy were measured on the whole voiced parts of the speech samples and were grouped according to the following phonetic classes:

on nasal sounds marked with [m], [n] and [J]

on high vowels marked with [E], [e:], [i], [2] and [y]

on low vowels marked with [O], [A:], [o] and [u]

voiced spirants marked with [v], [z] and [Z]

voiced plosives and affricates marked with [b], [d], [g], [dz], [dZ] and [d’]

Moreover, SPI was calculated on the whole sample as well; no standard deviation was calculated here. Thus, a total of 33 acoustic parameters were measured per patient voice sample, as starting parameter set in this research.

Decision methods

A two-class classification was performed on the Initial database using leave-one-out cross validation, with SVM (support vector machine) classifier. SVM is a supervised machine learning algorithm which is used mainly for binary classification tasks. It uses the kernel trick to transform data and based on these transformations it finds an optimal boundary between the possible outputs. The classifier was used successfully in our previous work achieving high accuracy separating the healthy and pathological voices (Kazinczi et al. 2015). The goal of the two-class classification was to find out whether the chosen acoustic parameters are rich enough in information to differentiate between healthy and pathological voices, while reducing the dimensionality of the input vector.

In order to reduce dimensionality of the input vector the forward feature selection (FFS) algorithm was used. Forward feature selection is an iterative algorithm, choosing the best feature that improves the performance in regard to a cost or objective function in each step and adding it to the already selected features. Here, the features were selected using maximum accuracy as an objective function.

It is also an important question whether the acoustic parameters selected by the correlation analysis are suitable for modelling the four grade assessments of the specialists (RBH subjective scale). For this reason, an unsupervised learning method, the k-means algorithm was used on the Selected database. The k-means is one of the simplest algorithms that uses unsupervised learning method to solve known clustering issues. This method is a fast and simple approach to the problem: it is easy to implement, and it is easy to interpret the clustering results.

The consistency of the four specialists’ ratings was also examined with Cronbach’s Alpha and the Intra Class Correlation Coefficient (ICC). Both methods are widely used to estimate the reliability of a composite score.

Our main aim is the automatic estimation of the severity of dysphonia. Linear regression and support vector regression (SVR) with radial basis function (RBF) kernel was used for model building. By its nature, linear regression only looks at linear relationships between dependent and independent variables; linear regression also assumes that there is a straight-line relationship between the input variables and the target. SVR with RBF kernel has good generalization and strong tolerance to input noise.