It has been hypothesised and evidenced by various psychological experiments that there exist differences in smiles between the two genders. To verify this computationally and at the same time to develop a tool for gender classification solely based on the smiles, we propose a framework which can track the dynamic variations in the face from neutral to the peak of a smile. Our framework is based upon four key components. They are (1) the spatial features which are based on dynamic geometric distances on the overall face, (2) the changes that occur in the area of the mouth, (3) the geometric flow around prominent parts of the face and (4) a set of intrinsic features based on the dynamic geometry of the face. Note, all of the dynamic features described here are intuitive extensions of the relevant physical experimentations and are based on the reported literature on facial emotions, especially on the dynamics of the smile, eg. [8, 10].

Figure 1 presents a block diagram showing the key components of our framework for the analysis of the dynamics of a smile. The first step in our framework is to detect and track the face within a given video sequence. To do this, we have used a well-known Viola-Jones algorithm. It is based on Haar feature selection to create an integral image through the use of Adaboost training and cascade classifiers [32]. The ability of this algorithm to robustly detect faces under different lighting conditions is well established, and we have also demonstrated this in previous work [2].

The next step in our proposed framework involves automatic detection and tracking of a stable set of landmarks on the dynamic face. Automated Landmark detection is done using the CHEHRA model [6]. The algorithm has been trained to detect facial landmarks using in-the-wild datasets under various illumination, facial expressions and head poses. It is based on cascade linear regression for discriminative face alignment. This is done by applying Incremental Parallel Cascade of Linear Regression (iPar-CLR) method. The tests we have carried using the CHEHRA model appear to be acceptable though we noticed it is likely to suffer when it comes to real-time applications. The algorithm has been utilised to detect 49 landmarks on the face, marked as \(P_{1}\ldots P_{49}\) as shown in Fig. 2b for the face shown in Fig. 2a. Note, in addition to the 49 landmarks which CHEHRA detects, we also include the centres of the eyes as two additional landmark points, as shown in Fig. 2b, marked as \(P_{50}\) and \(P_{51}\).

Table 1 Description of the geometric distances from which dynamic spatial parameters are derived Full size table

Fig. 3 Variation in the dynamic spatial parameters \(\delta d_{i}\) across the 10 partitions of time, for a typical smile, from neutral to the peak Full size image

Dynamics of the spatial parameters

Based on some of the positions of the 49 landmarks we obtain through the CHEHRA model, we identify 6 dynamic geometric Euclidean distances across the face which are utilised to compute our dynamic spatial parameters. Further details of these spatial parameters are given in Table 1. The general form of a given spatial parameter is,

$$\begin{aligned} \delta d_{i} = \frac{d_{i}}{N_{i}} + \sum _{n=1}^{t} \frac{d_{i}}{N_{i}} - \frac{d_{in}}{N_{in}}, \end{aligned}$$ (1)

where t is the total number of video frames corresponding to each \(\frac{1}{10}\hbox {th}\) increment of the total time T for the smile, from neutral to the peak. Here \(N_{i}\) is the length of the nose, for a given video frame, computed as the distance between \(P_{23}\) and \(P_{26}\). Thus, by dividing the spatial parameters by the length of the nose \(N_{i}\), we normalise these parameters to the given dynamic facial image. It is noteworthy to point out that for a given smile, from neutral to the peak, we divide the time it takes into ten partitions and therefore for each of the \(d_{i}\) we have 10 times \(d_{i}\) parameters which are fed to the machine learning. Hence, in our dynamic smile framework, we have a total of 60 dynamic spatial parameters.

Fig. 4 Description of triangular mouth areas used to form the dynamic area parameters on the mouth Full size image

Figure 3 shows the variation of \(\delta d_{i}\) across the 10 time partitions for a typical smile. As can be observed, there is a variation in each parameter as the smile proceeds from neutral to its peak.

Dynamic area parameters on the mouth

The second set of dynamic parameters concern the mouth. Here we compute the changes in the area of 22 triangular regions that occupy the total area of the mouth. This is shown in Fig. 4. Again these areas are computed using the corresponding landmarks obtained from the CHEHRA model. The general form of how the changes in the mouth area are computed is described as,

$$\begin{aligned} \bigtriangleup _\mathrm{area}^{i} = \sum _{n=1}^{22} \frac{\bigtriangleup _{i}}{\bigtriangleup N_{i}}, \end{aligned}$$ (2)

and,

$$\begin{aligned} \delta \bigtriangleup _{i} = \sum _{n=1}^{t} \bigtriangleup _\mathrm{area}^{i}, \end{aligned}$$ (3)

where t is the total number of video frames corresponding to each \(\frac{1}{10}\hbox {th}\) increment of the total time T for the smile, from neutral to the peak. Here \(\bigtriangleup N_{i}\) is the invariant triangle area determined by the landmarks defining the outer corners of the eyes and the tip of the nose, namely \(P_{11}\),\(P_{20}\) and \(P_{26}\). Again we divide the total time of the smile, from neutral to peak, into ten partitions, and therefore we obtain 10 parameters from the \(\delta \bigtriangleup _{i}\), though time, which are fed to the machine learning. Thus, in our dynamic smile framework, we have a total of 10 parameters which capture dynamics of the mouth.

For the purpose of illustration, in Fig. 5 we show the distribution of areas of the triangular regions, \(\bigtriangleup _{i}\), across a typical smile.

Fig. 5 Variation in the dynamic area parameters \(\bigtriangleup _{i}\) on the mouth, across the 10 partitions of time, for a typical smile, from neutral to the peak Full size image

Dynamic geometric flow parameters

The third component of our framework for smile dynamics is the computation of flow around the face during a smile. More specifically, we compute the flow around the mouth, cheeks and around the eyes. To do this, we have utilised the dense optical flow developed by Farnebäck [16]. It is a two-frame motion estimation algorithm in which quadratic polynomials are used to approximate the motion between two subsequent frames in order to approximate motion of neighbourhood pixels for both the frames. Using this algorithm, we are able to estimate the successive displacement of each of the landmarks during the smile.

Table 2 Description of how the optical flow parameters around the face are derived Full size table

Table 2 shows how the various landmarks and regions of the face are utilised to compute the optical flows around the face. The relevant facial regions and landmarks are given in Figs. 2b and 6 respectively. We also show the variations in the dynamic optical flows, \(\delta f_{i}\), around the face for a typical smile in Fig. 7.

Note, the geometric flow, \(\delta f_{i}\), for each of the regions is normalised upon computation by means of the corresponding flow around the invariant triangle area of the face determined by the landmarks defining the outer corners of the eyes and the tip of the nose, namely \(P_{11}\),\(P_{20}\) and \(P_{26}\). Again, each of the geometric flow parameters \(\delta f_{i}\) is computed across the 10 time interval through which the smile is measured, resulting in a total of 50 dynamic geometric flow parameters which are then fed to machine learning.

Fig. 6 Regions of the face identified for dynamic optical flow computation Full size image

Fig. 7 Variations in the dynamic optical flows \(\delta f_{1}\) around the face, for a typical smile, from neutral to the peak Full size image

Intrinsic dynamic parameters

In addition to the spatial parameters, the area parameters and geometric flow parameters, we compute a family of intrinsic dynamic parameters on the face to further enhance the analysis of the dynamics of the smile. These intrinsic parameters are mainly based on the computation of the variations in the slopes and the growth rates of various features across the face. We identify these features as \(s_{1}\), \(s_{2}\), \(s_{3}\) and \(s_{4}\), details of which we describe as follows.

The first parameter family in this category relates to the computation of the overall slope variation around the mouth during a smile. To compute this, we use,

$$\begin{aligned} s_{1i} = \frac{ N\sum _{n=1}^{N} P_{ix}P_{iy} - \sum _{n=1}^{N} P_{ix} \sum _{n=1}^{N} P_{iy} }{\sum _{n=1}^{N} P_{ix}^2- \left( \sum _{n=1}^{N} P_{ix}\right) ^2}, \end{aligned}$$ (4)

where N is the number of video frames comprising the whole smile, from neutral to the peak, \(P_{ix}\) and \(P_{iy}\) are the Cartesian coordinate equivalents in the image space corresponding to the landmark point \(P_{i}\). Hence, a total of 12 parameters are identified for the variations in slopes around mouth corresponding to the mouth landmarks \(P_{32}\) to \(P_{43}\).

Table 3 Parameter description for the computational framework for smile dynamics Full size table

The second family of parameters, \(s_{2}\), in this category corresponds to the growth rates across smile corresponding to the spatial parameters as well as area parameters on the mouth. The growth rates arising from the spatial parameters are defined as,

$$\begin{aligned} s_{2i(\mathrm {spatial})} =\sum _{n=1}^{N} \frac{\delta d_{i}^{t} - \delta d_{i}^{t+1}}{\delta d_{i}^{t}}, \end{aligned}$$ (5)

and for the area parameters on the mouth are,

$$\begin{aligned} s_{2i(\mathrm {area})} =\sum _{n=1}^{N} \frac{\bigtriangleup _{i}^{t} - \bigtriangleup _{i}^{t+1}}{\bigtriangleup _{i}^{t}}, \end{aligned}$$ (6)

where N is identified as the total number of frames in the video sequence of the smile while t to \(t+1\) defines two successive video frames. In addition to the growth rates \(s_{2i(\mathrm {area})}\), for each of the 22 triangular regions of the mouth, we also compute the total growth rate for the mouth, by using Eq. (6) along with the 22 triangular mouth area information. This means we have a total of \(6+22+1 = 29\) parameters of dynamic intrinsic type \(s_{2}\).

The third family of parameters, \(s_{3}\), in this category we have identified is for both spatial parameters across the face and area parameters in the mouth. These are defined as compound growth rates given as,

$$\begin{aligned} s_{3i(\mathrm {spatial})} = \left( \frac{\delta d_{i}^\mathrm{neutral}}{\delta d_{i}^\mathrm{peak}} \right) ^{1/N} -1, \end{aligned}$$ (7)

and,

$$\begin{aligned} s_{3i(\mathrm {area})} = \left( \frac{\bigtriangleup _{i}^\mathrm{neutral}}{\bigtriangleup _{i}^\mathrm{peak}} \right) ^{1/N} -1, \end{aligned}$$ (8)

where N, like previously, is the total number of frames in the video sequence of the smile. The compound growth rate is measured simply using the neutral and peak of the smile. Again, like previously, in addition to the compound growth rates \(s_{3i(\mathrm {area})}\) we also compute the compound growth for the entire mouth by means of the utilising the total area of the mouth. This implies that we obtain a total of 29 parameters of dynamic intrinsic type \(s_{3}\) too.

For the final family of parameters, \(s_{4}\), in this category, we compute the gradient orientation of the mouth based on the two mouth corner landmarks \(P_{32}\) and \(P_{38}\) which provides us with a line m passing \(\delta d_{1}\) at the neutral and the peak of the smile. We then use,

$$\begin{aligned} s_{4i} = \sum _{t=1}^{T} \delta d_{1}^t - m^t, \end{aligned}$$ (9)

to compute the rate of deviation of the mouth corners against the gradient m over the 10 time partitions where T is the total time from neutral to the peak of the smile. Similarly, we compute the gradient orientation of the mouth area based on the combined 22 triangular areas of the mouth between the neutral frame and the peak of the smile.

These parameters provide us with a sense of the smoothness of the smile and forms an additional \(10+10=20\) parameters for machine learning.

Table 3 provides a summary and brief description of various parameters associated with our computational framework for smile dynamics.