Powered by Google Cloud Speech-to-Text

welcome to the voice Tech podcast join me Tom Robinson in conversation with the world’s leading voice technology expert discover the latest product tools and technique and learn to build the voice app to the Future when the interaction is friction lesson seem less you actually I’m more happy with it you’re less stressed because it just feels natural

hello everybody Welcome Back today’s episode is cool to voice emotion analytics and which you’ll have my fascinating conversation with Florian eyben the CTO of ordering an audio Analysis company that specializes in emotional a special intelligence

Florian is a leading expert in voice emotion analytics machine learning and Signal processing and his contributions to the field and I relied upon by reset your done industry practitioners worldwide and a conversation you were held about adding credit applications of voice emotion I left it we got into some of the projects carried out by what they ring the major brands including my Market Research Medical Centers social Redbox house many more open smile which is the open-source research tool kit for audio feature extraction developed by Florian and the ordering team it’s the state-of-the-art in affective Computing for audio and is the most widely used tool for a next-generation order analysis applications

we don’t discuss the team at standard acoustic promise a recommendation I which is used in the into speech emotion challenge Florian tells us about the different emotional to mention such as valence an arousal I will and about today to set so use the training and evaluating a motion detection algorithms and we discussed multimodal the motion enabled in spaces and much more besides it’s Fantasy episode I know you going to enjoy it it’s a little bit more technical than the average episode but it’s exactly what some of you have been asking me. I’m sure we can enjoy it we can I want to see a message thank you I had to my very best Patron Yung hunnid computer science PhD students from erlangen in Germany and I am humbled and grateful and I find incredibly encouraging that someone’s I decided to part with a hot and cash at this but why I’m doing hair I send a message to Tionne to say thanks and I got you into talking to him online and since then he’s giving me some great feedback on the show one of which has stuff to do more research of the one today

tell you this episode is that dedicated to you young thank you very much for your support. On the Hills again was that my second patron reconciler of the voice podcast whom I’m sure most if not all of you will know very well already Brent is a leader in the voice space heater fantastic podcast host has the best gas in the industry of hair on the show and you know what I mean no news I said which I encourage you to check out so thank you very much

remember I’m a one-man band Atlas I’m I’m doing everything from the from the from the podcast to the marketing and the post office still on free I’m not pushing any products or anything I’m so if you’d like to buy me a coffee to keep me going you can do at patron.com / voicetechpodcast.com

I’m talking to me so I can finish my Master’s Degree now and I signed signed machine learning submit to the paper or exciting my first-ever cool. Sequence to sequence of ethanol in full speech motion conversion conversion so touching and the topic was talking to learn about today because I have actually been working on the emotion commotion in the boys I’m a submitted the paper to the icassp 2019 which is the International Conference on acoustic speech and Signal processing and it’s being held in Brighton in the UK in May next year so I’m yeah Check It Out by Linkin the show notes I’m not sure how much of a chance I’ll have of having my paper accepted but it’s it’s exciting than the last and now I’m looking to do a PhD or I will find out an interesting job in order to improve my machine learning skills build a conversation interfaces and contribute to the nursing field of human computer interaction

someone dug up some interviews lined up including one about Behavior change and nudges Richard thaler which sounds amazing I said really competitive position on the house I’m not sure whether I make it or not but again you know you got to be in it to win it so what’s his face the final thing I wanted to bring to you before I always start last night was the end of the interview last episode I mentioned that I was building a free guide on how to build a free Alexis scale for your podcast on musically audio content while the guys available now it’s it’s it’s up on the site it contains easy to follow step-by-step instructions on how to build and publish a free Alexis scale for your podcast audio music content everyday is a hundred percent free the guide is free the to The Fray and Steven hosting is free listen is can ask Alexa to play the latest episode of O’Hare list of episodes

and I’ll Supply specific one I’m so if you go to pick out you got a little upset so you go see music perhaps you were thinking of doing an Ambien scale or you can send over your content that you won’t listen to go to select a particular track from this is a perfect scale after you to build its really beginner friendly tutorial it’s old and in the web browser is no come online anyone with Sarah Perkins experience can handle that I’m if you want to check it out go to voicetechpodcast.com skill guide Alexa devices supported obviously wants to publish to Alexa everything from Smiles because the wearables automotive in a where do you say like so I can listen to can head your way here audio content in Stella customizable as well unlike some of the off the shelf Solutions are available you can customize everything the flow the odf files on everything that Alexis says even out in SML tags to make a saddle in the Sun weigh that’s what’s a great first project is really fun

what you’ve done at you you will impress your friends and I I-70 impressed myself once I managed to do it I was just going to have to head your content on Alexa and it really doesn’t take very long either you can do it in the afternoon so I need to check it out like I said it’s beginner-friendly old web browsing to come online just go to voicetechpodcast.com skill guide and Howell. Okay a lot of people been contacted me about appearing on the show I was gas so we’ve got some really great episodes coming up speaking of which I without further Ado it’s not my great pleasure to bring you glory and i b c t i o of ordering

okay so I’m here with Florian eyben the CTO of Odell ring which is an audio Analysis company based just outside of Munich in Germany that specializes in emotional artificial intelligence ordering products able to ultimately analyze the acoustic scenes and also the emotional states of a human speaker voice emotion analytics which is what we can be hearing about today so floor in welcome thank you for coming home to the forecast

yes thank you for having me hello from my side my pleasure to be here excellent yeah that really excited to talk to you because this traffic in particular is of particular interest to me I got really excited about the whole idea of being able to discover what a person’s thinking the. They even know it but just from the sounds of that voice and I also love that the fact that the technology is it able to finalize something that can happen possibly can happen just automatically in the background and can can be used for such a wide range of applications in order to the ultimate and improve our lives so we have really excited to talk about this topic with you would you like to introduce the company what is daring

absolutely so I’ll Deering is well we could say we’re still a start-up more less or at least the spirit and what we feel is a startup and also the size so we’re currently in 47 people and we were founded 2012 and it’s quite an interesting story I would say we started the whole company as a spin-off from our University Research at the Technical University of Munich and there we developed world-leading algorithms for speech emotion recognition and time together with yarn Schuler we build up the field which is now called computational paralinguistics so that means everything beyond the linguistic content beyond the text that means how you say something your tone of voice everything about your person so every voice is different

I’m an old boy so young voice mail voice female voice those things or whether you even have a cold or a sore throat or anything that’s in your voice revealing information additionally to text very exciting and at the time how big of field was that supposed to mean quite quite novel in in 2010 and 2012 so it was that time it was established but when actually when Buren Schuler started in like around 2000 to work on speech emotion-recognition everybody was laughing at him so it was at that time that people were focusing on hidden Markov models and speech recognition and kind of the in quotes serious business and then they said what are you doing it’s like mind reading and we don’t believe all this but then in in 2012 actually when we found at the company that was a lot of traction and research and that area because people were not moving anywhere anymore

voice recognition has been showing that this has been more less was and reduction of deep learning there’s a lot of progress been made you could buy commercial products but then everybody in research was jumping onto emotions and paralinguistics rights when we also got we also got a lot of requests at the University that were more commercial nature so University set bonuses small project nobody interesting for us but we said that’s exciting times that started company and try to commercialize the products nor actually the technologies that we discovered and research and make really good products out of them when I was a pressing decision because now it says it’s a booming boom occupies the computational power Linguistics is a subfield of affective Computing without me right

I’m more less so you can see it as two fields that have an overlap affective Computing is the whole field of emotional interaction with machines machines being able to display emotions being able to analyze emotions on various input modality so they can beat a face that can be if it’s electrical sensor like a smartwatch or it can even be the voice as we’re doing it and computational paralinguistics is very specific to the sound of your voice as I had mentioned it before so emotions is one part of computation padding with sticks and emotions is one part of affective computing

Computing is General in some senses but of course voice carries a huge amount of emotion side side effect of competing

absolutely okay so from from your research then you divide a number of products and within the company I could you just give us a brief overview of those

short so well we’re really well known for an open source tool kit actually that we’re providing to the research Community that’s called open Smile this is something that we could start at creating at the Technical University of Munich and we then purchase the IP rights to that software when we found at the company so just was kind of the first building block to a d-ring and now we still continue this provided as open source took it to the community and it does feature extraction from audio that’s in essence what it does but compared to other feature extraction tool kits that are also out there for example for a music information retrieval oral so for other voice analysis tasks hour to a kids has the ability to extract many many features in Berry and little amount of time

it’s very efficiently it’s written in C plus plus and also it provides standard out-of-the-box feature sets which we have been using at various challenges where researchers are comparing their algorithms against it so we can really sitting bass lines with that tool kit wonderful let’s talk about that in debt for sure that something that’s interested in the research side of things and evil Zelda a commercial side of a say so

open Smile was kind of our bases and then we’ve been building commercial products around it so you can think of it as we’re doing the feature extraction to Signal processing and then we’re adding in modules for emotion recognition for age recognition for gender recognition for detecting the presence of Voice or not and that becomes the product in the end so we have a product that we call Sensei I web API where all these modules are available and it’s also easily usable as a restful service like any other speech processing service or some speech recognition service you send an audio file to our service and you get the results in Json format with emotions with age estimates gender estimates speech segments we also we also have an embedded package which we call Sensei I am bedded

and that’s something that makes us unique because most of our competitors focus on apis and Cloud technology however as I mentioned before our open smartcore technology is really efficient it can run on mobile phones it can run on arm platform such as a Raspberry Pi or even Hardware that is a bit lower resources so we can actually run all our emotion recognition models for example on very low resource hardware and run it in real-time there so that’s the same say I am Pettit and Danica mobile phone since I am mobile package so you have the SD case as well that’s really interesting about the embedded side of things cuz I know that’s a big push now with the with the with the growth of iot and public awareness of the need for privacy in a better solution obviously avoids all of the

mission of the network said I said that’s a huge advantage of the competitors for what you mentioned privacy and this is really important to us it really matters because the things were analyzing now we’ve been talking about emotions but it could go further and health space there is really sensitive data and when we want to get into those use cases and actually make people benefit from this technology then I think the only sensible Way Forward is to do it embedded and to actually preserve people’s privacy instead of streaming every word that we say to a cloud what’s the difference between Cowboys and embedded in MN is that got closing of the time so for us there is no performance difference because all the stuff that we do we can also run embedded so it’s not that we would need a big big big clouds to run

deep learning models on multiple gpus or something so it is not because it’s that would have facial recognition system friend since you would need would need more horsepower that means I would run short I mean Envision you have two dimensions in audio you have one dimension but this doesn’t by default mean that audio is less complex for example if we look at speech recognition we can also have very complex models because they support I don’t know a million 10 million words or so that they can recognize so there’s a big language model and there’s a huge search path that we have to go through in our case it’s a different we still have complex models but they are more easy and to compute and kind of smaller

okay going to answer to what was the cause of problems does it solve and what kinds of client-side utility do with

well there’s a lot of application areas for our technology I would start with with the first one actually that we’ve been working for a man that was market research and still is so in market research you have situations where it companies small companies they want to test new products Concepts and ideas so they reach out to certain panels of people present the products do some interviews face to face or in groups and then get feedback from those groups about how they perceive the new product and now the traditional method is basically you feel in a questionnaire where you rate on a scale from 1 to 5 how much you like the product what aspects are good and bad and then you fill in a few comments and then somebody has to do some statistics on that and in the end the company initiating study gets a report saying well this party

better than the other one but they completely Miss is the emotional attachment to products and if we are doing a voice interview face-to-face the way we are talking is something that is influenced by our subconsciousness so whenever I think that’s really boring and I don’t really like it but I still want to be polite and say yeah that’s a good idea and I think it will be successful but if I’m really enthusiastic and attached to something to say out loud that’s so great and and then I start talking about it and I put my full energy into it

and that’s a completely different message that manages, you just call and ask your audience what I think of your products because they know you want them to say it’s great you just can’t rely on the side you have to look at what I do and that’s that’s what our technology can do it just measures the tonality and emotions in your voice and we add that together of course with other metrics to combine products for market research then okay huge number of other uses other than just has to research I’ll go through them go alternative medicine under the stairs on musically maybe maybe just for from voice assistance and social robotics I think this is a very very interesting field because currently when we interact with machines like

a Google assistant or an Alexa or Cortana or something peace systems just take into account what you’re saying sometimes they understand you sometimes they don’t so it is kind of still feels non-human like and can stupid interaction but if you look at what free sample Google presented with Google duplex sorry I haven’t seen it anywhere where we can use it but that’s more towards what we think how natural conversations can be and exactly that’s where we’re adding in this emotion on this component that machine’s can actually show more empathy and use that information to strangers dialogue I know that’s the division on myself and I many people listening what will have it says we look forward to the day when we can actually talk to a computer and not face be kind of another human but it’s so many aspects to that but one of the things that comes to mind is the fact that all these Big Time Players I’ll keep an old

play some sounds I can build an Alexa skill today but I can’t access the audio daisa of the other uses I turn says I’m sorry, I can’t use you for your service to unlock the Motions in in my response is why do you see that the field guide you think that big that would eventually open up this this information would you see that the future of passive voice emotion in a text through I open source platforms like Mycroft what do you think I would you think’s going to happen and yeah that’s that’s very good point that you have there I think there is going to be some change in both directions so I think that the big players they will feel the needs to to also integrators Technologies or to open up a little bits in terms of plugins or something of course they always hide behind saying it’s it’s state of privacy and we don’t give out the voice recordings from earth science which totally understandable

but again they give out a lot of other data so question is why not voice and wine or I will talk to you later but I think there’s going to be some change in direction to either that they adult more the technologies that they license in technology that they developed technology in that respect and also of course it it creates a niche for platforms like Mycroft and even other mid size platforms to allow for more customization because I mean creating an Alexa skill it is nice but it really Limited in what you can do you might want to have more privacy you might want to run it embedded so there’s other car companies that offer you chatbots solutions that you can customize more and I think there’s a there’s a great potential to to integrate such signal and she still is too powerful the technology to for them to ignore the too long and I’m sure they’re just many many developers you have any ideas

he died about the potential uses for these things I just can’t do it without the top ones that they stand at the moment said his many people driving together as soon as it’s made available in the end up in some way or another and I could could you get us a I’m a description of some of that the project so that I can I study or something to address to walk us through and some of that some of the benefits and applications of this absolutely our main focus is on B2B business so business-to-business we see ourselves as a technology provider just because of the fact that there’s so many use cases we focus on what we’re good at delivering the technology and then licensing it out to companies that create products with our technology example in market research we are partnering with the gfk to big German and US international company from I could research we’ve been doing projects with BMW with big telecommunications company with Huawei

so there’s major brands that we’ve been working with now with gfk4 line we have developed measures to quantify the emotions in market research interviews they have launched a product to take home Market build a voice which is also won the BVM Innovation Prize last year was very prestigious German market research award and the distinction Factor was our component to analyze the emotions from the Acoustics of the voice this is a big they call it a passion metrics and how passionate you are about a product and that’s what we measure and then they combine it with their other factors in the product I say I want one of these these phones showcases their ability to do to do this kind of thing all the others will need need to jump on board and then it becomes the then you sounded the new Baseline exactly exactly and the nice things

that is really scales so it is culture is language independent specifically cultural differences exist but their methods that we have to do some baseline organizations especially market research so it’s a very interesting question it was very cultural culturally specific the intonation and the way we speak out a speech passons I’m just saying that the wrong things that I inherited it to humans and then you can detect exactly so there are things that are definitely culture and language specific and intonation is one of them but actually it’s not actually intonation that we’re looking at for the emotions maybe at this point we have to go back one step a little bit and I have to explain very very basic concept of emotional representation in Psychology and there’s two Dimensions very simple two Dimensions called

activation and Valence so activation is telling you how aroused or activated an emotion is think of boredom being as very low activated and anger and happiness or excitement having very high activation and Valence describes how positive or negative in Emotion is happiness versus anger or boredom versus Zara calm and relaxed state so these two Dimensions describe a continuous space in which you can place quite a few emotion categories surveillance is the positive or negative the person feels about something and an arousal is the degree to which they feel that positive or negative feeling exactly so how Express if they are when they actually show their emotions and how intense maybe the emotion is now DD activation that mentioned coming back to the culture and language Independence

there is evidence that even animals show emotions and show it saying like activation or arousal so if you look at psychological literature then it’s actually a proven fact that the expression of arousal in the voice has been there before humans have specialized to develop different cultures and languages and especially before our current languages have developed so there is actually common traits and common properties in the voice that I shared University for this one then for the valence Dimension it’s a bit different because whether something sounds positive or negative indeed depends on how we’ve been raised or what are impression for positive and negative is so there’s some components that generalize across cultures but there’s also more fine details

Cohen and Scott are than cultural language dependent but we can actually partialize those out we can have a result for arousal activation to cut your independent parts and then we can have specific predictors prevalence Parts Coachella independent and then for the more details and find grandparents that our culture and potentially language dependent prices you protect each of them separately correct spelling for the results together and there was another part of a study that out to take you interested in with the M Tech it was the detection of Parkinson’s disease could you tell us a bit about what happened that absolutely so Parkinson’s and other it supposed to be like neurodegenerative diseases they have properties that for example your motor coordination soda coordination go control of your muscles and motor activities is degrading or changing and of course

does your our voice is also controlled by muscles it affect also the way we speak and also voice and there’s substantial evidence that certain voice features are indications of Parkinson’s disease so in this project we actually had gathered data from people with Parkinson’s Disease diagnosed by doctors and healthy subject of a similar age group and they were doing more or less are a standardized voice test so that means like 1 or 2 minutes recording with several items in there and from these we were able with 92% accuracy in a speaker independent cross-validation to detect whether a person is suffering from Parkinson’s disease or not

detection of Parkinson’s from 1 to 10 minute recording and is it a CPAP and I know that that the recording time is especially important but how that okay since I left because I don’t just called an hour’s worth of audio from the Parkinson’s patients in the Radley I ain’t of those prices on that the stuff and so is really important to go to Discount to a show tonight laurdiy as something other kind and it’ll just got to make not to text him so I a congratulate you on that one that results from festac absolutely and actually the segments me then in the end used we’re even shorter than than once 2 minutes so I think it was even in the range of a few few seconds so amazing I mean that’s really approaching the point why we can we can perform this kind of detached and passively just in the conversation with the doctor we can record the OT I would Whitney wouldn’t even have to assign a specific task I just came outside a realistic possibility in the near future I think it is it is a realistic possibility

however there is always in the medical space of course that we have to validate technology and have to have controlled conditions so just in a conversation we might not have that so it might even be better to assign a task and do a specific test like we measure blood pressure with certain device and certain conditions you have to hold your arm at a certain height to to Aetna have the guarantee that it works as a as expected you need to be absolutely sure do you have Parkinson’s detection on monitoring I’m true I owe you and Allison the voice but these asking about the power smart glasses and I’ll just have to take the the facial expressions of the user as well

yes but this was not this was on our part so it was part of a a bigger project and which are partner and take was developing those glasses of for monitoring facial expressions of the user’s movement send another things but we were not involved too much into that I say so the the data from the facial expressions in the audio analysis what will not come by and there was no I said what a nice analysis or some fusion of that date I know exactly so I went to the 92% we going to be only from audio which is quite remarkable it is that’s that’s right you I also I also don’t see it as a primarily as a means of diagnosis so you don’t talk to your doctor and then it says speak 10 20 seconds and then he tells you you could Parkinson’s yes or no, that might be at some point in application in addition to other diagnostic measures however I see more the value in monitoring the progress of the disease

whenever you’re diagnosed with it you will get medication and we don’t know why you getting the right doses of medication and that so it’s really difficult because often you would go to the doctor and visit them once every two months or something but what have you been doing in between how was your progress and that’s what we can actually track with voice recordings because they can be done so easily and we can process the data on a device so it doesn’t even have to leave the device and then we go to the doctor and we have kind of a report from the device telling the doctor how we have performed or is it has been changes and I think this is really where our technology has found you

yeah that medical Darren’s is one of the biggest problems I think an in chronic chronic condition car. X given the medicine that will that will help them and then they didn’t take it people who suffer from cardiac arrest in the go home and go to close to medication and I forgot it’s just an ongoing problem it is extremely difficult problem to solve I can I can see you Jesus for this just in a job at the login login that sugars and it to stay at the house with a voice as as they do that

I see on the website that you need what kind of like food or diary which is the patience to actually do that long to stay at the house I seen for chronic conditions will yes it’s a respect product that we have which we use in and berries of our research projects Moon the clinical side and it’s kind of a technology framework which helps patients to to as you sad love their diseases do some voice recordings and then we use our Sensei I whip API technology to analyze those recordings in the background or we use since I am bedded in the app to provide them with graphs of how they have feeling for example okay so I specifically around new detection is using the emotion recognition technology that you have yes but it also provides us some potential to to associate this with certain other diagnostic measures okay… The last product

is that the sense or since I know how to pronounce it since I owe since I since AI since I sent the music yes so since a iPhone music is a collection of all the algorithms that we offer in the music space because it’s all right and we found it many algorithms to reuse and speech and and vice versa that they can be applied for music and play music to speech so we have basic technology to analyze the tempo the meter and track beats in music so for example you can use that to animate some avatars are robots to sync with the music so you can have robots dance to the rhythm of some music coming in next thing all be asleep

we’re focusing on emotions and music is conveying a lot of emotions so that’s the analysis of mood in music in order to be able you uploaded library of your songs and then say you want to have a happy so I’m now and then system can select a happy song for you that’s that’s a big topic and we both so seeing products I think I’m on some smartphones Android smartphones Samsung smartphones I believe it was there was kind of this psych like moods Square where you could select songs and it would play from your library so that’s that’s indeed something that stuff people are looking for this amazing you know I remember talking about this number he has a guy right I predicted that one day night clubs will be will be just run by robots and I’ll be analyzing that the mood of the crowds and then just selecting and mixing the tracks according to the reaction of the crowd to be completely automatic and will reach some

new level of us in a nightclub because of that because I’ll be so passing lies to it to the audience and then hopefully the end of the stick on that would be amazing because I got to see what changes really why are crazy absolutely I mean an Essence that is what DJs do right they analyzed too much does cloud-to-cloud and then they add in their own expertise and then they choose the songs that’s another job on the threat from hey I don’t think they will be fully replaced because it’s still that there’s the art to it so you want to have somebody there you you want to have the person there but still if it’s smaller club or something where they don’t wear they would now have a playlist in a restaurant or something or in a bar or a smaller Club they might well be replaced this

it’s interesting I wonder if it could be used in the fairies send means is why I like a McDonald’s will notify the music that’s playing that restaurant we are going to eat cracker and get lost potentially potentially there’s a lot of things you could think of was let’s see what you’re going to become reality you’re not as well because I didn’t think voice emotion analytics is a particularly well-known and in the public Consciousness but it is a very fun Springfield I looked online that 70 emotionally 6 reports out there actually one of the predicted that from 2016 to 2022 and is going to be compounded annual growth rate but I’m an average annual growth rate of 82% so that really sounds like it’s a shooting up

okay listen to listen up if you haven’t do an Alexa skill yet what are you waiting for check out the voicetechpodcast.com to building a free Alexa skill fuel pump concert music or audio content in the going to go find easy to follow step-by-step instructions on how to build and publish a free Lexus scale wheel podcast audio music content listen is going to ask Alexa to play your latest episode for hair list of episodes and I’ll Supply a specific one Israeli beginner friendly tutorial old under the web browser no come online to check it out at voicetechpodcast.com skill guide if you’ve already got to put Contour your content then you can grow your audience on Alexa publishing it on that the latest and greatest voice technology platform or if you’re thinking of building something you like an Ambient sound like a Nick schwaba in Botox has done great success now’s your chance the process is totally customizable the flow Peoria Peoria files is a great first project and you show to impress yourself and your friends and the best

is 100% free the guns free the tools are free and hosting spray Now voice chat podcast. Com / skill guide

okay now let’s get back to this amazing episode

listen to this joke about the technology then open Smile in particular it’s a research tool kit for Lakeside ODI feature extraction could you tell us what does open small do how can we use that so it is intended actually for maybe for the average PhD PhD student working on the field of audio analysis for computation of paralinguistics basically it’s a command line tool kits so you have to be familiar with using a Linux or Windows command line you would then write the feature set that you want to extract as a good week all open Smile configuration file you can also think of it as a script or something definition file that describes which features you want to analyze

could you die just describe what what is a feature of how do you describe a feature so a feature is a parameter one parameter that we measure from the audio signal so for example the signal energy as a very simple measure or the fundamental frequency so pitch of your voice then also if you take the mean value of the signal energy over a segment of a certain duration like 1 or 2 seconds or a word or a sentence in Herrin at properties of the signal or derived features based on those in heaven properties

yes exactly I haven’t heard of ordering before we we we do this until I have heard about them smile is Right famous in the in the research community and looking on your side it’s actually that the most widely used tow 4000 and Analysis such as emotion recognition and it’s been downloaded hundred fifty thousand times which is huge ass 1800 citation so yeah we should on State Line how important this this is I’m in it and it really is seen as the state-of-the-art in affective Computing foot for Rodeo Einstein is it supposed to use this as a Baseline Baseline speech feature extractor for the speech emotion challenge like what what does that mean in so the interest speech emotion challenge it started in 2009 with actually the first emotion challenge so that was the first attempts to provide a data sets to the research community on which

all the research teams participating in this challenge could compare their algorithms for a task of emotion recognition we had five emotion classes there to compare on the same exact same conditions there other atoms then this has started to be a more Universal computational paralinguistics Challenge and it has continued since then every year and interest speech and then also at other venues there were some of the challenges okay so that the challenge that the participants are coming out with new algorithms in order to take to detect the emotions based on a standard size features and they use open smile. That sounded set a standard is the data so the recordings that they analyze and the recordings that they can use to train their systems that is standardized so we know everybody is using the same conditions in exactly and that’s where open smile comes into

game with open tomorrow we have provided standard feature sets so that you don’t have to start from scratch when you work in this field and I’m that you start writing your own code to measure signal energy and to measure fundamental frequency and so on one person uses this to look at another word for person uses another tool kids and maybe aunt about using of comparable and someone for the important point to say at the end of the test you want to be sure that it’s just the algorithm that really made the difference is not all but they use different feature extractor so it might be due to that as well as you want to move on to give it to me exactly I mean also of course a lot of people to sit and stay introduce new features or day they had a new feature part of their out of rhythm that’s absolutely okay and that makes it very interesting the field but by providing or standard Baseline since everybody could simply participates using a feature set and then Focus only on the machine learning part so we got many

more people participating a challenge because mainly the machine learning people they said but we don’t know much about signal processing all the speech feature so we’re both scared to participate in speech challenge but once it becomes more a machine learning and it’s a big data problem then you attract a lot of other researchers and they Springs the whole field forward to more than previously but the GE Maps paper which is like that that I sounded it sounded sad promises. Her name is sticker analysis yes exactly start explaining maybe the history of little bit of game apps so that it’s every 1st of the Geneva minimalistic acoustic parameter set because it’s been also led by a professor of a cow Shake It Off from the University of Geneva and it was based on on his work and all of his class

Rachel is working in the space and also other literature we analyzed all that’s to find out which were important features and literature features that commonly we’re used to provide a good correlations to emotions and then similar aspects in the voice and then we try to also keep it small because the other opens my future since they are quite large I will talk about that in a minute but for D-Max we decided to keep it quite small and describe all the features in the sets quite well so that’s not only people doing machine learning with big number of teachers in big data sets but also voice scientists working on individual aspects of teachers could use it and then we also gave out and open Smile configuration file for gmaps so that everybody could easily reproduce it because they could use the open-source research version of Hope

smile and get exactly the same values so you don’t have to go ahead and use it and some people get to extract the future because in a paper somebody only describes we use these three features but you don’t know how to reproduce exact values I say so it says yes that game apps recommended the features but then also links to open small so that they can access those peaches in in exactly the right format exactly so it’s the paper is about the recommendation then the evaluation for certain emotion recognition tasks comparing it to other feature sets and also then the implementation in open smile so that you can actually out of the box get the game apps features direct me a recommended Envy Maps incorrect so actually one Central elements of open Smile is the so-called feature brute-forcing so that is we combine

level descriptors which are features that we extract typically periodically. Every 10 milliseconds from short frames like signal energy or fundamental frequency is examples and then we have some higher-level time unit the word a sentence or a phix time frame and then we apply so-called functionals to it and dysfunctional are things like mean standard deviation or any function that would map a Time series of features of a variable length to a single value and that way we have say 100 low level descriptors 200

functional makes a large spaces of 20000 features 20,000 well so I’m one example Auntie compare 2016 feature-set which is the latest features that from the compare 2016 challenge which is been applied to several tasks own little emotion recognition but eight recognition in various Health aspects that features that has over 6,000 parameters or features interesting is a problem because I understand from machine learning that the more features that you you have the more chance there is a job of a fitting in the in the big of the Moto and the longer it takes to train so what scheme apps and the temp set helping researchers reduce those to just the most of the most effective ones the most essential it it wasn’t some sense indeed the more features you have the most complex Yuma

will be in the end however in practice we see Dad with features like compare 2016 and appropriate classification algorithms that can handle large dimensional feature spaces free simple life support Vector machines but also neural networks and the like that they have no problems even if we have very very small data sets but indeed game apps was an attempt to reduce it specifically for voice emotion recognition but nice thing of compare 2016 is that it has a very broad applicability so you can take the set and you can classify a for example if somebody has Parkinson’s or not you can take its to classify the motion of the person you can take it to classify the age of a person out of the box

of course you can always optimize your future sets for a specific use case and get a better accuracy but this is something that is very Universal and that’s why it has so many features in it and works great for many applications as we have shown in the Baseline tasks of all those competition paralinguistics challenges the stuff yet so when you measure the performance fuel measuring the the the emotion recognition accuracy based on some kind of reference databases you mentioned before you standardize the dates are in the in the median to speech competition iPhones testing on a on the same database yes correct so does does a few public reference databases but I saw a gun mention a new site that that the daily basis of the news in Academia on for these competitions don’t act for the represent real-world audio conditions on so you couldn’t even since use the same model

trains at the phone while on academic reference databases you could for order the camera voice assistant I’m so glad you got if you are developing applications what do you get the data from and what’s better train models of the working in the railroad what we get data from The Real World preferably I’m so that is we work together with customers that have data and build models on their data specifically for the use cases we also collect our own data and we also have a large team of annotators and the company that help us build up data sets

to I see that now I’m with the latest Voice assistance. Coming out with screens and the the recent Announcement by Amazon as they The Elms in the presentation language that multi Moto is it becoming more and more of a thing as a paid for the paying attention to the use of all your commands in conjunction with with looking at screen and possibly other modalities as well have you done any projects using multimodal and what the specific challenges

yeah it’s so far multimodal for us more means the the input side so not necessary screen and voice command but actually looking at the face looking at the voice and looking at signals we would get from physiology so hydrate and skin conductivity for example I know where to get like a complete picture of the emotional state because sometimes you might have voice sometimes you might only see the face so there’s different modalities said he would needs to get a reading comprehensive picture of the emotional state of a person and in order to operate at work in this world we’re ready with Partners II to offer a fuse technology so we we do the audio side both in terms of analyzing the emotions from only the audio but also providing teachers

and intermediate results which then a partner in a multi-modal product can you use an Infuse with modalities such as the face or facial action points for facial emotion recognition interesting interesting so yeah you actually creates new features based on your analysis that I used in a subsequent machine learning model that incorporates different facial expressions or other body signals that I’m in which of venues to to make a decision on using the product so how absolutely yes I can see smart watches really aren’t going away a lot more people wearing the Apple watch now and I know that the number sentences on that is it’s pretty impressive and I could definitely see your applications for I call center agents Princeton says it ends and stop in mote control situations where we have reliable access to

all modalities from from unit for the extended for the duration of the interaction so much and in the end of the world in the consumers I think I might be a lot more peace mail you might not always have all those modalities but something like I’m an agent in a call center I could quite easily why array of census and how to spell the months and their emotional state I’m correctly or if you think more about security relevant application such as flight control or air traffic control it it’s more stress and also if you look at robotics I mean if we have human to human interaction we typically have face and voice and actually it’s also a proven fact that if you see the lips and face of the other person when you talk and like the great acoustic conditions over by high levels of background noise you can understand the person much better if you see the face and you here

audio signal then if you only hear the audio signal I do not see anything and also for emotions you could be saying something in a very positive way but look kind of angry or you could say a positive word in an angry tone. Cat sense of direction of irony and sarcasm then and the interplay of the modalities then the key to getting these very fine-grained nuances of Social and emotional Communication number of interesting things that I mean reading and open source library that defines a standard set of features for the lit reading is not is not really state-of-the-art what does that exist already know it’s more Niche I would say nobody has really take me to breeding to my knowledge to really something like we have for speech recognition there are libraries like opencv for

visual analysis but I wouldn’t be aware that there’s an out-of-the-box solution for Reliable lip reading course if there is one out there I’m happy to hear about it that’s all right and email all right big big subject then is is a natural language processing and I’ll pay on sentiment analysis so this is the linguistic analysis analysis of the actual first text you just briefly Explain four this is like how how that differs from walk from you guys do and whether you’ve ever what with systems that incorporate both an LP and voice emotion in that text I mean we just actually touch before because I said when you say some texts like this is good and in a bad intonation or in a board intonation to say this is good

or this is good then you’re creating like a mismatch of that extra content and of the way you’re pronouncing things and an opium there’s actually three concepts and what you asked and LP is natural language processing so it’s kind of the the topic of dealing with natural language with inferring meaning from It’s So when you say a command to Voice Assistant you’re using natural language and not well-defined command structure so it has to be a bit taller and to what you say and get the meaning of its and then in for the meeting or you want to infer from a text or a conversation the rough topic of what it’s about so these are topics of NLP then there is sentiment analysis which is also of course apart of an opium and sentiment analysis refers to the emotion recognition from texts basically whether a text is positive or negative I think it

started from more or less reviews so when you have a website and you’ve reviews or movie critics or something on there that’s well-known so there you can actually have a greater than those of sentiment analysis apply to and then get scores are the reviews positive or negative and then that works really well and often people sell it as emotion recognition which is true I mean it’s emotion recognition from the text but it’s only like half of it and if we ignore the way things are said you do not forget all the things like irony and sarcasm and you are also missing information on on how things are said so all the things like the arousal in the passion I’ve been talking about before I go see a question then the true emotion to another text Shorty would incorporate or the modality

piano piece side the power linguistic side the lip reading on the gestures I mean show me when moving toward the world why would we need all of these sources they turn on Moto Z know how to spell to accurately predict someone’s feeling yes fully fully agree and deed and although our background is on the acoustic signal processing and machine learning there we are also fusing these results with text analysis so we actually an hour since they I web API product have a module for a text-based sentiment classification valence value from text and then we are relying on someone third-party vendors for speech recognition or on customer is delivering us the text in addition to the audio recordings transcriptions and then we estimate with values and give them a combined score which really also has value

what was vice would you have for potential PhD candidates who were excited by this area with thinking about doing a thesis in was right now I text all I know Jason field would you say would be the the things to consider and I know say what was the most promising attractive active research areas in Emotion recognition right now that they should consider so I mean what they should consider before embarking into this adventure I think that is really understanding what you’re up to so maybe and starting with a Mazda Caesar’s or something in that area to know how it is to to do research to read papers in the area specifically so don’t know what’s what’s going on obviously you cannot read all the papers but really started with some approach understand how papers were written understand how things work

then maybe try to download some open sores to a kids and then really try to to get feeling if this is the right thing for you or not and then if you decide yeah that’s it for me then of course be where the field is is changing fast and new things are coming up new trends so try to find what what’s most interesting to you and also where you get funding for maybe or where you get some PhD position in and then I’m like really the Hot Topics that are in my opinion still outstanding

is a lot of research been done on individual data bases in the research world so there’s like the Berlin speech in motion database which is really very very old well known but some small an active data said there’s datasets like to send main data set that are bigger from the EU project two main natural Mystic emotions but anyway most researchers take a dataset or take two or three day two sets evaluate their algorithm on it but they train on a portion of his data set and test on a portion of the same day to say that I mean there are destructive but it’s still same condition same emotions and someone and then they were brought the accuracies this really over estimates the the True Performance of the algorithm so what we need is essential also and research to test it across the mains I mean they’re there has been researching that direction but then you see you get an 80% on the Berlin speech motion database in crossword today

and then you apply to another database and your accuracy drops to 40% or something and I think integrating this in your testing procedure and more focusing on on Cross Corpus performance is one thing and also the description of emotions is another thing I mean dismal probably for psychology students not ready for the M machine learning people Encino processing people but having this this connection between we have to pick Knology people and me of psychology people that know how to describe emotions but what do we expect a so this is a bottom-up approach to defining what it what should we expect from an emotion and looking for that supposed to just classifying classifying a day I would date a based on some kind of label that was applied before exactly so if you just take a basic six emotions or something

classified Autos can we really use it in the product no it’s it’s quite limited and also if you say what anger is one of the basic emotions but there so many different expression variabilities of anger so I think learning from psychologists there how the problem is described and then thinking about machine learning approaches for that that’s also an area where are still some researchers required laughing at least I think it’s unsupervised learning and reinforcement learning so Google has been publishing some papers on that has really helped them because I mean obviously they have lots of data big databases and you need more more data for it because if you only have labels where you’re not really certain if they’re correct or not or if you have labels just saying no that was right or wrong but you have a more complex problem that you want tomorrow you need not say that but that’s the future because we as humans

we can learn from very few data points interesting they so somebody tells you do it this way next time maybe they tell you again do it this way and then you know how to do it or you learn a new word or are you learn a new voice you learn to recognize a new voice to hurt once you immediately know it again our machine Learning Systems they need hours and hours of data to train typically ass home with our identically mean the community so I think that’s potential for for a lot of cool and new findings and finally what what do you think that the ultimate potential of the voice emotion that I take says specifically watch The Vision that keeps you excited about the field and and drive you for words

what division is that at some point we can make robots I mean they will be entering our daily lives they will be there to support us when we are old hopefully and to make the interaction with them more social more empathic and more natural I mean DD cumbersome if their faces of The Voice assistance right now I didn’t understand you I’m so sorry then you get mad at it and shout it and then throw it out of the window whatever they said they’re not impressed by it but if you have some social companion robot you come home and tell you you sounded so you sound a bit stressed out so maybe some relaxing music or something so that that’s more seamless I think that’s that’s a vision that we can work on and that’s where also wondering is working for division the productivity

obviously that would remove the friction between humans and Technology Anna and able to still come finish grade a number of things in the mall complex OST 2 it so when the interaction is fiction lesson seamless you actually I’m more happy with it you’re less stressed because it just feels natural so it in turn makes you again more happy and you feel better and maybe we’re less sick okay I just improve the conditions under which we use technology which is most of the time these days so yeah this is extended technology is there to assist us and do what we want and not that we have to spend hours and hours teaching technology to do the things the way we want and then it doesn’t work and then we get our own messages and then we were just a good old fashion way instead of using technology that would be the vision that really think she’s there to support us but that we’re still in control of everything of course

yeah I think it will say once these devices to come really really laugh likes it to the point where they’re almost in Sanibel from a bummer okay man but you obviously know and I never will be and I probably an obligation Dental to announce whenever you are talking to talking to each other. Sorry so late with Google duplex that it was in some ways make us I think I help make us appreciate a real human contact that little bit more because we’ll be we’ll be talking to computers as much as well using iPhones and desktops and laptops right now which is all the time and say like real human interaction will become even more precious and I think it will help us appreciate the difference and appreciate each other a little bit more that’s my yes oh yes I think I mean I actually agree that that human interaction human to human interaction is very precious and we are we should still preserve that and I see also.

is that a lot of human human interactions are replaced by an automated robot San agents but I think for it for some areas that’s fine so when I want to know some information from customer support of a company and get you the agent the virtual agent can actually help me give me that information so it’s my problem that’s absolutely fine and then I have more time to actually focus on my real human interactions with my family with my friends and the social environments and I think as you say that’s where we should go to what’s on the horizon then for All Day ring that what will you be focusing on a G’zOne over the next 6 to 12 months we’re at the moment releasing a few updates to our products and we’re also integrating text analytics in in all the products pretzels to the embedded product so there is a road map to come and

I think they’ll be a quite a few more deals in the next 6 months which have also improved accuracy and which will generalize better to different conditions and most importantly targeting specific Industries so we’re currently working on a product for the call center specifically for the call center space so that we’re not only providing a technology but we provide a complete interface and everything for a customer to directly use it more less also for market research that there’s a product that a person can directly used to conduct an interview and then get an analysis report at the end and also in the smart assistant and embedded space that there’s a module that somebody has a software developer in directly into grade in their products so it’s made ATT interface components that’s the common needs of customers are dressed

we’ve got a solution of the shelf for and exciting so I can stuff can wait to see what comes out what can people find out more about odia Ring your work had to reset still putting out open Smile first first thing is to visit our website at ww.w. Audeering. Com so a u d double e r i n g.com from there you’ll find I also links to the research area and to open Smile we have a newsletter which you some And subscribe to we also have a Facebook page and a Twitter accounts to please follow us there and also I can announce that we’re sitting up a new more community and Technology focused website in addition to our corporate website ordering come that’ll be out there soon so watch out for it potentially for Christmas or early January that’ll be out

scratch and newsletters go to Facebook follow us on Twitter and stay tuned Port excellent we will come tationil paralinguistic community on and Beyonce. All right thanks very much sorry and it’s been a pleasure in a fantastic episodes I really enjoy that I really appreciate your insights and I wish you the best of luck in the near future projects player is also in my side it was very very profitable to talk to you and I really enjoyed doing this recording with you

okay so you just had from Florida and I burned the CTO of all day ring that was exciting and humbling to talk to such an experienced and knowledgeable Engineers such as Florian what is a great opportunity to learn how that the state to be out systems that are being built today I think it’s some insight into that the Mike Savage Creator I’m in or what companies such as old are managed to create the impact that I have been on the field and the leadership that they continue to walk by with that I knew recession and products so that’s all for today I hope you enjoyed listening as always you can find a share notes with that links two results is mentioned in the episode voicetechpodcast.com if you’d like to split the show just tell one friend or colleague about this episode I have a think about any computer science students new developers product managers or start at Founders you know in or an annual cycle who would benefit from learning about the technical side of voice and so they could understand and appreciate the technology

the technology future episodes will continue to focus on moving companies in the space researches on the research that producing and I know so focus on specific Technologies to to help out reveal what’s going on behind the scenes in these and the any product I said please help me spread the word as I’ve already mentioned how to build a free Alexa skill for podcasting audio content if you know anyone who’s looking to do that then please by all means recommended guide I’m like I say it’s easy to follow step-by-step instructions is a beginner friendly tutorial Old & in the web browser snow come online and you can get it at voicetechpodcast.com / skill guide if you want to demo it actually there’s a there’s a video on that page or even even better you can just say Alexa voice Tech podcast and you’ll have my podcast on the scale that you can build yourself

I’ll be back soon with another episode but until then I’m your host, Carl Robinson, thank you for listening to the voicetechpodcast.com