Xavier Anguera – ELSA CTO

From December 14th to December 18th we attended the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) in Singapore. The ASRU workshop is organized every 2 years in different locations around the World and is a prime venue where advances in speech and language technologies are presented .

This year the workshop reached a record high number of participants (approx. 450 people) and had to close registrations early as the capacity of the venue became too small. This is a clear indication that speech technology is currently going through its “golden ages” with lots of new research and practical applications coming up every year. Another record this year was the number of corporate sponsors that went up to 18 (split into tiers) versus 11 from the 2017 edition, and 9 in the 2015 edition.

ASRU technical programme

The technical programme at ASRU is always organized as a “single-track” workshop. This means that there is only one main event happening at any particular time. This is an advantage for people like me that have broad interest in many areas of speech and machine learning as we am able to attend all talks and see all papers we find interesting.

The ASRU workshops are split between poster sessions and invited/keynote talks. In poster sessions the participants whose paper was accepted to the workshop presented their work in front of a printed poster board. Unlike other conferences where accepted papers are either given an oral or poster presentation, at ASRU accepted papers are always only presented as posters. In my opinion, presenting your work with a poster is a great way to create lots of interactivity between the author(s) and the audience, and a great way to understand a work fully, as the audience can ask plenty o questions in 1on1 discussions.

The invited talks and keynotes were given by senior people in areas related to speech and AI where they brought “depth” into a topic they are experts on. Organizers took good care to make sure that many currently relevant areas were covered (hence catering to all audiences and ensuring good topic “breath”).

Organizers divided talks between keynotes and invited speeches. I did not see any difference in research/presentation quality between both, nor on the seniority of the speakers.

I attended all talks and found all of them very interesting. In here I mention the 6 that I found most interesting:

“Multi-Modal Processing of Speech and Language: How-to Videos and Beyond” by Florian Metze – Facebook

Florian’s talk was about the recent advances and challenges in multimodal signal processing, bringing together speech, text and image processing to perform tasks like speech recognition, translation and summarization. Florian showed some very interesting experiments done on Youtube How-to videos, where users provide a sort of groundtruth in the summary section of the video and Florian’s group tried to automatically recreate this information from the long description and from the audiovisuals themselves.

“What Makes a Speaker Charismatic? Producing and Perceiving Charismatic Speech” by Julia Hirschberg – Columbia Univ.

Julia talked about how to measure charisma in people’s speech. Her group has done a series of experiments to identify what traits, measurable in someone’s speech, have a high correlation with their perceived charisma. They applied this work to analyze the charisma of US presidential campaign nominees over past elections, as well as to compare other well known personalities. For example, which one do you think was judged as more charismatic, Steve Jobs or Mark Zuckerberg?

Julia talked about how to measure charisma in people’s speech. Her group has done a series of experiments to identify what traits, measurable in someone’s speech, have a high correlation with their perceived charisma. They applied this work to analyze the charisma of US presidential campaign nominees over past elections, as well as to compare other well known personalities. For example, which one do you think was judged as more charismatic, Steve Jobs or Mark Zuckerberg? “Conversational Machines: Towards bridging the chasm between task-oriented and social conversations” by Dilek Hakkani-Tür – Amazon

Dilek talked about dialog systems, and how the current trend is to switch from task/objective based systems (where the objective is to retrieve some specific pieces of information, or intents) to more conversational interfaces. Dilek went on to describe the Amazon Alexa challenge around conversational skills that has been ongoing for a few years now and every year pushes the boundaries on how good the competing universities manage to build skills that maintain open-ended conversations with real users. If you own an Amazon Alexa and would like to try it out, speak “Alexa, let’s chat” and you should be welcomed by one of the participating bots.

Dilek talked about dialog systems, and how the current trend is to switch from task/objective based systems (where the objective is to retrieve some specific pieces of information, or intents) to more conversational interfaces. Dilek went on to describe the Amazon Alexa challenge around conversational skills that has been ongoing for a few years now and every year pushes the boundaries on how good the competing universities manage to build skills that maintain open-ended conversations with real users. If you own an Amazon Alexa and would like to try it out, speak “Alexa, let’s chat” and you should be welcomed by one of the participating bots. “End-to-End Speech Synthesis” by Yuxuan Wang – ByteDance

Yuxuan is the person behind the first version of Tacotron, a DNN-based speech synthesis (TTS) system that revolutionized the speech synthesis area. In his talk, Yuxuan explained what are the current challenges in TTS (which he believes are around TTS personalization and intonation/expressivity of the voices) and played some audio samples with staggering quality where one can almost not differentiate between real or synthetic.

Yuxuan is the person behind the first version of Tacotron, a DNN-based speech synthesis (TTS) system that revolutionized the speech synthesis area. In his talk, Yuxuan explained what are the current challenges in TTS (which he believes are around TTS personalization and intonation/expressivity of the voices) and played some audio samples with staggering quality where one can almost not differentiate between real or synthetic. “Biosignal-based Spoken Communication” by Tanja Schultz – Univ. Bremen

Tanja’s work on the last few years has been focused on the measurement of biosignals related to the production and understanding of speech. In Tanja’s group they analyze the signals extracted by placing biomarkers right on top of the Wernicke’s area in the brain, a privileged direct access to measure how the brain understands and processes speech. She explained how they are able to analyze the activations received from these sensors and to even produce understandable speech from them. She also discussed about silence speech interfaces where users’ speech is captured from the muscles movements and not from the sound pressure waves created when the vocal folds vibrate ad air comes out from our mouth.

Tanja’s work on the last few years has been focused on the measurement of biosignals related to the production and understanding of speech. In Tanja’s group they analyze the signals extracted by placing biomarkers right on top of the Wernicke’s area in the brain, a privileged direct access to measure how the brain understands and processes speech. She explained how they are able to analyze the activations received from these sensors and to even produce understandable speech from them. She also discussed about silence speech interfaces where users’ speech is captured from the muscles movements and not from the sound pressure waves created when the vocal folds vibrate ad air comes out from our mouth. “Towards Better Understanding Generalization in Deep Learning” by Samy Bengio – Google Brain

Samy’s talk was very different from all the other talks in that he did not have a single slide about speech, but focused on showing his latest work on explainability of what happens inside deep neural network (DNN) models. Instead of treating deep neural networks as black boxes, his group analyzes how the network behaves to different types and amounts of training data (where he discussed about how overtraining/overfitting influences DNN models) and how the different layers are trained and behave when modified. He also talked about the differences between models that learn to generalize concepts given by the training data and those that focus on encoding internal representations of the input training samples.

Closing remarks

Over all, the workshop was a great opportunity to learn lots of new things, get together with people that during the year live very far apart from each other and enjoy what scenery and food Singapore has to offer (weather permitting, as it rained the whole week). One aspect of the workshop that I found very interesting is that as many senior research people have now switched from academia to industry, it is very normal to be having lunch or dinner in a table with a few professors, and people from Google, Apple, Facebook, Amazon, ELSA, … all brought together by their interest of learning more about the field, what new applications/opportunities are opening up with the recent research advances and, or course, meeting old and new friends.

I am already looking forward to ASRU in two years, and closer by, ICASSP in May in Barcelona, Spain.