by Espen Sommer Eide

It is a tingling sense of presence in the room, when I finally press play on the generated audio file, and hear my trained deep-learning neural net try to formulate new and never before spoken sentences in a language where the last fluent speaker passed away in 2003. When Edison invented the phonograph, it was soon conceived as a means not primarily to play music, but to hear voices of dead persons. The voices recorded on the phonograph were experienced as sounds without bodies, as spirits in space. Listening intensely to the sound, at first I can hear only static noise, but deep inside it various spectral shapes and pulses are starting to make themselves present. I think this is what it must have felt like for Edison when he played his first ghost-like recording of a human voice.

Two early versions of experiment:

Recently there have been big breakthroughs in the field of artificial intelligence and machine learning. Over a period of just a couple of years, it has found new and novel uses in everything from self-driving cars and medical image processing to automatic translation algorithms, including speech recognition and natural language processing. Companies such as Google, Facebook, Apple, Amazon, Microsoft and the Chinese firm Baidu are currently competing in hunting down and clearing out whole computer science departments at universities around the globe, in order to employ the best heads in the field.

One of the technologies driving this revolution goes by names such as deep learning and deep neural networks. In short, the form of computing that is inspired by the brain and its billions of neurons working in parallel to interpret and act in accordance with its surroundings. What has made this old idea of neural networks make such a comeback is the recent availability of big data – large data sets used in the training of the networks, and also the speed of parallel processing in modern GPU chipsets.

As an artist and electronic musician with a keen interest in language and computing, I came across an article published fall 2016, where a group of Google scientists had turned towards the field of audio to try to improve artificial speech[1]. What triggered my imagination was not the fact that they had succeeded in making computer speech sounding much more natural, but the weird by-products of trying the technology out on musical material and other sounds. I had to try this out myself and I fearlessly installed the necessary software on one of Google's cloud-based computing engines to run the tests. My first experiments were with a collection of water-insect field recordings, and also with my own music to see if it could learn to "sound" like tracks of my musical projects phonophani or alog (possibly putting me out of work in the process!).

Water insects:

Phonophani music:

The really big step forward compared to previous techniques is that the trained model is on a sample-by-sample level, so the algorithm really doesn't care if the sounds used for training are a factory siren, a water insect or a piano. The neural network becomes a black box where it is hard to visualise what is actually going on inside. It learns by itself with no instructions on how to replicate the sounds it is fed. And if not correlated with some strictly labelled material it just babbles away meaninglessly like it is speaking in tongues. Or in case of musical material it sounds like a stuttering of half thought-out musical phrases. One big challenge for working with music and deep learning will be the access to a big dataset. In computer vision research, large databases of tagged visual material are readily accessible, and have been for a decade or so. This is what has made the striking visual art of the various neural deep dreaming projects possible (inspired by the 2015 Google inceptionism project[2]). But in the field of music and sound large datasets for this purpose are just now being assembled for the first time[3].

I turned my experiments back towards language again. Would it be possible to train a deep learning network for a dead language? I have in my previous art projects worked extensively with languages that are endangered or already extinct – so called dying languages[4]. Every ten days a language disappears, and at that rate, within a few generations, half of the approximately 6000 languages in the world today will be extinct. The concept of a dying language is a highly complex mechanism. In order for language to survive, it is of central importance that the language is in use, especially in normal households, and between generations of a family. Can a language be kept and conserved for future generations, or is a language alive only when actively used and spoken between people in a society? Can a language be detached from a people's culture, knowledge and identity? Among the family of Sámi languages (of the indigenous groups of northern Norway, Sweden, Finland and Russia), several of the languages are already extinct or with very few and old speakers left, but efforts are being made to help revive some of them.

From a contact in the Freiburg Research Group in Saami Studies, I got hold of the last remaining recorded material from an already dead language, Akkala Sámi, from north-west Russia. One of the last speakers, "Piotr", tells a story and sings a song. What if I trained a deep learning model to speak this lost language, using this material for the training? Could it be a way to hear the language spoken again, as if it were living? Could it give any insights into how it sounds that are not already present in the final recordings? Could it somehow give a language its illusive sense of presence back?

Original story used for training (excerpt):

Three stages in the learning process[5]:

From the vantage point of art, it is not so important whether what comes out of this experiment keeps intact the meaning of some speaker, some knowledge, correct grammar etc. I only care for the sound itself, the material content, or the medium itself. Some of the uncanniest generated files are the ones that are almost silent, where only the breathing and some small sounds of the mouth between words are generated. This babbling, or "dreaming" as it is often also called when the neural network is turned inward on itself, is an excellent method to highlight the pure audible element of speech. Also it makes clear what is unique to a certain language, and therefore a possible answer to the question of what is lost if the language is lost.

In the end, I would not label my experiments a success. They are a sketch of a rudimentary idea, a proof of concept at most. In my experiments with musical material, it is not the quality of the musical results that interests me, but the sense of an outside presence or otherness in the sounds generated without a plan or program formulated by a human consciousness. I think this is a central part of my experience of any "deep" art, that there is some singular and unknown method or secret autonomous algorithm working within the piece that makes it endlessly fascinating. In short, the work becomes a character – a face in front of you – but not necessarily a human face.

The oft-cited Turing test is used to measure the success of an artificial intelligence like the one I have created. In this case a slightly amended version, a Turing test for art. In the original test, a human subject is to determine whether he or she is fooled by the machine to count it as a human consciousness. Much can be argued against such a simplistic test, but I think its biggest flaw is that it is fundamentally anthropocentric in its approach. Why should a human determine what intelligence is? If we at any point should meet a form of artificial intelligence in this "strong" sense, it will be characterised by its total otherness, and not in any way comparable to our way of thinking. It will be like the black box of the deep learning layers, where we will not ever be able to visualise its multi-dimensional structure. Similarly to the question of whether animals can think or feel like us, the whole question of intelligence becomes too narrow in scope. What matters is our natural reactions and emotions when put in front of the other.

Final generations:

This is how far the experiment got before publishing this text. The result of one and a half month of 24 hour deep learning Akkala Sámi on a cloud-based cpu-server. I felt something happening the final few days, it was as if the voice was starting to coalesce into less stuttering – less like Schwitters "Ursonate" and more flow, and maybe less anger and shouting? Or is it just my mind playing tricks with me? It is one of the biggest challenges to know when to quit. Just one more hour of learning… Just one little change of code and try again… The main weakness in my experiment was the limited amount of source material. I would need access to a larger data corpus of a language to move further. This highlights the increased importance of archives in the future. The world must become even more "data-centric". How will artificial intelligence change a world characterised by homogeneity and the destruction of diversity? Will artificial intelligence make possible a new way of preserving the unique and singular? A preserving of the past by making it present all around us?

In the case of my experiment the next logical step will be to team up with linguists and computer scientists to move the idea further. Still, it is a case in point that I, with my limited specialist knowledge of the technology, was capable of running experiments of this kind[6]. The technology will become even more democratised when the prices of fast GPU processors come down to a consumer level. How could artificial intelligence assist in the creation of art? Will it be a new form of post-human art, as some speculate? The bigger question is what deep learning will mean for art and culture, for creativity, for social studies and the humanities. This is a future that I, and many others will discover and take part in shaping over the coming years.



[1] https://deepmind.com/blog/wavenet-generative-model-raw-audio/ [2] https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html [3] http://motherboard.vice.com/read/big-datas-unexplored-frontier-recorded-music [4] See projects: http://sommer.alog.net/pages/48 and http://sommer.alog.net/pages/29 [5] For more examples visit the full set: https://soundcloud.com/user-614303604/sets/deep-learning-dead-languages [6] For my experiments I used the open source wavenet implementation found at https://github.com/ibab/tensorflow-wavenet * * * Espen Sommer Eide is a musician and artist based in Bergen, Norway. With his music projects Phonophani and Alog, he has composed and performed a series of experimental electronic works. As an artist his works investigates subjects ranging from the linguistic, the historical and archival to the invention of new scientific and musical instruments for performative fieldwork. His works has been exhibited and performed at Bergen Kunsthall, Nikolaj Kunsthal, Manifesta Biennial, Henie Onstad Kunstsenter, Stedelijk Museum, GRM, De Halle Haarlem, Bergen Assembly, Sonic Acts, Mutek, Performa and more. He is also a member of theatre-collective Verdensteatret with works performed at the Shanghai Biennial, Exit festival Paris, BAM New York and more. Email: espensommer A gmail D com