During my qualifying exams at the MIT Media Lab, Dr. Tristan Jehan asked me:

A number of companies are working on this right now, and offering some interesting options. Many feature machine learning, cloud collaboration, or the blockchain. These technologies are already changing the way music is created and consumed, and I expect that their impact will only grow stronger in the future. However, my proposal is not characterized by any particular technology.

Before answering the question, I added an additional constraint: How could we setup this technology in a way that helps music and the music industry to thrive?

Can new recording technologies benefit not just their creators, but also musicians, fans, producers, researchers, record labels, and the music industry as a whole? Possibly. I am publishing my answer here in the hope that feedback will generate discussion and strengthen the ideas. Perhaps by working together, we can nudge the future of music and music production in a positive direction.

Turning the DAW Inside Out

Recorded music had a huge impact on the 20th century, with analog and digital recording technologies co-evolving along with changes in musical tastes and possibilities. The evolution of sound recording technologies will continue in the coming century. How will the processes of making and recording music change? We are beginning to see how cloud computing, streaming media, machine learning and social networks are changing the music creation process. This paper describes a speculative next generation sound recording and music production technology, and proposes a technical blueprint for how this technology could be created today.

Conventional DAWs like Reaper (shown here) are modeled after multi track tape machines.

The internet already changed music authorship, ownership, distribution, and consumption. We will not include a discussion of the many ethical, financial or legal implications. Instead, the goal is to explore an optimistic way that music and technology could co-evolve in the twenty-first century, and lay a foundation to start working toward that future.

We begin with a comparison to a tool for software development: git. Git is both an open source software project, and a protocol. It lets many users work together on a software project, and makes branching and forking trivial. Commercial products built on the git foundation are offered through organizations like GitHub, GitLab, and Bitbucket. This is the kind of model I envision for next generation audio tools. However, audio is more complicated than code, and music is a higher level of abstraction than audio. We can learn from the architecture of git, but the modules that we need to build twenty-first century music creation are different, if not much more complicated.

What are the modules and protocols that will be valuable to musicians in the 21st century? Before answering this question, I will walk through a hypothetical (science fiction) recording session, describing how it is different from a conventional recording session today.

The Science Fiction Studio

A band’s producer and bass player, Jordan, opens up session in her DAW of choice to review pull requests. A few days ago, the band was jamming in the studio, and found a groove that Jordan wants to make the basis of a track. After the studio session, she sent a link to the session to a few friends, asking for feedback. The track received a few comments, and a pull request for scratch vocals from a song writer. She reviews the comments, taking note of which ones to share with the band later, and merges the pull request for the rest of the band to review.

Jordan also subscribes to a song arranging service. After the band finished recording, the automated service analyzed the full three hour jam session, and identified three distinct songs that the band played. The service created three new sessions in Jordan’s account, which were organized into song structure, with intro, verse, chorus, outro. The tempo map generated by Jordan’s native DAW was reasonably accurate, but the commercial service handles time signature changes better, and accurately re-organized an hour’s worth of material into a three and a half minute arrangement. The service also added symbolic transcriptions of the performances: A transcription of the chords, and MIDI-like note transcriptions of the individual tracks.

Jordan plugs in her bass guitar, hits the play button in her DAW, and records a few takes and overdubs to comp a bass track to replace the scratch track recorded in last week’s jam session with a polished version over the three minute arrangement. When she presses ‘play’ in the DAW, the audio from her bass is recorded and cached on her local machine, but also sent to the cloud were it can be processed, archived and analyzed. When she is satisfied, she can forward the result to the rest of her band for more overdubs. An automated service can send quotes for overdubbing a professional string ensemble in a high end studio. When the band is ready to release the song, it can be fingerprinted, along with the rights and authorship. From there, artists and bots can sample and remix the final track, all while retaining references to rights and attribution information.

Jordan can give multiple services read access to her stems, and write access to a directory for mixdowns. These might be bots that do automated mixes, individual mix engineers, or combinations of the two. Jordan can then give read access to the mix down directory to a streaming service, which will compensate her, her band, and the mixing and mastering engineers if the song becomes popular. The streaming service annotates the master mixes with playback statistics. Other services can read the statistics, and these can drive further mixes, or advice Jordan how to structure her band’s next single.

At any point during this process, Jordan can publish a rough mix of a new track to streaming services. The mix might not be very polished, but it will only be available to the band’s immediate supporters. If Jordan allows it, supporters who hear the initial mixes can make pull requests, fork the project, or request permission to sample it.

We can imagine many services could fit into a system like this. Chord progression and arrangement recommendation services. Educational music theory or arrangement aids. Bots that scrape the network for illegally copied material. Many components that could fit into a system like this are on the market today. Splice lets users backup, share and fork DAW sessions, iZotope Neutron will attempt to generate channel strip presets that attempt to automatically mitigate mixing artifacts like masking. Hooktheory visualizes and compares chord progressions from a database of thousands of songs. Propellerhead and others offer a centralized “app store” that sells or rents audio effects and synthesizers. Ohm Studio and Soundtrap are real time collaborative cloud based DAWs. These services leverage the current internet platformization trend, where the success is measured by market share. Following these services to their logical conclusion results in a convergence in the tools used to create music, and a convergence of the music itself.

An alternative future is one that takes inspiration from the success of git. In this future, many developers can build modules on top of an open source protocol without getting locked in a walled garden. Following this pattern to its logical conclusion reveals a rich and diverse ecosystem of tools for different kinds of collaboration and music creation. What follows in a proposal for the technical foundation that such a system could be built on.

Terminology

In the technical description below, the following terms are defined explicitly:

Annotations — Time stamped information or instructions about a session, track, or asset.

Artist — The user, who is creating music.

Audio Asset — The raw digital encoding of a wave form.

DAW — Digital Audio Workstation

Metadata — Data about tracks and assets. For example, “composer”, “performer” “conductor”.

Plugin — An extension hosted by a DAW.

Resource Access Interface (RAI) — The software interface for services to access an artist’s sessions and assets.

Service — Services are the modular tools for processing and annotating audio sessions. Services can accept and return audio, audio sessions, symbolic audio, or metadata. As an example, a service might accept an audio session as an input, and return a timestamped list of chord changes (annotations). Another service might accept a song session as input, and provide streaming access to a mastered version that song for fans to listen to.

Session — The file saved by a DAW that includes arrangement, routing, plugin, assets, asset references.

Symbolic Audio Asset — The raw music asset, encoded symbolically. For example, a MIDI file.

Track — A rendered session, often a stereo audio file.

Technical Requirements

How could a system like the one described above work? The goal in this paper is not to design each of the modular building blocks like automated musical analysis, transcription, and mixing. Instead the aim is to describe the system that these modules could be built on top of. Ideally such a system would provide music creators the freedom to utilize a variety of diverse tools in their creation process, while incentivizing researchers and developers to create modular tools and plugins that can work together in creative and interesting ways.

Traditionally, the tools used in the music production workflow follow a sequence (for example: Composition, Recording, Production, and Distribution). When all of these operations are performed in a cloud enabled toolchain, we may stop thinking of them as a sequence: A track’s composition and production can continue even after the track has been released to streaming services. That track can be sampled, covered, or remixed by other artists and collaborators. Every step can be automated, and every creative and technical decision can be data driven. All these capabilities can be integrated into the toolchain.

How can next generation composition and recording tools take advantage of the collaborative and data driven capabilities while simultaneously fostering creativity and diversity in the results? This approach to creating and recording music represents a significant paradigm shift from the conventional approach. The tools will need to evolve considerably.

Our inner designer might be tempted to imagine the user experience of working with the DAW of the future. With the experience in mind, we can then work backwards to the technical implementation. This approach is risky, because it forces us to answer many different design questions simultaneously. It is still early to envision and design the full experience. We cannot create Github without first creating git.

Instead of trying to re-imagine the entire music pipeline, let us start with the foundational blocks that we can build new musical tools on top of. Essentially, this involves turning the DAW inside out and exposing its contents to the cloud, where artists and services can sample, remix, and interact. If we do this right, new creative workflows will emerge naturally. We can evaluate how artists are using the tools, and use what we learn to inform further development.

Git Objects

Returning to our earlier comparison, consider the foundational objects that git is built on:

Blob — Binary data storage, typically the contents of a file.

— Binary data storage, typically the contents of a file. Tree — Tree data structure with named pointers to other tree and blob objects

— Tree data structure with named pointers to other tree and blob objects Tag — Points to an object. Includes a comment, and the object, type and tagger headers.

— Points to an object. Includes a comment, and the object, type and tagger headers. Commit — Points to a tree. Includes a comment, and headers:

tree — The tree object parent — The sha1 of all parent commits author — The name of the person who made the change committer — The name of the person who made the commit

The binary format of a git commit. Source: Git Community Book

A git repository uses the host’s filesystem to store these objects, addressed by their sha1 hash. The basic git commands are responsible for creating new objects, and updating local and remote file systems based on the existing objects. These simple primitives make a complex variety commands and collaborative workflows possible. Notice that commit objects encapsulate attribution, and that attribution travels with the object when it is pushed or pulled to and from a remote repository. Additionally, git does not attempt to solve dependency and package management, leaving that to other compatible or complimentary services like pip and npm.

What would a system look like for music? Git objects were designed to describe state in the evolution of a codebase. The way that code is composed is very different from the way the music is composed. Consequently, the role of musical primitives proposed here is not the same as the role of git objects. Instead of mimicking the capabilities of git (version control), and applying those to music, this is a design proposal for foundational infrastructure that could enable the kinds of collaborative and computational music creation processes described in the introduction above. With that in mind, consider the following description of musical asset ‘objects’ that could integrate into music creation process.

Music Objects

The role of a music object is to create an addressable connection from assets to annotations and metadata. For example, consider a stereo mixdown of a drum kit. There are many different ways it could be used. It can be a scratch track for over dubs. It can be converted to a sample library. It can be deconstructed as a Rex sample. It can be transcribed, and converted to a symbolic asset. It can be used as the basis for a grove template to be applied to another track. In any of these form it can serve as a data point for generative music or music information retrieval. If libraries of music assets are saved and searchable through their annotations and metadata, any number of compositional aids, sampling, and remix tools can be built.

All information pertaining to an asset will be accessible through a content addressable system. Like git, resources will be identified by their 40 character hexadecimal sha1 hash. This allows them to be saved in a filesystem like the git objects in a local repository, or unambiguously referenced by a URI. If the contents of a local file change, then the metadata may no longer be accurate, and must be re-calculated. If a local asset filename changes, the hash will remain unchanged, and the file can still be identified by its hash.

Musical Metadata and Annotation Serialization

The document that represents a user within a service like Facebook will include a collection of properties and relationships typically wrapped in HTML or JSON. For conventional web services like Facebook, the schematic that defines properties and relationships that make up the user profile data is not indented to be used externally. Every conventional web service that implements user accounts is responsible for defining and maintaining an internal ‘user’ schematic.

It would be possible to design cloud based music production tools using this conventional approach. Each service could design its own data models, and be responsible for parsing the annotations of upstream services. A chord annotation service would implement its own data structure modeling a song timeline. That service would access a client’s audio assets, save the chord annotations in a custom format, and return the result to the client. This is essentially the model that exists today, and is employed by services like Splice.com. This approach is not without advantages. Primarily, it is simple. Every service developer can define the models that are most suitable to their needs. Because a service developer will write their own schematics, they will have no trouble understanding it. This approach benefits from the “separation of concerns” design principle. Separate service developers do not need to understand any one particular resource definition language, and may use whatever method they are most familiar with.

However, the goal is to facilitate many services that interact with each other. The existing platform based model facilitates the ecosystem that we already have: Each platform measures success by its market share. Each benefits from the “network effect” only when its users are exclusive.

The Web Ontology Language (abbreviated as OWL) and Resource Definition Framework (RDF) created by the W3C provide a standardized language for explicitly defining ontological schematics and relationships, and describing data within those schematics. The W3C languages are designed with exactly this use case in mind: The ontology and knowledge graph are not owned or managed by any one particular service. Instead, ontological primitives like resource classes and relationships can be distributed across a many different web servers, and referenced with a URI. As an example, consider the following triple:

The RDF would divide this knowledge into three parts: A subject (J.S. Bach), a predicate (Is the composer of), and an object (the Goldberg Variations). Using RDF, each of the three parts could be referenced by a URI. That way, many web resources can access the same predicate, assigning an axiomatic and machine readable relationship between people and compositions. The semantic web is designed such that the ontological structure can be defined in a distributed fashion using the existing DNS and HTTP protocols.

MIR researchers proposed a standardized way to describe audio features, including annotations and metadata. A paper by Gyorgy Fazekas describes how a software application such as a DAW could read and write metadata in a standardized but extensible way (Fazekas, 2009). The format takes advantage of the RDF, and the standards proposed by the W3C for semantic “linked data” on the web (Cannam, 2010). An ontological standard provides a major benefit to DAW, service, and plugin developers. The proposed ontology standardizes metadata descriptors like “artist name”, “composer” as well as musical timeline features like beat location and chord change. This allows many different services to recognize each other’s annotations and contributions. RDF provides a convenient mechanism for developers to create additional annotation types, and publish these specifications to the web.

There are significant advantages to the semantic approach. In the examples above, many different services perform operations on an artist’s assets, annotations, and metadata. If schematics are written in an explicit well defined format, developers can organize their services around shared schematics, eliminating redundancy, and improving interoperability. If the musical annotations supported by the existing music ontology are not sufficient, services can also design, document, and publish new annotations that replace or extend the existing ontologies. This means that schematics must be general, and developers must learn to work with the RDF.

Consider the example from the beginning of this document. The artist’s session and assets were exposed to several different services. Each service added and updated assets of its own. All services need to be able to parse each other’s metadata and annotations. At the same time, each service may define its own extensions to the existing schematic.

Resource Access Interface

In the example at the beginning of this paper, Jordan allowed a third party service to access and update her DAW session. Another service was given read access to her session and tracks, and provided her with additional annotations. How can she control access to her session and track assets? All her assets need to be accessible on the internet. The server that makes assets accessible needs to authenticate and authorize read and write access requests from services and individuals.

Jordan can interact directly with many services. Many of these services can also interact with each other. For example, services can request and exchange assets, metadata, and sessions. Semantic audio and the RDF offer a standardized machine readable language for services to exchange metadata.

Uploading Assets

A practical implementation of this service could work in the following way. First, assets created during a recording must be uploaded to the resource server. A software daemon on the recording engineer’s computer monitors the DAW’s asset directory, similar to services like Dropbox and Google Drive. When a change is detected, any newly created or updated assets are uploaded to an asset server. This can be accomplished with a one way file synchronization algorithm as is done by the rsync utility (Tridgell, 1999).

The process of recording music generates a significant amount of metadata, including but not limited to the time and place of the performance, the musicians involved, recording techniques used, all of which can be semantically linked to the the musical composition and arrangement. Individual takes, performances. A discussion of the semantic properties and workflow for authoring metadata resulting from a recording session is described in detail by the multitrack ontology (Fazekas, 2008). Recording techniques such as microphone type and placement may also be described by the studio ontology (also written by Gyorgy Fazekas). Ideally, all available metadata on a recording session would be added at the time of the recording session, and could be edited and updated by the recording engineer as well as by the artists. This brings us to an additional technical requirement of the resource server: It must expose an interface to create, read, update, and delete (CRUD) assets, annotations, and metadata.

CRUD Operations

There are two options for performing CRUD operations on linked data. The first is via SPARQL queries. SPARQL is a query language made especially for linked data, and standardized by the W3C. It has a very flexible query syntax, that allows complex queries on linked data. The specialized query language does come with a cost: Because queries written in SPARQL are designed with the flexibility to traverse a linked data graph, a poorly written query can overwhelm server resources. This can be mitigated by using a intermediary like D2RQ, which translates SPARQL queries to SQL for operating on a traditional relational database which has shoehorned in linked data.

The second option is to use standard HTTP methods and headers, similarly to a traditional REST interface. This method is proposed in the Socially Linked Data (SOLID) specification. The SOLID REST specification includes some additional requirements, for example, support for wildcard “globs” when retrieving data with the HTTP GET method. The SOLID project also includes a declarative specification for group based authentication and authorization called Web Access Control. Developing a REST style interface is made easier by the many existing implementations, and well established middleware based approach for authenticating and authorizing requests. However, it cannot support the more complex queries provided by SPARQL explicitly made for searching and manipulating linked data.

Metadata Storage

Compared to uncompressed audio assets, annotations and metadata consume a small amount of storage. The simplest option for saving linked data on the server is to use the server’s filesystem. In this model, metadata would be saved as linked data files .rdf or .n3 formats supported by the W3C. For example, a .n3 file that encodes musical annotations using the Timeline and Music ontologies (Raimond, 2007) may be referenced by:

https://example.com/m/46eee28edc90bef1cd58f2db9d626ec8cb350546/timeline.n3

Where the 40 character hexadecimal string is the sha1 hash of the audio asset that the timeline refers to. Using this pattern allows simple service to run with minimal configuration by an Apache or Nginx server.

However, modern web services require more advanced authentication, authorization, data validation, backup, queries, and scale than a filesystem based approach provides. For this reason, the conventional approach is to put a HTTP server in front of a database, which allows more flexibility and customization. A full discussion of issues surrounding linked data and databases is out of scope. We will highlight a few key factors.

If the entire linked dataset is small enough to fit on a single server, a centralized datastore may be used. For example, Jena2 is a Java based toolkit for storing and querying subject-predicate-object triples in a centralized SQL database (Wilkinson, 2003). Jena2 supports input and output in RDF serialization formats including N3 and RDF/XML. It also supports SPARQL queries. If the linked dataset cannot fit in a single machine, a distributed RDF store must be used. D-SPARQ is an RDF query engine built on MongoDB, a NoSQL datastore designed to scale vertically and horizontally across many servers (Mutharaju, 2013). D-SPARQ uses the Map/Reduce functionality built into MongoDB, but does not accept SPARQL queries. Another scalable option is provided by Amazon Web Services. AWS Neptune is a managed distributed graph database with support for RDF objects and SPARQL. An up to date discussion of many additional RDF ready database options can be found in Linked Data: Storing, Querying, and Reasoning (Sakr, 2018).

Storing assets

Audio assets are larger than metadata, and should be stored and exposed through a different mechanism. In the proposed model, services can query annotations and metadata. If a service needs to query the audio itself, it can request the audio directly. Artists should have control over which services may access which assets, metadata, and annotations.

A simple way to store large binary files like audio assets on a web server is using that server’s file system. However, this is often avoided because it offers limited support for best practices like versioning and automated redundant distributed backups. A more flexible open source option would be GridFS, an official specification and API for storing large files in a MongoDB collections. This takes advantage of the built in replication and backup procedures made for MongoDB. Alternatively, commercial services can provide turnkey services for saving large binary blobs, and exposing those blobs to the world wide web. AWS S3 allows web clients to upload and download large assets on behalf of a service. Access to assets stored in AWS S3 buckets can be subject to customizable authentication and authorization parameters.

DAW Sessions

We have described how a cloud enabled service could act as a repository for assets, annotations, and metadata. What about DAW session files? The RDF coupled with the music ontology provide a powerful collection of standards for metadata and annotations. However, the services described in the introduction don’t just edit metadata, they integrate into the digital audio production workflow. One of the proposed services autonomously parsed Jordan’s DAW session, and re-structured the session by automatically editing the band’s jam into an intro, verse, chorus, etc. To do this, services must be able to access and de-serialize the session information. Access is not complicated. Session files can be uploaded and distributed through the cloud in a similar fashion to audio assets. However, a service that modifies a DAW session must be able to parse the session file. There are many very different session file formats, interoperability between the formats supported by different DAWs is limited.

One option is creating a new DAW from scratch with the cloud service based workflow in mind, and publishing the session spec. It allows us to build support directly into the DAW, making more complicated features like real time streaming collaboration easier. This ensures compatibility, but means we need to create a DAW, and motivate artists to use it. This might seem inadvisable, because DAWs are a very competitive space, but many new platform-based collaborative DAWs are in are currently in development. Examples include Soundtrap, Ohm Studio, BandLab, and Amped Studio. However, our goal is not to create a new platform in the conventional sense, so we will be better served supporting existing audio production tools.

Another option is authoring a session format specification. An “edit decision list ontology” in an RDF compatible format is one potential format. This would allow us to create session converters that (for example) convert the .RPP session format saved by Reaper to our portable format. We could also ask DAW developers to include support for exporting to our custom format.

Alternatively we could use an existing session file specification. There is historical precedent for open DAW standards. In 1993, Avid Technologies published the Open Media Framework, which was designed to allow interoperability between different DAWs (Lamaa, 1993). Two subsequent standards AES31 (Yonge, 2000) and AAF (Tudor, 2004) published in 2000 and 2004 respectively, where created for the same purpose. Unfortunately, true DAW interoperability is non-trivial. The AAF format includes support for a sample accurate edit decision list, fades, and to some extent, automation. However, it does not support routing, bussing, MIDI, and audio plugin configuration. Creating a standard that exactly reproduces every possible parameter in a DAW is impractical, because of the many evolving capabilities of different audio tools. Existing commercial DAW session translators like AATranslator and Vordio suggest that explicit limited support for certain features is a more realistic approach.

AAF encoding is supported by Studio One, Pro Tools, Logic, Nuendo, Digital Performer, Final Cut Pro, and Premier, but not by Reaper, Ableton Live, FL Studio, Bitwig Studio, Ardour or Reason. Could we expect cloud services to adopt the AAF format? There are some obstacles. Unlike the RDF standards published by the W3C, there is no process in place for extending or maintaining the specification. Also, unlike RDF, legacy formats like AAF do not include any way to publish extension to a given specification in a distributed fashion. The defining feature of the RDF is support for a continuously evolving extensible definitions. Only a C++ library exists for parsing AAF files. Meanwhile, modern and generic binary formats like Protocol Buffers have libraries in many different programming languages, as well as built-in support for versioning and backwards compatibility.

Standardizing DAW session file format involves obstacles whether we write our own specification or use a legacy format. These challenges can be circumvented by leaving the problem of parsing session files entirely to service developers. To create the session re-structuring service that organized Jordan’s session file into song sections, each service could advertise exactly which session file formats it supports. This would fragment service interoperability, but also encourage DAW developers to expose their session file specifications to service developers.

A final option involves trading the service capabilities for operational simplicity. We can limit service interaction to raw audio assets. In this scenario, an artist interested in tapping a service would render audio files from the session, before uploading them for processing by a cloud service. This significantly limits the type of available services. For example, services could not programmatically alter a mix, and automatically publish those mixes to a customer facing streaming services.

Summary

We described how music composition, recording, and production process could evolve given a cloud enabled redesign with a distributed architecture and a focus on open and extensible standards. One way to visualize this is “turning the DAW inside out”, and exposing its contents to a spectrum of human and automated collaborators, with the goal of encouraging new and previously unimaginable music production processes. This is in contrast to conventional DAWs that work by embedding assets and plugins, centralizing the workflow, and (often) by locking users in to a particular platform.

We described the technical foundation for how such a service could be built, using existing standards when possible. Some implementations details are left to future work. A robust authorization strategy is required for artists to specify which services may access which assets. We discussed bundling metadata with assets with the implication that when an artists samples or remixes existing tracks, attribution can be built in to the production process. This is a good step, but we still need a payment or transaction system that enables upstream artists to be compensated.

The comprehensive vision considers how creating music can continue to develop in an age when we no longer need to clear boundaries between music composition, recording, production, and consumption.