Over the years I've listened to several opinions expressing doubt over the Linux sound stack. There are lots of ill informed comments out there concerning various things sound related, both positive and negative, but more often than not commentators miss out very important aspects of a modern, multi-user, desktop sound stack. So in this article I'll attempt to discuss some of the misconceptions out there, provide a balanced view of the current state of affairs, discuss some of the perceived mistakes in the rollout of new sound stacks and where things are going in the future.There have been a few articles, some picking up mainstream coverage talking about the Linux sound stack. Some comments suggest that it's not that bad, but totally miss the point regarding what a desktop audio stack is all about, but most people are talking about how it's in a bad way and overly complicated and while such comments do have some merit, things really are not that bad, and I believe there is a really bright future.

ALSA vs OSS

A lot of the comments of late have been discussing things such as how amazingly brilliant OSS is. Personally I don't buy it. I've never really played overly much with OSS and as such this is probably a slightly ill-informed view - although that's not to say it's not accurate of course :D. Most of these kind of comments are made by people who don't really understand ALSA and are bought over by the "ALSA API is overly complex" type comments. Yes, the ALSA client library is rather complex and has numerous pitfalls - so much so that there exists now an unofficial "safe" ALSA Subset API. But what people invariably fail to comment on (and thus fully understand) is that ALSA comes in two parts: the kernel driver and the userspace library. ALSA differs from OSS in that all access to the kernel layer is performed via a userspace library. I don't know of any ALSA clients that communicate directly with the kernel layer without going through libasound. What this means is that the kernel interface has the freedom to be re-factored and improved at any time, provided the userspace library is developed in parallel. For this reason, the kernel layer is actually quite clean and well defined. The rather rigorous quality control that goes on in the kernel is testament to the fact that on the kernel side of things, ALSA is doing pretty well. Of course there can (and will be) improvements in this area in the future, but this side of things is certainly not in a poor state as people seem to assume.

The "too complex" argument relates to the ALSA userspace API. In order to remain backwards compatible, the userspace API has undergone several refinements. As will anything not designed from the top down, some parts of it are rather confusing and have sometimes been misinterpreted (the classic example here being the confusion over snd_pcm_delay() - it's documentation hinting at a hardware based implementation that subsequently lead to some project (i.e. WINE) assuming that this function will eventually return 0 which is not true; fortunately this problem is behind us now, but with a new API call added that does return the info the WINE guys (and others) needed).

So yes, the ALSA userspace API could use a complete top-down redesign, but in order to do that, we would immediately break compatibility with 90% of the apps out there: Not a great idea all in all. Retaining backwards compatibility is a pain, but it's also quite important!

But Sound Servers Suck!

What is in a name? That which we call a rose by any other name would smell as sweet. Some people seem to have some sort of built in hatred of "sound servers" as a concept without really thinking through what this means. Yes, there have been some pretty awful experiences with some sound servers in the past (EsounD and aRTs being the immediate examples that spring to mind), but that doesn't mean the concept itself is flawed. You may drive a couple of shit cars but doesn't mean we should all abandon the roads. In addition, the sound servers of old were really just mixers. In the old days most hardware was not capable of doing hardware mixing and thus couldn't produce sound from multiple apps at the same time, so a mixer was an essential component. Nowadays, software mixing is the norm rather than the exception, even on high end hardware, and ALSA itself has a pretty solid sofware mixing in the form of DMIX, thus obsoleting large parts of the previous sound server functionality - certainly making the additional features they did offer seem disproportionate to the hassle they introduced. In the early days DMIX was just another sound server. Apparently this has changed these days, no longer needing an additional process. While it achieves the job of software mixing very well, it's not as fast or as flexible as other solutions can offer.

Modern Multi-user Desktop

So, these days a modern, multi-user desktop is quite a different beast to what it once was. Components such as Console Kit track which users are currently active (e.g. when more than one user is logged in simultaneously) and tells udev to write appropriate ACLs to enforce this policy. Users also want to use network attached sound systems, such as Apple Airtunes (RAOP) devices and UPnP media renderers etc. not to mention Bluetooth devices. All of this is much further up the sound stack than the low level driver level and has to deal with various permission and authentication schemes. This obviously needs a userspace component to govern this interaction. Something has to be responsible for this and a "sound server" of some sort obviously fits the bill perfectly.

PulseAudio

So enter PulseAudio. It's had it's fair share of bad publicity, but ultimately this important part of the Linux sound stack is taking on several roles that are important in a modern desktop. It's dealing with several different things:

Software mixing

Independent (per-application) volume control

Dealing with permissions (is the user allowed to access the sound device?)

Dealing with Bluetooth devices

Dealing with Network based devices (UPnP, Apple Airtunes, Native PulseAudio etc).

Handling the moving of streams between outputs.

Handling sound from remote applications run via X11 over a network.

Dealing with routing policy (Music goes to USB speakers, Desktop sound events to built in speakers, VoIP to Bluetooth headset)

Effects to promote HCI (e.g. positional event sounds - button clicks etc, coming out louder on the left hand speaker when triggered from the left hand side of the desktop)

Power Consumption and Efficient savings.

Reduces risk of buffer under-runs.

So the people who talk about OSSv4 and how it can do mixing and per-app volume control and how this means that ALSA and PulseAudio are not needed are totally underestimating what's needed in a modern audio stack. There still needs to be some kind of userspace daemon to govern these other sound systems and deal with multiple users. This is a non-trivial job and no other system out there is currently aiming to implement these capabilities.

One of the often overlooked advantages of PulseAudio is the "glitch free" system. This is an approach that ultimately disabled interrupt driven audio and instead relies on system timers. Modern kernels can provide these timers easily and reducing the number of interrupts and using larger buffers allows you to greatly reduce the number of CPU wake-ups thus saving power. This is actually a very important technique to implement when dealing with modern mobile platforms.

Reuse

It's obviously important to ensure efficient code reuse. It doesn't make sense for all sound producing applications to implement direct support for "exotic" sound systems such as Bluetooth, UPnP and Apple Airtunes etc. To do so is very inefficient (there are some exceptions to this - e.g. a media player that targets Win/Lin/Mac will maybe need to implement direct support if it is to be available across the board). Keeping the implementation centralised and having a single app->sound system API is essential here.

Consistency of UI

One of my big problems with many applications is inconsistent UI. This is a problem on Windows as much as on Linux, but it's something OSX has done mostly right. Users got to a central GUI to configure their sound and which device is currently active/in use. In Linux land all sound producing apps have their own config GUI for selecting sound devices. This is insane. Non-technical users don't know that you have to go to Tool->Preferences->Advanced->Sound in App A and Edit->Settings->Audio in App B. Sure, those of us who are reasonably technical will generally find the options (that's how we use applications - we click and look at all the settings pretty early on!), it's going to be less than obvious for a massive number of users. Keeping the preferences centralised so the user always know where to look is important and for a general purpose application that outputs sound, there should be no reason to provide any config option relating to this to the user - it should "just work"(tm).

Incompatibility

Some users have complained that some proprietary applications have stopped working with PulseAudio, Skype being an oft mentioned example. Well, I'm sorry but that's just tough. If a closed source application does not implement an API cleanly and does bizarre things, there is nothing we can do to fix it. The problems Skype has experienced with PulseAudio would also be experienced by any other plugin to ALSA. I'm sorry to say it, but in order to move forward, some applications have to suffer and/or be forced into action. By not allowing the people who care about this stuff the right to improve things themselves you're taking on the responsibility to do this yourself and you need to live up to your responsibilities. Considering the last version of Skype for Linux was released more than one and a half years ago, it's hard to consider it as anything more than abandon-ware at present. Will there be more pain like this? Yes, probably but that's just way things are - Free Software only truly works if the whole bundle is Free, if you mix and match you, as a user, have to accept this state of affairs. I do.

Desktop environments need to ensure they integrate nicely with PulseAudio. GNOME is obviously doing this, but KDE is lagging behind. I do hope to rectify the latter situation personally, and have a pretty clear roadmap to making this happen - it's just a matter of finding the time to do it!

Conclusion

So, with all this in mind, the sound stack has to be more than just a driver layer. It needs a persistent userspace layer that can run and keep track of various permission problems, deal with network connections and generally govern things. At present PulseAudio is fitting the bill pretty nicely and is continuing to add support for additional constructs in the Linux stack. As things stand all the major Linux distributions are now using PulseAudio with commercial interest from Nokia, Intel and Palm among others.

Future

So the future? Well, the drivers in ALSA need to be further debugged and developed to ensure the accuracy of the timing information that has so far plagued the "glitch free" system in PulseAudio. Nothing has pushed the ALSA drivers to such limits before, but the benefits of the glitch free mode are clearly worth the pain. Applications using the ALSA API need to ensure that they are using it correctly and sticking to the safe subset whenever possible (thus ensuring compatibility with PulseAudio's ALSA plugin). In addition, applications such as media players need to deal properly with latencies. It's a bit of a myth that low latencies are needed by such applications - higher latencies will ensure better battery life on mobile players and depending how the user wants to route their sound (e.g. to the Bluetooth enabled hi-fi system) latencies will be something beyond the control of the application in any event. It's therefore important to deal with this correctly and appropriately to ensure A/V sync. It's only been about half a year that the ALSA level limitations on buffer sizes were lifted after lobbying from the PulseAudio maintainer. Intel are even experimenting with 10 second buffers (that's not the same as latency!) in order to save power!

Every day more and more applications are tightening up their ALSA implementations. Every day the constructs of the Linux desktop are becoming more stable and solidified, offering a truly joined up multi-user and network aware experience. I think this is particularly impressive considering the fact that (as far as I know) only three people are employed to look after the Linux sound stack: Takashi Iwai andJaroslav Kysela on the ALSA side and Lennart Poettering on the PulseAudio side. While there are numerous other contributors, this is still pretty impressive progress with the resources at hand. It's also worth noting that two of the three are employed by RedHat, the other by Novell.

While Mandriva will still provide an easy way to disable PulseAudio if you feel it's not right for you (just untick the box - it's not hard!!) or need to use these closed applications such as Skype, I believe that this will not be necessary in the not too distant future.

So where is Sound on Linux? In my opinion it's in a pretty good state - there are still lots of things to do, and that will never change, but there is a firm and solid framework out there now and it's getting better every day.