By

Once upon a time I was reading a Popular Mechanics article, the title of which eludes me. Something about playing different music for different parts of a dance floor. They were describing a way to focus sound towards different people.

What struck me about the idea was that there was a way to focus sound. It was a piece of mesh of some sort, which acted as a lens for ultrasonics. This sparked an idea for what ended up being the most complex and expensive of my hobby projects to date.

Imagine using such lenses to focus sound onto a plane of microphones. Just like light in a camera. One microphone is one pixel. An ability to see sound.

I didn’t actually read Daredevil comics until much later, but those who have can see where this is going.

For a long time it was just something at the edge of my mind. I was envisioning things like watching ambulances drive past, their color shifting from blue to red, or a firework’s bang lighting up the buildings one by one.

The initial idea was something huge: a truck-sized box with mesh optics and a board of microphones. Completely and utterly impossible to make.

Eventually I had the idea to make the scanning optical camera I described in the previous article, and wondered whether I’d be able to do sound with it as well.

I planned the camera with that in mind – the size of the hole, and the size of the box itself. It should have been able to resolve a basic image in ultrasonic ranges, where waves are short enough, using a simple microphone scanning head.

I made some wave propagation simulations to test the idea.

It should have gotten a few pixels, at 16KHz, with the microphone scanning head filtering and detecting that exact frequency (to avoid motor noise and so on).

Later, I ran the experiments for real, and got nothing but noise.

In hindsight, the inner surfaces should have been made anechoic, and even with that, the walls are a bit too transparent for the sound.

That idea failed.

I kept contemplating it every now and then, whenever I saw microphones sold in bulk for cheap. Instead of a scanning rig, I wanted to do a full matrix. I wanted to see the world in sound, and not at a frame per 30 seconds, but at 30 frames per second.

The scale was always too big to pull off or afford.

Then, I read an article about the FFT telescope — how you can resolve an image from a grid of wave sensors using zero optics and a lot of mathematics.

That was the first breakthrough: I didn’t need to build the box or the mesh optics! The project suddenly collapsed into something portable and plausible.

Along the way, I understood what Duga-3 really was, the “Russian Woodpecker” radio array you see at the top of the article. Ironically this Soviet-era monster, the largest source of radio noise during the Cold War, is the closest thing to what I wanted to make (only of a portable size).

It is a big grid of radio transmitters that can shape a wave.

The same principle works in reverse.

Imagine an 8×8 grid of microphones pointing at a tone generator, which is moved left and right. What would the grid hear?



Each microphone gets its sound, which is a waveform. FFT is something that can split this waveform into its constituent frequencies, a set of amplitude and phase for a set of “buckets” representing a frequency. This lets us get the intensities of the sound waves of different frequencies, rather than a trace of the microphone’s membrane going up and down.

Now contemplate a distant sound source. What would the sound hitting the camera be like, when it is pointed straight at it?

Pretty much all mics getting the same values at once: the wave reaches them all at the same time.

Now, what if the source were somewhere to one side of the grid?

The sound would hit the mics on one side a bit earlier than the mics in the centre and the other side. The higher the deflection, the higher is the frequency of this rolling wave. For the source moving left and right, we would get the waves that are slower, then faster, then slower, and so on.

So, what would we get if we apply 2D FFT to THESE waves, and plot them based on the deflection angle?

An IMAGE.

And that’s all the magic there is.

Sadly, the idea was still too complex and too expensive to pull off.

Every microphone needs a pre-amp.

The outputs of the pre-amps needed to be fed to a fast analog to digital converter (ADC).

The ADC outputs then fed into a field-programmable gate array (FPGA).

A routing nightmare, a soldering nightmare…

Then one day I was fixing my father’s iPhone, and noticed that it had an odd microphone in it. It was a chip with a hole, and looked quite unfamiliar.

Some googling later, I discovered the existence of MEMS microphones.

That was the second breakthrough – a MEMS microphone is etched directly in the silicon, and comes with the pre-amp and ADC already on chip. It’s a DIGITAL OUTPUT microphone!

Suddenly, the project collapsed in complexity, and for the first time it was on the edge of feasibility. With the digital microphones, all I really needed were microphones and an FPGA to process it.

That made for a simple board, and for an affordable Bill of Materials (BOM).

I got a few of the mics and made a prototype with a spare FPGA board from my home automation system.

This only covered one line out of the grid, but it should prove the concept.

I tested the math, and it seemed to work. I could track a sound source and get an 8×1 image, of a sort.

Time to do it for real.

Even then, it was just barely cheap enough. I had to buy enough components to get the bulk prices — over 25 chips, over 1000 microphones — but not so many that the surplus would make the effective cost too high.

But in between, that U-shaped cost curve dipped just below the line of affordability.

I settled on a 32×32 array, made out of 8×8 cells. Each cell is a self-contained camera, optimized for syncing. The image was to be stored on a microSD card, with a few live output options.

This way I could use a cheap-ish FPGA and cheap-ish board manufacturing. Also, that makes the system flexible. Literally and figuratively.

The boards are to be mounted on a frame that maintains the spacing, with zero force being applied to the boards — the wind should go past them through the gaps.

2 cm between the microphones, 1 cm of gap between the cells.

With 64 channels of data to be sampled at 3Mhz, an FPGA was the only real option. A microcontroller can do one instruction at a time, executing one operation in a sequence: read data, process it, store it. Even the bigger ones would choke with only a dozen of channels.

An FPGA, on the other hand, is a software-defined array of logic gates. You can define 64 pipelines of signal processing, all of which would work simultaneously. They are great for signal processing tasks, at the cost of being much more complex to work with both on circuit level and software level.

A typical FPGA would need several different voltages, proper decoupling and ground planes, input protection, external ROM, and so on. They come with the most complex and convoluted datasheets I’ve ever used. A far cry from a microcontroller you can just drop in and run with.

I took my time while designing the board. There would be no second chances – even in China a run of 22 quad-layer PCBs cost about $600. So I checked and rechecked the design, contemplating everything that could possibly go wrong.

I put in a bunch of options for the unexpected.

MicroSD card for storage.

Spare pattern for a Flash chip, in SOIC and in DIP.

Spare input and output interfaces.

Spare patterns to allow for in-line, pull-up and pull-down resistors.

I went over the design again and again.

It paid off later, as I ended up using most of the “just in case” options. Eventually, I pulled the trigger, and two weeks later China delivered.

Despite my precautions, I screwed up. It turns out the exposed pad of the FPGA MUST be soldered to the ground, and I had no vias to reach it from below.

I wonder what the datasheet writer was thinking when he decided to mention this critical, need-to-know information only in an inconspicuous footnote hidden several hundred pages deep.

Luckily it wasn’t a show stopper, all I had to do is to reflow the chip with the heat gun and a touch of solder left in-between.

But it was a hassle.

Once that was solved, it turned out that the sucker worked!

Well, the LED blinks.

And a little later, I got to the microphones’ data over the debug channel.

Hmmm…

A few bad solder joint fixes later…

Much better.

Now, the time had come for the first real test.

There was something magical about it, the trepidation of finally approaching a point when an idea that had been bouncing around your head for a decade was about to become a reality.

After a few months of work and waiting and more work, this was it. I put the cell standing on a table and linked it up.

Then, I went ahead and started waving my phone in front of it, set to generate several tones. Crude software, no frame rate control, bad framing, a hack upon a hack.

But I got a video.

You can see two blobs – one is the phone, and the other is it’s reflection from the table.

A few blobs, but for me that was magical. Seeing sound, for real, for the first time.

I did some work on the software, got the thing untethered, recording to the microSD card as was originally planned, and started playing.

Now the sound source was the PC speakers, and I was standing some distance away, turning the camera left and right.



You might notice the gaps in the video. Turns out microSD cards are not as well behaved as I’d hoped. They have their own internal logic that can cause arbitrary delays, picking their own time to flush the buffers or erase more FLASH blocks. While the average write rate is more than fast enough, the latency is unpredictable.

And my hardware can only store one frame at a time, so there is no way to wait. I hoped to fix this later, one way or another, so I moved on to syncing several cells together.

It takes an evening of boring work to populate one cell’s PCB, so much podcasts later I got myself a 2×1 array.

The small board is the controller, it sends the trigger signals to the cells, letting them start recording a frame at the same exact time.

The frame is stored on the microSD card.

I found that at around 10 FPS I can avoid most of the latency issues, so that became the go-to hack for the moment.

Let’s look around at the glorious 16×8 resolution.

The redder the blob, the lower the frequency, the bluer, the higher.

The lower the frequency, the lower is the precision with which it can be located. So random noises show up as big red blobs.

It has to do with the wavelength — if the wave is much larger than the array, then you can’t really detect its direction any more.

But, something else was wrong. It took me a while and some tweaking to find what it was exactly.

Here is a video of the array sitting still, looking straight at a tone generator.

See the double-single blinking?

For some reason, the cells were not triggered exactly at the same time. A timing error, which causes one of the cells to skip an entire 48kHz sampling rate step. This might not sound like a lot, but it was huge in a system that is designed to measure sound wave directions by determining their phase shifts over a grid. During that 20 microsecond delay the sound travels 6 mm, which is a third of the distance between microphones. That breaks the pattern.

I tried to fix things that looked like the might be the source of the problem, then tried to scale up and clean the array.

Perhaps the issue was in some flaky wiring…

Two more cells in, and here is the same “looking around” performance in even more glorious 16×16 resolution:

There are a lot of reflections visible now – from the walls, ceiling, furniture and so on. Also, the blinking, while more dissolved, is still there. Apparently the timing errors are not gone yet.

At the same time, the microSD card writing just does not work as well as i would have liked. Different cards have different latencies, and even at 10 FPS, I was losing frames after tens of seconds of runtime. Not to mention that removing all the cards to plug them into a bunch of card readers is not quite a frictionless way to get images.

I wanted to explore, to see things in real time, not minutes or hours later, back home, plugging the cards in only to find lost frames.

I needed a non-storage processing pipeline…

I had a bulk interface on the PCBs, in anticipation of a centralized sampling approach. One cell produces 4Mbps of data stream, and I figured it would take another FPGA board to poll them all, process the data and drive an LCD.

However, the processing in question was hugely complex for an FPGA implementation. I would have to make another special FPGA board, and figure out a whole new system, and figure out a way to add a visible light camera into it so I could track the sound of what exactly was I looking at…

For months, the project was dormant. Other projects came and went. One of them helped me figure out how cheap a powerful x86 computer is these days.

And then I realized that I don’t need to make the whole data processing pipeline in hardware. All I needed to do was to get the data out of the cells and into a little PC at the full speed.

These days a powerful enough computer to render for this thing would be about the size of one of the cells, and would come with a bonus of being able to run a plain vanilla webcam for keeping track of what is recorded in the mysterious sound blobs.

Unfortunately, we are still talking about a total of 64 Mbps of data. That needs a USB 2.0 sampling board that would pull the data out of the cells over the bulk interface — another FPGA, albeit a much simpler one this time.

This would take time to make, to try and debug, and hopefully that second part of the project would deliver the true promise of the sonic vision…