I was confused about the Kinect sensor.

I knew that it was somehow capable of recognising human gestures but I didn’t know exactly how that data was presented to the host computer. Peering in on the Kinect community from the outside, it was difficult to work out exactly what the Kinect sensor delivered.

Luckily I was able to find a sensor for sale at my local Meijer supermarket and snapped it up to find out what all the fuss is about.

Basically the Kinect appears to be a 640×480 30fps second video camera that knows the *depth* of every single pixel in the frame. It does this by projecting a pattern of dots with a near infrared laser over the scene and using a detector that establishes the parallax shift of the dot pattern for each pixel in the detector.

Alongside this there is a regular RGB video camera that detects a standard video frame. This RGBZ (or ‘D’) data is then packaged up and sent to the host over USB.

First off this is very cool – RGBD cameras are traditionally very expensive and so $150 is a bargain. So top marks to Microsoft for this.

What it does not do is identify shapes within the field of view and attempt to map skeletal outlines of those shapes recognised. This was the most confusing thing for me, as every article I’d read had shown the skeletal representation as an explanation of what the Kinect does.

You would need to take each one of the 640×480 frames and copy them into a framebuffer so they can be processed by a vision library like OpenCV. Typical operations would be to threshold the depth image to get the “closest” pixels – then perform a blob analysis ROI to group these pixels into identifiable features and then track those blobs over their lifetime.

This is actually quite a lot of work – and one thing I’ve noticed about some of the Kinect demo videos is the slight lag –

Both these demos are impressive, but I’m not totally convinced that they rely on the Kinect’s 3D abilities. I don’t think it would be too hard to implement either of these using OpenCV and a good 2D camera.

When I was working on the laser harp I spent some time trying various video cameras as potential detectors. What I found was 30fps was too slow to get a response suitable for music – something 60-100fps was better.

Also 640×480 was just too much data to crunch at this rate and 320×240 was about the maximum that could be processed.

The PS3Eye camera is excellent in this respect – it can deliver 120 fps at 320×240 monochrome – perfect for a laser harp type instrument or a super responsive “surface” computer.

One of the best things about the Wiimote / Pixart sensor is that it does the blob tracking in hardware, so you end up with a datastream containing the X/Y/Z position of up to 4 bright points – perfect for an Arduino* or other “slow” microprocessor.

Where I think the Kinect will be outstanding is in robotic vision applications where the processor has time to analyse the image, update the internal world model and navigate accordingly.

But for true realtime operation – there is still a bit too much work to be done.

*But* maybe I’m wrong on this – because the Xbox seems to manage all that processing, along with rendering a game too.

Comments very welcome on this one…

* iBlogCred *= 100; // Mention of Arduino

UPDATE :

(via the comment from Mike below)

This is a fantastic example of using the Kinect’s unique 3D capabilities. I urge you to check out all of the videos in the series.

What is really interesting about this is in one of the later videos the framerate counter in the bottom corner is reporting 100fps – I don’t know how he’s getting FPS greater than the 30 limit coming from the Kinect – maybe some of these are duplicates, in that the 3d rendering is happening quicker than the data is coming in.

In closing, I want to stress that I am in no way trying to “disrespect” the Kinect sensor, or the brilliant work done by the hacking community to put the drivers out. It’s a very interesting device, and quite unique. I do think that the depth mapping is being underutilised in most of the early demos, and we all need to get our respective thinking caps on to come up with implementations that take advantage of it.

Follow-up to this article.