I am now back from Prague where I gave a talk on image stabilisation (and my holiday pictures). Hopefully a video of the talk will soon be online. In the meantime, I would like to explain a bit my efforts in written form, with some details slightly updated from the talk (the code progressed a bit since then).

UPDATE: The talk is now online.

I got interested in the issues of image stabilisation through a helium balloon photography project in which I participated. I want to make a nice time lapse video from the pictures I have taken, but they were taken from a camera that was moving, which would make the result very shaky without some kind of postprocessing.

Thankfully, I work at Igalia, which means that on top of my personal time, I could spend on this project some company time (what we call internally our hackfest time, of up to 5 hours per week).

Original problem statement I have around 4h30 of pictures taken from a balloon 100 metres high. The pictures were taken at a rate of one per minute, which makes around 270 pictures. I want to make a nice time lapse out of it. Simply using the frames as is to build a video does not work well. Partly because I would probably be legally required to include a warning to epileptic people at the beginning of the video, but mostly because people actually watching it would wish they were epileptic to have a good excuse not to watch it. This is due to the huge differences occurring between two consecutive frames. Here is an example of two consecutive frames in that series: As you can see, from one frame to the next, a lot of pixels would change. And that does not look pretty. It is also pretty obvious that they are both pictures of the same thing, and could be made to be pretty similar, mainly by rotating one of them, and maybe reprojecting it a bit so that things align properly even though the point of view changed a bit from one frame to the next.

Standing on the shoulders of giants There was no question in my mind that I wanted to use GStreamer for the task, by writing an element or set of elements to do the stabilisation. The two big advantages of this approach are: I can benefit from all the other elements of GStreamer, and I can easily do things like decode my pictures, turn them in a video, stabilise it and encode it in a format of my choice, all in one command.

Others could easily reuse my work, potentially in ways I could not think of. One idea would be to integrate that in PiTiVi in the future. Then, after some research, I realised that OpenCV provides a lot of the tools needed for the task as well. Since I am still in a prototyping/research stage, and I hate to write loads of boilerplate, I am using python for that project, though a later rewrite in C or C++ is not impossible.

First things first I will not present things exactly in the order I researched them, but rather in the order I should have researched them: starting with a simpler problem, then getting into the complications of my balloon problem. The simpler problem at hand is presented to you by Joe the Hippo:

Joe the shaky hippo (video) As you can see, Joe almost looks like he's on a boat. He isn't, but the cameraman is, and the video was taken with a lot of zoom. The movement in that video stream has a particularity that can make things simpler: the position of a feature on the screen does not change much from one frame to the next, because a very short amount of time happens between them. We will see that some potentially very useful algorithms take advantage of that particularity.

The steps of image stabilisation As I see it for the moment, there are two basic steps in image stabilisation: Find the optical flow (i.e. the movement) between two frames Apply a transformation that reverts that movement, on a global (frame) scale Step 2. is made rather easy by OpenCV with the findHomography() and warpPerspective() functions, so we won't talk much about it here.

Optical flow For all that matters in this study, we can say that for each frame the optical flow is represented by two lists of point coordinates origins and destinations, such that the feature at the coordinate origins[i] in the previous frame is at the coordinate destinations[i] in the current frame. Optical flow algorithms can be separated in two classes, depending on whether they provide the flow for all pixels (Dense optical flow algorithms) or only for selected pixels (Sparse optical flow algorithms). Both classes can theoretically provide us with the right data (origins and destinations point lists) to successfully compute the opposite transformation we want to apply using findHomography(). I tried one algorithm of each class, choosing the ones that seemed popular to me after reading a bit of [Bradski2008]. Here is what I managed to do with them. Dense optical flow I tried to use OpenCV's implementation of the Horn-Schunck algorithm [Horn81]. I don't know if I used it incorrectly, or if the algorithm simply cannot be applied to that situation, but this is all I could do to Joe with that: Now Joe is shaky and flickery As you can see, this basically added flickering. Since that, I did not find time to improve this case before I realised that this algorithm is considered as obsolete in OpenCV, and the new python bindings do not include it. Note that this does not mean that dense optical flow sucks: David Jordan, a Google Summer of Code student, does awesome things with a dense algorithm by Proesmans et al. [Proesmans94]. Sparse optical flow I played with the Lucas-Kanade algorithm [Lucas81], with the implementation provided by OpenCV. Once I managed to find a good set of parameters (which are now the default in the opticalflowfinder element), I got pretty good results: Joe enjoys the stability of the river bank, undisturbed by the movements of the water ( Joe enjoys the stability of the river bank, undisturbed by the movements of the water ( video And it is quite fast too. On my laptop (with an i5 processor), I can stabilise Joe the hippo in real time (it is only a 640x480 video, though).

The balloon problem As we seen in the previous section, for a shaky hippo video, [Horn81] isn't any help, but [Lucas81] is pretty efficient. But can they be of any use for my balloon problem? Unsuccessful results I won't show any video here, because there is nothing much to see. Instead, an explanation in pictures that show how the algorithms rate for the balloon time lapse. This is what Horn-Schunck can do: The picture shows two consecutive frames in the time lapse (the older is on the left). Each of the coloured lines goes from a point on the first image to the corresponding point on the second one, according to the algorithm (click on the image to see a larger version where the lines are more visible). Since Horn-Schunck is a dense algorithm, the coloured lines are only displayed for a random subset of points to avoid clutter. Obviously, these lines show that the algorithm is completely wrong, and could not follow the big rotation happening between the two frames. Does Lucas-Kanade rate better? Let's see: This is the same kind of visualisation, except that there is no need to chose a subset, since the algorithm already does that. As for the result, it might be slightly less wrong than Horn-Schunck, but Lucas-Kanade does not seem to be of any help to us either. The issue here, as said earlier, is that these two algorithms, like most optical flow algorithms, are making the assumption that a given feature will not move more than a few pixels from one frame to the next (for some value of "a few pixels"). This assumption is very clever for typical video streams taken at 25 or 30 frames per second. Unfortunately, it is obviously wrong in the case of our stream, where the camera has the time to move a lot between two frames (which are captured one minute apart). Is all hope lost? Of course not! Feature recognition I found salvation in feature recognition. OpenCV provides a lot of feature recognition algorithms. I have tried only one of them so far, but I hope to find the time to compare it with others in the future. The one I tried is SURF (for "Speeded Up Robust Features", [Bay06]). It finds "interesting" features in an image and descriptors associated with them. The descriptors it provides are invariant to rotation and scaling, which means that it is in theory possible to find the same descriptors from frame to frame. To be able to efficiently compare the sets of frame descriptors I get for two consecutive frames, I use FLANN, which is well integrated in OpenCV. Here is a visualisation of how this method performs: As you can see, this is obviously much better! There might be a few outliers, but OpenCV's findHomography() can handle them perfectly, and here's a proof video (I am not including it in the article since it is quite high resolution). Obviously, the result is not perfect yet (especially in the end), but it is quite promising, and I hope to be able to fix the remaining glitches sooner than later.