In the last case study on property-based testing (PBT) in Komposition we looked at timeline flattening. This post covers the video classifier, how it was tested before, and the bugs I found when I wrote property tests for it.

If you haven’t read the introduction or the first case study yet, I recommend checking them out!

Classifying Scenes in Imported Video

Komposition can automatically classify scenes when importing video files. This is a central productivity feature in the application, effectively cutting recorded screencast material automatically, letting the user focus on arranging the scenes of their screencast. Scenes are segments that are considered moving, as opposed to still segments:

A still segment is a sequence of at least \(S\) seconds of near-equal frames

seconds of near-equal frames A moving segment is a sequence of non-equal frames, or a sequence of near-equal frames with a duration less than \(S\)

\(S\) is a preconfigured minimum still segment duration in Komposition. In the future it might be configurable from the user interface, but for now it’s hard-coded.

Equality of two frames \(f_1\) and \(f_2\) is defined as a function \(E(f_1, f_2)\), described informally as:

comparing corresponding pixel color values of \(f_1\) and \(f_2\) , with a small epsilon for tolerance of color variation, and

and , with a small epsilon for tolerance of color variation, and deciding two frames equal when at least 99% of corresponding pixel pairs are considered equal.

In addition to the rules stated above, there are two edge cases:

The first segment is always a considered a moving segment (even if it’s just a single frame) The last segment may be a still segment with a duration less than \(S\)

The second edge case is not what I would call a desirable feature, but rather a shortcoming due to the classifier not doing any type of backtracking. This could be changed in the future.

Manually Testing the Classifier

The first version of the video classifier had no property tests. Instead, I wrote what I thought was a decent classifier algorithm, mostly messing around with various pixel buffer representations and parallel processing to achieve acceptable performance.

The only type of testing I had available, except for general use of the application, was a color-tinting utility. This was a separate program using the same classifier algorithm. It took as input a video file, and produced as output a video file where each frame was tinted green or red, for moving and still frames, respectively.

Video classification shown with color tinting

In the recording above you see the color-tinted output video based on a recent version of the classifier. It classifies moving and still segments rather accurately. Before I wrote property tests and fixed the bugs that I found, it did not look so pretty, flipping back and forth at seemingly random places.

At first, debugging the classifier with the color-tinting tool way seemed like a creative and powerful technique. But the feedback loop was horrible, having to record video, process it using the slow color-tinting program, and inspecting it by eye. In hindsight, I can conclude that PBT is far more effective for testing the classifier.

Video Classification Properties

Figuring out how to write property tests for video classification wasn’t obvious to me. It’s not uncommon in example-based testing that tests end up mirroring the structure, and even the full implementation complexity, of the system under test. The same can happen in property-based testing.

With some complex systems it’s very hard to describe the correctness as a relation between any valid input and the system’s observed output. The video classifier is one such case. How do I decide if an output classification is correct for a specific input, without reimplementing the classification itself in my tests?

The other way around is easy, though! If I have a classification, I can convert that into video frames. Thus, the solution to the testing problem is to not generate the input, but instead generate the expected output. Hillel Wayne calls this technique “oracle generators” in his recent article.

The classifier property tests generate high-level representations of the expected classification output, which are lists of values describing the type and duration of segments.

A generated sequence of expected classified segments

Next, the list of output segments is converted into a sequence of actual frames. Frames are two-dimensional arrays of RGB pixel values. The conversion is simple:

Moving segments are converted to a sequence of alternating frames, flipping between all gray and all white pixels

Still frames are converted to a sequence of frames containing all black pixels

The example sequence in the diagram above, when converted to pixel frames with a frame rate of 10 FPS, can be visualized like in the following diagram, where each thin rectangle represents a frame:

Pixel frames derived from a sequence of expected classified output segments

By generating high-level output and converting it to pixel frames, I have input to feed the classifier with, and I know what output it should produce. Writing effective property tests then comes down to writing generators that produce valid output, according to the specification of the classifier. In this post I’ll show two such property tests.

Testing Still Segment Minimum Length

As stated in the beginning of this post, classified still segments must have a duration greater than or equal to \(S\), where \(S\) is the minimum still segment duration used as a parameter for the classifier. The first property test we’ll look at asserts that this invariant holds for all classification output.

= property $ do hprop_classifies_still_segments_of_min_lengthproperty -- 1. Generate a minimum still segment length/duration <- forAll $ Gen.int (Range.linear 2 ( 2 * frameRate)) minStillSegmentFramesforAllGen.int (Range.linearframeRate)) let minStillSegmentTime = frameCountDuration minStillSegmentFrames minStillSegmentTimeframeCountDuration minStillSegmentFrames -- 2. Generate output segments <- forAll $ segmentsforAll 1 10 ) genSegments (Range.linear 1 (Range.linear * 2 )) (minStillSegmentFrames)) (Range.linear minStillSegmentFrames * 2 )) (minStillSegmentFrames)) resolution -- 3. Convert test segments to actual pixel frames let pixelFrames = testSegmentsToPixelFrames segments pixelFramestestSegmentsToPixelFrames segments -- 4. Run the classifier on the pixel frames let counted = classifyMovement minStillSegmentTime (Pipes.each pixelFrames) countedclassifyMovement minStillSegmentTime (Pipes.each pixelFrames) & Pipes.toList Pipes.toList & countSegments countSegments -- 5. Sanity check === totalClassifiedFrames counted countTestSegmentFrames segmentstotalClassifiedFrames counted -- 6. Ignore last segment and verify all other segments case initMay counted of initMay counted Just rest -> rest traverse_ (assertStillLengthAtLeast minStillSegmentTime) rest Nothing -> success success where = 10 :. 10 resolution

This chunk of test code is pretty busy, and it’s using a few helper functions that I’m not going to bore you with. At a high level, this test:

Generates a minimum still segment duration, based on a minimum frame count (let’s call it \(n\) ) in the range \([2, 20]\) . The classifier currently requires that \(n \geq 2\) , hence the lower bound. The upper bound of 20 frames is an arbitrary number that I’ve chosen. Generates valid output segments using the custom generator genSegments , where moving segments have a frame count in \([1, 2n]\) , and

, and still segments have a frame count in \([n, 2n]\) . Converts the generated output segments to actual pixel frames. This is done using a helper function that returns a list of alternating gray and white frames, or all black frames, as described earlier. Count the number of consecutive frames within each segment, producing a list like [Moving 18, Still 5, Moving 12, Still 30] . Performs a sanity check that the number of frames in the generated expected output is equal to the number of frames in the classified output. The classifier must not lose or duplicate frames. Drops the last classified segment, which according to the specification can have a frame count less than \(n\) , and asserts that all other still segments have a frame count greater than or equal to \(n\) .

Let’s run some tests.

> :{ | hprop_classifies_still_segments_of_min_length | & Hedgehog.withTests 10000 | & Hedgehog.check | :} ✓ <interactive> passed 10000 tests.

Cool, it looks like it’s working.

Sidetrack: Why generate the output?

Now, you might wonder why I generate output segments first, and then convert to pixel frames. Why not generate random pixel frames to begin with? The property test above only checks that the still segments are long enough!

The benefit of generating valid output becomes clearer in the next property test, where I use it as the expected output of the classifier. Converting the output to a sequence of pixel frames is easy, and I don’t have to state any complex relation between the input and output in my property. When using oracle generators, the assertions can often be plain equality checks on generated and actual output.

But there’s benefit in using the same oracle generator for the “minimum still segment length” property, even if it’s more subtle. By generating valid output and converting to pixel frames, I can generate inputs that cover the edge cases of the system under test. Using property test statistics and coverage checks, I could inspect coverage, and even fail test runs where the generators don’t hit enough of the cases I’m interested in.

Had I generated random sequences of pixel frames, then perhaps the majority of the generated examples would only produce moving segments. I could tweak the generator to get closer to either moving or still frames, within some distribution, but wouldn’t that just be a variation of generating valid scenes? It would be worse, in fact. I wouldn’t then be reusing existing generators, and I wouldn’t have a high-level representation that I could easily convert from and compare with in assertions.

Testing Moving Segment Time Spans

The second property states that the classified moving segments must start and end at the same timestamps as the moving segments in the generated output. Compared to the previous property, the relation between generated output and actual classified output is stronger.

= property $ do hprop_classifies_same_scenes_as_inputproperty -- 1. Generate a minimum still still segment duration <- forAll $ Gen.int (Range.linear 2 ( 2 * frameRate)) minStillSegmentFramesforAllGen.int (Range.linearframeRate)) let minStillSegmentTime = frameCountDuration minStillSegmentFrames minStillSegmentTimeframeCountDuration minStillSegmentFrames -- 2. Generate test segments <- forAll $ genSegments (Range.linear 1 10 ) segmentsforAllgenSegments (Range.linear 1 (Range.linear * 2 )) (minStillSegmentFrames)) (Range.linear minStillSegmentFrames * 2 )) (minStillSegmentFrames)) resolution -- 3. Convert test segments to actual pixel frames let pixelFrames = testSegmentsToPixelFrames segments pixelFramestestSegmentsToPixelFrames segments -- 4. Convert expected output segments to a list of expected time spans -- and the full duration let durations = map segmentWithDuration segments durationssegmentWithDuration segments = movingSceneTimeSpans durations expectedSegmentsmovingSceneTimeSpans durations = foldMap unwrapSegment durations fullDurationfoldMap unwrapSegment durations -- 5. Classify movement of frames let classifiedFrames = classifiedFrames Pipes.each pixelFrames & classifyMovement minStillSegmentTime classifyMovement minStillSegmentTime & Pipes.toList Pipes.toList -- 6. Classify moving scene time spans let classified = classified (Pipes.each classifiedFrames & classifyMovingScenes fullDuration) classifyMovingScenes fullDuration) >-> Pipes.drain Pipes.drain & Pipes.runEffect Pipes.runEffect & runIdentity runIdentity -- 7. Check classified time span equivalence === classified expectedSegmentsclassified where = 10 :. 10 resolution

Steps 1–3 are the same as in the previous property test. From there, this test:

Converts the generated output segments into a list of time spans. Each time span marks the start and end of an expected moving segment. Furthermore, it needs the full duration of the input in step 6, so that’s computed here. Classify the movement of each frame, i.e. if it’s part of a moving or still segment. Run the second classifier function called classifyMovingScenes , based on the full duration and the frames with classified movement data, resulting in a list of time spans. Compare the expected and actual classified list of time spans.

While this test looks somewhat complicated with its setup and various conversions, the core idea is simple. But is it effective?

Bugs! Bugs everywhere!

Preparing for a talk on property-based testing, I added the “moving segment time spans” property a week or so before the event. At this time, I had used Komposition to edit multiple screencasts. Surely, all significant bugs were caught already. Adding property tests should only confirm the level of quality the application already had. Right?

Nope. First, I discovered that my existing tests were fundamentally incorrect to begin with. They were not reflecting the specification I had in mind, the one I described in the beginning of this post.

Furthermore, I found that the generators had errors. At first, I used Hedgehog to generate the pixels used for the classifier input. Moving frames were based on a majority of randomly colored pixels and a small percentage of equally colored pixels. Still frames were based on a random single color.

The problem I had not anticipated was that the colors used in moving frames were not guaranteed to be distinct from the color used in still frames. In small-sized examples I got black frames at the beginning and end of moving segments, and black frames for still segments, resulting in different classified output than expected. Hedgehog shrinking the failing examples’ colors towards 0, which is black, highlighted this problem even more.

I made my generators much simpler, using the alternating white/gray frames approach described earlier, and went on to running my new shiny tests. Here’s what I got:

What? Where does 0s–0.6s come from? The classified time span should’ve been 0s–1s, as the generated output has a single moving scene of 10 frames (1 second at 10 FPS). I started digging, using the annotate function in Hedgehog to inspect the generated and intermediate values in failing examples.

I couldn’t find anything incorrect in the generated data, so I shifted focus to the implementation code. The end timestamp 0.6s was consistently showing up in failing examples. Looking at the code, I found a curious hard-coded value 0.5 being bound and used locally in classifyMovement .

The function is essentially a fold over a stream of frames, where the accumulator holds vectors of previously seen and not-yet-classified frames. Stripping down and simplifying the old code to highlight one of the bugs, it looked something like this:

= classifyMovement minStillSegmentTime case ... of InStillState { .. } -> if someDiff > minEqualTimeForStill someDiffminEqualTimeForStill then ... else ... InMovingState { .. } -> if someOtherDiff >= minStillSegmentTime someOtherDiffminStillSegmentTime then ... else ... where = 0 . 5 minEqualTimeForStill

Let’s look at what’s going on here. In the InStillState branch it uses the value minEqualTimeForStill , instead of always using the minStillSegmentTime argument. This is likely a residue from some refactoring where I meant to make the value a parameter instead of having it hard-coded in the definition.

Sparing you the gory implementation details, I’ll outline two more problems that I found. In addition to using the hard-coded value, it incorrectly classified frames based on that value. Frames that should’ve been classified as “moving” ended up “still”. That’s why I didn’t get 0s–1s in the output.

Why didn’t I see 0s–0.5s, given the hard-coded value 0.5? Well, there was also an off-by-one bug, in which one frame was classified incorrectly together with the accumulated moving frames.

The classifyMovement function is 30 lines of Haskell code juggling some state, and I managed to mess it up in three separate ways at the same time. With these tests in place I quickly found the bugs and fixed them. I ran thousands of tests, all passing.

Finally, I ran the application, imported a previously recorded video, and edited a short screencast. The classified moving segments where notably better than before.

Summary

A simple streaming fold can hide bugs that are hard to detect with manual testing. The consistent result of 0.6, together with the hard-coded value 0.5 and a frame rate of 10 FPS, pointed clearly towards an off-by-one bug. I consider this is a great showcase of how powerful shrinking in PBT is, consistently presenting minimal examples that point towards specific problems. It’s not just a party trick on ideal mathematical functions.

Could these errors have been caught without PBT? I think so, but what effort would it require? Manual testing and introspection did not work for me. Code review might have revealed the incorrect definition of minEqualTimeForStill , but perhaps not the off-by-one and incorrect state handling bugs. There are of course many other QA techniques, I won’t evaluate all. But given the low effort that PBT requires in this setting, the amount of problems it finds, and the accuracy it provides when troubleshooting, I think it’s a clear win.

I also want to highlight the iterative process that I find naturally emerges when applying PBT:

Think about how your system is supposed to work. Write down your specification. Think about how to generate input data and how to test your system, based on your specification. Tune your generators to provide better test data. Try out alternative styles of properties. Perhaps model-based or metamorphic testing fits your system better. Run tests and analyze the minimal failing examples. Fix your implementation until all tests pass.

This can be done when modifying existing code, or when writing new code. You can apply this without having any implementation code yet, perhaps just a minimal stub, and the workflow is essentially the same as TDD.

Coming Up

The final post in this series will cover testing at a higher level of the system, with effects and multiple subsystems being integrated to form a full application. We will look at property tests that found many bugs and that made a substantial refactoring possible.

Until then, thanks for reading!

Credits

Thank you Ulrik Sandberg, Pontus Nagy, and Fredrik Björeman for reviewing drafts of this post.

Buy the Book

This series is now available as an ebook on Leanpub. While the content is mostly the same, there are few changes bringing it up-to-date. Also, if you’ve already enjoyed the articles, you might want support my work by purchasing this book. Finally, you might enjoy a nicely typeset PDF, or an EPUB book, over a web page.