Xinlu Huang is a Fellow from the most recent session of Insight Data Science in Silicon Valley. In this post she describes how she built TakeNote, a tool that helps students better navigate online course lectures by combining some innovative approaches to tracking changes in video content.

As someone who is always interested in learning new things, I have a love-hate relationship with online video lectures: they are at the same time highly effective and highly inefficient learning tools. For video lectures on technical subjects (e.g. math, computer science, physics, or engineering), I often find myself scrolling through the progress bar until I see something unfamiliar on the blackboard1 and then rewind back to find a place to start watching, which usually takes me a few tries to get right. Sometimes I go too far back, other times not far enough. This process seems really inefficient. So when it came time to choose a project when I was at Insight, I thought: why not make a tool to facilitate learning from online video lectures?

The result is TakeNote. The usage is simple: give it a url to a video lecture hosted on YouTube, and TakeNote will process the video, extracting the sections where the professor is writing on the blackboard. Using the video’s contents, the app builds a series of blackboard writing snippets and presents them to the user, allowing the user to jump ahead to specific points in the video and watch what the lecturer has to say about a particular equation.

As a result, users are able to significantly decrease the amount of time spent watching a lecture, while still obtaining the most important aspects of the information.

Scoping the project and finding a line of attack

When I first pitched this idea to my fellow Insight Fellows and the Insight team, I got two classes of feedback: 1) cool idea, and 2) this seems impossible to accomplish within 2–3 weeks. So setting the right scope for my project was the most important step! I decided to focus on a very limited set of cases: blackboard lecture videos with a fixed camera and relatively stable background lighting conditions. I also started with one source, the video lectures from the Galileo Galilei Institute for Theoretical Physics2. The end product is applicable to general blackboard lecture videos given these limitations, but the performance does vary. I encourage you to test it out.

The goal of the project can be reduced to finding when blackboard text appears (and disappears, i.e. erased) in a video and then extract image of just the blackboard writings. There are many ways to approach this problem and I explored many options3 before settling on one that seemed the most feasible. My approach divided the problem into two smaller pieces: 1) identify image fragments of changes between video frames, and 2) classify those image fragments into blackboard writings vs. anything else. Being able to divide the problem into smaller, more tractable parts gave me some confidence that I could finish the project in 2–3 weeks.

Being able to divide the problem into smaller, more tractable parts gave me some confidence that I could finish the project in 2–3 weeks.

Part I: Extracting image fragments of changes between frames

This was actually the harder part, and the part that could be most improved by further work. To begin, I needed to understand how digital images work and how to work with them in Python, something I hadn’t done before. I decided to only deal with grayscale images, since I wanted TakeNote to work independently of the lecturer’s shirt color or whether the blackboard is green or black. Grayscale images are actually very simple to handle in Python: you can read them as 2D numpyarrays containing integer values ranging from 0 to 255, where 0 is black, and 255 is white, and anything in between is some shade of gray.

The basic steps to identify, for a lack of better word, the “blobs” of changes between two video frames are:

Subtract one frame’s numpy array from the other. Filter out the irrelevant pixels Downsample Identify clusters of pixels

THE IMAGE SHOWS RESULT AFTER EACH STEPS. IMAGE A) IS THE RESULT OF SUBTRACTING TWO IMAGES, SO PIXEL VALUES RANGES FROM -255 TO 255, A PIXEL WITH A VALUE OF 0 REPRESENTS THE UNCHANGED PIXELS AND CORRESPONDS TO GRAY AFTER RESCALING. IMAGES B) AND C) ARE INVERTED FOR BETTER CONTRAST. IN IMAGE C) YOU CAN SEE THE THE DEGREE OF DOWNSAMPLING. THE COLORS IN D) CORRESPOND TO DIFFERENT CLUSTERS.

Step 2 is the crucial, and difficult step. In an ideal world, the only non-zero pixels left after step 1 would belong to moving objects between the frames, and thus represent new or erased writing, or the lecturer moving. But in the real world, light conditions change over time, and this results in spurious changed pixels. I filtered out the positive-valued pixels (these should correspond to a new dark pixel over a light background, which is clearly not like blackboard writing), then used Otsu’s method to remove the remaining background pixels. In a nutshell, Otsu’s method produces a threshold value to divide background and foreground pixels such that the variance within each group is minimized. Despite its simplicity, Otsu’s method works really well for images with relatively clean background such as from videos with stable lighting. For videos with more drastic lighting changes, I experimented with a variety of other methods. Thresholding on the mean of a Gaussian blurred image works the best for these, but I ended up just focusing on videos with stable lighting conditions.

For step 4, identifying the clusters, I choose a particularly simple scheme: connected pixels belong to the same cluster. This method relies on having a clean image from the filtering step. I then performed the closing morphology operation operation on the clusters to make them more uniformly shaped.

Alternatively, filtering is less important if the clustering method is smarter and less sensitive to noise. In particular, DBSCAN works relatively well and can identify noise that does not belong to any clusters. However, for my particular problem, the computation cost of running DBSCAN for each frame of the video was too high.

Part II: Classifying image fragments

My video processing pipeline yields a small number of blobs that change between two given frames (5 on average across all pairs of frames in my test video). My next goal was to classify these blobs as either blackboard writing or something else.

I generated a small dataset of about 1500 image fragments from a few videos and hand labeled my data with the help of interactive widgets in a Jupyter Notebook (which is a cool feature and definitely worth checking out). With such a small training dataset, deep learning or even using individual pixel values as features was out of the question. Fortunately, blackboard writing is very different from other moving objects in these videos (like the movement of the lecturer’s head, for example), which meant I was able to generate new features specific to this classification problem.

Finding the right features took a lot of just staring at the images followed by trial and error. For inspiration I looked to skimage’s useful gallery of examples and read up on current techniques for identifying text-like fragments in images, such asthis approach from MathWorks. Some of the most important features I used wereextent (the percentage of white pixels over all pixels in the binarized image after applying Otsu’s method), corner_frac (the density of Harris corners within the white pixels), and ncluster_frac (a measure of how connected the white pixels are). If you think about a typical image of writing on a blackboard and compare it with the other objects that typically appear in these videos, such as the lecturer’s head or random noise, these features make a lots of sense and the plots below demonstrate the segregating power of these features:

I trained a Random Forest classifier with the features described above, along with a host of other, more basic, variables (e.g. size, aspect ratio, etc.):

The performance of a classifier can be quantified with the ROC curve. The area under the curve (AUC) for my model is about 0.96, very close to a perfect classifier with AUC of 1 and significantly better than a random classifier with AUC of 0.5.

Putting it together

At this point I could now successfully identify blackboard-writing-like changes between video frames. To process a whole video, I set the first frame as the base frame and then checked for changes with respect to the base frame at 3 second intervals. If new writing appears, this becomes the new base frame and the process continues. False positives are reduced by checking that the new blackboard writing stays on the board for at least 15 seconds. I chose to define “staying on the board” as F(α=2)>0.7 for new blackboard writings on consecutive frames with respect to the same base frame, where

and F(α) is defined between two binary images and can be intuitively understood as weighted product of the fraction of shared white pixels and fraction of shared black pixels between the images. Here 1 is the indicator function, p1i etc. is the pixel value at position i of image 1, etc. I chose α>1 to overweight on the shared white pixel fraction.

There is one additional problem: erasure. However, this can be solved similarly. If writing disappears in consecutive frames (within 15 seconds), then it is considered erased. I chose the criterion for disappearance to be F(α=0.5)<0.4, overweighting on shared fraction of black pixels.

Building an interactive user interface

Using the algorithms described above as my backend, I built a web-based user interface where individuals can submit YouTube videos and obtain a “virtual blackboard” of the lecture. The raw video stream address is retrieved by the pafypackage, and the video stream is decoded into still frames using OpenCV. The most interesting aspect of building my web app was how to deal with the relatively long time needed for video processing. Presenting the results as they come in is a better user experience than serving a blank screen while processing.

To accomplish this, I decided to use HTML5 Server-Sent Events to stream the results as they come in. This opens a one-way socket between the server and your browser. You can read all about it , but it is shockingly simple to implement in Python via Flask, just return a generator that yields data with a simple structure with a mimetype of text/event-stream:

And in the front-end javascript you’ll add:

To finish the frontend, I added some javascript to handle putting the streaming images in the right position and showing the highlight during mouse-over.

Parting words

Building TakeNote without any prior background in image or video processing was intense but also intensely fun. What I learned along the way has also given me many ideas of how improve the product, such as adding support for videos with moving cameras by tracking how the video frames move, or integrating subtitles or audio snippets into the app’s output. The possibilities are limitless and I’m excited to keep learning as I continue building TakeNote.