Mask R-CNN algorithm in low light - thinks it sees a cat ¯\_(ツ)_/¯

There are plenty of approaches to do Object Detection. YOLO (You Only Look Once) is the algorithm of choice for many, because it passes the image through the Fully Convolutional Neural Network (FCNN) only once. This makes the inference fast. About 30 frames per second on a GPU.

Object bounding boxes predicted by YOLO (Joseph Redmon)

Another popular approach is the use of Region Proposal Network (RPN). RPN based algorithms have two components. First component gives proposals for Regions of Interests (RoI)… i.e. where in the image might be objects. The second component does the image classification task on these proposed regions. This approach is slower. Mask R-CNN is a framework by Facebook AI that makes use of RPN for object detection. Mask R-CNN can operate at about 5 frames per second on a GPU. We will use Mask R-CNN.

Why use a slow algorithm when there are faster alternatives? Glad you asked!

Mask R-CNN also outputs object-masks in addition to object detection and bounding box prediction.

Object masks and bounding boxes predicted by Mask R-CNN (Matterport)

The following sections contain explanation of the code and concepts that will help in understanding object detection, and working with camera inputs with Mask R-CNN, on Colab. It’s not a step by step tutorial but hopefully it would be as effective. At the end of this article, you will find the link to Colab notebook to try it yourself.

Matterport has a great implementation of Mask R-CNN using Keras and Tensorflow. They have provided Notebooks to play with Mask R-CNN, to train Mask R-CNN with your own dataset and to inspect the model and weights.

Why Google Colab

If you don’t have a GPU machine, or don’t want to go through the tedious task of setting up the development environment, Colab is the best temporary option.

In my case, I had lost my favorite laptop recently. So, I am on my backup machine - a windows tablet with a keyboard. Colab enables you to work in a Jupyter Notebook in your browser, connected to a powerfull GPU or a TPU (Tensor Processing Unit) virtual machine in Google Cloud. The VM comes pre-installed with Python, Tensorflow, Keras, PyTorch, Fastai and a lot of other important Machine Learning tools. All for free. Beware that your session progress gets lost due to a few minutes of inactivity.

Getting started with Google Colab

The Welcome to Colaboratory guide gets you started easily. And the Advanced Colab guide comes in handy when taking input from camera, communicating between different cells of the notebook, and communication between Python and JavaScript code. If you don’t have time to look at them, just remember the following.

A cell in Colab notebook usually contains Python code. By default the code runs inside /content directory of the connected Virtual Machine. Ubuntu is the operating system of Colab VMs and you can execute system commands by starting the line of the command with ! .

The following command will clone the repository.

! git clone https://github.com/matterport/Mask_RCNN

If you have multiple system commands in the same cell, then you must have %%shell as the first line of the cell followed by system commands. Thus, the following set of commands will clone the repository, change the directory to Mask_RCNN and setup the project.

%%shell # clone Mask_RCNN repo and install packages git clone https://github.com/matterport/Mask_RCNN cd Mask_RCNN python setup.py install

Import Mask R-CNN

The following code comes from Demo Notebook provided by Matterport. We only need to change the ROOT_DIR to ./Mask_RCNN , the project we just cloned.

The python statement sys.path.append(ROOT_DIR) makes sure that the subsequent code executes within the context of Mask_RCNN directory where we have Mask R-CNN implementation available. The code imports the necessary libraries, classes and downloads the pre-trained Mask R-CNN model. Go through it. The comments make it easier to understand the code.

import os import sys import random import math import numpy as np import skimage.io import matplotlib import matplotlib.pyplot as plt # Root directory of the project ROOT_DIR = os . path . abspath ( "./Mask_RCNN/" ) # Import Mask RCNN sys . path . append ( ROOT_DIR ) # To find local version of the library from mrcnn import utils import mrcnn.model as modellib from mrcnn import visualize # Import COCO config sys . path . append ( os . path . join ( ROOT_DIR , "samples/coco/" )) # find local version import coco % matplotlib inline # Directory to save logs and trained model MODEL_DIR = os . path . join ( ROOT_DIR , "logs" ) # Local path to trained weights file COCO_MODEL_PATH = os . path . join ( ROOT_DIR , "mask_rcnn_coco.h5" ) # Download COCO trained weights from Releases if needed if not os . path . exists ( COCO_MODEL_PATH ): utils . download_trained_weights ( COCO_MODEL_PATH ) # Directory of images to run detection on IMAGE_DIR = os . path . join ( ROOT_DIR , "images" )

Create Model from Trained Weights

Following code creates model object in inference mode, so we could run predictions. Then it loads the weights from the pre-trained model that we downloaded earlier, into the model object.

# Create model object in inference mode. model = modellib . MaskRCNN ( mode = "inference" , model_dir = MODEL_DIR , config = config ) # Load weights trained on MS-COCO model . load_weights ( COCO_MODEL_PATH , by_name = True )

Run Object Detection

Now we test the model on some images. Mask_RCNN repository has a directory named images that contains… you guessed it… some images. The following code takes an image from that directory, passes it through the model and displays the result on the notebook along with bounding box information.

# Load a random image from the images folder file_names = next ( os . walk ( IMAGE_DIR ))[ 2 ] image = skimage . io . imread ( os . path . join ( IMAGE_DIR , random . choice ( file_names ))) # Run detection results = model . detect ([ image ], verbose = 1 ) # Visualize results r = results [ 0 ] visualize . display_instances ( image , r [ 'rois' ], r [ 'masks' ], r [ 'class_ids' ], class_names , r [ 'scores' ])

The result of prediction

Processing 1 images image shape: (425, 640, 3) min: 0.00000 max: 255.00000 uint8 molded_images shape: (1, 1024, 1024, 3) min: -123.70000 max: 151.10000 float64 image_metas shape: (1, 93) min: 0.00000 max: 1024.00000 float64 anchors shape: (1, 261888, 4) min: -0.35390 max: 1.29134 float32

Working with Camera Images

In the advanced usage guide of Colab, they have provided code that can capture image from webcam in the notebook and then forward it to the Python code.

Colab notebook has pre-installed python package called google.colab which contains handy helper methods. There’s a method called output.eval_js which helps us evaluate the JavaScript code and returns the output to Python. And in JavaScript, we know that there is a method called getUserMedia() which enables us to capture the audio and/or video stream from user’s webcam and microphone.

Have a look at the following JavaScript code. Using getUserMedia() method of WebRTC API of JavaScript, it captures the video stream of the webcam and draws the individual frames on HTML canvas. Like google.colab Python package, we have google.colab library available to us in JavaScript. This library will help us invoke a Python method using kernel.invokeFunction function from our JavaScript code.

The image captured from webcam is converted to Base64 format. This Base64 image is passed to a Python callback method, which we will define later.

async function takePhoto ( quality ) { // create html elements const div = document . createElement ( ' div ' ); const video = document . createElement ( ' video ' ); video . style . display = ' block ' ; // request the stream. This will ask for Permission to access // a connected Camera/Webcam const stream = await navigator . mediaDevices . getUserMedia ({ video : true }); // show the HTML elements document . body . appendChild ( div ); div . appendChild ( video ); // display the stream video . srcObject = stream ; await video . play (); // Resize the output (of Colab Notebook Cell) to fit the video element. google . colab . output . setIfrmeHeight ( document . documentElement . scrollHeight , true ); // capture 5 frames (for test) for ( let i = 0 ; i < 5 ; i ++ ) { const canvas = document . createElement ( ' canvas ' ); canvas . width = video . videoWidth ; canvas . height = video . videoHeight ; canvas . getContext ( ' 2d ' ). drawImage ( video , 0 , 0 ); img = canvas . toDataURL ( ' image/jpeg ' , quality ); // Call a python function and send this image google . colab . kernel . invokeFunction ( ' notebook.run_algo ' , [ img ], {}); // wait for X miliseconds second, before next capture await new Promise ( resolve => setTimeout ( resolve , 250 )); } stream . getVideoTracks ()[ 0 ]. stop (); // stop video stream }

We already discussed that having %%shell as the first line of the Colab notebook cell makes it run as terminal commands. Similarly, you can write JavaScript in the whole cell by starting the cell with %%javascript . But we will simply put the JavaScript code we wrote above, inside the Python code. Like this:

from IPython.display import display , Javascript from google.colab.output import eval_js from base64 import b64decode def take_photo ( filename = 'photo.jpg' , quality = 0.8 ): js = Javascript ( ''' // ... // JavaScript code here <<== // ... ''' ) # make the provided HTML, part of the cell display ( js ) # call the takePhoto() JavaScript function data = eval_js ( 'takePhoto({})' . format ( quality ))

Python - JavaScript Communication

The JavaScript code we wrote above invokes notebook.run_algo method of our Python code. The following code defines a Python method run_algo which accepts a Base64 image, converts it to a numpy array and then passes it through the Mask R-CNN model we created above. Then it shows the output image and processing stats.

Important! Don’t forget to surround the Python code of your callback method in try / except block and log it. Because it will be invoked by JavaScript and there will be no sign of what error occurred while calling the Python callback.

import IPython import time import sys import numpy as np import cv2 import base64 import logging from google.colab import output from PIL import Image from io import BytesIO def data_uri_to_img ( uri ): """convert base64 image to numpy array""" try : image = base64 . b64decode ( uri . split ( ',' )[ 1 ], validate = True ) # make the binary image, a PIL image image = Image . open ( BytesIO ( image )) # convert to numpy array image = np . array ( image , dtype = np . uint8 ); return image except Exception as e : logging . exception ( e ); print ( '

' ) return None def run_algo ( imgB64 ): """ in Colab, run_algo function gets invoked by the JavaScript, that sends N images every second, one at a time. params: image: image """ image = data_uri_to_img ( imgB64 ) if image is None : print ( "At run_algo(): image is None." ) return try : # Run detection results = model . detect ([ image ], verbose = 1 ) # Visualize results r = results [ 0 ] visualize . display_instances ( image , r [ 'rois' ], r [ 'masks' ], r [ 'class_ids' ], class_names , r [ 'scores' ] ) except Exception as e : logging . exception ( e ) print ( '

' )

Let’s register run_algo as notebook.run_algo . Now it will be invokable by the JavaScript code. We also call the take_photo() Python method we defined above, to start the video stream and object detection.

# register this function, so JS code could call this output . register_callback ( 'notebook.run_algo' , run_algo ) # put the JS code in cell and run it take_photo ()

Try it yourself

You are now ready to try Mask R-CNN on camera in Google Colab. The notebook will walk you step by step through the process.

(Optional) For Curious Ones

The process we used above converts the camera stream to images in browser (JavaScript) and sends individual images to our Python code for object detection. This is obviously not real-time. So, I spent hours trying to upload the WebRTC stream from the JavaScript (peer A) to the Python Server (peer B) without success. Perhaps my unfamiliarity with the combination of async / await with Python Threads was the main hindrance. I was trying to use aiohttp as Python server that will handle WebRTC connection using aiortc . The Python library aiortc makes it easy to create Python as a peer of WebRTC. Here is the link to the Colab notebook with incomplete effort of creating WebRTC server.

What’s next

Facebook AI Research is working on almost similar problem, but through simulations. The algorithm that can solve the navigation problem will have applications in Augmented / Virtual Reality, Security, Robotics, and tools for Visually Impaired people.

Facebook AI has effectively solved the task of point-goal navigation by AI agents in simulated environments, using only a camera, GPS, and compass data. Agents achieve 99.9% success in a variety of virtual settings, such as houses and offices. - @facebookai

In Part 2, we will use PlaneRCNN to extract the depthmap and surface planes from a single image. The next step will be to use this depthmap information to understand the environment around us, using the smartphone camera.

A Tiny Request

I am pretty new to Deep Learning and my approach to the problem could be very naive. If you think you or someone you know can advise me on how to approach the “Navigation for Visually Impaired” problem, please connect at emadulehsan@gmail.com.