Earlier this year I designed a system that would be able to find and count animals in infrared drone footage. I set out to test a few different technologies to prove that it could be done reliably.

As always, my solution was loosely coupled and modular, allowing components to be easily swapped out if better or more cost-effective options became available and/or failed to meet expectations. During the process, I came up with, and tested, three different architectures that I thought could do the job.

Detecting the Animals

I didn’t have all the drone footage I needed yet so I got some sample IR footage of wild hogs from YouTube to test with. Some of the footage from YouTube was aerial from helicopters and some were side-on from the ground. I wanted to give my models the best chance of detecting a hog so I collected as many examples as possible, grabbing screenshots from the videos I had collected. A lot of the footage also contained coyotes so I took the opportunity to collect pictures of those as well.

In general, I would say that the data I was using was quite low. A lot of the images were blurred and the images ended up being quite small with some only an approximate 100x100 pixels. I used around 200 samples to test with which is also quite a small number, but I decided to start there and continue through to tedious talk of collecting more screenshots only if I needed to.

Below are some examples of the types of images I had collected…

Wild Hogs in Infrared

Wild Coyotes in Infrared

Examples of some of the testing images I used

TensorFlow

I had a good read of Léo Beaucourt’s article “Real-time and video processing object detection using Tensorflow, OpenCV and Docker.” and used that as a base to set up my Tensorflow model. I went with Tensorflow because I already know that object detection has been explored extensively on the platform, and I was already familiar with it.

TensorFlow is an open-source machine learning library built for high-performance computation. Its flexible architecture allows easy deployment of computation across a variety of platforms, and from desktops to clusters of servers to mobile and edge devices.

A custom TensorFlow implementation on Ubuntu will not cost anything in regards to licensing, however, it would cost in regards to infrastructure and development time. The benefit is that you get a purpose-built AI that has the potential to yield better results, is cheaper to scale, and can be extended and customised.

Pros:

Inexpensive to use.

Can be indefinitely modified in any way needed.

The model can be finely tuned to get the required results.

Can benefit from other data as well as image data; The model could be trained using additional features such as location or weather data later on to possibly predict where certain animals would be at a certain time of the year,

Cons:

Could be slower at classifying images than ready-made commercial systems, depending on the infrastructure.

Will need to maintain a hosting server.

The model will need to be developed, adding to development cost and time.

More footage would be required to be as accurate as commercial models.

More time will be needed to train the model.

Implementation Steps:

1 — Setup the Ubuntu Server

2 — Install Python and Tensorflow and the server.

3 — Train the model with relevant images.

4 — Test prediction in the model.

A big advantage of this model is that I had already set it up for parsing video and it could tag multiple results in one frame. Other API endpoint systems would require me to send the footage frame by frame, and only Microsoft Cognitive Services could give me detailed bounding box information on multiple results in one image.

Microsoft Cognitive Services

Cognitive Services is a collection of dashboards and API’s within Azure that allows developers to access intelligent processes such as image recognition and speech and language understanding for applications.

Azure Custom Vision is a part of the computer vision cognitive service that lets you build, deploy, and improve your own image classification algorithms built within the Azure Image Recognition GUI. The provided AI service can apply labels to images according to their visual characteristics after undergoing supervised training. Images can be sent to the trained model to be classified via a REST API call and a result can be returned in JSON format.

Pros:

Simple to use and train.

Do not have to host powerful high-end servers or have a powerful high-end computer or phone to use the service.

The system is improved by Microsoft.

Cons:

Cannot modify the model, can only train it.

Must be connected to the internet to use the model.

Can only benefit from image data; Cannot train the model using additional features such as location or weather data.

Using the model was very simple and results can be returned with bounding box information which was exactly what I needed. As I mentioned before, the major advantage that Cognitive Services had was that I was able to tag multiple images at a time. This was crucial for my use case as it is highly likely there would be more than one animal in an image at a time.

Ximilar

The Czechian company Ximilar provides a computer vision service that is built to challenge the likes of Google and Microsoft in the image recognition space. I found that it serves as a very capable alternative and was very straight forward to get going. I had a lot of communication from the Ximilar team offering to give me a hand training models and to lend me advice on how to get the most out of their platform. I highly recommend working with the team as I found thier high-level openness and service refreshing.

To use the platform, I uploaded my training images to Ximilar and tagged them using the GUI. The GUI allowed me to configure the training service to automatically flip and rotate my images and to mutate them randomly, adding static and changing thier colour. This was perfect as it allowed me to simulate different types of infrared and images that could be taken at all sorts of angles from a drone in the air.

Some of my Training Images Tagged in Ximilar

I tried a few different scenarios as shown below and got very promising results with only a small number of images. My future tests would have the model find animals in larger pictures.

Detection Rates for a Hog or Coyote Comparer

Detection Rates for a Comparer that Differentiates Between Hog Tops or Sides and Coyotes

The test below shows the need for more data with higher quality images. I tested using an unrelated image but the model felt it still needed to categorise something… either that or the Naruto Runner looks like a prancing Hog?

Naruto Runner or Hog?

An attempt to debug the issue shows one limitation. There are no coordinates on the image returned for the result, the platform only returns that there was a result in the image and what the result was:

Ximilar JSON Sample

This is a limitation that the Ximilar team may have been able to assist me in rectifying if I had asked them later on during implementation.

Designing the System

To supplement the crucial AI component and provide a usable system for the field, I needed a system around it that would allow the drones to send information to the model and the user to get information about the footage. This would mean that the drone would have to send video in the case of TensorFlow, or images as frames to my web-based models. Other data such as flight telemetry would need to be extracted from the drones .DAT files if I wanted to use that in the future. One such use it to generate .KML Map Files to that flights can be viewed visually.

Peripheral Components

Azure Blob Storage is Microsoft’s object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that does not adhere to a particular data model or definition, such as text or binary data. I wanted to collect as much of this as possible from the drone and also from weather websites as its possible that the temperature, weather, and moon phases as I imagined that there’s a chance they play a role in animal locations that we do not yet understand.

Xamarin is a cross-platform framework and toolkit that allows developers to efficiently create native user interface layouts and Apps that can be shared across iOS, Android, and Universal Windows Platforms. I planned to use this as my app framework as it’s C# based, can be easily deployed to Azure, and would mean I only have to make one app for iOS and Android.

Azure Functions is a solution for easily running small pieces of code in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. I thought that this would be a decent solution for handling WebAPI requests and sending the different data and results to where it needs to go.

Ubuntu is a free and open-source Linux distribution. I planned to house my Tensoflow implementation on a Ubuntu server as opposed to within a docker container running Ubuntu.

The System Using TensorFlow

Animals in IR Drone Footage Detector Using TensorFlow

Training Step 1

Infrared video footage and flight data is downloaded from the drone by the developer.

Training Step 2

The developer takes the image data and the flight data and sends it to the Ubuntu server where it can be used to train the custom Tensorflow model.

Training Step 3

The developer takes the image data and the flight data and uses it to train the custom Tensorflow model based on the combined features of the flight data and the images from each frame of the footage.

Training Step 4

The model is produced, converted into a format that can be used in the custom Tensorflow prediction service, and frozen for export.

Training Step 5

The frozen model is loaded into the custom Tensorflow prediction service to be used for prediction.

Prediction Step 1

The user conducts a surveillance/recon flight using the drone, collecting infrared images and drone flight data.

Prediction Step 2

The downloads the data from the drone and sends it to the cross-platform Xamarin app. The app can be used on iOS or Android devices.

Prediction Step 3

The user uses the app the select which video it wants to send to the AI for prediction. The user can trim the beginning or end of the footage so that sending irrelevant takeoff and landing footage can be avoided. The app automatically couples the data file with the video file and trims/sanitises the data appropriately.

Usage Step 1

The App takes the trimmed footage, separates it into 5 frames per second of video, and sends each frame to the Azure Function App. At the same time, the app uploads the data and the footage into Azure Blob Storage for archiving and potential future use.

Usage Step 2

The function app calls the AI on the Ubuntu server to predict the content for each image. The prediction results return to the Function App which then passes them back to the Xamarin App for each frame and the Xamarin App assembles a shortened video from them with only the frames that include animals and saves it to the user’s video library within the Xamarin App. The times of the frames that contain animals are identified and used to extract the location data from the .DAT file and generate a .KML file for map visualisation.

The System Using Cognitive Services

Animals in IR Drone Footage Detector Using Microsoft Cognitive Services

Training Step 1

Infrared video footage and flight data is downloaded from the drone by the developer.

Training Step 2

The developer takes only the images from the footage and uploads it to the Custom Vision Training dashboard.

Training Step 3

The developer uses the uploaded footage to train the model using Microsoft’s Custom Vision Training dashboard.

Training Step 4

The trained model is then attached to the Custom Vision Prediction Service for use via API calls by the Xamarin App.

Prediction Step 1

The user conducts a surveillance/recon flight using the drone, collecting infrared images and drone flight data.

Prediction Step 2

The downloads the data from the drone and sends it to the cross-platform Xamarin app. The app can be used on iOS or Android devices.

Prediction Step 3

The user uses the app the select which video it wants to send to the AI for prediction. The user can trim the beginning or end of the footage so that sending irrelevant takeoff and landing footage can be avoided.

Usage Step 1

The App takes the trimmed footage, separates it into 5 frames per second of video, and sends each frame to the AI. At the same time, the app uploads the data and the footage into Azure blob storage for archiving and potential future use.

The prediction results return to the app for each frame and the app assembles a shortened video from them with only the frames that include animals and saves it to the user’s video library within the app.

The times of the frames that contain animals are identified and used to extract the location data from the .DAT file and generate a .KML file for map visualisation.

The System Using Ximilar

Animals in IR Drone Footage Detector Using Ximilar

Training Step 1

Infrared video footage and flight data is downloaded from the drone by the developer.

Training Step 2

The developer takes only the images from the footage and uploads it to the Ximilar dashboard.

Training Step 3

The developer uses the uploaded footage to train the model using Ximilars Training dashboard.

Training Step 4

The trained model can then be reached via API calls by the Xamarin App.

Prediction Step 1

The user conducts a surveillance/recon flight using the drone, collecting infrared images and drone flight data.

Prediction Step 2

The downloads the data from the drone and sends it to the cross-platform Xamarin app. The app can be used on iOS or Android devices.

Prediction Step 3

The user uses the app the select which video it wants to send to the AI for prediction. The user can trim the beginning or end of the footage so that sending irrelevant takeoff and landing footage can be avoided.

Usage Step 1

The App takes the trimmed footage, separates it into 5 frames per second of video, and sends each frame to the AI. At the same time, the app uploads the data and the footage into Azure blob storage for archiving and potential future use.

The prediction results return to the app for each frame and the app assembles a shortened video from them with only the frames that contain animals and saves it to the user’s video library within the app.

The times of the frames that contain animals are identified and used to extract the location data from the .DAT file and generate a .KML file for map visualisation.

Conclusion

After designing the system, Ximilar let me know that their platform underwent a major upgrade and the platform can now tag images in a way that allows more than one tag to be assigned to an image. In light of this information, I would have chosen Ximilar for this system due to its simplicity and also its price, which was less than Cognitive Services. Later on down the track if I wanted to include supplemental data such as drone or weather data I would then swap Ximilar out for TensorFlow.

For the Artificial Intelligence / Machine Learning (AI/ML) components of the solution to be accurate, I think at least 4000 diverse training examples would be needed of each animal before I would be comfortable using it in production. Based on my brief testing, I think the system would be able to reach around 70% accuracy.

About Feral Pigs in Western Australia

In my examples, I was primarily trying to detect feral pigs. In Western Australia, feral pigs were introduced by settlers and are responsible for the destruction of the natural habitat of many native species. This is because they move in large herds and turn the ground as they move, digging up soil and roots and trampling vegetation.

The WA government has released a strategy document on how they plan on tackling the pig issue which can be found here. The government has recently given out grants of $300,000 towards the issue, but so far the use of technology is still basic, relying on static cameras and GPS monitoring of only some pigs.