A Practical Guide for Creating A Quality Satellite Imagery Dataset for Agricultural Applications

Learnings after our two month AI challenge with 36 Collaborators to estimate crops yield in partnership with the UN World Food Program in Nepal.

Photo by Alex Wigan on Unsplash

Article authored by Jayasudan Munsamy, Alexander Epifanov, and Łukasz Murawski.

The work has also been done by James Tan, Thomas Chambon, Erick Galinkin, Shefalika Gautam, Radhika Menon, Javier Perez Tobia, Saqib Shamsi, Sai Praveen

Hello there! So, you have decided to use satellite imagery for agricultural purposes? GREAT! You are in the right place.

This article is the result of working in Omdena’s AI challenge to estimate crops yield with the UN World Food Program in Nepal. The problem was tough, challenges were huge, and resources scarce. Still, a community of 36 collaborators managed to build a solution with 89% accuracy. This article focuses on the dataset creation.

Feel free to also read the following article.

Though the following recommendations are mainly from our experience with crop types identification project, we believe they are generic and will apply to any agricultural Machine Learning project such as identifying crop growth rate/stage or checking pest infestation in crops or checking for crop density, etc.

Happy reading ☺

The characteristics of a ‘good’ dataset

Rubbish in — rubbish out’ is a popular phrase referring to poor ML model results caused by poor quality datasets.

No matter how good a model architecture is, it’s useless if not trained with appropriate good data. At the same time, the opposite holds true — sometimes even the basic model architecture is enough to get the job done if fed with a proper dataset for training.

And while so much has been written about ML model building, we think that dataset preparation has been a bit disregarded and this is understandable because it ain’t too sexy of a subject and takes forever to get it right. Unfortunately, our path to realizing the importance of ‘creating a good dataset’ was painful and agonizing.

So what really is ‘good’ dataset? Every project is different and has unique requirements for its dataset.

Let’s start with the characteristics of an expected outcome — our ML model. It seems that a good model is the one that achieves best results on a variety of input data — we say that model generalizes well and we call it ‘robust’.

In short, to achieve this, we need to train the model with high quality, precisely labelled datasets representing the full spectrum of input possibilities. The importance of data in ML can be understood from the fact that ‘typically 80% of an ML project is spent on data — analysis, gathering & engineering’.

The key characteristics of a good dataset are listed below:

Data distribution — data should cover all or most of the possible spectrum of the input Data coverage — every class should have enough representation in the dataset Data accuracy — data should be highly relevant to the task in hand and be as close as possible to that used for inference, in terms of quality, format, etc. Feature engineered — data should enable the ML model to learn what we intend it to learn (appropriate features) Data transformation — almost always data acquired cannot be used as-is and an appropriate data transformation pipeline can simplify the model architecture Data volume — depending on whether the ML model is built from scratch or learning transferred from another model, availability of data is critical Data split — typically data is split into 3 chunks: training (75%), validation (15%) & test (10%) and it’s important to ensure there is no ‘duplicate/same’ data across these chunks and the samples are distributed properly

One of the key challenges with ML projects is that exact requirements for data is NOT known at the time of data analysis & gathering and sometimes is known only after the model is built and shortcomings understood. So, an iterative approach to creating and refining dataset in order to improve model metrics is a safe bet.

Seven recommendations for creating satellite imagery datasets for agricultural applications

Photo by NASA on Unsplash

Satellite images can be in visible colors (RGB) and in other spectra, e.g. data within specific wavelength ranges across the electromagnetic spectrum like Near-Infrared. There are also elevation maps, usually made by radar images which can be used to estimate vegetation growth rate, etc. Normally, the interpretation and analysis of satellite imagery are conducted using specialized remote sensing software but advancements in AI have made autonomous, large scale analysis of imagery possible.

We have listed some of the key points to consider.

Ground truth data — know the ground truth Source of satellite images — decide the source Spatial distribution — know the terrain Temporal distribution — know the crop growth cycle Image quality — know what’s in the images Vegetation indices — know the right indices Labeling and Masking — know what is what & where in the images

#1 Ground truth data — know the truth

Almost every stage of the project is dependent on the Ground Truth (GT) data provided. In satellite imagery for agricultural purposes case, it should contain all the details about the crop fields, which will help to identify them individually, so the information can be fed to ML models via appropriate datasets for correct feature extractions. In many cases, GT data is in the form of a file containing required information gathered during field surveys. Usually, it’s a simple spreadsheet file filled with a full variety of field information that can be directly used for labeling. But in practice, we may not have all the required information and should be cautious to pay attention to the content of the file for two main reasons:

Acquiring comprehensive field survey results is expensive, if possible at all Preparing proper Ground Truth data file requires an understanding of all the details required for creating robust ML models

In short, the goal of the Ground Truth data should be to provide complete, well-balanced and properly distributed data. It should also serve as a reference on how to recognize different objects of interest to facilitate complete and reliable labeling.

It should have the following characteristics:

Contain all the required details of data (crop parcels in agriculture case) — for example in case of crop type identification GT should specify crop field id, their dimensions, GPS locations, shape, size, crop cultivated, seasonal crops for the field, crop cycle details, land use patterns, etc. A well-balanced number of classes (Ex: equal or similar number of samples for each class) Well distributed spatially (various terrains in area of interest) and temporally (covering various time periods like seasons/crop cycles, etc.), representing the full range of possible scenarios Number of data points must exceed the expected number of images since for some/many of them there won’t be good satellite images (due to weather conditions, pollution, etc.) Should contain samples from few different years, so in case a year’s satellite images cannot be used for bad weather conditions or some other reasons, we can utilize images from other years mentioned in ground truth data to compensate for data loss Should highlight periods when objects of interest (say different crops) are easiest to be recognized. For example, months when the crops are in the fully grown stage to help in easy labeling (from there, we can easily propagate masks knowing the vegetation specifics) Should include examples of a similar kind of crops/vegetation (not just visually but also in terms of VIs if possible) in the nearby regions around the area of interest. For example, if rice fields are the class of interest, examples of grass, cornfields, etc. in the surrounding area which look similar to rice field should be also included

The key point to remember is that Ground Truth data quality will have a big impact on the dataset created, labeling/masking done on a dataset and ultimately results of the solution.

#2 Source of satellite images — decide the source

With the advent of satellites launched by many countries and private organizations, satellite imagery has become more accessible to the general public for a variety of applications. Some of the more popular programs are Landsat (by USGS & NASA, 30m resolution since early 1980s), MODIS (by NASA, near-daily satellite imagery of earth in 36 spectral bands since 2000), Sentinel (by ESA, 5 days frequency of earth in 16 spectral bands since 2016) and ASTER (by NASA, detailed maps of land surface temperature, reflectance, and elevation).

Organizations selling satellite imagery

Several private organizations sell raw & processed satellite imagery with customized data as required by customers. Few popular ones are GeoEye (since Sep2008, images with a ground resolution of 0.41 meters (16 inches) in the panchromatic or black and white mode also has multispectral or color imagery at 1.65-meter resolution or about 64 inches), DigitalGlobe (imagery with 0.46m & 0.6m panchromatic only spatial resolution, also images with 0.31 m spatial resolution), OneAtlas platform (by Airbus, Optical & Radar Earth Observation), Spot Image (by Bratislava, images with 1.5 m for panchromatic channel, 6m for multi-spectral and 0.50 meter or about 20 inches) and ImageSat International (also known as “EROS” satellites, images can be used for mapping, border control, infrastructure planning, agricultural monitoring, environmental monitoring, disaster response, training, and simulations, etc.).

Key decision points for choosing the satellite imagery source:

Raw or processed datasets — we can’t use raw satellite images and processing of satellite images is an involved activity using various tools: so processed images which are available readily as part of datasets are a good choice to start with Image quality — sharp images with clear differentiation of objects we are interested in is critical: the higher the resolution, the better the results with the ML model The spatial resolution of images — it’s the area on the ground covered by a single pixel in the satellite image: the lower the resolution (they go down to 15cm), the better they are. Few organizations improve spatial resolution of final images by applying scaling techniques which sometimes may cause undesirable quality of images and hence need to watch-out (ex: few bands of Sentinel2 dataset are resampled/scaled with constant Ground Sampling Distance metric depending on native resolutions of the bands and hence can have spatial resolution of 10m, 20m or 60m, but the corresponding images will be low quality due to sampling) Free or paid — may seem like an easy choice, but various aspects like quality, processing is done, completeness of data, etc. in the provided datasets depend on effort spent by the provider: the source of images used at inference time and accuracy/other metrics of ML models mainly drive the decision Temporal images coverage — depending on where, when & purpose of the satellites that were launched, imagery may be available only for certain geography and period (ex: Sentinel2 Level1C dataset has images only from June 2015 onwards): temporal data required for the task will help to decide Spectral bands to use — depending on the sensors in satellites, various spectral data will be available in images (ex: Sentinel2 imagery contains 13 spectral bands): usage of various vegetation indices for the type of remote sensing task will help to decide Number of images per day/week/month — depending on the frequency of satellite orbiting over the area of interest, number of images may vary (ex: Sentinel2 orbits over a location every 5 days, so in a month, there will be 4 to 6 images of a particular location: volume of images required will help to decide Image processing should be done — image quality decreases with various factors like cloud cover, haze cover, pollution distractions, etc. many organizations apply various processing techniques to get rid of such distractions in images: image quality requirements for the task at hand will help to decide

# 3 Spatial distribution — know the terrain

Nepal’s agro-ecological zones in different terrains

While RGB bands in satellite images can show the crop fields, the terrain of these fields also plays an important role. For example, crop fields in plains tend to be large, with more regular shapes and similar crops are usually in the neighborhood; whereas in hilly areas crop fields tend to be small, different shapes & altitude and mix of other vegetation may be surrounding the fields; similarly in forest areas crop fields tend to be surrounded by thick trees without clear visual representation of fields and their borders. Hence understanding the various terrains the crop fields are in becomes important.

The recommendation is to look at satellite images from different time periods/seasons to understand the terrain of an area of interest, include images from various terrains in the dataset, consider the challenges with images from different areas while labeling/marking and address those challenges as much as possible with appropriate labeling.

# 4 Temporal distribution — know the growth cycle

Photo by Kai Pilger on Unsplash

Satellite images from all the months covering various growth stages of crops should be added to the dataset. These images will help the ML model to generalize well and be able to accurately identify crops irrespective of the growth stage of crops. A better understanding of the crops’ growth cycle and seasonal crops cycle can help to find satellite images of crops at different stages of growth.

While looking for temporal data, there is a possibility that a few months in a specific year do not have any images due to bad weather or climatic conditions. In such cases, consider choosing images from other years for these months. An important assumption that needs to be validated here is that the crop cycle & seasonal crops for those years are the same as the year with ground truth. In some cases, the same crop can have a different growth cycle in different regions.

In our case with Nepal, there were 10 to 12 varieties of rice widely adopted by farmers, having two main growing seasons depending on rice variety: 1) Spring rice (February/March to June/July): Chaite 2, Chaite 4, Ch 45, Bindeswar, etc. and; 2) Main season rice (June/July to October/November): Mahsuri, Savitri, etc. (Source).

Another important point to consider regarding temporal data is the land use pattern of cultivated fields. Though there might be defined crop cycle for each crop, there can be scenarios where the same crop fields are used for different crop cultivation in different seasons (ex: seasonal short-term crops may be cultivated in the same fields after main crop’s harvesting is done and before next sowing). Missing out on satellite images from different months representing seasonal crops will result in an incomplete dataset, resulting in inaccurate ML models.

Talking to the subject matter experts/farmers in the area of interest to understand the temporal data to be captured is critical at this stage of dataset creation.

#5 Image quality — know what’s in the images

The resolution of satellite images is relatively high and image processing is time-consuming. Similarly, depending on the sensor from which the imagery was created, appropriate processing is required before consuming the images. For the same reason, weather (rain, clouds, etc.) & environmental (pollution, haze, etc.) conditions can affect image quality. For such reasons, publicly available satellite image datasets are typically processed for visual or scientific commercial use by third parties.

Just like any other digital image, resolution of satellite images are critical for the purpose and varies depending on the instrument used and the altitude of the satellite’s orbit. There are four types of resolution when discussing satellite imagery in remote sensing: spatial (pixel size of an image representing size of the surface area being measured on the ground), spectral (wavelength interval size and number of intervals), temporal (amount of time / days that passes between imagery collection periods for a given surface location) and radiometric (levels of brightness / contrast).

Satellite image spatial resolution vs quality

Though there are open datasets of satellite imagery available to the public free of cost, quality images to be used for specific purposes like crop growth detection, crop type identification, etc. are expensive. The higher the quality, the higher the cost. As simple as that.

But not always high-quality images are required. For example, for a project with the objective of identifying building structures in satellite images, we may not require images with spatial resolution as low as 30cm or so.

So, there’s no standard guideline or single rule suggesting the minimal or maximum image quality required for a project. It all depends on the objective of the project. However, there are two main factors which need to be considered before deciding on the quality of images to use:

ML model’s accuracy/performance metrics — will the image quality chosen help to meet the high-performance requirements of the project? Data labeling potential — will the image quality chosen be good enough for labeling given the kind of objects to be identified from images?

The decision on point 1 is quite obvious — we can test the ML models using images of different quality and choose the one that meets the project’s requirements/metrics with lowest quality images.

The decision on point 2 is more subjective in terms of what objects are to be identified on images and should be agreed upon at the beginning of the project by carefully assessing the labeling capabilities. But, the image quality must be good enough for labeling, so people can easily see and draw masks around all objects/classes of interest.

Solutions which require small objects to be identified, where distinguishing edges is more important than counting the overall coverage, require very high-quality images. That was the case with Omdena’s Trees Recognition Challenge where the goal was to identify trees close to electricity lines in order to prevent power outages and fires sparked. Here, extremely accurate masking, close to the tree’s edges were necessary. For that, high-resolution 0.5m spatial resolution pictures had to be used. Thus, not only trees but also little bushes and shadows had to be precisely annotated. And that paid off.

With only 150 original images and very basic transformations, using Deep UNet Model, the team achieved around 95% accuracy.

From trees identification project at Omdena

For our crop identification project in Nepal, such a high resolution was not necessary as crop fields shapes were pretty much regular (except for the ones in hilly areas), with borders being mostly straight lines. So, in this project, the ability to distinguish similar objects (like rice vs. grass) at the labeling stage was the main factor to decide on image quality. We ended up using Sentinel2 Level-1C satellite images with 13 spectral bands from the Copernicus program (European Space Agency) with a maximum spatial resolution of 10m per pixel for certain bands. Unfortunately, it turned out that the max zoom level to get clear RBG images was only 100m. And that zoom level was not good enough for labeling since the crop field areas appear too small in the agricultural setting as seen in the images below (actual images used were 500x500 dimension and yet many fields appeared too small /unclear to recognize).