Geospatial data plays a crucial role in various domains, from remote sensing and urban planning to environmental monitoring and disaster management. When working with geospatial data for machine learning tasks, preparing a custom dataloader is essential to efficiently load, preprocess, and augment the data without losing its properties, especially when the input image has more than 3 bands.

Rasterio is indeed a specialized library designed explicitly for handling geospatial raster data efficiently. While libraries like OpenCV and Pillow are versatile and widely used for image processing, they may not provide the specialized geospatial capabilities and optimizations offered by Rasterio.

Rasterio is tailored to work seamlessly with geospatial file formats such as GeoTIFF, and it provides tools for georeferencing, coordinate transformations, and other geospatial-specific tasks. This makes it an essential choice when working with geospatial data, as it ensures accurate handling of spatial information, coordinate systems, and map projections.

So, when dealing with geospatial raster datasets, Rasterio is often the preferred choice. However, OpenCV and Pillow can still be useful in certain scenarios, especially when you need to perform general image processing tasks on geospatial data, when you’re working with non-geospatial image data, or when you just want to read three bands of an image without retaining its geospatial properties.

In this tutorial, we will walk you through the process of creating a custom geospatial dataloader using PyTorch and Rasterio, two powerful libraries for deep learning and geospatial analysis.

Prerequisites: Before we begin, ensure you have the following requirements:

An IDE like Jupyter Notebook.
Python and the required libraries are installed.
PyTorch and Rasterio installed. You can install them using pip:

pip install torch torchvision rasterio

If you face trouble installing Rasterio, the best way I have found is to do it in the following sequence:

pip install wheel
pip install pipwin

pipwin install numpy
pipwin install pandas
pipwin install shapely
pipwin install gdal
pipwin install fiona
pipwin install pyproj
pipwin install six
pipwin install rtree
pipwin install geopandas
pip install rasterio

Source: https://stackoverflow.com/a/58943939/14111919

Once you have all the libraries installed, let's create a custom geospatial data loader. The following steps have to be followed in your IDE:

Import libraries
Define Custom Dataset
Create a data loader
Iterate through batches

Let’s begin with these steps.

Import Libraries

import torch
from torch.utils.data import Dataset, DataLoader
import rasterio
import numpy as np
from torchvision import transforms

Define Custom Dataset

Create a custom dataset class that inherits from PyTorch’s Dataset class. In this class, you will define how to load and preprocess geospatial data.

class CustomGeoDataset(Dataset):
    def __init__(self, file_paths, transform=None):
        self.file_paths = file_paths  # List of file paths for geospatial data
        self.transform = transform  # Data augmentation/transformations

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        # Open the geospatial file using Rasterio
        with rasterio.open(self.file_paths[idx], 'r') as src:
            data = src.read()  # Read the data (e.g., satellite imagery)
            # Apply any preprocessing or transformations here
            if self.transform:
                data = self.transform(data)
        return data

# Define the list of file paths to your geospatial data files
file_paths = ['path_to_file1.tif', 'path_to_file2.tif', ...]

# Instantiate the custom dataset
custom_dataset = CustomGeoDataset(file_paths)

You can create the list of file paths with the os library.

Data Augmentation (Optional)

If you wish to apply data augmentation or transformations to your geospatial data, you can define custom transformation functions using PyTorch’s transforms module and pass them to the dataset during initialization.

# Define custom data augmentation/transformations
custom_transforms = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(15),
    # Add more transformations as needed
])

# Instantiate the custom dataset with transformations
custom_dataset = CustomGeoDataset(file_paths, transform=custom_transforms)

Create Dataloader

# Create a DataLoader for batching and parallel data loading
dataloader = DataLoader(custom_dataset, batch_size=32, shuffle=True)

Iterate through Batches

You can now iterate through the batches of geospatial data using the DataLoader and use them for training or inference in your deep learning model.

for batch in dataloader:
    # Perform model training or inference with the batch data
    # Make sure your model is compatible with the input data format

After this, try visualizing the dataset, and checking its shape and length to verify if the dataloader has been generated correctly.

Creating a custom geospatial dataloader with PyTorch and Rasterio enables you to efficiently handle geospatial data for various machine learning or deep learning tasks. This tutorial provides the foundation for loading and preprocessing geospatial data, and you can further customize it to suit your specific project requirements. Whether you’re working on land cover classification, object detection, or any other geospatial task, a custom dataloader will streamline your workflow and help you achieve accurate results with geospatial datasets.

This installment of the series has provided you with a guide on creating a geospatial dataloader using the rasterio library. In our upcoming segment, we will take this knowledge a step further by creating a custom geospatial dataloader and presenting a practical dataset example.

Stay tuned for a hands-on demonstration of how to harness the power of geospatial data in your machine learning projects.

Follow me to stay updated on upcoming geospatial articles! 🔔

Thank you for reading this article. :)