Share article

Building the AntScout Dataset: 12 Million Images Cleaned

Introduction

I will be creating a dataset for my ant classification model, it needs to contain as many cropped images of ants as possible. See Choosing and training the best object detection model for ants for more information on the model that crops the ants.

Resources used to create this dataset

  • iNaturalist Cropped into 3.04M individually cropped ant images
    Uncropped 1.23M Research grade observation images
    Only research grade observations are used. Downloaded via the Open Data repository since it has the most recent and complete data. On GBIF its only 500k images.
  • AntWeb 193k specimen images
    Antweb is not cropped since its already fully covering the ant, reducing cropping error. GBIF did not contain all the images, so I downloaded them by id.
  • iBOL 78k specimen images
    Has a lot of duplicate specimen images from antweb, these have been deduplicated.
  • Museum of Comparative Zoology 60k specimen images Contains several hundred images from antweb, these have been deduplicated. These images were cropped from original, often wider specimen images, also containing labels.

Choosing the best resolution

A grid of ant images from the validation set, showing the manual cleaning process where incorrect or low-quality images are marked for removal.

Image attribution: “casent0922802 profile view 1” - Myrmica schencki Viereck, 1903 Collected in Czechia by California Academy of Sciences, 2000-2012 (licensed under Attribution-ShareAlike (BY-SA) Creative Commons License and GNU Free Documentation License (GFDL))

As you can see, a 368px resolution appears to capture enough detail for identification at a glance. However, the model achieves significantly higher accuracy at 512px. This suggests that while larger features like spines and hairs remain visible, the fine-grained textures and subtle anatomical nuances required for precise classification are lost or distorted at lower resolutions. These tiny details are often the deciding factor in distinguishing between closely related species, and their loss effectively introduces noise that reduces model accuracy.

Cleaning & Strategy

Before training, several data quality issues needed addressing:

  1. Low Res Images: Filtered out crops below 50x50px.
  2. Hierarchical Noise: Since some images are labeled “Lasius” (Genus) and others “Lasius niger” (Species), I used a custom hierarchical loss. This rewards the model for predicting a related child species or the parent genus instead of a binary “wrong” penalty.
  3. Spatial Generalization: To prevent the model from memorizing exact coordinates, I added 0.5° of location noise and randomly dropped location data (50% dropout) during training.

Dataset Cleaning & Validation

To build a reliable benchmark, I manually cleaned the validation set, ensuring it only contains high-quality field and specimen photos. The validation set is equally balanced (5 images per species) and strictly excludes any observations featured in the training set. This makes it an extremely difficult but fair test of generalization.

Dataset before and after cleaning

Before CleaningAfter Cleaning
Total Images~12,000,0006,652,773
Total Classes11,0876,384

Cleaning validation set

To create a reliable validation set, I removed all genera from the validation set, and I tried to manually clean it as much as possible, to ensure that the validation set is clean and reliable.

This is more difficult than you would think, where would you draw the line of it being a good image or not? If only an antenna is visible, is that still a good image? What if the ant is blurry, or partially visible? I tried to give it a good challenge, to ensure that the validation set is of high quality.

The validation set is equally balanced, with 5 images per species. This makes it an extremely difficult validation set, but also very reliable. A big problem with this, is that a large portion of the species which only have 10 images or less, are from GBIF (Antweb mostly), which makes it so a very large portion of the validation set are specimen photos from Antweb, which are generally easier to classify than field photos, but also more difficult as they are often very similar to each other.

When evaluating the model on this dataset, it mostly just means that its good at classifying ants from Antweb, which is not the main goal, but its still a good benchmark to see how well the model performs and converges, as the different specimens from a certain species still vary a lot.

I made a simple script to by hand go through all the tens of thousands of images in the validation set, and mark the ones that are inccorect to remove them.

Heres an example of one of the few hundred pages I went through:

A grid of ant images from the validation set, showing the manual cleaning process where incorrect or low-quality images are marked for removal.

If it would get 50% on the validation set, that would mean it got half of these images (and tens of thousands of other ones like this) correct, which the model has never seen before.

Preventing data leakage

The validation set also, does not include any images that were of the same observation or source image as any of the training images, to ensure that the model is evaluated on completely unseen data. This is to prevent data leakage, where the model has seen the same image before, and is able to recognize it.