Share article

Identifying ants with object detection and classification models

Introduction

My goal with AntScout is to build the world’s most accurate ant identification system. While most standard models try to classify an entire image at once, I’ve found that a specialized two-stage pipeline using object detection to isolate individual ants before classification, delivers significantly better results. By focusing the classification model entirely on fine-grained anatomical details of cropped ants, we can better distinguish between the 13,000+ species that often look identical to the naked eye.

Through this project, I’m leveraging over 1.5 million images from iNaturalist and GBIF in combination with DINOv3 and ecological data to push the boundaries of what’s possible in automated entomology.

Step 1: Object Detection

Finding the ants within the image is the foundation. I use a specialized object detection model to locate each ant, which is then cropped and sent to the classification model. For a deep dive into the benchmarks and architecture decisions for this phase (including YOLOv11 and RF-DETR), check out the dedicated post Choosing and training the best object detection model for ants.

Technical Architecture

I chose DINOv3 ViT-L/16 as my backbone. Vision transformers like DINOv3 output several components: a CLS token (a global representation of the entire image) and patch tokens (local representations of specific image regions).

To reach the highest possible accuracy, I designed a custom model head that leverages both:

1. Location & Habitat Encoding

The location input goes far beyond simple coordinates. I use a combination of spatial and ecological data to give the model a complete “sense of place”:

  • Fourier Features: Latitude and longitude are encoded using Fourier features, which helps the model learn periodic spatial patterns and prevents it from over-focusing on exact GPS points.
  • Ecological Context (40+ Layers): Using high-resolution global maps, I sample over 40 distinct habitat variables for every observation:
    • Land Cover: Consensus classes from EarthEnv (e.g., evergreen forest, cultivated land, open water).
    • Climate (BioClim): 19 variables from WorldClim, including temperature seasonality, annual precipitation, and coldest quarter mean temperature.
    • Soil Chemistry (SoilGrids): 10 layers describing the ground itself, such as pH level, nitrogen content, and sand/clay/silt ratios, crucial for ground-nesting ants.
    • Topography: Accurate elevation data to differentiate between lowland and alpine specialists.

This modular location embedding is then used to generate FiLM (Feature-wise Linear Modulation) parameters. These parameters modulate the visual features from DINOv3, essentially telling the model: “This image is from a high-latitude, acidic-soil forest in Scandinavia, look for Formica or Myrmica traits.” A film_scale of 0.7 balances visual evidence with geographical probability.

2. Dual Pooling Strategy

The model combines two ways of summarizing the image:

  • CLS Token: DINOv3’s global representation.
  • GeM Pooling: Generalized Mean pooling over all patch tokens, which learns to balance between average and max pooling (learned power of ~1.34).

These vectors are concatenated with the location embedding into a combined 3072-dim feature vector, which passes through a wide bottleneck (8192 → 4096) with Squeeze-and-Excitation (SE) attention to learn channel importance.

3. Taxonomic Classification Heads

The final 4096-dim embedding feeds into multiple classification heads sharing the same representation:

  • Species: ArcFace layer (~6500 classes)
  • Genus: Linear layer (~450 classes)
  • Tribe: Linear layer (~120 classes)
  • Subfamily: Linear layer (~12 classes)

Diagram showing the AntScout identification model architecture, including location encoding, dual pooling strategy, and taxonomic classification heads.

4. Aggregated Scoring

At inference time, I use an Additive Log-Prob fusion method that combines probabilities from all heads. This “anchors” the identification; if the species head is uncertain, the 99% confident genus head ensures the final prediction stays within the correct taxonomic group.

Resources used to create this dataset:

Flowchart illustrating the step-by-step process of the AntScout identification model, from image input to final prediction.

Cleaning & Strategy

Before training, several data quality issues needed addressing:

  1. Low Res Images: Filtered out crops below 50x50px.
  2. Mislabeled Data: iNaturalist and GBIF labels aren’t 100% accurate. I used 10% label smoothing to prevent the model from overfitting to noise.
  3. Class Imbalance: Used “Effective Number” balancing (Beta=0.99, Max Weight=20x). This recognizes diminishing returns, the 500th photo adds less value than the first 50.
  4. Hierarchical Noise: Since some images are labeled “Lasius” (Genus) and others “Lasius niger” (Species), I used a custom hierarchical loss. This rewards the model for predicting a related child species or the parent genus instead of a binary “wrong” penalty.
  5. Spatial Generalization: To prevent the model from memorizing exact coordinates, I added 0.5° of location noise and randomly dropped location data (50% dropout) during training.

Training Results: Class Balancing

Optimal results were achieved using a Beta of 0.99 and a max weight of 20x for initial training, followed by fine-tuning with a more aggressive 1000x max ratio. This two-stage approach ensures the model learns both rare and common species effectively.

Normalized Classification (ArcFace vs Softmax)

To handle 6500+ species, I used ArcFace instead of standard Softmax. Softmax can allow common classes to develop “louder” weight vectors simply by having more data, making rare species harder to classify. ArcFace solves this by L2-normalizing both features and class weights, ensuring classification is based purely on the angle between a sample and its class center.

I also experimented with the Margin (m) parameter, which pushes classes further apart to improve separation:

SettingVal (Loc)Val (NoLoc)AntScout (Loc)AntScout (NoLoc)
m=0 (No Margin)69.37%57.10%63.35%41.95%
m=0.1 Fixed69.91%57.53%61.97%40.08%
m=0.02-0.15 Adaptive69.91%57.48%61.93%40.10%

Conclusion: While a margin helps on the perfectly balanced validation set, it causes “over-confidence” that hurts performance on the realistically balanced AntScout dataset. The final model uses m=0 (also known as CosFace with no margin), which keeps all species on an equal footing while maintaining better calibration for field photos.

Dataset Cleaning & Validation

To build a reliable benchmark, I manually cleaned the validation set, ensuring it only contains high-quality field and specimen photos. The validation set is equally balanced (5 images per species) and strictly excludes any observations featured in the training set. This makes it an extremely difficult but fair test of generalization.

Dataset before and after cleaning

Before CleaningAfter Cleaning
Total Images~12,000,0006,652,773
Total Classes11,0876,384

Cleaning validation set

To create a reliable validation set, I removed all genera from the validation set, and I tried to manually clean it as much as possible, to ensure that the validation set is clean and reliable.

This is more difficult than you would think, where would you draw the line of it being a good image or not? If only an antenna is visible, is that still a good image? What if the ant is blurry, or partially visible? I tried to give it a good challenge, to ensure that the validation set is of high quality.

The validation set is equally balanced, with 5 images per species. This makes it an extremely difficult validation set, but also very reliable. A big problem with this, is that a large portion of the species which only have 10 images or less, are from GBIF (Antweb mostly), which makes it so a very large portion of the validation set are specimen photos from Antweb, which are generally easier to classify than field photos, but also more difficult as they are often very similar to each other.

When evaluating the model on this dataset, it mostly just means that its good at classifying ants from Antweb, which is not the main goal, but its still a good benchmark to see how well the model performs and converges, as the different specimens from a certain species still vary a lot.

I made a simple script to by hand go through all the tens of thousands of images in the validation set, and mark the ones that are inccorect to remove them.

Heres an example of one of the few hundred pages I went through:

A grid of ant images from the validation set, showing the manual cleaning process where incorrect or low-quality images are marked for removal.

If it would get 50% on the validation set, that would mean it got half of these images (and tens of thousands of other ones like this) correct, which the model has never seen before.

The validation set also, does not include any images that were of the same observation or source image as any of the training images, to ensure that the model is evaluated on completely unseen data.

To push accuracy to the limit, I moved beyond standard Cross-Entropy loss:

  • PRA Loss (Patch-Relevant Attention): This custom loss implements Hard-Negative Mining. During training, the model is forced to differentiate between “hard negatives”, different species from the same genus. This forces the Siamese-style architecture to learn the tiny anatomical differences (like the shape of a propodeal spine or the length of a scape) that separate closely related species.
  • Koleo Loss: Adopted from the original DINO research, this loss encourages a uniform spread of embeddings across the hypersphere, preventing “mode collapse” where different species might cluster too closely together.
  • Gram Anchoring: This loss maintains structural consistency by comparing the Gram matrices (pairwise similarities) of patch tokens between the student model and the teacher (EMA) model, ensuring that the spatial understanding stays stable during fine-tuning.

Training Configuration

  • Hardware: 2x RTX PRO 6000 WS (96GB VRAM each), later with FP8: 4x RTX 5090 (32GB VRAM each)
  • Strategy: Maximum 1000 images per class per epoch to prevent overfitting and speed up training (2 hours/epoch at 448px).
  • FP8 Training: Leveraging torchao, the backbone is trained in FP8 precision. This significantly boosts throughput and allows for larger batch sizes on modern Ada Lovelace GPUs while maintaining identification accuracy.
  • LLRD (Layer-wise Learning Rate Decay): The 24 transformer blocks of the ViT-L backbone use decaying learning rates. This ensures that the early layers (which learn general features) stay relatively stable while the later layers can adapt more aggressively to the specifics of ant anatomy. LLRD of 0.98 seemed to be most optimal.
  • Backbone: DINOv3 ViT-L/16 unfrozen after 2 epochs. The backbone had very low accuracy when unfrozen, this could also have been unfrozen right away.
DINOV3 V2 vs DINOV3 V3 (Val Acc)
70%
56%
42%
28%
14%
0%
Ep 1
Ep 20
Ep 40
Ep 60
Ep 78
DINOV3 V2
DINOV3 V2 AntScout
DINOV3 V3
DINOV3 V3 AntScout

Training Evolution:

  • DINOV3 V1: Linear head using only the CLS token, no EMA.
  • DINOV3 V2: Added GeM pooling and a Conv2D layer; unfrozen after epoch 11 with progressive balancing.
  • DINOV3 V3: Refined dual-pooling (CLS + GeM) and habitat-aware FiLM conditioning. Optimized with a two-stage LR strategy (0.001 → 0.0002).

Note on training: I’ve found that the use of a correct EMA is extremely important. It helps especially with higher learning rates. Higher learning rate this way gets a higher accuracy than lower learning rate, but this gives worse accuracy on out of distribution data, and degrades on both when then switching to a lower learning rate. Best is to use 0.001 learning rate, then eventually fine tune it with 0.0002, even if the accuracy on the validation set decreases. Going lower than 0.0002 degrades accuracy. Training too long on 0.001 learning rate seems to only degrade the out of distribution AntScout dataset. Having the out of distribution dataset here really helps get a better understanding while training. I used AdamW + Lookahead (which helped convergence speed) with weight decay 0.2 for the head and 0.04 for the backbone.

Final evaluation results

Validation set without location data

ModelTop-1 Species AccuracyTop-5 Species AccuracyTop-1 Genus AccuracyTop-5 Genus AccuracyResolution
DINOv3 ViT-L/16 V357.03%81.59%93.53%98.73%448px
DINOv3 ViT-L/16 V3 HIER57.08%82.08%--448px
DINOv3 ViT-L/16 V3 TTA57.43%81.81%93.60%98.78%448px
DINOv3 ViT-L/16 V3 TTA HIER57.39%82.18%--448px
DINOv3 ViT-L/16 V251.24%75.38%90.43%97.05%448px

Validation set with location data

ModelTop-1 Species AccuracyTop-5 Species AccuracyTop-1 Genus AccuracyTop-5 Genus AccuracyResolution
DINOv3 ViT-L/16 V369.37%89.82%95.59%99.28%448px
DINOv3 ViT-L/16 V3 HIER69.43%90.00%--448px
DINOv3 ViT-L/16 V3 TTA69.55%89.87%95.60%99.27%448px
DINOv3 ViT-L/16 V3 TTA HIER69.57%89.99%--448px
DINOv3 ViT-L/16 V261.15%84.57%92.76%98.15%448px

(Uncleaned) AntScout set with location data

ModelTop-1 Species AccuracyTop-5 Species AccuracyTop-1 Genus AccuracyTop-5 Genus AccuracyResolution
DINOv3 ViT-L/16 V364.37%93.30%93.89%98.97%448px
DINOv3 ViT-L/16 V3 HIER65.00%93.06%--448px
DINOv3 ViT-L/16 V3 TTA64.40%93.42%94.19%99.06%448px
DINOv3 ViT-L/16 V3 TTA HIER64.89%93.40%--448px
DINOv3 ViT-L/16 V263.83%---448px

(Uncleaned) AntScout set without location data

ModelTop-1 Species AccuracyTop-5 Species AccuracyTop-1 Genus AccuracyTop-5 Genus AccuracyResolution
DINOv3 ViT-L/16 V342.18%76.63%83.61%95.71%448px
DINOv3 ViT-L/16 V3 HIER42.70%77.59%--448px
DINOv3 ViT-L/16 V3 TTA42.78%76.95%84.25%95.78%448px
DINOv3 ViT-L/16 V3 TTA HIER43.59%78.13%--448px
DINOv3 ViT-L/16 V236.99%---448px
Logit Adjustment Analysis (Tau)
80%
64%
48%
32%
16%
0%
-1.0
-0.5
0.0
0.5
1.0
AntScout EMA HIGH Loc
Val EMA HIGH Loc (Hier)

Note on Logit Adjustment: The chart above shows the impact of the Tau (τ) parameter on balanced vs. realistic datasets. Increasing τ pushes the model to be more conservative on rare species, effectively “out-balancing” the predictions to favor either the AntScout realistic distribution or the perfectly balanced validation set.

Benchmarking against iNaturalist

Comparing localized models is difficult, as pipeline architectures vary. To make this comparison fair, I upload full images to both models. My pipeline detects and classifies up to 10 ants individually, while other models classify the entire image as one.

Because the evaluation set is challenging for general-purpose models, I used common species they are trained on. All test images were unseen by both models during training.

The evaluation dataset for this is quite small as I need to manually run inference on them via inaturalist. The evaluation set used for this can be seen in the header image of this blog post.

Scoring system

Its a really simples scoring system, it just gets either one of these 4 options, or none at all. Then after its done, I just sum up the points for each model. I will also remove all where they both got it fully correct, to only see where they differ.

PredictionRankPoints
Exact Species1st Option14
Correct genus1st option8
Correct Tribes1st Option4
Correct Subfamily1st Option2
No Location Data
294
Human
272
AntScout
128
iNaturalist
118
BioCLIP 2
With Location Data
294
Human
278
AntScout
144
iNaturalist
28
Obs.org
Human
AntScout
iNaturalist
BioCLIP 2
Observation.org

Browse Individual Predictions

Examples

Example why hierarchial weighting is beneficial

This example shows how a standard model can easily get ‘lost’ in the visual details of similar species. However, by using hierarchical weighting, the model leverages its high confidence in the Subfamily (74.69%) and Genus (50.87%) to ‘anchor’ the prediction.

Macro photograph of an Azteca ant species, used as an example to demonstrate the benefits of hierarchical weighting.
=== TOP 5 PREDICTIONS (STANDARD) ===
1. Crematogaster_cylindriceps (0.50%)
2. Philidris (0.47%)
3. Dolichoderus_diversus (0.43%)
4. Azteca_mayrii (0.36%)
5. Cryptopone (0.29%)

=== TOP 5 PREDICTIONS (HIERARCHICAL WEIGHTED) ===
1. Azteca_mayrii (4.02%)
2. Azteca_xanthochroa (3.44%)
3. Azteca_instabilis (3.21%)
4. Azteca (3.18%)
5. Azteca_isthmica (3.11%)

=== HIERARCHY (Top 1) ===
Dolichoderinae > Leptomyrmecini > Azteca > Azteca_mayrii

=== INDIVIDUAL HEADS (Independent) ===
Subfamily: Dolichoderinae (74.69%)
Tribe: Leptomyrmecini (74.37%)
Genus: Azteca (50.87%)
Insight: It's uncertain which species it is, but it's most likely closely related to Azteca mayrii, which is the best answer it could've given. This identification was done without location.

Examples on rare species

Identification of rare species is one of the biggest challenges in entomology. This Anochetus horridus example is particularly impressive because the model was trained exclusively on pinned specimen images. Despite the shift from high-resolution, professional lab equipment to real-world field photography with natural lighting and complex backgrounds, the model successfully generalized its learned features to correctly identify the genus and species.

A high-detail field photograph of Anochetus horridus, showing its unique trap-jaw mandibles.
© Tamas Jegh
=== TOP 5 PREDICTIONS (STANDARD) ===
1. Anochetus_micans (1.03%)
2. Anochetus_horridus (0.79%)
3. Anochetus (0.68%)
4. Anochetus_emarginatus (0.66%)
5. Anochetus_mayri (0.61%)

=== TOP 5 PREDICTIONS (HIERARCHICAL WEIGHTED) ===
1. Anochetus_micans (9.91%)
2. Anochetus_horridus (8.68%)
3. Anochetus (8.09%)
4. Anochetus_emarginatus (7.96%)
5. Anochetus_mayri (7.66%)

=== HIERARCHY (Top 1) ===
Ponerinae > Ponerini > Anochetus > Anochetus_horridus

=== INDIVIDUAL HEADS (Independent) ===
Subfamily: Ponerinae (95.82%)
Tribe: Ponerini (95.23%)
Genus: Anochetus (97.28%)

The Anochetus horridus is way more impressive than you would think. This species was only trained on specimen images! Heres all the images it was trained on

A collage of laboratory specimen images used for training the model, demonstrating the high-quality, standardized data needed for accurate species identification.
Another field image of an Anochetus ant, demonstrating the model's ability to identify species in real-world environments.
© Tamas Jegh
=== TOP 5 PREDICTIONS (STANDARD) ===
1. Anochetus (1.67%)
2. Anochetus_micans (0.65%)
3. Anochetus_inermis (0.58%)
4. Anochetus_horridus (0.51%)
5. Anochetus_diegensis (0.35%)

=== TOP 5 PREDICTIONS (HIERARCHICAL WEIGHTED) ===
1. Anochetus (12.65%)
2. Anochetus_micans (7.92%)
3. Anochetus_inermis (7.44%)
4. Anochetus_horridus (6.99%)
5. Anochetus_diegensis (5.79%)

=== HIERARCHY (Top 1) ===
Ponerinae > Ponerini > Anochetus > Anochetus

=== INDIVIDUAL HEADS (Independent) ===
Subfamily: Ponerinae (94.42%)
Tribe: Ponerini (93.46%)
Genus: Anochetus (98.26%)
Macro photograph of Paltothyreus tarsatus, illustrating the detailed anatomical features the model uses for identification.
=== TOP 5 PREDICTIONS (STANDARD) ===
1. Paltothyreus_tarsatus (3.47%)
2. Pachycondyla_crassinoda (1.54%)
3. Dinoponera_grandis (0.58%)
4. Pachycondyla (0.47%)
5. Ectomomyrmex_astutus (0.38%)

=== TOP 5 PREDICTIONS (HIERARCHICAL WEIGHTED) ===
1. Paltothyreus_tarsatus (13.25%)
2. Pachycondyla_crassinoda (6.16%)
3. Pachycondyla (3.40%)
4. Pachycondyla_striata (2.53%)
5. Pachycondyla_fuscoatra (2.49%)

=== HIERARCHY (Top 1) ===
Ponerinae > Ponerini > Paltothyreus > Paltothyreus_tarsatus

=== INDIVIDUAL HEADS (Independent) ===
Subfamily: Ponerinae (96.43%)
Tribe: Ponerini (97.82%)
Genus: Paltothyreus (51.31%)
Detailed macro image of Myrmica schencki, used to showcase the precision achieved in class-specific feature detection.
=== TOP 5 PREDICTIONS (STANDARD) ===
1. Myrmica_schencki (1.71%)
2. Myrmica_sulcinodis (1.29%)
3. Myrmica_punctiventris (1.10%)
4. Myrmica (0.77%)
5. Myrmicini (0.60%)

=== TOP 5 PREDICTIONS (HIERARCHICAL WEIGHTED) ===
1. Myrmica_schencki (12.72%)
2. Myrmica_sulcinodis (11.05%)
3. Myrmica_punctiventris (10.22%)
4. Myrmica (8.54%)
5. Myrmica_scabrinodis (7.03%)

=== HIERARCHY (Top 1) ===
Myrmicinae > Myrmicini > Myrmica > Myrmica_schencki

=== INDIVIDUAL HEADS (Independent) ===
Subfamily: Myrmicinae (94.31%)
Tribe: Myrmicini (96.94%)
Genus: Myrmica (96.98%)

Conclusion

This project has demonstrated that a specialized pipeline separating object detection from classification can significantly outperform general-purpose models for ant identification. By leveraging the self-supervised learning capabilities of DINOv3, implementing a custom hierarchical loss function, and integrating geographical data, I was able to create a model that is not only accurate but also robust across different resolutions and scenarios.

The comparison with iNaturalist highlights the strength of this approach, particularly in cases where location data is missing or when dealing with challenging images. While the current results are promising, the field of computer vision is moving fast. Future improvements could involve expanding the dataset with even more diverse sources, refining the class balancing techniques, or experimenting with larger model architectures. For now, however, this model stands as a powerful tool for the ant keeping community, pushing the boundaries of what’s possible in automated species identification.

Possible improvements and future work

  • Use different backbones like DINOv3 7B, which requires more memory but are expected to perform better. In testing, the huge version already performed significantly better on the AntScout and slightly on the validation, compared to the Large version, which means it generalizes better and works better in real world, out of distribution scenarios.
  • Further improve the hierarchical loss function and logic.
  • Use a better object detection model to get a better quality dataset. For example using DINOv3 instead of yolo, combined with DETR or Co DETR which gets SOTA results, which then in return also boosts this model. Also using a more reliable crop of the ants, fully covering the antenna and legs.
  • Use wider crops to also make it use antenna and legs more often.
  • Use higher resolution images to get even more details.
  • Use more data and higher quality data. Ideas are for example to filter out outliars based on location or current model.
  • Only use research grade images for high quantity species, to reduce the amount of noise in the dataset.