Jan 22, 2026

Choosing and training the best object detection model for ants

Introduction

In my quest to build the ultimate ant identification pipeline, the first and most crucial step is object detection. To classify an ant correctly, we first need to find it, crop it, and ensure we have a high-quality input for the classification model. In this post, I’ll dive into the benchmarks and decisions behind the object detection model used in AntScout.

Why make an object detection model?

There are several use-cases for an object detection model. Many tasks are also possible without an object detection model, but it adds the ability to individually run it per ant, instead of running on the full image. Normally when using image classifcation on an image, you can, for example, get the caste or species from the image. But it will not individually per ant classify them.

Thats where we could use an object detection model to first detect all ants and their location in the image, to then run the classification model on each ant individually. This also makes it so it can run on each ant with full resolution, since its cropped to the ant itself. It also has way less noise from the rest of the image, making it way more accurate.

Use-cases

Individual ant identification with high accuracy
Individual ant caste detection
Ant or brood counting
Ant or brood location or walking pattern tracking
Automatic activity monitoring

Its also possible to combine all of these use-cases into one single pipeline, extracting as much information as possible from the image or video. So you would have the amount of brood, ants, and per ant the caste and species. You could for example make it say from 1 image: “There are 10 ants, 5 of them are Lasius niger workers, 4 are Lasius niger males, 1 is a Lasius flavus queen and there 2 are larvae and 20 eggs”, well below a second of processing time.

Besides just having fun with the model, and helping antkeepers. This could also be used for ant research or improving databases like Observation.org or Inaturalist. Even if it wouldn’t be accurate enough to always get the correct species, you can still use it as a rare species observation finder, or tune it so it only outputs on how confident it actually is. So it falls back to group or genus when needed.

The datasets

I already have datasets, since I already made the YOLOv8 and YOLOv11 models a few years ago. This dataset contains ~12k images of only ants in 640px resolution. This is an absurd amount and cost me weeks of hard labor to make.

A few months ago, I began working on making a new dataset. This dataset contains ~1100 images of actual antkeepers in 1024px resolution. This dataset contains different castes, brood and a wide variety of species.

The plan now

My plan is to first just evaluate every model on my old ~12k images dataset and compare the results. After that, I will make a list of improvements I can make to the dataset to improve it, and apply that to the new ~1100 images dataset. In this case, dataset quality is more important than quantity. I will compare 4 different models and use the best one for the final dataset.

Model Comparisons

CO-DETR Swin-L

A transformer-based model, using the Swin-Large backbone, overall one of the best. Way heavier than RF-DETR.

RF-DETR-2XL

A transformer-based model, using the DINOv2 backbone. Sslightly worse than YOLO on smaller objects.

YOLOv11m

The model used in 2024/2025 in AntScout as counter, has more false positives, but way faster. Degrades after bigger than medium size.

YOLOv26x

A tiny bit better and faster on CPU, than the YOLOv11m variant. Also does not degrade when using bigger variants.

How can we further improve?

As you can see, the transformer based models outperform the YOLO models by a very large margin. They have less false positives and false negatives. The RF-DETR-2XL is by far the best here, its very easy to train and convert to the correct format for fast and easy deployment on the AntScout API or public access.

It has one giant flaw though, its pretrained using num_queries set to 300, which is way too low for our use case of detecting ants, sometimes theres upto 1500 ants in a single image! So I have to find a way to, still use the pretrained checkpoint, and increase the num_queries to 1500 or more. This will also dramatically slow down the prediction inference time, but its already more than fast enough, so that doesn’t really bother much.

Future Improvements and SOTA Architectures

The main thing for improving the model is to increase the number of images inside of the dataset, especially making them more diverse on different species, brood, castes or even different scenes. This will help the model to be more accurate and less prone to false positives and false negatives. Without adding it to the dataset enough, it might think a mealworm is a cocoon for example. Or think springtails are eggs or larva.

Im also looking forward to when the RF-DETR model will use the new DINOv3 backbone, since its way better at dense features it could give me some extra accuracy and precision.