Choosing and training the best object detection model for ants
Introduction
In my quest to build the ultimate ant identification pipeline, the first and most crucial step is object detection. To classify an ant correctly, we first need to find it, crop it, and ensure we have a high-quality input for the classification model. In this post, I’ll dive into the benchmarks and decisions behind the object detection model used in AntScout.
Why make an object detection model?
There are several use-cases for an object detection model. Many tasks are also possible without an object detection model, but it adds the ability to individually run it per ant, instead of running on the full image. Normally when using image classifcation on an image, you can, for example, get the caste or species from the image. But it will not individually per ant classify them.
Thats where we could use an object detection model to first detect all ants and their location in the image, to then run the classification model on each ant individually. This also makes it so it can run on each ant with full resolution, since its cropped to the ant itself. It also has way less noise from the rest of the image, making it way more accurate.
Use-cases
- Individual ant identification with high accuracy
- Individual ant caste detection
- Ant or brood counting
- Ant or brood location or walking pattern tracking
- Automatic activity monitoring
Its also possible to combine all of these use-cases into one single pipeline, extracting as much information as possible from the image or video. So you would have the amount of brood, ants, and per ant the caste and species. You could for example make it say from 1 image: “There are 10 ants, 5 of them are Lasius niger workers, 4 are Lasius niger males, 1 is a Lasius flavus queen and there 2 are larvae and 20 eggs”, well below a second of processing time.
Besides just having fun with the model, and helping antkeepers. This could also be used for ant research or improving databases like Observation.org or Inaturalist. Even if it wouldn’t be accurate enough to always get the correct species, you can still use it as a rare species observation finder, or tune it so it only outputs on how confident it actually is. So it falls back to group or genus when needed.
The datasets
I already have datasets, since I already made the YOLOv8 and YOLOv11 models a few years ago. This dataset contains ~12k images of only ants in 640px resolution. This is an absurd amount and cost me weeks of hard labor to make.
A few months ago, I began working on making a new dataset. This dataset contains ~1100 images of actual antkeepers in 1024px resolution. This dataset contains different castes, brood and a wide variety of species.
The plan now
My plan is to first just evaluate every model on my old ~12k images dataset and compare the results. After that, I will make a list of improvements I can make to the dataset to improve it, and apply that to the new ~1100 images dataset. In this case, dataset quality is more important than quantity. I will compare 4 different models and use the best one for the final dataset.
Model Comparisons
How can we further improve?
As you can see, the transformer based models outperform the YOLO models by a very large margin. They have less false positives and false negatives. The RF-DETR-2XL is by far the best here, its very easy to train and convert to the correct format for fast and easy deployment on the AntScout API or public access.
It has one giant flaw though, its pretrained using num_queries set to 300, which is way too low for our use case of detecting ants, sometimes theres upto 1500 ants in a single image! So I have to find a way to, still use the pretrained checkpoint, and increase the num_queries to 1500 or more. This will also dramatically slow down the prediction inference time, but its already more than fast enough, so that doesn’t really bother much.
Future Improvements and SOTA Architectures
The main thing for improving the model is to increase the number of images inside of the dataset, especially making them more diverse on different species, brood, castes or even different scenes. This will help the model to be more accurate and less prone to false positives and false negatives. Without adding it to the dataset enough, it might think a mealworm is a cocoon for example. Or think springtails are eggs or larva.
Im also looking forward to when the RF-DETR model will use the new DINOv3 backbone, since its way better at dense features it could give me some extra accuracy and precision.