Identifying ants with object detection and classifcation models
Introduction
In this blog post, I will explain my plan to create the best possible ant identification model using object detection and classification models. The goal is to create a model that can not only detect ants in images, but also classify them into species with high accuracy. This is a challenging task, as there are over 13000 known ant species, with many of them looking very similar to each other.
The plan
I already own a giant dataset of 15307 images which are annotated with ants. I used this to train it on Yolov8s a few years back, which achieved decent results. Then, when yolov11 released, I retrained it on yolov11 and achieved even better results. However, now I want to push the limits further. To do this, I want to use giant datasets from inaturalist, and, or from GBIF, to push the boundaries of whats possible. I want to create a pipeline that first uses object detection to detect all ants in the image, and then seperatly classifies them into species using a classification model. This way, I can create a model that can not only detect ants, but also classify them into species individually. Normally this is already done inside of the object detection model, but I want to see if seperating the two tasks can achieve better results.
Step 1: Creating the best possible object detection model
Here are the current benchmarks of the model trained on just my 15k images with 640px resolution, the only class in this dataset is “ant”.
| Model | mAP 50 | mAP 50-95 | Recall | Speed (ms) |
|---|---|---|---|---|
| YOLOv8s | 86.7 | 44.8 | 80.3 | 2 |
| YOLOv11s | 87.3 | 45.8 | 86.8 | 3 |
| YOLOv11m | 88.1 | 46.3 | 81.4 | 5 |
| RF-DETR-LARGE | 84.3 | 43.8 | 82.0 | 25 |
With these benchmarks, keep in mind that the main purpose of the model is to count ants, so recall is the most important metric here. Also keep in mind that they all used the same dataset with the same validation set. And by my surprise, RF DETR Large achieved almost the same results as yolov11m. My other dataset with brood achieved more than a 20% higher mAP50 on RF DETR than on YOLO models, but that dataset is not public yet. I will most likely run inference on 1024px, instead of 640px as its a dynamic resolution model, so the results will be even better.
Creating the dataset for classification
Now that I have the best possible object detection model that I can currently train, I need to make a dataset for classification. To do this, I will run inference on giant datasets from inaturalist and GBIF, to detect ants in those images. Then, I will crop out the detected ants and save them into a new dataset, which I will then use to train a classification model. This way, I can create a model that can classify ants into species based on the cropped images.
Current state of the art classifcation models
I need to choose the right classifcation model for this task. Currently, the best models for image classification are vision transformers (ViTs) and convnext models.
Here are some of the most important requirements:
- Highest possible accuracy
- Reasonable inference and training speed, less than 500M parameters
- Good at either 224px or 448px resolution
Some of the best models currently available to choose from
| Model | Parameters | Res | Top-1 Accuracy | Acc per M Params |
|---|---|---|---|---|
| DINOv3 ViT-S/16 | 21M | 224 | ~82.0% | 3.905 |
| DINOv3 ViT-S+/16 | ~40M | 224 | ~83.5% | 2.088 |
| DINOv3 ViT-B/16 | 86M | 224 | ~84.5% | 0.983 |
| DINOv3 ViT-L/16 | 304M | 224 | ~86.0% | 0.283 |
| DINOv3 ViT-L/16 | 304M | 448 | ~87.2% | 0.287 |
| EVA-02 ViT-Ti/14 | 6M | 224 | 80.7% | 13.450 |
| EVA-02 ViT-S/14 | 22M | 224 | 85.8% | 3.809 |
| EVA-02 ViT-B/14 | 87M | 224 | 88.3% | 1.015 |
| EVA-02 ViT-L/14 | 304M | 224 | 89.6% | 0.295 |
| EVA-02 ViT-L/14 | 304M | 448 | 90.0% | 0.296 |
| EfficientNetV2-L | 120M | 480 | 85.7% | 0.714 |
| ConvNeXt V2-B | 89M | 224 | 84.9% | 0.954 |
| ConvNeXt V2-L | 198M | 384 | 86.3% | 0.436 |
| Swin-L | 197M | 384 | 86.4% | 0.438 |
Keep in mind that these models were evaluated on ImageNet-1k, which has 1000 classes. My ant species dataset will likely have way more classes, and be way more difficult on lower resolution images. So higher resolution might yield a better accuracy increase than on ImageNet.
EVA-02 ViT-L/14 at 448px seems like the best choice, but as its way harder and more expensive to train with 448px, I will most likely go with EVA-02 ViT-B/14 at 224px first, as it has a good balance between accuracy and training speed.
The DINOv3 models are also very interesting, as they are trained with self-supervised learning, which might yield better results on my dataset. The ViT-L/16 at 448px seems like a good choice, but its quite large and might be slow to train. I will try the 224px ViT-B/16 first.
So I will try to make a 448px dataset, and then convert it to 224px, train that, and then I can always still train the 448px model later if I have the resources. I will first try DINOv3 ViT-B/16 at 224px, and EVA-02 ViT-B/14 at 224px, and see which one performs better.
Lets first create the dataset by running inference on the giant datasets, and then crop out the ants and save them into a new dataset for classification. After that, I can start training the classification models and see how well they perform.
Resources used to create this dataset:

Cleaning the dataset
There are some problems with the dataset which need to be fixed before training:
Many images are very small, below 50x50px, which are not useful for classification.
Solution: I will filter out all images that have a width or height below 50px.
Some images are wrongly labeled, as inaturalist and GBIF have some mislabeled species.
Solution: The easiest way to solve this is via label smoothing of 10% while training, so the model does not overfit to the wrong labels.
Its extremely imbalanced, with some species having thousands of images, while others have only a few.
Solution: This requires a few steps, first I will limit the minimum number of images per class to 10, then while training I will add an automatic class balancing technique. Besides this, there will also be data augmentation for all classes. See below for more details.
As the Yolov11s model is not perfect, some images do not contain ants at all, or contain multiple ants.
Solution: This should not be a big problem, as the model should be able to learn to ignore these images. Again, label smoothing will help here.
Theres classes for family, genus and species, for example “Lasius”, but also “Lasius niger”, which makes it hard to train a model on because the classes are the same but at different levels.
Solution: This is complex to solve, more about this below.
The model has no idea where the ant is located in the world, so some species that look very similar might be confused.
Solution: This is also complex to solve, more about this below.
Handling hierarchical labels
There are multiple ways to do this, the easiest way is to just ignore it, and train the model on all classes as if they were seperate. This might work decently well, but it would be better to use the hierarchy to improve the model. Keeping the genus as seperate labels adds a lot of noise to the dataset, as there would be a lot of overlapping classes.
Problem
There are multiple overlapping classes for example “Lasius” and “Lasius niger”, which makes it hard for the model to learn, because they are the same, but at different levels.
Solution
Create a custom loss function that takes into account the hierarchy of the labels.
For example:
True label: Lasius niger
Predicted label: Lasius niger
Reward given to model: +1
True label: Lasius
Predicted label: Lasius niger
Reward given to model: +0.8 / number of species in genus
True label: Lasius niger
Predicted label: Lasius
Reward given to model: +0.2
True label: Lasius niger
Predicted label: Formica
Reward given to model: 0
If the model predicts “Lasius” when the true label is “Lasius niger”, it should be penalized less than if it predicted “Formica”. This way, the model can learn to generalize better across the hierarchy.
This approuch also enables me to use way more images, because I can now include all images that are labeled at genus level, which increases the dataset size significantly.
Another thing to consider is the validation set, to make it simple and more reliable, I just removed all genera from the validation set, so I can both evaluate the model on species level, and also on genus level by mapping the species to their respective genera.
Possible alternative solutions
Another solution for this whole problem would be to create a multi-label classification model, that predicts both the genus and species at the same time. This way, the model can learn to predict both levels of the hierarchy simultaneously. However, this would require a more complex model architecture, more resources to train, and would most likely be way less reliable as most genus and species could look very similar.
Handling geographical information
Some species look very similar, and can only be reliably distinguished by knowing where the ant was found. For example, Lasius umbratus and Lasius subumbratus look very similar, but are found in different parts of the world. We also do not want it to be a hard rule, as some species can be found in multiple locations, invasive species exist and incorrect locations given by users is also possible.
Problem
You don’t want the model to think, “This specific ant ONLY lives at this exact GPS coordinate in a small city somewhere in Europe”, if in the training data it only saw that ant at that location, as that would make it very inflexible and not generalize well. Another problem is that the model should not 100% rely on location data, and the possiblity to use the model without location data should still be there.
Solution 1
Add location noise during training, so the model learns to generalize better.
For example, if the true location is (lat: 52.5200, lon: 13.4050), then during training, we can add a random noise of +/- 0.5 degrees to both latitude and longitude. This way, the model learns to not rely too much on precise exact location data.
Solution 2
Adding location dropout during training, so the model learns to classify without location data as well.
For example, during training, we can randomly drop the location data with a probability of 50%. This way, the model learns to classify ants based on visual features alone, without relying on location data. And enables the model to be used without location data as well.
Class balancing
There are a lot of classes, some with 200k images, while others have only 10. This makes it hard for the model to learn, as it will naturally be biased towards the classes with more images.
Standard balancing uses simple math: “If Solenopsis invicta is 50000x more common, make Ankylomyrma coronacantha 50000x more important.”
Problem
It is too aggressive. The model will obsess over the rare species and memorize their features perfectly (Overfitting), failing to recognize them with new images.
Solution
Using “Effective” balancing, which recognizes the law of Diminishing Returns.
The first 50 photos of an ant teach you a lot.
The 500th photo of that same ant? It looks almost exactly like the others. It doesn’t add much new information.
The Logic: “1,000 photos isn’t actually 1,000 times better than 1 photo. It’s effectively only like… 200 unique viewpoints.”
It calculates weights based on this “Effective Number.” It boosts the rare classes, but not so violently that it breaks the training.
It counts how many ants are in each species, it decides that after a certain number of photos, new photos are less valuable, it then forces the training loop to show the rare ants more often, so the model treats them as equals to the common ants.
There are two main hyperparameters to tune here:
- Beta (β): Controls how quickly the value of new images diminishes. A β close to 1 means diminishing returns happen slowly, while a β closer to 0 means they happen quickly.
- Max Weight: The maximum weight that any class can receive
The classes in this dataset sometimes have classes that weight over 50000x more than others, so I will need to set a max weight to prevent extreme weights. I will start with a beta of 0.9999 and a max weight of 1000x, and see how that performs. This makes it still quite aggressive, but not too extreme.
Dataset before and after cleaning
| Before Cleaning | After Cleaning | |
|---|---|---|
| Total Images | ~12,000,000 | 6,652,773 |
| Total Classes | 11,087 | 6,384 |
Cleaning validation set
To create a reliable validation set, I removed all genera from the validation set, and I tried to manually clean it as much as possible, to ensure that the validation set is clean and reliable.
This is more difficult than you would think, where would you draw the line of it being a good image or not? If only an antenna is visible, is that still a good image? What if the ant is blurry, or partially visible? I tried to give it a good challenge, to ensure that the validation set is of high quality.
The validation set is equally balanced, with 5 images per species. This makes it an extremely difficult validation set, but also very reliable. A big problem with this, is that a large portion of the species which only have 10 images or less, are from GBIF (Antweb mostly), which makes it so a very large portion of the validation set are specimen photos from Antweb, which are generally easier to classify than field photos, but also more difficult as they are often very similar to each other.
When evaluating the model on this dataset, it mostly just means that its good at classifying ants from Antweb, which is not the main goal, but its still a good benchmark to see how well the model performs and converges, as the different specimens from a certain species still vary a lot.
I made a simple script to by hand go through all the tens of thousands of images in the validation set, and mark the ones that are inccorect to remove them.
Heres an example of one of the few hundered pages I went through:

If it would get 50% on the validation set, that would mean it got half of these images (and tens of thousands of other ones like this) correct, which the model has never seen before.
The validation set also, does not include any images that were of the same observation or source image as any of the training images, to ensure that the model is evaluated on completely unseen data.
Training the model
Now that the dataset is cleaned and prepared, I can start training the classification models. I will first try DINOv3 ViT-L/16 at 224px, and EVA-02 ViT-B/14 at 224px, and see which one performs better.
EVA-02 already plateaus after about 3 epochs, while DINOv3 easily beats it in the first epoch while its still frozen.
I used 2x RTX PRO 6000 WS gpus, which have 96GB of vram each, to train these models.
As reference, 1 epoch on DINOV3 Vit-L/16 at 448px takes about 7 hours to train, with 224px taking about 2 hours unfrozen and 40 minutes with frozen backbone.
The DINOV3 ViT-L/16 was unfrozen after the second epoch. EVA02, was trained end-to-end from the start.
Final evaluation results
| Model | Top-1 Species Accuracy | Top-5 Species Accuracy | Top-1 Genus Accuracy | Top-5 Genus Accuracy | Resolution | Location |
|---|---|---|---|---|---|---|
| DINOv3 ViT-L/16 + hierarchical loss | 54.79% | 79.37% | 90.20% | 97.40% | 448px | Yes |
| DINOv3 ViT-L/16 + hierarchical loss | 49.77% | 73.79% | 87.60% | 96.22% | 448px | No |
| DINOv3 ViT-L/16 + hierarchical loss | 53.05% | 77.36% | 88.81% | 97.02% | 224px | Yes |
| DINOv3 ViT-L/16 + hierarchical loss | 47.68% | 71.78% | 85.94% | 95.85% | 224px | No |
| DINOv3 ViT-B/16 | 41.82% | 65.32% | - | - | 224px | Yes |
| DINOv3 ViT-B/16 | 30.51% | 51.39% | - | - | 224px | No |
| EVA02 ViT-B/14 + hierarchical loss | 28.59% | - | - | - | 224px | Yes |
As expected, the DINOv3 ViT-L/16 gained a small ~1.5% boost in accuracy when using 448px instead of 224px. Adding location data also provided a significant boost of about ~6-7% in top-1 species accuracy. The hierarchical loss function also proved to be very effective, as it allowed the model to better generalize across the different levels of the taxonomy. It would be better to test the loss functions on the same model to see the exact difference.
Evaluation against inaturalist and observation.org
Its quite hard to actually fairly evaluate, there’s no real model that does the exact same pipeline as mine.
How my model works: Image -> splits it into individual ants -> classifies each ant into species
How other models works: Image -> classifies the entire image into species (which might contain multiple ants, or none at all)
To make this fair, I will make inaturalist use the full image, which is how its supposed to be used, and I will upload the full image to my model as well, which will then detect max 10 ants and use the sum of the predictions as the outcome.
Because the evaluation set is way too difficult for the inaturalist model, I will only use more common species that they are actually trained on and able to identify. All images have never been seen by any of the models before.
The evaluation dataset for this is quite small as I need to manually run inference on them via inaturalist. The evaluation set used for this can be seen in the header image of this blog post.
Scoring system
Its a really simples scoring system, it just gets either one of these 4 options, or none at all. Then after its done, I just sum up the points for each model. I will also remove all where they both got it fully correct, to only see where they differ.
| Prediction | Rank | Points |
|---|---|---|
| Exact Species | 1st Option | 10 |
| Exact Species | Top 5 | 5 |
| Correct Genus | 1st Option | 3 |
| Correct Genus | Top 5 | 1 |
Results