... views

Building an Ant Species Knowledge Dataset from Scientific Literature

Introduction

One of the biggest problems with large language models today is hallucination, they’ll confidently generate plausible-sounding but completely wrong information about ant species. AntScout already uses AntWiki’s data for care sheets, but the current care sheets only cover a fraction of what’s known about each species. There’s an enormous amount of factual, peer-reviewed knowledge locked away in thousands of research papers that no LLM can reliably access.

So I set out to build a structured, machine-readable dataset of facts about all ant species, extracted directly from scientific literature. This dataset is specifically designed to be used by LLMs to ground their responses in accurate, verifiable information about ants, containing only facts extracted from actual research papers, not aggregated care sheet data from other sources.

Eventually, this dataset will be integrated directly into AntScout’s care sheets. Once it’s live, any LLM with web-grounding capabilities will be able to pull accurate, peer-reviewed facts about any ant species straight from AntScout, instead of hallucinating care advice scraped from forums and blogs. Every fact will be structured, citable, and traceable back to its original paper, turning AntScout, hopefully, into a reliable grounding source for ant information on the open web.

The Sources: FORMIS2024 & AntCat

The starting point was the FORMIS2024 bibliography database. FORMIS (Formicidae Literature Database) is a comprehensive index of myrmecological literature maintained over decades, containing references to tens of thousands of papers about ants. It covers everything from original species descriptions and taxonomy revisions to behavioral studies and ecological surveys. This got me around 11,000 PDFs, which were at least decent quality. This could be more, its more than enough for the first version.

I also used AntCat as a second data source. AntCat (Ant Catalog) is an online catalog of ant taxonomy that maintains an extensive bibliography of ant-related publications. This got me around ~7000 pdfs.

Between the two databases, I ended up with a large number of duplicate PDFs, the same paper appearing under slightly different titles, DOIs, or metadata entries across the databases. I deduplicated all of them, comparing by PDF content. Only unique papers made it into the final parsing pipeline, which were around ~13.000.

Many of these papers date back to the 1800s, scanned from physical copies, sometimes barely legible, with inconsistent typography, mixed languages, and hand-drawn illustrations embedded in the text. Over a century of taxonomic revisions means species names have changed, been synonymized, or split. A paper from 1863 might refer to a species by a name no longer in use today. All of this makes parsing exceptionally challenging.

Challenges in Parsing Scientific Literature

This is not your typical NLP task. Scientific literature about ants presents a unique set of problems:

Legacy formats and OCR quality, Older papers are scans of physical copies. OCR errors are common: “rn” confused with “m”, ligatures merged or split incorrectly, species names mangled. The model has to be robust enough to handle garbled text and still extract meaningful facts from it. I will be using the Document AI processor from Google Cloud, which was used by Gemini 2.5 pro, it is relatively cheap and very reliable.

Language diversity, While the majority of papers are in English, significant portions of the myrmecological literature are in German, French, Spanish, Portuguese, and other languages. The model needs to handle multilingual input and correctly extract species-level facts regardless of the language.

Narrative vs. structured content, Modern papers might have clearly labeled “Results” and “Discussion” sections. Papers from the 1800s? Not so much. They’re often written as flowing narrative, with observations embedded in paragraphs of travelogue or correspondence. Extracting individual facts requires understanding what constitutes a factual claim versus context or opinion.

Scale, Even after deduplication, the corpus contains thousands upon thousands of papers. Each needs to be processed, parsed, and validated. Which costs thousands of dollars in API calls, for the OCR, and the AI models.

Old taxa, A lot of papers use outdated taxonomy, so the species names in the papers might not match the current taxonomy. This is something to be aware of when using the dataset. This is fixed by using the AntCat database to resolve the species names to their current taxonomy.

The Approach: Parsing PDFs with GLM 5.1

Manually reading and extracting facts from thousands of papers would take a lifetime. I needed a model that could reliably parse scientific PDFs into structured data without making things up. This is where the choice of model matters enormously.

I chose GLM 5.1 for this task because it currently has the lowest hallucination rate among frontier models. When you’re building a dataset that LLMs will use as a source of truth, you absolutely cannot afford a parser that invents facts. The model needed to:

  1. Extract only what’s explicitly stated in the paper, no inference, no filling in gaps
  2. Correctly attribute facts to species despite inconsistent naming conventions across centuries of literature
  3. Preserve the original phrasing and context rather than paraphrasing, to minimize distortion
  4. Handle diverse formats, from modern structured papers to 19th-century naturalist accounts written in narrative prose

The key insight is that for a dataset meant to ground LLMs, precision matters more than recall. It’s better to extract fewer facts and have them all be accurate than to extract more facts with some being fabricated. GLM 5.1’s low hallucination rate makes it uniquely suited for this, when it doesn’t know something or can’t find it in the text, it says so rather than making something up. That’s exactly the behavior you want when building a truth dataset.

GLM 5.1 is at the time of writing in the top 3 of 353 models on the AA-Omniscience Hallucination Rate leaderboard. Its in my experience also extremely good at instruction following, and long context.

The Output Format

Each paper is parsed into a structured JSON object using a carefully designed system prompt, some of the JSON objects were made with a slightly different older system prompt. Here’s an example from an 2010 paper by Shingo Hosoishi and Kazuo Ogata:

{
  "paper_title": "On the identity of Crematogaster schimmeri Forel, 1912 and the distribution of subgenus Decacrema in Asia",
  "authors": [
    "Shingo Hosoishi",
    "Kazuo Ogata"
  ],
  "publication_year": "2010",
  "language": "en",
  "research_focus": "Taxonomic revision of Crematogaster schimmeri, determining its correct subgenus placement (Orthocrema vs Decacrema) through examination of type specimens, and clarifying the distribution of Decacrema in Southeast Asia.",
  "methods": "Examination of syntype workers of Crematogaster schimmeri from BMNH (The Natural History Museum, London), NHMB (Naturhistorisches Museum, Basel), and MHNG (Musée d'Histoire Naturelle, Geneva).",
  "species": {
    "Crematogaster schimmeri": [
      "Originally described by Forel in 1912 from Pilam, Taiwan",
      "Type specimens examined: syntype workers in BMNH, NHMB, and MHNG",
      "Character states: 11-segmented antenna",
      "Anterolateral margins of clypeus not protruded anteriorly",
      "Petiole with node-like process posteriorly",
      "Can be distinguished from other Orthocrema species by sculptured head",
      "Shining surface of lateral pronotum and mesopleuron",
      "Originally placed in subgenus Decacrema by Bolton (1995) based on misquotation of Emery (1922)",
      "Emery (1922) originally correctly placed the species in Orthocrema based on: 11-segmented antenna, 2-jointed antennal club, petiole with subparallel sides, and postpetiole without median sulcus",
      "Bolton et al. (2006) transferred the species to subgenus Crematogaster without comment",
      "Authors confirm placement in Orthocrema following Emery (1922)",
      "Represented by a star on the distribution map in Fig. 3"
    ],
    "Crematogaster subgenus Decacrema": [
      "Workers can be easily distinguished from other subgenera by their 10-segmented antenna",
      "Obligate plant-ants associated with Macaranga (Euphorbiaceae)",
      "Known distribution in Asia: Southeast Asia (Malaya, Sumatra, Borneo, Sulawesi, southern Philippines), Taiwan, New Guinea, Africa and Madagascar",
      "In Asia, Decacrema is confined to between approximately 10°N and 10°S",
      "Center of distribution is in the Malesia region",
      "No species of Decacrema is known from the mainland between Taiwan and the Malay Peninsula, indicating discontinuous distribution",
      "Macaranga species are widely distributed in this region, although not all are ant-plants"
    ]
  }
}

Every fact is attributed to a specific source, preserving the exact text from the original paper. This means every claim in the dataset can be traced back to a real publication.

The structure serves a clear purpose: by keeping the species field as a map from species name to a list of facts, it becomes trivial to look up all known facts about a given species across the entire corpus. The research_focus and methods fields provide context about what kind of study produced these facts, which is critical for evaluating their reliability, a controlled experiment carries different weight than an anecdotal observation from 1863.

Use with LLMs

This dataset is built specifically to be fed into LLMs. The structured JSON format makes it straightforward to integrate as a retrieval source in RAG (Retrieval-Augmented Generation) pipelines. When an LLM is asked about a species, it can pull verified facts directly from this dataset rather than generating plausible but potentially incorrect information.

The dataset contains only information extracted directly from research papers. If you also need practical husbandry data, you can combine this dataset with external sources like AntWiki.

How to Use This Dataset

The primary way to use this dataset is as a retrieval source for LLM tool calls or scripts. Rather than dumping the entire dataset into context, you query it for a specific species and feed only the relevant results to the model.

Basic lookup: When you need information about a species, say, Myrmica rubra, you retrieve all key findings for that species across every paper in the dataset. This can amount to hundreds of thousands of tokens of factual, verified content. High-end LLMs with large context windows can handle this volume directly, making it possible to generate comprehensive outputs like detailed care sheets, behavioral summaries, or ecological profiles grounded entirely in primary literature.

Filtering and ranking: When the full result set is too large or you only need the most relevant findings, you can reduce the output before passing it to the model:

  • Simple string filters, scripts that check whether a finding contains a keyword (e.g., “temperature”, “diet”, “nesting”) to narrow results to a specific topic
  • Embedding or ranking models, encode the query and each finding as vectors, then rank by semantic similarity to surface the most relevant facts first

Output format: When providing findings to an LLM, we strongly recommend encoding the data in TOON (Token-Oriented Object Notation) rather than JSON. TOON is a compact, human-readable encoding of the JSON data model that uses ~40% fewer tokens than standard JSON while maintaining comparable or better retrieval accuracy. For a dataset this large, that token reduction is significant, it means you can fit more findings per species within the same context window, or use a smaller context window and lower your costs.

Source attribution: For every key finding you provide to the LLM, include the paper it came from. At minimum, include the paper_title. Add authors and language when the source’s credibility or linguistic context matters. Include research_focus and methods when the reader needs to understand how a finding was produced, for example, a behavioral observation from a controlled lab experiment carries different weight than an anecdotal field note from 1863. This context helps the LLM reason about the reliability and applicability of each fact.

A practical tool call or script pipeline looks like this:

  1. Receive a species name (e.g., “Myrmica rubra”)
  2. Look up all key findings for that species in the dataset, use contains/substring matching, not exact equality. Species keys in the dataset often include author and year (e.g., "Myrmica sabuleti Meinert 1861" rather than "Myrmica sabuleti"), so a lookup that only matches on exact name will miss entries. Search for the species name as a substring of the key to capture all variants.
  3. Optionally filter or rank findings by topic relevance
  4. For each finding, attach its source metadata (title, authors, language, research focus, methods)
  5. Encode the result as TOON
  6. Pass it to the LLM as grounded context for the task at hand

What This Enables

This dataset opens up a number of possibilities beyond just improving chatbot accuracy:

  • Fact verification, When an LLM makes a claim about an ant species, it can be checked against extracted facts from primary literature
  • Knowledge gap identification, Species with few or no extracted facts immediately stand out as understudied, highlighting where further research is needed
  • Cross-referencing, Conflicting facts from different papers about the same species become visible, enabling identification of taxonomic debates or behavioral controversies
  • Historical analysis, Track how understanding of a species has evolved across decades of literature

Download the Full Dataset

The download link will be available soon.

This dataset is compiled from publicly available scientific literature for research and educational purposes. The following applies:

  • Source material: All facts are extracted from papers indexed in the FORMIS2024 bibliography database and AntCat. The original papers retain their respective copyrights held by the original authors and publishers.
  • Extracted data: The structured extraction of factual claims (species descriptions, behaviors, measurements, etc.) from these papers constitutes a transformative compilation. Factual data itself is not copyrightable under most jurisdictions, including US and EU law.
  • Attribution: Every fact in this dataset includes its source reference (paper title, authors, year). This ensures full traceability and proper academic attribution to the original researchers.
  • License: The dataset compilation is released under the CC BY 4.0 license. You are free to use, modify, and distribute it, provided you give appropriate credit to AntScout and link back to this page.
  • Disclaimer: While every effort has been made to accurately extract information from the source papers, this dataset may contain errors introduced during the automated parsing process. Users should verify critical information against the original sources. The dataset is provided “as is” without warranty of any kind.
  • Intended use: This dataset is intended for research, educational, and LLM grounding purposes. It is designed to improve the factual accuracy of AI systems when discussing ant species.

Leave a Comment

Please to leave a comment.