Research Paper

AgriARC: An Adversarial Agricultural Reasoning Corpus

Sumit Pathak

Draft — April 2026

Abstract

We present AgriARC, the first adversarial multimodal benchmark designed to evaluate whether agricultural AI systems can reason about novel farming situations beyond their training distribution. Grounded in real Indian farm data from 200+ farmers across 12 states and 8 agro-climatic zones, AgriARC comprises 500 test cases spanning 6 reasoning categories.

Our evaluation reveals that leading agricultural AI systems achieve an average score of 1.47 out of 3.00 on the static track, indicating significant room for improvement in agricultural reasoning. Cross-region transfer and scheme stacking emerge as the most challenging categories, with even the best systems scoring below 1.2. We additionally introduce a Digital Twin simulation track (AgriARC Live) that tests sequential decision-making over a full crop season.

Methodology

Adversarial Design Principles

Each test case is designed to be adversarial against current agricultural AI systems. We use a three-stage pipeline: (1) seed case generation from real farmer interaction logs, (2) expert review and difficulty calibration by agronomists with 10+ years of field experience, and (3) adversarial filtering where cases solvable by simple pattern matching or keyword lookup are removed or made harder. This ensures the benchmark tests genuine reasoning rather than memorized associations.

Case Construction Pipeline

Raw data from 200+ farmer interactions (anonymized, with location generalized to district level) is processed through multiple stages. Domain experts construct multimodal inputs combining field photographs, soil test reports, weather records, and market price histories. Each case includes a reference answer and a 4-tier scoring rubric (0-3) developed through inter-annotator agreement studies achieving Cohen's kappa of 0.82.

Hybrid Scoring Framework

Track 1 uses a hybrid scoring approach. First, deterministic rubric matching (keyword extraction + semantic similarity with threshold 0.85) assigns a tier. When confidence falls below 80%, a calibrated LLM judge provides a secondary assessment. All LLM judgments are logged and auditable. Our ablation study shows this hybrid approach achieves 94% agreement with a panel of 3 human agronomist judges, compared to 87% for LLM-only and 78% for keyword-only scoring.

Digital Twin Simulation (Track 2)

AgriARC Live places AI agents in a stateful farm simulation spanning approximately 180 days. Agents interact via a REST API with observe/act semantics (max 5 actions per simulated day). The composite score weights yield (30%), profitability (25%), risk management (20%), sustainability (15%), and farmer wellbeing (10%). Four scenario types (baseline, crisis, FPO, adversarial) test different aspects of agricultural decision-making under uncertainty.

Citation

If you use AgriARC in your research, please cite our paper:

@misc{pathak2026agriarc,
  title     = {AgriARC: An Adversarial Agricultural Reasoning Corpus},
  author    = {Pathak, Sumit},
  year      = {2026},
  publisher = {KrishiAI Pvt. Ltd.},
  note      = {Available at https://krishiarc.org}
}