research March 5, 2025 11 min read

drik-bench: If You Build the Benchmark, You Define What 'Good' Means

Why COCO and KITTI are useless for Indian traffic — and how we built a benchmark suite that measures what actually matters.

Hansraj Patel

drik-bench: If You Build the Benchmark, You Define What 'Good' Means

The Benchmark Problem

Here is a dirty secret of computer vision: your model is only as good as the benchmark you evaluate it on. And every major benchmark was built for a world that does not look like Indian roads.

COCO has 80 classes. None of them are “auto-rickshaw.” None are “tractor-trolley.” The “car” class does not distinguish between a Maruti Alto and a Toyota Fortuner — a distinction that matters for toll computation, axle-load estimation, and insurance classification.

KITTI measures 3D detection on German highways. MOT17 tracks pedestrians in European shopping districts. BDD100K evaluates on American roads with lane markings and functioning traffic signals.

If you evaluate an Indian traffic system on these benchmarks, you get a number that means nothing. A model scoring 92 mAP on COCO can score 54 mAP on actual Indian CCTV footage. The benchmark told you the model was excellent. The road told you it was not.

We built drik-bench because the benchmarks we needed did not exist. And because whoever builds the benchmark controls what “good” means.

Design Principles

drik-bench was designed around four principles:

1. Measure deployment reality, not lab performance. Test images come from real Indian CCTV cameras — 720p, compressed, noisy, badly lit. Not from dashcams on clear days. Not from curated datasets with ideal viewing angles.

2. Use a taxonomy that matches the real world. 50+ vehicle classes. Fine-grained distinctions where they matter operationally. An auto-rickshaw is not a car. A tractor-trolley is not a truck.

3. Evaluate the full stack, not just detection. Detection is necessary but not sufficient. We benchmark tracking, ANPR, and scene understanding as separate tasks with separate metrics.

4. Include the hard cases. Dense intersections with 200+ objects. Night scenes with headlight glare. Monsoon conditions with 50m visibility. Occluded plates. Wrong-way vehicles. Animals on highways.

The Four Benchmark Categories

Category 1: Detection (drik-bench-det)

Task: Detect and classify all traffic participants in a single frame.

Dataset: 15,000 annotated frames from 120 camera locations across 8 Indian cities. Stratified by:

Road type (highway, arterial, residential, village)
Time of day (dawn, morning, afternoon, evening, night)
Weather (clear, haze, rain, fog)
Density (sparse <20 objects, medium 20-80, dense 80+)

Taxonomy: 54 classes organized in a three-level hierarchy:

vehicle/
├── two_wheeler/
│   ├── motorcycle
│   ├── scooter
│   ├── moped
│   ├── electric_scooter
│   └── bicycle
├── three_wheeler/
│   ├── auto_rickshaw
│   ├── e_rickshaw
│   ├── cargo_three_wheeler
│   └── tempo
├── car/
│   ├── hatchback
│   ├── sedan
│   ├── suv
│   ├── muv
│   └── van
├── commercial/
│   ├── mini_truck
│   ├── lcv
│   ├── hcv
│   ├── multi_axle
│   ├── tanker
│   └── container
├── bus/
│   ├── city_bus
│   ├── state_bus
│   ├── school_bus
│   └── minibus
├── agricultural/
│   ├── tractor
│   ├── tractor_trolley
│   └── bullock_cart
└── other/
    ├── handcart
    ├── construction_vehicle
    └── military_vehicle

non_vehicle/
├── pedestrian
├── cyclist
├── animal/
│   ├── cow
│   ├── buffalo
│   ├── dog
│   └── other_animal
└── special/
    ├── band_baarat
    ├── funeral_procession
    └── religious_float

Metrics:

We report multiple metrics because no single number captures detection quality:

Metric	What It Measures	Why It Matters
mAP@50	Detection at loose IoU	Basic “did you find it?“
mAP@75	Detection at strict IoU	Localization precision
mAP@50:95	Average across IoU thresholds	Overall detection quality
AP-small	Detection of objects <32x32 px	Distant vehicles and pedestrians
AP-rare	Detection of rare classes (<100 instances)	Long-tail performance
Latency	Inference time in ms	Real-time viability

The AP-rare metric is particularly important. Most models optimize for common classes (cars, motorcycles) and ignore rare ones (bullock carts, military vehicles). But rare classes are often the most operationally important — a tractor on a highway at night is a safety hazard precisely because it is rare.

Category 2: Tracking (drik-bench-track)

Task: Track all objects across video sequences, maintaining consistent identity through occlusion, camera shake, and dense traffic.

Dataset: 200 video sequences, each 30-120 seconds. 50 sequences are “challenge sequences” — extreme density, long occlusions, abrupt camera motion.

Metrics:

Metric	What It Measures
HOTA	Holistic accuracy (balances detection and association)
IDF1	Identity preservation accuracy
MOTA	Multi-object tracking accuracy
AssA	Association accuracy — are the same objects linked correctly?
ID Switches	How often a track’s identity changes incorrectly
Frag	How often a track is interrupted and resumed

We emphasize HOTA and IDF1 over MOTA. MOTA is dominated by detection quality — a perfect detector with a bad tracker scores well on MOTA. HOTA and IDF1 explicitly measure whether the tracker maintains correct identity over time.

Our challenge sequences test specific failure modes:

challenge_sequences:
  - name: "dense_intersection_delhi"
    objects_per_frame: 180+
    challenge: "extreme density, overlapping trajectories"

  - name: "bus_occlusion_mumbai"
    max_occlusion_duration: 4.2s
    challenge: "long-duration full occlusion"

  - name: "night_highway_rajasthan"
    visibility: "headlights only"
    challenge: "low visibility, high speed differential"

  - name: "monsoon_kolkata"
    weather: "heavy rain"
    challenge: "rain streaks, spray, reduced visibility"

  - name: "festival_crowd_varanasi"
    pedestrian_density: 300+
    challenge: "mixed pedestrian-vehicle, ceremonial objects"

Category 3: ANPR (drik-bench-anpr)

Task: Read license plates from CCTV footage.

This is not the standard ANPR benchmark of high-resolution, well-lit, head-on plate images. This is Indian CCTV ANPR: compressed, angled, partially occluded, dirty plates, non-standard fonts.

Dataset: 25,000 plate instances from real CCTV footage:

Resolution range: 20x10 px to 200x60 px
Angle: 0-45 degrees off-axis
Conditions: day, night, rain, fog, motion blur
Plate types: private (white), commercial (yellow), electric (green), diplomatic, temporary, dealer

Indian plate format:

Standard:    XX 00 XX 0000
             State | RTO | Series | Number
Example:     GJ 01 AB 1234
             MH 12 DE 5678

Variations:
- Single-line vs two-line format
- Hindi/regional script alongside English
- Old format (pre-2019) vs new format (IND hologram)
- Hand-painted plates (common on commercial vehicles)
- Damaged, dirty, or partially obscured plates

Metrics:

Metric	Description
Plate Detection Rate	% of plates successfully localized
Character Accuracy	% of characters correctly read
Full Plate Accuracy	% of plates with all characters correct
Accuracy by Resolution	Performance stratified by plate pixel width
Accuracy by Condition	Performance stratified by weather/lighting

We provide separate scores for resolution brackets: <40px width (very hard), 40-80px (hard), 80-120px (moderate), >120px (easy). Most commercial ANPR systems only report accuracy on the “easy” bracket. We report all four.

Category 4: Scene Understanding (drik-bench-scene)

Task: Given a video clip, generate a structured description of what is happening.

This is the most experimental category. We are not aware of any existing benchmark that evaluates traffic scene understanding on Indian roads.

Dataset: 500 video clips (10-30 seconds each) with human-written structured annotations:

{
    "clip_id": "scene_0142",
    "duration_s": 15,
    "road_type": "urban_arterial",
    "traffic_state": "congested",
    "events": [
        {
            "type": "red_light_violation",
            "time_range": [3.2, 5.1],
            "vehicle": "auto_rickshaw",
            "description": "Auto-rickshaw runs red light while weaving between stopped vehicles"
        },
        {
            "type": "wrong_way_driving",
            "time_range": [8.0, 12.5],
            "vehicle": "motorcycle",
            "description": "Motorcycle travels against traffic on service road to avoid congestion"
        }
    ],
    "scene_description": "Congested four-lane arterial road during evening rush hour. Signal is red for the main carriageway. Vehicles are stopped in queue except for an auto-rickshaw that navigates through gaps to cross the intersection. A motorcycle uses the wrong side of the service road to bypass the queue.",
    "traffic_density": "high",
    "compliance_rate": 0.78
}

Metrics:

Event Detection F1: Did the system identify the correct events?
Event Timing IoU: Did it identify when the events occurred?
Scene Description BLEU/ROUGE: Does the generated description match the reference?
Factual Accuracy: Are the stated facts (vehicle types, actions, states) correct?

Scene understanding is where reasoning, not detection, becomes the differentiator. A detector sees objects. A scene understanding system sees situations. The difference is the difference between “there is a motorcycle” and “there is a motorcycle traveling against traffic to bypass congestion at a red light.”

Initial Results

We evaluated several models on drik-bench-det to establish baselines:

Model	mAP@50	mAP@75	AP-small	AP-rare	Latency (A2)
YOLOv8-L (COCO pretrained)	41.2	28.7	18.3	12.4	22ms
YOLOv8-L (fine-tuned)	62.8	49.1	34.2	29.7	22ms
YOLOv8-X (fine-tuned)	66.4	52.3	37.8	33.1	38ms
drik-detect v3	71.2	58.6	42.1	48.3	25ms
Commercial System A	58.3	41.2	22.7	15.8	45ms
Commercial System B	55.1	38.9	19.4	11.2	67ms

Key observations:

COCO pretraining is a poor starting point. YOLOv8-L drops from 53.9 mAP on COCO to 41.2 mAP on drik-bench. The taxonomy mismatch and domain gap are severe.

Fine-tuning helps but has limits. Fine-tuning on Indian data recovers 20+ mAP points, but AP-rare remains low. Rare classes do not have enough training examples in real data alone.

Synthetic data is the rare-class equalizer. drik-detect v3, trained with DrikSynth synthetic data, achieves 48.3 AP-rare — 15 points above the next best. Synthetic data provides unlimited examples of rare classes.

Commercial systems struggle. Both commercial systems were marketed for Indian traffic. Neither achieves 60 mAP on our benchmark. Their strength is in common classes (cars, motorcycles); they fail on the long tail.

Speed matters. Commercial System B achieves lower accuracy at nearly 3x the latency. Architecture efficiency is not optional for real-time deployment.

The Scoring Philosophy

We deliberately chose metrics and evaluation protocols that reflect deployment reality:

No “easy” test sets. Every benchmark includes hard examples. We do not provide a “clean” subset for reporting flattering numbers.

Latency is a metric. A model that achieves 75 mAP at 200ms is worse than a model that achieves 70 mAP at 25ms for real-time applications. We report accuracy-latency tradeoffs explicitly.

Rare classes are weighted. AP-rare receives equal emphasis to AP-common in our overall scoring. A system that ignores 20% of the taxonomy is not production-ready, regardless of its average mAP.

Full-stack evaluation. drik-bench is not four independent benchmarks. The leaderboard includes a composite score that weights detection, tracking, ANPR, and scene understanding. A system must be competent across all four to rank well.

How to Use drik-bench

The benchmark suite is packaged as a Python library with a CLI:

# Install
pip install drik-bench

# Download benchmark data
drik-bench download --category det --split test

# Evaluate a model
drik-bench evaluate \
    --category det \
    --predictions predictions.json \
    --format coco

# Generate detailed report
drik-bench report \
    --results results.json \
    --output report.html \
    --compare baseline_yolov8l.json

Predictions can be submitted in COCO, YOLO, or MOT format. The evaluation script handles format conversion internally.

Contributing

drik-bench is open source and we actively seek contributions:

Data donations. If you have annotated Indian traffic data that you can share, we will integrate it into the benchmark (with attribution).
New challenge sequences. Traffic conditions vary enormously across India. We need sequences from the Northeast, from hill stations, from coastal cities.
Evaluation metrics. If you have domain-specific metrics that we should include, open an issue.
Baseline models. Run your model on drik-bench and submit results. We will add it to the leaderboard.

The benchmark improves as the community contributes. And a better benchmark means better models for everyone.

Why This Matters

Benchmarks shape research. What gets measured gets optimized. For a decade, the computer vision community optimized for COCO — and produced models that excel on 80 classes of Western objects in well-lit photographs.

Indian traffic needs different optimization targets. Fine-grained classification. Dense tracking. Degraded-input robustness. Rare event handling. If we do not build benchmarks that measure these, nobody will build models that handle them.

drik-bench is our contribution to shifting the optimization target. We are saying: this is what “good” looks like for Indian traffic AI. Hit these numbers and you have a system that works on real roads with real cameras in real conditions.

If it works here, it works everywhere.

Check out drik-bench on GitHub and run your models against it. The leaderboard is open.