Architecture Documentation

The world has a billion cameras. None of them can reason.

Drik processes 200+ camera feeds simultaneously on on-premises GPU infrastructure, generating legally admissible violation evidence and city-scale traffic intelligence in real time.

200+ cameras
< 17ms latency
FP4/INT8 inference
On-Premises deployment

Reasoning Progress

Five Levels of Visual Reasoning

We measure our AI across five levels of visual understanding — from basic detection to predictive reasoning. Here's where we stand today.

L1 Detect
L2 Recognize
L3 Describe
L4 Reason
L5 Predict
Level 3 achieved — Level 4 in progress
1 LIVE

Detect — It sees every object

Every vehicle, every person, every object — detected, classified, and counted in real time across every frame.

2 LIVE

Recognize — It reads every plate, every face

License plates extracted. Vehicle makes and models identified. Colors, modifications, damage — all catalogued instantly.

3 LIVE

Describe — It describes what it sees in words

Not just labels. Full natural-language descriptions of every scene, every event, every anomaly. Searchable. Queryable.

4 IN PROGRESS

Reason — It understands cause and effect

Why did traffic stop? What caused the accident? Which vehicle triggered the chain reaction? The AI builds causal chains.

5

Predict — It anticipates what happens next

Pattern recognition across time and space. Congestion forecasts. Accident risk zones. Before it happens, the system sees it forming.

Product Universe

The Drik Ecosystem

All products are named after Sanskrit stars and cosmic concepts. Dhruva, the Pole Star, powers everything.

ध्रुव

Dhruva

Core Engine LIVE

Pole Star — always present, always on

The foundational inference engine powering all products. SSM-based continuous video reasoning across 200+ camera feeds simultaneously.

अश्विनी

Ashvinī

Traffic Intelligence LIVE

Twin Horsemen of dawn — swift justice

Traffic violation detection, ANPR, challan generation, and enforcement intelligence built on top of Dhruva.

चित्रा

Chitrā

Annotation Tool LIVE

Brightest star Spica — precision in labeling

AI-assisted video annotation with active learning propagation. Open-sourced as DrikLabel.

माया

Māyā

Synthetic Data IN DEV

Cosmic illusion — creates training worlds

Unreal Engine pipeline generating photorealistic synthetic training data with perfect ground truth labels.

अग्नि

Agni

Edge Deploy LIVE

Sacred fire — forges edge engines

Edge deployment kit with TensorRT optimization, FP4/INT8 quantization, and OTA model updates.

Future Horizons

Rohiṇī Smart City Vishākhā Manufacturing Jyeshthā Security Brahmāṇḍa World Model Ākāsha Cloud API

System Architecture

L1 Container Architecture

Six deployable containers running on-premises — from RTSP ingest to challan delivery, all on a single GPU server rack.

200+ CAMERAS RTSP / ONVIF FRAME SAMPLER 10 FPS Adaptive CUDA PREPROCESS GPU Kernels DHRUVA Inference Cluster Detection (YOLOv8+) Tracking (ByteTrack) Vehicle ReID Batch B=32, FP4 RTX PRO 5000 × 4 EVENT ENGINE Violation Rules Evidence Packager Challan Generator ANPR (DrikNetra) Multi-Frame Super-Res OCR + Vahan Lookup DASHBOARD Next.js + React API GATEWAY REST / gRPC / WS ALERTS Push / MQTT / SMS PostgreSQL + Redis NAS 100TB Archive Prometheus + Grafana

The L1 architecture consists of six deployable containers: Video Ingest Gateway (200+ RTSP cameras, frame sampler at 10 FPS, CUDA preprocessing), Dhruva Inference Cluster (YOLOv8+ detection, ByteTrack tracking, Vehicle ReID at batch size 32 with FP4 on 4x RTX PRO 5000), Event Engine (violation rules, evidence packager, challan generator), DrikNetra ANPR (multi-frame super-resolution, OCR, Vahan lookup), and output services (Dashboard, API Gateway, Alerts). Data stores include PostgreSQL with Redis, 100TB NAS Archive, and Prometheus with Grafana.

Ingest Inference Events + ANPR Delivery

Under the Hood

Low-Level Architecture

Seven core modules power the Drik pipeline — from raw video ingestion to predictive intelligence.

01

Ingestion Layer

LIVE
200 RTSP STREAMS H.264 / H.265 RTSP POOL FFmpeg + NVDEC 200 threads, HW decode FRAME SAMPLER 10 FPS Adaptive 2-15 FPS motion-aware CUDA PREPROCESS Fused GPU kernel BGR→RGB, resize, norm raw H.265 1080p uint8 sampled [B, 3, 640, 640] fp16
RTSP PoolCUDA DecodeAdaptive Sampler
02

Detection Engine

LIVE
INPUT BATCH [B,3,640,640] CSPDarknet Feature Pyramid TensorRT FP4/INT8 P3 80×80 P4 40×40 P5 20×20 FPN/PAN NECK Multi-scale fusion DETECTION HEAD 8400 anchors ~3ms @ FP4 [B, 8400, 85]
YOLOv8+TensorRT FP4Feature Pyramid
03

Multi-Object Tracking

LIVE
RAW PREDS [B,8400,85] NMS conf>0.25, IoU=0.45 ~0.5ms GPU-accelerated ByteTrack 1. High-conf match (IoU) 2. Low-conf match 30-frame track lifetime TRACKS [N,7] per camera [N_det, 6] per image bbox + conf + cls + track_id
ByteTrackKalman FilterRe-ID
04

License Plate Recognition

LIVE
VEHICLE CROP PLATE DETECT YOLOv8-nano [1,3,416,416] MULTI-FRAME SUPER-RES 5 crops → Homography align → Sharpness fuse → ESRGAN 4× [1,3,64,200] → [1,3,256,800] bluroksharptiltblur 5 frames of same plate OCR CRNN + CTC "MH 02 AB 1234" VAHAN LOOKUP Owner, make/model Insurance, registration
DrikNetraSuper-ResPaddleOCRVahan API
05

Violation Rule Engine

LIVE
TRACKS + SCENE zones, signals, lanes per-camera calibration RULE ENGINE Pluggable ViolationRule.evaluate() track × zone × signal → violation? Red Light Wrong Way Speed Lane No Helmet Triple Riding Parking Pedestrian
Rule EngineZone GeometrySignal State
06

Evidence Pipeline

IN DEV
VIOLATION detected event ANPR LOOKUP Plate → Vahan DB owner details VEHICLE ReID [512] embed, L2-norm color, make/model EVIDENCE PKG Frames + video clip + metadata + plate PDF Challan Legally admissible evidence chain
Evidence PackagerChallan GenVehicle ReID
07

CVWM (Vision World Model)

PLANNED
TEMPORAL BUF [B,T=8,3,384,384] ~1 sec @ 7.5 FPS FastViT-SA12 Video Encoder ~12M params COMPRESS 576 → 64 tokens 9× spatial reduction MAMBA-2 SSM 12 layers, d=384 ~45M params O(n), constant memory FiLM (γ, β) instruction cond. HEADS A: Event [B,9] B: JSON out C: Predict [B,4,384] [B, 512, 384] spatiotemporal tokens ~63M trainable params total
Mamba-2 SSMFastViT Encoder~63M params

Performance

330x Faster Than Naive

Five layers of optimization turn a 50ms-per-frame baseline into 0.15ms amortized — enough to handle 200+ cameras on a single GPU server.

2.5x frame reduction

Adaptive Sampling

200 cameras at 25 FPS = 5,000 frames/sec reduced to 2,000 via 10 FPS sampling. Motion-adaptive: static scenes drop to 2 FPS.

16x throughput gain

Dynamic Batching

Collate 32 frames from different cameras into a single GPU batch. GPU utilization jumps from ~20% to saturated.

8x speed improvement

FP4 Quantization

Native FP4 Tensor Core support on RTX PRO 5000. Mixed-precision: FP4 backbone + FP16 heads. Less than 2% accuracy loss.

2-5x additional speedup

TensorRT Compilation

Layer fusion (Conv+BN+ReLU), kernel auto-tuning, memory planning, and multi-stream execution.

33% latency reduction

Pipeline Parallelism

Three-stage pipeline on separate CUDA streams: preprocess, infer, and postprocess overlap in time.

End-to-End Latency Budget

Full path with ANPR + ReID — per violation event

3
8
3
Decode
1ms
Preprocess
0.5ms
Backbone
3ms
NMS
0.5ms
Tracking
0.3ms
Rules
0.2ms
ANPR
8ms
ReID
3ms
Output
0.5ms
Total (full path) ~17ms
Fast path (no violation) ~5.5ms
Per-frame amortized (B=32) ~0.17ms
Naive ~50ms/frame 1 camera
Optimized ~0.15ms/frame 200+ cameras

How It Works

Your cameras. Our intelligence. Any scale.

Deploy Drik on-premise, in the cloud, or both. Our nodes scale with your camera network — from a single building to an entire city.

Camera network infrastructure
CAM-001
CAM-002
CAM-003
CAM-004
CAM-005
CAM-006
CAM-007
CAM-008
CAM-009
CAM-010
CAM-011
CAM-012
DRIK NODE
< 17ms Full Latency
4x RTX PRO 5000 GPU Cluster
200+ Cameras
On-Premises Deployment

Stack

Built With

From GPU kernels to dashboards — the full production and training stack.

ML & Training

PyTorch 2.x PyTorch FSDP Weights & Biases DVC Unreal Engine 5

Inference

TensorRT CUDA C++ ONNX Runtime FP4 / INT8 NVDEC

Video & Vision

FFmpeg GStreamer OpenCV PaddleOCR ByteTrack

Backend

Python 3.11+ FastAPI Redis Streams gRPC MQTT

Frontend

Next.js React React Native Tailwind CSS

Infrastructure

Docker PostgreSQL Redis Prometheus Grafana Nginx

Hardware

RTX PRO 5000 AMD EPYC NVLink 10GbE