Research-first,
built for the cameras
that never sleep.

Video-native AI for the operational reality of CCTV — continuous, multi-camera, multi-hour streams. We close the gap that image-centric models skipped.

The premise

The world doesn't arrive
in still frames.

Foundation models skipped a step: building a real representation of the world through video. Mikshi is organized around closing that gap, with CCTV as the anchor.

We focus on temporal reasoning over long-form footage — what happened, where in the hour, what comes next — expressed in language an operator can act on.

Our Models

Two specialized models, not one general one.

Different encoders, different objectives, different embedding spaces. Deployed together, designed separately.

The Art of Detail

Mikshi Search: a video-native encoder for long-duration CCTV.

A video-native encoder. It reads frames and the time between them, then emits multi-vector embeddings — one set of tokens per segment, not a single pooled vector — so temporal structure survives indexing.

Late interaction between query and segment tokens returns the right seconds, not the right hour. A day of footage becomes searchable in milliseconds.

Mikshi Search

multi-vector · per segment

Mikshi Analyze

clip in · grounded language out

The Power of Alignment

Mikshi Analyze: turning what happened into language an operator can act on.

A video-language model with its own visual encoder. It turns a clip into grounded, temporally-precise language an operator can act on.

Search returns the evidence. Analyze returns the explanation. They talk through clips and timestamps, not a shared embedding space.

Scene understanding
Critical activity vs. background motion in dense feeds.
Temporal grounding
Sub-second timestamps emitted as tokens, not aligned after the fact.
Visual QA
Natural-language interface over archived and live video.
Anomaly detection
A judgment and an explanation, not an opaque score.

Search recovers what was seen. Analyze explains what it meant. Together they make long-form CCTV legible.

Why CCTV is the hard case

CCTV isn't just more video. It's a qualitatively different regime —
and it shapes every design decision.

We don't treat CCTV as an application of a general video model. We treat general video understanding as a byproduct of solving CCTV.

Hours, not seconds.

One camera produces 24 hours a day. A deployment produces thousands. Web-clip models don't survive this scale.

Mostly nothing, occasionally everything.

Returning the right hour is useless. Operators need the right seconds.

Fixed viewpoint, drifting conditions.

Same scene for months, but lighting, weather, and occlusion never stop changing.

Vision-only signal.

No speech, no narration. The visual and temporal channel has to carry it alone.

The cost is missed events, not slow ones.

Value is lost when a moment is never flagged — not when a model is a second slow.

Research Focus

Three themes run through our work.

Two specialized models, not one general one.

Retrieval and reasoning have different objectives, data, and latency profiles. We don't force them through a shared representation — they compose at the clip level.

Recovering missed events.

The value isn't running faster than humans. It's seeing what humans stop seeing several hours into a shift.

Multi-vector embeddings.

A segment is a sequence of embeddings, not a point. Pooling destroys exactly the temporal detail retrieval needs.

Target Applications

Built for settings where video
is produced faster than anyone can watch it.

Retrieval and reasoning are both needed, neither alone is enough, and the latency budget is set by the operator. Mikshi is shaped by that constraint.

Traffic Analytics

One feed, live and post-hoc.

Live incidents, investigations, and reporting — no re-indexing in between.

Crisis Response

From many feeds to the right moment.

Retrieval across feeds decides how fast the moment reaches the person.

Security Monitoring

Flag it. Then justify it.

Real-time anomalies returned as language an operator can audit.

Why this matters

Most footage is watched
by no one. Mikshi changes that.

VLMs were built around static images. Industrial deployments need something else — an understanding of how a scene evolves over time.

Search surfaces the moment. Analyze describes what happened. One video intelligence surface for the cameras that are already running.

Read the whitepapers Talk to Sales

Research-first,built for the camerasthat never sleep.

The world doesn't arrivein still frames.

Two specialized models, not one general one.

Mikshi Search: a video-native encoder for long-duration CCTV.

Mikshi Analyze: turning what happened into language an operator can act on.

CCTV isn't just more video. It's a qualitatively different regime —and it shapes every design decision.

Hours, not seconds.

Mostly nothing, occasionally everything.

Fixed viewpoint, drifting conditions.

Vision-only signal.

The cost is missed events, not slow ones.

Three themes run through our work.

Two specialized models, not one general one.

Recovering missed events.

Multi-vector embeddings.

Built for settings where videois produced faster than anyone can watch it.

One feed, live and post-hoc.

From many feeds to the right moment.

Flag it. Then justify it.

Most footage is watchedby no one. Mikshi changes that.

Research-first,
built for the cameras
that never sleep.

The world doesn't arrive
in still frames.

CCTV isn't just more video. It's a qualitatively different regime —
and it shapes every design decision.

Built for settings where video
is produced faster than anyone can watch it.

Most footage is watched
by no one. Mikshi changes that.