Two specialized models, not one general one.
Retrieval and reasoning have different objectives, data, and latency profiles. We don't force them through a shared representation — they compose at the clip level.
Video-native AI for the operational reality of CCTV — continuous, multi-camera, multi-hour streams. We close the gap that image-centric models skipped.
Foundation models skipped a step: building a real representation of the world through video. Mikshi is organized around closing that gap, with CCTV as the anchor.
We focus on temporal reasoning over long-form footage — what happened, where in the hour, what comes next — expressed in language an operator can act on.
Different encoders, different objectives, different embedding spaces. Deployed together, designed separately.
A video-native encoder. It reads frames and the time between them, then emits multi-vector embeddings — one set of tokens per segment, not a single pooled vector — so temporal structure survives indexing.
Late interaction between query and segment tokens returns the right seconds, not the right hour. A day of footage becomes searchable in milliseconds.
multi-vector · per segment
clip in · grounded language out
A video-language model with its own visual encoder. It turns a clip into grounded, temporally-precise language an operator can act on.
Search returns the evidence. Analyze returns the explanation. They talk through clips and timestamps, not a shared embedding space.
Search recovers what was seen. Analyze explains what it meant. Together they make long-form CCTV legible.
We don't treat CCTV as an application of a general video model. We treat general video understanding as a byproduct of solving CCTV.
One camera produces 24 hours a day. A deployment produces thousands. Web-clip models don't survive this scale.
Returning the right hour is useless. Operators need the right seconds.
Same scene for months, but lighting, weather, and occlusion never stop changing.
No speech, no narration. The visual and temporal channel has to carry it alone.
Value is lost when a moment is never flagged — not when a model is a second slow.
Retrieval and reasoning have different objectives, data, and latency profiles. We don't force them through a shared representation — they compose at the clip level.
The value isn't running faster than humans. It's seeing what humans stop seeing several hours into a shift.
A segment is a sequence of embeddings, not a point. Pooling destroys exactly the temporal detail retrieval needs.
Retrieval and reasoning are both needed, neither alone is enough, and the latency budget is set by the operator. Mikshi is shaped by that constraint.
Live incidents, investigations, and reporting — no re-indexing in between.
Retrieval across feeds decides how fast the moment reaches the person.
Real-time anomalies returned as language an operator can audit.
VLMs were built around static images. Industrial deployments need something else — an understanding of how a scene evolves over time.
Search surfaces the moment. Analyze describes what happened. One video intelligence surface for the cameras that are already running.