Mikshi Search 1.0: a Video Embedding Model for Archive‑Scale Retrieval.
Mikshi Search is the embedding model behind the platform. Hand it any video — a short clip, an uploaded recording, or an hour‑long CCTV file — and it produces embeddings that power retrieval across collections of any size.
See our capability overview for how teams ship with it in production.
Introducing
Mikshi Search
1.0One model, one embedding space. The same network encodes your videos at upload time and your queries at search time — text, reference clip, or (soon) a reference image — so retrieval lands on the right seconds, not the right hour.
One embedding space. Three ways to ask.
All query types are projected into the same space as the indexed video and matched the same way. You do not interact with the embeddings directly — you upload videos to a collection, you issue queries, you get back ranked moments.
Natural language
Describe the moment in words. The query is encoded into the same space as the video and matched directly.
Reference video clip
Hand it a clip; find me other moments that look like this. Useful when language can't describe it precisely.
Reference image
Find moments that look like a single image. On the roadmap; same embedding space, same retrieval path.

Not the right video. The right seconds.
Each result is a ranked moment, not a whole file. Operators do not need the right hour inside a day of footage — they need the right seconds.
- Source video and the collection it belongs to.
- Start and end timestamp locating the matched moment inside the source video.
- A score indicating how well the moment matched the query.
- A thumbnail for the matched moment.
- Raw embeddings as an opt-in field on the request — 1024-dimensional.
- 01thumb00:14:08.2 → 00:14:11.6 0.94
Red sedan crosses south stop line ~0.4s after signal turns red.
- 02thumb00:47:22.0 → 00:47:25.8 0.88
Red sedan, different vehicle — runs the east‑bound red without slowing.
- 03thumb01:09:33.4 → 01:09:36.1 0.81
Red SUV (high match for sedan) crosses on yellow→red transition.
Built for hour‑long footage, end to end.
Mikshi Search generates embeddings for full videos, including long‑form CCTV recordings. You do not pre‑chunk, you do not pre‑segment, you do not configure window sizes. Hand it the video; it produces the embeddings.
Internally, Mikshi Search uses a multi‑vector representation rather than a single pooled vector per video. If a long recording were collapsed to one vector, the best a query could return would be "this video contains your event." Multi‑vector embeddings preserve sub‑minute temporal resolution all the way through retrieval, so query results land on the right seconds.
A visual representation of actions and entities.
Mikshi Search is vision‑only today. It looks at frames and at how they evolve over time. It does not transcribe speech or use ambient audio — CCTV deployments cannot rely on audio anyway. The visual + temporal signal carries the load.
Actions
What is happening in the scene — motion patterns, interactions, gestures.
Entities
Who and what is present — people, vehicles, objects.
Spatial relationships
How the entities are arranged and how that arrangement changes over time.
One camera, one collection.
The recommended deployment shape is one camera per collection. Each physical camera maintains its own collection, and footage from that camera flows continuously into it. This mapping has three operational consequences.
Queries are camera-scoped by default
Asking “red sedan running a light” against the intersection-north collection only searches that camera's footage. No cross-camera bleed unless you explicitly query multiple collections.
Cameras are independent units of operation
Adding a camera means creating a new collection. Retiring a camera means archiving its collection. No global re-index, no coordination across cameras.
Per-camera scaling
Indexing throughput, storage growth, and query load are isolated per collection. A busy camera does not slow queries against a quiet one.
If you need to search across cameras at a site, issue parallel queries to each camera's collection and merge results client‑side, or maintain a separate site‑level collection that aggregates the cameras you want grouped.
Semantic query + time window + camera.
Every result Mikshi Search returns is timestamped, so you can scope queries to time windows. Time filters apply as part of the query, alongside the natural‑language or reference‑clip query itself — no external scheduling or batch step.
Time‑based search works against archive‑scale footage. "Last 24 hours"queries against a camera that has been recording continuously for months do not scan the entire history — only the requested window is searched.
"Anything matching this query in the last 1 hour?"
"Find all instances between 08:00 and 12:00 yesterday."
"Search only the last 30 minutes of footage on this camera."
Thousands of camera‑hours. The seconds that matter.
Mikshi Search is designed for the CCTV regime: continuous recordings spanning days or weeks, mostly routine, with the seconds that matter buried deep inside. Hour‑long recordings are first‑class inputs, not edge cases.
of footage from a single camera becomes a searchable space of moments.
of archive remain queryable without per‑query degradation.
query latency against the indexed window, including time‑scoped queries like “last 1 hour.”
Indexing runs in the background as new footage arrives. The unit of consumption is the moment, not the file — that is what makes search useful at archive scale.
Hand it a video. Get back the seconds.
Mikshi Search turns hours of footage into a searchable space of moments — text, clip, or (soon) image queries, ranked and timestamped.