Product

Mikshi Analyze 1.0: a Video‑Language Model for Real‑World Operator Intelligence.

Mikshi Analyze is the video‑language model behind the platform. Given a video — a short clip, an uploaded recording, an hour‑long CCTV file, or a direct camera feed — it produces grounded natural‑language output describing what happened, answering questions about it, and flagging the moments that matter.

See our capability overview for how teams ship with it in production.

Mikshi Research · 2026

Introducing

Mikshi Analyze

1.0

Five capabilities, one model. The same network describes a video, answers a question about it, holds a multi‑turn chat, locates an event in time, and flags what an operator needs to see next. They are not five models. They are five ways of prompting the same model.

Prompt Modes

One model. Five ways to ask.

You give Mikshi Analyze a video — anywhere from a few seconds to hours long — and a prompt (or a sequence of prompts in a chat). It returns natural‑language text grounded in what it saw, with timestamps embedded in the output.

Describe

Describe the video

Full-video narrative — end-to-end understanding, not per-frame captions.

"Describe what happened in this 47-minute CCTV recording."
Ask

Answer a question about it

Operators ask in natural language; the model answers from what it actually saw.

"Did the white SUV stop at the sign?"
Chat

Chat across multiple turns

The model retains context across the full conversation, on the same video.

"What happened around 00:14?" → "Who was the person on the left?"
Locate

Locate an event in time

Returns timestamps, not paragraphs. Operators jump straight to the moment.

"When does the person enter the loading bay?"
Flag

Flag what matters

Surfaces events grouped by severity — Critical, Warning, Info.

"Anything I should look at in this hour of footage?"
Mikshi Analyze pipeline — a prompt and a video go into the model, and grounded, time‑stamped events come out with confidence scores.
Prompt + video in. Grounded, time‑stamped events out — with the moments an operator should look at next.
Hour‑long footage

Built for full videos, including hour‑long footage.

Mikshi Analyze processes complete videos — not pre‑segmented clips. Hand it an hour of CCTV (or point it at a direct camera feed) and ask questions about anything that happened anywhere in it. No manual clipping. No pre‑segmentation. No requirement that you already know which moment to look at.

This is the property that makes Mikshi Analyze useful for operators: they don't have time to find the right minute before asking a question. They ask the question against the whole video.

≤ 60s
Short clips & moments
30 min
Meetings & recordings
1 hour+
CCTV & long surveillance
Live
Direct camera feeds
12 Languages

Ask in your language. Get the answer in your language.

Mikshi Analyze supports 12 languages for both prompts and responses. A traffic operator in Delhi can query in Hindi; a security operator in São Paulo can query in Portuguese; a transit operator in Jakarta can query in Bahasa Indonesia. Same model, no per‑language fine‑tune.

Timestamps, severity labels, and grounded event descriptions all work across the full language set.

Englishहिन्दीDeutschEspañolالعربيةFrançaisPortuguêsItalianoBahasa IndonesiaTagalogไทยTiếng Việt

Severity is part of the model's output.

Every event Mikshi Analyze surfaces is labeled with one of three categories — emitted by the model, not layered on by a downstream rule. Operators see at a glance which moments demand action and which are background context.

Category

Critical

Immediate operator action required — collisions, intrusions, fires, weapons, safety violations.

Category

Warning

Notable deviations that should be reviewed but are not emergencies — unusual loitering, parking violations, restricted-area near-misses.

Category

Info

Routine but query-relevant observations — vehicle counts, object presence, scene descriptions.

Temporal Grounding

Timestamps are first‑class.

Mikshi Analyze treats time as part of its output, not as something recovered afterwards. When the model says "the vehicle ran the light at 00:14.3", the timestamp is a token the model generated, anchored to the frames it actually saw.

There is no separate alignment pass that could be wrong. That is what makes temporal grounding reliable enough for incident response, compliance, and investigation.

cam_07_loading_bay.mp4
01:47:22 total
  • 00:14.3
    Critical

    White SUV (license partial: TN09‑••) ran the south stop sign without slowing.

  • 12:08.7
    Warning

    Person in dark jacket loitering near gate 3 for 4m 12s; no entry attempt.

  • 47:51.0
    Info

    Cargo van offloaded 6 pallets; bay clear at 49:02.

  • 01:09:33.2
    Critical

    Forklift operator entered exclusion zone without high‑vis vest.

Session · cam_12_lobby.mp4Day 1 → Day 4
O1

What happened around 00:14?

M

A man in a navy jacket pushed open the side door at 00:14.6 and held it for two others who entered behind him.

O1

Who was the person on the left?

M

Same person who walked through the lobby earlier at 00:08.1 — taller, carrying a black backpack.

Operator 2 · next shift · 3 days later
O2

Go back to that moment. Was anyone else with them?

Persistent chat

Chat sessions are retained per video.

  • Ask a question, get an answer. Ask a follow‑up — the model remembers prior turns.
  • Leave the session, come back days later, pick up exactly where you left off.
  • Hand the session to a different operator on the next shift without starting from zero.
Operator Capabilities

Five capabilities, all from one model.

Pick one with a prompt; pick another by changing the prompt. There is no per‑task fine‑tune to manage, no second model to deploy.

Complex scene understanding

Distinguishes critical activity from background motion in dense feeds.

Temporal grounding

Locates events in time with sub-second precision; timestamps are emitted, not inferred.

Visual question answering

Natural-language interface over a video, supporting multi-turn follow-up questions.

Severity classification

Labels every event as Critical, Warning, or Info as part of the model's response.

Multi-turn chat

Retains session state and conversation history so operators can return to a video and continue the conversation.

Point it a video. Ask anything.

Mikshi Analyze turns hours of footage into grounded, time‑stamped answers an operator can act on.