Claude-real-video － any LLM can watch a video

GitHub - HUANGCHIHHUNGLeo/claude-real-video: Let Claude (or any LLM) actually watch a video — scene-aware, deduplicated frames + transcript, from a URL or local file. Runs locally, MIT. · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

HUANGCHIHHUNGLeo

claude-real-video

Public

Notifications You must be signed in to change notification settings

Fork

Star

master

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 6 Commits 6 Commits

src/claude_real_video

.gitignore

LICENSE

README.md

pyproject.toml

View all files

Repository files navigation

claude-real-video

Let Claude — or any LLM — actually watch a video.

Most AI tools don't really see a video. Paste a YouTube link into ChatGPT and it reads the transcript , not the picture. Claude won't take a video file at all. Even Gemini, which can read video natively, has to send it up to Google and samples frames at a fixed interval (1 fps by default), so fast cuts slip past.

claude-real-video does it differently, and locally : point it at a URL or a file, and it pulls the frames that actually matter (every scene change, not a fixed quota), throws away the near-duplicates, transcribes the audio, and hands you a clean folder any LLM can read — on your own machine, nothing uploaded.

crv "https://www.youtube.com/watch?v=..." # → crv-out/frames/*.jpg + crv-out/transcript.txt + crv-out/MANIFEST.txt

Then drop the frames + MANIFEST.txt into Claude / ChatGPT / Gemini and ask away.

Why not just sample frames?

Most "let an LLM watch a video" scripts (and Gemini's own pipeline) grab frames at a fixed interval — e.g. one per second. That over-samples a static screencast and under-samples a fast-cut reel. claude-real-video is smarter:

fixed-interval sampling claude-real-video

Frame selection every N seconds scene-change detection + density floor

Repeated shots (A-B-A cuts) sent again every time sliding-window dedup sends each shot once

Static slide (10 min) ~600 near-identical frames collapses to 1 (dedup)

Fast-cut reel misses frames between samples catches each visual change

Audio often ignored Whisper transcript w/ language detect

Where the video goes often uploaded to a cloud stays on your machine

Input usually local file only URL (yt-dlp) or local file

You feed the model fewer, more meaningful frames — cheaper context, better understanding.

Install

pip install claude-real-video # core (frames + dedup) pip install "claude-real-video[whisper]" # + audio transcription

System requirement: ffmpeg

ffmpeg / ffprobe are used for frame extraction and audio, and aren't pip-installable. Install them once:

OS command

macOS brew install ffmpeg

Linux sudo apt install ffmpeg (or your distro's package manager)

Windows winget install Gyan.FFmpeg — or choco install ffmpeg — or download a build and add its bin\ folder to your PATH

Verify it's on your PATH:

ffmpeg -version

Transcription uses the whisper CLI (installed by the [whisper] extra, or pip install openai-whisper). Whisper also relies on ffmpeg.

Works on macOS, Windows, and Linux — Python 3.10+.

Usage

# A YouTube / Instagram / TikTok / ... link crv "https://www.instagram.com/reel/XXXX/"

# A local file, English transcript, output to ./out crv lecture.mp4 -o out --lang en

# Frames only, no transcription crv clip.mp4 --no-transcribe

# A login-gated video (your own / authorised use): pass a Netscape cookie file crv "https://..." --cookies cookies.txt

python -m claude_real_video ... works as an alias for crv too.

Options

flag default meaning

-o, --out crv-out output directory

--scene 0.30 scene-change sensitivity (lower = more frames)

--fps-floor 1.0 at least one frame every N seconds

--max-frames 150 hard cap on total frames

--lang auto Whisper language (en, zh, auto, ...)

--dedup-threshold % of pixels that must change for a frame to count as new; higher = fewer frames

--dedup-window compare against the last N kept frames — a shot the model already saw doesn't come back after a cutaway (1 =...

Claude-real-video － any LLM can watch a video

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI