SIMS v2t: Portable Video & Audio to Text on Your Desktop

Transcribing video content is a constant bottleneck. Researchers pause and replay interviews. Journalists manually type quotes from recordings. Content creators copy subtitles line by line. Teams wait for cloud services to process meeting recordings — and then discover the service requires an account, a subscription, or a monthly limit.

🎯

The Challenge

Getting text from video shouldn't require a cloud account, a browser extension, uploading sensitive recordings to a third party, or learning a command line. SIMS v2t puts transcription on your desktop: drag a file or paste a YouTube link, press Start, get a text file. Works offline. No account required.

The Solution: SIMS v2t

SIMS v2t is a portable desktop application built on Tauri 2 (Rust backend + React frontend). It runs on Windows, macOS, and Linux. Drop it in a folder — no installer, no system dependencies, no admin rights needed. Point it at a video file, an audio recording, a folder of media, or a YouTube URL (including playlists), and it produces a plain .txt transcript in the output folder you choose.

Transcription happens either through a configurable HTTP API (OpenAI-compatible — works with OpenAI Whisper, Groq, local LM Studio, and any compatible endpoint) or entirely offline via whisper.cpp — a local CLI that runs the Whisper model on your machine with no internet connection required.

How It Works

Add sources — drag & drop files, pick files or a folder, or paste YouTube/playlist URLs
Configure once — set output folder, transcription mode (cloud API or local), and model
Start — queue processes sequentially with real-time progress and logs
Get text — .txt files appear in your output folder with predictable filenames

Video file / Audio file
YouTube URL / Playlist     →  [ffmpeg normalize]  →  [Whisper API / local CLI]  →  .txt
Folder of media files

Key Capabilities

📁

Drag & Drop Queue

Drop files directly onto the app. Paste YouTube URLs (single video or full playlist). Add entire folders — recursive scan finds all supported media formats automatically. Sequential queue with per-job status and log output.

🌐

YouTube & Playlist Support

Paste any YouTube link — single video or playlist. yt-dlp extracts audio automatically. A playlist produces one transcript per video. Optionally save the original video file to your output folder alongside the transcript.

🔌

Cloud API Mode

Connect to any OpenAI-compatible transcription endpoint — OpenAI Whisper API, Groq, local LM Studio, or self-hosted services. API key stored in OS credential store (Windows Credential Manager, macOS Keychain) — never in plain text files.

🖥️

Local Offline Mode

Switch to whisper.cpp CLI for fully offline transcription. Choose from tiny, base, small, medium, or large-v3-turbo models. Models download on first use with SHA-1 verification. After download — works with no internet connection at all.

⬇️

One-Click Tool Setup

ffmpeg and yt-dlp download automatically on Windows and macOS with one button click — no manual installation. whisper-cli detected via Homebrew on macOS or downloaded on Windows from the official whisper.cpp release. Download progress shown in real time.

🔄

Retry & Resume

HTTP API requests automatically retry on 429 / 5xx errors with exponential backoff. Large files split into chunks — each chunk checkpointed. If a job is interrupted, it resumes from the last completed chunk — not from scratch.

Supported Formats

Category	Formats
Video	mp4, mkv, mov, webm, avi, wmv, m4v
Audio	mp3, wav, m4a, flac, ogg, opus, aac, wma
URLs	YouTube videos, YouTube playlists, any yt-dlp-supported URL
Output	Plain text .txt with configurable filename template

Transcription Modes

☁️ Cloud API Mode

✅OpenAI Whisper API — industry-standard quality
✅Groq — ultra-fast inference
✅Local LM Studio — private cloud on your machine
✅Any compatible endpoint — configurable base URL
✅Files > 22 MB auto-split into 8-minute chunks and reassembled
✅API key in OS keychain — not in config files

🖥️ Local Offline Mode

✅No internet required after model download
✅No API key — fully free to run indefinitely
✅Models: tiny (75 MB) → large-v3-turbo (1.5 GB)
✅SHA-1 verified model files from official Hugging Face repository
✅Real-time progress from whisper.cpp stderr output
✅Same chunking logic as cloud mode for large files

💼 Use Cases

🎓 Researchers & Academics

Transcribe hours of interview recordings in a single batch run. Add a folder of audio files — get a folder of transcripts. Fully offline — no need to upload sensitive interview data to cloud services.

📰 Journalists & Content Creators

Paste a YouTube link and get a transcript in minutes. Transcribe podcast episodes, interview recordings, or conference talks. Use the transcript as a draft for articles, show notes, or subtitles.

💼 Business & Teams

Transcribe meeting recordings, webinars, and training videos. Process a folder of recordings from a conference day in one go. Keep sensitive call recordings local — no cloud upload required in offline mode.

🛠️ Developers & IT Teams

Self-host a Whisper-compatible API and point v2t at it. Process bulk media libraries in batch. Portable — runs from a USB drive or shared network folder with no installation on target machines.

📚 Education

Transcribe lecture recordings for students. Download YouTube educational playlists and produce a text corpus for study. Works on classroom machines without admin rights — fully portable.

🔒 Privacy-Sensitive Workflows

Legal, medical, or HR recordings that cannot leave the organization. Switch to local whisper.cpp mode — all processing happens on the machine, nothing is transmitted externally. Full control over model and data.

🔐 Security & Privacy

🛡️

Designed for Privacy

✅ OS Credential Store: API key stored in Windows Credential Manager or macOS Keychain — never written to disk in plaintext
✅ Offline mode available: Local whisper.cpp processes audio entirely on your machine — no data leaves your computer
✅ No telemetry: The application sends no usage data anywhere
✅ Process isolation: External tools (ffmpeg, yt-dlp, whisper-cli) are called with explicit arguments — no shell injection possible from user input
✅ SHA-1 verified models: Whisper model files verified against official catalog checksums before use

⚡ Quick Start

🚀

Get Running in 3 Minutes

1. Download the installer or portable ZIP from GitHub Releases
2. Launch — the Setup Guide walks you through the first-time configuration
3. Click Download Tools to get ffmpeg + yt-dlp automatically (Windows/macOS)
4. Choose transcription mode: paste an API key for cloud, or download a whisper model for offline
5. Drag a video file or paste a YouTube URL → click Start → get your .txt

Whisper Models

Model	Size	Speed	Quality	Best for
tiny	75 MB	Fastest	Basic	Quick drafts, clear speech
base	142 MB	Fast	Good	Everyday use
small	466 MB	Medium	Very good	Recommended for most users
medium	1.5 GB	Slower	Excellent	Accented speech, technical content
large-v3-turbo	1.5 GB	Medium	Best	Maximum accuracy

All models downloaded automatically from the official Hugging Face repository (ggerganov/whisper.cpp) with SHA-1 integrity verification.

🛠️ Technical Stack

Core Technologies

🦀 Rust + Tauri 2 — native desktop backend, cross-platform
⚛️ React 19 + TypeScript — responsive UI
⚡ Vite 7 — fast frontend tooling
🔗 Tokio — async Rust runtime with cancellation tokens

External Tools

🎬 ffmpeg — audio normalization (16 kHz mono WAV)
📥 yt-dlp — YouTube and URL audio extraction
🎙️ whisper.cpp CLI — local offline transcription (optional)
🔑 OS keyring — secure API key storage

❓ Frequently Asked Questions

Do I need to install ffmpeg and yt-dlp manually?

No. On Windows and macOS, the Settings panel has a one-click Download button that fetches and installs ffmpeg and yt-dlp automatically into the app data directory. On Linux, install via your package manager (apt install ffmpeg yt-dlp) or point to existing binaries in Settings. You can also place the binaries next to the application executable — v2t finds them automatically.

Can I use it without an internet connection?

Yes — switch to Local Whisper mode in Settings and download a model (tiny through large-v3-turbo). After the one-time model download, v2t transcribes entirely offline. No internet connection, no API key, no account. Works indefinitely.

What happens with large files over the API size limit?

Files larger than 22 MB are automatically split into 8-minute chunks by ffmpeg, each chunk is transcribed separately, and the results are joined into a single text file. This is transparent — you just get the complete transcript. Chunks are checkpointed: if the job is interrupted, it resumes from the last successful chunk rather than starting over.

What languages are supported?

Any language supported by your chosen Whisper model or API endpoint. Whisper was trained on 99 languages. Set the language code (e.g., uk, de, fr) in Settings — or leave it empty for automatic language detection.

Is it truly portable — can I run it from a USB drive?

Yes. Place ffmpeg and yt-dlp in the same folder as the v2t executable (or in a bin/ subfolder). v2t detects them automatically. Settings are stored in the OS standard app config directory so they persist between runs. No registry entries, no system-wide installation required.

Can I transcribe an entire YouTube playlist?

Yes. Paste a YouTube playlist URL into the URL field. v2t uses yt-dlp to extract audio from each video in the playlist and produces one .txt file per video. All files are placed in your configured output folder with names derived from the video titles.

Where is my API key stored?

API keys are stored in the OS credential store: Windows Credential Manager on Windows, macOS Keychain on macOS. They are never written to the settings JSON file or any other plain text file on disk. Even if you share your settings folder, your API key stays protected.

📞 Contact & Support

GitHub: github.com/vglu/v2t — issues, feature requests, source code
Email: vhlu@sims-service.com

🚀

Open Source — MIT License

SIMS v2t is free, open-source software by SIMS Tech. Source code, releases, and contributions: github.com/vglu/v2t.

SIMS v2t — Video & Audio to Text