Skip to main content

Introducing SIMS v2t — Portable Video & Audio to Text for Your Desktop

March 27, 2026

by SIMS Tech6 min read
product-launchopen-sourcetranscriptionwhispertauridesktop-appvideo-to-textai-tools

Introducing SIMS v2t — Portable Video & Audio to Text for Your Desktop

Today we're releasing SIMS v2t v1.0 — a free, open-source desktop application that converts video files, audio recordings, and YouTube URLs into plain text transcripts.

🎙️

A New Addition to the SIMS Tech Portfolio

SIMS v2t is a portable Tauri desktop app built with Rust and React. Drop a video file, paste a YouTube URL, or add a folder — and get a .txt transcript. Supports both cloud API (OpenAI-compatible) and fully offline transcription via whisper.cpp. No account required for offline mode. Windows, macOS, Linux. MIT License.

Why We Built This

Every time we needed to transcribe something — a recorded meeting, a YouTube tutorial, an interview — the options were the same: upload to a cloud service and hope your data is handled responsibly, run a complex command-line tool, or pay a per-minute subscription fee.

None of those options felt right for a tool you'd actually want to use daily.

We wanted something you could put in a folder and run. Something that works offline when needed. Something that handles a YouTube playlist as easily as a single local file. Something with a proper GUI that doesn't require you to read documentation to start it.

That's SIMS v2t.

What It Does

📁 Drag & Drop Queue

Drop files directly. Paste YouTube URLs. Pick a folder for batch processing — v2t scans for all video and audio formats recursively. Jobs run sequentially with live status and log output.

🌐 YouTube & Playlist Support

Paste a YouTube link — single video or full playlist. yt-dlp handles extraction. A playlist produces one transcript per video. Optionally save the original video file alongside the transcript.

☁️ Cloud API Mode

Connect to any OpenAI-compatible endpoint — OpenAI Whisper, Groq, local LM Studio, or a self-hosted service. Configure base URL and model. API key stored in OS keychain, never in a plain text file.

🖥️ Fully Offline Mode

Switch to whisper.cpp local mode. Choose a model (tiny to large-v3-turbo), download it once with one click. From that point on — transcription with no internet connection, no API key, no cost per minute.

⬇️ One-Click Tool Setup

A single button downloads and installs ffmpeg and yt-dlp on Windows and macOS. whisper-cli auto-detected via Homebrew on macOS or downloaded on Windows from the official release. No manual setup for most users.

🔄 Retry & Resume

API errors retry automatically with backoff. Large files split into chunks — each checkpointed. If a job is interrupted halfway through a 3-hour recording, it resumes from the last completed chunk.

Two Transcription Modes

The core design decision in v2t is offering two independent paths for getting text from audio:

Cloud API Mode

You configure a Whisper-compatible API endpoint (OpenAI, Groq, a self-hosted instance, local LM Studio). v2t sends audio as a multipart HTTP request and gets text back. The API key lives in the OS keychain — Windows Credential Manager or macOS Keychain — and is never written to any configuration file.

Files larger than 22 MB are split into 8-minute chunks automatically by ffmpeg, transcribed in sequence, and reassembled into a single output file. The split is transparent.

Local Offline Mode

Select a whisper.cpp model in Settings (tiny, base, small, medium, or large-v3-turbo). v2t downloads the model file from the official Hugging Face repository and verifies the SHA-1 checksum before marking it ready. After that, all transcription happens locally — ffmpeg normalizes audio, whisper-cli processes it, v2t reads the output. No network traffic, no API calls, no ongoing cost.

Real-time progress comes from parsing whisper-cli's stderr output — the percentage completes as the model processes audio. The same chunking and resume logic applies as in cloud mode.

Technical Choices

SIMS v2t is built with Tauri 2 — Rust backend running as a native process, React + TypeScript frontend rendered in a system webview. This means the application is genuinely native (not Electron), with a small binary footprint and access to system APIs.

The Rust backend handles:

  • Process spawning for ffmpeg, yt-dlp, and whisper-cli with proper cancellation (CancellationToken + process tree kill)
  • HTTP requests to the transcription API via reqwest with streaming progress
  • File I/O, chunk management, and resume logic
  • OS keyring integration for API key storage

The React frontend handles:

  • Queue management and status display
  • Settings UI with separate tabs for queue and configuration
  • Real-time log streaming from Tauri events
  • Download progress for tools and models

All external processes are called with explicit argument arrays — no shell interpolation of user input — which eliminates a class of injection vulnerabilities common in tools that build command strings.

Who Is This For

Honestly, anyone who regularly deals with audio or video content and needs the text:

  • Researchers transcribing interview recordings without uploading them to external services
  • Journalists turning recorded sources into searchable text
  • Content creators generating transcripts from YouTube videos for repurposing
  • Teams processing meeting recordings in batch
  • Developers who want a reliable transcription pipeline without building one

The offline mode makes it particularly useful for anyone handling sensitive recordings that shouldn't leave the local machine.

Getting Started

Download from GitHub Releases. The first-run Setup Guide walks through:

  1. Setting an output folder
  2. Downloading ffmpeg + yt-dlp (one button on Windows/macOS)
  3. Choosing transcription mode — cloud API or local whisper
  4. Entering an API key or downloading a whisper model

After that: drag a file or paste a URL, click Start, get your transcript.

Open Source

SIMS v2t is MIT-licensed. The full source is on GitHub. Contributions welcome.

Links

🎙️ Stop copying text from video manually.
Drag a file. Paste a URL. Get your transcript. Offline or cloud. Free & Open Source (MIT).


SIMS v2t is free, open-source software by SIMS Tech (MIT License). Source code: github.com/vglu/v2t