Free AI Audio Transcription — 100% Private, Runs in Your Browser
Technical Mastery Overview
What Is AI Audio Transcription?
AI audio transcription is the process of converting spoken audio into written text using a machine learning model. The model listens to audio frames, identifies phonemes, words, and sentence boundaries, and produces a timestamped text output — automatically, without human transcriptionists.
Modern AI transcription tools are powered by large neural network models trained on hundreds of thousands of hours of labeled audio in dozens of languages. The resulting accuracy on clean audio rivals professional human transcription at a fraction of the time and cost.
The key differentiator between tools is where that model runs: on your device (local) or on a company's server (cloud). This tool runs Whisper entirely in your browser.
How OpenAI Whisper Works
This tool uses OpenAI Whisper — an open-source automatic speech recognition (ASR) model published by OpenAI and available in multiple sizes optimized for different accuracy/speed trade-offs.
Whisper is a transformer-based encoder-decoder architecture trained in a weakly supervised manner on ~680,000 hours of multilingual audio scraped from the web. The training diversity makes it exceptionally robust across accents, recording conditions, and speaking styles compared to models trained on narrow studio datasets.
Technical pipeline
- Audio input — the uploaded file or microphone recording is loaded into browser memory as a binary blob.
- Decoding and resampling — the Web Audio API decodes the audio (regardless of original codec) and resamples it to 16 kHz mono — the exact format Whisper's encoder expects. This resampling happens entirely in-browser using
AudioContext({ sampleRate: 16000 }). - Float32 conversion — the decoded PCM data is converted to a
Float32Arrayand transferred to a Web Worker via a zero-copy ArrayBuffer transfer. - Log-mel spectrogram — inside the worker, the ONNX encoder converts the audio frames into a log-mel spectrogram: 80 frequency bins × time frames. This is the visual representation of frequency energy over time that Whisper's encoder was trained on.
- Encoder pass — the spectrogram passes through the encoder transformer to produce audio embeddings.
- Decoder pass — the decoder autoregressively generates token sequences from the embeddings. Each token is a word piece (BPE subword), decoded into text using the Whisper vocabulary.
- Timestamp prediction — with
return_timestamps: true, the decoder also predicts segment start and end times in the audio, enabling SRT subtitle export. - Chunking for long audio — audio longer than 30 seconds is split into overlapping 30-second windows with a 5-second stride. The stride overlap prevents words from being cut off at chunk boundaries. Results are concatenated and timestamps are offset per chunk.
Why ONNX in the browser?
The Whisper model is distributed as an ONNX (Open Neural Network Exchange) graph — a format-agnostic representation that runs on WebGPU (GPU acceleration), WebAssembly (CPU fallback), or native runtimes without requiring separate model variants. One model file runs across all devices and browsers.
Two Model Sizes: Fast vs Accurate
| Whisper tiny.en | Whisper base | |
|---|---|---|
| Download size | ~75 MB | ~145 MB |
| Languages | English only | Multilingual (99 languages) |
| Typical speed (WASM) | 5–15s per minute of audio | 15–30s per minute of audio |
| Typical speed (WebGPU) | 2–5s per minute of audio | 5–10s per minute of audio |
| Best for | Clear English, short clips | Accents, non-English, noisy audio |
| Accuracy (clean English) | Very good | Excellent |
| Accuracy (noisy/accented) | Moderate | Good |
Recommendation: Start with Fast (tiny.en). Switch to Accurate (base) if you notice missed words, garbled names, or poor punctuation on challenging audio.
Both models are cached after the first download. Switching models triggers a new download for the new variant.
How to Use the Audio Transcription Tool
Option 1 — File upload
- Click the upload area or drag and drop an audio file (MP3, WAV, M4A, WebM, OGG, or FLAC).
- The tool decodes and resamples the audio, then begins transcription automatically.
- A progress bar shows estimated completion percentage.
Option 2 — Microphone recording
- Click the Microphone tab and tap the red record button.
- Allow browser microphone access when prompted. Recording begins immediately.
- A timer shows elapsed recording time.
- Click Stop — the recording is processed automatically.
- Click Re-record to discard and start over.
Reviewing and exporting
After transcription completes:
- Edit the transcript in the text box — fix names, technical terms, punctuation.
- Copy text copies the full transcript to clipboard.
- Download .txt saves a plain text file.
- Download .srt saves a SubRip subtitle file with timestamps for each segment.
What Is an SRT File?
SRT (SubRip Text) is the most widely supported subtitle format. It looks like this:
1
00:00:00,000 --> 00:00:04,200
Welcome to today's meeting. Let's get started.
2
00:00:04,200 --> 00:00:09,800
The first item on the agenda is the Q3 budget review.
3
00:00:09,800 --> 00:00:15,400
We're tracking about 12% under forecast, which gives us room to discuss...
Each subtitle block has a sequence number, a timecode range in HH:MM:SS,ms --> HH:MM:SS,ms format, and the text. SRT files are importable into:
- Video editors: Premiere Pro, Final Cut Pro, DaVinci Resolve, CapCut
- YouTube: Upload as closed captions
- Subtitle tools: Aegisub, Subtitle Edit
- Video players: VLC, QuickTime (with extensions)
Whisper generates segment-level timestamps (not word-level). For precise subtitle sync, use a subtitle editor to fine-tune individual cue timings after export.
Common Use Cases
Meeting and call transcription
Meeting recordings are the highest-volume private transcription use case. Zoom, Google Meet, and Teams all allow local recording. Uploading those recordings to cloud transcription services means your internal business conversations — strategy discussions, performance reviews, client calls — sit on a third-party server.
Transcribing locally means:
- No terms-of-service clause about training on your data
- No legal risk of sensitive business information leaving your organization
- No subscription cost per hour of audio
The output is a clean transcript you can paste into meeting notes, feed to a summarization tool, or search with Ctrl+F.
Podcast and video content
Podcast show notes and video descriptions drive a significant share of search traffic for content creators. A 45-minute episode without a transcript is invisible to search engines. Transcribing and publishing the full transcript (or key sections) makes the content discoverable by readers who prefer text.
For video, the SRT export goes directly into your video editor or YouTube's caption upload to add closed captions — improving accessibility and SEO simultaneously.
Voice memos and notes
Voice memos are one of the fastest ways to capture ideas during a commute, walk, or meeting-free block. But audio notes are nearly unsearchable without transcription. Converting a 3-minute voice memo takes under 30 seconds with Whisper tiny.en, producing searchable text you can paste into Notion, Obsidian, or any notes app.
Accessibility captions
Pre-recorded video content published without captions excludes an estimated 15% of users with hearing loss. WCAG 2.1 Level AA requires captions for pre-recorded audio/video. Generating and editing SRT captions locally is faster than waiting for auto-captions on YouTube (which take hours to publish) and more private than uploading sensitive internal videos to caption services.
Legal and medical notes
Interviews, depositions, and medical consultations often involve sensitive personal information subject to privacy regulations (GDPR, HIPAA, attorney-client privilege). Uploading to a cloud transcription service, even a reputable one, creates a data handling obligation and risk exposure. Local transcription eliminates that risk entirely — the audio never leaves the device.
Privacy Comparison: Local vs Cloud Transcription
| This Tool (Local) | Otter.ai | Fireflies | Rev.com | Zoom AI Notes | |
|---|---|---|---|---|---|
| Audio uploaded? | No | Yes | Yes | Yes | Yes (to Zoom) |
| Data retained? | Never | Per TOS | Per TOS | Per TOS | Per TOS |
| HIPAA compliant? | Inherently | Paid plan only | Paid plan only | Paid plan | Zoom Business+ |
| Free tier | Unlimited | 300 min/month | Limited | Pay per minute | Included w/ Zoom |
| Works offline? | Yes (after first load) | No | No | No | No |
| Diarization? | No | Yes (paid) | Yes (paid) | Yes | Yes |
| Speaker labels? | No | Yes | Yes | Yes | Yes |
For single-speaker audio or cases where data privacy matters, local Whisper is the default choice. For multi-speaker recordings requiring labeled diarization, a cloud service with appropriate data handling agreements is currently necessary.
Accuracy: What to Expect
Factors that improve accuracy
- Clean audio — minimal background noise, no music, no echo
- Consistent microphone distance — 20–30 cm from the speaker
- Normal speaking pace — Whisper handles natural speech rhythm better than very fast delivery
- Standard English pronunciation — tiny.en was trained predominantly on standard American and British English
Factors that reduce accuracy
- Background noise — cafes, open offices, outdoor environments add noise the model has to work around
- Multiple speakers overlapping — Whisper handles overlapping voices poorly; the dominant voice tends to dominate the transcript
- Heavy accents — tiny.en's training data skews toward standard English accents; switch to base for multilingual or accent-heavy audio
- Technical vocabulary — domain-specific terms (medical, legal, engineering) often get approximated to common words. Post-editing is expected for specialized content.
- Very low audio levels — quiet recordings may cause the model to miss words entirely or produce random output (a known Whisper behavior called "hallucination on silence")
Whisper hallucination on silence
One known Whisper behavior: when audio is very quiet (below a noise threshold), the model may generate confident-sounding but entirely fabricated text — often phrases like "Thank you for watching", "Subtitles by...", or random punctuation. This is called hallucination.
If your transcript contains phrases that don't correspond to anything you said, check whether:
- The recording volume is very low (check your OS input levels)
- The microphone captured mostly silence or ambient noise
- The audio file is damaged or has significant silent sections
WebGPU vs WASM: Backend Performance
| WebGPU | WASM (CPU) | |
|---|---|---|
| Browser support | Chrome 113+, Edge 113+, Chrome Android 121+ | All modern browsers |
| Whisper tiny.en speed | 2–5s per audio minute | 5–15s per audio minute |
| Whisper base speed | 5–10s per audio minute | 15–30s per audio minute |
| Memory | GPU VRAM | System RAM |
| Typical 30-min recording (tiny.en, WASM) | — | 2–7 min processing |
| Typical 30-min recording (tiny.en, WebGPU) | 1–3 min processing | — |
The tool shows which backend it used in the result badge. On Chrome with a discrete GPU, WebGPU acceleration typically delivers 3–5× faster transcription than WASM.
Privacy and Data Handling
The complete data flow for this tool:
- You select a file or record from microphone → audio is loaded into a JavaScript
Blobin browser memory - The
Blobis decoded byAudioContextand resampled to 16 kHz mono — no network request - The resulting
Float32Arrayis zero-copy transferred to a Web Worker as anArrayBuffertransferable - The worker runs the Whisper ONNX model and produces a text result
- The text is transferred back to the main thread and displayed
- When you close the tab, all in-memory data is discarded
No audio data is transmitted over any network connection at any point. The browser's developer tools Network panel shows zero audio-related requests when you transcribe a file.
The model itself is downloaded once from huggingface.co via jsDelivr CDN and cached in browser cache storage. After that initial download, no external requests are made during transcription.
Tips for Better Recordings
If you are recording with the microphone feature:
Position the microphone correctly. Built-in laptop microphones pick up keyboard noise and fan noise. A USB microphone or wired headset positioned close to your mouth produces significantly cleaner audio.
Minimize background noise. Close windows, pause fans, and move to a quieter space if possible. Even a slight reduction in ambient noise has a measurable impact on Whisper's word error rate.
Speak clearly at a consistent pace. Rapid speech and very soft speech are the two most common sources of transcription error. A slightly deliberate pace (not unnaturally slow, just clear) helps the model distinguish word boundaries.
Record at adequate volume. Check that your OS input level meter shows consistent activity when you speak. If the level is barely moving, increase microphone gain before recording.
Use WAV for highest quality. If you have a choice of export format from another tool, WAV (uncompressed PCM) preserves full audio quality. MP3 at 128 kbps+ is also fine. Heavily compressed M4A at very low bitrates may degrade accuracy.
How This Compares to Whisper Desktop Apps
OpenAI's Whisper model is available as:
- This tool — browser-based, zero-install, private
- Whisper.app / MacWhisper — macOS native apps using Core ML or Metal acceleration, faster on Apple Silicon
- whisper.cpp — CPU-optimized C++ port, runs on any OS via terminal
- Whisper via Python (openai-whisper) — original implementation, requires Python environment
The browser version trades some raw performance for zero-install convenience and cross-platform compatibility. On a modern MacBook with WebGPU, in-browser Whisper is within 2–3× of MacWhisper's speed — reasonable for occasional use without installing software. For daily high-volume transcription, a native app is faster.
Pair With Other Privacy-First Tools
After transcribing a meeting or call, you often need to sanitize the text before sharing — redacting personal information, names, phone numbers, or email addresses. Use our PII Redactor to automatically detect and mask sensitive data in the transcript before copying it to a shared document.
For long transcripts, use our Word Counter to estimate reading time and check length before publishing as a blog post or show notes. Our Markdown Editor lets you format and preview the transcript as structured content with headers and sections before export.
Experience it now.
Use the professional-grade Audio Transcription with zero latency and 100% privacy in your browser.