The number one complaint about AI transcription tools like Whisper is background music generating phantom subtitles — gibberish text, misheard song lyrics, or repeated phrases appearing where nobody is actually speaking. If you've ever run a video through Whisper and gotten lines like "Thank you for watching" or random English words during an instrumental intro, you know the problem. GeekLink solves this with built-in VAD (Voice Activity Detection) that automatically detects and mutes non-speech segments before transcription, so your subtitles only contain what people actually say.
Background music is the silent killer of AI transcription quality. When you feed a video with background music into Whisper or any speech-to-text engine, the model doesn't know the difference between a human voice and a guitar riff. It tries to transcribe everything it hears, and when there's no speech to transcribe, it hallucinates — generating phantom subtitles that range from nonsensical fragments to confident-sounding sentences that nobody ever said. You'll see things like song lyrics (sometimes in the wrong language), repeated phrases like "Thank you" or "Subscribe," or complete gibberish that looks like a fever dream.
This hits almost every type of video content: YouTube videos with intro/outro music, podcasts with jingle transitions, variety shows with constant BGM, corporate training videos with background tracks, vlogs with licensed music, wedding videos with DJ sets, and gaming streams with in-game soundtracks. The more prominent the music, the worse the hallucinations get. Even quiet background music can trigger phantom subtitles during pauses in speech.
Why does this happen? Whisper and similar models are trained to find speech in audio. When the audio contains music but no speech, the model doesn't output silence — it tries to find patterns that match speech and generates its best guess, which is almost always wrong. Without any pre-filtering to tell the model "there's no speech here, skip this part," it hallucinates text for every second of audio. Reddit threads are full of people asking variations of "Why does Whisper keep transcribing my background music as random English words?" and "How do I stop phantom subtitles in music segments?" The answer is pre-filtering with VAD.
The manual workaround is painful: open the video in Audacity, identify and strip music-only segments, apply noise reduction filters, export the cleaned audio, then re-import into your transcription tool. This is tedious enough for a single video — it takes 15-30 minutes of careful audio editing before you even start transcribing. For anyone processing multiple videos, it's completely impractical. If you have 50 YouTube videos or a season of a show to subtitle, spending 15 minutes per video on audio prep alone adds up to over 12 hours of manual work.
Cloud transcription services charge per minute of audio and most of them have the exact same background music problem. You're paying to transcribe music that shouldn't be transcribed. Some services offer VAD as a premium add-on, but you're still uploading your videos to someone else's servers and paying ongoing fees. Most desktop Whisper GUIs don't include VAD at all — they just pass the raw audio straight to Whisper and hope for the best.
Yes. VAD detects vocal activity including singing, so if a person is singing in your video, those segments will be kept and transcribed. The VAD specifically filters out instrumental music, sound effects, and non-vocal audio. If your video has a singer performing over a backing track, the vocal segments are preserved while purely instrumental breaks are filtered out.
The music-only intro segment will be automatically muted by VAD, and transcription starts when the host begins speaking. If the podcast uses music that plays underneath speech (a common technique for transitions), VAD keeps those segments active because it detects the human voice over the music. The speech recognition model handles speech-over-music reasonably well — it's the music-only segments that cause hallucinations, and those are what VAD eliminates.
Voice Activity Detection analyzes the audio waveform to classify each segment as speech or non-speech. GeekLink uses Silero VAD, a neural network model specifically trained for this task. It runs locally on your Mac and processes audio in real-time, producing a map of which time ranges contain human voice. Only those ranges are sent to the speech recognition engine. The model is highly accurate at distinguishing human voice from music, noise, applause, and silence.
Negligibly. The VAD analysis adds only a few seconds per video regardless of length. In fact, it often makes the overall process faster because the speech recognition engine has less audio to process — it skips all the non-speech segments entirely. The time saved by not having to manually review and delete phantom subtitles afterward far outweighs the minimal VAD overhead.
Yes. VAD pre-filtering can be toggled off in GeekLink's settings if you prefer raw transcription output without any pre-filtering. This might be useful in rare cases where you intentionally want to transcribe non-speech audio, or for testing and comparison purposes. By default, VAD is enabled because it produces significantly cleaner results for the vast majority of videos.