How to Convert Speech to Text with OpenAI Whisper

OpenAI Whisper transforms audio recordings into accurate text transcriptions using advanced AI models. This open-source tool handles multiple languages and audio formats with remarkable precision, making it ideal for transcribing meetings, interviews, or voice memos.

Install Python and pip. Download Python 3.8 or newer from python.org and install it on your system. Ensure pip is included during installation by checking 'Add Python to PATH' on Windows or using the standard installer on macOS and Linux. Verify installation by opening Terminal or Command Prompt and typing 'python --version' and 'pip --version'.
Install Whisper via pip. Open Terminal or Command Prompt and run 'pip install openai-whisper'. The installation downloads the necessary dependencies and sets up the whisper command-line tool. Wait for the process to complete, which may take several minutes depending on your internet connection.
Prepare your audio file. Place your audio file in an accessible directory on your computer. Whisper supports common formats including MP3, WAV, M4A, and MP4. Ensure the audio quality is clear and the volume is adequate for optimal transcription results.
Run basic transcription. Navigate to your audio file's directory in Terminal and execute 'whisper filename.mp3' replacing 'filename.mp3' with your actual file name. Whisper automatically downloads the base model on first use and begins transcription. The process displays progress and saves results as text files in the same directory.
Choose the appropriate model size. Select from five model sizes: tiny, base, small, medium, or large. Larger models provide better accuracy but require more processing time and memory. Run 'whisper filename.mp3 --model large' for maximum accuracy or 'whisper filename.mp3 --model tiny' for fastest processing.
Specify output format and language. Control output format using '--output_format' followed by txt, srt, vtt, or json. Specify the audio language with '--language' and the two-letter code like 'en' for English or 'es' for Spanish. Example: 'whisper audio.mp3 --language en --output_format srt' creates subtitle files.
Use Whisper in Python scripts. Import whisper in Python with 'import whisper', load a model using 'model = whisper.load_model("base")', then transcribe with 'result = model.transcribe("audio.mp3")'. Access the transcribed text through 'result["text"]' and individual segments via 'result["segments"]' for timestamps.