Google Colab: Aligned Transcript¶
Running whisper-smith on Google Colab gives you a free GPU to speed up
speaker diarization and lets you process audio without installing anything locally.
Note
Before running, go to Runtime → Change runtime type and select T4 GPU.
Prerequisites¶
In the Colab sidebar open Secrets (key icon) and add:
OPENAI_API_KEY— your OpenAI API keyHUGGINGFACE_TOKEN— your Hugging Face token
You must also accept the pyannote/speaker-diarization-community-1 model terms on Hugging Face before the pipeline can be downloaded.
Open in Colab¶
Click the badge to open the notebook directly in Google Colab — no download needed:
The notebook is also available in the repository at:
notebooks/colab_aligned_transcript.ipynb
Notebook walkthrough¶
Step 1 — Install whisper-smith
The notebook installs whisper-smith[colab] into an isolated target
directory under /content and creates a small CLI launcher. This avoids
rewriting Colab’s preinstalled packages while still making whisper-smith
available to shell commands in later cells.
import os
import shutil
import subprocess
import sys
from pathlib import Path
repo_url = "git+https://github.com/yeiichi/whisper-smith.git"
target_dir = Path("/content/whisper-smith-packages")
bin_dir = Path("/content/whisper-smith-bin")
shutil.rmtree(target_dir, ignore_errors=True)
shutil.rmtree(bin_dir, ignore_errors=True)
target_dir.mkdir(parents=True, exist_ok=True)
bin_dir.mkdir(parents=True, exist_ok=True)
install_command = [
sys.executable,
"-m",
"pip",
"install",
"--target",
str(target_dir),
"--upgrade",
f"whisper-smith[colab] @ {repo_url}",
"torchvision==0.26.*",
]
completed = subprocess.run(
install_command,
capture_output=True,
text=True,
)
if completed.stdout:
print(completed.stdout)
if completed.stderr:
print(completed.stderr)
completed.check_returncode()
target_path = str(target_dir)
if target_path not in sys.path:
sys.path.insert(0, target_path)
launcher = bin_dir / "whisper-smith"
launcher.write_text(
"#!/usr/bin/env python3\n"
"import os\n"
"import sys\n"
"os.environ['MPLBACKEND'] = 'Agg'\n"
f"target_path = {target_path!r}\n"
"sys.path = [target_path] + [\n"
" path for path in sys.path\n"
" if 'site-packages' not in path and 'dist-packages' not in path\n"
"]\n"
"from whisper_smith.cli import main\n"
"raise SystemExit(main())\n",
encoding="utf-8",
)
launcher.chmod(0o755)
os.environ["PATH"] = f"{bin_dir}:{os.environ['PATH']}"
subprocess.run(["whisper-smith", "--help"], check=True)
Step 2 — Load credentials from Colab Secrets
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
os.environ["HUGGINGFACE_TOKEN"] = userdata.get("HUGGINGFACE_TOKEN")
Step 3 — Upload audio
from google.colab import files
from pathlib import Path
uploaded = files.upload()
audio_path = Path(next(iter(uploaded)))
output_path = audio_path.with_suffix(".aligned.json")
Step 4 — Run the pipeline
This is the direct Colab equivalent of the local CLI command:
whisper-smith audio.m4a --align --diarization-model pyannote/speaker-diarization-community-1 --output audio.aligned.json
In the notebook, the uploaded filename is substituted at runtime:
!whisper-smith "{audio_path}" --align --diarization-model pyannote/speaker-diarization-community-1 --output "{output_path}"
Step 5 — Preview results
import json
data = json.loads(output_path.read_text(encoding="utf-8"))
for seg in data["segments"]:
speaker = seg.get("speaker") or "UNKNOWN"
print(f"[{seg['start']:6.2f}s – {seg['end']:6.2f}s] {speaker:12s} {seg['text'].strip()}")
Step 6 — Download the JSON
files.download(str(output_path))
Advanced: explicit GPU pipeline¶
For fine-grained control — such as specifying the number of speakers — load
the pyannote pipeline manually, move it to the GPU with .to(device), and
pass it via the pipeline= argument. This also avoids re-downloading the
model if you run diarization multiple times in the same session.
import torch
from pyannote.audio import Pipeline
from whisper_smith.align import assign_speakers
from whisper_smith.diarize import diarize_audio
from whisper_smith.exporters import export_json
from whisper_smith.transcribe import transcribe_audio
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
diarize_pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-community-1",
token=os.environ["HUGGINGFACE_TOKEN"],
)
diarize_pipeline.to(device)
transcript = transcribe_audio(audio_path)
diarization = diarize_audio(audio_path, pipeline=diarize_pipeline, num_speakers=2)
aligned = assign_speakers(transcript, diarization)
output_path.write_text(export_json(aligned), encoding="utf-8")