Google Colab: Aligned Transcript

Running whisper-smith on Google Colab gives you a free GPU to speed up speaker diarization and lets you process audio without installing anything locally.

Note

Before running, go to Runtime → Change runtime type and select T4 GPU.

Prerequisites

In the Colab sidebar open Secrets (key icon) and add:

  • OPENAI_API_KEY — your OpenAI API key

  • HUGGINGFACE_TOKEN — your Hugging Face token

You must also accept the pyannote/speaker-diarization-community-1 model terms on Hugging Face before the pipeline can be downloaded.

Open in Colab

Click the badge to open the notebook directly in Google Colab — no download needed:

Open in Colab

The notebook is also available in the repository at:

notebooks/colab_aligned_transcript.ipynb

Notebook walkthrough

Step 1 — Install whisper-smith

The notebook installs whisper-smith[colab] into an isolated target directory under /content and creates a small CLI launcher. This avoids rewriting Colab’s preinstalled packages while still making whisper-smith available to shell commands in later cells.

import os
import shutil
import subprocess
import sys
from pathlib import Path

repo_url = "git+https://github.com/yeiichi/whisper-smith.git"
target_dir = Path("/content/whisper-smith-packages")
bin_dir = Path("/content/whisper-smith-bin")
shutil.rmtree(target_dir, ignore_errors=True)
shutil.rmtree(bin_dir, ignore_errors=True)
target_dir.mkdir(parents=True, exist_ok=True)
bin_dir.mkdir(parents=True, exist_ok=True)

install_command = [
    sys.executable,
    "-m",
    "pip",
    "install",
    "--target",
    str(target_dir),
    "--upgrade",
    f"whisper-smith[colab] @ {repo_url}",
    "torchvision==0.26.*",
]
completed = subprocess.run(
    install_command,
    capture_output=True,
    text=True,
)
if completed.stdout:
    print(completed.stdout)
if completed.stderr:
    print(completed.stderr)
completed.check_returncode()

target_path = str(target_dir)
if target_path not in sys.path:
    sys.path.insert(0, target_path)

launcher = bin_dir / "whisper-smith"
launcher.write_text(
    "#!/usr/bin/env python3\n"
    "import os\n"
    "import sys\n"
    "os.environ['MPLBACKEND'] = 'Agg'\n"
    f"target_path = {target_path!r}\n"
    "sys.path = [target_path] + [\n"
    "    path for path in sys.path\n"
    "    if 'site-packages' not in path and 'dist-packages' not in path\n"
    "]\n"
    "from whisper_smith.cli import main\n"
    "raise SystemExit(main())\n",
    encoding="utf-8",
)
launcher.chmod(0o755)
os.environ["PATH"] = f"{bin_dir}:{os.environ['PATH']}"
subprocess.run(["whisper-smith", "--help"], check=True)

Step 2 — Load credentials from Colab Secrets

import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
os.environ["HUGGINGFACE_TOKEN"] = userdata.get("HUGGINGFACE_TOKEN")

Step 3 — Upload audio

from google.colab import files
from pathlib import Path

uploaded = files.upload()
audio_path = Path(next(iter(uploaded)))
output_path = audio_path.with_suffix(".aligned.json")

Step 4 — Run the pipeline

This is the direct Colab equivalent of the local CLI command:

whisper-smith audio.m4a --align --diarization-model pyannote/speaker-diarization-community-1 --output audio.aligned.json

In the notebook, the uploaded filename is substituted at runtime:

!whisper-smith "{audio_path}" --align --diarization-model pyannote/speaker-diarization-community-1 --output "{output_path}"

Step 5 — Preview results

import json

data = json.loads(output_path.read_text(encoding="utf-8"))
for seg in data["segments"]:
    speaker = seg.get("speaker") or "UNKNOWN"
    print(f"[{seg['start']:6.2f}s – {seg['end']:6.2f}s]  {speaker:12s}  {seg['text'].strip()}")

Step 6 — Download the JSON

files.download(str(output_path))

Advanced: explicit GPU pipeline

For fine-grained control — such as specifying the number of speakers — load the pyannote pipeline manually, move it to the GPU with .to(device), and pass it via the pipeline= argument. This also avoids re-downloading the model if you run diarization multiple times in the same session.

import torch
from pyannote.audio import Pipeline
from whisper_smith.align import assign_speakers
from whisper_smith.diarize import diarize_audio
from whisper_smith.exporters import export_json
from whisper_smith.transcribe import transcribe_audio

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

diarize_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token=os.environ["HUGGINGFACE_TOKEN"],
)
diarize_pipeline.to(device)

transcript  = transcribe_audio(audio_path)
diarization = diarize_audio(audio_path, pipeline=diarize_pipeline, num_speakers=2)
aligned     = assign_speakers(transcript, diarization)

output_path.write_text(export_json(aligned), encoding="utf-8")