How I use AI to transcribe and diarize TTRPG sessions

This post will help you turn an audio file of your TTRPG sessions into a text file that you can mold into whatever you want, really.
This is completely free and you run it completely in the cloud (Google Colab).
I wont go into details on what Colab is, so feel free to read about it here

So, what do you need to get going?
These things!

A Google account
A Huggingface account
An LLM of your choice (I use Gemini 2.5 Pro)
An Audio file, as clean as possible with only voices

Step 1 – Huggingface

Visit these huggingface links:
segmentation-3.0
speaker-diarization-3.1
tokens

For the first two links you will have to login and accept the conditions to use their models:

For the last link, you will have to create a Huggingface token with read permissions (this is done so that when we load the models later they can verify that your have agreed to the terms to use them)
To do so you click the link the “token” link above and create a read token as per the gif below: (right click -> “Open image in new tab” if its too hard to see)

Make sure you copy that token, we will be needing it soon (if you lose it, just create a new one or “invalidate and refresh” the one you just created to get a new token).

Well done! Onto step 2!

Step 2 – Google Colab

Now we want to create a new notebook in Colab in your Google Drive:
https://colab.research.google.com/
Login if you aren’t already, then create a new notebook, you can name it whatever you want!

It should look something like this:

Then we need to make sure we select the correct runtime (we need that beefy GPU), make sure you click save.

With all that done we can start adding some code!
You will only need to add the code once, but you will have to run them between each session. I will explain that again, don’t worry!

Click on the “+ Code” tab in the top left corner 5 times, you should have 6 new empty code prompts.
These are the code we want to add to these new rows:

Row 1:

!pip install -q git+https://github.com/m-bain/whisperx.git -q

Row 2:

!pip3 install -U huggingface_hub

Row 3:

!apt-get install libcudnn8

Row 4:

!pip install ctranslate2==4.4.0

Row 5:

!pip install transformers -U

Row 6:

Now row 6 is a bit special, this is where you add your token from step 1. (replace the bold text with your token) and then you have to change the name of the audio file to match yours. I always name mine “1.mp3” for convenience, that way I never have to change anything in this snippet. Its worth mentioning that I’m using a Swedish model for this. If you need another language you have to find an appropriate model on Huggingface and replace “KBLab/wav2vec2-large-voxrex-swedish” with your new model and change “–language sv” to fit your language.

!whisperx --model large --align_model KBLab/wav2vec2-large-voxrex-swedish --language sv --chunk_size 6 --diarize --hf_token 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 1.mp3 --output_format txt

This is what it should look like:

Now, hover your mouse over the first row and you will see a play button appear to the left, click it.
This will start installing WhisperX from the github link provided (feel free to head over there and read up on it, fascinating stuff!).
NOTE: You will see error messages in red here when its done, this is normal.
You know when it’s done by the green checkmark that will appear to the left of the play button.

Do this for all of them, wait until they are complete before you move on to the next and stop before the last one.

Before we actually start the transcription, we have to upload our audio, click the folder icon in the left toolbar, wait for the folders to load in, then drag and drop your audio file into that root folder, like so:

You will get a warning message telling you that the file will be deleted after the session, that is fine, just click okay.
Once the file is uploaded (there will be an indicator at the bottom, right above your available space, that indicates to upload progress, if you dont see a loading circle there, you are good to go.

Now, we press play next to the last row and we wait.

When everything is done, you will get a text file with the same name as the audio file in the left column, tadaa, that’s it!

I’ll make another post demonstrating what I have done to create the “journal entries” after each session and the prompt I made for it.