Creating a Gradio App for Audio-to-Meeting-Minutes with Whisper and Llama

In this post, I'll walk through how I am able to build a meeting minute generator in a simple Gradio app.

The goal of this project is to upload a file in Gradio interface, and instantly get back structured meeting minutes and key points!

Below is a simple workflow diagram on how it works.

Step 1: Generate Audio from Text with GTTS

Everything starts with the Google Text-to-Speech (GTTS) package. I use GTTS to convert any meeting script or text into an MP3 audio file. This is especially handy for testing or when you want to simulate meeting recordings without needing a live speaker.

Step 2: Transcribe Audio with OpenAI Whisper

Once I have the MP3 file, it’s fed directly into OpenAI Whisper. Whisper is a state-of-the-art speech-to-text model that quickly turns the audio into a detailed transcript. This step is fully automated within the Gradio interface-just specify the file path, and Whisper does the rest.

Step 3: Extract Meeting Minutes with Llama 3

The transcript generated by Whisper is then passed to Llama 3 from Meta. Here, I use a carefully crafted prompt to instruct Llama 3 to extract structured meeting minutes, including:

A summary with attendees, location, and date
Key discussion points
Takeaways
Action items with owners

Llama 3 outputs the results in markdown format, making them easy to read, share, or integrate into documentation.

Step 4: All in a Gradio Interface

The entire workflow runs seamlessly through a Gradio interface. You simply provide the audio file path, and the app returns summarized meeting notes with key points-no manual steps required.

💡

Workflow Recap:

GTTS: Generate audio file from text.
Whisper: Transcribe audio to text.
Llama 3: Summarize transcript into meeting minutes.
Gradio: One interface to run the whole pipeline.

Below are a few screenshots of how it actually looks like!

Gradio Interface where a user can upload an audio file

This setup is ideal for automating meeting documentation, whether you’re working with real recordings or synthetic audio. It’s fast, reliable, and easy to use-just point to your audio file and let the models handle the rest.