INCREDIBLE Fast AI Real Time Speech to Text Transcribtion - Build From Scratch

openai Mar 10, 2024
 


Today, we'll explore the use of open-source language models to achieve real-time transcription. In this guide, we focus on employing Incredible Fast Whisper for speech recognition and detail the steps to construct a transcription AI leveraging Replicate and AWS.

This tutorial is designed for developers at any skill level, offering a straightforward approach to integrating advanced AI into your communication processes.

Incredible Whisper & Replicate

Whisper is a tool made by OpenAI that turns what you say into written text. It's smart enough to understand different accents and can work even in noisy places. This is great for writing down meetings or talks. But sometimes, Whisper is a bit slow for catching everything as it happens.

That's where Incredibly Fast Whisper comes in. It's a open source model that works much faster, making it perfect for when you need words written down immediately, like during live events.

Replicate is a website that makes it easy to use Incredibly Fast Whisper and other tools like it. It's designed to help developers add these cool features to their own apps without a lot of hassle.


Real-Time Transcription System Workflow

Our approach to real-time transcription hinges on a seamless interaction between your web browser and a robust backend infrastructure.

Every two seconds, your browser captures audio and sends it to a Flask server—a lightweight gatekeeper that processes and temporarily stores these audio bites.

From there, each snippet is swiftly moved to an S3 bucket, a secure and scalable storage option provided by AWS. Once safely stored, the Fast Whisper service kicks in, transcribing the audio into text with remarkable speed.

This efficient system ensures that the spoken word is almost instantly reflected in written form, ready for any application where immediate text output is needed.

 

First Steps with Replicate

In our code, we start by pulling in the replicate library, which connects us to a vast array of AI models. With just a few lines of code, we set the replicate.run function in motion, calling upon the 'incredibly-fast-whisper' model to work its magic on our audio file.

The code specifies the model, the task of transcription, and the audio file's URL, along with several parameters to fine-tune the process to our needs. Once the model processes the audio, the resulting transcription is printed out.

 

import replicate

output = replicate.run(
"vaibhavs10/incredibly-fast-whisper:3ab86df6c8f54c11309d4d1f930ac292bad43ace52d10c80d87eb258b3c9f79c",
input={
"task": "transcribe",
"audio": "https://replicate.delivery/pbxt/Js2Fgx9MSOCzdTnzHQLJXj7abLp3JLIG3iqdsYXV24tHIdk8/OSR_uk_000_0050_8k.wav",
"language": "None",
"timestamp": "chunk",
"batch_size": 64,
"diarise_audio": False
}
)
print(output)

 
AWS Setup

Once we have our transcription service up and running, the next crucial step is to manage where we store the audio files. This is where Amazon Web Services (AWS) comes into play. AWS's Simple Storage Service (S3) provides us with a reliable and secure place to keep our audio files. Creating a bucket on S3 is straightforward; think of a bucket as a dedicated space for your project's files on Amazon's vast cloud storage network.

With our bucket ready, the next task is to ensure secure access. This is done through AWS Identity and Access Management (IAM), which allows us to create users with specific permissions. In this case, we create an IAM user with rights tailored just for handling our S3 operations. This user will have an access key ID and a secret access key, which are credentials we'll use in our application to interact with our S3 bucket safely.

 

Integrating AWS S3 with Incredibly Fast Whisper for Transcription

In the first step we'll try to transcribe a local sample audio file with replicate. We begin by importing boto3 for AWS operations. With boto3, we establish a connection to our AWS account using our aws_access_key and aws_secret_key.

We specify our bucket_name, which is the designated S3 storage location for our files. The upload_file function of the s3 client takes our local file, 'pizza.wav' in this case, and uploads it to the bucket, making it available online. We then construct the file's URL, which the Incredibly Fast Whisper model will use to access and transcribe the audio.

Using the replicate.run function, we call upon the transcription AI, passing it the URL of our audio file in S3.

Once the transcription is complete, we print the output. 

import replicate
import boto3
aws_access_key = 'YOUR KEY'
aws_secret_key = 'YOUR SECRET'
bucket_name = 'YOUR BUCKET NAME'

s3 = boto3.client("s3", aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key )
filename = 'pizza.wav'
s3.upload_file(filename, bucket_name, filename)
temp_audio_uri = f"""https://{bucket_name}.s3.amazonaws.com/{filename}"""

output = replicate.run(
"vaibhavs10/incredibly-fast-whisper:3ab86df6c8f54c11309d4d1f930ac292bad43ace52d10c80d87eb258b3c9f79c",
input={
"task": "transcribe",
"audio": temp_audio_uri,
"language": "None",
"timestamp": "chunk",
"batch_size": 64,
"diarise_audio": False
}
)

print(output)

 

Web Integration for Real-Time Transcription

Shifting gears from merely transcribing sample files, we're now setting our sights on a more interactive application.

This involves spinning up a Flask backend server, crafting an HTML front end for user interaction, and implementing JavaScript for live audio recording and backend communication.

This holistic approach allows users to record directly on a webpage and receive real-time transcriptions, marrying the simplicity of Flask with the dynamism of web technologies for an enhanced user experience. Let's start with the backend part:

import replicate
import boto3
from flask import Flask, request, jsonify, render_template
import tempfile app = Flask(__name__)

@app.route("/") 
def index():
return render_template("index.html")

bucket_name =
'BUCKET NAME'
aws_access_key =
'KEY'
aws_secret_key =
'SECRET'

s3 = boto3.client(
"s3", aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key )

@app.route('/process-audio', methods=["POST"]) 

def process_audio_data():
audio_data = request.files[
"audio"].read()

with
tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_audio:
temp_audio.write(audio_data)
temp_audio.flush()
s3.upload_file(temp_audio.name, bucket_name, temp_audio.name)
temp_audio_uri =
f"https://{bucket_name}.s3.amazonaws.com/{temp_audio.name}"
output = replicate.run(
"vaibhavs10/incredibly-fast-whisper:3ab86df6c8f54c11309d4d1f930ac292bad43ace52d10c80d87eb258b3c9f79c",
input={
"task": "transcribe",
"audio": temp_audio_uri,
"language": "None",
"timestamp": "chunk",
"batch_size": 64,
"diarise_audio": False
}
)

print
(output)
return jsonify({"transcript": output['text']})

if __name__ == "__main__":
app.run(host=
"0.0.0.0", port=8080)

This server code handles the web interface and audio processing. When users access the site and record audio, it's sent here, processed, and transcribed.

The following HTML serves as the frontend of the application. It displays a button to control recording and a text area to show the live transcription output to the user:

<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Meeting Copilot</title>
<link
href="https://cdn.jsdelivr.net/npm/[email protected]/dist/tailwind.min.css"
rel="stylesheet"
/>
</head>
<body class="bg-gray-100 w-full">
<button
id="recordButton"
class="bg-blue-500 hover:bg-blue-700 text-white font-bold py-2 px-4 rounded ml-10 mt-2"
>
Start Recording
</button>
<div class="space-x-10 w-full">
<p class="text-lg font-semibold ml-10 mt-4 mb-2">Transcript</p>
<textarea
id="transcript"
rows="10"
class="bg-white shadow-md rounded-lg w-1/2 p-4"
></textarea>
</div>
<script src="static/js/script.js"></script>
</body>
</html>


Let's continue with the client-Side JavaScript:

const recordButton = document.getElementById('recordButton')
const transcriptDiv = document.getElementById('transcript')

let isRecording = false
let mediaRecorder
let intervalId
let full_transcript = ''

recordButton.addEventListener('click', () => {
if (!isRecording) {
startRecording()
recordButton.textContent = 'Stop Recording'
} else {
stopRecording()
recordButton.textContent = 'Start Recording'
}
isRecording = !isRecording
})

async function startRecording() {
const stream = await navigator.mediaDevices.getUserMedia({
audio: true,
})

function createRecorder() {
mediaRecorder = new MediaRecorder(stream)

mediaRecorder.addEventListener('dataavailable', async (event) => {
const audioBlob = event.data
const formData = new FormData()
formData.append('audio', audioBlob)

const transcript_response = await fetch('/process-audio', {
method: 'POST',
body: formData,
})

const transcript_data = await transcript_response.json()
if (transcript_data.transcript != null) {
full_transcript += transcript_data.transcript
transcriptDiv.textContent = full_transcript
}
})


Conclusion

As we wrap up this tutorial, we've taken significant strides in merging the capabilities of AI transcription with the utility of web applications. We've built a system that not only captures and transcribes audio with impressive speed but also presents it in a user-friendly web interface.

Our journey through setting up the Flask server, securing AWS S3 storage, and integrating real-time transcription with Incredibly Fast Whisper has equipped you with the knowledge to create powerful tools that can significantly enhance productivity and accessibility in various settings.

Stay Ahead in AI with Free Weekly Video Updates!

AI is evolving faster than ever ā€“ donā€™t get left behind. By joining our newsletter, youā€™ll get:

  • Weekly video tutorials previews on new AI tools and frameworks
  • Updates on major AI breakthroughs and their impact
  • Real-world examples of AI in action, delivered every week, completely free.


Don't worry, your information will not be shared.

We hate SPAM. We will never sell your information, for any reason.