INCREDIBLE Fast AI Real Time Speech to Text Transcribtion - Build From Scratch
Mar 10, 2024
Today, we'll explore the use of open-source language models to achieve real-time transcription. In this guide, we focus on employing Incredible Fast Whisper for speech recognition and detail the steps to construct a transcription AI leveraging Replicate and AWS.
This tutorial is designed for developers at any skill level, offering a straightforward approach to integrating advanced AI into your communication processes.
Incredible Whisper & Replicate
Whisper is a tool made by OpenAI that turns what you say into written text. It's smart enough to understand different accents and can work even in noisy places. This is great for writing down meetings or talks. But sometimes, Whisper is a bit slow for catching everything as it happens.
That's where Incredibly Fast Whisper comes in. It's a open source model that works much faster, making it perfect for when you need words written down immediately, like during live events.
Replicate is a website that makes it easy to use Incredibly Fast Whisper and other tools like it. It's designed to help developers add these cool features to their own apps without a lot of hassle.
Real-Time Transcription System Workflow
First Steps with Replicate
In our code, we start by pulling in the replicate
library, which connects us to a vast array of AI models. With just a few lines of code, we set the replicate.run
function in motion, calling upon the 'incredibly-fast-whisper' model to work its magic on our audio file.
The code specifies the model, the task of transcription, and the audio file's URL, along with several parameters to fine-tune the process to our needs. Once the model processes the audio, the resulting transcription is printed out.
import replicate
output = replicate.run(
"vaibhavs10/incredibly-fast-whisper:3ab86df6c8f54c11309d4d1f930ac292bad43ace52d10c80d87eb258b3c9f79c",
input={
"task": "transcribe",
"audio": "https://replicate.delivery/pbxt/Js2Fgx9MSOCzdTnzHQLJXj7abLp3JLIG3iqdsYXV24tHIdk8/OSR_uk_000_0050_8k.wav",
"language": "None",
"timestamp": "chunk",
"batch_size": 64,
"diarise_audio": False
}
)
print(output)
AWS Setup
Once we have our transcription service up and running, the next crucial step is to manage where we store the audio files. This is where Amazon Web Services (AWS) comes into play. AWS's Simple Storage Service (S3) provides us with a reliable and secure place to keep our audio files. Creating a bucket on S3 is straightforward; think of a bucket as a dedicated space for your project's files on Amazon's vast cloud storage network.
import replicate
import boto3
aws_access_key = 'YOUR KEY'
aws_secret_key = 'YOUR SECRET'
bucket_name = 'YOUR BUCKET NAME'
s3 = boto3.client("s3", aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key )
filename = 'pizza.wav'
s3.upload_file(filename, bucket_name, filename)
temp_audio_uri = f"""https://{bucket_name}.s3.amazonaws.com/{filename}"""
output = replicate.run(
"vaibhavs10/incredibly-fast-whisper:3ab86df6c8f54c11309d4d1f930ac292bad43ace52d10c80d87eb258b3c9f79c",
input={
"task": "transcribe",
"audio": temp_audio_uri,
"language": "None",
"timestamp": "chunk",
"batch_size": 64,
"diarise_audio": False
}
)
print(output)
Web Integration for Real-Time Transcription
Shifting gears from merely transcribing sample files, we're now setting our sights on a more interactive application.
This involves spinning up a Flask backend server, crafting an HTML front end for user interaction, and implementing JavaScript for live audio recording and backend communication.
This holistic approach allows users to record directly on a webpage and receive real-time transcriptions, marrying the simplicity of Flask with the dynamism of web technologies for an enhanced user experience. Let's start with the backend part:
import replicate
import boto3
from flask import Flask, request, jsonify, render_template
import tempfile app = Flask(__name__)
@app.route("/")
def index():
return render_template("index.html")
bucket_name = 'BUCKET NAME'
aws_access_key = 'KEY'
aws_secret_key = 'SECRET'
s3 = boto3.client( "s3", aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key )
@app.route('/process-audio', methods=["POST"])
def process_audio_data():
audio_data = request.files["audio"].read()
with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_audio:
temp_audio.write(audio_data)
temp_audio.flush()
s3.upload_file(temp_audio.name, bucket_name, temp_audio.name)
temp_audio_uri = f"https://{bucket_name}.s3.amazonaws.com/{temp_audio.name}"
output = replicate.run(
"vaibhavs10/incredibly-fast-whisper:3ab86df6c8f54c11309d4d1f930ac292bad43ace52d10c80d87eb258b3c9f79c",
input={
"task": "transcribe",
"audio": temp_audio_uri,
"language": "None",
"timestamp": "chunk",
"batch_size": 64,
"diarise_audio": False
}
)
print(output)
return jsonify({"transcript": output['text']})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080)
This server code handles the web interface and audio processing. When users access the site and record audio, it's sent here, processed, and transcribed.
The following HTML serves as the frontend of the application. It displays a button to control recording and a text area to show the live transcription output to the user:
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Meeting Copilot</title>
<link
href="https://cdn.jsdelivr.net/npm/[email protected]/dist/tailwind.min.css"
rel="stylesheet"
/>
</head>
<body class="bg-gray-100 w-full">
<button
id="recordButton"
class="bg-blue-500 hover:bg-blue-700 text-white font-bold py-2 px-4 rounded ml-10 mt-2"
>
Start Recording
</button>
<div class="space-x-10 w-full">
<p class="text-lg font-semibold ml-10 mt-4 mb-2">Transcript</p>
<textarea
id="transcript"
rows="10"
class="bg-white shadow-md rounded-lg w-1/2 p-4"
></textarea>
</div>
<script src="static/js/script.js"></script>
</body>
</html>
Let's continue with the client-Side JavaScript:
const recordButton = document.getElementById('recordButton')
const transcriptDiv = document.getElementById('transcript')
let isRecording = false
let mediaRecorder
let intervalId
let full_transcript = ''
recordButton.addEventListener('click', () => {
if (!isRecording) {
startRecording()
recordButton.textContent = 'Stop Recording'
} else {
stopRecording()
recordButton.textContent = 'Start Recording'
}
isRecording = !isRecording
})
async function startRecording() {
const stream = await navigator.mediaDevices.getUserMedia({
audio: true,
})
function createRecorder() {
mediaRecorder = new MediaRecorder(stream)
mediaRecorder.addEventListener('dataavailable', async (event) => {
const audioBlob = event.data
const formData = new FormData()
formData.append('audio', audioBlob)
const transcript_response = await fetch('/process-audio', {
method: 'POST',
body: formData,
})
const transcript_data = await transcript_response.json()
if (transcript_data.transcript != null) {
full_transcript += transcript_data.transcript
transcriptDiv.textContent = full_transcript
}
})
Conclusion
As we wrap up this tutorial, we've taken significant strides in merging the capabilities of AI transcription with the utility of web applications. We've built a system that not only captures and transcribes audio with impressive speed but also presents it in a user-friendly web interface.
Our journey through setting up the Flask server, securing AWS S3 storage, and integrating real-time transcription with Incredibly Fast Whisper has equipped you with the knowledge to create powerful tools that can significantly enhance productivity and accessibility in various settings.
Stay Ahead in AI with Free Weekly Video Updates!
AI is evolving faster than ever ā€“ donā€™t get left behind. By joining our newsletter, youā€™ll get:
- Weekly video tutorials previews on new AI tools and frameworks
- Updates on major AI breakthroughs and their impact
- Real-world examples of AI in action, delivered every week, completely free.
Don't worry, your information will not be shared.
We hate SPAM. We will never sell your information, for any reason.