To continue on the topic of Next.js from my previous post, today I want to show how to create prompts using the Whisper API in combination with ChatGPT:

Demo project

A repository with a demo project can be found on GitHub.

You have to setup / create an .env.local in its root directory. Add the entry OPENAI_API_KEY with the value of a valid OpenAI API key, before you can run

docker compose up

When you see the message

ℹī¸ Next.js instance now running on port 3000 ...

in your terminal, you should be able to open the page with http://localhost:3000/ in your browser:

Screenshot1

The application demonstrates how to record OGG audio with the browser API and then send it to the /api/transcribe endpoint.

On success this will return a text body with the transcribed audio.

After this text has been written to the upper text field, the user can hit the second button SEND PROMPT.

This will send the Prompt text to the /api/chat endpoint, which itself will communicate with OpenAI’s Chat Completion API.

The result will be displayed under the both buttons as Markdown.

Recording audio

In the browser, you first have to create a MediaStream with getUserMedia method:

const stream = await navigator.mediaDevices.getUserMedia({
    audio: true
});

If the user agrees, you can start recording audio with a MediaRecorder instance:

const mediaRecorder = new MediaRecorder(stream);

const allAudioChunks: Blob[] = [];
mediaRecorder.ondataavailable = (e) => {
    // collect each chunk to `allAudioChunks`
    chunks.push(e.data);
};

mediaRecorder.start();

Executing

mediaRecorder.stop();

will fire dataavailable event once and stop the recording.

To make it possible that a user or a time limit can stop the recording, the demo uses an AbortSignal of an instance in AbortController.signal.

At the end you should have binary data of these audio records in OGG format.

Transcribe audio

The Whisper API is able to convert audio data, like OGG, MP3 or WAV into text.

Instead of sending JSON data, you have to format the request body as multipart/form-data, what is a common transmission format in HTML forms:

// ...

// npm i formdata-node
import {
    File,
    FormData
} from "formdata-node";

// ...

const OPENAI_API_KEY = process.env.OPENAI_API_KEY!.trim();

export default async function transcribeAudio(
    request: NextApiRequest,
    response: NextApiResponse,
) {
    // ...

    // `readStream()` is a helper function reading request body
    // as one `Buffer`
    const audioData = await readStream(request);

    // collect form data for submission
    const form = new FormData();
    form.set("model", "whisper-1");
    form.set("file", new File([audioData], 'audio.ogg'));
    form.set("language", "en");

    // ...

    const response = await axios.post(
        'https://api.openai.com/v1/audio/transcriptions',
        form,
        {
            headers: {
                'Content-Type': 'multipart/form-data',
                'Authorization': `Bearer ${OPENAI_API_KEY}`
            },
        }
    );

    // on success a response could look like:
    //
    // {
    //    "text": "lorem ipsum"
    // }
    const transcribedText = response.data.text as string;

    // ...
}

Conclusion

You see Whisper is no rocket science and in combination with function calling of ChatGPT the API could be a powerful tool to automate things only with the power of the own voice.

Have fun while trying it out! 🎉