Build and refine your audio generation end-to-end with Gemini 1.5 Pro

2 weeks ago 9
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Generative AI is giving people new ways to experience audio content, from podcasts to audio summaries. For example, users are embracing NotebookLM’s recent Audio Overview feature, which turns documents into audio conversations. With one click, two AI hosts start up a lively “deep dive” discussion based on the sources you provide. They summarize your material, make connections between topics, and discuss back and forth. 

While NotebookLM offers incredible benefits for making sense of complex information, some users want more control over generating  unique audio experiences – for example, creating their own podcasts. Podcasts are an increasingly popular medium for creators, business leaders, and users to listen to what interests them. Today, we’ll share how Gemini 1.5 Pro and the Text-to-Speech API on Google Cloud can help you create conversations with diverse voices and generate podcast scripts with custom prompts.

The approach: Expand your reach with diverse audio formats

A great podcast starts with accessible audio content. Gemini's multimodal capabilities, combined with our high-fidelity Text-to-Speech API, offers 380+ voices across 50+ languages and custom voice creation. This unlocks new ways for users to experience content and expand their reach through diverse audio formats. 

This approach also helps content creators reach a wider audience and streamline the content creation process, including:

  • Expanded reach: Connect with an audience segment that prefers audio content.

  • Increased engagement:  Foster deeper connections with listeners through personalized audio.

  • Content repurposing: Maximize the value of existing written content by transforming it into a new format, reaching a wider audience without starting from scratch.

Let’s take a look at how. 

The architecture: Gemini 1.5 Pro and Text-to-Speech 

Our audio overview creation architecture uses two powerful services from Google Cloud:

  • Gemini 1.5 Pro: This advanced generative AI model excels at understanding  and generating  human-like text. We'll use Gemini 1.5 Pro to:

    • Generate engaging scripts: Feed your podcast content overview  to Gemini 1.5 Pro, and it can generate compelling conversational scripts, complete with introductions, transitions, and calls to action.

    • Adapt content for audio: Gemini 1.5 Pro can optimize written content for the audio format, ensuring a natural flow and engaging listening experience. It can also adjust the tone and style to suit any format such as podcasts.

  • Text-to-Speech API: This API converts text into natural-sounding speech, giving a voice to your scripts. You can choose from various voices and languages to match your brand and target audience.

How to create an engaging podcast yourself, step-by-step 

  • Content preparation: Prepare your podcast. Ensure it's well-structured and edited for clarity. Consider dividing longer posts into multiple episodes for optimal listening duration.

  • Gemini 1.5 Pro integration: Use Gemini 1.5 Pro to generate a conversational script from your podcast. Experiment with prompts to fine-tune the output, achieving the desired style and tone. Example prompt: "Generate an engaging audio overview script from this podcast, including an introduction, transitions, and a call to action. Target audience is technical developers, engineers, and cloud architects."

  • Section extraction: For complex or lengthy podcasts, you might use Gemini 1.5 Pro to extract key sections and subsections as JSON, enabling a more structured approach to script generation.

A python function that powers our podcast creation process can look as simple as below:

Then, use Gemini 1.5 Pro to generate the podcast script for each section. Again, provide clear instructions in your prompts, specifying target audience, desired tone, and approximate episode length.

For each section and subsection you can use a function like below to generate a script:

Next, feed the generated  script by Gemini to the Text-to-Speech API. Choose a voice and language appropriate for your target audience and content.

A function as below can generate human quality audio based on text. For this we can use the advanced text-to-speech API in Google Cloud.

Finally, to store audio content already encoded as base64 MP3 data in Google Cloud Storage, you can use the google-cloud-storage Python library. This allows you to decode the base64 string and upload the resulting bytes directly to a designated bucket, specifying the content type as 'audio/mp3'.

Hear it for yourself

While the Text-to-Speech API produces high-quality audio, you can further enhance your audio conversation with background music, sound effects, and professional editing using tools. Hear it for yourself – download the audio conversation I created from this blog using Gemini 1.5 Pro and Text-to-Speech API.

To start creating for yourself, explore our full suite of audio generation features using Google Cloud services, such as Text-to-Speech API  and Gemini models using the free tier. We recommend experimenting with different modalities like text and image prompts to experience Gemini's potential for content creation.

Posted in
Read Entire Article