Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Generative AI makes developers lives much easier - but by how much?

I have been learning German for the past year, and one of the things I thought would be personally useful would be to generate many conversations in German - via voice, which be extremely useful for me to learn German. The audio I created can be found here – here's a rundown of what I learned while doing this.

Where I went wrong

Generate only when needed, generated output may not always be parseable.
LLMs can't count, in certain formats of text. This is normal because LLMs generate text on probability, but it can be jarring to see it say 2 + 2 = 5 (your calculator will return this 0% of the time, of course, but with LLMs, there's always that chance..)
Parsing is annoying, you will have to manually edit the generated text often, or generate a lot of text in the hope that something succeeds.
You can never be explicit enough, there'll probably always be something you miss.
Check the generated text, for any edge cases that may occur.
Write fault tolerant code, don't expect an LLM to have always worked correctly, especially for massive workloads.
Don't make assumptions about what can be generated and what cannot be generated without testing it.
All generated output needs to be tested.

Generating the conversational audio I wanted practically has three major steps

Generating conversations with an LLM between a few people in many many different themes.
Converting the previous generated conversation text via Bark into audio.
Repeat for 100 different conversations.

Generating conversations with an LLM

This in and of itself had 2 major steps:

Creating a list of characters in the conversation.
Creating a transcript of a conversation between the characters.

Creating a list of characters

I used the following prompt to generate "speakers" via my LLM, for who will be talking to each other:

For a conversation (that you will write later), only give me some characters for the conversation, there should be a maximum of 3 female speakers and 4 male speakers in the conversation The conversation happens in Germany, so try to give German names. Write down all the speakers in the conversation in the format: ``` --- number of female speakers : <num_female_speakers> number of male speakers : <num_male_speakers> <name> : <Male/Female> <name> : <Male/Female> <name> : <Male/Female> .... ---

Lessons from this

Getting structured output from an LLM is hard, it took me a few tries with multiple prompt styles for an LLM to give me a good mostly-parsable output, and even then, for this use case, it'd have been easier for me to just ask it to generate a list of names and then randomly select some names from that list of names, as speakers.
LLMs can't count, sometimes, an earlier iteration of this prompt was this

For a conversation (that you will write later), only give me some characters for the conversation, there should be a maximum of 3 female speakers and 4 male speakers in the conversation write down all the speakers in the conversation in the format ``` --- <name> : <Male/Female> <name> : <Male/Female> <name> : <Male/Female> .... --- ```

Without me explicitly asking it to write down how many it speakers of a particular gender it would generate explicity before it generated the names and genders, it, often produced 4 female speakers even though I only requested 3.

Creating a chat transcript

I used the following code to create a chat transcript from the list of speakers:

with the following speakers {speakers_raw} write a conversation in the format ``` --- [DE] <speaker name> : <dialogue> [EN] <speaker name> : <dialogue> [DE] <speaker name> : <dialogue> [EN] <speaker name> : <dialogue> [DE] <speaker name> : <dialogue> [EN] <speaker name> : <dialogue> ... --- ``` Ensure the English translation is always in the directly next line, and dialogues between two participants have a empty line between them (as shown in the example) where the conversation is first given in german and then English. Ensure you start and end the main part of the output with 3 minuses (---), as displayed above, which in this case will be the entire conversation. The conversation should be about '{conversation_theme}' Ensure the conversation gets into complex themes and narratives, and include a discussions of the problems people face, and what they like about the industry.

{speakers_raw} was substituted by the characters generated by the previous step, and so was {conversation_theme} which I got by asking to generate a list of conversations.

Lessons from this

Parsing is hard
Parsing is hard
Parsing is hard, if you read the prompt above, the prompt had a lot of explicitness I had to continuously outline to the tool (keep 2 spaces, start and end with 3 minuses, etc, etc), even though this occurred, it would sometimes not respect the explicitly made statements, and often not keep the 3 spaces or start and end in the format expected, I just generated more until something worked

It would often misspell names it correctly spelled earlier, things like Johannas would become Hannes for no reason.
It would often spell "Emma" as "mma", which was absurd.

You can probably never be explicit enough. Sometimes, it would insert things like "alle" (everyone in German) in the audio, which makes sense, when you think in terms of training data, but, I didn't want that, I had to rewrite this in order to make it work

Converting to audio

I used Bark's conversational code to generate the audio, you can find the code in the bottom part of the notebook here https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb

Lessons from this

Check the generated content, a lot of the issues I found myself in were recognized after I generated the content earlier, and then didn't see the bugs in the content. For rented GPUs this is a waste of GPU compute time, so, being more mindful of this would have certainly made my life easier.
Write fault tolerant code, I later modified my code to follow this, but essentially when I was looping and converting things into audio, the loop often broke because of parsing issues, this is time that I could've saved by just, having had fault tolerant code in the first place, that auto-generated with newer transcripts, whenever an error occured, or skipped a generation when it had troubles generating.
Bark's list of speakers, only has two female German speakers, so I took an English speaker, and assumed that the model would be able to make the speaker speak German - it couldn't, which makes sense when you think about the training data, because there's going to be very few speakers from primary English speaking countries that'd speak German fluently also being present in the training data, I should've tested this assumption properly.

Final lesson

After generating all the audio, I still found certain bits of audio, having major issues, often random screams or "tape scratches" within the audio, to the speaker saying completely unexpected phrases in the audio.

Neither generated text, nor audio, was ever 100% reliable, and needed a means to seperate good audio from bad audio, and keeping this in mind before making any assumptions and having constantly checked the audio would've saved me a lot of time.

I wasn't able to clean up the audio, however, I found it good enough for my learning purposes. You can find all the generated audio over here : https://german-audio-stuff.dreamymagic.art

‍

Author

River Snow

Date

September 6, 2023

Table of contents

TOC

Get started

Lessons While Using Generative Language and Audio For Practical Use Cases

Generative AI makes developers lives much easier - but by how much?

Where I went wrong

Generate only when needed, generated output may not always be parseable.
LLMs can't count, in certain formats of text. This is normal because LLMs generate text on probability, but it can be jarring to see it say 2 + 2 = 5 (your calculator will return this 0% of the time, of course, but with LLMs, there's always that chance..)
Parsing is annoying, you will have to manually edit the generated text often, or generate a lot of text in the hope that something succeeds.
You can never be explicit enough, there'll probably always be something you miss.
Check the generated text, for any edge cases that may occur.
Write fault tolerant code, don't expect an LLM to have always worked correctly, especially for massive workloads.
Don't make assumptions about what can be generated and what cannot be generated without testing it.
All generated output needs to be tested.

Generating the conversational audio I wanted practically has three major steps

Generating conversations with an LLM between a few people in many many different themes.
Converting the previous generated conversation text via Bark into audio.
Repeat for 100 different conversations.

Generating conversations with an LLM

This in and of itself had 2 major steps:

Creating a list of characters in the conversation.
Creating a transcript of a conversation between the characters.

Creating a list of characters

I used the following prompt to generate "speakers" via my LLM, for who will be talking to each other:

Lessons from this

Getting structured output from an LLM is hard, it took me a few tries with multiple prompt styles for an LLM to give me a good mostly-parsable output, and even then, for this use case, it'd have been easier for me to just ask it to generate a list of names and then randomly select some names from that list of names, as speakers.
LLMs can't count, sometimes, an earlier iteration of this prompt was this

For a conversation (that you will write later), only give me some characters for the conversation, there should be a maximum of 3 female speakers and 4 male speakers in the conversation write down all the speakers in the conversation in the format ``` --- <name> : <Male/Female> <name> : <Male/Female> <name> : <Male/Female> .... --- ```

Creating a chat transcript

I used the following code to create a chat transcript from the list of speakers:

{speakers_raw} was substituted by the characters generated by the previous step, and so was {conversation_theme} which I got by asking to generate a list of conversations.

Lessons from this

Parsing is hard
Parsing is hard
Parsing is hard, if you read the prompt above, the prompt had a lot of explicitness I had to continuously outline to the tool (keep 2 spaces, start and end with 3 minuses, etc, etc), even though this occurred, it would sometimes not respect the explicitly made statements, and often not keep the 3 spaces or start and end in the format expected, I just generated more until something worked

It would often misspell names it correctly spelled earlier, things like Johannas would become Hannes for no reason.
It would often spell "Emma" as "mma", which was absurd.

You can probably never be explicit enough. Sometimes, it would insert things like "alle" (everyone in German) in the audio, which makes sense, when you think in terms of training data, but, I didn't want that, I had to rewrite this in order to make it work

Converting to audio

I used Bark's conversational code to generate the audio, you can find the code in the bottom part of the notebook here https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb

Lessons from this

Check the generated content, a lot of the issues I found myself in were recognized after I generated the content earlier, and then didn't see the bugs in the content. For rented GPUs this is a waste of GPU compute time, so, being more mindful of this would have certainly made my life easier.
Write fault tolerant code, I later modified my code to follow this, but essentially when I was looping and converting things into audio, the loop often broke because of parsing issues, this is time that I could've saved by just, having had fault tolerant code in the first place, that auto-generated with newer transcripts, whenever an error occured, or skipped a generation when it had troubles generating.
Bark's list of speakers, only has two female German speakers, so I took an English speaker, and assumed that the model would be able to make the speaker speak German - it couldn't, which makes sense when you think about the training data, because there's going to be very few speakers from primary English speaking countries that'd speak German fluently also being present in the training data, I should've tested this assumption properly.

Final lesson

I wasn't able to clean up the audio, however, I found it good enough for my learning purposes. You can find all the generated audio over here : https://german-audio-stuff.dreamymagic.art

‍

Lessons While Using Generative Language and Audio For Practical Use Cases

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Where I went wrong

Generating the conversational audio I wanted practically has three major steps

Generating conversations with an LLM

Creating a list of characters

Lessons from this

Creating a chat transcript

Lessons from this

Converting to audio

Lessons from this

Final lesson

Lessons While Using Generative Language and Audio For Practical Use Cases

Where I went wrong

Generating the conversational audio I wanted practically has three major steps

Generating conversations with an LLM

Creating a list of characters

Lessons from this

Creating a chat transcript

Lessons from this

Converting to audio

Lessons from this

Final lesson

RunPod Achieves SOC 2 Type I Certification: A Milestone in AI Security

Benchmarking LLMs: A Deep Dive into Local Deployment & Optimization

Orchestrating Runpod’s Workloads Using dstack

Build what’s next.

Lessons While Using Generative Language and Audio For Practical Use Cases

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Where I went wrong

Generating the conversational audio I wanted practically has three major steps

Generating conversations with an LLM

Creating a list of characters

Lessons from this

Creating a chat transcript

Lessons from this

Converting to audio

Lessons from this

Final lesson

Lessons While Using Generative Language and Audio For Practical Use Cases

Where I went wrong

Generating the conversational audio I wanted practically has three major steps

Generating conversations with an LLM

Creating a list of characters

Lessons from this

Creating a chat transcript

Lessons from this

Converting to audio

Lessons from this

Final lesson

Related articles.

RunPod Achieves SOC 2 Type I Certification: A Milestone in AI Security

Benchmarking LLMs: A Deep Dive into Local Deployment & Optimization

Orchestrating Runpod’s Workloads Using dstack

Build what’s next.

You’ve unlocked areferral bonus!

You’ve unlocked a
referral bonus!