Guide for getting the most out of you cloned voices.
multilingual v2
. You can read more about each individual model here and their strengths.
As mentioned earlier, if the voice you try to clone falls outside of these parameters or outside of what the AI has heard during training, it might have a hard time replicating the voice perfectly using instant voice cloning.
How the audio was recorded is more important than the total length (total runtime) of the samples. The number of samples you use doesn’t matter; it is the total combined length (total runtime) that is the important part.
Approximately 1-2 minutes of clear audio without any reverb, artifacts, or background noise of any kind appears to be the sweet spot. When we speak of “audio or recording quality,” we do not mean the codec, such as MP3 or WAV; we mean how the audio was captured. However, regarding audio codecs, using MP3 at 128 kbps and above seems to work just fine, and higher bitrates don’t seem to markedly improve the quality of the clone.
The AI will attempt to mimic everything it hears in the audio; the speed of the person talking as well as the inflections, the accent and tonality, breathing pattern and strength, as well as noise and mouth clicks and everything else, including noise and artefacts which can confuse it.
Another important thing to keep in mind is that the AI will try to replicate the performance of the voice you provide. If you talk in a slow, monotone voice without much emotion, that is what the AI will mimic. On the other hand, if you talk quickly with much emotion, that is what the AI will try to replicate.
It is crucial that the voice remains consistent throughout all the samples, not only in tone but also in performance. If there is too much variance, it might confuse the AI, leading to more varied output between generations.