Intro

In case you are building a new AI model that can speak like Elon Musk or Taylor Swift. Now you wonder how similar the generated output is to the real human in their voice. Here, the similarity means their tone, prosody, and word articulation.

Example

In this example, we will assume you have your own AI model for generating human voice similar to the given human voice. See an example below.

import podonos
from podonos import *

client = podonos.init()
etor = client.create_evaluator(
    name='Taylor Swift voice similarity',
    desc='How similar voice can my AI model generate to Taylor Swift?',
    type='SMOS', num_eval=10)

original_speech_path = ['ts0.wav', 'ts1.wav', 'ts2.wav']
generated_speech_path = ['ts0_gen.wav', 'ts1_gen.wav', 'ts2_gen.wav']

for org, syn in zip(real_speech_path, generated_speech_path):
    org_file = File(path=org, model_tag='real', tags=['female'])
    syn_file = File(path=syn, model_tag='model1', tags=['female', 'Taylor Swift'])
    etor.add_files(file0=org_file, file1=syn_file)

etor.close()