![]() ![]() Train2 should be trained after Train1 is done!.Run train2.py to train and eval2.py to test.Run train1.py to train and eval1.py to test.Train phase: Net1 and Net2 should be trained sequentially.Griffin-Lim reconstruction when reverting wav from spectrogram.Target2(Kate Winslet): over 2 hours of audio book sentences read by her (private).Target1(anonymous female): Arctic dataset (public).Loss is reconstruction error between input and target.Since Net1 is already trained in previous step, the remaining part only should be trained in this step.The input/target is a set of target speaker's utterances.Net2 synthesizes the target speaker's speeches.Process: net1(wav -> spectrogram -> mfccs -> phoneme dist.) -> spectrogram -> wav.contains 630 speakers' utterances and corresponding phones that speaks similar sentences.Objective function is cross entropy loss.For each timestep, the input is log magnitude spectrogram and the target is phoneme dist.Net1 classifies spectrogram to phonemes that consists of 60 English phonemes at every timestep.Process: wav -> spectrogram -> mfccs -> phoneme dist.We applied CBHG(1-D convolution bank + highway network + bidirectional GRU) modules that are mentioned in Tacotron.ĬBHG is known to be good for capturing features from sequential data. Net2(speech synthesis) synthesize speeches of the target speaker from the phones.Phonemes are speaker-independent while waveforms are speaker-dependent. ![]() Net1(phoneme classification) classify someone's utterances to one of phoneme classes at every timestep.The model architecture consists of two modules: (To make these parallel datasets needs a lot of effort.)Īll we need in this project is a number of waveforms of the target speaker's utterances and only a small set of pairs from a number of anonymous speakers. The main significance of this work is that we could generate a target speaker's utterances without parallel data like, or, but only waveforms of the target speaker. This is a many-to-one voice conversion system. We implemented a deep neural networks to achieve that and more than 2 hours of audio book sentences read by Kate Winslet are used as a dataset. We worked on this project that aims to convert someone's voice to a famous English actress Kate Winslet's This project started with a goal to convert someone's voice to a specific target voice. What if you could imitate a famous celebrity's voice or sing like a famous singer? Voice Conversion with Non-Parallel Data Subtitle: Speaking like Kate WinsletĪuthors: Dabi Ahn( Kyubyong Park( Samples ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |