I cloned my own voice to create an AI-Austin Yip using my Mac Studio
If you are not interested in any of the technical sections, jump to the BOTTOM and see the results.
Ever since voice-cloning became popular on social media, I have developed a keen interest in learning how it works and how to apply it in my own work. The most popular method of voice-cloning is through a project called “SoftVC VITS Singing Voice Conversion,” which is an open-source application that utilizes shallow diffusion to extract speech features from the source audio and convert them into a desired audio output. Although I have a basic understanding of diffusion from playing Stable Diffusion, the underlying logic is still beyond my comprehension. However, using the application itself seems relatively easier, so I gave it a few tries and realized that I was mistaken.
The Problems
The biggest problem of running the so-vits svc is actually not to understand the documentation on github, but how to run it on Apple M1 without a NVIDIA GPU.
The reason behind this is that the whole project was built on running CUDA, which is an NVIDIA-specific “parallel computing platform and programming model.” The M1, clearly not equipped with any NVIDIA GPUs, is my first encountered problem in finding an alternative solution.
At this point I have already spent hours trying to install with python 3.9, 3.10 and 3.11 locally in my terminal, and after hours of searching online I was lucky to find the following website that walks through how to create a conda environment in M1, so that I can install PyTorch and so-vits-svc on my computer.
With the step-by-step instructions, installing so-vits-svc became quite smooth, and I still remember how delighted I was when I was first able to launch the GUI.
The First Attempt
The first thing I did was download a model online and try to convert something I had in my drive to a new voice. I fed the application a recording from my older project, Koto (2019), which was written for narration, alto sax, two guitars, and left-hand piano. The text was adapted from Yasunari Kawabata’s novel of the same title, and the recording was recited by Kanako Shioji. The result was quite acceptable, considering that I had no expectations of what would come out.
It was then a few hours of trial and error, and I was able to feed the same model different input audio, including my own voice, and was able to feed other models.
While the first problem is solved and I wasn’t particularly interested in creating an AI clone of Obama (which can be easily found online) or 尹光 (Wan Kwong, a famous Hong Kong singer), I continued to search for ways to train my own model. This way, I could create an AI-cloned version of Austin Yip.
Training My Own Model
It took me a few more hours to grasp the basic logic, and reading the Preprocessing section of the So-vits GitHub documentation definitely helped. The first step I did was to dig out a recording of my own voice from an old Podcast project. I was lucky to have a clean recording of my voice, and I also had it pre-processed with RX9 before.
The next step was to slice the audio into smaller sections, and the audio-slicer-GUI was quite self-explanatory. Then I had to filter out the shorter slices, and ran a few terminal commands to resample the files.
Up till this point it was a smooth journey. I then spent a few hours looking for solutions on how to generate the config file, as well as generating hubert and f0. I was glad that before I gave up I found this Youtube video,
Even though it wasn’t 100% similar to my situation, the best thing I learnt from the video was — if any errors pop up, copy the problem and ask ChatGPT how to solve it.
This might include very obvious issues like file loading, as well as an extra file that blocks so-vits-svc from running. In the problem below, it is very likely that the .DS_Store file was causing the issue because within the 44k folder, there should only be a model folder with processed audio files. By removing the file, the application will not read it as an embedded folder.
Understanding the Config File and Train
Understanding the config file was another major problem, especially considering that the preset number of epochs was 10000. When I initially attempted to train my own voice as a test with only 35 samples, I was shocked by the amount of time it took. The video above has partially explained how to recalculate the equation so that I can start my first training with a smaller number of epochs, because ultimately, I have no idea if it will even work. Having a test run first would be beneficial.
The training process took approximately 4 days to complete for just 1667 epochs on my M1 Max. I can only imagine that if I had a decent NVIDIA GPU, the task would have been completed much more quickly.
I was so happy when I saw the training completed.
The Results — Speaking
I went right away to feed the same audio input using AI-Austin Yip as the model. Because all the samples were from a speech of mine with only 35 samples, the result was, I would say, acceptable. It still sounds a little bit weird, but I think with more samples I can have it solved.
Given the above success of using my own trained voice, I continued to look for other recordings that I could play around with. Eventually, I decided to use a recording from my opera, Por Por, in which Caleb Woo, a Hong Kong baritone played and sang the main role. The reason behind this is that his performance was so exceptional that I’m curious to see how my voice would sound if I transform it.
The Results-Singing
Above is a short excerpt that combines Caleb’s singing and the AI-cloned version of Austin Yip. I intended to showcase my AI-cloned version first, followed by Caleb’s original singing, and finally a combination of the two.
Clearly, the results are not as good as when it comes to speaking. While analyzing my own AI voice, I realized that this could be mainly due to the small sample size of my initial dataset. It could also be because of what I fed into the training, which was mostly speaking data.
Takeaway
My first attempt at cloning myself has provided me with valuable insights into what I can do next. I plan to re-sample my own voice with more varieties, such as singing, speaking in different languages, and experimenting with different vocal effects.
Additionally, I am considering using a web server instead of training locally. This would allow me to save a significant amount of time before seeing the next model, even though it would require spending more money on renting online services.
I’m looking forward to the next steps.
So this is pretty much it for the first AI-cloned Austin Yip, and how I self-taught myself to clone myself. Stay tuned for the updates, and I hope that the next model really can sing.
Check out my other works at www.austinyip.com, or follow me at www.instagram.com/austinyip_thecomposer