Yesterday I needed a bit of a distraction from Stage Traxx so I looked at the current state of AI stem extraction.
In 2019 Deezer released an ai model for stem extraction called Spleeter as open source software: https://github.com/deezer/spleeter
This triggered the creation of service like Moises and others that use Spleeter under the hoods. Later followed other services like lalal.ai with custom trained models. But most notably Facebook Research released their own ai model called Demucs (https://github.com/facebookresearch/demucs) in 2021 as open source that immediately won the Sony MDX Challenge 2021 for audio separation.
Since then the development of Spleeter has somehow stalled while Facebook Research improved the model further. They are currently at version 4 of that model.
One of the lead developers of Demucs has left Facebook recently and hired with another company that released another high quality separation model called MDX.
And that is the current state of development as far as I am aware of.
It used to be quite challenging getting those models to run on your computer but luckily there have been some open source tools released that make running these models a breeze. Most notably I can recommend taking a look at Ultimate Vocal Remover: https://ultimatevocalremover.com
This tool makes these ai models usable for everyone. A single click install and you can run most of the open source ai models available on your computer.
So how do they fare?
I got to say that things developed quite a bit since the last time I looked at this topic. If all you want to do is to remove vocals from a track, the UVR-MDX-NET Inst HQ 3 model does an absolutely stunning job. The instrumental playbacks created by this model are nearly artefact free. In my opinion this would already work nicely on stage if all you need is a full instrumental playback. Performance of this model on my M2 MacBook Air would even allow realtime conversion. A 6 minute song needs about 1-2 minutes to process. Unfortunately most models need at least 8GB RAM so running on iOS is not an option not even on an iPad Pro.
Stem extraction on the other hand is a bit mixed. The latest Demucs model htdemucs_ft is way ahead of Spleeter and the best I found so far. It can extract stems for Drums, Bass, Vocals and Other. There is a 6 stem model htdemucs_6s available that also tries to separate Gtr from Keys but that does not work well. Drums, Bass and Vocals extraction is working OK. The stem for Others contains noticeable artifacts. As a practice tool it is working stellar especially for drummers and bass players. It also helps a lot for keyboard and guitar players to be able to hear those tracks more clearly. But using these stems for a modular playback is not yet at a point where I would say it is usable. Processing time on my M2 MacBook Air is near real time but way slower then the MDX model.
So the bottom line: Removing vocals from a track works surprisingly well. Creating multiple stems is still a work in progress. But given a bit more time this also might turn out to work well in maybe 2 or 3 years.
We are truly living in astonishing times seeing all these developments. Download the Ultimate Vocal Remover and play around a bit with the different ai models. It is definitely worth it.