mirror of
https://github.com/fumiama/Retrieval-based-Voice-Conversion-WebUI.git
synced 2026-06-05 09:10:25 +08:00
feat(audio): use PyAV instead of ffmpeg (#31)
* feat(audio): use PyAV instead of ffmpeg replaced usage of ffmpeg in favor of PyAV (`av`) * refactor(audio): store all of the audio related functions in the `infer.lib.audio` refactors previous commit to have singular functions for each task, all located in `infer.lib.audio` * fix(audio): remove downsample_audio from mdxnet.py it is no longer needed, since it's imported from infer.lib.audio * docs: remove every ffmpeg mention in the documentation to avoid confusion * chore(requirements): remove ffmpeg-python and ffmpy from all requirements * fix(audio): fix loading for UVR wrapped gathering of META info from the stream into a function fixes loading for UVR * fix(audio): use np.frombuffer() instead of direct conversion of the resampled frames this fixes traceback on preprocessing * feat(audio): pre-allocate decoded_audio array in the load_audio function this should improve performance, even if just a little * Revert "docs: remove every ffmpeg mention in the documentation to avoid confusion" This reverts commit1e05bbce03. * chore(format): run black on dev * fix(requirements): revert removal of ffmpeg in unitest.yml and Dockerfile * Revert "fix(requirements): revert removal of ffmpeg in unitest.yml and Dockerfile" This reverts commite28a0eebb2. * feat(audio): pre-allocate numpy array to store the AudioFrame data in ndarray of dtype float32 * chore(format): run black on dev * fix(audio): fix the decoded_audio size estimation in estimated_total_samples we multiply by `sr` instead of `container.streams.audio[0].rate` since we want to estimate size of the OUTPUT file, not the input one. - Added dynamic resizing, in case something goes wrong and the size of decoded_audio is estimated incorrectly Fixed function `load_audio` when the input audio's samplerate does not match the desired samplerate (`sr`) * chore(format): run black on dev * refactor(audio): remove `clean_path()` function as it serves no purpose anymore * docs: remove everything related to ffmpeg this includes everything except for formats support specification in the training_tips docs, since it has nothing to do with what ffmpeg does/did but rather what audio formats are supported (all the ones that ffmpeg supports!) * docs: fix order of the steps in preparation in the READMEs --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This commit is contained in:
@@ -126,27 +126,7 @@ sh ./run.sh
|
||||
rvcmd assets/v2 # RVC-Models-Downloader command
|
||||
```
|
||||
|
||||
### 2. 安装 ffmpeg 工具
|
||||
若已安装`ffmpeg`和`ffprobe`则可跳过此步骤。
|
||||
|
||||
#### Ubuntu/Debian 用户
|
||||
```bash
|
||||
sudo apt install ffmpeg
|
||||
```
|
||||
#### MacOS 用户
|
||||
```bash
|
||||
brew install ffmpeg
|
||||
```
|
||||
#### Windows 用户
|
||||
下载后放置在根目录。
|
||||
```bash
|
||||
rvcmd tools/ffmpeg # RVC-Models-Downloader command
|
||||
```
|
||||
- 下载[ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
|
||||
|
||||
- 下载[ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
|
||||
|
||||
### 3. 下载 rmvpe 人声音高提取算法所需文件
|
||||
### 2. 下载 rmvpe 人声音高提取算法所需文件
|
||||
|
||||
如果你想使用最新的RMVPE人声音高提取算法,则你需要下载音高提取模型参数并放置于`assets/rmvpe`。
|
||||
|
||||
@@ -162,7 +142,7 @@ rvcmd tools/ffmpeg # RVC-Models-Downloader command
|
||||
rvcmd assets/rmvpe # RVC-Models-Downloader command
|
||||
```
|
||||
|
||||
### 4. AMD显卡Rocm(可选, 仅Linux)
|
||||
### 3. AMD显卡Rocm(可选, 仅Linux)
|
||||
|
||||
如果你想基于AMD的Rocm技术在Linux系统上运行RVC,请先在[这里](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html)安装所需的驱动。
|
||||
|
||||
@@ -207,7 +187,6 @@ rvcmd packs/general/latest # RVC-Models-Downloader command
|
||||
+ [VITS](https://github.com/jaywalnut310/vits)
|
||||
+ [HIFIGAN](https://github.com/jik876/hifi-gan)
|
||||
+ [Gradio](https://github.com/gradio-app/gradio)
|
||||
+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
|
||||
+ [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
|
||||
+ [audio-slicer](https://github.com/openvpi/audio-slicer)
|
||||
+ [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
|
||||
|
||||
@@ -1,11 +1,4 @@
|
||||
## Q1:ffmpeg error/utf8 error.
|
||||
|
||||
大概率不是ffmpeg问题,而是音频路径问题;
|
||||
|
||||
ffmpeg读取路径带空格、()等特殊符号,可能出现ffmpeg error;训练集音频带中文路径,在写入filelist.txt的时候可能出现utf8 error;
|
||||
|
||||
|
||||
## Q2:一键训练结束没有索引
|
||||
## Q1:一键训练结束没有索引
|
||||
|
||||
显示"Training is done. The program is closed."则模型训练成功,后续紧邻的报错是假的;
|
||||
|
||||
@@ -13,11 +6,11 @@ ffmpeg读取路径带空格、()等特殊符号,可能出现ffmpeg error;训
|
||||
一键训练结束完成没有added开头的索引文件,可能是因为训练集太大卡住了添加索引的步骤;已通过批处理add索引解决内存add索引对内存需求过大的问题。临时可尝试再次点击"训练索引"按钮。
|
||||
|
||||
|
||||
## Q3:训练结束推理没看到训练集的音色
|
||||
## Q2:训练结束推理没看到训练集的音色
|
||||
点刷新音色再看看,如果还没有看看训练有没有报错,控制台和webui的截图,logs/实验名下的log,都可以发给开发者看看。
|
||||
|
||||
|
||||
## Q4:如何分享模型
|
||||
## Q3:如何分享模型
|
||||
rvc_root/logs/实验名 下面存储的pth不是用来分享模型用来推理的,而是为了存储实验状态供复现,以及继续训练用的。用来分享的模型应该是weights文件夹下大小为60+MB的pth文件;
|
||||
|
||||
后续将把weights/exp_name.pth和logs/exp_name/added_xxx.index合并打包成weights/exp_name.zip省去填写index的步骤,那么zip文件用来分享,不要分享pth文件,除非是想换机器继续训练;
|
||||
@@ -25,18 +18,18 @@ ffmpeg读取路径带空格、()等特殊符号,可能出现ffmpeg error;训
|
||||
如果你把logs文件夹下的几百MB的pth文件复制/分享到weights文件夹下强行用于推理,可能会出现f0,tgt_sr等各种key不存在的报错。你需要用ckpt选项卡最下面,手工或自动(本地logs下如果能找到相关信息则会自动)选择是否携带音高、目标音频采样率的选项后进行ckpt小模型提取(输入路径填G开头的那个),提取完在weights文件夹下会出现60+MB的pth文件,刷新音色后可以选择使用。
|
||||
|
||||
|
||||
## Q5:Connection Error.
|
||||
## Q4:Connection Error.
|
||||
也许你关闭了控制台(黑色窗口)。
|
||||
|
||||
|
||||
## Q6:WebUI弹出Expecting value: line 1 column 1 (char 0).
|
||||
## Q5:WebUI弹出Expecting value: line 1 column 1 (char 0).
|
||||
请关闭系统局域网代理/全局代理。
|
||||
|
||||
|
||||
这个不仅是客户端的代理,也包括服务端的代理(例如你使用autodl设置了http_proxy和https_proxy学术加速,使用时也需要unset关掉)
|
||||
|
||||
|
||||
## Q7:不用WebUI如何通过命令训练推理
|
||||
## Q6:不用WebUI如何通过命令训练推理
|
||||
训练脚本:
|
||||
|
||||
可先跑通WebUI,消息窗内会显示数据集处理和训练用命令行;
|
||||
@@ -72,21 +65,21 @@ device=sys.argv[8]
|
||||
is_half=bool(sys.argv[9])
|
||||
|
||||
|
||||
## Q8:Cuda error/Cuda out of memory.
|
||||
## Q7:Cuda error/Cuda out of memory.
|
||||
小概率是cuda配置问题、设备不支持;大概率是显存不够(out of memory);
|
||||
|
||||
|
||||
训练的话缩小batch size(如果缩小到1还不够只能更换显卡训练),推理的话酌情缩小config.py结尾的x_pad,x_query,x_center,x_max。4G以下显存(例如1060(3G)和各种2G显卡)可以直接放弃,4G显存显卡还有救。
|
||||
|
||||
|
||||
## Q9:total_epoch调多少比较好
|
||||
## Q8:total_epoch调多少比较好
|
||||
|
||||
如果训练集音质差底噪大,20~30足够了,调太高,底模音质无法带高你的低音质训练集
|
||||
|
||||
如果训练集音质高底噪低时长多,可以调高,200是ok的(训练速度很快,既然你有条件准备高音质训练集,显卡想必条件也不错,肯定不在乎多一些训练时间)
|
||||
|
||||
|
||||
## Q10:需要多少训练集时长
|
||||
## Q9:需要多少训练集时长
|
||||
推荐10min至50min
|
||||
|
||||
保证音质高底噪低的情况下,如果有个人特色的音色统一,则多多益善
|
||||
@@ -98,7 +91,7 @@ is_half=bool(sys.argv[9])
|
||||
1min以下时长数据目前没见有人尝试(成功)过。不建议进行这种鬼畜行为。
|
||||
|
||||
|
||||
## Q11:index rate干嘛用的,怎么调(科普)
|
||||
## Q10:index rate干嘛用的,怎么调(科普)
|
||||
如果底模和推理源的音质高于训练集的音质,他们可以带高推理结果的音质,但代价可能是音色往底模/推理源的音色靠,这种现象叫做"音色泄露";
|
||||
|
||||
index rate用来削减/解决音色泄露问题。调到1,则理论上不存在推理源的音色泄露问题,但音质更倾向于训练集。如果训练集音质比推理源低,则index rate调高可能降低音质。调到0,则不具备利用检索混合来保护训练集音色的效果;
|
||||
|
||||
Reference in New Issue
Block a user