feat(audio): use PyAV instead of ffmpeg (#31)

* feat(audio): use PyAV instead of ffmpeg replaced usage of ffmpeg in favor of PyAV (`av`) * refactor(audio): store all of the audio related functions in the `infer.lib.audio` refactors previous commit to have singular functions for each task, all located in `infer.lib.audio` * fix(audio): remove downsample_audio from mdxnet.py it is no longer needed, since it's imported from infer.lib.audio * docs: remove every ffmpeg mention in the documentation to avoid confusion * chore(requirements): remove ffmpeg-python and ffmpy from all requirements * fix(audio): fix loading for UVR wrapped gathering of META info from the stream into a function fixes loading for UVR * fix(audio): use np.frombuffer() instead of direct conversion of the resampled frames this fixes traceback on preprocessing * feat(audio): pre-allocate decoded_audio array in the load_audio function this should improve performance, even if just a little * Revert "docs: remove every ffmpeg mention in the documentation to avoid confusion" This reverts commit 1e05bbce03. * chore(format): run black on dev * fix(requirements): revert removal of ffmpeg in unitest.yml and Dockerfile * Revert "fix(requirements): revert removal of ffmpeg in unitest.yml and Dockerfile" This reverts commit e28a0eebb2. * feat(audio): pre-allocate numpy array to store the AudioFrame data in ndarray of dtype float32 * chore(format): run black on dev * fix(audio): fix the decoded_audio size estimation in estimated_total_samples we multiply by `sr` instead of `container.streams.audio[0].rate` since we want to estimate size of the OUTPUT file, not the input one. - Added dynamic resizing, in case something goes wrong and the size of decoded_audio is estimated incorrectly Fixed function `load_audio` when the input audio's samplerate does not match the desired samplerate (`sr`) * chore(format): run black on dev * refactor(audio): remove `clean_path()` function as it serves no purpose anymore * docs: remove everything related to ffmpeg this includes everything except for formats support specification in the training_tips docs, since it has nothing to do with what ffmpeg does/did but rather what audio formats are supported (all the ones that ffmpeg supports!) * docs: fix order of the steps in preparation in the READMEs --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-07-17 02:40:35 +08:00 · 2024-06-12 18:13:26 +07:00
parent aec56ec0b4
commit 1e22d468ea
28 changed files with 233 additions and 366 deletions
--- a/.github/workflows/unitest.yml
+++ b/.github/workflows/unitest.yml
@@ -18,7 +18,6 @@ jobs:
    - name: Install dependencies
      run: |
        sudo apt update
-        sudo apt -y install ffmpeg
        wget https://github.com/fumiama/RVC-Models-Downloader/releases/download/v0.2.3/rvcmd_linux_amd64.deb
        sudo apt -y install ./rvcmd_linux_amd64.deb
        python -m pip install --upgrade pip
--- a/.gitignore
+++ b/.gitignore
@@ -12,5 +12,3 @@ xcuserdata
 /logs

 /assets/weights/*
-ffmpeg.*
-ffprobe.*
--- a/2
+++ b/2
@@ -8,7 +8,7 @@ WORKDIR /app

 # Install dependenceis to add PPAs
 RUN apt-get update && \
-    apt-get install -y -qq ffmpeg aria2 && apt clean && \
+    apt-get install -y -qq aria2 && apt clean && \
    apt-get install -y software-properties-common && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
--- a/README.md
+++ b/README.md
@@ -128,26 +128,7 @@ If you want to use the v2 version of the model, you need to download additional
 	rvcmd assets/v2 # RVC-Models-Downloader command
 	```

-### 2. Install ffmpeg tool
-If `ffmpeg` and `ffprobe` have already been installed, you can skip this step.
-#### Ubuntu/Debian
-```bash
-sudo apt install ffmpeg
-```
-#### MacOS
-```bash
-brew install ffmpeg
-```
-#### Windows
-After downloading, place it in the root directory.
-```bash
-rvcmd tools/ffmpeg # RVC-Models-Downloader command
-```
- [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
-
- [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
-
-### 3. Download the required files for the rmvpe vocal pitch extraction algorithm
+### 2. Download the required files for the rmvpe vocal pitch extraction algorithm

 If you want to use the latest RMVPE vocal pitch extraction algorithm, you need to download the pitch extraction model parameters and place them in `assets/rmvpe`.

@@ -163,7 +144,7 @@ If you want to use the latest RMVPE vocal pitch extraction algorithm, you need t
 	rvcmd assets/rmvpe # RVC-Models-Downloader command
 	```

-### 4. AMD ROCM (optional, Linux only)
+### 3. AMD ROCM (optional, Linux only)

 If you want to run RVC on a Linux system based on AMD's ROCM technology, please first install the required drivers [here](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html).

@@ -207,7 +188,6 @@ rvcmd packs/general/latest # RVC-Models-Downloader command
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 + [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
--- a/docs/cn/README.cn.md
+++ b/docs/cn/README.cn.md
@@ -126,27 +126,7 @@ sh ./run.sh
 	rvcmd assets/v2 # RVC-Models-Downloader command
 	```

-### 2. 安装 ffmpeg 工具
-若已安装`ffmpeg`和`ffprobe`则可跳过此步骤。
-
-#### Ubuntu/Debian 用户
-```bash
-sudo apt install ffmpeg
-```
-#### MacOS 用户
-```bash
-brew install ffmpeg
-```
-#### Windows 用户
-下载后放置在根目录。
-```bash
-rvcmd tools/ffmpeg # RVC-Models-Downloader command
-```
- 下载[ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
-
- 下载[ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
-
-### 3. 下载 rmvpe 人声音高提取算法所需文件
+### 2. 下载 rmvpe 人声音高提取算法所需文件

 如果你想使用最新的RMVPE人声音高提取算法，则你需要下载音高提取模型参数并放置于`assets/rmvpe`。

@@ -162,7 +142,7 @@ rvcmd tools/ffmpeg # RVC-Models-Downloader command
 	rvcmd assets/rmvpe # RVC-Models-Downloader command
 	```

-### 4. AMD显卡Rocm(可选, 仅Linux)
+### 3. AMD显卡Rocm(可选, 仅Linux)

 如果你想基于AMD的Rocm技术在Linux系统上运行RVC，请先在[这里](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html)安装所需的驱动。

@@ -207,7 +187,6 @@ rvcmd packs/general/latest # RVC-Models-Downloader command
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 + [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
--- a/docs/cn/faq.md
+++ b/docs/cn/faq.md
@@ -1,11 +1,4 @@
-## Q1:ffmpeg error/utf8 error.
-
-大概率不是ffmpeg问题，而是音频路径问题；
-
-ffmpeg读取路径带空格、()等特殊符号，可能出现ffmpeg error；训练集音频带中文路径，在写入filelist.txt的时候可能出现utf8 error；
-
-
-## Q2:一键训练结束没有索引
+## Q1:一键训练结束没有索引

 显示"Training is done. The program is closed."则模型训练成功，后续紧邻的报错是假的；

@@ -13,11 +6,11 @@ ffmpeg读取路径带空格、()等特殊符号，可能出现ffmpeg error；训
 一键训练结束完成没有added开头的索引文件，可能是因为训练集太大卡住了添加索引的步骤；已通过批处理add索引解决内存add索引对内存需求过大的问题。临时可尝试再次点击"训练索引"按钮。


-## Q3:训练结束推理没看到训练集的音色
+## Q2:训练结束推理没看到训练集的音色
 点刷新音色再看看，如果还没有看看训练有没有报错，控制台和webui的截图，logs/实验名下的log，都可以发给开发者看看。


-## Q4:如何分享模型
+## Q3:如何分享模型
   rvc_root/logs/实验名 下面存储的pth不是用来分享模型用来推理的，而是为了存储实验状态供复现，以及继续训练用的。用来分享的模型应该是weights文件夹下大小为60+MB的pth文件；

   后续将把weights/exp_name.pth和logs/exp_name/added_xxx.index合并打包成weights/exp_name.zip省去填写index的步骤，那么zip文件用来分享，不要分享pth文件，除非是想换机器继续训练；
@@ -25,18 +18,18 @@ ffmpeg读取路径带空格、()等特殊符号，可能出现ffmpeg error；训
   如果你把logs文件夹下的几百MB的pth文件复制/分享到weights文件夹下强行用于推理，可能会出现f0，tgt_sr等各种key不存在的报错。你需要用ckpt选项卡最下面，手工或自动（本地logs下如果能找到相关信息则会自动）选择是否携带音高、目标音频采样率的选项后进行ckpt小模型提取（输入路径填G开头的那个），提取完在weights文件夹下会出现60+MB的pth文件，刷新音色后可以选择使用。


-## Q5:Connection Error.
+## Q4:Connection Error.
 也许你关闭了控制台（黑色窗口）。


-## Q6:WebUI弹出Expecting value: line 1 column 1 (char 0).
+## Q5:WebUI弹出Expecting value: line 1 column 1 (char 0).
 请关闭系统局域网代理/全局代理。


 这个不仅是客户端的代理，也包括服务端的代理（例如你使用autodl设置了http_proxy和https_proxy学术加速，使用时也需要unset关掉）


-## Q7:不用WebUI如何通过命令训练推理
+## Q6:不用WebUI如何通过命令训练推理
 训练脚本：

 可先跑通WebUI，消息窗内会显示数据集处理和训练用命令行；
@@ -72,21 +65,21 @@ device=sys.argv[8]
 is_half=bool(sys.argv[9])


-## Q8:Cuda error/Cuda out of memory.
+## Q7:Cuda error/Cuda out of memory.
 小概率是cuda配置问题、设备不支持；大概率是显存不够（out of memory）；


 训练的话缩小batch size（如果缩小到1还不够只能更换显卡训练），推理的话酌情缩小config.py结尾的x_pad，x_query，x_center，x_max。4G以下显存（例如1060（3G）和各种2G显卡）可以直接放弃，4G显存显卡还有救。


-## Q9:total_epoch调多少比较好
+## Q8:total_epoch调多少比较好

 如果训练集音质差底噪大，20~30足够了，调太高，底模音质无法带高你的低音质训练集

 如果训练集音质高底噪低时长多，可以调高，200是ok的（训练速度很快，既然你有条件准备高音质训练集，显卡想必条件也不错，肯定不在乎多一些训练时间）


-## Q10:需要多少训练集时长
+## Q9:需要多少训练集时长
   推荐10min至50min

   保证音质高底噪低的情况下，如果有个人特色的音色统一，则多多益善
@@ -98,7 +91,7 @@ is_half=bool(sys.argv[9])
   1min以下时长数据目前没见有人尝试（成功）过。不建议进行这种鬼畜行为。


-## Q11:index rate干嘛用的，怎么调（科普）
+## Q10:index rate干嘛用的，怎么调（科普）
   如果底模和推理源的音质高于训练集的音质，他们可以带高推理结果的音质，但代价可能是音色往底模/推理源的音色靠，这种现象叫做"音色泄露"；

   index rate用来削减/解决音色泄露问题。调到1，则理论上不存在推理源的音色泄露问题，但音质更倾向于训练集。如果训练集音质比推理源低，则index rate调高可能降低音质。调到0，则不具备利用检索混合来保护训练集音色的效果；
--- a/docs/en/faq_en.md
+++ b/docs/en/faq_en.md
@@ -1,30 +1,25 @@
-## Q1:ffmpeg error/utf8 error.
-It is most likely not a FFmpeg issue, but rather an audio path issue;
-
-FFmpeg may encounter an error when reading paths containing special characters like spaces and (), which may cause an FFmpeg error; and when the training set's audio contains Chinese paths, writing it into filelist.txt may cause a utf8 error.<br>
-
-## Q2:Cannot find index file after "One-click Training".
+## Q1:Cannot find index file after "One-click Training".
 If it displays "Training is done. The program is closed," then the model has been trained successfully, and the subsequent errors are fake;

 The lack of an 'added' index file after One-click training may be due to the training set being too large, causing the addition of the index to get stuck; this has been resolved by using batch processing to add the index, which solves the problem of memory overload when adding the index. As a temporary solution, try clicking the "Train Index" button again.<br>

-## Q3:Cannot find the model in “Inferencing timbre” after training
+## Q2:Cannot find the model in “Inferencing timbre” after training
 Click “Refresh timbre list” and check again; if still not visible, check if there are any errors during training and send screenshots of the console, web UI, and logs/experiment_name/*.log to the developers for further analysis.<br>

-## Q4:How to share a model/How to use others' models?
+## Q3:How to share a model/How to use others' models?
 The pth files stored in rvc_root/logs/experiment_name are not meant for sharing or inference, but for storing the experiment checkpoits for reproducibility and further training. The model to be shared should be the 60+MB pth file in the weights folder;

 In the future, weights/exp_name.pth and logs/exp_name/added_xxx.index will be merged into a single weights/exp_name.zip file to eliminate the need for manual index input; so share the zip file, not the pth file, unless you want to continue training on a different machine;

 Copying/sharing the several hundred MB pth files from the logs folder to the weights folder for forced inference may result in errors such as missing f0, tgt_sr, or other keys. You need to use the ckpt tab at the bottom to manually or automatically (if the information is found in the logs/exp_name), select whether to include pitch infomation and target audio sampling rate options and then extract the smaller model. After extraction, there will be a 60+ MB pth file in the weights folder, and you can refresh the voices to use it.<br>

-## Q5:Connection Error.
+## Q4:Connection Error.
 You may have closed the console (black command line window).<br>

-## Q6:WebUI popup 'Expecting value: line 1 column 1 (char 0)'.
+## Q5:WebUI popup 'Expecting value: line 1 column 1 (char 0)'.
 Please disable system LAN proxy/global proxy and then refresh.<br>

-## Q7:How to train and infer without the WebUI?
+## Q6:How to train and infer without the WebUI?
 Training script:<br>
 You can run training in WebUI first, and the command-line versions of dataset preprocessing and training will be displayed in the message window.<br>

@@ -47,17 +42,17 @@ index_rate=float(sys.argv[7])<br>
 device=sys.argv[8]<br>
 is_half=bool(sys.argv[9])<br>

-## Q8:Cuda error/Cuda out of memory.
+## Q7:Cuda error/Cuda out of memory.
 There is a small chance that there is a problem with the CUDA configuration or the device is not supported; more likely, there is not enough memory (out of memory).<br>

 For training, reduce the batch size (if reducing to 1 is still not enough, you may need to change the graphics card); for inference, adjust the x_pad, x_query, x_center, and x_max settings in the config.py file as needed. 4G or lower memory cards (e.g. 1060(3G) and various 2G cards) can be abandoned, while 4G memory cards still have a chance.<br>

-## Q9:How many total_epoch are optimal?
+## Q8:How many total_epoch are optimal?
 If the training dataset's audio quality is poor and the noise floor is high, 20-30 epochs are sufficient. Setting it too high won't improve the audio quality of your low-quality training set.<br>

 If the training set audio quality is high, the noise floor is low, and there is sufficient duration, you can increase it. 200 is acceptable (since training is fast, and if you're able to prepare a high-quality training set, your GPU likely can handle a longer training duration without issue).<br>

-## Q10:How much training set duration is needed?
+## Q9:How much training set duration is needed?

 A dataset of around 10min to 50min is recommended.<br>

@@ -69,29 +64,29 @@ There are some people who have trained successfully with 1min to 2min data, but
 Data of less than 1min duration has not been successfully attempted so far. This is not recommended.<br>


-## Q11:What is the index rate for and how to adjust it?
+## Q10:What is the index rate for and how to adjust it?
 If the tone quality of the pre-trained model and inference source is higher than that of the training set, they can bring up the tone quality of the inference result, but at the cost of a possible tone bias towards the tone of the underlying model/inference source rather than the tone of the training set, which is generally referred to as "tone leakage".<br>

 The index rate is used to reduce/resolve the timbre leakage problem. If the index rate is set to 1, theoretically there is no timbre leakage from the inference source and the timbre quality is more biased towards the training set. If the training set has a lower sound quality than the inference source, then a higher index rate may reduce the sound quality. Turning it down to 0 does not have the effect of using retrieval blending to protect the training set tones.<br>

 If the training set has good audio quality and long duration, turn up the total_epoch, when the model itself is less likely to refer to the inferred source and the pretrained underlying model, and there is little "tone leakage", the index_rate is not important and you can even not create/share the index file.<br>

-## Q12:How to choose the gpu when inferring?
+## Q11:How to choose the gpu when inferring?
 In the config.py file, select the card number after "device cuda:".<br>

 The mapping between card number and graphics card can be seen in the graphics card information section of the training tab.<br>

-## Q13:How to use the model saved in the middle of training?
+## Q12:How to use the model saved in the middle of training?
 Save via model extraction at the bottom of the ckpt processing tab.

-## Q14:File/memory error(when training)?
+## Q13:File/memory error(when training)?
 Too many processes and your memory is not enough. You may fix it by:

 1、decrease the input in field "Threads of CPU".

 2、pre-cut trainset to shorter audio files.

-## Q15: How to continue training using more data
+## Q14: How to continue training using more data

 step1: put all wav data to path2.

@@ -101,7 +96,7 @@ step3: copy the latest G and D file of exp_name1 (your previous experiment) into

 step4: click "train the model", and it will continue training from the beginning of your previous exp model epoch.

-## Q16: error about llvmlite.dll
+## Q15: error about llvmlite.dll

 OSError: Could not load shared object file: llvmlite.dll

@@ -109,11 +104,11 @@ FileNotFoundError: Could not find module lib\site-packages\llvmlite\binding\llvm

 The issue will happen in windows, install https://aka.ms/vs/17/release/vc_redist.x64.exe and it will be fixed.

-## Q17: RuntimeError: The expanded size of the tensor (17280) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [1, 17280].  Tensor sizes: [0]
+## Q16: RuntimeError: The expanded size of the tensor (17280) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [1, 17280].  Tensor sizes: [0]

 Delete the wav files whose size is significantly smaller than others, and that won't happen again. Than click "train the model"and "train the index".

-## Q18: RuntimeError: The size of tensor a (24) must match the size of tensor b (16) at non-singleton dimension 2
+## Q17: RuntimeError: The size of tensor a (24) must match the size of tensor b (16) at non-singleton dimension 2

 Do not change the sampling rate and then continue training. If it is necessary to change, the exp name should be changed and the model will be trained from scratch. You can also copy the pitch and features (0/1/2/2b folders) extracted last time to accelerate the training process.

--- a/docs/fr/README.fr.md
+++ b/docs/fr/README.fr.md
@@ -112,16 +112,6 @@ Voici une liste des modèles et autres fichiers requis par RVC :

 ./assets/pretrained_v2

-# Si vous utilisez Windows, vous pourriez avoir besoin de ces fichiers pour ffmpeg et ffprobe, sautez cette étape si vous avez déjà installé ffmpeg et ffprobe. Les utilisateurs d'ubuntu/debian peuvent installer ces deux bibliothèques avec apt install ffmpeg. Les utilisateurs de Mac peuvent les installer avec brew install ffmpeg (prérequis : avoir installé brew).
-
-# ./ffmpeg
-
-https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe
-
-# ./ffprobe
-
-https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe
-
 # Si vous souhaitez utiliser le dernier algorithme RMVPE de pitch vocal, téléchargez les paramètres du modèle de pitch et placez-les dans le répertoire racine de RVC.

 https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt
@@ -167,7 +157,6 @@ python web.py
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 + [Extraction de la hauteur vocale : RMVPE](https://github.com/Dream-High/RMVPE)
--- a/docs/fr/faq_fr.md
+++ b/docs/fr/faq_fr.md
@@ -1,30 +1,25 @@
-## Q1: Erreur ffmpeg/erreur utf8.
-Il s'agit très probablement non pas d'un problème lié à FFmpeg, mais d'un problème lié au chemin de l'audio ;
-
-FFmpeg peut rencontrer une erreur lors de la lecture de chemins contenant des caractères spéciaux tels que des espaces et (), ce qui peut provoquer une erreur FFmpeg ; et lorsque l'audio du jeu d'entraînement contient des chemins en chinois, l'écrire dans filelist.txt peut provoquer une erreur utf8.<br>
-
-## Q2: Impossible de trouver le fichier index après "Entraînement en un clic".
+## Q1: Impossible de trouver le fichier index après "Entraînement en un clic".
 Si l'affichage indique "L'entraînement est terminé. Le programme est fermé", alors le modèle a été formé avec succès, et les erreurs subséquentes sont fausses ;

 L'absence d'un fichier index 'ajouté' après un entraînement en un clic peut être due au fait que le jeu d'entraînement est trop grand, ce qui bloque l'ajout de l'index ; cela a été résolu en utilisant un traitement par lots pour ajouter l'index, ce qui résout le problème de surcharge de mémoire lors de l'ajout de l'index. Comme solution temporaire, essayez de cliquer à nouveau sur le bouton "Entraîner l'index".<br>

-## Q3: Impossible de trouver le modèle dans “Inférence du timbre” après l'entraînement
+## Q2: Impossible de trouver le modèle dans “Inférence du timbre” après l'entraînement
 Cliquez sur “Actualiser la liste des timbres” et vérifiez à nouveau ; si vous ne le voyez toujours pas, vérifiez s'il y a des erreurs pendant l'entraînement et envoyez des captures d'écran de la console, de l'interface utilisateur web, et des logs/nom_de_l'expérience/*.log aux développeurs pour une analyse plus approfondie.<br>

-## Q4: Comment partager un modèle/Comment utiliser les modèles d'autres personnes ?
+## Q3: Comment partager un modèle/Comment utiliser les modèles d'autres personnes ?
 Les fichiers pth stockés dans rvc_root/logs/nom_de_l'expérience ne sont pas destinés à être partagés ou inférés, mais à stocker les points de contrôle de l'expérience pour la reproductibilité et l'entraînement ultérieur. Le modèle à partager doit être le fichier pth de 60+MB dans le dossier des poids ;

 À l'avenir, les poids/nom_de_l'expérience.pth et les logs/nom_de_l'expérience/ajouté_xxx.index seront fusionnés en un seul fichier poids/nom_de_l'expérience.zip pour éliminer le besoin d'une entrée d'index manuelle ; partagez donc le fichier zip, et non le fichier pth, sauf si vous souhaitez continuer l'entraînement sur une machine différente ;

 Copier/partager les fichiers pth de plusieurs centaines de Mo du dossier des logs au dossier des poids pour une inférence forcée peut entraîner des erreurs telles que des f0, tgt_sr, ou d'autres clés manquantes. Vous devez utiliser l'onglet ckpt en bas pour sélectionner manuellement ou automatiquement (si l'information se trouve dans les logs/nom_de_l'expérience), si vous souhaitez inclure les informations sur la hauteur et les options de taux d'échantillonnage audio cible, puis extraire le modèle plus petit. Après extraction, il y aura un fichier pth de 60+ MB dans le dossier des poids, et vous pouvez actualiser les voix pour l'utiliser.<br>

-## Q5: Erreur de connexion.
+## Q4: Erreur de connexion.
 Il se peut que vous ayez fermé la console (fenêtre de ligne de commande noire).<br>

-## Q6: WebUI affiche 'Expecting value: line 1 column 1 (char 0)'.
+## Q5: WebUI affiche 'Expecting value: line 1 column 1 (char 0)'.
 Veuillez désactiver le proxy système LAN/proxy global puis rafraîchir.<br>

-## Q7: Comment s'entraîner et déduire sans le WebUI ?
+## Q6: Comment s'entraîner et déduire sans le WebUI ?
 Script d'entraînement :<br>
 Vous pouvez d'abord lancer l'entraînement dans WebUI, et les versions en ligne de commande de la préparation du jeu de données et de l'entraînement seront affichées dans la fenêtre de message.<br>

@@ -99,17 +94,17 @@ protect = sys.argv[15].lower() == 'false' # change for true if needed
 Assurez-vous de remplacer les chemins par ceux correspondant à votre configuration et d'ajuster les autres paramètres selon vos besoins.
 -->

-## Q8: Erreur Cuda/Mémoire Cuda épuisée.
+## Q7: Erreur Cuda/Mémoire Cuda épuisée.
 Il y a une faible chance qu'il y ait un problème avec la configuration CUDA ou que le dispositif ne soit pas pris en charge ; plus probablement, il n'y a pas assez de mémoire (manque de mémoire).<br>

 Pour l'entraînement, réduisez la taille du lot (si la réduction à 1 n'est toujours pas suffisante, vous devrez peut-être changer la carte graphique) ; pour l'inférence, ajustez les paramètres x_pad, x_query, x_center, et x_max dans le fichier config.py selon les besoins. Les cartes mémoire de 4 Go ou moins (par exemple 1060(3G) et diverses cartes de 2 Go) peuvent être abandonnées, tandis que les cartes mémoire de 4 Go ont encore une chance.<br>

-## Q9: Combien de total_epoch sont optimaux ?
+## Q8: Combien de total_epoch sont optimaux ?
 Si la qualité audio du jeu d'entraînement est médiocre et que le niveau de bruit est élevé, 20-30 époques sont suffisantes. Le fixer trop haut n'améliorera pas la qualité audio de votre jeu d'entraînement de faible qualité.<br>

 Si la qualité audio du jeu d'entraînement est élevée, le niveau de bruit est faible, et la durée est suffisante, vous pouvez l'augmenter. 200 est acceptable (puisque l'entraînement est rapide, et si vous êtes capable de préparer un jeu d'entraînement de haute qualité, votre GPU peut probablement gérer une durée d'entraînement plus longue sans problème).<br>

-## Q10: Quelle durée de jeu d'entraînement est nécessaire ?
+## Q9: Quelle durée de jeu d'entraînement est nécessaire ?
 Un jeu d'environ 10 min à 50 min est recommandé.<br>

 Avec une garantie de haute qualité sonore et de faible bruit de fond, plus peut être ajouté si le timbre du jeu est uniforme.<br>
@@ -119,29 +114,29 @@ Pour un jeu d'entraînement de haut niveau (ton maigre + ton distinctif), 5 min
 Il y a des personnes qui ont réussi à s'entraîner avec des données de 1 min à 2 min, mais le succès n'est pas reproductible par d'autres et n'est pas très informatif. <br>Cela nécessite que le jeu d'entraînement ait un timbre très distinctif (par exemple, un son de fille d'anime aérien à haute fréquence) et que la qualité de l'audio soit élevée ;
 Aucune tentative réussie n'a été faite jusqu'à présent avec des données de moins de 1 min. Cela n'est pas recommandé.<br>

-## Q11: À quoi sert le taux d'index et comment l'ajuster ?
+## Q10: À quoi sert le taux d'index et comment l'ajuster ?
 Si la qualité tonale du modèle pré-entraîné et de la source d'inférence est supérieure à celle du jeu d'entraînement, ils peuvent améliorer la qualité tonale du résultat d'inférence, mais au prix d'un possible biais tonal vers le ton du modèle sous-jacent/source d'inférence plutôt que le ton du jeu d'entraînement, ce qui est généralement appelé "fuite de ton".<br>

 Le taux d'index est utilisé pour réduire/résoudre le problème de la fuite de timbre. Si le taux d'index est fixé à 1, théoriquement il n'y a pas de fuite de timbre de la source d'inférence et la qualité du timbre est plus biaisée vers le jeu d'entraînement. Si le jeu d'entraînement a une qualité sonore inférieure à celle de la source d'inférence, alors un taux d'index plus élevé peut réduire la qualité sonore. Le réduire à 0 n'a pas l'effet d'utiliser le mélange de récupération pour protéger les tons du jeu d'entraînement.<br>

 Si le jeu d'entraînement a une bonne qualité audio et une longue durée, augmentez le total_epoch, lorsque le modèle lui-même est moins susceptible de se référer à la source déduite et au modèle sous-jacent pré-entraîné, et qu'il y a peu de "fuite de ton", le taux d'index n'est pas important et vous pouvez même ne pas créer/partager le fichier index.<br>

-## Q12: Comment choisir le gpu lors de l'inférence ?
+## Q11: Comment choisir le gpu lors de l'inférence ?
 Dans le fichier config.py, sélectionnez le numéro de carte après "device cuda:".<br>

 La correspondance entre le numéro de carte et la carte graphique peut être vue dans la section d'information de la carte graphique de l'onglet d'entraînement.<br>

-## Q13: Comment utiliser le modèle sauvegardé au milieu de l'entraînement ?
+## Q12: Comment utiliser le modèle sauvegardé au milieu de l'entraînement ?
 Sauvegardez via l'extraction de modèle en bas de l'onglet de traitement ckpt.

-## Q14: Erreur de fichier/erreur de mémoire (lors de l'entraînement) ?
+## Q13: Erreur de fichier/erreur de mémoire (lors de l'entraînement) ?
 Il y a trop de processus et votre mémoire n'est pas suffisante. Vous pouvez le corriger en :

 1. Diminuer l'entrée dans le champ "Threads of CPU".

 2. Pré-découper le jeu d'entraînement en fichiers audio plus courts.

-## Q15: Comment poursuivre l'entraînement avec plus de données
+## Q14: Comment poursuivre l'entraînement avec plus de données

 étape 1 : mettre toutes les données wav dans path2.

@@ -151,7 +146,7 @@ Il y a trop de processus et votre mémoire n'est pas suffisante. Vous pouvez le

 étape 4 : cliquez sur "entraîner le modèle", et il continuera l'entraînement depuis le début de votre époque de modèle exp précédente.

-## Q16: erreur à propos de llvmlite.dll
+## Q15: erreur à propos de llvmlite.dll

 OSError: Impossible de charger le fichier objet partagé : llvmlite.dll

@@ -159,11 +154,11 @@ FileNotFoundError: Impossible de trouver le module lib\site-packages\llvmlite\bi

 Le problème se produira sous Windows, installez https://aka.ms/vs/17/release/vc_redist.x64.exe et il sera corrigé.

-## Q17: RuntimeError: La taille étendue du tensor (17280) doit correspondre à la taille existante (0) à la dimension non-singleton 1. Tailles cibles : [1, 17280]. Tailles des tensors : [0]
+## Q16: RuntimeError: La taille étendue du tensor (17280) doit correspondre à la taille existante (0) à la dimension non-singleton 1. Tailles cibles : [1, 17280]. Tailles des tensors : [0]

 Supprimez les fichiers wav dont la taille est nettement inférieure à celle des autres, et cela ne se reproduira plus. Ensuite, cliquez sur "entraîner le modèle" et "entraîner l'index".

-## Q18: RuntimeError: La taille du tensor a (24) doit correspondre à la taille du tensor b (16) à la dimension non-singleton 2
+## Q17: RuntimeError: La taille du tensor a (24) doit correspondre à la taille du tensor b (16) à la dimension non-singleton 2

 Ne changez pas le taux d'échantillonnage puis continuez l'entraînement. S'il est nécessaire de changer, le nom de l'expérience doit être modifié et le modèle sera formé à partir de zéro. Vous pouvez également copier les hauteurs et caractéristiques (dossiers 0/1/2/2b) extraites la dernière fois pour accélérer le processus d'entraînement.

--- a/docs/jp/README.ja.md
+++ b/docs/jp/README.ja.md
@@ -130,27 +130,7 @@ v2バージョンのモデルを使用したい場合は、追加ダウンロー
 	rvcmd assets/v2 # RVC-Models-Downloader command
 	```

-### 2. ffmpegツールのインストール
-`ffmpeg`と`ffprobe`がすでにインストールされている場合は、このステップをスキップできます。
-
-#### Ubuntu/Debian
-```bash
-sudo apt install ffmpeg
-```
-#### MacOS
-```bash
-brew install ffmpeg
-```
-#### Windows
-ダウンロード後、ルートディレクトリに配置しましょう。
-```bash
-rvcmd tools/ffmpeg # RVC-Models-Downloader command
-```
- [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
-
- [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
-
-### 3. RMVPE人声音高抽出アルゴリズムに必要なファイルのダウンロード
+### 2. RMVPE人声音高抽出アルゴリズムに必要なファイルのダウンロード

 最新のRMVPE人声音高抽出アルゴリズムを使用したい場合は、音高抽出モデルをダウンロードし、`assets/rmvpe`に配置する必要があります。

@@ -166,7 +146,7 @@ rvcmd tools/ffmpeg # RVC-Models-Downloader command
 	rvcmd assets/rmvpe # RVC-Models-Downloader command
 	```

-### 4. AMD ROCM（オプション、Linuxのみ）
+### 3. AMD ROCM（オプション、Linuxのみ）

 AMDのRocm技術を基にLinuxシステムでRVCを実行したい場合は、まず[ここ](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html)で必要なドライバをインストールしてください。

@@ -211,7 +191,6 @@ rvcmd packs/general/latest # RVC-Models-Downloader command
 - [VITS](https://github.com/jaywalnut310/vits)
 - [HIFIGAN](https://github.com/jik876/hifi-gan)
 - [Gradio](https://github.com/gradio-app/gradio)
- [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 - [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 - [audio-slicer](https://github.com/openvpi/audio-slicer)
 - [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
--- a/docs/jp/faq_ja.md
+++ b/docs/jp/faq_ja.md
@@ -1,35 +1,30 @@
-## Q1: ffmpeg error/utf8 error
-
-大体の場合、ffmpeg の問題ではなく、音声パスの問題です。<br>
-ffmpeg は空白や()などの特殊文字を含むパスを読み込む際に ffmpeg error が発生する可能性があります。トレーニングセットの音声が中国語のパスを含む場合、filelist.txt に書き込む際に utf8 error が発生する可能性があります。<br>
-
-## Q2: ワンクリックトレーニングが終わってもインデックスがない
+## Q1: ワンクリックトレーニングが終わってもインデックスがない

 "Training is done. The program is closed."と表示された場合、モデルトレーニングは成功しています。その直後のエラーは誤りです。<br>

 ワンクリックトレーニングが終了しても added で始まるインデックスファイルがない場合、トレーニングセットが大きすぎてインデックス追加のステップが停止している可能性があります。バッチ処理 add インデックスでメモリの要求が高すぎる問題を解決しました。一時的に「トレーニングインデックス」ボタンをもう一度クリックしてみてください。<br>

-## Q3: トレーニングが終了してもトレーニングセットの音色が見えない
+## Q2: トレーニングが終了してもトレーニングセットの音色が見えない

 音色をリフレッシュしてもう一度確認してください。それでも見えない場合は、トレーニングにエラーがなかったか、コンソールと WebUI のスクリーンショット、logs/実験名の下のログを開発者に送って確認してみてください。<br>

-## Q4: モデルをどのように共有するか
+## Q3: モデルをどのように共有するか

 rvc_root/logs/実験名の下に保存されている pth は、推論に使用するために共有するためのものではなく、実験の状態を保存して再現およびトレーニングを続けるためのものです。共有するためのモデルは、weights フォルダの下にある 60MB 以上の pth ファイルです。<br>
    今後、weights/exp_name.pth と logs/exp_name/added_xxx.index を組み合わせて weights/exp_name.zip にパッケージ化し、インデックスの記入ステップを省略します。その場合、zip ファイルを共有し、pth ファイルは共有しないでください。別のマシンでトレーニングを続ける場合を除きます。<br>
   logs フォルダの数百 MB の pth ファイルを weights フォルダにコピー/共有して推論に強制的に使用すると、f0、tgt_sr などのさまざまなキーが存在しないというエラーが発生する可能性があります。ckpt タブの一番下で、音高、目標オーディオサンプリングレートを手動または自動（ローカルの logs に関連情報が見つかる場合は自動的に）で選択してから、ckpt の小型モデルを抽出する必要があります（入力パスに G で始まるものを記入）。抽出が完了すると、weights フォルダに 60MB 以上の pth ファイルが表示され、音色をリフレッシュした後に使用できます。<br>

-## Q5: Connection Error
+## Q4: Connection Error

 コンソール（黒いウィンドウ）を閉じた可能性があります。<br>

-## Q6: WebUI が Expecting value: line 1 column 1 (char 0)と表示する
+## Q5: WebUI が Expecting value: line 1 column 1 (char 0)と表示する

 システムのローカルネットワークプロキシ/グローバルプロキシを閉じてください。<br>

 これはクライアントのプロキシだけでなく、サーバー側のプロキシも含まれます（例えば autodl で http_proxy と https_proxy を設定して学術的な加速を行っている場合、使用する際には unset でオフにする必要があります）。<br>

-## Q7: WebUI を使わずにコマンドでトレーニングや推論を行うには
+## Q6: WebUI を使わずにコマンドでトレーニングや推論を行うには

 トレーニングスクリプト：<br>
 まず WebUI を実行し、メッセージウィンドウにデータセット処理とトレーニング用のコマンドラインが表示されます。<br>
@@ -51,18 +46,18 @@ index_rate=float(sys.argv[7])<br>
 device=sys.argv[8]<br>
 is_half=bool(sys.argv[9])<br>

-## Q8: Cuda error/Cuda out of memory
+## Q7: Cuda error/Cuda out of memory

 まれに cuda の設定問題やデバイスがサポートされていない可能性がありますが、大半はメモリ不足（out of memory）が原因です。<br>

 トレーニングの場合は batch size を小さくします（1 にしても足りない場合はグラフィックカードを変更するしかありません）。推論の場合は、config.py の末尾にある x_pad、x_query、x_center、x_max を適宜小さくします。4GB 以下のメモリ（例えば 1060（3G）や各種 2GB のグラフィックカード）は諦めることをお勧めしますが、4GB のメモリのグラフィックカードはまだ救いがあります。<br>

-## Q9: total_epoch はどのくらいに設定するのが良いですか
+## Q8: total_epoch はどのくらいに設定するのが良いですか

 トレーニングセットの音質が悪く、ノイズが多い場合は、20〜30 で十分です。高すぎると、ベースモデルの音質が低音質のトレーニングセットを高めることができません。<br>
 トレーニングセットの音質が高く、ノイズが少なく、長い場合は、高く設定できます。200 は問題ありません（トレーニング速度が速いので、高音質のトレーニングセットを準備できる条件がある場合、グラフィックカードも条件が良いはずなので、少しトレーニング時間が長くなることを気にすることはありません）。<br>

-## Q10: トレーニングセットはどれくらいの長さが必要ですか
+## Q9: トレーニングセットはどれくらいの長さが必要ですか

 10 分から 50 分を推奨します。
    音質が良く、バックグラウンドノイズが低い場合、個人的な特徴のある音色であれば、多ければ多いほど良いです。
@@ -70,7 +65,7 @@ is_half=bool(sys.argv[9])<br>
   1 分から 2 分のデータでトレーニングに成功した人もいますが、その成功体験は他人には再現できないため、あまり参考になりません。トレーニングセットの音色が非常に特徴的である必要があります（例：高い周波数の透明な声や少女の声など）、そして音質が良い必要があります。
   1 分未満のデータでトレーニングを試みた（成功した）ケースはまだ見たことがありません。このような試みはお勧めしません。

-## Q11: index rate は何に使うもので、どのように調整するのか（啓蒙）
+## Q10: index rate は何に使うもので、どのように調整するのか（啓蒙）

 もしベースモデルや推論ソースの音質がトレーニングセットよりも高い場合、推論結果の音質を向上させることができますが、音色がベースモデル/推論ソースの音色に近づくことがあります。これを「音色漏れ」と言います。
   index rate は音色漏れの問題を減少させたり解決するために使用されます。1 に設定すると、理論的には推論ソースの音色漏れの問題は存在しませんが、音質はトレーニングセットに近づきます。トレーニングセットの音質が推論ソースよりも低い場合、index rate を高くすると音質が低下する可能性があります。0 に設定すると、検索ミックスを利用してトレーニングセットの音色を保護する効果はありません。
--- a/docs/kr/README.ko.han.md
+++ b/docs/kr/README.ko.han.md
@@ -81,8 +81,6 @@ V2 버전 모델을 테스트하려면 추가 다운로드가 필요합니다.

 ./assets/pretrained_v2

-# Windows를 使用하는境遇 이 사전도 必要할 수 있습니다. FFmpeg가 設置되어 있으면 건너뛰어도 됩니다.
-ffmpeg.exe
 ```
 그後 以下의 命令을 使用하여 WebUI를 始作할 수 있습니다:
 ```bash
@@ -95,7 +93,6 @@ Windows를 使用하는境遇 `RVC-beta.7z`를 다운로드 및 壓縮解除하
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 ## 모든寄與者분들의勞力에感謝드립니다
--- a/docs/kr/README.ko.md
+++ b/docs/kr/README.ko.md
@@ -156,26 +156,6 @@ sh ./run.sh
 	rvcmd assets/v2 # RVC-Models-Downloader command
 	```

-### 2. 安装 ffmpeg 工具
-若已安装`ffmpeg`和`ffprobe`则可跳过此步骤。
-
-#### Ubuntu/Debian 用户
-```bash
-sudo apt install ffmpeg
-```
-#### MacOS 用户
-```bash
-brew install ffmpeg
-```
-#### Windows 用户
-下载后放置在根目录。
-```bash
-rvcmd tools/ffmpeg # RVC-Models-Downloader command
-```
- 下载[ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
-
- 下载[ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
-
 ### 3. 下载 rmvpe 人声音高提取算法所需文件

 如果你想使用最新的RMVPE人声音高提取算法，则你需要下载音高提取模型参数并放置于`assets/rmvpe`。
@@ -237,7 +217,6 @@ rvcmd packs/general/latest # RVC-Models-Downloader command
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 + [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
@@ -298,31 +277,7 @@ v2 버전 모델을 사용하려면 추가로 다음을 다운로드해야 합
  rvcmd assets/v2 # RVC-Models-Downloader command
  ```

-### 2. ffmpeg 설치
-
-`ffmpeg`와 `ffprobe`가 이미 설치되어 있다면 건너뜁니다.
-
-#### Ubuntu/Debian 사용자
-
-```bash
-sudo apt install ffmpeg
-```
-
-#### MacOS 사용자
-
-```bash
-brew install ffmpeg
-```
-
-#### Windows 사용자
-
-다운로드 후 루트 디렉토리에 배치.
-
- [ffmpeg.exe 다운로드](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
-
- [ffprobe.exe 다운로드](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
-
-### 3. RMVPE 인간 음성 피치 추출 알고리즘에 필요한 파일 다운로드
+### 2. RMVPE 인간 음성 피치 추출 알고리즘에 필요한 파일 다운로드

 최신 RMVPE 인간 음성 피치 추출 알고리즘을 사용하려면 음피치 추출 모델 매개변수를 다운로드하고 RVC 루트 디렉토리에 배치해야 합니다.

@@ -332,7 +287,7 @@ brew install ffmpeg

 - [rmvpe.onnx 다운로드](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.onnx)

-### 4. AMD 그래픽 카드 Rocm(선택사항, Linux만 해당)
+### 3. AMD 그래픽 카드 Rocm(선택사항, Linux만 해당)

 Linux 시스템에서 AMD의 Rocm 기술을 기반으로 RVC를 실행하려면 [여기](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html)에서 필요한 드라이버를 먼저 설치하세요.

@@ -392,7 +347,6 @@ source /opt/intel/oneapi/setvars.sh
 - [VITS](https://github.com/jaywalnut310/vits)
 - [HIFIGAN](https://github.com/jik876/hifi-gan)
 - [Gradio](https://github.com/gradio-app/gradio)
- [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 - [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 - [audio-slicer](https://github.com/openvpi/audio-slicer)
 - [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
--- a/docs/kr/faq_ko.md
+++ b/docs/kr/faq_ko.md
@@ -1,19 +1,14 @@
-## Q1:ffmpeg 오류/utf8 오류
-
-대부분의 경우 ffmpeg 문제가 아니라 오디오 경로 문제입니다. <br>
-ffmpeg가 공백, () 등의 특수 문자가 포함된 경로를 읽을 때 ffmpeg 오류가 발생할 수 있습니다. 트레이닝 세트 오디오가 중문 경로일 때 filelist.txt에 쓸 때 utf8 오류가 발생할 수 있습니다. <br>
-
-## Q2:일괄 트레이닝이 끝나고 인덱스가 없음
+## Q1:일괄 트레이닝이 끝나고 인덱스가 없음

 "Training is done. The program is closed."라고 표시되면 모델 트레이닝이 성공한 것이며, 이어지는 오류는 가짜입니다. <br>

 일괄 트레이닝이 끝나고 'added'로 시작하는 인덱스 파일이 없으면 트레이닝 세트가 너무 커서 인덱스 추가 단계에서 멈췄을 수 있습니다. 메모리에 대한 인덱스 추가 요구 사항이 너무 큰 문제를 배치 처리 add 인덱스로 해결했습니다. 임시로 "트레이닝 인덱스" 버튼을 다시 클릭해 보세요. <br>

-## Q3:트레이닝이 끝나고 트레이닝 세트의 음색을 추론에서 보지 못함
+## Q2:트레이닝이 끝나고 트레이닝 세트의 음색을 추론에서 보지 못함

 '음색 새로고침'을 클릭해 보세요. 여전히 없다면 트레이닝에 오류가 있는지, 콘솔 및 webui의 스크린샷, logs/실험명 아래의 로그를 개발자에게 보내 확인해 보세요. <br>

-## Q4:모델 공유 방법
+## Q3:모델 공유 방법

 rvc_root/logs/실험명 아래에 저장된 pth는 추론에 사용하기 위한 것이 아니라 실험 상태를 저장하고 복원하며, 트레이닝을 계속하기 위한 것입니다. 공유에 사용되는 모델은 weights 폴더 아래 60MB 이상인 pth 파일입니다. <br>
 <br/>
@@ -21,17 +16,17 @@ rvc_root/logs/실험명 아래에 저장된 pth는 추론에 사용하기 위한
 <br/>
 logs 폴더 아래 수백 MB의 pth 파일을 weights 폴더에 복사/공유하여 강제로 추론에 사용하면 f0, tgt_sr 등의 키가 없다는 오류가 발생할 수 있습니다. ckpt 탭 아래에서 수동 또는 자동(로컬 logs에서 관련 정보를 찾을 수 있는 경우 자동)으로 음성, 대상 오디오 샘플링률 옵션을 선택한 후 ckpt 소형 모델을 추출해야 합니다(입력 경로에 G로 시작하는 경로를 입력). 추출 후 weights 폴더에 60MB 이상의 pth 파일이 생성되며, 음색 새로고침 후 사용할 수 있습니다. <br>

-## Q5:연결 오류
+## Q4:연결 오류

 아마도 컨트롤 콘솔(검은 창)을 닫았을 것입니다. <br>

-## Q6:WebUI에서 "Expecting value: line 1 column 1 (char 0)" 오류가 발생함
+## Q5:WebUI에서 "Expecting value: line 1 column 1 (char 0)" 오류가 발생함

 시스템 로컬 네트워크 프록시/글로벌 프록시를 닫으세요. <br>

 이는 클라이언트의 프록시뿐만 아니라 서버 측의 프록시도 포함합니다(예: autodl로 http_proxy 및 https_proxy를 설정한 경우 사용 시 unset으로 끄세요). <br>

-## Q7:WebUI 없이 명령으로 트레이닝 및 추론하는 방법
+## Q6:WebUI 없이 명령으로 트레이닝 및 추론하는 방법

 트레이닝 스크립트: <br>
 먼저 WebUI를 실행하여 데이터 세트 처리 및 트레이닝에 사용되는 명령줄을 메시지 창에서 확인할 수 있습니다. <br>
@@ -53,18 +48,18 @@ index_rate=float(sys.argv[7]) <br>
 device=sys.argv[8] <br>
 is_half=bool(sys.argv[9]) <br>

-## Q8:Cuda 오류/Cuda 메모리 부족
+## Q7:Cuda 오류/Cuda 메모리 부족

 아마도 cuda 설정 문제이거나 장치가 지원되지 않을 수 있습니다. 대부분의 경우 메모리가 부족합니다(out of memory). <br>

 트레이닝의 경우 batch size를 줄이세요(1로 줄여도 부족하다면 다른 그래픽 카드로 트레이닝을 해야 합니다). 추론의 경우 config.py 파일 끝에 있는 x_pad, x_query, x_center, x_max를 적절히 줄이세요. 4GB 미만의 메모리(예: 1060(3GB) 및 여러 2GB 그래픽 카드)를 가진 경우는 포기하세요. 4GB 메모리 그래픽 카드는 아직 구할 수 있습니다. <br>

-## Q9:total_epoch를 몇으로 설정하는 것이 좋을까요
+## Q8:total_epoch를 몇으로 설정하는 것이 좋을까요

 트레이닝 세트의 오디오 품질이 낮고 배경 소음이 많으면 20~30이면 충분합니다. 너무 높게 설정하면 바닥 모델의 오디오 품질이 낮은 트레이닝 세트를 높일 수 없습니다. <br>
 트레이닝 세트의 오디오 품질이 높고 배경 소음이 적고 길이가 길 경우 높게 설정할 수 있습니다. 200도 괜찮습니다(트레이닝 속도가 빠르므로, 고품질 트레이닝 세트를 준비할 수 있는 조건이 있다면, 그래픽 카드도 좋을 것이므로, 조금 더 긴 트레이닝 시간에 대해 걱정하지 않을 것입니다). <br>

-## Q10: 트레이닝 세트는 얼마나 길어야 하나요
+## Q9: 트레이닝 세트는 얼마나 길어야 하나요

 10분에서 50분을 추천합니다.
 <br/>
@@ -76,7 +71,7 @@ is_half=bool(sys.argv[9]) <br>
 <br/>
 1분 미만의 데이터로 트레이닝을 시도(성공)한 사례는 아직 보지 못했습니다. 이런 시도는 권장하지 않습니다.

-## Q11: index rate는 무엇이며, 어떻게 조정하나요? (과학적 설명)
+## Q10: index rate는 무엇이며, 어떻게 조정하나요? (과학적 설명)

 만약 베이스 모델과 추론 소스의 음질이 트레이닝 세트보다 높다면, 그들은 추론 결과의 음질을 높일 수 있지만, 음색이 베이스 모델/추론 소스의 음색으로 기울어질 수 있습니다. 이 현상을 "음색 유출"이라고 합니다.
 <br/>
--- a/docs/pt/README.pt.md
+++ b/docs/pt/README.pt.md
@@ -123,15 +123,6 @@ Se você deseja testar o modelo da versão v2 (o modelo da versão v2 alterou a

 ./assets/pretrained_v2

-#Se você estiver usando Windows, também pode precisar desses dois arquivos, pule se FFmpeg e FFprobe estiverem instalados
-ffmpeg.exe
-
-https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe
-
-ffprobe.exe
-
-https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe
-
 Se quiser usar o algoritmo de extração de tom vocal SOTA RMVPE mais recente, você precisa baixar os pesos RMVPE e colocá-los no diretório raiz RVC

 https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt
@@ -179,7 +170,6 @@ python web.py
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 + [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
--- a/docs/pt/faq_pt.md
+++ b/docs/pt/faq_pt.md
@@ -100,37 +100,33 @@ Primeira coisa que gostaria de lembrar, não necessariamente quanto mais epochs


 # <b>FAQ Original traduzido</b>
-## <b><span style="color: #337dff;">Q1: erro ffmpeg/erro utf8.</span></b>
-Provavelmente não é um problema do FFmpeg, mas sim um problema de caminho de áudio;

-O FFmpeg pode encontrar um erro ao ler caminhos contendo caracteres especiais como spaces e (), o que pode causar um erro FFmpeg; e quando o áudio do conjunto de treinamento contém caminhos chineses, gravá-lo em filelist.txt pode causar um erro utf8.<hr>
-
-## <b><span style="color: #337dff;">Q2:Não é possível encontrar o arquivo de Index após "Treinamento com um clique".</span></b>
+## <b><span style="color: #337dff;">Q1:Não é possível encontrar o arquivo de Index após "Treinamento com um clique".</span></b>
 Se exibir "O treinamento está concluído. O programa é fechado ", então o modelo foi treinado com sucesso e os erros subsequentes são falsos;

 A falta de um arquivo de index 'adicionado' após o treinamento com um clique pode ser devido ao conjunto de treinamento ser muito grande, fazendo com que a adição do index fique presa; isso foi resolvido usando o processamento em lote para adicionar o index, o que resolve o problema de sobrecarga de memória ao adicionar o index. Como solução temporária, tente clicar no botão "Treinar Index" novamente.<hr>

-## <b><span style="color: #337dff;">Q3:Não é possível encontrar o modelo em “Modelo de voz” após o treinamento</span></b>
+## <b><span style="color: #337dff;">Q2:Não é possível encontrar o modelo em “Modelo de voz” após o treinamento</span></b>
 Clique em "Atualizar lista de voz" ou "Atualizar na EasyGUI e verifique novamente; se ainda não estiver visível, verifique se há erros durante o treinamento e envie capturas de tela do console, da interface do usuário da Web e dos ``logs/experiment_name/*.log`` para os desenvolvedores para análise posterior.<hr>

-## <b><span style="color: #337dff;">Q4:Como compartilhar um modelo/Como usar os modelos dos outros?</span></b>
+## <b><span style="color: #337dff;">Q3:Como compartilhar um modelo/Como usar os modelos dos outros?</span></b>
 Os arquivos ``.pth`` armazenados em ``*/logs/minha-voz`` não são destinados para compartilhamento ou inference, mas para armazenar os checkpoits do experimento para reprodutibilidade e treinamento adicional. O modelo a ser compartilhado deve ser o arquivo ``.pth`` de 60+MB na pasta **weights**;

 No futuro, ``weights/minha-voz.pth`` e ``logs/minha-voz/added_xxx.index`` serão mesclados em um único arquivo de ``weights/minha-voz.zip`` para eliminar a necessidade de entrada manual de index; portanto, compartilhe o arquivo zip, não somente o arquivo .pth, a menos que você queira continuar treinando em uma máquina diferente;

 Copiar/compartilhar os vários arquivos .pth de centenas de MB da pasta de logs para a pasta de weights para inference forçada pode resultar em erros como falta de f0, tgt_sr ou outras chaves. Você precisa usar a guia ckpt na parte inferior para manualmente ou automaticamente (se as informações forem encontradas nos ``logs/minha-voz``), selecione se deseja incluir informações de tom e opções de taxa de amostragem de áudio de destino e, em seguida, extrair o modelo menor. Após a extração, haverá um arquivo pth de 60+ MB na pasta de weights, e você pode atualizar as vozes para usá-lo.<hr>

-## <b><span style="color: #337dff;">Q5 Erro de conexão:</span></b>
+## <b><span style="color: #337dff;">Q4 Erro de conexão:</span></b>
 Para sermos otimistas, aperte F5/recarregue a página, pode ter sido apenas um bug da GUI

 Se não...
 Você pode ter fechado o console (janela de linha de comando preta).
 Ou o Google Colab, no caso do Colab, as vezes pode simplesmente fechar<hr>

-## <b><span style="color: #337dff;">Q6: Pop-up WebUI 'Valor esperado: linha 1 coluna 1 (caractere 0)'.</span></b>
+## <b><span style="color: #337dff;">Q5: Pop-up WebUI 'Valor esperado: linha 1 coluna 1 (caractere 0)'.</span></b>
 Desative o proxy LAN do sistema/proxy global e atualize.<hr>

-## <b><span style="color: #337dff;">Q7:Como treinar e inferir sem a WebUI?</span></b>
+## <b><span style="color: #337dff;">Q6:Como treinar e inferir sem a WebUI?</span></b>
 Script de treinamento:
 <br>Você pode executar o treinamento em WebUI primeiro, e as versões de linha de comando do pré-processamento e treinamento do conjunto de dados serão exibidas na janela de mensagens.<br>

@@ -153,17 +149,17 @@ index_rate=float(sys.argv[7])<br>
 device=sys.argv[8]<br>
 is_half=bool(sys.argv[9])<hr>

-## <b><span style="color: #337dff;">Q8: Erro Cuda/Cuda sem memória.</span></b>
+## <b><span style="color: #337dff;">Q7: Erro Cuda/Cuda sem memória.</span></b>
 Há uma pequena chance de que haja um problema com a configuração do CUDA ou o dispositivo não seja suportado; mais provavelmente, não há memória suficiente (falta de memória).<br>

 Para treinamento, reduza o (batch size) tamanho do lote (se reduzir para 1 ainda não for suficiente, talvez seja necessário alterar a placa gráfica); para inference, ajuste as configurações x_pad, x_query, x_center e x_max no arquivo config.py conforme necessário. Cartões de memória 4G ou inferiores (por exemplo, 1060(3G) e várias placas 2G) podem ser abandonados, enquanto os placas de vídeo com memória 4G ainda têm uma chance.<hr>

-## <b><span style="color: #337dff;">Q9:Quantos total_epoch são ótimos?</span></b>
+## <b><span style="color: #337dff;">Q8:Quantos total_epoch são ótimos?</span></b>
 Se a qualidade de áudio do conjunto de dados de treinamento for ruim e o nível de ruído for alto, **20-30 epochs** são suficientes. Defini-lo muito alto não melhorará a qualidade de áudio do seu conjunto de treinamento de baixa qualidade.<br>

 Se a qualidade de áudio do conjunto de treinamento for alta, o nível de ruído for baixo e houver duração suficiente, você poderá aumentá-lo. **200 é aceitável** (uma vez que o treinamento é rápido e, se você puder preparar um conjunto de treinamento de alta qualidade, sua GPU provavelmente poderá lidar com uma duração de treinamento mais longa sem problemas).<hr>

-## <b><span style="color: #337dff;">Q10:Quanto tempo de treinamento é necessário?</span></b>
+## <b><span style="color: #337dff;">Q9:Quanto tempo de treinamento é necessário?</span></b>

 **Recomenda-se um conjunto de dados de cerca de 10 min a 50 min.**<br>

@@ -175,28 +171,28 @@ Há algumas pessoas que treinaram com sucesso com dados de 1 a 2 minutos, mas o
 Dados com menos de 1 minuto, já obtivemo sucesso. Mas não é recomendado.<hr>


-## <b><span style="color: #337dff;">Q11:Qual é a taxa do index e como ajustá-la?</span></b>
+## <b><span style="color: #337dff;">Q10:Qual é a taxa do index e como ajustá-la?</span></b>
 Se a qualidade do tom do modelo pré-treinado e da fonte de inference for maior do que a do conjunto de treinamento, eles podem trazer a qualidade do tom do resultado do inference, mas ao custo de um possível viés de tom em direção ao tom do modelo subjacente/fonte de inference, em vez do tom do conjunto de treinamento, que é geralmente referido como "vazamento de tom".<br>

 A taxa de index é usada para reduzir/resolver o problema de vazamento de timbre. Se a taxa do index for definida como 1, teoricamente não há vazamento de timbre da fonte de inference e a qualidade do timbre é mais tendenciosa em relação ao conjunto de treinamento. Se o conjunto de treinamento tiver uma qualidade de som mais baixa do que a fonte de inference, uma taxa de index mais alta poderá reduzir a qualidade do som. Reduzi-lo a 0 não tem o efeito de usar a mistura de recuperação para proteger os tons definidos de treinamento.<br>

 Se o conjunto de treinamento tiver boa qualidade de áudio e longa duração, aumente o total_epoch, quando o modelo em si é menos propenso a se referir à fonte inferida e ao modelo subjacente pré-treinado, e há pouco "vazamento de tom", o index_rate não é importante e você pode até não criar/compartilhar o arquivo de index.<hr>

-## <b><span style="color: #337dff;">Q12:Como escolher o GPU ao inferir?</span></b>
+## <b><span style="color: #337dff;">Q11:Como escolher o GPU ao inferir?</span></b>
 No arquivo ``config.py``, selecione o número da placa em "device cuda:".<br>

 O mapeamento entre o número da placa e a placa gráfica pode ser visto na seção de informações da placa gráfica da guia de treinamento.<hr>

-## <b><span style="color: #337dff;">Q13:Como usar o modelo salvo no meio do treinamento?</span></b>
+## <b><span style="color: #337dff;">Q12:Como usar o modelo salvo no meio do treinamento?</span></b>
 Salvar via extração de modelo na parte inferior da guia de processamento do ckpt.<hr>

-## <b><span style="color: #337dff;">Q14: Erro de arquivo/memória (durante o treinamento)?</span></b>
+## <b><span style="color: #337dff;">Q13: Erro de arquivo/memória (durante o treinamento)?</span></b>
 Muitos processos e sua memória não é suficiente. Você pode corrigi-lo por:

 1. Diminuir a entrada no campo "Threads da CPU".
 2. Diminuir o tamanho do conjunto de dados.

-## Q15: Como continuar treinando usando mais dados
+## Q14: Como continuar treinando usando mais dados

 passo 1: coloque todos os dados wav no path2.

@@ -206,7 +202,7 @@ passo 3: copie o arquivo G e D mais recente de exp_name1 (seu experimento anteri

 passo 4: clique em "treinar o modelo" e ele continuará treinando desde o início da época anterior do modelo exp.

-## Q16: erro sobre llvmlite.dll
+## Q15: erro sobre llvmlite.dll

 OSError: Não foi possível carregar o arquivo de objeto compartilhado: llvmlite.dll

@@ -214,11 +210,11 @@ FileNotFoundError: Não foi possível encontrar o módulo lib\site-packages\llvm

 O problema acontecerá no Windows, instale https://aka.ms/vs/17/release/vc_redist.x64.exe e será corrigido.

-## Q17: RuntimeError: O tamanho expandido do tensor (17280) deve corresponder ao tamanho existente (0) na dimensão 1 não singleton. Tamanhos de destino: [1, 17280]. Tamanhos de tensor: [0]
+## Q16: RuntimeError: O tamanho expandido do tensor (17280) deve corresponder ao tamanho existente (0) na dimensão 1 não singleton. Tamanhos de destino: [1, 17280]. Tamanhos de tensor: [0]

 Exclua os arquivos wav cujo tamanho seja significativamente menor que outros e isso não acontecerá novamente. Em seguida, clique em "treinar o modelo" e "treinar o índice".

-## Q18: RuntimeError: O tamanho do tensor a (24) deve corresponder ao tamanho do tensor b (16) na dimensão não singleton 2
+## Q17: RuntimeError: O tamanho do tensor a (24) deve corresponder ao tamanho do tensor b (16) na dimensão não singleton 2

 Não altere a taxa de amostragem e continue o treinamento. Caso seja necessário alterar, o nome do exp deverá ser alterado e o modelo será treinado do zero. Você também pode copiar o pitch e os recursos (pastas 0/1/2/2b) extraídos da última vez para acelerar o processo de treinamento.

--- a/docs/tr/README.tr.md
+++ b/docs/tr/README.tr.md
@@ -108,15 +108,6 @@ V2 sürüm modelini test etmek isterseniz (v2 sürüm modeli, 9 katmanlı Hubert

 ./assets/pretrained_v2

-Eğer Windows kullanıyorsanız, FFmpeg ve FFprobe kurulu değilse bu iki dosyayı da indirmeniz gerekebilir.
-ffmpeg.exe
-
-https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe
-
-ffprobe.exe
-
-https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe
-
 En son SOTA RMVPE vokal ton çıkarma algoritmasını kullanmak istiyorsanız, RMVPE ağırlıklarını indirip RVC kök dizinine koymalısınız.

 https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt
@@ -140,7 +131,6 @@ Windows veya macOS kullanıyorsanız, `RVC-beta.7z` dosyasını indirip çıkara
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 + [Vokal ton çıkarma:RMVPE](https://github.com/Dream-High/RMVPE)
--- a/docs/tr/faq_tr.md
+++ b/docs/tr/faq_tr.md
@@ -1,30 +1,25 @@
-## Q1: FFmpeg Hatası/UTF8 Hatası
-Büyük olasılıkla bu bir FFmpeg sorunu değil, daha çok ses dosyası yolunda bir sorun;
-
-FFmpeg, boşluklar ve () gibi özel karakterler içeren yolları okurken bir hata ile karşılaşabilir; ve eğitim setinin ses dosyaları Çin karakterleri içeriyorsa, bunlar filelist.txt'ye yazıldığında utf8 hatasına neden olabilir.<br>
-
-## Q2: "Tek Tıklamayla Eğitim" Sonrası İndeks Dosyası Bulunamıyor
+## Q1: "Tek Tıklamayla Eğitim" Sonrası İndeks Dosyası Bulunamıyor
 Eğer "Eğitim tamamlandı. Program kapatıldı." mesajını görüyorsa, model başarıyla eğitilmiş demektir ve sonraki hatalar sahte;

 "Added" dizini oluşturulduğu halde "Tek Tıklamayla Eğitim" sonrası indeks dosyası bulunamıyorsa, bu genellikle eğitim setinin çok büyük olmasından kaynaklanabilir ve indeksin eklenmesi sıkışabilir. Bu sorun indeks eklerken bellek yükünü azaltmak için toplu işlem yaparak çözülmüştür. Geçici bir çözüm olarak, "Eğitim İndeksini Eğit" düğmesine tekrar tıklamayı deneyin.<br>

-## Q3: Eğitim Sonrası "Tonlama İnceleniyor" Bölümünde Model Bulunamıyor
+## Q2: Eğitim Sonrası "Tonlama İnceleniyor" Bölümünde Model Bulunamıyor
 "Lanetleme İstemi Listesini Yenile" düğmesine tıklayarak tekrar kontrol edin; hala görünmüyorsa, eğitim sırasında herhangi bir hata olup olmadığını kontrol edin ve geliştiricilere daha fazla analiz için konsol, web arayüzü ve logs/experiment_name/*.log ekran görüntülerini gönderin.<br>

-## Q4: Bir Model Nasıl Paylaşılır/Başkalarının Modelleri Nasıl Kullanılır?
+## Q3: Bir Model Nasıl Paylaşılır/Başkalarının Modelleri Nasıl Kullanılır?
 rvc_root/logs/experiment_name dizininde saklanan pth dosyaları paylaşım veya çıkarım için değildir, bunlar deney checkpoint'larıdır ve çoğaltılabilirlik ve daha fazla eğitim için saklanır. Paylaşılacak olan model, weights klasöründeki 60+MB'lık pth dosyası olmalıdır;

 Gelecekte, weights/exp_name.pth ve logs/exp_name/added_xxx.index birleştirilerek tek bir weights/exp_name.zip dosyasına dönüştürülecek ve manuel indeks girişi gereksinimini ortadan kaldıracaktır; bu nedenle pth dosyasını değil, farklı bir makinede eğitime devam etmek istemezseniz zip dosyasını paylaşın;

 Çıkarılmış modelleri zorlama çıkarım için logs klasöründen weights klasörüne birkaç yüz MB'lık pth dosyalarını kopyalamak/paylaşmak, eksik f0, tgt_sr veya diğer anahtarlar gibi hatalara neden olabilir. Smaller modeli manuel veya otomatik olarak çıkarmak için alttaki ckpt sekmesini kullanmanız gerekmektedir (eğer bilgi logs/exp_name içinde bulunuyorsa), pitch bilgisini ve hedef ses örnekleme oranı seçeneklerini seçmeli ve ardından daha küçük modele çıkarmalısınız. Çıkardıktan sonra weights klasöründe 60+ MB'lık bir pth dosyası olacaktır ve sesleri yeniden güncelleyebilirsiniz.<br>

-## Q5: Bağlantı Hatası
+## Q4: Bağlantı Hatası
 Büyük ihtimalle konsolu (siyah komut satırı penceresi) kapatmış olabilirsiniz.<br>

-## Q6: Web Arayüzünde 'Beklenen Değer: Satır 1 Sütun 1 (Karakter 0)' Hatası
+## Q5: Web Arayüzünde 'Beklenen Değer: Satır 1 Sütun 1 (Karakter 0)' Hatası
 Lütfen sistem LAN proxy/global proxy'sini devre dışı bırakın ve ardından sayfayı yenileyin.<br>

-## Q7: WebUI Olmadan Nasıl Eğitim Yapılır ve Tahmin Yapılır?
+## Q6: WebUI Olmadan Nasıl Eğitim Yapılır ve Tahmin Yapılır?
 Eğitim komut dosyası:<br>
 Önce WebUI'de eğitimi çalıştırabilirsiniz, ardından veri seti önişleme ve eğitiminin komut satırı sürümleri mesaj penceresinde görüntülenecektir.<br>

@@ -47,19 +42,19 @@ index_rate=float(sys.argv[7])<br>
 device=sys.argv[8]<br>
 is_half=bool(sys.argv[9])<br>

-## Q8: Cuda Hatası/Cuda Bellek Yetersizliği
+## Q7: Cuda Hatası/Cuda Bellek Yetersizliği
 Küçük bir ihtimalle CUDA konfigürasyonunda bir problem olabilir veya cihaz desteklenmiyor olabilir; daha muhtemel olarak yetersiz bellek olabilir (bellek yetersizliği).<br>

 Eğitim için toplu işlem boyutunu azaltın (1'e indirgemek yeterli değilse, grafik kartını değiştirmeniz gerekebilir); çıkarım için ise config.py dosyasındaki x_pad, x_query, x_center ve x_max ayarlarını ihtiyaca göre düzenleyin. 4GB veya daha düşük bellekli kartlar (örneğin 1060(3G) ve çeşit

 li 2GB kartlar) terk edilebilir, 4GB bellekli kartlar hala bir şansı vardır.<br>

-## Q9: Optimal Olarak Kaç total_epoch Gerekli?
+## Q8: Optimal Olarak Kaç total_epoch Gerekli?
 Eğitim veri setinin ses kalitesi düşük ve gürültü seviyesi yüksekse, 20-30 dönem yeterlidir. Fazla yüksek bir değer belirlemek, düşük kaliteli eğitim setinizin ses kalitesini artırmaz.<br>

 Eğitim setinin ses kalitesi yüksek, gürültü seviyesi düşük ve yeterli süre varsa, bu değeri artırabilirsiniz. 200 kabul edilebilir bir değerdir (çünkü eğitim hızlıdır ve yüksek kaliteli bir eğitim seti hazırlayabiliyorsanız, GPU'nuz muhtemelen uzun bir eğitim süresini sorunsuz bir şekilde yönetebilir).<br>

-## Q10: Kaç Dakika Eğitim Verisi Süresi Gerekli?
+## Q9: Kaç Dakika Eğitim Verisi Süresi Gerekli?

 10 ila 50 dakika arası bir veri seti önerilir.<br>

@@ -70,29 +65,29 @@ Yüksek seviyede bir eğitim seti (zarif ve belirgin tonlama), 5 ila 10 dakika a
 1 ila 2 dakika veri ile başarılı bir şekilde eğitim yapan bazı insanlar olsa da, başarı diğerleri tarafından tekrarlanabilir değil ve çok bilgilendirici değil. Bu, eğitim setinin çok belirgin bir tonlamaya sahip olmasını (örneğin yüksek frekansta havadar bir anime kız sesi gibi) ve ses kalitesinin yüksek olmasını gerektirir; 1 dakikadan daha kısa süreli veri denenmemiştir ve önerilmez.<br>


-## Q11: İndeks Oranı Nedir ve Nasıl Ayarlanır?
+## Q10: İndeks Oranı Nedir ve Nasıl Ayarlanır?
 Eğer önceden eğitilmiş model ve tahmin kaynağının ton kalitesi, eğitim setinden daha yüksekse, tahmin sonucunun ton kalitesini yükseltebilirler, ancak altta yatan modelin/tahmin kaynağının tonu yerine eğitim setinin tonuna yönelik olası bir ton önyargısıyla sonuçlanır, bu genellikle "ton sızıntısı" olarak adlandırılır.<br>

 İndeks oranı, ton sızıntı sorununu azaltmak/çözmek için kullanılır. İndeks oranı 1 olarak ayarlandığında, teorik olarak tahmin kaynağından ton sızıntısı olmaz ve ton kalitesi daha çok eğitim setine yönelik olur. Eğer eğitim seti, tahmin kaynağından daha düşük ses kalitesine sahipse, daha yüksek bir indeks oranı ses kalitesini azaltabilir. Oranı 0'a düşürmek, eğitim seti tonlarını korumak için getirme karıştırmasını kullanmanın etkisine sahip değildir.<br>

 Eğer eğitim seti iyi ses kalitesine ve uzun süreye sahipse, total_epoch'u artırın. Model, tahmin kaynağına ve önceden eğitilmiş alt modeline daha az başvurduğunda ve "ton sızıntısı" daha az olduğunda, indeks oranı önemli değil ve hatta indeks dosyası oluşturmak/paylaşmak gerekli değildir.<br>

-## Q12: Tahmin Yaparken Hangi GPU'yu Seçmeli?
+## Q11: Tahmin Yaparken Hangi GPU'yu Seçmeli?
 config.py dosyasında "device cuda:" ardından kart numarasını seçin.<br>

 Kart numarası ile grafik kartı arasındaki eşleme, eğitim sekmesinin grafik kartı bilgileri bölümünde görülebilir.<br>

-## Q13: Eğitimin Ortasında Kaydedilen Model Nasıl Kullanılır?
+## Q12: Eğitimin Ortasında Kaydedilen Model Nasıl Kullanılır?
 Kaydetme işlemini ckpt işleme sekmesinin altında yer alan model çıkarımı ile yapabilirsiniz.

-## Q14: Dosya/Bellek Hatası (Eğitim Sırasında)?
+## Q13: Dosya/Bellek Hatası (Eğitim Sırasında)?
 Çok fazla işlem ve yetersiz bellek olabilir. Bu sorunu düzeltebilirsiniz:

 1. "CPU İş Parçacıkları" alanındaki girişi azaltarak.

 2. Eğitim verisini daha kısa ses dosyalarına önceden keserek.

-## Q15: Daha Fazla Veri Kullanarak Eğitime Nasıl Devam Edilir?
+## Q14: Daha Fazla Veri Kullanarak Eğitime Nasıl Devam Edilir?

 Adım 1: Tüm wav verilerini path2 dizinine yerleştirin.

--- a/docs/tr/training_tips_tr.md
+++ b/docs/tr/training_tips_tr.md
@@ -20,9 +20,6 @@ Ses yüklenir ve ön işleme yapılır.
 Ses içeren bir klasör belirtirseniz, bu klasördeki ses dosyaları otomatik olarak okunur.
 Örneğin, `C:Users\hoge\voices` belirtirseniz, `C:Users\hoge\voices\voice.mp3` yüklenecek, ancak `C:Users\hoge\voices\dir\voice.mp3` yüklenmeyecektir.

-Ses okumak için dahili olarak ffmpeg kullanıldığından, uzantı ffmpeg tarafından destekleniyorsa otomatik olarak okunacaktır.
-ffmpeg ile int16'ya dönüştürüldükten sonra float32'ye dönüştürülüp -1 ile 1 arasında normalize edilir.
-
 ### Gürültü Temizleme
 Ses scipy'nin filtfilt işlevi ile yumuşatılır.

--- a/infer/lib/audio.py
+++ b/infer/lib/audio.py
@@ -1,9 +1,10 @@
 from io import BufferedWriter, BytesIO
 from pathlib import Path
-from typing import Dict
-import ffmpeg
+from typing import Dict, Tuple
 import numpy as np
 import av
+import os
+from av.audio.resampler import AudioResampler

 video_format_dict: Dict[str, str] = {
    "m4a": "mp4",
@@ -39,20 +40,112 @@ def load_audio(file: str, sr: int) -> np.ndarray:
        raise FileNotFoundError(f"File not found: {file}")

    try:
-        # https://github.com/openai/whisper/blob/main/whisper/audio.py#L26
-        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
-        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
-        file = str(clean_path(file))  # 防止小白拷路径头尾带了空格和"和回车
-        out, _ = (
-            ffmpeg.input(file, threads=0)
-            .output("-", format="f32le", acodec="pcm_f32le", ac=1, ar=sr)
-            .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
-        )
+        container = av.open(file)
+        resampler = AudioResampler(format="fltp", layout="mono", rate=sr)
+
+        # Estimated maximum total number of samples to pre-allocate the array
+        audio_duration_sec: float = (
+            container.duration / 1_000_000
+        )  # AV stores length in microseconds by default
+        estimated_total_samples = int(audio_duration_sec * sr + 0.5)
+        decoded_audio = np.zeros(estimated_total_samples + 1, dtype=np.float32)
+
+        offset = 0
+        for frame in container.decode(audio=0):
+            frame.pts = None  # Clear presentation timestamp to avoid resampling issues
+            resampled_frames = resampler.resample(frame)
+            for resampled_frame in resampled_frames:
+                frame_data = np.array(resampled_frame.to_ndarray()).flatten()
+                end_index = offset + len(frame_data)
+
+                # Check if decoded_audio has enough space, and resize if necessary
+                if end_index > decoded_audio.shape[0]:
+                    decoded_audio = np.resize(decoded_audio, end_index + 1)
+
+                decoded_audio[offset:end_index] = frame_data
+                offset += len(frame_data)
+
+        # Truncate the array to the actual size
+        decoded_audio = decoded_audio[:offset]
    except Exception as e:
        raise RuntimeError(f"Failed to load audio: {e}")

-    return np.frombuffer(out, np.float32).flatten()
+    return decoded_audio


-def clean_path(path: str) -> Path:
-    return Path(path.strip(' "\n')).resolve()
+def downsample_audio(input_path: str, output_path: str, format: str) -> None:
+    if not os.path.exists(input_path):
+        return
+
+    input_container = av.open(input_path)
+    output_container = av.open(output_path, "w")
+
+    # Create a stream in the output container
+    input_stream = input_container.streams.audio[0]
+    output_stream = output_container.add_stream(format)
+
+    output_stream.bit_rate = 128_000  # 128kb/s (equivalent to -q:a 2)
+
+    # Copy packets from the input file to the output file
+    for packet in input_container.demux(input_stream):
+        for frame in packet.decode():
+            for out_packet in output_stream.encode(frame):
+                output_container.mux(out_packet)
+
+    for packet in output_stream.encode():
+        output_container.mux(packet)
+
+    # Close the containers
+    input_container.close()
+    output_container.close()
+
+    try:  # Remove the original file
+        os.remove(input_path)
+    except Exception as e:
+        print(f"Failed to remove the original file: {e}")
+
+
+def resample_audio(
+    input_path: str, output_path: str, codec: str, format: str, sr: int, layout: str
+) -> None:
+    if not os.path.exists(input_path):
+        return
+
+    input_container = av.open(input_path)
+    output_container = av.open(output_path, "w")
+
+    # Create a stream in the output container
+    input_stream = input_container.streams.audio[0]
+    output_stream = output_container.add_stream(codec, rate=sr, layout=layout)
+
+    resampler = AudioResampler(format, layout, sr)
+
+    # Copy packets from the input file to the output file
+    for packet in input_container.demux(input_stream):
+        for frame in packet.decode():
+            frame.pts = None  # Clear presentation timestamp to avoid resampling issues
+            out_frames = resampler.resample(frame)
+            for out_frame in out_frames:
+                for out_packet in output_stream.encode(out_frame):
+                    output_container.mux(out_packet)
+
+    for packet in output_stream.encode():
+        output_container.mux(packet)
+
+    # Close the containers
+    input_container.close()
+    output_container.close()
+
+    try:  # Remove the original file
+        os.remove(input_path)
+    except Exception as e:
+        print(f"Failed to remove the original file: {e}")
+
+
+def get_audio_properties(input_path: str) -> Tuple:
+    container = av.open(input_path)
+    audio_stream = next(s for s in container.streams if s.type == "audio")
+    channels = 1 if audio_stream.layout == "mono" else 2
+    rate = audio_stream.base_rate
+    container.close()
+    return channels, rate
--- a/infer/modules/uvr5/mdxnet.py
+++ b/infer/modules/uvr5/mdxnet.py
@@ -8,6 +8,9 @@ import numpy as np
 import soundfile as sf
 import torch
 from tqdm import tqdm
+import av
+
+from infer.lib.audio import downsample_audio

 cpu = torch.device("cpu")

@@ -218,20 +221,8 @@ class Predictor:
            sf.write(path_other, opt, rate)
            opt_path_vocal = path_vocal[:-4] + ".%s" % format
            opt_path_other = path_other[:-4] + ".%s" % format
-            if os.path.exists(path_vocal):
-                os.system(f'ffmpeg -i "{path_vocal}" -vn "{opt_path_vocal}" -q:a 2 -y')
-                if os.path.exists(opt_path_vocal):
-                    try:
-                        os.remove(path_vocal)
-                    except:
-                        pass
-            if os.path.exists(path_other):
-                os.system(f'ffmpeg -i "{path_other}" -vn "{opt_path_other}" -q:a 2 -y')
-                if os.path.exists(opt_path_other):
-                    try:
-                        os.remove(path_other)
-                    except:
-                        pass
+            downsample_audio(path_vocal, opt_path_vocal, format)
+            downsample_audio(path_other, opt_path_other, format)


 class MDXNetDereverb:
--- a/infer/modules/uvr5/modules.py
+++ b/infer/modules/uvr5/modules.py
@@ -4,7 +4,7 @@ import logging

 logger = logging.getLogger(__name__)

-import ffmpeg
+from infer.lib.audio import resample_audio, get_audio_properties
 import torch

 from configs import Config
@@ -46,27 +46,24 @@ def uvr(model_name, inp_root, save_root_vocal, paths, save_root_ins, agg, format
            need_reformat = 1
            done = 0
            try:
-                info = ffmpeg.probe(inp_path, cmd="ffprobe")
-                if (
-                    info["streams"][0]["channels"] == 2
-                    and info["streams"][0]["sample_rate"] == "44100"
-                ):
-                    need_reformat = 0
+                channels, rate = get_audio_properties(inp_path)
+
+                # Check the audio stream's properties
+                if channels == 2 and rate == 44100:
                    pre_fun._path_audio_(
                        inp_path, save_root_ins, save_root_vocal, format0, is_hp3=is_hp3
                    )
+                    need_reformat = 0
                    done = 1
-            except:
+            except Exception as e:
                need_reformat = 1
-                traceback.print_exc()
+                print(f"Exception {e} occured. Will reformat")
            if need_reformat == 1:
                tmp_path = "%s/%s.reformatted.wav" % (
                    os.path.join(os.environ["TEMP"]),
                    os.path.basename(inp_path),
                )
-                os.system(
-                    f'ffmpeg -i "{inp_path}" -vn -acodec pcm_s16le -ac 2 -ar 44100 "{tmp_path}" -y'
-                )
+                resample_audio(inp_path, tmp_path, "pcm_s16le", "s16", 44100, "stereo")
                inp_path = tmp_path
            try:
                if done == 0:
--- a/infer/modules/uvr5/vr.py
+++ b/infer/modules/uvr5/vr.py
@@ -6,6 +6,7 @@ logger = logging.getLogger(__name__)
 import librosa
 import numpy as np
 import soundfile as sf
+from infer.lib.audio import downsample_audio
 import torch

 from infer.lib.uvr5_pack.lib_v5 import nets_123821KB as Nets
@@ -60,7 +61,7 @@ class AudioPre:
                (
                    X_wave[d],
                    _,
-                ) = librosa.core.load(  # 理论上librosa读取可能对某些音频有bug，应该上ffmpeg读取，但是太麻烦了弃坑
+                ) = librosa.core.load(  # 理论上librosa读取可能对某些音频有bug，应该上av读取，但是太麻烦了弃坑
                    music_file,
                    bp["sr"],
                    False,
@@ -146,12 +147,7 @@ class AudioPre:
                )
                if os.path.exists(path):
                    opt_format_path = path[:-4] + ".%s" % format
-                    os.system(f'ffmpeg -i "{path}" -vn "{opt_format_path}" -q:a 2 -y')
-                    if os.path.exists(opt_format_path):
-                        try:
-                            os.remove(path)
-                        except:
-                            pass
+                    downsample_audio(path, opt_format_path, format)
        if vocal_root is not None:
            if is_hp3 == True:
                head = "instrument_"
@@ -185,14 +181,8 @@ class AudioPre:
                    (np.array(wav_vocals) * 32768).astype("int16"),
                    self.mp.param["sr"],
                )
-                if os.path.exists(path):
-                    opt_format_path = path[:-4] + ".%s" % format
-                    os.system(f'ffmpeg -i "{path}" -vn "{opt_format_path}" -q:a 2 -y')
-                    if os.path.exists(opt_format_path):
-                        try:
-                            os.remove(path)
-                        except:
-                            pass
+                opt_format_path = path[:-4] + ".%s" % format
+                downsample_audio(path, opt_format_path, format)


 class AudioPreDeEcho:
@@ -241,7 +231,7 @@ class AudioPreDeEcho:
                (
                    X_wave[d],
                    _,
-                ) = librosa.core.load(  # 理论上librosa读取可能对某些音频有bug，应该上ffmpeg读取，但是太麻烦了弃坑
+                ) = librosa.core.load(  # 理论上librosa读取可能对某些音频有bug，应该上av读取，但是太麻烦了弃坑
                    music_file,
                    bp["sr"],
                    False,
@@ -323,12 +313,7 @@ class AudioPreDeEcho:
                )
                if os.path.exists(path):
                    opt_format_path = path[:-4] + ".%s" % format
-                    os.system(f'ffmpeg -i "{path}" -vn "{opt_format_path}" -q:a 2 -y')
-                    if os.path.exists(opt_format_path):
-                        try:
-                            os.remove(path)
-                        except:
-                            pass
+                    downsample_audio(path, opt_format_path, format)
        if vocal_root is not None:
            if self.data["high_end_process"].startswith("mirroring"):
                input_high_end_ = spec_utils.mirroring(
@@ -360,9 +345,4 @@ class AudioPreDeEcho:
                )
                if os.path.exists(path):
                    opt_format_path = path[:-4] + ".%s" % format
-                    os.system(f'ffmpeg -i "{path}" -vn "{opt_format_path}" -q:a 2 -y')
-                    if os.path.exists(opt_format_path):
-                        try:
-                            os.remove(path)
-                        except:
-                            pass
+                    downsample_audio(path, opt_format_path, format)
--- a/requirements/amd.txt
+++ b/requirements/amd.txt
@@ -11,7 +11,6 @@ gradio
 Cython
 pydub>=0.25.1
 soundfile>=0.12.1
-ffmpeg-python>=0.2.0
 tensorboardX
 Jinja2>=3.1.2
 json5
@@ -43,7 +42,6 @@ onnxruntime
 onnxruntime-gpu
 torchcrepe==0.0.20
 fastapi
-ffmpy==0.3.1
 python-dotenv>=1.0.0
 av
 torchfcpe
--- a/requirements/dml.txt
+++ b/requirements/dml.txt
@@ -10,7 +10,6 @@ gradio
 Cython
 pydub>=0.25.1
 soundfile>=0.12.1
-ffmpeg-python>=0.2.0
 tensorboardX
 Jinja2>=3.1.2
 json5
@@ -41,7 +40,6 @@ httpx
 onnxruntime-directml
 torchcrepe==0.0.20
 fastapi
-ffmpy==0.3.1
 python-dotenv>=1.0.0
 av
 torchfcpe
--- a/requirements/ipex.txt
+++ b/requirements/ipex.txt
@@ -15,7 +15,6 @@ gradio
 Cython
 pydub>=0.25.1
 soundfile>=0.12.1
-ffmpeg-python>=0.2.0
 tensorboardX
 Jinja2>=3.1.2
 json5
@@ -47,7 +46,6 @@ onnxruntime; sys_platform == 'darwin'
 onnxruntime-gpu; sys_platform != 'darwin'
 torchcrepe==0.0.20
 fastapi
-ffmpy==0.3.1
 python-dotenv>=1.0.0
 av
 FreeSimpleGUI
--- a/requirements/main.txt
+++ b/requirements/main.txt
@@ -10,7 +10,6 @@ gradio
 Cython
 pydub>=0.25.1
 soundfile>=0.12.1
-ffmpeg-python>=0.2.0
 tensorboardX
 Jinja2>=3.1.2
 json5
@@ -43,7 +42,6 @@ onnxruntime-gpu; sys_platform != 'darwin'
 torchcrepe==0.0.20
 fastapi
 torchfcpe
-ffmpy==0.3.1
 python-dotenv>=1.0.0
 av
 pybase16384
--- a/requirements/py311.txt
+++ b/requirements/py311.txt
@@ -10,7 +10,6 @@ gradio
 Cython
 pydub>=0.25.1
 soundfile>=0.12.1
-ffmpeg-python>=0.2.0
 tensorboardX
 Jinja2>=3.1.2
 json5
@@ -43,7 +42,6 @@ onnxruntime-gpu; sys_platform != 'darwin'
 torchcrepe==0.0.20
 fastapi
 torchfcpe
-ffmpy==0.3.1
 python-dotenv>=1.0.0
 av
 pybase16384