feat(audio): use PyAV instead of ffmpeg (#31)

* feat(audio): use PyAV instead of ffmpeg replaced usage of ffmpeg in favor of PyAV (`av`) * refactor(audio): store all of the audio related functions in the `infer.lib.audio` refactors previous commit to have singular functions for each task, all located in `infer.lib.audio` * fix(audio): remove downsample_audio from mdxnet.py it is no longer needed, since it's imported from infer.lib.audio * docs: remove every ffmpeg mention in the documentation to avoid confusion * chore(requirements): remove ffmpeg-python and ffmpy from all requirements * fix(audio): fix loading for UVR wrapped gathering of META info from the stream into a function fixes loading for UVR * fix(audio): use np.frombuffer() instead of direct conversion of the resampled frames this fixes traceback on preprocessing * feat(audio): pre-allocate decoded_audio array in the load_audio function this should improve performance, even if just a little * Revert "docs: remove every ffmpeg mention in the documentation to avoid confusion" This reverts commit 1e05bbce03. * chore(format): run black on dev * fix(requirements): revert removal of ffmpeg in unitest.yml and Dockerfile * Revert "fix(requirements): revert removal of ffmpeg in unitest.yml and Dockerfile" This reverts commit e28a0eebb2. * feat(audio): pre-allocate numpy array to store the AudioFrame data in ndarray of dtype float32 * chore(format): run black on dev * fix(audio): fix the decoded_audio size estimation in estimated_total_samples we multiply by `sr` instead of `container.streams.audio[0].rate` since we want to estimate size of the OUTPUT file, not the input one. - Added dynamic resizing, in case something goes wrong and the size of decoded_audio is estimated incorrectly Fixed function `load_audio` when the input audio's samplerate does not match the desired samplerate (`sr`) * chore(format): run black on dev * refactor(audio): remove `clean_path()` function as it serves no purpose anymore * docs: remove everything related to ffmpeg this includes everything except for formats support specification in the training_tips docs, since it has nothing to do with what ffmpeg does/did but rather what audio formats are supported (all the ones that ffmpeg supports!) * docs: fix order of the steps in preparation in the READMEs --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-06-05 09:10:25 +08:00 · 2024-06-12 18:13:26 +07:00
parent aec56ec0b4
commit 1e22d468ea
28 changed files with 233 additions and 366 deletions
--- a/docs/kr/README.ko.han.md
+++ b/docs/kr/README.ko.han.md
@@ -81,8 +81,6 @@ V2 버전 모델을 테스트하려면 추가 다운로드가 필요합니다.

 ./assets/pretrained_v2

-# Windows를 使用하는境遇 이 사전도 必要할 수 있습니다. FFmpeg가 設置되어 있으면 건너뛰어도 됩니다.
-ffmpeg.exe
 ```
 그後 以下의 命令을 使用하여 WebUI를 始作할 수 있습니다:
 ```bash
@@ -95,7 +93,6 @@ Windows를 使用하는境遇 `RVC-beta.7z`를 다운로드 및 壓縮解除하
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 ## 모든寄與者분들의勞力에感謝드립니다
--- a/docs/kr/README.ko.md
+++ b/docs/kr/README.ko.md
@@ -156,26 +156,6 @@ sh ./run.sh
 	rvcmd assets/v2 # RVC-Models-Downloader command
 	```

-### 2. 安装 ffmpeg 工具
-若已安装`ffmpeg`和`ffprobe`则可跳过此步骤。
-
-#### Ubuntu/Debian 用户
-```bash
-sudo apt install ffmpeg
-```
-#### MacOS 用户
-```bash
-brew install ffmpeg
-```
-#### Windows 用户
-下载后放置在根目录。
-```bash
-rvcmd tools/ffmpeg # RVC-Models-Downloader command
-```
- 下载[ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
-
- 下载[ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
-
 ### 3. 下载 rmvpe 人声音高提取算法所需文件

 如果你想使用最新的RMVPE人声音高提取算法，则你需要下载音高提取模型参数并放置于`assets/rmvpe`。
@@ -237,7 +217,6 @@ rvcmd packs/general/latest # RVC-Models-Downloader command
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
-+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 + [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
@@ -298,31 +277,7 @@ v2 버전 모델을 사용하려면 추가로 다음을 다운로드해야 합
  rvcmd assets/v2 # RVC-Models-Downloader command
  ```

-### 2. ffmpeg 설치
-
-`ffmpeg`와 `ffprobe`가 이미 설치되어 있다면 건너뜁니다.
-
-#### Ubuntu/Debian 사용자
-
-```bash
-sudo apt install ffmpeg
-```
-
-#### MacOS 사용자
-
-```bash
-brew install ffmpeg
-```
-
-#### Windows 사용자
-
-다운로드 후 루트 디렉토리에 배치.
-
- [ffmpeg.exe 다운로드](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
-
- [ffprobe.exe 다운로드](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
-
-### 3. RMVPE 인간 음성 피치 추출 알고리즘에 필요한 파일 다운로드
+### 2. RMVPE 인간 음성 피치 추출 알고리즘에 필요한 파일 다운로드

 최신 RMVPE 인간 음성 피치 추출 알고리즘을 사용하려면 음피치 추출 모델 매개변수를 다운로드하고 RVC 루트 디렉토리에 배치해야 합니다.

@@ -332,7 +287,7 @@ brew install ffmpeg

 - [rmvpe.onnx 다운로드](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.onnx)

-### 4. AMD 그래픽 카드 Rocm(선택사항, Linux만 해당)
+### 3. AMD 그래픽 카드 Rocm(선택사항, Linux만 해당)

 Linux 시스템에서 AMD의 Rocm 기술을 기반으로 RVC를 실행하려면 [여기](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html)에서 필요한 드라이버를 먼저 설치하세요.

@@ -392,7 +347,6 @@ source /opt/intel/oneapi/setvars.sh
 - [VITS](https://github.com/jaywalnut310/vits)
 - [HIFIGAN](https://github.com/jik876/hifi-gan)
 - [Gradio](https://github.com/gradio-app/gradio)
- [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 - [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 - [audio-slicer](https://github.com/openvpi/audio-slicer)
 - [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
--- a/docs/kr/faq_ko.md
+++ b/docs/kr/faq_ko.md
@@ -1,19 +1,14 @@
-## Q1:ffmpeg 오류/utf8 오류
-
-대부분의 경우 ffmpeg 문제가 아니라 오디오 경로 문제입니다. <br>
-ffmpeg가 공백, () 등의 특수 문자가 포함된 경로를 읽을 때 ffmpeg 오류가 발생할 수 있습니다. 트레이닝 세트 오디오가 중문 경로일 때 filelist.txt에 쓸 때 utf8 오류가 발생할 수 있습니다. <br>
-
-## Q2:일괄 트레이닝이 끝나고 인덱스가 없음
+## Q1:일괄 트레이닝이 끝나고 인덱스가 없음

 "Training is done. The program is closed."라고 표시되면 모델 트레이닝이 성공한 것이며, 이어지는 오류는 가짜입니다. <br>

 일괄 트레이닝이 끝나고 'added'로 시작하는 인덱스 파일이 없으면 트레이닝 세트가 너무 커서 인덱스 추가 단계에서 멈췄을 수 있습니다. 메모리에 대한 인덱스 추가 요구 사항이 너무 큰 문제를 배치 처리 add 인덱스로 해결했습니다. 임시로 "트레이닝 인덱스" 버튼을 다시 클릭해 보세요. <br>

-## Q3:트레이닝이 끝나고 트레이닝 세트의 음색을 추론에서 보지 못함
+## Q2:트레이닝이 끝나고 트레이닝 세트의 음색을 추론에서 보지 못함

 '음색 새로고침'을 클릭해 보세요. 여전히 없다면 트레이닝에 오류가 있는지, 콘솔 및 webui의 스크린샷, logs/실험명 아래의 로그를 개발자에게 보내 확인해 보세요. <br>

-## Q4:모델 공유 방법
+## Q3:모델 공유 방법

 rvc_root/logs/실험명 아래에 저장된 pth는 추론에 사용하기 위한 것이 아니라 실험 상태를 저장하고 복원하며, 트레이닝을 계속하기 위한 것입니다. 공유에 사용되는 모델은 weights 폴더 아래 60MB 이상인 pth 파일입니다. <br>
 <br/>
@@ -21,17 +16,17 @@ rvc_root/logs/실험명 아래에 저장된 pth는 추론에 사용하기 위한
 <br/>
 logs 폴더 아래 수백 MB의 pth 파일을 weights 폴더에 복사/공유하여 강제로 추론에 사용하면 f0, tgt_sr 등의 키가 없다는 오류가 발생할 수 있습니다. ckpt 탭 아래에서 수동 또는 자동(로컬 logs에서 관련 정보를 찾을 수 있는 경우 자동)으로 음성, 대상 오디오 샘플링률 옵션을 선택한 후 ckpt 소형 모델을 추출해야 합니다(입력 경로에 G로 시작하는 경로를 입력). 추출 후 weights 폴더에 60MB 이상의 pth 파일이 생성되며, 음색 새로고침 후 사용할 수 있습니다. <br>

-## Q5:연결 오류
+## Q4:연결 오류

 아마도 컨트롤 콘솔(검은 창)을 닫았을 것입니다. <br>

-## Q6:WebUI에서 "Expecting value: line 1 column 1 (char 0)" 오류가 발생함
+## Q5:WebUI에서 "Expecting value: line 1 column 1 (char 0)" 오류가 발생함

 시스템 로컬 네트워크 프록시/글로벌 프록시를 닫으세요. <br>

 이는 클라이언트의 프록시뿐만 아니라 서버 측의 프록시도 포함합니다(예: autodl로 http_proxy 및 https_proxy를 설정한 경우 사용 시 unset으로 끄세요). <br>

-## Q7:WebUI 없이 명령으로 트레이닝 및 추론하는 방법
+## Q6:WebUI 없이 명령으로 트레이닝 및 추론하는 방법

 트레이닝 스크립트: <br>
 먼저 WebUI를 실행하여 데이터 세트 처리 및 트레이닝에 사용되는 명령줄을 메시지 창에서 확인할 수 있습니다. <br>
@@ -53,18 +48,18 @@ index_rate=float(sys.argv[7]) <br>
 device=sys.argv[8] <br>
 is_half=bool(sys.argv[9]) <br>

-## Q8:Cuda 오류/Cuda 메모리 부족
+## Q7:Cuda 오류/Cuda 메모리 부족

 아마도 cuda 설정 문제이거나 장치가 지원되지 않을 수 있습니다. 대부분의 경우 메모리가 부족합니다(out of memory). <br>

 트레이닝의 경우 batch size를 줄이세요(1로 줄여도 부족하다면 다른 그래픽 카드로 트레이닝을 해야 합니다). 추론의 경우 config.py 파일 끝에 있는 x_pad, x_query, x_center, x_max를 적절히 줄이세요. 4GB 미만의 메모리(예: 1060(3GB) 및 여러 2GB 그래픽 카드)를 가진 경우는 포기하세요. 4GB 메모리 그래픽 카드는 아직 구할 수 있습니다. <br>

-## Q9:total_epoch를 몇으로 설정하는 것이 좋을까요
+## Q8:total_epoch를 몇으로 설정하는 것이 좋을까요

 트레이닝 세트의 오디오 품질이 낮고 배경 소음이 많으면 20~30이면 충분합니다. 너무 높게 설정하면 바닥 모델의 오디오 품질이 낮은 트레이닝 세트를 높일 수 없습니다. <br>
 트레이닝 세트의 오디오 품질이 높고 배경 소음이 적고 길이가 길 경우 높게 설정할 수 있습니다. 200도 괜찮습니다(트레이닝 속도가 빠르므로, 고품질 트레이닝 세트를 준비할 수 있는 조건이 있다면, 그래픽 카드도 좋을 것이므로, 조금 더 긴 트레이닝 시간에 대해 걱정하지 않을 것입니다). <br>

-## Q10: 트레이닝 세트는 얼마나 길어야 하나요
+## Q9: 트레이닝 세트는 얼마나 길어야 하나요

 10분에서 50분을 추천합니다.
 <br/>
@@ -76,7 +71,7 @@ is_half=bool(sys.argv[9]) <br>
 <br/>
 1분 미만의 데이터로 트레이닝을 시도(성공)한 사례는 아직 보지 못했습니다. 이런 시도는 권장하지 않습니다.

-## Q11: index rate는 무엇이며, 어떻게 조정하나요? (과학적 설명)
+## Q10: index rate는 무엇이며, 어떻게 조정하나요? (과학적 설명)

 만약 베이스 모델과 추론 소스의 음질이 트레이닝 세트보다 높다면, 그들은 추론 결과의 음질을 높일 수 있지만, 음색이 베이스 모델/추론 소스의 음색으로 기울어질 수 있습니다. 이 현상을 "음색 유출"이라고 합니다.
 <br/>