在label studio 引入whisper 实现语音转写与说话人分离
Published in:2024-12-17 |
Words: 1.8k | Reading time: 8min | reading:

在label studio 引入whisper 实现语音转写与说话人分离

Label Studio 介绍

Label Studio 是一个开源的数据标注工具,广泛应用于机器学习和人工智能项目中,用于标注各种类型的数据,如文本、图像、音频、视频等。它支持多种标注任务,能够帮助用户将原始数据转换为结构化的标签数据,以便训练机器学习模型

安装使用

  • 安装
1
2
3
4
pip install label-studio
# docker
docker run -p 8080:8080 -v /path/to/your/data:/mnt labelstudio/label-studio

  • 使用
1
2

label-studio start

API 调用

1
2
3
4
5
6
import label_studio_sdk
from label_studio_sdk import Client

client = Client(url='http://localhost:8080', api_key='your_api_key')
project = client.init_project('Your Project Name')

模板自定义

1
2
3
4
5
6
7
8
<label>
<Choice name="Label" toName="text">
<Choice value="Positive"/>
<Choice value="Negative"/>
</Choice>
<Text name="text" value="$text"/>
</label>

Whisper 介绍

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching.

Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in “tap”. A popular example model is wav2vec2.0.

Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.

Voice Activity Detection (VAD) is the detection of the presence or absence of human speech.

Speaker Diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.

安装使用

  • whisper
1
2
3
4
5
6
pip install openai-whisper
#or
git clone https://github.com/openai/whisper.git
cd whisper
pip install .

  • pytorch
1
2
3
4
pip install torch
# or
pip install torch --no-index -f https://download.pytorch.org/whl/cpu/torch_stable.html

  • ffmpeg
1
2
3
# https://ffmpeg.org/download.html
sudo apt-get install ffmpeg

基本使用

  • 语音转文字
1
2
3
4
5
6
7
8
9
10
11
import whisper

# 加载 Whisper 模型
model = whisper.load_model("base")

# 识别音频文件中的语音
result = model.transcribe("demo.wav")

# 输出识别结果
print(result["text"])

  • 音频切割
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# _*_ coding: utf-8 _*_
"""
Time: 2024/12/17 16:09
Author: ZhaoQi Cao(czq)
Version: V 0.1
File: audio_seg.py
Describe: Write during the python at zgxmt, Github link: https://github.com/caozhaoqi
"""
import whisper
from pydub import AudioSegment
import webrtcvad
import os

# 初始化 Whisper 模型
whisper_model = whisper.load_model("base")

# VAD 音频活动检测器
vad = webrtcvad.Vad(3) # 设置为 3 表示严格的语音活动检测


def vad_collected_segments(audio_path, sample_rate=16000):
""" 用 VAD 检测音频中的语音活动段 """
audio = AudioSegment.from_file(audio_path)
audio = audio.set_frame_rate(sample_rate) # 设置为 16kHz 采样率
raw_audio = audio.raw_data
frames = len(raw_audio) // 2 # 16位采样
segments = []

# 使用 VAD 切割音频
for i in range(0, frames, sample_rate // 50): # 50ms为步长
frame = raw_audio[i:i + sample_rate // 50]
if vad.is_speech(frame, sample_rate):
segments.append((i, i + sample_rate // 50))

return segments, audio


def transcribe_audio(audio_file):
""" 使用 Whisper 模型进行语音转录 """
result = whisper_model.transcribe(audio_file)
return result['text']


def cut_audio_by_vad(audio_file, segments, output_folder):
""" 根据 VAD 检测的音频片段切割音频 """
audio = AudioSegment.from_file(audio_file)
segment_index = 1

for start, end in segments:
segment_audio = audio[start:end]
segment_audio.export(f"{output_folder}/segment_{segment_index}.wav", format="wav")
segment_index += 1


def process_audio(audio_file, output_folder):
""" 处理音频文件,进行语音转录和按句切割 """
# 1. 使用 Whisper 转录音频
transcription = transcribe_audio(audio_file)
print(f"转录结果: {transcription}")

# 2. 使用 VAD 检测语音活动区域并切割音频
segments, audio = vad_collected_segments(audio_file)

# 3. 保存音频片段
cut_audio_by_vad(audio_file, segments, output_folder)
print(f"切割完成,音频片段保存在 {output_folder}")


# 调用函数,传入音频文件路径和保存切割音频片段的文件夹路径
process_audio(r"C:\Users\DELL\Desktop\audio_mark\1\1-英文对话\1-英文对话\12月17日.WAV", "./output_segments")

  • 命令行
1
2
3
whisper demo.wav --model base
# or
whisper demo.wav --model base --output_dir ./transcripts

结合使用

whisper 转写语音后引入label studio 进行标注

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# _*_ coding: utf-8 _*_
"""
Time: 2024/12/17 10:49
Author: ZhaoQi Cao(czq)
Version: V 0.1
File: label_studio_whisper.py
Describe: Write during the python at zgxmt, Github link: https://github.com/caozhaoqi
"""
import whisper
from speechbrain.pretrained import SpeakerRecognition
from label_studio_sdk import Client
import os

# 初始化 Whisper 模型
whisper_model = whisper.load_model("base") # 你可以选择不同大小的模型,如 'small', 'medium', 'large'

# 初始化 Speechbrain 的说话人识别模型
speaker_model = SpeakerRecognition.from_hparams(source="speechbrain/embedding-model-libri", savedir="tmpdir")


def transcribe_audio(audio_file):
""" 使用 Whisper 对音频文件进行转写 """
result = whisper_model.transcribe(audio_file)
return result['text']


def speaker_diarization(audio_file):
""" 使用 Speechbrain 对音频进行说话人分离 """
# 获取音频的说话人嵌入
signal, fs = speaker_model.load_audio(audio_file)

# 获取说话人分离结果
diarization = speaker_model.diarize(signal)

return diarization


def process_audio_task(task_data):
""" 处理传入的 Label Studio 任务数据 """
audio_file = task_data['audio_file']

# 1. 使用 Whisper 进行语音转写
transcription = transcribe_audio(audio_file)

# 2. 使用 Speechbrain 进行说话人分离
diarization = speaker_diarization(audio_file)

# 返回转写文本和说话人分离信息
return {
'transcription': transcription,
'diarization': diarization
}


# 运行 Label Studio API 服务
if __name__ == "__main__":
zq_api = "93e8ebc81cc1337c41567aa20113fd74934a4f17"
client = Client(url='http://localhost:8080', api_key=zq_api) # 获取你的 Label Studio API key
project = client.get_project(project_id=1) # 假设你已创建项目并设置了ID

# 获取待标注任务
tasks = project.get_tasks()

# 处理每个任务的音频文件
for task in tasks:
task_data = task.data
result = process_audio_task(task_data)

# 在 Label Studio 中提交转写和说话人分离结果
project.create_prediction(task.id, result)

pypinyin 音素序列输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import os
import re
from g2p_en import G2p
from pypinyin import pinyin, Style
from langdetect import detect
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt') # 如果还有缺少的标点符号资源,可以一并下载
# C:\Users\DELL\AppData\Roaming\nltk_data
nltk_data_path = r'C:\Users\DELL\AppData\Roaming\nltk_data'

# 如果目录不存在,则创建该目录
if not os.path.exists(nltk_data_path):
os.makedirs(nltk_data_path)

# 设置 NLTK 数据的搜索路径
nltk.data.path.append(nltk_data_path)

# 创建 G2P 对象(用于英文)
g2p = G2p()

def get_phonemes(text):
# 自动检测文本语言
language = detect(text)

phonemes = []

# 处理混合语言的文本,按字符处理语言
for char in text:
if re.match(r'[\u4e00-\u9fa5]', char): # 判断字符是否为中文
# 中文字符处理
pinyin_output = pinyin(char, style=Style.TONE3)
phonemes.append(pinyin_output[0][0].replace(' ', '')) # 去除空格
elif re.match(r'[a-zA-Z]', char): # 判断字符是否为英文
# 英文字符处理
phonemes.extend(g2p(char)) # 使用 G2P 对象处理英文音素
else:
# 对于非中文和非英文字符(如标点符号),直接保留
phonemes.append(char)

return phonemes

# 测试中英文混合文本
text_mixed = "你说什么? Wash water off. Why is she inside my own boots? 他没有雨鞋得给他买一双。"
phonemes_mixed = get_phonemes(text_mixed)
print(f"Mixed language phonemes: {phonemes_mixed}")

# 测试英文文本
text_en = "If we use it like. umm"
phonemes_en = get_phonemes(text_en)
print(f"English phonemes: {phonemes_en}")

# 测试中文文本(简体)
text_zh = "你说什么?"
phonemes_zh = get_phonemes(text_zh)
print(f"Chinese phonemes: {phonemes_zh}")

# 测试荷兰语文本
text_nl = "Hallo wereld"
phonemes_nl = get_phonemes(text_nl)
print(f"Dutch phonemes: {phonemes_nl}")

# 测试简体中文 (zh-cn)
text_zh_cn = "他管你叫老姑父。哈哈哈哈哈 嘿 哈哈哈哈哈哈哈哈哈"
phonemes_zh_cn = get_phonemes(text_zh_cn)
print(f"Chinese (Simplified) phonemes: {phonemes_zh_cn}")

# 测试繁体中文 (zh-tw)
text_zh_tw = "你好,世界"
phonemes_zh_tw = get_phonemes(text_zh_tw)
print(f"Chinese (Traditional) phonemes: {phonemes_zh_tw}")

See

Prev:
label studio引入nemo_asr实现预测标注文本
Next:
CentOS 系统上安装 Docker