在label studio 引入whisper 实现语音转写与说话人分离 Label Studio 介绍
Label Studio 是一个开源的数据标注工具,广泛应用于机器学习和人工智能项目中,用于标注各种类型的数据,如文本、图像、音频、视频等。它支持多种标注任务,能够帮助用户将原始数据转换为结构化的标签数据,以便训练机器学习模型
安装使用
1 2 3 4 pip install label-studio # docker docker run -p 8080:8080 -v /path/to/your/data:/mnt labelstudio/label-studio
API 调用 1 2 3 4 5 6 import label_studio_sdkfrom label_studio_sdk import Clientclient = Client(url='http://localhost:8080' , api_key='your_api_key' ) project = client.init_project('Your Project Name' )
模板自定义 1 2 3 4 5 6 7 8 <label > <Choice name ="Label" toName ="text" > <Choice value ="Positive" /> <Choice value ="Negative" /> </Choice > <Text name ="text" value ="$text" /> </label >
Whisper 介绍
Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI’s whisper does not natively support batching.
Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in “tap”. A popular example model is wav2vec2.0.
Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.
Voice Activity Detection (VAD) is the detection of the presence or absence of human speech.
Speaker Diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.
安装使用
1 2 3 4 5 6 pip install openai-whisper # or git clone https://github.com/openai/whisper.git cd whisper pip install .
1 2 3 4 pip install torch # or pip install torch --no-index -f https://download.pytorch.org/whl/cpu/torch_stable.html
1 2 3 # https://ffmpeg.org/download.html sudo apt-get install ffmpeg
基本使用
1 2 3 4 5 6 7 8 9 10 11 import whispermodel = whisper.load_model("base" ) result = model.transcribe("demo.wav" ) print (result["text" ])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 """ Time: 2024/12/17 16:09 Author: ZhaoQi Cao(czq) Version: V 0.1 File: audio_seg.py Describe: Write during the python at zgxmt, Github link: https://github.com/caozhaoqi """ import whisperfrom pydub import AudioSegmentimport webrtcvadimport oswhisper_model = whisper.load_model("base" ) vad = webrtcvad.Vad(3 ) def vad_collected_segments (audio_path, sample_rate=16000 ): """ 用 VAD 检测音频中的语音活动段 """ audio = AudioSegment.from_file(audio_path) audio = audio.set_frame_rate(sample_rate) raw_audio = audio.raw_data frames = len (raw_audio) // 2 segments = [] for i in range (0 , frames, sample_rate // 50 ): frame = raw_audio[i:i + sample_rate // 50 ] if vad.is_speech(frame, sample_rate): segments.append((i, i + sample_rate // 50 )) return segments, audio def transcribe_audio (audio_file ): """ 使用 Whisper 模型进行语音转录 """ result = whisper_model.transcribe(audio_file) return result['text' ] def cut_audio_by_vad (audio_file, segments, output_folder ): """ 根据 VAD 检测的音频片段切割音频 """ audio = AudioSegment.from_file(audio_file) segment_index = 1 for start, end in segments: segment_audio = audio[start:end] segment_audio.export(f"{output_folder} /segment_{segment_index} .wav" , format ="wav" ) segment_index += 1 def process_audio (audio_file, output_folder ): """ 处理音频文件,进行语音转录和按句切割 """ transcription = transcribe_audio(audio_file) print (f"转录结果: {transcription} " ) segments, audio = vad_collected_segments(audio_file) cut_audio_by_vad(audio_file, segments, output_folder) print (f"切割完成,音频片段保存在 {output_folder} " ) process_audio(r"C:\Users\DELL\Desktop\audio_mark\1\1-英文对话\1-英文对话\12月17日.WAV" , "./output_segments" )
1 2 3 whisper demo.wav --model base # or whisper demo.wav --model base --output_dir ./transcripts
结合使用
whisper 转写语音后引入label studio 进行标注
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 """ Time: 2024/12/17 10:49 Author: ZhaoQi Cao(czq) Version: V 0.1 File: label_studio_whisper.py Describe: Write during the python at zgxmt, Github link: https://github.com/caozhaoqi """ import whisperfrom speechbrain.pretrained import SpeakerRecognitionfrom label_studio_sdk import Clientimport oswhisper_model = whisper.load_model("base" ) speaker_model = SpeakerRecognition.from_hparams(source="speechbrain/embedding-model-libri" , savedir="tmpdir" ) def transcribe_audio (audio_file ): """ 使用 Whisper 对音频文件进行转写 """ result = whisper_model.transcribe(audio_file) return result['text' ] def speaker_diarization (audio_file ): """ 使用 Speechbrain 对音频进行说话人分离 """ signal, fs = speaker_model.load_audio(audio_file) diarization = speaker_model.diarize(signal) return diarization def process_audio_task (task_data ): """ 处理传入的 Label Studio 任务数据 """ audio_file = task_data['audio_file' ] transcription = transcribe_audio(audio_file) diarization = speaker_diarization(audio_file) return { 'transcription' : transcription, 'diarization' : diarization } if __name__ == "__main__" : zq_api = "93e8ebc81cc1337c41567aa20113fd74934a4f17" client = Client(url='http://localhost:8080' , api_key=zq_api) project = client.get_project(project_id=1 ) tasks = project.get_tasks() for task in tasks: task_data = task.data result = process_audio_task(task_data) project.create_prediction(task.id , result)
pypinyin 音素序列输出 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 import osimport refrom g2p_en import G2pfrom pypinyin import pinyin, Stylefrom langdetect import detectimport nltknltk.download('averaged_perceptron_tagger' ) nltk.download('averaged_perceptron_tagger_eng' ) nltk.download('punkt' ) nltk_data_path = r'C:\Users\DELL\AppData\Roaming\nltk_data' if not os.path.exists(nltk_data_path): os.makedirs(nltk_data_path) nltk.data.path.append(nltk_data_path) g2p = G2p() def get_phonemes (text ): language = detect(text) phonemes = [] for char in text: if re.match (r'[\u4e00-\u9fa5]' , char): pinyin_output = pinyin(char, style=Style.TONE3) phonemes.append(pinyin_output[0 ][0 ].replace(' ' , '' )) elif re.match (r'[a-zA-Z]' , char): phonemes.extend(g2p(char)) else : phonemes.append(char) return phonemes text_mixed = "你说什么? Wash water off. Why is she inside my own boots? 他没有雨鞋得给他买一双。" phonemes_mixed = get_phonemes(text_mixed) print (f"Mixed language phonemes: {phonemes_mixed} " )text_en = "If we use it like. umm" phonemes_en = get_phonemes(text_en) print (f"English phonemes: {phonemes_en} " )text_zh = "你说什么?" phonemes_zh = get_phonemes(text_zh) print (f"Chinese phonemes: {phonemes_zh} " )text_nl = "Hallo wereld" phonemes_nl = get_phonemes(text_nl) print (f"Dutch phonemes: {phonemes_nl} " )text_zh_cn = "他管你叫老姑父。哈哈哈哈哈 嘿 哈哈哈哈哈哈哈哈哈" phonemes_zh_cn = get_phonemes(text_zh_cn) print (f"Chinese (Simplified) phonemes: {phonemes_zh_cn} " )text_zh_tw = "你好,世界" phonemes_zh_tw = get_phonemes(text_zh_tw) print (f"Chinese (Traditional) phonemes: {phonemes_zh_tw} " )
See