Speech noise reduction model for railway passenger station scene
-
摘要: 为进一步提升铁路客运站嘈杂环境下的语音识别效果,文章提出一种基于Conformer的语音降噪模型ConformerGAN。其训练流程类似生成对抗网络,生成器采用Conformer进行语音特征提取,对特征建模;鉴别器使用代理评估函数对语音感知进行质量评价。为增强模型的泛化能力并提高模型对未知噪声的降噪能力,在噪声的叠加上采用随机截取片段融入的方式,并构建铁路客运站场景噪声数据集。与语音降噪相关模型效果对比的结果表明,ConformerGAN模型可将客观语音质量评估(PESQ,Perceptual Evaluation of Speech Quality)分数提高0.19,有效提高铁路客运站嘈杂环境下的语音识别准确率,改善铁路旅客语音交互体验。
-
关键词:
- 铁路客运站 /
- 语音降噪 /
- Conformer /
- 生成对抗网络(GAN) /
- 语音识别
Abstract: In order to further improve the speech recognition effect in the noisy environment of the station, this paper proposed a Conformer based generative adjunctive network Conformer Generative Adversarial Network (GAN) for speech noise reduction. Its training process was similar to GAN, generator used the Conformer to extract speech features and model them; discriminator constructed a proxy evaluation function to evaluate the perceptual quality of speech. In order to enhance the generalization ability of the model and improve the noise reduction ability of the model for unknown noise, the overlay of noise was incorporated by randomly intercepting fragments. The paper also built a station scene noise dataset. Compared with the effect of related models, the ConformierGAN model can improve the Perceptual Evaluation of Speech Quality (PESQ) score by 0.19, effectively improve the accuracy of voice recognition in the noisy environment of railway passenger stations, and improve the voice interaction experience of railway passengers. -
表 1 实验环境配置
实验环境 配置 操作系统 Linux CPU型号 Inter® Xeon®CPU E5-2698 v4 @2.20 GHz GPU型号 Tesla V100 运行内存 251 GB 编程语言 Python 算法框架 Pytorch 表 2 模型测评结果
模型 PESQ CSIG CBAK COVL MetricGAN+ 3.10 4.02 3.11 3.43 ConformerGAN(N=2) 3.21 4.30 3.36 3.71 ConformerGAN(N=4) 3.29 4.55 3.57 3.81 ConformerGAN(N=12) 3.28 4.40 3.55 3.78 表 3 车站智能服务机器人语音降噪效果
语音背景噪声类型 PESQ CER(降噪前) CER(降噪后) 站内广播 3.10 15.67 10.35 人工服务台 3.20 14.32 10.27 检票口 3.22 11.68 9.44 -
[1] 王 芳,刘祖润,吴海辉. 基于软硬阈值折中的小波包语音增强算法的研究 [J]. 铁路计算机应用,2010,19(7):8-10. doi: 10.3969/j.issn.1005-8451.2010.07.003 [2] 闫昭宇,王 晶. 结合深度卷积循环网络和时频注意力机制的单通道语音增强算法 [J]. 信号处理,2020,36(6):863-870. doi: 10.16798/j.issn.1003-0530.2020.06.007 [3] 袁文浩,胡少东,时云龙,等. 一种用于语音增强的卷积门控循环网络 [J]. 电子学报,2020,48(7):1276-1283. doi: 10.3969/j.issn.0372-2112.2020.07.005 [4] Riedmiller M. Advanced supervised learning in multi-layer perceptrons—from backpropagation to adaptive learning algorithms [J]. Computer Standards & Interfaces, 1994, 16(3): 265-278. [5] Xu Y, Du J, Dai L R, et al. A regression approach to speech enhancement based on deep neural networks [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(1): 7-19. doi: 10.1109/TASLP.2014.2364452 [6] Albawi S, Mohammed T A, Al-Zawi S. Understanding of a convolutional neural network[C]//2017 international conference on engineering and technology (ICET), 21-23 August, 2017, Antalya, Turkey. New York, USA: IEEE, 2017: 1-6. [7] Deng L, Yu D. Deep learning: methods and applications [J]. Foundations and Trends® in Signal Processing, 2014, 7(3-4): 197-387. [8] Sun L, Du J, Dai L R, et al. Multiple-target deep learning for LSTM-RNN based speech enhancement[C]//2017 Hands-free Speech Communications and Microphone Arrays, 1-3 March, 2017, San Francisco, CA, USA. New York: IEEE, 2017: 136-140. [9] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems, 8-13 December, 2014, Montreal Canada. Cambridge, USA: MIT Press, 2014: 2672-2680. [10] Fu S W, Liao C F, Tsao Y, et al. MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement[C]//Proceedings of the 36th International Conference on Machine Learning, 9-15 June, 2019, Long Beach, USA. PMLR, 2019: 2031-2041. [11] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 4-9 December, 2017, Long Beach, USA. Red Hook: Curran Associates Inc. , 2017: 6000-6010. [12] Gulati A, Qin J, Chiu C C, et al. Conformer: convolution-augmented transformer for speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, 25-29 October, 2020, Shanghai, China. ISCA, 2020: 5036-5040. [13] Chen S Y, Wu Y, Chen Z, et al. Continuous speech separation with conformer[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6-11 June, 2021, Toronto, ON, Canada. New York, USA: IEEE, 2021: 5749-5753. [14] Fu S W, Yu C, Hsieh T A, et al. MetricGAN+: an improved version of metricGAN for speech enhancement[C]//Proceedings of the 22nd Annual Conference of the International Speech Communication Association, 30 August - 3 September, 2021, Brno, Czechia. ISCA, 2021: 201-205. [15] Shi X J, Chen Z R, Wang H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 7-12 December, 2015, Montreal, Canada. Cambridge, USA: MIT Press, 2015: 802-810.