


default search action
SLT 2021: Shenzhen, China
- IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021. IEEE 2021, ISBN 978-1-7281-7066-4
- Mohan Li, Catalin Zorila, Rama Doddipatla:
Transformer-Based Online Speech Recognition with Decoder-end Adaptive Computation Steps. 1-7 - Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, Chunyang Wu, Frank Zhang, Julian Chan, Michael L. Seltzer:
Streaming Attention-Based Models with Augmented Memory for End-To-End Speech Recognition. 8-14 - Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie:
Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter. 15-21 - Emiru Tsunoo, Yosuke Kashiwagi, Shinji Watanabe
:
Streaming Transformer Asr With Blockwise Synchronous Beam Search. 22-29 - Jinhwan Park, Chanwoo Kim, Wonyong Sung:
Convolution-Based Attention Model With Positional Encoding For Streaming Speech Recognition On Embedded Devices. 30-37 - George Sterpu, Christian Saam, Naomi Harte
:
Learning to Count Words in Fluent Speech Enables Online Speech Recognition. 38-45 - Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-Feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig:
Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR. 46-51 - Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer:
Alignment Restricted Streaming Recurrent Neural Network Transducer. 52-59 - Huahuan Zheng, Keyu An, Zhijian Ou:
Efficient Neural Architecture Search for End-to-End Speech Recognition Via Straight-Through Gradients. 60-67 - Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman:
Transformer Based Deliberation for Two-Pass Speech Recognition. 68-74 - Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie:
Simplified Self-Attention for Transformer-Based end-to-end Speech Recognition. 75-81 - Jian Luo, Jianzong Wang
, Ning Cheng, Guilin Jiang, Jing Xiao:
Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion. 82-88 - Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals:
On The Usefulness of Self-Attention for Automatic Speech Recognition with Transformers. 89-96 - Thomas Pellegrini, Romain Zimmer, Timothée Masquelier
:
Low-Activity Supervised Convolutional Spiking Neural Networks Applied to Speech Commands Recognition. 97-103 - Yuxiang Kong, Jian Wu, Quandong Wang, Peng Gao, Weiji Zhuang, Yujun Wang, Lei Xie:
Multi-Channel Automatic Speech Recognition Using Deep Complex Unet. 104-110 - Kiran Praveen, Abhishek Pandey, Deepak Kumar, Shakti Prasad Rath, Sandip Shriram Bapat:
Dynamically Weighted Ensemble Models for Automatic Speech Recognition. 111-116 - Kazuhiro Nakadai, Yosuke Fukumoto, Ryu Takeda
:
Investigation of Node Pruning Criteria for Neural Networks Model Compression with Non-Linear Function and Non-Uniform Network Topology. 117-124 - Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Y. Hannun:
Semi-Supervised end-to-end Speech Recognition via Local Prior Matching. 125-132 - Jaesung Huh, Minjae Lee, Heesoo Heo, Seongkyu Mun, Joon Son Chung:
Metric Learning for Keyword Spotting. 133-140 - Alexandru-Lucian Georgescu
, Cristian Manolache, Dan Oneata, Horia Cucu, Corneliu Burileanu:
Data-Filtering Methods for Self-Training of Automatic Speech Recognition Systems. 141-147 - Prakhar Swarup, Debmalya Chakrabarty, Ashtosh Sapru, Hitesh Tulsiani, Harish Arsikere, Sri Garimella:
Efficient Large Scale Semi-Supervised Learning for CTC Based Acoustic Models. 148-155 - Morgane Rivière, Emmanuel Dupoux:
Towards Unsupervised Learning of Speech Features in the Wild. 156-163 - Bowen Shi, Shane Settle, Karen Livescu
:
Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings. 164-171 - Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey Zweig:
Improving RNN Transducer Based ASR with Auxiliary Tasks. 172-179 - Songjun Cao, Yike Zhang, Xiaobing Feng, Long Ma:
Improving Speech Recognition Accuracy of Local POI Using Geographical Models. 180-185 - Heng-Jui Chang
, Alexander H. Liu, Hung-yi Lee, Lin-Shan Lee:
End-to-End Whispered Speech Recognition with Frequency-Weighted Approaches and Pseudo Whisper Pre-training. 186-193 - Chenpeng Du
, Hao Li, Yizhou Lu, Lan Wang, Yanmin Qian:
Data Augmentation for end-to-end Code-Switching Speech Recognition. 194-200 - Bin Wu, Sakriani Sakti, Satoshi Nakamura:
Incorporating Discriminative DPGMM Posteriorgrams for Low-Resource ASR. 201-208 - Xinwei Li, Yuanyuan Zhang, Xiaodan Zhuang, Daben Liu:
Frame-Level Specaugment for Deep Convolutional Neural Networks in Hybrid ASR Systems. 209-214 - Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux:
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain. 215-222 - Ashutosh Pandey, Chunxi Liu, Yun Wang, Yatharth Saraf:
Dual Application of Speech Enhancement for Automatic Speech Recognition. 223-228 - Ruizhi Li, Gregory Sell, Hynek Hermansky
:
Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream end-to-end ASR. 229-235 - Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu:
Block-Online Guided Source Separation. 236-242 - Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li
, Yifan Gong:
Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition. 243-250 - Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen, Michael L. Seltzer:
Deep Shallow Fusion for RNN-T Personalization. 251-257 - Dan Oneata, Alexandru Caranica, Adriana Stan
, Horia Cucu:
An Evaluation of Word-Level Confidence Estimation for End-to-End Automatic Speech Recognition. 258-265 - Shih-Hsuan Chiu, Berlin Chen:
Innovative Bert-Based Reranking Language Models for Speech Recognition. 266-271 - Bipasha Sen, Aditya Agarwal, Mirishkar Sai Ganesh, Anil Kumar Vuppala:
Reed: An Approach Towards Quickly Bootstrapping Multilingual Acoustic Models. 272-279 - Minguang Song, Yunxin Zhao, Shaojun Wang, Mei Han:
Word Similarity Based Label Smoothing in Rnnlm Training for ASR. 280-285 - Seong Min Kye, Joon Son Chung, Hoirin Kim:
Supervised Attention for Speaker Recognition. 286-293 - Seong Min Kye, Yoohwan Kwon, Joon Son Chung:
Cross Attentive Pooling for Speaker Verification. 294-300 - Tianyan Zhou, Yong Zhao, Jian Wu:
ResNeXt and Res2Net Structures for Speaker Verification. 301-307 - Danwei Cai, Ming Li:
Embedding Aggregation for Far-Field Speaker Verification with Distributed Microphone Arrays. 308-315 - Yiling Huang, Yutian Chen, Jason Pelecanos, Quan Wang:
Synth2Aug: Cross-Domain Speaker Recognition with TTS Synthesized Speech. 316-322 - Md. Sahidullah, Achintya Kumar Sarkar, Ville Vestman, Xuechen Liu, Romain Serizel, Tomi Kinnunen, Zheng-Hua Tan, Emmanuel Vincent:
UIAI System for Short-Duration Speaker Verification Challenge 2020. 323-329 - Zheng Li, Miao Zhao, Lin Li, Qingyang Hong:
Multi-Feature Learning with Canonical Correlation Analysis Constraint for Text-Independent Speaker Verification. 330-337 - Hrishikesh Rao, Kedar Phatak, Elie Khoury
:
Improving Speaker Recognition with Quality Indicators. 338-343 - Po-Han Chi, Pei-Hung Chung
, Tsung-Han Wu, Chun-Cheng Hsieh, Yen-Hao Chen, Shang-Wen Li, Hung-yi Lee:
Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation. 344-350 - Bo-Hao Su, Chi-Chun Lee
:
A Conditional Cycle Emotion Gan for Cross Corpus Speech Emotion Recognition. 351-357 - Michael Neumann, Ngoc Thang Vu:
Investigations on audiovisual emotion recognition in noisy conditions. 358-364 - Patrick Meyer, Ziyi Xu, Tim Fingscheidt
:
Improving Convolutional Recurrent Neural Networks for Speech Emotion Recognition. 365-372 - Manon Macary, Marie Tahon, Yannick Estève, Anthony Rousseau:
On the Use of Self-Supervised Pre-Trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition. 373-380 - Aparna Khare
, Srinivas Parthasarathy, Shiva Sundaram:
Self-Supervised Learning with Cross-Modal Transformers for Emotion Recognition. 381-388 - Shi-wook Lee
:
Domain Generalization with Triplet Network for Cross-Corpus Speech Emotion Recognition. 389-396 - Alice Baird, Shahin Amiriparian
, Manuel Milling, Björn W. Schuller:
Emotion Recognition in Public Speaking Scenarios Utilising An LSTM-RNN Approach with Attention. 397-402 - Haohan Guo
, Shaofei Zhang, Frank K. Soong, Lei He, Lei Xie:
Conversational End-to-End TTS for Voice Agents. 403-409 - Liangqi Liu, Jiankun Hu, Zhiyong Wu, Song Yang, Songfan Yang, Jia Jia, Helen Meng:
Controllable Emphatic Speech Synthesis based on Forward Attention for Expressive Speech Synthesis. 410-414 - Kun Zhou, Berrak Sisman
, Haizhou Li:
Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech. 415-422 - Yi Lei, Shan Yang, Lei Xie:
Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis. 423-430 - Slava Shechtman, Raul Fernandez, David Haws:
Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis. 431-437 - Aolan Sun, Jianzong Wang
, Ning Cheng, Huayi Peng, Zhen Zeng, Lingwei Kong, Jing Xiao:
GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis. 438-445 - Chung-Ming Chien, Hung-yi Lee:
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis. 446-453 - Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio, Erik Marchi, Varun Lakshminarasimhan:
Whispered and Lombard Neural Speech Synthesis. 454-461 - Yeunju Choi, Youngmoon Jung, Hoirin Kim:
Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning with Spoofing Detection and Spoofing Type Classification. 462-469 - Eunwoo Song, Ryuichi Yamamoto, Min-Jae Hwang, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim:
Improved Parallel Wavegan Vocoder with Perceptually Weighted Spectrogram Loss. 470-476 - Yang Ai, Haoyu Li, Xin Wang
, Junichi Yamagishi, Zhen-Hua Ling:
Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation. 477-484 - Zhen Zeng, Jianzong Wang
, Ning Cheng, Jing Xiao:
MelGlow: Efficient Waveform Generative Network Based On Location-Variable Convolution. 485-491 - Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie:
Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech. 492-498 - Song Li, Beibei Ouyang, Lin Li, Qingyang Hong:
Lightspeech: Lightweight Non-Autoregressive Multi-Speaker Text-To-Speech. 499-506 - Hongqiang Du, Xiaohai Tian, Lei Xie, Haizhou Li:
Optimizing Voice Conversion Network with Cycle Consistency Loss of Speaker Identity. 507-513 - Tzu-hsien Huang, Jheng-Hao Lin, Hung-yi Lee:
How Far Are We from Robust Voice Conversion: A Survey. 514-521 - Heyang Xue, Shan Yang, Yi Lei, Lei Xie, Xiulin Li:
Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher. 522-529 - Hayato Shibata, Mingxin Zhang, Takahiro Shinozaki:
Unsupervised Acoustic-to-Articulatory Inversion Neural Network Learning Based on Deterministic Policy Gradient. 530-537 - Tianxiang Chen, Elie Khoury
:
Spoofprint: A New Paradigm for Spoofing Attacks Detection. 538-543 - Yang Gao, Jiachen Lian, Bhiksha Raj, Rita Singh:
Detection and Evaluation of Human and Machine Generated Speech in Spoofing Attacks on Automatic Speaker Verification Systems. 544-551 - Chien-yu Huang, Yist Y. Lin, Hung-yi Lee, Lin-Shan Lee:
Defending Your Voice: Adversarial Attack on Voice Conversion. 552-559 - Hiroto Kai, Shinnosuke Takamichi, Sayaka Shiota, Hitoshi Kiya:
Lightweight Voice Anonymization Based on Data-Driven Optimization of Cascaded Voice Modification Modules. 560-566 - Youngki Kwon, Hee Soo Heo, Jaesung Huh, Bong-Jin Lee, Joon Son Chung:
Look Who's Not Talking. 567-573 - Qiujia Li, Florian L. Kreyssig, Chao Zhang, Philip C. Woodland:
Discriminative Neural Clustering for Speaker Diarisation. 574-581 - Desh Raj
, Zili Huang, Sanjeev Khudanpur:
Multi-Class Spectral Clustering with Overlaps for Speaker Diarization. 582-589 - Suchitra Krishnamachari, Manoj Kumar, So Hyun Kim
, Catherine Lord, Shrikanth Narayanan:
Developing Neural Representations for Robust Child-Adult Diarization. 590-597 - You Jin Kim, Hee Soo Heo, Soo-Whan Chung, Bong-Jin Lee:
End-To-End Lip Synchronisation Based on Pattern Classification. 598-605 - Jian Luo, Jianzong Wang
, Ning Cheng, Guilin Jiang, Jing Xiao:
End-To-End Silent Speech Recognition with Acoustic Sensing. 606-612 - Timothy Israel Santos, Andrew Abel, Nick Wilson
, Yan Xu:
Speaker-Independent Visual Speech Recognition with the Inception V3 Model. 613-620 - Shahram Ghorbani, Yashesh Gaur, Yu Shi, Jinyu Li
:
Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations. 621-628 - Mao Saeki
, Yoichi Matsuyama, Satoshi Kobashikawa, Tetsuji Ogawa, Tetsunori Kobayashi:
Analysis of Multimodal Features for Speaking Proficiency Scoring in an Interview Dialogue. 629-635 - Srinivas Parthasarathy, Shiva Sundaram:
Detecting Expressions with Multimodal Transformers. 636-643 - Muralikrishna H, Shikha Gupta, Dileep Aroor Dinesh, Padmanabhan Rajan:
Noise-Robust Spoken Language Identification Using Language Relevance Factor Based Embedding. 644-651 - Jörgen Valk, Tanel Alumäe:
VOXLINGUA107: A Dataset for Spoken Language Recognition. 652-658 - Xiaosu Tong, Che-Wei Huang, Sri Harish Mallidi, Shaun Joseph, Sonal Pareek, Chander Chandak, Ariya Rastrow, Roland Maas:
Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection. 659-664 - Fang Kang, Feiran Yang, Jun Yang:
Real-Time Independent Vector Analysis with a Deep-Learning-Based Source Model. 665-669 - Amit Meghanani, Chandran Savithri Anoop
, A. G. Ramakrishnan
:
An Exploration of Log-Mel Spectrogram and MFCC Features for Alzheimer's Dementia Recognition from Spontaneous Speech. 670-677 - Su Ji Park, Alan Rozet:
Film Quality Prediction Using Acoustic, Prosodic and Lexical Cues. 678-684 - Yulan Feng, Alan W. Black, Maxine Eskénazi:
Towards Automatic Route Description Unification in Spoken Dialog Systems. 685-692 - Subash Khanal, Michael T. Johnson, Narjes Bozorg:
Articulatory Comparison of L1 and L2 Speech for Mispronunciation Diagnosis. 693-697 - Yang Shen, Ayano Yasukagawa, Daisuke Saito, Nobuaki Minematsu, Kazuya Saito:
Optimized Prediction of Fluency of L2 English Based on Interpretable Network Using Quantity of Phonation and Quality of Pronunciation. 698-704 - Xinhao Wang, Keelan Evanini, Yao Qian, Matthew Mulholland:
Automated Scoring of Spontaneous Speech from Young Learners of English Using Transformers. 705-712 - Binghuai Lin, Liyuan Wang, Hongwei Ding, Xiaoli Feng:
Improving L2 English Rhythm Evaluation with Automatic Sentence Stress Detection. 713-719 - Protima Nomo Sudro
, Rohan Kumar Das, Rohit Sinha, S. R. Mahadeva Prasanna:
Enhancing the Intelligibility of Cleft Lip and Palate Speech Using Cycle-Consistent Adversarial Networks. 720-727 - Ram C. M. C. Shekar, Chelzy Belitz, John H. L. Hansen:
Development of CNN-Based Cochlear Implant and Normal Hearing Sound Recognition Models Using Natural and Auralized Environmental Audio. 728-733 - Haoyu Li, Yang Ai, Junichi Yamagishi:
Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model. 734-741 - Ying Shi, Haolin Chen, Zhiyuan Tang, Lantian Li
, Dong Wang, Jiqing Han:
Can We Trust Deep Speech Prior? 742-749 - Yanpei Shi, Thomas Hain
:
Contextual Joint Factor Acoustic Embeddings. 750-757 - Yanpei Shi, Thomas Hain
:
Supervised Speaker Embedding De-Mixing in Two-Speaker Environment. 758-765 - Jianming Liu, Meng Yu, Yong Xu, Chao Weng, Shi-Xiong Zhang, Lianwu Chen, Dong Yu:
Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising. 766-770 - Aditya Jayasimha, Periyasamy Paramasivam:
Personalizing Speech Start Point and End Point Detection in ASR Systems from Speaker Embeddings. 771-777 - Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita
, Marc Delcroix
, Tomohiro Nakatani, Shoko Araki:
Multimodal Attention Fusion for Target Speaker Extraction. 778-784 - Chenda Li, Jing Shi, Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Naoyuki Kamo, Moto Hira, Tomoki Hayashi, Christoph Böddeker, Zhuo Chen, Shinji Watanabe
:
ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration. 785-792 - Catalin Zorila, Mohan Li, Rama Doddipatla
:
An Investigation into the Multi-channel Time Domain Speaker Extraction Network. 793-800 - Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu:
Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks. 801-808 - Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka:
Investigation of End-to-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings. 809-816 - Zhaoheng Ni, Yong Xu, Meng Yu, Bo Wu, Shi-Xiong Zhang, Dong Yu, Michael I. Mandel:
WPD++: An Improved Neural Beamformer for Simultaneous Speech Separation and Dereverberation. 817-824 - Yi Luo, Cong Han, Nima Mesgarani:
Distortion-Controlled Training for end-to-end Reverberant Speech Separation with Auxiliary Autoencoding Loss. 825-832 - Xiaofei Wang, Naoyuki Kanda, Yashesh Gaur, Zhuo Chen, Zhong Meng, Takuya Yoshioka:
Exploring End-to-End Multi-Channel ASR with Bias Information for Meeting Transcription. 833-840 - Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe
, Paola García, Kenji Nagamatsu:
Online End-To-End Neural Diarization with Speaker-Tracing Buffer. 841-848 - Yuki Takashima, Yusuke Fujita, Shinji Watanabe
, Shota Horiguchi, Paola García, Kenji Nagamatsu:
End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection. 849-856 - Yihui Fu, Jian Wu, Yanxin Hu, Mengtao Xing, Lei Xie:
DESNet: A Multi-Channel Network for Simultaneous Speech Dereverberation, Enhancement and Separation. 857-864 - Chenda Li, Yi Luo, Cong Han, Jinyu Li
, Takuya Yoshioka, Tianyan Zhou, Marc Delcroix
, Keisuke Kinoshita
, Christoph Böddeker, Yanmin Qian, Shinji Watanabe
, Zhuo Chen:
Dual-Path RNN for Long Recording Speech Separation. 865-872 - Chung-Cheng Chiu, Arun Narayanan, Wei Han, Rohit Prabhavalkar, Yu Zhang, Navdeep Jaitly, Ruoming Pang, Tara N. Sainath, Patrick Nguyen, Liangliang Cao, Yonghui Wu:
RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions. 873-880