default search action
ASRU 2021: Cartagena, Colombia
- IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021. IEEE 2021, ISBN 978-1-6654-3739-4
- Christian Huber, Juan Hussain, Sebastian Stüker, Alexander Waibel:
Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition. 1-7 - Maxime Burchi, Valentin Vielzeuf:
Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition. 8-15 - Florian Boyer, Yusuke Shinohara, Takaaki Ishii, Hirofumi Inaguma, Shinji Watanabe:
A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies. 16-23 - Norbert Braunschweiler, Rama Doddipatla, Simon Keizer, Svetlana Stoyanchev:
A Study on Cross-Corpus Speech Emotion Recognition and Data Augmentation. 24-30 - Sebastian P. Bayerl, Aniruddha Tammewar, Korbinian Riedhammer, Giuseppe Riccardi:
Detecting Emotion Carriers by Combining Acoustic and Lexical Representations. 31-38 - Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak:
Beyond Isolated Utterances: Conversational Emotion Recognition. 39-46 - Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe:
A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation. 47-54 - Fu-An Chao, Shao-Wei Fan-Jiang, Bi-Cheng Yan, Jeih-weih Hung, Berlin Chen:
TENET: A Time-Reversal Enhancement Network for Noise-Robust ASR. 55-61 - Liqiang He, Shulin Feng, Dan Su, Dong Yu:
Latency-Controlled Neural Architecture Search for Streaming Speech Recognition. 62-67 - Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara:
Data Augmentation for ASR Using TTS Via a Discrete Representation. 68-75 - Keqi Deng, Songjun Cao, Yike Zhang, Long Ma:
Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models. 76-82 - Linchen Zhu, Wenjie Liu, Linquan Liu, Edward Lin:
Improving ASR Error Correction Using N-Best Hypotheses. 83-89 - Prachi Singh, Sriram Ganapathy:
Self-Supervised Metric Learning With Graph Clustering For Speaker Diarization. 90-97 - Shota Horiguchi, Shinji Watanabe, Paola García, Yawen Xue, Yuki Takashima, Yohei Kawaguchi:
Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors. 98-105 - Yi Ma, Kong Aik Lee, Ville Hautamäki, Haizhou Li:
PL-EESR: Perceptual Loss Based End-to-End Robust Speaker Representation Extraction. 106-113 - Naohiro Tawara, Atsunori Ogawa, Yuki Kitagishi, Hosana Kamiyama, Yusuke Ijima:
Robust Speech-Age Estimation Using Local Maximum Mean Discrepancy Under Mismatched Recording Conditions. 114-121 - Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng, Jianwu Dang:
DeepLip: A Benchmark for Deep Learning-Based Audio-Visual Lip Biometrics. 122-129 - Jeong-Hwan Choi, Joon-Young Yang, Joon-Hyuk Chang:
Short-Utterance Embedding Enhancement Method Based on Time Series Forecasting Technique for Text-Independent Speaker Verification. 130-137 - Yan Gao, Titouan Parcollet, Nicholas D. Lane:
Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. 138-145 - Biel Tura, Santiago Escuder, Ferran Diego, Carlos Segura, Jordi Luque:
Efficient Keyword Spotting by Capturing Long-Range Interactions with Temporal Lambda Networks. 146-153 - Mohan Li, Rama Doddipatla:
Improving HS-DACS Based Streaming Transformer ASR with Deep Reinforcement Learning. 154-161 - Xianrui Zheng, Chao Zhang, Philip C. Woodland:
Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition. 162-168 - Jakob Poncelet, Hugo Van hamme:
Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch. 169-176 - Timo Lohrenz, Patrick Schwarz, Zhengyang Li, Tim Fingscheidt:
Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition. 177-184 - Xuechen Liu, Md. Sahidullah, Tomi Kinnunen:
Optimized Power Normalized Cepstral Coefficients Towards Robust Deep Speaker Verification. 185-190 - Pierre Champion, Thomas Thebaud, Gaël Le Lan, Anthony Larcher, Denis Jouvet:
On the Invertibility of a Voice Privacy System Using Embedding Alignment. 191-197 - Jingyu Li, Si Ioi Ng, Tan Lee:
Improving Text-Independent Speaker Verification with Auxiliary Speakers Using Graph. 198-205 - Li Zhang, Qing Wang, Lei Xie:
Duality Temporal-Channel-Frequency Attention Enhanced Speaker Representation Learning. 206-213 - Fangyuan Wang, Zhigang Song, Hongchen Jiang, Bo Xu:
MACCIF-TDNN: Multi Aspect Aggregation of Channel and Context Interdependence Features in TDNN-Based Speaker Verification. 214-219 - Zhuo Li, Ce Fang, Runqiu Xiao, Wenchao Wang, Yonghong Yan:
SI-Net: Multi-Scale Context-Aware Convolutional Block for Speaker Verification. 220-227 - Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-Wen Yang, Yu Tsao, Hung-yi Lee, Shinji Watanabe:
An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition. 228-235 - Dhanush Bekal, Ashish Shenoy, Monica Sunkara, Sravan Bodapati, Katrin Kirchhoff:
Remember the Context! ASR Slot Error Correction Through Memorization. 236-243 - Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu:
w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. 244-250 - Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro J. Moreno:
Injecting Text in Self-Supervised Speech Pretraining. 251-258 - Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha:
TS-RIR: Translated Synthetic Room Impulse Responses for Speech Augmentation. 259-266 - Peter Vieting, Christoph Lüscher, Wilfried Michel, Ralf Schlüter, Hermann Ney:
On Architectures and Training for Raw Waveform Feature Extraction in ASR. 267-274 - Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw:
Multi-User Voicefilter-Lite via Attentive Speaker Embedding. 275-282 - Midia Yousefi, John H. L. Hansen:
Speaker Conditioning of Acoustic Models Using Affine Transformation for Multi-Speaker Speech Recognition. 283-288 - Szu-Jui Chen, Wei Xia, John H. L. Hansen:
Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora. 289-295 - Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka:
A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio. 296-303 - Tom O'Malley, Arun Narayanan, Quan Wang, Alex Park, James Walker, Nathan Howard:
A Conformer-Based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation. 304-311 - Arun Narayanan, Chung-Cheng Chiu, Tom O'Malley, Quan Wang, Yanzhang He:
Cross-Attention Conformer for Context Modeling in Speech Enhancement for ASR. 312-319 - Li Fu, Xiaoxiao Li, Libo Zi, Zhengchen Zhang, Youzheng Wu, Xiaodong He, Bowen Zhou:
Incremental Learning for End-to-End Automatic Speech Recognition. 320-327 - Fan Yu, Haoneng Luo, Pengcheng Guo, Yuhao Liang, Zhuoyuan Yao, Lei Xie, Yingying Gao, Leijing Hou, Shilei Zhang:
Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR. 328-334 - Jing Zhao, Gui-Xin Shi, Guan-Bo Wang, Wei-Qiang Zhang:
Automatic Speech Recognition for Low-Resource Languages: The Thuee Systems for the IARPA Openasr20 Evaluation. 335-341 - Chandran Savithri Anoop, Prathosh A. P., A. G. Ramakrishnan:
Unsupervised Domain Adaptation Schemes for Building ASR in Low-Resource Languages. 342-349 - Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda:
Multimodal Emotion Recognition with High-Level Speech and Text Features. 350-357 - Zhi Zhu, Yoshinao Sato:
Speech Emotion Recognition Using Semi-Supervised Learning with Efficient Labeling Strategies. 358-365 - Jin Li, Nan Yan, Lan Wang:
Unsupervised Cross-Lingual Speech Emotion Recognition Using Pseudo Multilabel. 366-373 - Shi-wook Lee:
Ensemble of Domain Adversarial Neural Networks for Speech Emotion Recognition. 374-379 - Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara:
ASR Rescoring and Confidence Estimation with Electra. 380-387 - Sachin Singh, Ashutosh Gupta, Aman Maghan, Dhananjaya Gowda, Shatrughan Singh, Chanwoo Kim:
Comparative Study of Different Tokenization Strategies for Streaming End-to-End ASR. 388-394 - Dhananjaya Gowda, Abhinav Garg, Jiyeon Kim, Mehul Kumar, Sachin Singh, Ashutosh Gupta, Ankur Kumar, Nauman Dawalatabad, Aman Maghan, Shatrughan Singh, Chanwoo Kim:
HiTNet: Byte-to-BPE Hierarchical Transcription Network for End-to-End Speech Recognition. 395-402 - Nauman Dawalatabad, Tushar Vatsal, Ashutosh Gupta, Sungsoo Kim, Shatrughan Singh, Dhananjaya Gowda, Chanwoo Kim:
Two-Pass End-to-End ASR Model Compression. 403-410 - Qinglin Zhang, Qian Chen, Yali Li, Jiaqing Liu, Wen Wang:
Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation. 411-418 - Bidisha Sharma, Maulik C. Madhavi, Xuehao Zhou, Haizhou Li:
Exploring Teacher-Student Learning Approach for Multi-Lingual Speech-to-Intent Classification. 419-426 - Tan Liu, Wu Guo:
Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features. 427-432 - Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura:
Hierarchical Knowledge Distillation for Dialogue Sequence Labeling. 433-440 - Jaeyun Song, Hajin Shim, Eunho Yang:
Learning How Long to Wait: Adaptively-Constrained Monotonic Multihead Attention for Streaming ASR. 441-448 - Wei Liu, Tan Lee:
Utterance-Level Neural Confidence Measure for End-to-End Children Speech Recognition. 449-456 - Kiran Praveen, Hardik B. Sailor, Abhishek Pandey:
Warped Ensembles: A Novel Technique for Improving CTC Based End-to-End Speech Recognition. 457-464 - Shun-Po Chuang, Heng-Jui Chang, Sung-Feng Huang, Hung-yi Lee:
Non-Autoregressive Mandarin-English Code-Switching Speech Recognition. 465-472 - Ashutosh Gupta, Aditya Jayasimha, Aman Maghan, Shatrughan Singh, Dhananjaya Gowda, Chanwoo Kim:
Voice to Action: Spoken Language Understanding for Memory-Constrained Systems. 473-479 - Jen-Tzung Chien, Chih-Jung Tsai:
Variational Sequential Modeling, Learning and Understanding. 480-486 - Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe:
Attention-Based Multi-Hypothesis Fusion for Speech Summarization. 487-494 - Koichiro Ito, Masaki Murata, Tomohiro Ohno, Shigeki Matsubara:
Estimating the Generation Timing of Responsive Utterances by Active Listeners of Spoken Narratives. 495-502 - Feng-Ju Chang, Jing Liu, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo, Ariya Rastrow, Siegfried Kunzmann:
Context-Aware Transformer Transducer for Speech Recognition. 503-510 - Suwa Xu, Jinwon Lee, Jim Steele:
PSVD: Post-Training Compression of LSTM-Based RNN-T Models. 511-517 - Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed:
Kaizen: Continuously Improving Teacher Using Exponential Moving Average for Semi-Supervised Speech Recognition. 518-525 - Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong:
On Addressing Practical Challenges for RNN-Transducer. 526-533 - Felix Weninger, Marco Gaudesi, Ralf Leibold, Roberto Gemello, Puming Zhan:
Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition. 534-540 - Mohammad Omar Khursheed, Christin Jose, Rajath Kumar, Gengshen Fu, Brian Kulis, Santosh Kumar Cheekatmalla:
Tiny-CRNN: Streaming Wakeword Detection in a Low Footprint Setting. 541-547 - Shaojin Ding, Ye Jia, Ke Hu, Quan Wang:
Textual Echo Cancellation. 548-555 - Daniel Escobar-Grisales, Cristian D. Ríos-Urrego, Diego Alexander Lopez-Santander, Jeferson David Gallo-Aristizábal, Juan Camilo Vásquez-Correa, Elmar Nöth, Juan Rafael Orozco-Arroyave:
Colombian Dialect Recognition Based on Information Extracted from Speech and Text Signals. 556-563 - Yangyang Xia, Buye Xu, Anurag Kumar:
Incorporating Real-World Noisy Speech in Neural-Network-Based Speech Enhancement Systems. 564-570 - Takuya Higuchi, Anmol Gupta, Chandra Dhir:
Multi-Task Learning with Cross Attention for Keyword Spotting. 571-578 - Xinhao Wang, Christopher Hamill:
Automatic Generation of Diagnostic Content Feedback in Spoken Language Learning and Assessment. 579-586 - Thomas Schaaf, Longxiang Zhang, Alireza Bayestehtashk, Mark C. Fuhs, Shahid Durrani, Susanne Burger, Monika Woszczyna, Thomas Polzin:
Are You Dictating to Me? Detecting Embedded Dictations in Doctor-Patient Conversations. 587-593 - Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li:
Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer. 594-601 - Mengxin Chai, Shaotong Guo, Cheng Gong, Longbiao Wang, Jianwu Dang, Ju Zhang:
Learning Language and Speaker Information for Code-Switch Speech Synthesis with Limited Data. 602-609 - Takuma Okamoto, Tomoki Toda, Hisashi Kawai:
Multi-Stream HiFi-GAN with Data-Driven Waveform Decomposition. 610-617 - Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li:
DEEPA: A Deep Neural Analyzer for Speech and Singing Vocoding. 618-625 - Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee:
EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion. 626-633 - Raymond Chung, Brian Mak:
On-The-Fly Data Augmentation for Text-to-Speech Style Transfer. 634-641 - Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda:
On Prosody Modeling for ASR+TTS Based Voice Conversion. 642-649 - Ming-Chi Yen, Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Shu-Wei Tsai, Yu Tsao, Tomoki Toda, Jyh-Shing Roger Jang, Hsin-Min Wang:
Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling. 650-657 - Jiangyu Han, Wei Rao, Yanhua Long, Jiaen Liang:
Attention-Based Scaling Adaptation for Target Speech Extraction. 658-662 - Huiyu Shi, Xi Chen, Tianlong Kong, Shouyi Yin, Peng Ouyang:
GLMSnet: Single Channel Speech Separation Framework in Noisy and Reverberant Environments. 663-670 - Lu Zhang, Chenxing Li, Feng Deng, Xiaorui Wang:
Multi-Task Audio Source Separation. 671-678 - Wei Rao, Yihui Fu, Yanxin Hu, Xin Xu, Yvkai Jv, Jiangyu Han, Zhongjie Jiang, Lei Xie, Yannan Wang, Shinji Watanabe, Zheng-Hua Tan, Hui Bu, Tao Yu, Shidong Shang:
Conferencingspeech Challenge: Towards Far-Field Multi-Channel Speech Enhancement for Video Conferencing. 679-686 - Khaled Hechmi, Trung Ngo Trong, Ville Hautamäki, Tomi Kinnunen:
Voxceleb Enrichment for Age and Gender Recognition. 687-693 - Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Carlos Segura:
Enabling Zero-Shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders. 694-701 - Neil Zeghidour, Olivier Teboul, David Grangier:
Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding. 702-709 - Damien Ronssin, Milos Cernak:
AC-VC: Non-Parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion. 710-716 - Marvin Borsdorf, Haizhou Li, Tanja Schultz:
Target Language Extraction at Multilingual Cocktail Parties. 717-724 - Jose Antonio Lopez Saenz, Md Asif Jalal, Rosanna Milner, Thomas Hain:
Attention Based Model for Segmental Pronunciation Error Detection. 725-732 - Elizabeth Salesky, Julian Mäder, Severin Klinger:
Assessing Evaluation Metrics for Speech-to-Speech Translation. 733-740 - Songxiang Liu, Yuewen Cao, Dan Su, Helen Meng:
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion. 741-748 - Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari:
Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network. 749-756 - Björn Plüster, Cornelius Weber, Leyuan Qu, Stefan Wermter:
Hearing Faces: Target Speaker Text-to-Speech Synthesis from a Face. 757-764 - Bhagyashree Mukherjee, Anusha Prakash, Hema A. Murthy:
Analysis of Conversational Speech with Application to Voice Adaptation. 765-772 - Ruolan Liu, Xue Wen, Chunhui Lu, Liming Song, June Sig Sung:
Vibrato Learning in Multi-Singer Singing Voice Synthesis. 773-779 - Guangzhi Sun, Chao Zhang, Philip C. Woodland:
Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition. 780-787 - Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney:
Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures. 788-795 - Dmitriy Serdyuk, Otavio Braga, Olivier Siohan:
Audio-Visual Speech Recognition is Worth $32\times 32\times 8$ Voxels. 796-802 - Andrea Carmantini, Steve Renals, Peter Bell:
Leveraging Linguistic Knowledge for Accent Robustness of End-to-End Models. 803-810 - Abbas Khosravani, Philip N. Garner, Alexandros Lararidis:
An Evaluation Benchmark for Automatic Speech Recognition of German-English Code-Switching. 811-816 - Abbas Khosravani, Philip N. Garner, Alexandros Lazaridis:
Learning to Translate Low-Resourced Swiss German Dialectal Speech into Standard German Text. 817-823 - Marco Gaudesi, Felix Weninger, Dushyant Sharma, Puming Zhan:
ChannelAugment: Improving Generalization of Multi-Channel ASR by Training with Input Channel Randomization. 824-829 - Chia-Yu Li, Ngoc Thang Vu:
Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN. 830-836 - Kai Wei, Thanh Tran, Feng-Ju Chang, Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Jing Liu, Anirudh Raju, Ross McGowan, Nathan Susanj, Ariya Rastrow, Grant P. Strimel:
Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding. 837-844 - Zheng Gao, Mohamed Abdelhady, Radhika Arava, Xibin Gao, Qian Hu, Wei Xiao, Thahir Mohamed:
X-SHOT: Learning to Rank Voice Applications Via Cross-Locale Shard-Based Co-Training. 845-852 - Akshat Gupta, Olivia Deng, Akruti Kushwaha, Saloni Mittal, William Zeng, Sai Krishna Rallabandi, Alan W. Black:
Intent Recognition and Unsupervised Slot Identification for Low-Resourced Spoken Dialog Systems. 853-860 - Kishan Sachdeva, Joshua Maynez, Olivier Siohan:
Action Item Detection in Meetings Using Pretrained Transformers. 861-868