


default search action
21st Interspeech 2020: Shanghai, China
- Helen Meng, Bo Xu, Thomas Fang Zheng:
21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020. ISCA 2020
Keynote 1
- Janet B. Pierrehumbert:
The cognitive status of simple and complex models.
ASR Neural Network Architectures I
- Jinyu Li
, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu:
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition. 1-5 - Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin
:
SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. 6-10 - Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf:
Contextual RNN-T for Open Domain ASR. 11-15 - Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu Jeong Han, Tao Lei, Tao Ma:
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition. 16-20 - Deepak Kadetotad, Jian Meng, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo:
Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity. 21-25 - Timo Lohrenz
, Tim Fingscheidt
:
BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example. 26-30 - Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky
, Sebastian Stüker, Jan Niehues
, Alex Waibel:
Relative Positional Encoding for Speech Recognition and Direct Translation. 31-35 - Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka:
Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers. 36-40 - Takashi Fukuda, Samuel Thomas:
Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework. 41-45 - Jinhwan Park, Wonyong Sung:
Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition. 46-50
Multi-Channel Speech Enhancement
- Guanjun Li, Shan Liang, Shuai Nie, Wenju Liu, Zhanlei Yang, Longshuai Xiao:
Deep Neural Network-Based Generalized Sidelobe Canceller for Robust Multi-Channel Speech Recognition. 51-55 - Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, Dong Yu:
Neural Spatio-Temporal Beamformer for Target Speech Separation. 56-60 - Li Li, Kazuhito Koishida, Shoji Makino
:
Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis. 61-65 - Meng Yu, Xuan Ji, Bo Wu, Dan Su, Dong Yu:
End-to-End Multi-Look Keyword Spotting. 66-70 - Weilong Huang, Jinwei Feng:
Differential Beamforming for Uniform Circular Array with Directional Microphones. 71-75 - Jun Qi, Hu Hu
, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee:
Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement. 76-80 - Jian Wu, Zhuo Chen, Jinyu Li
, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie:
An End-to-End Architecture of Online Multi-Channel Speech Separation. 81-85 - Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi:
Mentoring-Reverse Mentoring for Unsupervised Multi-Channel Speech Source Separation. 86-90 - Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita
, Hiroshi Sawada, Shoko Araki:
Computationally Efficient and Versatile Framework for Joint Optimization of Blind Speech Separation and Dereverberation. 91-95 - Yanhui Tu, Jun Du, Lei Sun, Feng Ma, Jia Pan, Chin-Hui Lee:
A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge. 96-100
Speech Processing in the Brain
- Youssef Hmamouche, Laurent Prévot
, Magalie Ochs, Thierry Chaminade
:
Identifying Causal Relationships Between Behavior and Local Brain Activity During Natural Conversation. 101-105 - Di Zhou
, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Zhuo Zhang:
Neural Entrainment to Natural Speech Envelope Based on Subject Aligned EEG Signals. 106-110 - Chongyuan Lian, Tianqi Wang, Mingxiao Gu, Manwa L. Ng, Feiqi Zhu, Lan Wang, Nan Yan:
Does Lexical Retrieval Deteriorate in Patients with Mild Cognitive Impairment? Analysis of Brain Functional Network Will Tell. 111-115 - Zhen Fu, Jing Chen:
Congruent Audiovisual Speech Enhances Cortical Envelope Tracking During Auditory Selective Attention. 116-120 - Lei Wang, Ed X. Wu, Fei Chen:
Contribution of RMS-Level-Based Speech Segments to Target Speech Decoding Under Noisy Conditions. 121-124 - Bin Zhao, Jianwu Dang, Gaoyan Zhang, Masashi Unoki
:
Cortical Oscillatory Hierarchy for Natural Sentence Processing. 125-129 - Louis ten Bosch
, Kimberley Mulder, Lou Boves:
Comparing EEG Analyses with Different Epoch Alignments in an Auditory Lexical Decision Experiment. 130-134 - Tanya Talkar, Sophia Yuditskaya, James R. Williamson, Adam C. Lammert, Hrishikesh Rao, Daniel J. Hannon, Anne T. O'Brien, Gloria Vergara-Diaz, Richard DeLaura, Douglas E. Sturim, Gregory A. Ciccarelli, Ross Zafonte, Jeff Palmer, Paolo Bonato, Thomas F. Quatieri:
Detection of Subclinical Mild Traumatic Brain Injury (mTBI) Through Speech and Gait. 135-139
Speech Signal Representation
- Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Félix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv:
Towards Learning a Universal Non-Semantic Representation of Speech. 140-144 - Rajeev Rajan, Aiswarya Vinod Kumar, Ben P. Babu:
Poetic Meter Classification Using i-Vector-MTF Fusion. 145-149 - Wang Dai, Jinsong Zhang, Yingming Gao, Wei Wei, Dengfeng Ke, Binghuai Lin, Yanlu Xie:
Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism. 150-154 - Na Hu, Berit Janssen
, Judith Hanssen, Carlos Gussenhoven, Aoju Chen
:
Automatic Analysis of Speech Prosody in Dutch. 155-159 - Adrien Gresse, Mathias Quillot, Richard Dufour, Jean-François Bonastre:
Learning Voice Representation Using Knowledge Distillation for Automatic Voice Casting. 160-164 - B. Yegnanarayana, Joseph M. Anand, Vishala Pannala
:
Enhancing Formant Information in Spectrographic Display of Speech. 165-169 - Michael Gump, Wei-Ning Hsu, James R. Glass:
Unsupervised Methods for Evaluating Speech Representations. 170-174 - Dung N. Tran, Uros Batricevic, Kazuhito Koishida:
Robust Pitch Regression with Voiced/Unvoiced Classification in Nonstationary Noise Environments. 175-179 - Amrith Setlur, Barnabás Póczos, Alan W. Black:
Nonlinear ISA with Auxiliary Variables for Learning Speech Representations. 180-184 - Hirotoshi Takeuchi, Kunio Kashino, Yasunori Ohishi, Hiroshi Saruwatari:
Harmonic Lowering for Accelerating Harmonic Convolution for Audio Signals. 185-189
Speech Synthesis: Neural Waveform Generation I
- Yang Ai, Zhen-Hua Ling:
Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders. 190-194 - Qiao Tian, Zewang Zhang, Heng Lu, Ling-Hui Chen, Shan Liu
:
FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction. 195-199 - Jinhyeok Yang, Junmo Lee, Young-Ik Kim, Hoon-Young Cho, Injung Kim
:
VocGAN: A High-Fidelity Real-Time Vocoder with a Hierarchically-Nested Adversarial Network. 200-204 - Hiroki Kanagawa, Yusuke Ijima:
Lightweight LPCNet-Based Neural Vocoder with Tensor Decomposition. 205-209 - Po-Chun Hsu, Hung-yi Lee:
WG-WaveNet: Real-Time High-Fidelity Speech Synthesis Without GPU. 210-214 - Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber:
What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS. 215-219 - Vadim Popov, Stanislav Kamenev, Mikhail A. Kudinov, Sergey Repyevsky, Tasnima Sadekova, Vitalii Bushaev, Vladimir Kryzhanovskiy, Denis Parkhomenko:
Fast and Lightweight On-Device TTS with Tacotron2 and LPCNet. 220-224 - Wei Song, Guanghui Xu, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen Zhou:
Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed. 225-229 - Sébastien Le Maguer, Naomi Harte
:
Can Auditory Nerve Models Tell us What's Different About WaveNet Vocoded Speech? 230-234 - Dipjyoti Paul, Yannis Pantazis, Yannis Stylianou:
Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions. 235-239 - Zhijun Liu, Kuan Chen, Kai Yu:
Neural Homomorphic Vocoder. 240-244
Automatic Speech Recognition for Non-Native Children’s Speech
- Roberto Gretter, Marco Matassoni, Daniele Falavigna, Keelan Evanini, Chee Wee Leong:
Overview of the Interspeech TLT2020 Shared Task on ASR for Non-Native Children's Speech. 245-249 - Tien-Hong Lo, Fu-An Chao, Shi-Yan Weng, Berlin Chen:
The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge. 250-254 - Kate M. Knill, Linlin Wang, Yu Wang, Xixin Wu, Mark J. F. Gales:
Non-Native Children's Automatic Speech Recognition: The INTERSPEECH 2020 Shared Task ALTA Systems. 255-259 - Hemant Kumar Kathania, Mittul Singh
, Tamás Grósz
, Mikko Kurimo:
Data Augmentation Using Prosody and False Starts to Recognize Non-Native Children's Speech. 260-264 - Mostafa Ali Shahin
, Renée Lu, Julien Epps, Beena Ahmed:
UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children's Speech. 265-268
Speaker Diarization
- Shota Horiguchi, Yusuke Fujita, Shinji Watanabe
, Yawen Xue, Kenji Nagamatsu:
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors. 269-273 - Ivan Medennikov
, Maxim Korenevsky, Tatiana Prisyach, Yuri Y. Khokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timofeeva, Anton Mitrofanov
, Andrei Andrusenko
, Ivan Podluzhny, Aleksandr Laptev
, Aleksei Romanenko
:
Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario. 274-278 - Hagai Aronowitz, Weizhong Zhu, Masayuki Suzuki, Gakuto Kurata, Ron Hoory:
New Advances in Speaker Diarization. 279-283 - Qingjian Lin, Yu Hou, Ming Li:
Self-Attentive Similarity Measurement Strategies in Speaker Diarization. 284-288 - Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz, Michael Brudno:
Speaker Attribution with Voice Profiles by Graph-Based Semi-Supervised Learning. 289-293 - Prachi Singh, Sriram Ganapathy:
Deep Self-Supervised Hierarchical Clustering for Speaker Diarization. 294-298 - Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman:
Spot the Conversation: Speaker Diarisation in the Wild. 299-303
Noise Robust and Distant Speech Recognition
- Wangyou Zhang, Yanmin Qian:
Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition. 304-308 - Zhihao Du, Jiqing Han, Xueliang Zhang:
Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition. 309-313 - Antoine Bruguier, Ananya Misra, Arun Narayanan, Rohit Prabhavalkar
:
Anti-Aliasing Regularization in Stacking Layers. 314-318 - Andrei Andrusenko
, Aleksandr Laptev
, Ivan Medennikov
:
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription. 319-323 - Wangyou Zhang, Aswin Shanmugam Subramanian
, Xuankai Chang, Shinji Watanabe
, Yanmin Qian:
End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming. 324-328 - Xinchi Qiu, Titouan Parcollet, Mirco Ravanelli, Nicholas D. Lane, Mohamed Morchid:
Quaternion Neural Networks for Multi-Channel Distant Speech Recognition. 329-333 - Hangting Chen, Pengyuan Zhang, Qian Shi, Zuozhen Liu:
Improved Guided Source Separation Integrated with a Strong Back-End for the CHiME-6 Dinner Party Scenario. 334-338 - Dongmei Wang, Zhuo Chen, Takuya Yoshioka:
Neural Speech Separation Using Spatially Distributed Microphones. 339-343 - Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu:
Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones. 344-348 - Jack Deadman, Jon Barker:
Simulating Realistically-Spatialised Simultaneous Speech Using Video-Driven Speaker Detection and the CHiME-5 Dataset. 349-353
Speech in Multimodality
- Catarina Botelho
, Lorenz Diener, Dennis Küster
, Kevin Scheck, Shahin Amiriparian
, Björn W. Schuller, Tanja Schultz
, Alberto Abad
, Isabel Trancoso
:
Toward Silent Paralinguistics: Speech-to-EMG - Retrieving Articulatory Muscle Activity from Speech. 354-358 - Jiaxuan Zhang, Sarah Ita Levitan
, Julia Hirschberg:
Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features. 359-363 - Zexu Pan, Zhaojie Luo
, Jichen Yang, Haizhou Li:
Multi-Modal Attention for Speech Emotion Recognition. 364-368 - Guang Shen, Riwei Lai, Rui Chen, Yu Zhang, Kejia Zhang, Qilong Han, Hongtao Song:
WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition. 369-373 - Ming Chen, Xudong Zhao:
A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. 374-378 - Pengfei Liu
, Kun Li, Helen Meng:
Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition. 379-383 - Aparna Khare
, Srinivas Parthasarathy, Shiva Sundaram:
Multi-Modal Embeddings Using Multi-Task Learning for Emotion Recognition. 384-388 - Jeng-Lin Li, Chi-Chun Lee
:
Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network. 389-393 - Zheng Lian
, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, Rongjun Li:
Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition. 394-398
Speech, Language, and Multimodal Resources
- Bo Yang, Xianlong Tan, Zhengmao Chen, Bing Wang, Min Ruan, Dan Li, Zhongping Yang, Xiping Wu, Yi Lin
:
ATCSpeech: A Multilingual Pilot-Controller Speech Corpus from Real Air Traffic Control Environment. 399-403 - Alexander Gutkin
, Isin Demirsahin, Oddur Kjartansson, Clara Rivera, Kólá Túbosún:
Developing an Open-Source Corpus of Yoruba Speech. 404-408 - Jung-Woo Ha
, Kihyun Nam, Jingu Kang, Sang-Woo Lee, Sohee Yang, Hyunhoon Jung, Hyeji Kim, Eunmi Kim, Soojin Kim, Hyun Ah Kim, Kyoungtae Doh, Chan Kyu Lee, Nako Sung, Sunghun Kim:
ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers. 409-413 - Yanhong Wang, Huan Luan, Jiahong Yuan, Bin Wang, Hui Lin:
LAIX Corpus of Chinese Learner English: Towards a Benchmark for L2 English ASR. 414-418 - Vikram Ramanarayanan:
Design and Development of a Human-Machine Dialog Corpus for the Automated Assessment of Conversational English Proficiency. 419-423 - Si Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee
, Kathy Yuet-Sheung Lee, Michael Chi-Fai Tong
:
CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment. 424-428 - Katri Leino, Juho Leinonen, Mittul Singh
, Sami Virpioja
, Mikko Kurimo:
FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics. 429-433 - Maarten Van Segbroeck, Ahmed Zaid, Ksenia Kutsenko, Cirenia Huerta, Tinh Nguyen, Xuewen Luo, Björn Hoffmeister, Jan Trmal, Maurizio Omologo, Roland Maas:
DiPCo - Dinner Party Corpus. 434-436 - Bo Wang, Yue Wu, Niall Taylor, Terry J. Lyons, Maria Liakata, Alejo J. Nevado-Holgado, Kate E. A. Saunders:
Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews. 437-441 - Andreas Kirkedal, Marija Stepanovic, Barbara Plank
:
FT Speech: Danish Parliament Speech Corpus. 442-446
Language Recognition
- Raphaël Duroselle, Denis Jouvet, Irina Illina:
Metric Learning Loss Functions to Reduce Domain Mismatch in the x-Vector Space for Language Recognition. 447-451 - Zheng Li, Miao Zhao, Jing Li, Yiming Zhi, Lin Li, Qingyang Hong:
The XMUSPEECH System for the AP19-OLR Challenge. 452-456 - Zheng Li, Miao Zhao, Jing Li, Lin Li, Qingyang Hong:
On the Usage of Multi-Feature Integration for Speaker Verification and Language Identification. 457-461 - Shammur A. Chowdhury, Ahmed Ali, Suwon Shon, James R. Glass:
What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information? 462-466 - Matias Lindgren, Tommi Jauhiainen
, Mikko Kurimo:
Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets. 467-471 - Aitor Arronte Alvarez
, Elsayed Sabry Abdelaal Issa:
Learning Intonation Pattern Embeddings for Arabic Dialect Identification. 472-476 - Badr M. Abdullah, Tania Avgustinova, Bernd Möbius, Dietrich Klakow:
Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages. 477-481
Speech Processing and Analysis
- Noé Tits, Kevin El Haddad, Thierry Dutoit:
ICE-Talk: An Interface for a Controllable Expressive Talking Machine. 482-483 - Mathieu Hu, Laurent Pierron, Emmanuel Vincent, Denis Jouvet:
Kaldi-Web: An Installation-Free, On-Device Speech Recognition System. 484-485 - Amelia C. Kelly, Eleni Karamichali, Armin Saeb, Karel Veselý, Nicholas Parslow, Agape Deng, Arnaud Letondor, Robert O'Regan, Qiru Zhou:
Soapbox Labs Verification Platform for Child Speech. 486-487 - Amelia C. Kelly, Eleni Karamichali, Armin Saeb, Karel Veselý, Nicholas Parslow, Gloria Montoya Gomez, Agape Deng, Arnaud Letondor, Niall Mullally, Adrian Hempel, Robert O'Regan, Qiru Zhou:
SoapBox Labs Fluency Assessment Platform for Child Speech. 488-489 - Baybars Külebi, Alp Öktem, Alex Peiró Lilja, Santiago Pascual, Mireia Farrús:
CATOTRON - A Neural Text-to-Speech System in Catalan. 490-491 - Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, David Pautler, Doug Habberstad, Andrew Cornish, Hardik Kothare, Vignesh Murali, Jackson Liscombe, Dirk Schnelle-Walka, Patrick L. Lange, David Suendermann-Oeft:
Toward Remote Patient Monitoring of Speech, Video, Cognitive and Respiratory Biomarkers Using Multimodal Dialog Technology. 492-493 - Baihan Lin, Xinxin Zhang:
VoiceID on the Fly: A Speaker Recognition System that Learns from Scratch. 494-495
Speech Emotion Recognition I
- Zhao Ren, Jing Han, Nicholas Cummins
, Björn W. Schuller:
Enhancing Transferability of Black-Box Adversarial Attacks via Lifelong Learning for Speech Emotion Recognition Models. 496-500 - Han Feng, Sei Ueno, Tatsuya Kawahara
:
End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model. 501-505 - Bo-Hao Su, Chun-Min Chang, Yun-Shao Lin, Chi-Chun Lee
:
Improving Speech Emotion Recognition Using Graph Attentive Bi-Directional Gated Recurrent Unit Network. 506-510 - Adria Mallol-Ragolta, Nicholas Cummins
, Björn W. Schuller:
An Investigation of Cross-Cultural Semi-Supervised Learning for Continuous Affect Recognition. 511-515 - Kusha Sridhar, Carlos Busso
:
Ensemble of Students Taught by Probabilistic Teachers to Improve Speech Emotion Recognition. 516-520 - Siddique Latif
, Muhammad Asim, Rajib Rana, Sara Khalifa, Raja Jurdak
, Björn W. Schuller:
Augmenting Generative Adversarial Networks for Speech Emotion Recognition. 521-525 - Vipula Dissanayake
, Haimo Zhang, Mark Billinghurst, Suranga Nanayakkara
:
Speech Emotion Recognition 'in the Wild' Using an Autoencoder. 526-530 - Shuiyang Mao, Pak-Chung Ching, Tan Lee
:
Emotion Profile Refinery for Speech Emotion Classification. 531-535 - Sung-Lin Yeh, Yun-Shao Lin, Chi-Chun Lee
:
Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation. 536-540
ASR Neural Network Architectures and Training I
- Kshitiz Kumar, Emilian Stoimenov, Hosam Khalil, Jian Wu:
Fast and Slow Acoustic Model. 541-545 - Takafumi Moriya, Tsubasa Ochiai, Shigeki Karita, Hiroshi Sato, Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, Marc Delcroix
:
Self-Distillation for Improving CTC-Transformer-Based ASR Systems. 546-550 - Zoltán Tüske, George Saon
, Kartik Audhkhasi, Brian Kingsbury:
Single Headed Attention Based Sequence-to-Sequence Model for State-of-the-Art Results on Switchboard. 551-555 - Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno:
Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection. 556-560 - Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur:
PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR. 561-565 - Keyu An, Hongyu Xiang, Zhijian Ou:
CAT: A CTC-CRF Based ASR Toolkit Bridging the Hybrid and the End-to-End Approaches Towards Data Efficiency and Low Latency. 566-570 - Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara
:
CTC-Synchronous Training for Monotonic Attention Model. 571-575 - Brady Houston, Katrin Kirchhoff:
Continual Learning for Multi-Dialect Acoustic Models. 576-580 - Xingcheng Song, Zhiyong Wu, Yiheng Huang, Dan Su, Helen Meng:
SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition. 581-585
Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation
- Adriana Stan
:
RECOApy: Data Recording, Pre-Processing and Phonetic Transcription for End-to-End Speech-Based Applications. 586-590 - Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, Françoise Beaufays:
Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer. 591-595 - Zhe Liu, Fuchun Peng:
Statistical Testing on ASR Performance via Blockwise Bootstrap. 596-600 - Anil Ramakrishna, Shrikanth Narayanan:
Sentence Level Estimation of Psycholinguistic Norms Using Joint Multidimensional Annotations. 601-605 - Kai Fan, Bo Li, Jiayi Wang, Shiliang Zhang, Boxing Chen, Niyu Ge, Zhijie Yan:
Neural Zero-Inflated Quality Estimation Model for Automatic Speech Recognition System. 606-610 - Alejandro Woodward, Clara Bonnín, Issey Masuda, David Varas, Elisenda Bou-Balust, Juan Carlos Riveiro:
Confidence Measures in Encoder-Decoder Models for Speech Recognition. 611-615 - Ahmed Ali, Steve Renals:
Word Error Rate Estimation Without ASR Output: e-WER2. 616-620 - Bogdan Ludusan, Petra Wagner
:
An Evaluation of Manual and Semi-Automatic Laughter Annotation. 621-625 - Joshua L. Martin
, Kevin Tang
:
Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual "be". 626-630
Phonetics and Phonology
- Georgia Zellou, Rebecca Scarborough, Renee Kemp:
Secondary Phonetic Cues in the Production of the Nasal Short-a System in California English. 631-635 - Louis-Marie Lorin, Lorenzo Maselli
, Léo Varnet
, Maria Giavazzi:
Acoustic Properties of Strident Fricatives at the Edges: Implications for Consonant Discrimination. 636-640 - Mingqiong Luo:
Processes and Consequences of Co-Articulation in Mandarin V1N.(C2)V2 Context: Phonology and Phonetics. 641-645 - Yang Yue, Fang Hu:
Voicing Distinction of Obstruents in the Hangzhou Wu Chinese Dialect. 646-650 - Lei Wang:
The Phonology and Phonetics of Kaifeng Mandarin Vowels. 651-655 - Margaret Zellers
, Barbara Schuppler
:
Microprosodic Variability in Plosives in German and Austrian German. 656-660 - Jing Huang, Feng-fan Hsieh
, Yueh-Chin Chang:
Er-Suffixation in Southwestern Mandarin: An EMA and Ultrasound Study. 661-665 - Yinghao Li, Jinghua Zhang:
Electroglottographic-Phonetic Study on Korean Phonation Induced by Tripartite Plosives in Yanbian Korean. 666-670 - Nicholas Wilkins, Max Cordes Galbraith
, Ifeoma Nwogu
:
Modeling Global Body Configurations in American Sign Language. 671-675
Topics in ASR I
- Hang Li, Siyuan Chen, Julien Epps:
Augmenting Turn-Taking Prediction with Wearable Eye Activity During Conversation. 676-680 - Weiyi Lu, Yi Xu, Peng Yang, Belinda Zeng:
CAM: Uninteresting Speech Detector. 681-685 - Diamantino Caseiro, Pat Rondon, Quoc-Nam Le The, Petar S. Aleksic:
Mixed Case Contextual ASR Using Capitalization Masks. 686-690 - Huanru Henry Mao, Shuyang Li
, Julian J. McAuley
, Garrison W. Cottrell
:
Speech Recognition and Multi-Speaker Diarization of Long Conversations. 691-695 - Mengzhe Geng, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng:
Investigation of Data Augmentation Techniques for Disordered Speech Recognition. 696-700 - Wenqi Wei, Jianzong Wang
, Jiteng Ma, Ning Cheng, Jing Xiao:
A Real-Time Robot-Based Auxiliary System for Risk Evaluation of COVID-19 Infection. 701-705 - David S. Barbera, Mark A. Huckvale, Victoria Fleming
, Emily Upton, Henry Coley-Fisher, Ian Shaw, William H. Latham, Alexander P. Leff
, Jenny Crinion
:
An Utterance Verification System for Word Naming Therapy in Aphasia. 706-710 - Shansong Liu, Xurong Xie, Jianwei Yu, Shoukang Hu, Mengzhe Geng, Rongfeng Su, Shi-Xiong Zhang, Xunying Liu, Helen Meng:
Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition. 711-715 - Binghuai Lin, Liyuan Wang:
Joint Prediction of Punctuation and Disfluency in Speech Transcripts. 716-720 - Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Ye Bai, Cunhang Fan:
Focal Loss for Punctuation Prediction. 721-725
Large-Scale Evaluation of Short-Duration Speaker Verification
- Zhuxin Chen, Yue Lin:
Improving X-Vector and PLDA for Text-Dependent Speaker Verification. 726-730 - Hossein Zeinali, Kong Aik Lee
, Jahangir Alam, Lukás Burget
:
SdSV Challenge 2020: Large-Scale Evaluation of Short-Duration Speaker Verification. 731-735 - Tao Jiang, Miao Zhao, Lin Li, Qingyang Hong:
The XMUSPEECH System for Short-Duration Speaker Verification Challenge 2020. 736-740 - Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim:
Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020. 741-745 - Tanel Alumäe
, Jörgen Valk:
The TalTech Systems for the Short-Duration Speaker Verification Challenge 2020. 746-750 - Peng Shen, Xugang Lu, Hisashi Kawai:
Investigation of NICT Submission for Short-Duration Speaker Verification Challenge 2020. 751-755 - Jenthe Thienpondt
, Brecht Desplanques, Kris Demuynck:
Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization. 756-760 - Alicia Lozano-Diez, Anna Silnova, Bhargav Pulugundla, Johan Rohdin, Karel Veselý, Lukás Burget
, Oldrich Plchot, Ondrej Glembek, Ondrej Novotný, Pavel Matejka:
BUT Text-Dependent Speaker Verification System for SdSV Challenge 2020. 761-765 - Vijay Ravi, Ruchao Fan
, Amber Afshan, Huanhua Lu, Abeer Alwan:
Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification. 766-770
Voice Conversion and Adaptation I
- Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai:
Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning. 771-775 - Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna:
Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition. 776-780 - Yanping Li, Dongxiang Xu, Yan Zhang, Yang Wang, Binbin Chen:
Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN. 781-785 - Adam Polyak, Lior Wolf, Yaniv Taigman:
TTS Skins: Speaker Conversion via ASR. 786-790 - Zining Zhang
, Bingsheng He, Zhenjie Zhang:
GAZEV: GAN-Based Zero-Shot Voice Conversion Over Non-Parallel Speech Corpus. 791-795 - Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Rongxiu Zhong:
Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation. 796-800 - Adam Polyak, Lior Wolf, Yossi Adi, Yaniv Taigman:
Unsupervised Cross-Domain Singing Voice Conversion. 801-805 - Tatsuma Ishihara, Daisuke Saito:
Attention-Based Speaker Embeddings for One-Shot Voice Conversion. 806-810 - Jian Cong, Shan Yang, Lei Xie, Guoqiao Yu, Guanglu Wan:
Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training. 811-815
Acoustic Event Detection
- Sixin Hong, Yuexian Zou, Wenwu Wang:
Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging. 816-820 - Helin Wang, Yuexian Zou, Dading Chong, Wenwu Wang:
Environmental Sound Classification with Parallel Temporal-Spectral Attention. 821-825 - Luyu Wang, Kazuya Kawakami, Aäron van den Oord:
Contrastive Predictive Coding of Audio with an Adversary. 826-830 - Arjun Pankajakshan, Helen L. Bear, Vinod Subramanian, Emmanouil Benetos
:
Memory Controlled Sequential Self Attention for Sound Recognition. 831-835 - Donghyeon Kim
, Jaihyun Park
, David K. Han, Hanseok Ko
:
Dual Stage Learning Based Dynamic Time-Frequency Mask Generation for Audio Event Classification. 836-840 - Xu Zheng, Yan Song, Jie Yan, Li-Rong Dai, Ian McLoughlin
, Lin Liu:
An Effective Perturbation Based Semi-Supervised Learning Method for Sound Event Detection. 841-845 - Chieh-Chi Kao, Bowen Shi, Ming Sun, Chao Wang:
A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling. 846-850 - Chun-Chieh Chang, Chieh-Chi Kao, Ming Sun, Chao Wang:
Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging. 851-855 - In Young Park, Hong Kook Kim:
Two-Stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-Token Connectionist Temporal Classification. 856-860 - Amit Jindal, Narayanan Elavathur Ranganatha, Aniket Didolkar, Arijit Ghosh Chowdhury, Di Jin, Ramit Sawhney, Rajiv Ratn Shah:
SpeechMix - Augmenting Deep Sound Recognition Using Hidden Space Interpolations. 861-865
Spoken Language Understanding I
- Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann:
End-to-End Neural Transformer Based Spoken Language Understanding. 866-870 - Chen Liu, Su Zhu, Zijian Zhao, Ruisheng Cao
, Lu Chen, Kai Yu:
Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding. 871-875 - Milind Rao, Anirudh Raju, Pranav Dheram, Bach Bui, Ariya Rastrow:
Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces. 876-880 - Pavel Denisov, Ngoc Thang Vu:
Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning. 881-885 - Srikanth Raj Chetupalli, Sriram Ganapathy:
Context Dependent RNNLM for Automatic Transcription of Conversations. 886-890 - Yusheng Tian, Philip John Gorinski:
Improving End-to-End Speech-to-Intent Classification with Reptile. 891-895 - Won-Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim:
Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation. 896-900 - Weitong Ruan, Yaroslav Nechaev, Luoxin Chen, Chengwei Su, Imre Kiss:
Towards an ASR Error Robust Spoken Language Understanding System. 901-905 - Hong-Kwang Jeff Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis A. Lastras:
End-to-End Spoken Language Understanding Without Full Transcripts. 906-910 - Karthik Gopalakrishnan, Behnam Hedayatnia, Longshaokan Wang, Yang Liu, Dilek Hakkani-Tür
:
Are Neural Open-Domain Dialog Systems Robust to Speech Recognition Errors in the Dialog History? An Empirical Study. 911-915
DNN Architectures for Speaker Recognition
- Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, Zhangyang Wang:
AutoSpeech: Neural Architecture Search for Speaker Recognition. 916-920 - Ya-Qi Yu, Wu-Jun Li:
Densely Connected Time Delay Neural Network for Speaker Verification. 921-925 - Siqi Zheng, Yun Lei, Hongbin Suo:
Phonetically-Aware Coupled Network For Short Duration Text-Independent Speaker Verification. 926-930 - Myunghun Jung, Youngmoon Jung, Jahyun Goo, Hoirin Kim:
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification Using CTC-Based Soft VAD and Global Query Attention. 931-935 - Yanfeng Wu, Chenkai Guo, Hongcan Gao, Xiaolei Hou, Jing Xu:
Vector-Based Attentive Pooling for Text-Independent Speaker Verification. 936-940 - Pooyan Safari, Miquel India
, Javier Hernando:
Self-Attention Encoding and Pooling for Speaker Recognition. 941-945 - Ruiteng Zhang, Jianguo Wei
, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang
, Jiayu Jin, Junhai Xu
:
ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification. 946-950 - Hanyi Zhang, Longbiao Wang, Yunchun Zhang, Meng Liu, Kong Aik Lee
, Jianguo Wei
:
Adversarial Separation Network for Speaker Recognition. 951-955 - Jingyu Li, Tan Lee
:
Text-Independent Speaker Verification with Dual Attention Network. 956-960 - Xiaoyang Qu, Jianzong Wang
, Jing Xiao:
Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification. 961-965
ASR Model Training and Strategies
- Chao Weng, Chengzhu Yu, Jia Cui, Chunlei Zhang, Dong Yu:
Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition. 966-970 - Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li
, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou:
Semantic Mask for Transformer Based End-to-End Speech Recognition. 971-975 - Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig:
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces. 976-980 - Dimitrios Dimitriadis, Ken'ichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez:
A Federated Approach in Training Acoustic Models. 981-985 - Imran A. Sheikh
, Emmanuel Vincent, Irina Illina:
On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data. 986-990 - Yixin Gao, Noah D. Stein, Chieh-Chi Kao, Yunliang Cai, Ming Sun, Tao Zhang, Shiv Naga Prasad Vitaladevuni:
On Front-End Gain Invariant Modeling for Wake Word Spotting. 991-995 - Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du:
Unsupervised Regularization-Based Adaptive Training for Speech Recognition. 996-1000 - Erfan Loweimi
, Peter Bell, Steve Renals:
On the Robustness and Training Dynamics of Raw Waveform Models. 1001-1005 - Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Y. Hannun, Gabriel Synnaeve, Ronan Collobert:
Iterative Pseudo-Labeling for Speech Recognition. 1006-1010
Speech Annotation and Speech Assessment
- Naoko Kawamura, Tatsuya Kitamura, Kenta Hamada:
Smart Tube: A Biofeedback System for Vocal Training and Therapy Through Tube Phonation. 1011-1012 - Seong Choi, Seunghoon Jeong, Jeewoo Yoon, Migyeong Yang, Minsam Ko, Eunil Park, Jinyoung Han, Munyoung Lee, Seonghee Lee:
VCTUBE : A Library for Automatic Speech Data Annotation. 1013-1014 - Yanlu Xie, Xiaoli Feng, Boxue Li, Jinsong Zhang, Yujia Jin:
A Mandarin L2 Learning APP with Mispronunciation Detection and Feedback. 1015-1016 - Tejas Udayakumar, Kinnera Saranu, Mayuresh Sanjay Oak, Ajit Ashok Saunshikhar, Sandip Shriram Bapat:
Rapid Enhancement of NLP Systems by Acquisition of Data in Correlated Domains. 1017-1018 - Ke Shi, Kye Min Tan, Richeng Duan, Siti Umairah Md. Salleh, Nur Farah Ain Suhaimi, Rajan Vellu, Ngoc Thuy Huong Helen Thai, Nancy F. Chen:
Computer-Assisted Language Learning System: Automatic Speech Evaluation for Children Learning Malay and Tamil. 1019-1020 - Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari:
Real-Time, Full-Band, Online DNN-Based Voice Conversion System Using a Single CPU. 1021-1022 - Xiaoli Feng, Yanlu Xie, Yayue Deng, Boxue Li:
A Dynamic 3D Pronunciation Teaching Model Based on Pronunciation Attributes and Anatomy. 1023-1024 - Naoki Kimura, Zixiong Su, Takaaki Saeki:
End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge. 1025-1026
Cross/Multi-Lingual and Code-Switched Speech Recognition
- Jialu Li
, Mark Hasegawa-Johnson:
Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous? 1027-1031 - Martha Yifiru Tachbelie, Solomon Teferra Abate, Tanja Schultz
:
Development of Multilingual ASR Using GlobalPhone for Less-Resourced Languages: The Case of Ethiopian Languages. 1032-1036 - Wenxin Hou, Yue Dong, Bairong Zhuang, Longfei Yang, Jiatong Shi, Takahiro Shinozaki:
Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning. 1037-1041 - Xinyuan Zhou, Emre Yilmaz, Yanhua Long, Yijie Li, Haizhou Li:
Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition. 1042-1046 - Solomon Teferra Abate, Martha Yifiru Tachbelie, Tanja Schultz
:
Multilingual Acoustic and Language Modeling for Ethio-Semitic Languages. 1047-1051 - Yushi Hu, Shane Settle, Karen Livescu
:
Multilingual Jointly Trained Acoustic and Written Word Embeddings. 1052-1056 - Chia-Yu Li, Ngoc Thang Vu:
Improving Code-Switching Language Modeling with Artificially Generated Texts Using Cycle-Consistent Adversarial Networks. 1057-1061 - Xinhui Hu, Qi Zhang, Lei Yang, Binbin Gu, Xinkang Xu:
Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods. 1062-1066 - Xinxing Li, Edward Lin:
A 43 Language Multilingual Punctuation Prediction Neural Network Model. 1067-1071 - Jisung Wang, Jihwan Kim, Sangki Kim, Yeha Lee:
Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition. 1072-1075
Anti-Spoofing and Liveness Detection
- Patrick von Platen, Fei Tao, Gökhan Tür:
Multi-Task Siamese Neural Network for Improving Replay Attack Detection. 1076-1080 - Kosuke Akimoto, Seng Pei Liew, Sakiko Mishima, Ryo Mizushima, Kong Aik Lee
:
POCO: A Voice Spoofing and Liveness Detection Corpus Based on Pop Noise. 1081-1085 - Hongji Wang, Heinrich Dinkel, Shuai Wang, Yanmin Qian, Kai Yu:
Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection. 1086-1090 - Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, Ha-Jin Yu:
Self-Supervised Pre-Training with Acoustic Configurations for Replay Spoofing Detection. 1091-1095 - Abhijith Girish, Adharsh Sabu, Akshay Prasannan Latha, Rajeev Rajan:
Competency Evaluation in Voice Mimicking Using Acoustic Cues. 1096-1100 - Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li:
Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks. 1101-1105 - Hemlata Tak, Jose Patino, Andreas Nautsch
, Nicholas W. D. Evans, Massimiliano Todisco:
Spoofing Attack Detection Using the Non-Linear Fusion of Sub-Band Classifiers. 1106-1110 - Prasanth Parasu, Julien Epps, Kaavya Sriskandaraja
, Gajan Suthokumar:
Investigating Light-ResNet Architecture for Spoofing Detection Under Mismatched Conditions. 1111-1115 - Zhenchun Lei, Yingen Yang, Changhong Liu, Jihua Ye:
Siamese Convolutional Neural Network Using Gaussian Probability Feature for Spoofing Speech Detection. 1116-1120
Noise Reduction and Intelligibility
- Hendrik Schröter, Tobias Rosenkranz, Alberto N. Escalante-B., Pascal Zobel, Andreas Maier:
Lightweight Online Noise Reduction on Embedded Devices Using Hierarchical Recurrent Neural Networks. 1121-1125 - Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, Dominik Roblek:
SEANet: A Multi-Modal Speech Enhancement Network. 1126-1130 - Shang-Yi Chuang, Yu Tsao, Chen-Chou Lo
, Hsin-Min Wang
:
Lite Audio-Visual Speech Enhancement. 1131-1135 - Christian Bergler, Manuel Schmitt, Andreas Maier, Simeon Smeele, Volker Barth, Elmar Nöth:
ORCA-CLEAN: A Deep Denoising Toolkit for Killer Whale Communication. 1136-1140 - Hao Zhang, DeLiang Wang:
A Deep Learning Approach to Active Noise Control. 1141-1145 - Tuan Dinh, Alexander Kain, Kris Tjaden:
Improving Speech Intelligibility Through Speaker Dependent and Independent Spectral Style Conversion. 1146-1150 - Mathias Bach Pedersen, Morten Kolbæk
, Asger Heidemann Andersen, Søren Holdt Jensen, Jesper Jensen:
End-to-End Speech Intelligibility Prediction Using Time-Domain Fully Convolutional Neural Networks. 1151-1155 - Kenichi Arai
, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita
, Tomohiro Nakatani, Toshio Irino:
Predicting Intelligibility of Enhanced Speech Using Posteriors Derived from DNN-Based ASR System. 1156-1160 - Ali Abavisani, Mark Hasegawa-Johnson:
Automatic Estimation of Intelligibility Measure for Consonants in Speech. 1161-1165 - Viet Anh Trinh, Michael I. Mandel:
Large Scale Evaluation of Importance Maps in Automatic Speech Recognition. 1166-1170
Acoustic Scene Classification
- Jixiang Li, Chuming Liang, Bo Zhang, Zhao Wang, Fei Xiang, Xiangxiang Chu:
Neural Architecture Search on Acoustic Scene Classification. 1171-1175 - Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Seung-bin Kim, Ha-Jin Yu:
Acoustic Scene Classification Using Audio Tagging. 1176-1180 - Liwen Zhang
, Jiqing Han, Ziqiang Shi:
ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification. 1181-1185 - Jivitesh Sharma
, Ole-Christoffer Granmo, Morten Goodwin:
Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network. 1186-1190 - Weimin Wang, Weiran Wang, Ming Sun, Chao Wang:
Acoustic Scene Analysis with Multi-Head Attention Networks. 1191-1195 - Hu Hu
, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee:
Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification. 1196-1200 - Hu Hu
, Sabato Marco Siniscalchi, Yannan Wang, Xue Bai, Jun Du, Chin-Hui Lee:
An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances. 1201-1205 - Dhanunjaya Varma Devalraju
, H. Muralikrishna
, Padmanabhan Rajan, Dileep Aroor Dinesh:
Attention-Driven Projections for Soundscape Classification. 1206-1210 - Panagiotis Tzirakis, Alexander Shiarella, Robert M. Ewers, Björn W. Schuller:
Computer Audition for Continuous Rainforest Occupancy Monitoring: The Case of Bornean Gibbons' Call Detection. 1211-1215 - Zuzanna Kwiatkowska, Beniamin Kalinowski, Michal Kosmider, Krzysztof Rykaczewski:
Deep Learning Based Open Set Acoustic Scene Classification. 1216-1220
Singing Voice Computing and Processing in Music
- Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman:
Singing Synthesis: With a Little Help from my Attention. 1221-1225 - Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, Dong Yu:
Peking Opera Synthesis via Duration Informed Attention Network. 1226-1230 - Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Chunlei Zhang, Yusong Wu, Xiang Xie, Zijin Li, Dong Yu:
DurIAN-SC: Duration Informed Attention Network Based Singing Voice Conversion System. 1231-1235 - Yuanbo Hou, Frank K. Soong, Jian Luan, Shengchen Li:
Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music. 1236-1240 - Haohe Liu, Lei Xie, Jian Wu, Geng Yang:
Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music. 1241-1245
Acoustic Model Adaptation for ASR
- Samik Sadhu, Hynek Hermansky
:
Continual Learning in Automatic Speech Recognition. 1246-1250 - Genshun Wan, Jia Pan, Qingran Wang, Jianqing Gao, Zhongfu Ye:
Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism. 1251-1255 - Yan Huang, Jinyu Li
, Lei He, Wenning Wei, William Gale, Yifan Gong:
Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator. 1256-1260 - Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq R. Joty, Eng Siong Chng, Bin Ma:
Speech Transformer with Speaker Aware Persistent Memory. 1261-1265 - Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du:
Adaptive Speaker Normalization for CTC-Based Speech Recognition. 1266-1270 - Akhil Mathur, Nadia Berthouze
, Nicholas D. Lane:
Unsupervised Domain Adaptation Under Label Space Mismatch for Speech Classification. 1271-1275 - Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, Peng Xu, Pascale Fung:
Learning Fast Adaptation on Cross-Accented Speech Recognition. 1276-1280 - Kartik Khandelwal, Preethi Jyothi, Abhijeet Awasthi, Sunita Sarawagi:
Black-Box Adaptation of ASR for Accented Speech. 1281-1285 - M. A. Tugtekin Turan
, Emmanuel Vincent, Denis Jouvet:
Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation. 1286-1290 - Ryu Takeda
, Kazunori Komatani:
Frame-Wise Online Unsupervised Adaptation of DNN-HMM Acoustic Model from Perspective of Robust Adaptive Filtering. 1291-1295
Singing and Multimodal Synthesis
- Jie Wu, Jian Luan:
Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. 1296-1300 - JinHong Lu, Hiroshi Shimodaira:
Prediction of Head Motion from Speech Waveforms with a Canonical-Correlation-Constrained Autoencoder. 1301-1305 - Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou:
XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. 1306-1310 - Ravindra Yadav, Ashish Sardana, Vinay P. Namboodiri
, Rajesh M. Hegde:
Stochastic Talking Face Generation Using Latent Distribution Matching. 1311-1315 - Da-Yi Wu, Yi-Hsuan Yang:
Speech-to-Singing Conversion Based on Boundary Equilibrium GAN. 1316-1320 - Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, Koichiro Mori:
Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image. 1321-1325 - Wentao Wang, Yan Wang, Jianqing Sun, Qingsong Liu, Jiaen Liang, Teng Li:
Speech Driven Talking Head Generation via Attentional Landmarks Based Representation. 1326-1330
Intelligibility-Enhancing Speech Modification
- Marc René Schädler:
Optimization and Evaluation of an Intelligibility-Improving Signal Processing Approach (IISPA) for the Hurricane Challenge 2.0 with FADE. 1331-1335 - Haoyu Li, Szu-Wei Fu, Yu Tsao, Junichi Yamagishi:
iMetricGAN: Intelligibility Enhancement for Speech-in-Noise Using Generative Adversarial Network-Based Metric Learning. 1336-1340 - Jan Rennies, Henning F. Schepker, Cassia Valentini-Botinhao, Martin Cooke:
Intelligibility-Enhancing Speech Modifications - The Hurricane Challenge 2.0. 1341-1345 - Olympia Simantiraki, Martin Cooke:
Exploring Listeners' Speech Rate Preferences. 1346-1350 - Felicitas Bederna, Henning F. Schepker, Christian Rollwage, Simon Doclo, Arne Pusch, Jörg Bitzer
, Jan Rennies:
Adaptive Compressive Onset-Enhancement for Improved Speech Intelligibility in Noise and Reverberation. 1351-1355 - Carol Chermaz, Simon King:
A Sound Engineering Approach to Near End Listening Enhancement. 1356-1360 - Dipjyoti Paul, P. V. Muhammed Shifas, Yannis Pantazis, Yannis Stylianou:
Enhancing Speech Intelligibility in Text-To-Speech Synthesis Using Speaking Style Conversion. 1361-1365
Human Speech Production I
- Takayuki Arai:
Two Different Mechanisms of Movable Mandible for Vocal-Tract Model with Flexible Tongue. 1366-1370 - Qiang Fang:
Improving the Performance of Acoustic-to-Articulatory Inversion by Removing the Training Loss of Noncritical Portions of Articulatory Channels Dynamically. 1371-1375 - Aravind Illa, Prasanta Kumar Ghosh:
Speaker Conditioned Acoustic-to-Articulatory Inversion Using x-Vectors. 1376-1380 - Zirui Liu, Yi Xu, Feng-fan Hsieh
:
Coarticulation as Synchronised Sequential Target Approximation: An EMA Study. 1381-1385 - Jônatas Santos
, Jugurta Montalvão, Israel Santos:
Improved Model for Vocal Folds with a Polyp with Potential Application. 1386-1390 - Lin Zhang
, Kiyoshi Honda, Jianguo Wei
, Seiji Adachi:
Regional Resonance of the Lower Vocal Tract and its Contribution to Speaker Characteristics. 1391-1395 - Renuka Mannem, Navaneetha Gaddam, Prasanta Kumar Ghosh:
Air-Tissue Boundary Segmentation in Real Time Magnetic Resonance Imaging Video Using 3-D Convolutional Neural Network. 1396-1400 - Tilak Purohit, Prasanta Kumar Ghosh:
An Investigation of the Virtual Lip Trajectories During the Production of Bilabial Stops and Nasal at Different Speaking Rates. 1401-1405
Targeted Source Separation
- Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li:
SpEx+: A Complete Time Domain Speaker Extraction Network. 1406-1410 - Tingle Li, Qingjian Lin, Yuanyuan Bao, Ming Li:
Atss-Net: Target Speaker Separation via Attention-Based Neural Network. 1411-1415 - Leyuan Qu, Cornelius Weber, Stefan Wermter:
Multimodal Target Speech Separation with Voice and Face References. 1416-1420 - Zining Zhang
, Bingsheng He, Zhenjie Zhang:
X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network. 1421-1425 - Chenda Li, Yanmin Qian:
Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation. 1426-1430 - Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin, Bo Xu:
A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments. 1431-1435 - Jianshu Zhao, Shengzhou Gao, Takahiro Shinozaki:
Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding. 1436-1440 - Tsubasa Ochiai, Marc Delcroix
, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita
, Shoko Araki:
Listen to What You Want: Neural Network-Based Universal Sound Selector. 1441-1445 - Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi, Noboru Harada
:
Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. 1446-1450 - Jiahao Xu, Kun Hu, Chang Xu, Tran Duc Chung
, Zhiyong Wang:
Speaker-Aware Monaural Speech Separation. 1451-1455
Keynote 2
- Barbara G. Shinn-Cunningham:
Brain networks enabling speech perception in everyday settings.
Speech Translation and Multilingual/Multimodal Learning
- Liming Wang, Mark Hasegawa-Johnson:
A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions. 1456-1460 - Maha Elbayad, Laurent Besacier, Jakob Verbeek:
Efficient Wait-k Models for Simultaneous Machine Translation. 1461-1465 - Ha Nguyen, Fethi Bougares, Natalia A. Tomashenko
, Yannick Estève, Laurent Besacier:
Investigating Self-Supervised Pre-Training for End-to-End Speech Translation. 1466-1470 - Marco Gaido
, Mattia Antonino Di Gangi
, Matteo Negri
, Mauro Cettolo, Marco Turchi:
Contextualized Translation of Automatically Segmented Speech. 1471-1475 - Juan Miguel Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang:
Self-Training for End-to-End Speech Translation. 1476-1480 - Marcello Federico, Yogesh Virkar, Robert Enyedi, Roberto Barra-Chicote:
Evaluating and Optimizing Prosodic Alignment for Automatic Dubbing. 1481-1485 - Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James R. Glass:
Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets. 1486-1490 - Anne Wu, Changhan Wang, Juan Miguel Pino, Jiatao Gu:
Self-Supervised Representations Improve End-to-End Speech Translation. 1491-1495
Speaker Recognition I
- Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu:
Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms. 1496-1500 - Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim:
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances. 1501-1505 - Bin Gu, Wu Guo, Fenglin Ding, Zhen-Hua Ling, Jun Du:
An Adaptive X-Vector Model for Text-Independent Speaker Verification. 1506-1510 - Santi Prieto, Alfonso Ortega Giménez
, Iván López-Espejo
, Eduardo Lleida:
Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions. 1511-1515 - Aaron Nicolson
, Kuldip K. Paliwal
:
Sum-Product Networks for Robust Automatic Speaker Identification. 1516-1520 - Seung-bin Kim, Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu:
Segment Aggregation for Short Utterances Speaker Verification Using Raw Waveforms. 1521-1525 - Shai Rozenberg, Hagai Aronowitz, Ron Hoory:
Siamese X-Vector Reconstruction for Domain Adapted Speaker Recognition. 1526-1529 - Yanpei Shi, Qiang Huang, Thomas Hain
:
Speaker Re-Identification with Speaker Dependent Speech Enhancement. 1530-1534 - Galina Lavrentyeva, Marina Volkova, Anastasia Avdeeva
, Sergey Novoselov, Artem Gorlanov, Tseren Andzhukaev, Artem Ivanov, Alexander Kozlov:
Blind Speech Signal Quality Estimation for Speaker Verification Systems. 1535-1539 - Xu Li, Na Li, Jinghua Zhong, Xixin Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng:
Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification. 1540-1544
Spoken Language Understanding II
- Vaishali Pal
, Fabien Guillot, Manish Shrivastava
, Jean-Michel Renders, Laurent Besacier:
Modeling ASR Ambiguity for Neural Dialogue State Tracking. 1545-1549 - Haoyu Wang, Shuyan Dong, Yue Liu, James Logan, Ashish Kumar Agrawal, Yang Liu:
ASR Error Correction with Augmented Transformer for Entity Retrieval. 1550-1554 - Xueli Jia, Jianzong Wang
, Zhiyong Zhang, Ning Cheng, Jing Xiao:
Large-Scale Transfer Learning for Low-Resource Spoken Language Understanding. 1555-1559 - Judith Gaspers, Quynh Ngoc Thi Do, Fabian Triefenbach:
Data Balancing for Boosting Performance of Low-Frequency Classes in Spoken Language Understanding. 1560-1564 - Yu Wang, Yilin Shen, Hongxia Jin:
An Interactive Adversarial Reward Learning-Based Spoken Language Understanding System. 1565-1569 - Jin Cao, Jun Wang, Wael Hamza, Kelly Vanee, Shang-Wen Li:
Style Attuned Pre-Training and Parameter Efficient Fine-Tuning for Spoken Language Understanding. 1570-1574 - Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Ryo Masumura:
Unsupervised Domain Adaptation for Dialogue Sequence Labeling Based on Hierarchical Adversarial Training. 1575-1579 - Leda Sari, Mark Hasegawa-Johnson:
Deep F-Measure Maximization for End-to-End Speech Understanding. 1580-1584 - Taesun Whang, Dongyub Lee, Chanhee Lee, Kisu Yang, Dongsuk Oh, Heuiseok Lim:
An Effective Domain Adaptive Post-Training Method for BERT in Response Selection. 1585-1589 - Antoine Caubrière, Yannick Estève, Antoine Laurent, Emmanuel Morin:
Confidence Measure for Speech-to-Concept End-to-End Spoken Language Understanding. 1590-1594
Human Speech Processing
- Grant L. McGuire, Molly Babel:
Attention to Indexical Information Improves Voice Recall. 1595-1599 - Anaïs Tran Ngoc
, Julien Meyer, Fanny Meunier
:
Categorization of Whistled Consonants by French Speakers. 1600-1604 - Anaïs Tran Ngoc
, Julien Meyer
, Fanny Meunier
:
Whistled Vowel Identification by French Listeners. 1605-1609 - Maria del Mar Cordero, Fanny Meunier
, Nicolas Grimault
, Stéphane Pota, Elsa Spinelli:
F0 Slope and Mean: Cues to Speech Segmentation in French. 1610-1614 - Amandine Michelas, Sophie Dufour:
Does French Listeners' Ability to Use Accentual Information at the Word Level Depend on the Ear of Presentation? 1615-1619 - Wen Liu
:
A Perceptual Study of the Five Level Tones in Hmu (Xinzhai Variety). 1620-1623 - Zhen Zeng, Karen Mattock, Liquan Liu
, Varghese Peter
, Alba Tuninetti, Feng-Ming Tsao:
Mandarin and English Adults' Cue-Weighting of Lexical Stress. 1624-1628 - Yan Feng, Gang Peng, William Shi-Yuan Wang:
Age-Related Differences of Tone Perception in Mandarin-Speaking Seniors. 1629-1633 - Georgia Zellou, Michelle Cohn
:
Social and Functional Pressures in Vocal Alignment: Differences for Human and Voice-AI Interlocutors. 1634-1638 - Hassan Salami Kavaki, Michael I. Mandel:
Identifying Important Time-Frequency Locations in Continuous Speech Utterances. 1639-1643
Feature Extraction and Distant ASR
- Erfan Loweimi
, Peter Bell, Steve Renals:
Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling. 1644-1648 - Purvi Agrawal, Sriram Ganapathy:
Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations. 1649-1653 - Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals:
A Deep 2D Convolutional Network for Waveform-Based Speech Recognition. 1654-1658 - Ludwig Kürzinger
, Nicolas Lindae, Palle Klewitz, Gerhard Rigoll:
Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions. 1659-1663 - Pegah Ghahramani, Hossein Hadian, Daniel Povey, Hynek Hermansky
, Sanjeev Khudanpur:
An Alternative to MFCCs for ASR. 1664-1667 - Anirban Dutta, Ashishkumar Prabhakar Gudmalwar, Ch. V. Rama Rao:
Phase Based Spectro-Temporal Features for Building a Robust ASR System. 1668-1672 - Neethu M. Joy, Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals:
Deep Scattering Power Spectrum Features for Robust Speech Recognition. 1673-1677 - Titouan Parcollet, Xinchi Qiu, Nicholas D. Lane:
FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition. 1678-1682 - Kshitiz Kumar, Bo Ren, Yifan Gong, Jian Wu:
Bandpass Noise Generation and Augmentation for Unified ASR. 1683-1687 - Anurenjan Purushothaman
, Anirudh Sreeram, Rohit Kumar, Sriram Ganapathy:
Deep Learning Based Dereverberation of Temporal Envelopes for Robust Speech Recognition. 1688-1692
Voice Privacy Challenge
- Natalia A. Tomashenko
, Brij Mohan Lal Srivastava, Xin Wang
, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas W. D. Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco:
Introducing the VoicePrivacy Initiative. 1693-1697 - Andreas Nautsch, Jose Patino, Natalia A. Tomashenko
, Junichi Yamagishi, Paul-Gauthier Noé, Jean-François Bonastre
, Massimiliano Todisco, Nicholas W. D. Evans:
The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment. 1698-1702 - Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, Masashi Unoki
:
X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System. 1703-1707 - Mohamed Maouche, Brij Mohan Lal Srivastava, Nathalie Vauquier, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent:
A Comparative Study of Speech Anonymization Metrics. 1708-1712 - Brij Mohan Lal Srivastava, Natalia A. Tomashenko
, Xin Wang
, Emmanuel Vincent, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet, Marc Tommasi:
Design Choices for X-Vector Based Speaker Anonymization. 1713-1717 - Paul-Gauthier Noé, Jean-François Bonastre, Driss Matrouf, Natalia A. Tomashenko
, Andreas Nautsch
, Nicholas W. D. Evans:
Speech Pseudonymisation Assessment Using Voice Similarity Matrices. 1718-1722
Speech Synthesis: Text Processing, Data and Evaluation
- Kyubyong Park, Seanie Lee:
g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset. 1723-1727 - Haiteng Zhang, Huashan Pan, Xiulin Li:
A Mask-Based Model for Mandarin Chinese Polyphone Disambiguation. 1728-1732 - Michelle Cohn
, Georgia Zellou:
Perception of Concatenative vs. Neural Text-To-Speech (TTS): Differences in Intelligibility in Noise and Language Attitudes. 1733-1737 - Jason Taylor, Korin Richmond
:
Enhancing Sequence-to-Sequence Text-to-Speech with Morphology. 1738-1742 - Yeunju Choi, Youngmoon Jung, Hoirin Kim:
Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling. 1743-1747 - Gabriel Mittag, Sebastian Möller:
Deep Learning Based Assessment of Synthetic Speech Naturalness. 1748-1752 - Jiawen Zhang
, Yuanyuan Zhao, Jiaqi Zhu, Jinba Xiao:
Distant Supervision for Polyphone Disambiguation in Mandarin Chinese. 1753-1757 - Pilar Oplustil Gallegos, Jennifer Williams, Joanna Rownicka, Simon King:
An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets. 1758-1762 - Anurag Das, Guanlong Zhao, John Levis, Evgeny Chukharev-Hudilainen
, Ricardo Gutierrez-Osuna:
Understanding the Effect of Voice Quality and Accent on Talker Similarity. 1763-1767
Search for Speech Recognition
- Wei Zhou
, Ralf Schlüter
, Hermann Ney:
Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition Without Length Bias. 1768-1772 - Xi Chen, Songyang Zhang
, Dandan Song, Peng Ouyang, Shouyi Yin:
Transformer with Bidirectional Decoder for Speech Recognition. 1773-1777 - Weiran Wang, Guangsen Wang, Aadyot Bhatnagar, Yingbo Zhou, Caiming Xiong, Richard Socher:
An Investigation of Phone-Based Subword Units for End-to-End Speech Recognition. 1778-1782 - Jeremy Heng Meng Wong, Yashesh Gaur, Rui Zhao, Liang Lu, Eric Sun, Jinyu Li
, Yifan Gong:
Combination of End-to-End and Hybrid Models for Speech Recognition. 1783-1787 - Jihwan Kim, Jisung Wang, Sangki Kim, Yeha Lee:
Evolved Speech-Transformer: Applying Neural Architecture Search to End-to-End Automatic Speech Recognition. 1788-1792 - Abhinav Garg, Ashutosh Gupta, Dhananjaya Gowda, Shatrughan Singh, Chanwoo Kim:
Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition. 1793-1797 - Eugen Beck, Ralf Schlüter
, Hermann Ney:
LVCSR with Transformer Language Models. 1798-1802 - Yi-Chen Chen, Jui-Yang Hsu, Cheng-Kuang Lee, Hung-yi Lee:
DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation. 1803-1807
Computational Paralinguistics I
- Lukas Stappen, Georgios Rizos, Madina Hasan, Thomas Hain
, Björn W. Schuller:
Uncertainty-Aware Machine Support for Paper Reviewing on the Interspeech 2019 Submission Corpus. 1808-1812 - Michelle Cohn
, Melina Sarian, Kristin Predeck, Georgia Zellou:
Individual Variation in Language Attitudes Toward Voice-AI: The Role of Listeners' Autistic-Like Traits. 1813-1817 - Michelle Cohn
, Eran Raveh, Kristin Predeck, Iona Gessinger
, Bernd Möbius, Georgia Zellou:
Differences in Gradient Emotion Perception: Human vs. Alexa Voices. 1818-1822 - Luz Martinez-Lucas, Mohammed Abdelwahab, Carlos Busso
:
The MSP-Conversation Corpus. 1823-1827 - Fuxiang Tao, Anna Esposito
, Alessandro Vinciarelli:
Spotting the Traces of Depression in Read Speech: An Approach Based on Computational Paralinguistics and Social Signal Processing. 1828-1832 - Yelin Kim, Joshua Levy, Yang Liu:
Speech Sentiment and Customer Satisfaction Estimation in Socialbot Conversations. 1833-1837 - Haley Lepp
, Gina-Anne Levow:
Pardon the Interruption: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments. 1838-1842 - Jana Neitsch
, Oliver Niebuhr
:
Are Germans Better Haters Than Danes? Language-Specific Implicit Prosodies of Types of Hate Speech and How They Relate to Perceived Severity and Societal Rules. 1843-1847 - Fuling Chen, Roberto Togneri
, Murray Maybery
, Diana Tan
:
An Objective Voice Gender Scoring System and Identification of the Salient Acoustic Measures. 1848-1852 - Sadari Jayawardena, Julien Epps, Zhaocheng Huang:
How Ordinal Are Your Data? 1853-1857
Acoustic Phonetics and Prosody
- Vincent Hughes
, Frantz Clermont, Philip Harrison:
Correlating Cepstra with Formant Frequencies: Implications for Phonetically-Informed Forensic Voice Comparison. 1858-1862 - Jana Neitsch
, Plínio A. Barbosa, Oliver Niebuhr
:
Prosody and Breathing: A Comparison Between Rhetorical and Information-Seeking Questions in German and Brazilian Portuguese. 1863-1867 - Rebecca Defina, Catalina Torres
, Hywel Stoakes:
Scaling Processes of Clause Chains in Pitjantjatjara. 1868-1872 - Ai Mizoguchi
, Ayako Hashimoto, Sanae Matsui, Setsuko Imatomi, Ryunosuke Kobayashi, Mafuyu Kitahara
:
Neutralization of Voicing Distinction of Stops in Tohoku Dialects of Japanese: Field Work and Acoustic Measurements. 1873-1877 - Lou Lee, Denis Jouvet, Katarina Bartkova, Yvon Keromnes, Mathilde Dargnat:
Correlation Between Prosody and Pragmatics: Case Study of Discourse Markers in French and English. 1878-1882 - Dina El Zarka, Anneliese Kelterer, Barbara Schuppler
:
An Analysis of Prosodic Prominence Cues to Information Structure in Egyptian Arabic. 1883-1887 - Benazir Mumtaz, Tina Bögel, Miriam Butt:
Lexical Stress in Urdu. 1888-1892 - Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Jennifer Hamet Bagnou, Xuan-Nga Cao, Emmanuel Dupoux, Anne-Catherine Bachoud-Lévi
:
Vocal Markers from Sustained Phonation in Huntington's Disease. 1893-1897 - Laure Dentel, Julien Meyer
:
How Rhythm and Timbre Encode Mooré Language in Bendré Drummed Speech. 1898-1902
Keynote 3
- Lin-Shan Lee:
Doing Something we Never could with Spoken Language Technologies-from early days to the era of deep learning.
Tonal Aspects of Acoustic Phonetics and Prosody
- Wendy Lalhminghlui
, Priyankoo Sarmah:
Interaction of Tone and Voicing in Mizo. 1903-1907 - Yaru Wu, Martine Adda-Decker, Lori Lamel:
Mandarin Lexical Tones: A Corpus-Based Study of Word Length, Syllable Position and Prosodic Position on Duration. 1908-1912 - Yingming Gao, Xinyu Zhang, Yi Xu, Jinsong Zhang, Peter Birkholz
:
An Investigation of the Target Approximation Model for Tone Modeling and Recognition in Continuous Mandarin Speech. 1913-1917 - Wei Lai, Aini Li
:
Integrating the Application and Realization of Mandarin 3rd Tone Sandhi in the Resolution of Sentence Ambiguity. 1918-1922 - Zhenrui Zhang, Fang Hu:
Neutral Tone in Changde Mandarin. 1923-1927 - Ping Cui, Jianjing Kuang:
Pitch Declination and Final Lowering in Northeastern Mandarin. 1928-1932 - Phil Rose:
Variation in Spectral Slope and Interharmonic Noise in Cantonese Tones. 1933-1937 - Ping Tang, Shanpeng Li
:
The Acoustic Realization of Mandarin Tones in Fast Speech. 1938-1941
Speech Classification
- Anastassia Loukina, Keelan Evanini, Matthew Mulholland, Ian Blood, Klaus Zechner:
Do Face Masks Introduce Bias in Speech Technologies? The Case of Automated Scoring of Speaking Proficiency. 1942-1946 - Mohamed Mhiri, Samuel Myer, Vikrant Singh Tomar:
A Low Latency ASR-Free End to End Spoken Language Understanding System. 1947-1951 - Joe Wang, Rajath Kumar, Mike Rodehorst, Brian Kulis, Shiv Naga Prasad Vitaladevuni:
An Audio-Based Wakeword-Independent Verification System. 1952-1956 - Tyler Vuong, Yangyang Xia, Richard M. Stern
:
Learnable Spectro-Temporal Receptive Fields for Robust Voice Type Discrimination. 1957-1961 - Shuo-Yiin Chang, Bo Li, David Rybach, Yanzhang He, Wei Li, Tara N. Sainath, Trevor Strohman:
Low Latency Speech Recognition Using End-to-End Prefetching. 1962-1966 - Jingsong Wang, Tom Ko, Zhen Xu, Xiawei Guo, Souxiang Liu, Wei-Wei Tu, Lei Xie:
AutoSpeech 2020: The Second Automated Machine Learning Challenge for Speech Classification. 1967-1971 - Rajath Kumar, Mike Rodehorst, Joe Wang, Jiacheng Gu, Brian Kulis:
Building a Robust Word-Level Wakeword Verification Network. 1972-1976 - Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, Shoichiro Saito:
A Transformer-Based Audio Captioning Model with Keyword Estimation. 1977-1981 - Tong Mo, Yakun Yu, Mohammad Salameh, Di Niu, Shangling Jui:
Neural Architecture Search for Keyword Spotting. 1982-1986 - Ximin Li, Xiaodong Wei, Xiaowei Qin:
Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution. 1987-1991
Speech Synthesis Paradigms and Methods I
- Xin Wang
, Junichi Yamagishi:
Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model. 1992-1996 - Jen-Yu Liu, Yu-Hua Chen, Yin-Cheng Yeh, Yi-Hsuan Yang:
Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization. 1997-2001 - Toru Nakashika:
Complex-Valued Variational Autoencoder: A Novel Deep Generative Model for Direct Representation of Complex Spectra. 2002-2006 - Seungwoo Choi, Seungju Han, Dongyoung Kim, Sungjoo Ha:
Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding. 2007-2011 - Hyeong Rae Ihm, Joun Yeop Lee, Byoung Jin Choi, Sung Jun Cheon, Nam Soo Kim:
Reformer-TTS: Neural Speech Synthesis with Reformer Network. 2012-2016 - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo:
CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion. 2017-2021 - Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis
, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis:
High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency. 2022-2026 - Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu:
DurIAN: Duration Informed Attention Network for Speech Synthesis. 2027-2031 - Kentaro Mitsui, Tomoki Koriyama
, Hiroshi Saruwatari:
Multi-Speaker Text-to-Speech Synthesis Using Deep Gaussian Processes. 2032-2036 - Mano Ranjith Kumar M., Sudhanshu Srivastava, Anusha Prakash, Hema A. Murthy:
A Hybrid HMM-Waveglow Based Text-to-Speech Synthesizer Using Histogram Equalization for Low Resource Indian Languages. 2037-2041
The INTERSPEECH 2020 Computational Paralinguistics ChallengE (ComParE)
- Björn W. Schuller
, Anton Batliner, Christian Bergler, Eva-Maria Messner, Antonia F. de C. Hamilton, Shahin Amiriparian
, Alice Baird, Georgios Rizos, Maximilian Schmitt, Lukas Stappen, Harald Baumeister
, Alexis Deighton MacIntyre
, Simone Hantke:
The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks. 2042-2046 - Tomoya Koike, Kun Qian
, Björn W. Schuller, Yoshiharu Yamamoto:
Learning Higher Representations from Pre-Trained Deep Models with Data Augmentation for the COMPARE 2020 Challenge Mask Task. 2047-2051 - Steffen Illium
, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien:
Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms. 2052-2056 - Philipp Klumpp, Tomás Arias-Vergara
, Juan Camilo Vásquez-Correa
, Paula Andrea Pérez-Toro
, Florian Hönig, Elmar Nöth, Juan Rafael Orozco-Arroyave
:
Surgical Mask Detection with Deep Recurrent Phonetic Models. 2057-2061 - Claude Montacié
, Marie-José Caraty:
Phonetic, Frame Clustering and Intelligibility Analyses for the INTERSPEECH 2020 ComParE Challenge. 2062-2066 - Mariana Julião, Alberto Abad
, Helena Moniz
:
Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition. 2067-2071 - Maxim Markitantov
, Denis Dresvyanskiy, Danila Mamontov, Heysem Kaya
, Wolfgang Minker, Alexey Karpov
:
Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges. 2072-2076 - John Mendonça
, Francisco Teixeira, Isabel Trancoso
, Alberto Abad
:
Analyzing Breath Signals for the Interspeech 2020 ComParE Challenge. 2077-2081 - Alexis Deighton MacIntyre
, Georgios Rizos, Anton Batliner, Alice Baird, Shahin Amiriparian
, Antonia F. de C. Hamilton, Björn W. Schuller:
Deep Attentive End-to-End Continuous Breath Sensing from Speech. 2082-2086 - Jeno Szep, Salim Hariri:
Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion. 2087-2091 - Ziqing Yang, Zifan An, Zehao Fan, Chengye Jing, Houwei Cao
:
Exploration of Acoustic and Lexical Cues for the INTERSPEECH 2020 Computational Paralinguistic Challenge. 2092-2096 - Gizem Sogancioglu, Oxana Verkholyak
, Heysem Kaya
, Dmitrii Fedotov, Tobias Cadèe, Albert Ali Salah, Alexey Karpov
:
Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition. 2097-2101 - Nicolae-Catalin Ristea, Radu Tudor Ionescu:
Are you Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. 2102-2106
Streaming ASR
- Kshitiz Kumar, Chaojun Liu, Yifan Gong, Jian Wu:
1-D Row-Convolution LSTM: Fast Streaming ASR at Accuracy Parity with LC-BLSTM. 2107-2111 - Chengyi Wang, Yu Wu, Liang Lu, Shujie Liu, Jinyu Li
, Guoli Ye, Ming Zhou:
Low Latency End-to-End Streaming Speech Recognition with a Scout Network. 2112-2116 - Gakuto Kurata, George Saon
:
Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition. 2117-2121 - Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He:
Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition. 2122-2126 - Pau Baquero-Arnal, Javier Jorge, Adrià Giménez, Joan Albert Silvestre-Cerdà
, Javier Iranzo-Sánchez, Albert Sanchís, Jorge Civera, Alfons Juan:
Improved Hybrid Streaming ASR with Transformer Language Models. 2127-2131 - Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang:
Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory. 2132-2136 - Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara
:
Enhancing Monotonic Multihead Attention for Streaming ASR. 2137-2141 - Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie:
Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition. 2142-2146 - Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stüker, Alex Waibel:
High Performance Sequence-to-Sequence Model for Streaming Speech Recognition. 2147-2151 - Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, Jinyu Li
:
Transfer Learning Approaches for Streaming End-to-End Speech Recognition System. 2152-2156
Alzheimer’s Dementia Recognition Through Spontaneous Speech
- Matej Martinc, Senja Pollak:
Tackling the ADReSS Challenge: A Multimodal Approach to the Automated Recognition of Alzheimer's Dementia. 2157-2161 - Jiahong Yuan, Yuchen Bian, Xingyu Cai, Jiaji Huang, Zheng Ye, Kenneth Church
:
Disfluencies and Fine-Tuning Pre-Trained Language Models for Detection of Alzheimer's Disease. 2162-2166 - Aparna Balagopalan, Benjamin Eyre, Frank Rudzicz, Jekaterina Novikova:
To BERT or not to BERT: Comparing Speech and Language-Based Approaches for Alzheimer's Disease Detection. 2167-2171 - Saturnino Luz, Fasih Haider
, Sofia de la Fuente
, Davida Fromm, Brian MacWhinney:
Alzheimer's Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge. 2172-2176 - Raghavendra Pappagari, Jaejin Cho, Laureano Moro-Velázquez, Najim Dehak
:
Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer's Disease and Assess its Severity. 2177-2181 - Nicholas Cummins
, Yilin Pan, Zhao Ren, Julian Fritsch, Venkata Srikanth Nallanthighal, Heidi Christensen
, Daniel Blackburn
, Björn W. Schuller, Mathew Magimai-Doss, Helmer Strik, Aki Härmä
:
A Comparison of Acoustic and Linguistics Methodologies for Alzheimer's Dementia Recognition. 2182-2186 - Morteza Rohanian, Julian Hough, Matthew Purver
:
Multi-Modal Fusion with Gating Using Audio, Lexical and Disfluency Features for Alzheimer's Dementia Recognition from Spontaneous Speech. 2187-2191 - Thomas Searle
, Zina M. Ibrahim
, Richard J. B. Dobson
:
Comparing Natural Language Processing Techniques for Alzheimer's Dementia Prediction in Spontaneous Speech. 2192-2196 - Erik Edwards, Charles Dognin, Bajibabu Bollepalli, Maneesh Kumar Singh:
Multiscale System for Alzheimer's Dementia Recognition Through Spontaneous Speech. 2197-2201 - Anna Pompili
, Thomas Rolland, Alberto Abad
:
The INESC-ID Multi-Modal System for the ADReSS 2020 Challenge. 2202-2206 - Shahla Farzana, Natalie Parde
:
Exploring MMSE Score Prediction Using Verbal and Non-Verbal Cues. 2207-2211 - Utkarsh Sarawgi, Wazeer Zulfikar, Nouran Soliman, Pattie Maes:
Multimodal Inductive Transfer Learning for Detection of Alzheimer's Dementia and its Severity. 2212-2216 - Junghyun Koo
, Jie Hwan Lee, Jaewoo Pyo, Yujin Jo, Kyogu Lee:
Exploiting Multi-Modal Features from Pre-Trained Networks for Alzheimer's Dementia Recognition. 2217-2221 - Muhammad Shehram Shah Syed, Zafi Sherhan Syed
, Margaret Lech, Elena Pirogova
:
Automated Screening for Alzheimer's Dementia Through Spontaneous Speech. 2222-2226
Speaker Recognition Challenges and Applications
- Kong Aik Lee
, Koji Okabe, Hitoshi Yamamoto, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Keisuke Ishikawa, Koichi Shinoda:
NEC-TT Speaker Verification System for SRE'19 CTS Challenge. 2227-2231 - Ruyun Li, Tianyu Liang
, Dandan Song, Yi Liu, Yangcheng Wu, Can Xu, Peng Ouyang, Xianwei Zhang, Xianhong Chen, Weiqiang Zhang, Shouyi Yin, Liang He
:
THUEE System for NIST SRE19 CTS Challenge. 2232-2236 - Grigory Antipov, Nicolas Gengembre, Olivier Le Blouch, Gaël Le Lan:
Automatic Quality Assessment for Audio-Visual Verification Systems. The LOVe Submission to NIST SRE Challenge 2019. 2237-2241 - Ruijie Tao, Rohan Kumar Das, Haizhou Li:
Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network. 2242-2246 - Suwon Shon, James R. Glass:
Multimodal Association for Speaker Verification. 2247-2251 - Zhengyang Chen, Shuai Wang, Yanmin Qian:
Multi-Modality Matters: A Performance Leap on VoxCeleb. 2252-2256 - Zhenyu Wang, Wei Xia, John H. L. Hansen:
Cross-Domain Adaptation with Discrepancy Minimization for Text-Independent Forensic Speaker Verification. 2257-2261 - Mufan Sang
, Wei Xia, John H. L. Hansen:
Open-Set Short Utterance Forensic Speaker Verification Using Teacher-Student Network with Explicit Inductive Bias. 2262-2266 - Anurag Chowdhury
, Austin Cozzo, Arun Ross:
JukeBox: A Multilingual Singer Recognition Dataset. 2267-2271 - Ruirui Li, Jyun-Yu Jiang, Xian Wu, Chu-Cheng Hsieh, Andreas Stolcke:
Speaker Identification for Household Scenarios with Self-Attention and Adversarial Training. 2272-2276
Applications of ASR
- Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirkó Visontai, Stella Laurenzo:
Streaming Keyword Spotting on Mobile Devices. 2277-2281 - Hongyi Liu, Apurva Abhyankar, Yuriy Mishchenko, Thibaud Sénéchal, Gengshen Fu, Brian Kulis, Noah D. Stein, Anish Shah, Shiv Naga Prasad Vitaladevuni:
Metadata-Aware End-to-End Keyword Spotting. 2282-2286 - Yehao Kong, Jiliang Zhang:
Adversarial Audio: A New Information Hiding Method. 2287-2291 - Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic
, Odette Scharenborg
:
S2IGAN: Speech-to-Image Generation via Adversarial Learning. 2292-2296 - Juan Zuluaga-Gomez, Petr Motlícek, Qingran Zhan, Karel Veselý, Rudolf A. Braun:
Automatic Speech Recognition Benchmark for Air-Traffic Communications. 2297-2301 - Prithvi R. R. Gudepu, Gowtham P. Vadisetti, Abhishek Niranjan, Kinnera Saranu, Raghava Sarma, M. Ali Basha Shaik, Periyasamy Paramasivam:
Whisper Augmented End-to-End/Hybrid Speech Recognition System - CycleGAN Approach. 2302-2306 - Ramit Sawhney, Arshiya Aggarwal, Piyush Khanna, Puneet Mathur, Taru Jain, Rajiv Ratn Shah
:
Risk Forecasting from Earnings Calls Acoustics and Network Correlations. 2307-2311 - Huili Chen, Bita Darvish Rouhani, Farinaz Koushanfar
:
SpecMark: A Spectral Watermarking Framework for IP Protection of Speech Recognition Systems. 2312-2316 - Justin van der Hout, Zoltán D'Haese, Mark Hasegawa-Johnson, Odette Scharenborg
:
Evaluating Automatically Generated Phoneme Captions for Images. 2317-2321
Speech Emotion Recognition II
- Wei-Cheng Lin, Carlos Busso
:
An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks. 2322-2326 - Siddique Latif
, Rajib Rana, Sara Khalifa, Raja Jurdak
, Björn W. Schuller:
Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-Corpus Setting for Speech Emotion Recognition. 2327-2331 - Takuya Fujioka, Takeshi Homma
, Kenji Nagamatsu:
Meta-Learning for Speech Emotion Recognition Considering Ambiguity of Emotion Labels. 2332-2336 - Jiaxing Liu, Zhilei Liu
, Longbiao Wang, Yuan Gao
, Lili Guo, Jianwu Dang:
Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation. 2337-2341 - Zhi Zhu, Yoshinao Sato:
Reconciliation of Multiple Corpora for Speech Emotion Recognition by Multiple Classifiers with an Adversarial Corpus Discriminator. 2342-2346 - Zheng Lian
, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, Rongjun Li:
Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks. 2347-2351 - Shuiyang Mao, P. C. Ching, Tan Lee
:
EigenEmo: Spectral Utterance Representation Using Dynamic Mode Decomposition for Speech Emotion Classification. 2352-2356 - Shuiyang Mao, P. C. Ching, C.-C. Jay Kuo
, Tan Lee
:
Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition. 2357-2361
Bi- and Multilinguality
- Rubén Pérez Ramón, María Luisa García Lecumberri, Martin Cooke:
The Effect of Language Proficiency on the Perception of Segmental Foreign Accent. 2362-2366 - Yi Liu
, Jinghong Ning:
The Effect of Language Dominance on the Selective Attention of Segments and Tones in Urdu-Cantonese Speakers. 2367-2371 - Mengrou Li, Ying Chen, Jie Cui:
The Effect of Input on the Production of English Tense and Lax Vowels by Chinese Learners: Evidence from an Elementary School in China. 2372-2376 - Laura Spinu
, Jiwon Hwang, Nadya Pincus, Mariana Vasilita:
Exploring the Use of an Artificial Accent of English to Assess Phonetic Learning in Monolingual and Bilingual Speakers. 2377-2381 - Shammur A. Chowdhury, Younes Samih
, Mohamed Eldesouki, Ahmed Ali:
Effects of Dialectal Code-Switching on Speech Modules: A Study Using Egyptian Arabic Broadcast Speech. 2382-2386 - Khia A. Johnson
, Molly Babel, Robert A. Fuhrman:
Bilingual Acoustic Voice Variation is Similarly Structured Across Languages. 2387-2391 - Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng:
Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition. 2392-2396 - Dan Du, Xianjin Zhu
, Zhu Li, Jinsong Zhang:
Perception and Production of Mandarin Initial Stops by Native Urdu Speakers. 2397-2401 - Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman:
Now You're Speaking My Language: Visual Language Identification. 2402-2406 - Nari Rhee, Jianjing Kuang:
The Different Enhancement Roles of Covarying Cues in Thai and Mandarin Tones. 2407-2411
Single-Channel Speech Enhancement I
- Hao Shi, Longbiao Wang, Sheng Li
, Chenchen Ding, Meng Ge, Nan Li, Jianwu Dang, Hiroshi Seki:
Singing Voice Extraction with Attention-Based Spectrograms Fusion. 2412-2416 - Yen-Ju Lu, Chien-Feng Liao, Xugang Lu, Jeih-weih Hung, Yu Tsao:
Incorporating Broad Phonetic Information for Speech Enhancement. 2417-2421 - Andong Li, Chengshi Zheng, Cunhang Fan, Renhua Peng, Xiaodong Li:
A Recursive Network with Dynamic Attention for Monaural Speech Enhancement. 2422-2426 - Hongjiang Yu, Wei-Ping Zhu
, Yuhong Yang
:
Constrained Ratio Mask for Speech Enhancement Using DNN. 2427-2431 - Chi-Chang Lee, Yu-Chen Lin, Hsuan-Tien Lin, Hsin-Min Wang
, Yu Tsao:
SERIL: Noise Adaptive Speech Enhancement Using Regularization-Based Incremental Learning. 2432-2436 - Yoshiaki Bando, Kouhei Sekiguchi, Kazuyoshi Yoshii
:
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder. 2437-2441 - Ahmet Emin Bulut, Kazuhito Koishida:
Low-Latency Single Channel Speech Dereverberation Using U-Net Convolutional Neural Networks. 2442-2446 - Dung N. Tran, Kazuhito Koishida:
Single-Channel Speech Enhancement by Subspace Affinity Minimization. 2447-2451 - Haoyu Li, Junichi Yamagishi:
Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement. 2452-2456 - Feng Deng, Tao Jiang, Xiaorui Wang, Chen Zhang, Yan Li:
NAAGN: Noise-Aware Attention-Gated Network for Speech Enhancement. 2457-2461
Deep Noise Suppression Challenge
- Xiaofei Li, Radu Horaud:
Online Monaural Speech Enhancement Using Delayed Subband LSTM. 2462-2466 - Maximilian Strake, Bruno Defraene, Kristoff Fluyt, Wouter Tirry, Tim Fingscheidt
:
INTERSPEECH 2020 Deep Noise Suppression Challenge: A Fully Convolutional Recurrent Network (FCRN) for Joint Dereverberation and Denoising. 2467-2471 - Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie:
DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. 2472-2476 - Nils L. Westhausen, Bernd T. Meyer:
Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression. 2477-2481 - Jean-Marc Valin, Umut Isik, Neerad Phansalkar, Ritwik Giri, Karim Helwani, Arvindh Krishnaswamy:
A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech. 2482-2486 - Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, Arvindh Krishnaswamy:
PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss. 2487-2491 - Chandan K. A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke:
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results. 2492-2496
Voice and Hearing Disorders
- Sara Akbarzadeh, Sungmin Lee, Chin-Tuan Tan
:
The Implication of Sound Level on Spatial Selective Auditory Attention for Cochlear Implant Users: Behavioral and Electrophysiological Measurement. 2497-2501 - Yangyang Wan, Huali Zhou, Qinglin Meng, Nengheng Zheng:
Enhancing the Interaural Time Difference of Bilateral Cochlear Implants with the Temporal Limits Encoder. 2502-2506 - Toshio Irino, Soichi Higashiyama, Hanako Yoshigi:
Speech Clarity Improvement by Vocal Self-Training Using a Hearing Impairment Simulator and its Correlation with an Auditory Modulation Index. 2507-2511 - Zhuohuang Zhang, Donald S. Williamson
, Yi Shen
:
Investigation of Phase Distortion on Perceived Speech Quality for Hearing-Impaired Listeners. 2512-2516 - Zhuo Zhang, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Di Zhou
, Longbiao Wang:
EEG-Based Short-Time Auditory Attention Detection Using Multi-Task Deep Learning. 2517-2521 - Sondes Abderrazek
, Corinne Fredouille, Alain Ghio
, Muriel Lalain, Christine Meunier
, Virginie Woisard:
Towards Interpreting Deep Learning Models to Understand Loss of Speech Intelligibility in Speech Disorders - Step 1: CNN Model-Based Phone Classification. 2522-2526 - Bahman Mirheidari, Daniel Blackburn
, Ronan O'Malley, Annalena Venneri, Traci Walker
, Markus Reuber, Heidi Christensen
:
Improving Cognitive Impairment Classification by Generative Neural Network-Based Feature Augmentation. 2527-2531 - Meredith Moore
, Piyush Papreja, Michael Saxon, Visar Berisha, Sethuraman Panchanathan:
UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech. 2532-2536 - Purva Barche, Krishna Gurugubelli
, Anil Kumar Vuppala:
Towards Automatic Assessment of Voice Disorders: A Clinical Approach. 2537-2541 - Abhishek Shivkumar, Jack Weston, Raphael Lenain, Emil Fristed:
BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages. 2542-2546
Spoken Term Detection
- Menglong Xu, Xiao-Lei Zhang:
Depthwise Separable Convolutional ResNet with Squeeze-and-Excitation Blocks for Small-Footprint Keyword Spotting. 2547-2551 - Théodore Bluche, Thibault Gisselbrecht:
Predicting Detection Filters for Small Footprint Open-Vocabulary Keyword Spotting. 2552-2556 - Emre Yilmaz, Özgür Bora Gevrek, Jibin Wu, Yuxiang Chen, Xuanbo Meng, Haizhou Li:
Deep Convolutional Spiking Neural Networks for Keyword Spotting. 2557-2561 - Haiwei Wu, Yan Jia, Yuanfei Nie, Ming Li:
Domain Aware Training for Far-Field Small-Footprint Keyword Spotting. 2562-2566 - Kun Zhang, Zhiyong Wu, Daode Yuan, Jian Luan, Jia Jia, Helen Meng, Binheng Song:
Re-Weighted Interval Loss for Handling Data Imbalance Problem of End-to-End Keyword Spotting. 2567-2571 - Peng Zhang, Xueliang Zhang:
Deep Template Matching for Small-Footprint and Configurable Keyword Spotting. 2572-2576 - Chen Yang, Xue Wen, Liming Song:
Multi-Scale Convolution for Robust Keyword Spotting. 2577-2581 - Yangbin Chen
, Tom Ko, Lifeng Shang, Xiao Chen, Xin Jiang, Qing Li
:
An Investigation of Few-Shot Learning in Spoken Term Classification. 2582-2586 - Zeyu Zhao
, Weiqiang Zhang:
End-to-End Keyword Search Based on Attention and Energy Scorer for Low Resource Languages. 2587-2591 - Takuya Higuchi, Mohammad Ghasemzadeh, Kisun You, Chandra Dhir:
Stacked 1D Convolutional Networks for End-to-End Small Footprint Voice Trigger Detection. 2592-2596
The Fearless Steps Challenge Phase-02
- Jens Heitkaemper, Joerg Schmalenstroeer, Reinhold Haeb-Umbach
:
Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments. 2597-2601 - Xueshuai Zhang, Wenchao Wang, Pengyuan Zhang:
Speaker Diarization System Based on DPCA Algorithm for Fearless Steps Challenge Phase-2. 2602-2606 - Qingjian Lin, Tingle Li, Ming Li:
The DKU Speech Activity Detection and Speaker Identification Systems for Fearless Steps Challenge Phase-02. 2607-2611 - Arseniy Gorin, Daniil Kulko, Steven Grima, Alex Glasman:
"This is Houston. Say again, please". The Behavox System for the Apollo-11 Fearless Steps Challenge (Phase II). 2612-2616 - Aditya Joglekar, John H. L. Hansen, Meena Chandra Shekhar, Abhijeet Sangwan:
FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data. 2617-2621
Monaural Source Separation
- Yi Luo, Nima Mesgarani:
Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss. 2622-2626 - Jingjing Chen, Qirong Mao, Dong Liu:
On Synthesis for Supervised Monaural Speech Separation in Time Domain. 2627-2631 - Jun Wang:
Learning Better Speech Representations by Worsening Interference. 2632-2636 - Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas
, David Ditter, Ariel Frank
, Antoine Deleforge, Emmanuel Vincent:
Asteroid: The PyTorch-Based Audio Source Separation Toolkit for Researchers. 2637-2641 - Jingjing Chen, Qirong Mao, Dong Liu:
Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation. 2642-2646 - Chengyun Deng, Yi Zhang, Shiqian Ma, Yongtao Sha, Hui Song, Xiangang Li:
Conv-TasSAN: Separative Adversarial Network Based on Conv-TasNet. 2647-2651 - Keisuke Kinoshita
, Thilo von Neumann, Marc Delcroix
, Tomohiro Nakatani, Reinhold Haeb-Umbach
:
Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation. 2652-2656 - Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Rushil Anirudh, Andreas Spanias:
Unsupervised Audio Source Separation Using Generative Priors. 2657-2661
Single-Channel Speech Enhancement II
- Yuanhang Qiu, Ruili Wang:
Adversarial Latent Representation Learning for Speech Enhancement. 2662-2666 - Yang Xiang, Liming Shi
, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen:
An NMF-HMM Speech Enhancement Method Based on Kullback-Leibler Divergence. 2667-2671 - Lu Zhang, Mingjiang Wang:
Multi-Scale TCN: Exploring Better Temporal DNN Model for Causal Speech Enhancement. 2672-2676 - Quan Wang, Ignacio López-Moreno, Mert Saglam, Kevin W. Wilson, Alan Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Marily Nika, Alexander Gruenstein:
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition. 2677-2681 - Ziqiang Shi, Rujie Liu, Jiqing Han:
Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss. 2682-2686 - Xiang Hao, Shixue Wen, Xiangdong Su, Yun Liu, Guanglai Gao, Xiaofei Li:
Sub-Band Knowledge Distillation Framework for Speech Enhancement. 2687-2691 - Sujan Kumar Roy
, Aaron Nicolson
, Kuldip K. Paliwal
:
A Deep Learning-Based Kalman Filter for Speech Enhancement. 2692-2696 - Hongjiang Yu, Wei-Ping Zhu
, Benoît Champagne:
Subband Kalman Filtering with DNN Estimated Parameters for Speech Enhancement. 2697-2701 - Xiaoqi Li, Yaxing Li, Yuanjie Dong, Shan Xu, Zhihui Zhang, Dan Wang, Shengwu Xiong:
Bidirectional LSTM Network with Ordered Neurons for Speech Enhancement. 2702-2706 - Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe
, Bo Xu:
Speaker-Conditional Chain Model for Speech Separation and Extraction. 2707-2711
Topics in ASR II
- Leanne Nortje, Herman Kamper
:
Unsupervised vs. Transfer Learning for Multimodal One-Shot Matching of Speech and Images. 2712-2716 - Yoonhyung Lee, Seunghyun Yoon, Kyomin Jung:
Multimodal Speech Emotion Recognition Using Cross Attention with Aligned Audio and Text. 2717-2721 - Tamás Gábor Csapó:
Speaker Dependent Articulatory-to-Acoustic Mapping Using Real-Time MRI of the Vocal Tract. 2722-2726 - Tamás Gábor Csapó, Csaba Zainkó, László Tóth, Gábor Gosztolya, Alexandra Markó:
Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis. 2727-2731 - Siyuan Feng
, Odette Scharenborg
:
Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling. 2732-2736 - Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
:
Generative Adversarial Training Data Adaptation for Very Low-Resource Automatic Speech Recognition. 2737-2741 - Kazuki Tsunematsu, Johanes Effendi, Sakriani Sakti, Satoshi Nakamura:
Neural Speech Completion. 2742-2746 - Benjamin Milde, Chris Biemann:
Improving Unsupervised Sparsespeech Acoustic Models with Categorical Reparameterization. 2747-2751 - Katerina Papadimitriou, Gerasimos Potamianos:
Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning. 2752-2756 - Vineel Pratap, Qiantong Xu, Anuroop Sriram
, Gabriel Synnaeve, Ronan Collobert:
MLS: A Large-Scale Multilingual Dataset for Speech Research. 2757-2761
Neural Signals for Spoken Communication
- Ivan Halim Parmonangan, Hiroki Tanaka
, Sakriani Sakti, Satoshi Nakamura:
Combining Audio and Brain Activity for Predicting Speech Quality. 2762-2766 - Rini A. Sharon, Hema A. Murthy:
The "Sound of Silence" in EEG - Cognitive Voice Activity Detection. 2767-2771 - Siqi Cai, Enze Su, Yonghao Song, Longhan Xie, Haizhou Li:
Low Latency Auditory Attention Detection with Common Spatial Pattern Analysis of EEG Signals. 2772-2776 - Miguel Angrick
, Christian Herff, Garett D. Johnson, Jerry J. Shih, Dean J. Krusienski, Tanja Schultz
:
Speech Spectrogram Estimation from Intracranial Brain Activity Using a Quantization Approach. 2777-2781 - Debadatta Dash
, Paul Ferrari
, Angel W. Hernandez-Mulero, Daragh Heitzman, Sara G. Austin, Jun Wang:
Neural Speech Decoding for Amyotrophic Lateral Sclerosis. 2782-2786
Training Strategies for ASR
- Yang Chen, Weiran Wang, Chao Wang:
Semi-Supervised ASR by End-to-End Self-Training. 2787-2791 - Hitesh Tulsiani, Ashtosh Sapru, Harish Arsikere, Surabhi Punjabi, Sri Garimella:
Improved Training Strategies for End-to-End Speech Recognition in Digital Voice Assistants. 2792-2796 - Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka:
Serialized Output Training for End-to-End Overlapped Speech Recognition. 2797-2801 - Felix Weninger, Franco Mana, Roberto Gemello, Jesús Andrés-Ferrer, Puming Zhan:
Semi-Supervised Learning with Data Augmentation for End-to-End ASR. 2802-2806 - Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas:
Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition. 2807-2811 - Albert Zeyer, André Merboldt, Ralf Schlüter
, Hermann Ney:
A New Training Pipeline for an Improved Neural Transducer. 2812-2816 - Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le:
Improved Noisy Student Training for Automatic Speech Recognition. 2817-2821 - Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi:
Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition. 2822-2826 - Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Hejung Yang, Abhinav Garg, Sachin Singh, Jiyeon Kim, Mehul Kumar, Sichen Jin, Shatrughan Singh, Chanwoo Kim:
Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition. 2827-2831 - Gary Wang, Andrew Rosenberg, Zhehuai Chen, Yu Zhang, Bhuvana Ramabhadran, Pedro J. Moreno:
SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR. 2832-2836
Speech Transmission & Coding
- Sneha Das
, Tom Bäckström
, Guillaume Fuchs
:
Fundamental Frequency Model for Postfiltering at Low Bitrates in a Transform-Domain Speech and Audio Codec. 2837-2841 - Arthur Van Den Broucke, Deepak Baby, Sarah Verhulst:
Hearing-Impaired Bio-Inspired Cochlear Models for Real-Time Auditory Applications. 2842-2846 - Jan Skoglund
, Jean-Marc Valin:
Improving Opus Low Bit Rate Quality with Neural Speech Synthesis. 2847-2851 - Pranay Manocha
, Adam Finkelstein, Richard Zhang, Nicholas J. Bryan, Gautham J. Mysore, Zeyu Jin:
A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences. 2852-2856 - Piotr Masztalski
, Mateusz Matuszewski, Karol Piaskowski, Michal Romaniuk:
StoRIR: Stochastic Room Impulse Response Generation for Audio Data Augmentation. 2857-2861 - Babak Naderi, Ross Cutler:
An Open Source Implementation of ITU-T Recommendation P.808 with Validation. 2862-2866 - Gabriel Mittag, Ross Cutler, Yasaman Hosseinkashi, Michael Revow, Sriram Srinivasan, Naglakshmi Chande, Robert Aichner:
DNN No-Reference PSTN Speech Quality Prediction. 2867-2871 - Sebastian Möller
, Tobias Hübschen, Thilo Michael, Gabriel Mittag, Gerhard Schmidt:
Non-Intrusive Diagnostic Monitoring of Fullband Speech Quality. 2872-2876
Bioacoustics and Articulation
- Abdolreza Sabzi Shahrebabaki
, Negar Olfati, Sabato Marco Siniscalchi, Giampiero Salvi
, Torbjørn Svendsen
:
Transfer Learning of Articulatory Information Through Phone Information. 2877-2881 - Abdolreza Sabzi Shahrebabaki
, Sabato Marco Siniscalchi, Giampiero Salvi
, Torbjørn Svendsen
:
Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals. 2882-2886 - Bernardo B. Gatto
, Eulanda Miranda dos Santos, Juan Gabriel Colonna, Naoya Sogi, Lincon Sales de Souza, Kazuhiro Fukui:
Discriminative Singular Spectrum Analysis for Bioacoustic Classification. 2887-2891 - Renuka Mannem, Hima Jyothi R., Aravind Illa, Prasanta Kumar Ghosh:
Speech Rate Task-Specific Representation Learning from Acoustic-Articulatory Data. 2892-2896 - Abner Hernandez, Eun Jung Yeo, Sunhee Kim, Minhwa Chung:
Dysarthria Detection and Severity Assessment Using Rhythm-Based Metrics. 2897-2901 - Yi Ma, Xinzi Xu, Yongfu Li:
LungRN+NL: An Improved Adventitious Lung Sound Classification Using Non-Local Block ResNet Neural Network with Mixup Data Augmentation. 2902-2906 - Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh:
Attention and Encoder-Decoder Based Models for Transforming Articulatory Movements at Different Speaking Rates. 2907-2911 - Zijiang Yang, Shuo Liu, Meishu Song, Emilia Parada-Cabaleiro, Björn W. Schuller:
Adventitious Respiratory Classification Using Attentive Residual Neural Networks. 2912-2916 - Raphael Lenain, Jack Weston, Abhishek Shivkumar, Emil Fristed:
Surfboard: Audio Feature Extraction for Modern Machine Learning. 2917-2921 - Abinay Reddy Naini, Malla Satyapriya, Prasanta Kumar Ghosh:
Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task. 2922-2926
Speech Synthesis: Multilingual and Cross-Lingual Approaches
- Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma:
Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion. 2927-2931 - Zhaoyu Liu, Brian Mak
:
Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment. 2932-2936 - Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Chunyu Qiang, Tao Wang:
Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis. 2937-2941 - Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S. Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao:
Phonological Features for 0-Shot Multilingual Speech Synthesis. 2942-2946 - Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama
, Hiroshi Saruwatari:
Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space. 2947-2951 - Ruolan Liu, Xue Wen, Chunhui Lu, Xiao Chen:
Tone Learning in Low-Resource Bilingual TTS. 2952-2956 - Shubham Bansal, Arijit Mukherjee, Sandeepkumar Satpal, Rupesh Kumar Mehta:
On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model. 2957-2961 - Anusha Prakash, Hema A. Murthy:
Generic Indic Text-to-Speech Synthesisers with Rapid Adaptation in an End-to-End Framework. 2962-2966 - Marcel de Korte, Jaebok Kim, Esther Klabbers:
Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling. 2967-2971 - Tomás Nekvinda, Ondrej Dusek:
One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech. 2972-2976
Learning Techniques for Speaker Recognition I
- Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee-Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han:
In Defence of Metric Learning for Speaker Recognition. 2977-2981 - Seong Min Kye, Youngmoon Jung, Haebeom Lee, Sung Ju Hwang, Hoirin Kim:
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs. 2982-2986 - Kai Li, Masato Akagi, Yibo Wu, Jianwu Dang:
Segment-Level Effects of Gender, Nationality and Emotion Information on Text-Independent Speaker Verification. 2987-2991 - Yanpei Shi, Qiang Huang, Thomas Hain
:
Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification. 2992-2996 - Ana Montalvo
, José R. Calvo, Jean-François Bonastre:
Multi-Task Learning for Voice Related Recognition Tasks. 2997-3001 - Umair Khan, Javier Hernando:
Unsupervised Training of Siamese Networks for Speaker Verification. 3002-3006 - Ying Liu, Yan Song, Yiheng Jiang, Ian McLoughlin
, Lin Liu, Li-Rong Dai:
An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions. 3007-3011