


default search action
23rd Interspeech 2022: Incheon, Korea
- Hanseok Ko, John H. L. Hansen:
23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022. ISCA 2022
Speech Synthesis: Toward end-to-end synthesis
- Hyunjae Cho, Wonbin Jung, Junhyeok Lee
, Sang Hoon Woo:
SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech. 1-5 - Hanbin Bae, Young-Sun Joo
:
Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch. 6-10 - Martin Lenglet, Olivier Perrotin, Gérard Bailly:
Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings. 11-15 - Yooncheol Ju, Ilhwan Kim, Hongsun Yang, Ji-Hoon Kim, Byeongyeol Kim, Soumi Maiti, Shinji Watanabe
:
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner. 16-20 - Dan Lim, Sunghee Jung, Eesung Kim:
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech. 21-25
Technology for Disordered Speech
- Rosanna Turrisi
, Leonardo Badino
:
Interpretable dysarthric speaker adaptation based on optimal-transport. 26-30 - Zhengjun Yue, Erfan Loweimi, Heidi Christensen
, Jon Barker, Zoran Cvetkovic:
Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs. 31-35 - Luke Prananta, Bence Mark Halpern, Siyuan Feng
, Odette Scharenborg
:
The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition. 36-40 - Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda:
Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition. 41-45 - Chitralekha Bhat, Ashish Panda, Helmer Strik:
Improved ASR Performance for Dysarthric Speech Using Two-stage DataAugmentation. 46-50 - Abner Hernandez, Paula Andrea Pérez-Toro
, Elmar Nöth, Juan Rafael Orozco-Arroyave
, Andreas K. Maier, Seung Hee Yang:
Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition. 51-55
Neural Network Training Methods for ASR I
- Mun-Hak Lee, Joon-Hyuk Chang, Sang-Eon Lee, Ju-Seok Seong, Chanhee Park, Haeyoung Kwon:
Regularizing Transformer-based Acoustic Models by Penalizing Attention Weights. 56-60 - David M. Chan
, Shalini Ghosh:
Content-Context Factorized Representations for Automated Speech Recognition. 61-65 - Georgios Karakasidis, Tamás Grósz
, Mikko Kurimo:
Comparison and Analysis of New Curriculum Criteria for End-to-End ASR. 66-70 - Deepak Baby, Pasquale D'Alterio, Valentin Mendelev:
Incremental learning for RNN-Transducer based speech recognition models. 71-75 - Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio López-Moreno, Rajiv Mathews, Françoise Beaufays:
Production federated keyword spotting via distillation, filtering, and joint federated-centralized training. 76-80
Acoustic Phonetics and Prosody
- Jieun Song, Hae-Sung Jeon, Jieun Kiaer:
Use of prosodic and lexical cues for disambiguating wh-words in Korean. 81-85 - Vinicius Ribeiro, Yves Laprie
:
Autoencoder-Based Tongue Shape Estimation During Continuous Speech. 86-90 - Giuseppe Magistro, Claudia Crocco:
Phonetic erosion and information structure in function words: the case of mia. 91-95 - Miran Oh
, Yoon-Jeong Lee
:
Dynamic Vertical Larynx Actions Under Prosodic Focus. 96-100 - Leah Bradshaw, Eleanor Chodroff
, Lena A. Jäger, Volker Dellwo
:
Fundamental Frequency Variability over Time in Telephone Interactions. 101-105
Spoken Machine Translation
- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-jussà:
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation. 106-110 - Jinming Zhao, Hao Yang, Gholamreza Haffari, Ehsan Shareghi
:
M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation. 111-115 - Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim:
Cross-Modal Decision Regularization for Simultaneous Speech Translation. 116-120 - Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura:
Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation. 121-125 - Kirandevraj R, Vinod Kumar Kurmi, Vinay P. Namboodiri, C. V. Jawahar:
Generalized Keyword Spotting using ASR embeddings. 126-130
(Multimodal) Speech Emotion Recognition I
- Youngdo Ahn, Sung Joo Lee, Jong Won Shin:
Multi-Corpus Speech Emotion Recognition for Unseen Corpus Using Corpus-Wise Weights in Classification Loss. 131-135 - Junghun Kim, Yoojin An, Jihie Kim:
Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms. 136-140 - Joosung Lee:
The Emotion is Not One-hot Encoding: Learning with Grayscale Label for Emotion Recognition in Conversation. 141-145 - Andreas Triantafyllopoulos, Johannes Wagner, Hagen Wierstorf, Maximilian Schmitt, Uwe D. Reichel, Florian Eyben, Felix Burkhardt, Björn W. Schuller:
Probing speech emotion recognition transformers for linguistic knowledge. 146-150 - Navin Raj Prabhu, Guillaume Carbajal, Nale Lehmann-Willenbrock, Timo Gerkmann:
End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks. 151-155 - Matthew Perez
, Mimansa Jaiswal, Minxue Niu, Cristina Gorrostieta, Matthew Roddy, Kye Taylor, Reza Lotfian, John Kane, Emily Mower Provost:
Mind the gap: On the value of silence representations to lexical-based speech emotion recognition. 156-160 - Huang-Cheng Chou, Chi-Chun Lee
, Carlos Busso:
Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier. 161-165 - Hira Dhamyal, Bhiksha Raj, Rita Singh:
Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection. 166-170
Dereverberation, Noise Reduction, and Speaker Extraction
- Tuan Vu Ho, Maori Kobayashi, Masato Akagi:
Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion. 171-175 - Tuan Vu Ho, Quoc Huy Nguyen, Masato Akagi, Masashi Unoki:
Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement. 176-180 - Minseung Kim, Hyungchan Song, Sein Cheong, Jong Won Shin:
iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement. 181-185 - Kuo-Hsuan Hung, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao, Chii-Wann Lin:
Boosting Self-Supervised Embeddings for Speech Enhancement. 186-190 - Seorim Hwang
, Youngcheol Park, Sungwook Park
:
Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections. 191-195 - Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey:
CycleGAN-based Unpaired Speech Dereverberation. 196-200 - Ashutosh Pandey, DeLiang Wang:
Attentive Training: A New Training Framework for Talker-independent Speaker Extraction. 201-205 - Tyler Vuong, Richard M. Stern:
Improved Modulation-Domain Loss for Neural-Network-based Speech Enhancement. 206-210 - Chiang-Jen Peng, Yun-Ju Chan, Yih-Liang Shen, Cheng Yu, Yu Tsao, Tai-Shih Chi:
Perceptual Characteristics Based Multi-objective Model for Speech Enhancement. 211-215 - Marc Delcroix, Keisuke Kinoshita
, Tsubasa Ochiai, Katerina Zmolíková, Hiroshi Sato, Tomohiro Nakatani:
Listen only to me! How well can target speech extraction handle false alarms? 216-220 - Hao Shi, Longbiao Wang, Sheng Li
, Jianwu Dang, Tatsuya Kawahara:
Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction. 221-225 - Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann:
Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments. 226-230
Source Separation II
- Nicolás Schmidt, Jordi Pons, Marius Miron:
PodcastMix: A dataset for separating music and speech in podcasts. 231-235 - Kohei Saijo, Robin Scheibler:
Independence-based Joint Dereverberation and Separation with Neural Source Model. 236-240 - Kohei Saijo, Robin Scheibler:
Spatial Loss for Unsupervised Multi-channel Source Separation. 241-245 - Samuel Bellows
, Timothy W. Leishman:
Effect of Head Orientation on Speech Directivity. 246-250 - Kohei Saijo, Tetsuji Ogawa:
Unsupervised Training of Sequential Neural Beamformer Using Coarsely-separated and Non-separated Signals. 251-255 - Marvin Borsdorf
, Kevin Scheck, Haizhou Li, Tanja Schultz
:
Blind Language Separation: Disentangling Multilingual Cocktail Party Voices by Language. 256-260 - Mateusz Guzik
, Konrad Kowalczyk
:
NTF of Spectral and Spatial Features for Tracking and Separation of Moving Sound Sources in Spherical Harmonic Domain. 261-265 - Jack Deadman, Jon Barker:
Modelling Turn-taking in Multispeaker Parties for Realistic Data Simulation. 266-270 - Christoph Böddeker, Tobias Cord-Landwehr
, Thilo von Neumann, Reinhold Haeb-Umbach:
An Initialization Scheme for Meeting Separation with Spatial Mixture Models. 271-275 - Seongkyu Mun, Dhananjaya Gowda, Jihwan Lee, Changwoo Han, Dokyun Lee, Chanwoo Kim:
Prototypical speaker-interference loss for target voice separation using non-parallel audio samples. 276-280
Embedding and Network Architecture for Speaker Recognition
- Pierre-Michel Bousquet, Mickael Rouvier, Jean-François Bonastre
:
Reliability criterion based on learning-phase entropy for speaker recognition with neural network. 281-285 - Bei Liu, Zhengyang Chen, Yanmin Qian:
Attentive Feature Fusion for Robust Speaker Verification. 286-290 - Bei Liu, Zhengyang Chen, Yanmin Qian:
Dual Path Embedding Learning for Speaker Verification with Triplet Attention. 291-295 - Bei Liu, Zhengyang Chen, Shuai Wang, Haoyu Wang, Bing Han, Yanmin Qian:
DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design. 296-300 - Ruida Li, Shuo Fang, Chenguang Ma, Liang Li:
Adaptive Rectangle Loss for Speaker Verification. 301-305 - Yang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu, Hung-yi Lee, Helen Meng:
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification. 306-310 - Leying Zhang
, Zhengyang Chen, Yanmin Qian:
Enroll-Aware Attentive Statistics Pooling for Target Speaker Verification. 311-315 - Yusheng Tian, Jingyu Li, Tan Lee
:
Transport-Oriented Feature Aggregation for Speaker Embedding Learning. 316-320 - Mufan Sang
, John H. L. Hansen:
Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning. 321-325 - Linjun Cai, Yuhong Yang
, Xufeng Chen, Weiping Tu, Hongyang Chen:
CS-CTCSCONV1D: Small footprint speaker verification with channel split time-channel-time separable 1-dimensional convolution. 326-330 - Pengqi Li, Lantian Li, Askar Hamdulla, Dong Wang:
Reliable Visualization for Deep Speaker Recognition. 331-335 - Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee
, Guanglu Wan:
Unifying Cosine and PLDA Back-ends for Speaker Verification. 336-340 - Yuheng Wei
, Junzhao Du, Hui Liu, Qian Wang:
CTFALite: Lightweight Channel-specific Temporal and Frequency Attention Mechanism for Enhancing the Speaker Embedding Extractor. 341-345
Speech Representation II
- Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du:
SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech. 346-350 - David Feinberg
:
VoiceLab: Software for Fully Reproducible Automated Voice Analysis. 351-355 - Joel Shor, Subhashini Venugopalan:
TRILLsson: Distilled Universal Paralinguistic Speech Representations. 356-360 - Nan Li, Meng Ge, Longbiao Wang, Masashi Unoki, Sheng Li
, Jianwu Dang:
Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural Network. 361-365 - Mostafa Sadeghi, Paul Magron:
A Sparsity-promoting Dictionary Model for Variational Autoencoders. 366-370 - Yan Zhao
, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, Li Zhao:
Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition. 371-375 - John H. L. Hansen, Zhenyu Wang:
Audio Anti-spoofing Using Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning. 376-380 - Boris Bergsma, Minhao Yang, Milos Cernak:
PEAF: Learnable Power Efficient Analog Acoustic Features for Audio Recognition. 381-385 - Gasser Elbanna
, Alice Biryukov, Neil Scheidwasser-Clow, Lara Orlandic, Pablo Mainar, Mikolaj Kegler, Pierre Beckmann
, Milos Cernak:
Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load. 386-390 - Shijun Wang, Hamed Hemati, Jón Guðnason, Damian Borth:
Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition. 391-395 - Sarthak Yadav
, Neil Zeghidour:
Learning neural audio features without supervision. 396-400 - Yixuan Zhang, Heming Wang, DeLiang Wang:
Densely-connected Convolutional Recurrent Network for Fundamental Frequency Estimation in Noisy Speech. 401-405 - Abu Zaher Md Faridee, Hannes Gamper:
Predicting label distribution improves non-intrusive speech quality estimation. 406-410 - Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka:
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models. 411-415 - Abdul Hameed Azeemi, Ihsan Ayyub Qazi
, Agha Ali Raza
:
Dataset Pruning for Resource-constrained Spoofed Audio Detection. 416-420
Speech Synthesis: Linguistic Processing, Paradigms and Other Topics II
- Jaesung Tae, Hyeongju Kim, Taesu Kim:
EdiTTS: Score-based Editing for Controllable Text-to-Speech. 421-425 - Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng:
Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information. 426-430 - Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi:
SpeechPainter: Text-conditioned Speech Inpainting. 431-435 - Song Zhang, Ken Zheng, Xiaoxu Zhu, Baoxiang Li:
A polyphone BERT for Polyphone Disambiguation in Mandarin Chinese. 436-440 - Mutian He, Jingzhou Yang, Lei He, Frank K. Soong:
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge. 441-445 - Jian Zhu, Cong Zhang, David Jurgens:
ByT5 model for massively multilingual grapheme-to-phoneme conversion. 446-450 - Puneet Mathur, Franck Dernoncourt, Quan Hung Tran, Jiuxiang Gu, Ani Nenkova, Vlad I. Morariu, Rajiv Jain, Dinesh Manocha:
DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis. 451-455 - Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee
, Sheng Zhao:
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech. 456-460 - Junrui Ni
, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson:
Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition. 461-465 - Tho Nguyen Duc Tran, The Chuong Chu, Vu Hoang, Trung Huu Bui, Steven Hung Quoc Truong:
An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis. 466-470 - Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond
, Gustav Eje Henter:
Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks. 471-475 - Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin:
An Automatic Soundtracking System for Text-to-Speech Audiobooks. 476-480 - Daxin Tan, Guangyan Zhang, Tan Lee
:
Environment Aware Text-to-Speech Synthesis. 481-485 - Artem Ploujnikov, Mirco Ravanelli:
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation. 486-490 - Evelina Bakhturina, Yang Zhang, Boris Ginsburg:
Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization. 491-495 - Yogesh Virkar, Marcello Federico, Robert Enyedi, Roberto Barra-Chicote:
Prosodic alignment for off-screen automatic dubbing. 496-500 - Qibing Bai, Tom Ko, Yu Zhang:
A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis. 501-505 - Hirokazu Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka:
CAUSE: Crossmodal Action Unit Sequence Estimation from Speech. 506-510 - Binu Nisal Abeysinghe, Jesin James, Catherine I. Watson, Felix Marattukalam:
Visualising Model Training via Vowel Space for Text-To-Speech Systems. 511-515
Other Topics in Speech Recognition
- Aaqib Saeed:
Binary Early-Exit Network for Adaptive Inference on Low-Resource Devices. 516-520 - Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li
, Takuya Yoshioka:
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings. 521-525 - Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura:
Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data. 526-530 - Yi-Kai Zhang, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan:
Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation. 531-535 - Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide:
Federated Domain Adaptation for ASR with Full Self-Supervision. 536-540 - Longfei Yang, Wenqing Wei, Sheng Li
, Jiyi Li, Takahiro Shinozaki:
Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection. 541-545 - Zvi Kons, Hagai Aronowitz, Edmilson da Silva Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon
:
Extending RNN-T-based speech recognition systems with emotion and language classification. 546-549 - Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg:
Thutmose Tagger: Single-pass neural model for Inverse Text Normalization. 550-554 - Yeonjin Cho, Sara Ng, Trang Tran
, Mari Ostendorf:
Leveraging Prosody for Punctuation Prediction of Spontaneous Speech. 555-559 - Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie:
A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings. 560-564
Audio Deep PLC (Packet Loss Concealment) Challenge
- Yuansheng Guan, Guochen Yu, Andong Li, Chengshi Zheng, Jie Wang:
TMGAN-PLC: Audio Packet Loss Concealment using Temporal Memory Generative Adversarial Network. 565-569 - Jean-Marc Valin, Ahmed Mustafa, Christopher Montgomery, Timothy B. Terriberry, Michael Klingbeil, Paris Smaragdis, Arvindh Krishnaswamy:
Real-Time Packet Loss Concealment With Mixed Generative and Predictive Model. 570-574 - Baiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang:
PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial Network. 575-579 - Lorenz Diener, Sten Sootla, Solomiya Branets, Ando Saabas, Robert Aichner, Ross Cutler:
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge. 580-584 - Nan Li, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu:
End-to-End Multi-Loss Training for Low Delay Packet Loss Concealment. 585-589
Robust Speaker Recognition
- Ju-ho Kim, Jungwoo Heo, Hye-jin Shim, Ha-Jin Yu:
Extended U-Net for Speaker Verification in Noisy Environments. 590-594 - Seunghan Yang, Debasmit Das, Janghoon Cho, Hyoungwoo Park, Sungrack Yun:
Domain Agnostic Few-shot Learning for Speaker Verification. 595-599 - Qiongqiong Wang, Kong Aik Lee, Tianchi Liu
:
Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA? 600-604 - Themos Stafylakis
, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Anna Silnova, Lukás Burget, Jan Cernocký:
Training speaker embedding extractors using multi-speaker audio with unknown speaker boundaries. 605-609 - Chau Luu, Steve Renals, Peter Bell:
Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations. 610-614 - Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak
:
Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification. 615-619
Speech Production
- Tsukasa Yoshinaga, Kikuo Maekawa, Akiyoshi Iida:
Variability in Production of Non-Sibilant Fricative [ç] in /hi/. 620-624 - Sathvik Udupa, Aravind Illa, Prasanta Kumar Ghosh:
Streaming model for Acoustic to Articulatory Inversion with transformer networks. 625-629 - Tsiky Rakotomalala, Pierre Baraduc, Pascal Perrier
:
Trajectories predicted by optimal speech motor control using LSTM networks. 630-634 - Daniel R. van Niekerk
, Anqi Xu
, Branislav Gerazov
, Paul Konstantin Krug, Peter Birkholz
, Yi Xu:
Exploration strategies for articulatory synthesis of complex syllable onsets. 635-639 - Yoonjeong Lee
, Jody Kreiman:
Linguistic versus biological factors governing acoustic voice variation. 640-643 - Takayuki Nagamine
:
Acquisition of allophonic variation in second language speech: An acoustic and articulatory study of English laterals by Japanese speakers. 644-648
Speech Quality Assessment
- Pranay Manocha
, Anurag Kumar, Buye Xu, Anjali Menon, Israel Dejene Gebru, Vamsi Krishna Ithapu, Paul Calamia:
SAQAM: Spatial Audio Quality Assessment Metric. 649-653 - Pranay Manocha, Anurag Kumar:
Speech Quality Assessment through MOS using Non-Matching References. 654-658 - Hideki Kawahara, Kohei Yatabe
, Ken-Ichi Sakakibara, Tatsuya Kitamura, Hideki Banno, Masanori Morise:
An objective test tool for pitch extractors' response attributes. 659-663 - Kai Li, Sheng Li
, Xugang Lu, Masato Akagi, Meng Liu, Lin Zhang
, Chang Zeng
, Longbiao Wang, Jianwu Dang, Masashi Unoki:
Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection. 664-668 - Salah Zaiem, Titouan Parcollet, Slim Essid
:
Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning. 669-673 - Deebha Mumtaz, Ajit Jena, Vinit Jakhetiya, Karan Nathwani, Sharath Chandra Guntuku
:
Transformer-based quality assessment model for generalized user-generated multimedia audio content. 674-678
Language Modeling and Lexical Modeling for ASR
- Christophe Van Gysel, Mirko Hannemann, Ernest Pusateri, Youssef Oualil, Ilya Oparin:
Space-Efficient Representation of Entity-centric Query Language Models. 679-683 - Saket Dingliwal, Ashish Shenoy, Sravan Bodapati, Ankur Gandhe, Ravi Teja Gadde, Katrin Kirchhoff:
Domain Prompts: Towards memory and compute efficient domain adaptation of ASR systems. 684-688 - W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor D. Strohman, Shankar Kumar:
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition. 689-693 - Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey:
UserLibri: A Dataset for ASR Personalization Using Only Text. 694-698 - Chin-Yueh Chien, Kuan-Yu Chen:
A BERT-based Language Modeling Framework. 699-703
Challenges and Opportunities for Signal Processing and Machine Learning for Multiple Smart Devices
- Yoshiki Masuyama, Kouei Yamaoka, Nobutaka Ono:
Joint Optimization of Sampling Rate Offsets Based on Entire Signal Relationship Among Distributed Microphones. 704-708 - Gregory Ciccarelli, Jarred Barber, Arun Nair
, Israel Cohen, Tao Zhang:
Challenges and Opportunities in Multi-device Speech Processing. 709-713 - Ameya Agaskar:
Practical Over-the-air Perceptual AcousticWatermarking. 714-718 - Timm Koppelmann, Luca Becker
, Alexandru Nelus, Rene Glitza
, Lea Schönherr
, Rainer Martin
:
Clustering-based Wake Word Detection in Privacy-aware Acoustic Sensor Networks. 719-723 - Francesco Nespoli, Daniel Barreda, Patrick A. Naylor
:
Relative Acoustic Features for Distance Estimation in Smart-Homes. 724-728 - Ashutosh Pandey, Buye Xu, Anurag Kumar, Jacob Donley, Paul Calamia, DeLiang Wang:
Time-domain Ad-hoc Array Speech Enhancement Using a Triple-path Network. 729-733
Speech Processing & Measurement
- Arne-Lukas Fietkau, Simon Stone, Peter Birkholz
:
Relationship between the acoustic time intervals and tongue movements of German diphthongs. 734-738 - Sanae Matsui, Kyoji Iwamoto, Reiko Mazuka:
Development of allophonic realization until adolescence: A production study of the affricate-fricative variation of /z/ among Japanese children. 739-743 - Chung Soo Ahn
, L. L. Chamara Kasun, Sunil Sivadas, Jagath C. Rajapakse:
Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition. 744-748 - Louise Coppieters de Gibson, Philip N. Garner
:
Low-Level Physiological Implications of End-to-End Learning for Speech Recognition. 749-753 - Carolina Lins Machado
, Volker Dellwo
, Lei He:
Idiosyncratic lingual articulation of American English /æ/ and /ɑ/ using network analysis. 754-758 - Teruki Toya, Wenyu Zhu, Maori Kobayashi, Kenichi Nakamura, Masashi Unoki:
Method for improving the word intelligibility of presented speech using bone-conduction headphones. 759-763 - Debasish Ray Mohapatra
, Mario Fleischer, Victor Zappi, Peter Birkholz
, Sidney S. Fels
:
Three-dimensional finite-difference time-domain acoustic analysis of simplified vocal tract shapes. 764-768 - Dorina De Jong, Aldo Pastore
, Noël Nguyen, Alessandro D'Ausilio:
Speech imitation skills predict automatic phonetic convergence: a GMM-UBM study on L2. 769-773 - Marc-Antoine Georges, Jean-Luc Schwartz, Thomas Hueber:
Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE. 774-778 - Peter Wu, Shinji Watanabe
, Louis Goldstein, Alan W. Black, Gopala Krishna Anumanchipalli:
Deep Speech Synthesis from Articulatory Representations. 779-783 - Monica Ashokumar
, Jean-Luc Schwartz, Takayuki Ito:
Orofacial somatosensory inputs in speech perceptual training modulate speech production. 784-787
Speech Synthesis: Acoustic Modeling and Neural Waveform Generation I
- Minchan Kim
, Myeonghun Jeong, Byoung Jin Choi, Sunghwan Ahn, Joun Yeop Lee, Nam Soo Kim:
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus. 788-792 - Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto:
DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning. 793-797 - Kentaro Mitsui, Kei Sawada:
MSR-NV: Neural Vocoder Using Multiple Sampling Rates. 798-802 - Yuma Koizumi, Heiga Zen
, Kohei Yatabe
, Nanxin Chen, Michiel Bacchiani:
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping. 803-807 - Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung:
Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge. 808-812 - Jae-Sung Bae, Jinhyeok Yang, Taejun Bak, Young-Sun Joo
:
Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech. 813-817 - Krishna Subramani, Jean-Marc Valin, Umut Isik, Paris Smaragdis, Arvindh Krishnaswamy:
End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation. 818-822 - Perry Lam
, Huayun Zhang, Nancy F. Chen
, Berrak Sisman:
EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models. 823-827 - Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis:
Fine-grained Noise Control for Multispeaker Speech Synthesis. 828-832 - Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby:
WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. 833-837 - Ivan Vovk, Tasnima Sadekova, Vladimir Gogoryan, Vadim Popov, Mikhail A. Kudinov, Jiansheng Wei:
Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU. 838-842 - Alexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James R. Glass:
Simple and Effective Unsupervised Speech Synthesis. 843-847 - Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda:
Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation. 848-852
Show and Tell I
- Taejin Park, Nithin Rao Koluguri, Fei Jia, Jagadeesh Balam, Boris Ginsburg:
NeMo Open Source Speaker Diarization System. 853-854 - Baihan Lin:
Voice2Alliance: Automatic Speaker Diarization and Quality Assurance of Conversational Alignment. 855-856 - Rishabh Kumar, Devaraja Adiga, Mayank Kothyari, Jatin Dalal, Ganesh Ramakrishnan, Preethi Jyothi:
VAgyojaka: An Annotating and Post-Editing Tool for Automatic Speech Recognition. 857-858 - Alzahra Badi, Chungho Park, Min-Seok Keum, Miguel Alba, Youngsuk Ryu, Jeongmin Bae:
SKYE: More than a conversational AI. 859-860
Spatial Audio
- Hokuto Munakata, Ryu Takeda
, Kazunori Komatani:
Training Data Generation with DOA-based Selecting and Remixing for Unsupervised Training of Deep Separation Models. 861-865 - Hangting Chen, Yi Yang, Feng Dang, Pengyuan Zhang:
Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output. 866-870 - Feifei Xiong, Pengyu Wang
, Zhongfu Ye, Jinwei Feng:
Joint Estimation of Direction-of-Arrival and Distance for Arrays with Directional Sensors based on Sparse Bayesian Learning. 871-875 - Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello
:
How to Listen? Rethinking Visual Sound Localization. 876-880 - Zhiheng Ouyang, Miao Wang, Wei-Ping Zhu:
Small Footprint Neural Networks for Acoustic Direction of Arrival Estimation. 881-885 - Xiaoyu Wang, Xiangyu Kong, Xiulian Peng, Yan Lu:
Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation. 886-890 - Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang:
MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources. 891-895 - Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang:
Iterative Sound Source Localization for Unknown Number of Sources. 896-900 - Katharine Patterson, Kevin W. Wilson, Scott Wisdom, John R. Hershey:
Distance-Based Sound Separation. 901-905 - Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang:
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network. 906-910 - Ali Aroudi, Stefan Uhlich, Marc Ferras Font:
TRUNet: Transformer-Recurrent-U Network for Multi-channel Reverberant Sound Source Separation. 911-915
Single-channel Speech Enhancement II
- Xiaofeng Ge, Jiangyu Han
, Yanhua Long, Haixin Guan
:
PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement. 916-920 - Zhuangqi Chen, Pingjian Zhang:
Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement. 921-925 - Jiaming Cheng, Ruiyu Liang, Yue Xie, Li Zhao, Björn W. Schuller, Jie Jia, Yiyuan Peng:
Cross-Layer Similarity Knowledge Distillation for Speech Enhancement. 926-930 - Feifei Xiong, Weiguang Chen, Pengyu Wang
, Xiaofei Li, Jinwei Feng:
Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation. 931-935 - Ruizhe Cao, Sherif Abdulatif
, Bin Yang:
CMGAN: Conformer-based Metric GAN for Speech Enhancement. 936-940 - Zeyuan Wei, Li Hao, Xueliang Zhang:
Model Compression by Iterative Pruning with Knowledge Distillation and Its Application to Speech Enhancement. 941-945 - Chenhui Zhang, Xiang Pan:
Single-channel speech enhancement using Graph Fourier Transform. 946-950 - Zilu Guo, Xu Xu, Zhongfu Ye:
Joint Optimization of the Module and Sign of the Spectral Real Part Based on CRN for Speech Denoising. 951-955 - Hao Zhang, Ashutosh Pandey, DeLiang Wang:
Attentive Recurrent Network for Low-Latency Active Noise Control. 956-960 - Jen-Hung Huang, Chung-Hsien Wu
:
Memory-Efficient Multi-Step Speech Enhancement with Neural ODE. 961-965 - Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Jianjun Hao:
GLD-Net: Improving Monaural Speech Enhancement by Learning Global and Local Dependency Features with GLD Block. 966-970 - Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Dejun Li:
Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention. 971-975 - Jun Chen
, Wei Rao, Zilin Wang, Zhiyong Wu, Yannan Wang, Tao Yu, Shidong Shang, Helen Meng:
Speech Enhancement with Fullband-Subband Cross-Attention Network. 976-980 - Cheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli:
OSSEM: one-shot speaker adaptive speech enhancement using meta learning. 981-985 - Wenbin Jiang, Tao Liu, Kai Yu:
Efficient Speech Enhancement with Neural Homomorphic Synthesis. 986-990 - Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang:
Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation. 991-995 - Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita
, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura:
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations. 996-1000
Novel Models and Training Methods for ASR II
- Haaris Mehmood, Agnieszka Dobrowolska, Karthikeyan Saravanan, Mete Ozay:
FedNST: Federated Noisy Student Training for Automatic Speech Recognition. 1001-1005 - Li Fu, Xiaoxiao Li, Runyu Wang, Lu Fan, Zhengchen Zhang, Meng Chen, Youzheng Wu, Xiaodong He:
SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition. 1006-1010 - Yukun Liu, Ta Li, Pengyuan Zhang, Yonghong Yan:
NAS-SCAE: Searching Compact Attention-based Encoders For End-to-end Automatic Speech Recognition. 1011-1015 - Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma:
Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR. 1016-1020 - Guodong Ma
, Pengfei Hu, Nurmemet Yolwas, Shen Huang, Hao Huang:
PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition. 1021-1025 - Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno:
Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition. 1026-1030 - Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach:
Improving Rare Word Recognition with LM-aware MWER Training. 1031-1035 - Mohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Ralf Schlüter, Hermann Ney:
Improving the Training Recipe for a Robust Conformer-based Hybrid Model. 1036-1040 - Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg:
CTC Variations Through New WFST Topologies. 1041-1045 - Martin Sustek, Samik Sadhu, Hynek Hermansky
:
Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition. 1046-1050 - Chenfeng Miao, Kun Zou
, Ziyang Zhuang, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao:
Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition. 1051-1055 - Jisi Zhang, Catalin Zorila, Rama Doddipatla
, Jon Barker:
On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training. 1056-1060 - Selen Hande Kabil, Hervé Bourlard:
From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition. 1061-1065 - Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya:
Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks. 1066-1070 - Takashi Maekaku, Yuya Fujita, Yifan Peng
, Shinji Watanabe
:
Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR. 1071-1075
Spoken Dialogue Systems and Multimodality
- Naokazu Uchida, Takeshi Homma
, Makoto Iwayama, Yasuhiro Sogawa:
Reducing Offensive Replies in Open Domain Dialogue Systems. 1076-1080 - Ting-Wei Wu, Biing-Hwang Juang:
Induce Spoken Dialog Intents via Deep Unsupervised Context Contrastive Clustering. 1081-1085 - Fumio Nihei, Ryo Ishii, Yukiko I. Nakano, Kyosuke Nishida, Ryo Masumura, Atsushi Fukayama, Takao Nakamura:
Dialogue Acts Aided Important Utterance Detection Based on Multiparty and Multimodal Information. 1086-1090 - Dhanush Bekal, Sundararajan Srinivasan, Srikanth Ronanki, Sravan Bodapati, Katrin Kirchhoff:
Contextual Acoustic Barge-In Classification for Spoken Dialog Systems. 1091-1095 - Peilin Zhou, Dading Chong, Helin Wang, Qingcheng Zeng:
Calibrate and Refine! A Novel and Agile Framework for ASR Error Robust Intent Detection. 1096-1100 - Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng:
ASR-Robust Natural Language Understanding on ASR-GLUE dataset. 1101-1105 - Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen:
From Disfluency Detection to Intent Detection and Slot Filling. 1106-1110 - Hengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe
, Odette Scharenborg
, Jingdong Chen, Shifu Xiong, Jianqing Gao:
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis. 1111-1115 - Christina Sartzetaki, Georgios Paraskevopoulos, Alexandros Potamianos:
Extending Compositional Attention Networks for Social Reasoning in Videos. 1116-1120 - Shiquan Wang, Yuke Si, Xiao Wei, Longbiao Wang, Zhiqiang Zhuang, Xiaowang Zhang, Jianwu Dang:
TopicKS: Topic-driven Knowledge Selection for Knowledge-grounded Dialogue Generation. 1121-1125 - Andreas Liesenfeld, Mark Dingemanse
:
Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages. 1126-1130 - Yi Zhu, Zexun Wang, Hang Liu, Peiying Wang, Mingchao Feng, Meng Chen, Xiaodong He:
Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding. 1131-1135 - Keiko Ochi
, Nobutaka Ono, Keiho Owada, Miho Kuroda, Shigeki Sagayama, Hidenori Yamasue:
Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism. 1136-1140
Show and Tell I(VR)
- Denis Ivanko, Dmitry Ryumin, Alexey M. Kashevnik, Alexandr Axyonov, Andrey Kitenko, Igor Lashkov, Alexey Karpov:
DAVIS: Driver's Audio-Visual Speech recognition. 1141-1142
Speech Emotion Recognition I
- Einari Vaaras, Manu Airaksinen, Okko Räsänen
:
Analysis of Self-Supervised Learning and Dimensionality Reduction Methods in Clustering-Based Active Learning for Speech Emotion Recognition. 1143-1147 - Chun-Yu Chen, Yun-Shao Lin, Chi-Chun Lee
:
Emotion-Shift Aware CRF for Decoding Emotion Sequence in Conversation. 1148-1152 - Bo-Hao Su, Chi-Chun Lee
:
Vaccinating SER to Neutralize Adversarial Attacks with Self-Supervised Augmentation Strategy. 1153-1157 - Jack Parry, Eric DeMattos, Anita Klementiev, Axel Ind, Daniela Morse-Kopp, Georgia Clarke, Dimitri Palaz:
Speech Emotion Recognition in the Wild using Multi-task and Adversarial Learning. 1158-1162 - Ashishkumar Prabhakar Gudmalwar, Biplove Basel, Anirban Dutta, Ch V. Rama Rao:
The Magnitude and Phase based Speech Representation Learning using Autoencoder for Classifying Speech Emotions using Deep Canonical Correlation Analysis. 1163-1167 - Lucas Goncalves
, Carlos Busso:
Improving Speech Emotion Recognition Using Self-Supervised Learning with Domain-Specific Audiovisual Tasks. 1168-1172
Single-channel Speech Enhancement I
- Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani:
SNRi Target Training for Joint Speech Enhancement and Recognition. 1173-1177 - Yutaro Sanada, Takumi Nakagawa, Yuichiro Wada, Kosaku Takanashi, Yuhui Zhang, Kiichi Tokuyama, Takafumi Kanamori, Tomonori Yamada
:
Deep Self-Supervised Learning of Speech Denoising from Noisy Speeches. 1178-1182 - Chi-Chang Lee, Cheng-Hung Hu, Yu-Chen Lin, Chu-Song Chen, Hsin-Min Wang, Yu Tsao:
NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling. 1183-1187 - Ivan Shchekotov, Pavel K. Andreev, Oleg Ivanov, Aibek Alanov, Dmitry P. Vetrov:
FFC-SE: Fast Fourier Convolution for Speech Enhancement. 1188-1192 - Or Tal, Moshe Mandel, Felix Kreuk, Yossi Adi:
A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement. 1193-1197 - Wooseok Shin
, Hyun Joon Park, Jin Sob Kim, Byung Hoon Lee, Sung Won Han:
Multi-View Attention Transfer for Efficient Speech Enhancement. 1198-1202
Speech Synthesis: New Applications
- Nabarun Goswami
, Tatsuya Harada:
SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate. 1203-1207 - Talia Ben Simon, Felix Kreuk, Faten Awwad, Jacob T. Cohen, Joseph Keshet:
Correcting Mispronunciations in Speech using Spectrogram Inpainting. 1208-1212 - Jason Fong, Daniel Lyth, Gustav Eje Henter, Hao Tang, Simon King:
Speech Audio Corrector: using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech. 1213-1217 - Wen-Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, Anjali Menon:
End-to-End Binaural Speech Synthesis. 1218-1222 - Julia Koch, Florian Lux, Nadja Schauffler, Toni Bernhart, Felix Dieterle, Jonas Kuhn, Sandra Richter, Gabriel Viehhauser, Ngoc Thang Vu:
PoeticTTS - Controllable Poetry Reading for Literary Studies. 1223-1227 - Paul Konstantin Krug, Peter Birkholz
, Branislav Gerazov
, Daniel Rudolph van Niekerk
, Anqi Xu
, Yi Xu:
Articulatory Synthesis for Data Augmentation in Phoneme Recognition. 1228-1232
Spoken Language Understanding I
- Jihyun Lee
, Gary Geunbae Lee:
SF-DST: Few-Shot Self-Feeding Reading Comprehension Dialogue State Tracking with Auxiliary Task. 1233-1237 - Oralie Cattan, Sahar Ghannay, Christophe Servan, Sophie Rosset:
Benchmarking Transformers-based models on French Spoken Language Understanding tasks. 1238-1242 - Seong-Hwan Heo, WonKee Lee, Jong-Hyeok Lee:
mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot Filling. 1243-1247 - Pu Wang
, Hugo Van hamme
:
Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding. 1248-1252 - Anirudh Raju, Milind Rao, Gautam Tiwari, Pranav Dheram, Bryan Anderson, Zhe Zhang, Chul Lee, Bach Bui, Ariya Rastrow:
On joint training with interfaces for spoken language understanding. 1253-1257 - Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed Hussen Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed H. Tewfik:
Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models. 1258-1262
Inclusive and Fair Speech Technologies I
- Perez Ogayo, Graham Neubig, Alan W. Black:
Building African Voices. 1263-1267 - Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke:
Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities. 1268-1272 - May Pik Yu Chan, June Choe
, Aini Li
, Yiran Chen, Xin Gao, Nicole R. Holliday:
Training and typological bias in ASR performance for world Englishes. 1273-1277
Inclusive and Fair Speech Technologies II
- Marcely Zanon Boito, Laurent Besacier, Natalia A. Tomashenko
, Yannick Estève:
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems. 1278-1282 - Alexander Johnson, Kevin Everson, Vijay Ravi, Anissa Gladney, Mari Ostendorf, Abeer Alwan:
Automatic Dialect Density Estimation for African American English. 1283-1287 - Kunnar Kukk
, Tanel Alumäe:
Improving Language Identification of Accented Speech. 1288-1292 - Wiebke Toussaint, Lauriane Gorce, Aaron Yi Ding
:
Design Guidelines for Inclusive Speaker Verification Evaluation Datasets. 1293-1297 - Viet Anh Trinh, Pegah Ghahremani, Brian John King, Jasha Droppo, Andreas Stolcke, Roland Maas:
Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation. 1298-1302
Phonetics I
- Takuya Kunihara, Chuanbo Zhu, Nobuaki Minematsu, Noriko Nakanishi:
Gradual Improvements Observed in Learners' Perception and Production of L2 Sounds Through Continuing Shadowing Practices on a Daily Basis. 1303-1307 - Christin Kirchhübel, Georgina Brown:
Spoofed speech from the perspective of a forensic phonetician. 1308-1312 - Hae-Sung Jeon, Stephen Nichols
:
Investigating Prosodic Variation in British English Varieties using ProPer. 1313-1317 - Hyun Kyung Hwang, Manami Hirayama, Takaomi Kato:
Perceived prominence and downstep in Japanese. 1318-1321 - Andrea Alicehajic, Silke Hamann:
The discrimination of [zi]-[dʑi] by Japanese listeners and the prospective phonologization of /zi/. 1322-1326 - Ingo Langheinrich, Simon Stone, Xinyu Zhang, Peter Birkholz
:
Glottal inverse filtering based on articulatory synthesis and deep learning. 1327-1331 - Bogdan Ludusan, Marin Schröer, Petra Wagner
:
Investigating phonetic convergence of laughter in conversation. 1332-1336 - Véronique Delvaux, Audrey Lavallée, Fanny Degouis, Xavier Saloppe, Jean-Louis Nandrino
, Thierry Pham:
Telling self-defining memories: An acoustic study of natural emotional speech productions. 1337-1341 - Laura Spinu
, Ioana Vasilescu, Lori Lamel, Jason Lilley
:
Voicing neutralization in Romanian fricatives across different speech styles. 1342-1346 - Sishi Liao
, Phil Hoole, Conceição Cunha, Esther Kunay, Aletheia Cui, Lia Saki Bucar Shigemori, Felicitas Kleber, Dirk Voit, Jens Frahm, Jonathan Harrington:
Nasal Coda Loss in the Chengdu Dialect of Mandarin: Evidence from RT-MRI. 1347-1351 - Philipp Buech
, Simon Roessig
, Lena Pagel, Doris Mücke
, Anne Hermes:
ema2wav: doing articulation by Praat. 1352-1356
Multi-, Cross-lingual and Other Topics in ASR I
- Lars Rumberg, Christopher Gebauer, Hanna Ehlert
, Ulrike Lüdtke
, Jörn Ostermann
:
Improving Phonetic Transcriptions of Children's Speech by Pronunciation Modelling with Constrained CTC-Decoding. 1357-1361 - Soky Kak, Sheng Li
, Masato Mimura, Chenhui Chu, Tatsuya Kawahara:
Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention Mechanism. 1362-1366 - Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol:
KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus. 1367-1371 - Tünde Szalay
, Mostafa Ali Shahin
, Beena Ahmed, Kirrie J. Ballard
:
Knowledge of accent differences can be used to predict speech recognition. 1372-1376 - Maximilian Karl Scharf
, Sabine Hochmuth, Lena L. N. Wong, Birger Kollmeier, Anna Warzybok:
Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal features. 1377-1381 - Martin Flechl, Shou-Chun Yin, Junho Park, Peter Skala:
End-to-end speech recognition modeling from de-identified data. 1382-1386 - Aditya Yadavalli, Mirishkar Sai Ganesh, Anil Kumar Vuppala:
Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition. 1387-1391 - Jiamin Xie, John H. L. Hansen:
DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition. 1392-1396
Zero, low-resource and multi-modal speech recognition I
- Yuna Lee, Seung Jun Baek:
Keyword Spotting with Synthetic Data using Heterogeneous Knowledge Distillation. 1397-1401 - Maureen de Seyssel
, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski:
Probing phoneme, language and speaker information in unsupervised speech representations. 1402-1406 - Andrei Bîrladeanu, Helen Minnis, Alessandro Vinciarelli:
Automatic Detection of Reactive Attachment Disorder Through Turn-Taking Analysis in Clinical Child-Caregiver Sessions. 1407-1410 - Eesung Kim, Jae-Jin Jeon, Hyeji Seo, Hoon Kim:
Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning. 1411-1415 - Tyler Miller, David Harwath:
Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech. 1416-1420 - Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman:
Pseudo Label Is Better Than Human Label. 1421-1425 - Werner van der Merwe, Herman Kamper
, Johan Adam du Preez:
A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery. 1426-1430
Speaker Embedding and Diarization
- Siqi Zheng, Hongbin Suo, Qian Chen:
PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification. 1431-1435 - Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li:
Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings. 1436-1440 - Weiqing Wang, Ming Li, Qingjian Lin:
Online Target Speaker Voice Activity Detection for Speaker Diarization. 1441-1445 - Niko Brummer, Albert Swart, Ladislav Mosner, Anna Silnova, Oldrich Plchot, Themos Stafylakis
, Lukás Burget:
Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings. 1446-1450 - Bin Gu:
Deep speaker embedding with frame-constrained training strategy for speaker verification. 1451-1455 - Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan:
Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization. 1456-1460 - Mao-Kui He, Jun Du, Chin-Hui Lee:
End-to-End Audio-Visual Neural Speaker Diarization. 1461-1465 - Yanyan Yue, Jun Du, Mao-Kui He, Yu Ting Yeung, Renyu Wang:
Online Speaker Diarization with Core Samples Selection. 1466-1470 - Chenyu Yang, Yu Wang:
Robust End-to-end Speaker Diarization with Generic Neural Clustering. 1471-1475 - Tao Liu, Shuai Fan, Xu Xiang
, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Yanmin Qian, Kai Yu:
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild. 1476-1480 - Md. Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren
, Rosie Jones:
Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning Free. 1481-1485 - Keisuke Kinoshita
, Thilo von Neumann, Marc Delcroix, Christoph Böddeker, Reinhold Haeb-Umbach:
Utterance-by-utterance overlap-aware neural diarization with Graph-PIT. 1486-1490 - Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Qingyang Hong:
Spatial-aware Speaker Diarizaiton for Multi-channel Multi-party Meeting. 1491-1495
Acoustic Event Detection and Classification
- Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang:
Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection. 1496-1500 - Peng Liu, Songbin Li, Jigang Tang:
An End-to-End Macaque Voiceprint Verification Method Based on Channel Fusion Mechanism. 1501-1505 - Liang Xu, Jing Wang, Lizhong Wang, Sijun Bi, Jianqian Zhang, Qiuyue Ma:
Human Sound Classification based on Feature Fusion Method with Air and Bone Conducted Signal. 1506-1510 - Dongchao Yang, Helin Wang, Zhongjie Ye, Yuexian Zou, Wenwu Wang:
RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection. 1511-1515 - Achyut Mani Tripathi, Konark Paul:
Temporal Self Attention-Based Residual Network for Environmental Sound Classification. 1516-1520 - Juncheng Li, Shuhui Qu, Po-Yao Huang, Florian Metze:
AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification. 1521-1525 - Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou:
Improving Target Sound Extraction with Timestamp Information. 1526-1530 - Ying Hu, Xiujuan Zhu, Yunlong Li, Hao Huang, Liang He
:
A Multi-grained based Attention Network for Semi-supervised Sound Event Detection. 1531-1535 - Sangwook Park, Sandeep Reddy Kothinti, Mounya Elhilali:
Temporal coding with magnitude-phase regularization for sound event detection. 1536-1540 - Nian Shao
, Erfan Loweimi, Xiaofei Li:
RCT: Random consistency training for semi-supervised sound event detection. 1541-1545 - Yifei Xin, Dongchao Yang, Yuexian Zou:
Audio Pyramid Transformer with Domain Adaption for Weakly Supervised Sound Event Detection and Audio Classification. 1546-1550 - Yu Wang, Mark Cartwright, Juan Pablo Bello
:
Active Few-Shot Learning for Sound Event Detection. 1551-1555 - Tong Ye, Shijing Si, Jianzong Wang
, Ning Cheng, Jing Xiao:
Uncertainty Calibration for Deep Audio Classifiers. 1556-1560 - Yuanbo Hou, Dick Botteldooren
:
Event-related data conditioning for acoustic event classification. 1561-1565
Speech Synthesis: Acoustic Modeling and Neural Waveform Generation II
- Haohan Guo, Hui Lu, Xixin Wu, Helen Meng:
A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS. 1566-1570 - Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo:
RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion. 1571-1575 - Manh Luong, Viet-Anh Tran:
FlowVocoder: A small Footprint Neural Vocoder based Normalizing Flow for Speech Synthesis. 1576-1580 - Yanqing Liu, Ruiqing Xue, Lei He, Xu Tan, Sheng Zhao:
DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders. 1581-1585 - Xin Yuan
, Robin Feng, Mingming Ye, Cheng Tuo, Minghang Zhang:
AdaVocoder: Adaptive Vocoder for Custom Voice. 1586-1590 - Shengyuan Xu, Wenxiao Zhao, Jing Guo:
RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses. 1591-1595 - Chenpeng Du
, Yiwei Guo, Xie Chen, Kai Yu:
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature. 1596-1600 - Mengnan He, Tingwei Guo, Zhenxing Lu, Ruixiong Zhang, Caixia Gong:
Improving GAN-based vocoder for fast and high-quality speech synthesis. 1601-1605 - Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yuchao Zhang:
SoftSpeech: Unsupervised Duration Model in FastSpeech 2. 1606-1610 - Haohan Guo, Feng-Long Xie, Frank K. Soong, Xixin Wu, Helen Meng:
A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS. 1611-1615 - Yuhan Li, Ying Shen, Dongqing Wang, Lin Zhang:
SiD-WaveFlow: A Low-Resource Vocoder Independent of Prior Knowledge. 1616-1620 - Takeru Gorai, Daisuke Saito, Nobuaki Minematsu:
Text-to-speech synthesis using spectral modeling based on non-negative autoencoder. 1621-1625 - Hiroki Kanagawa, Yusuke Ijima, Hiroyuki Toda:
Joint Modeling of Multi-Sample and Subband Signals for Fast Neural Vocoding on CPU. 1626-1630 - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki:
MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks. 1631-1635 - Chenfeng Miao, Ting Chen, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao:
A compact transformer-based GAN vocoder. 1636-1640 - Hideyuki Tachibana
, Muneyoshi Inahara, Mocho Go, Yotaro Katayama, Yotaro Watanabe:
Diffusion Generative Vocoder for Fullband Speech Synthesis Based on Weak Third-order SDE Solver. 1641-1645
ASR: Architecture and Search
- Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran:
On Adaptive Weight Interpolation of the Hybrid Autoregressive Transducer. 1646-1650 - Ting-Wei Wu, I-Fan Chen, Ankur Gandhe:
Learning to rank with BERT-based confidence models in ASR rescoring. 1651-1655 - Jiatong Shi, George Saon
, David Haws, Shinji Watanabe
, Brian Kingsbury:
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States. 1656-1660 - Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, Jianwei Niu:
WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit. 1661-1665 - Yufei Liu, Rao Ma, Haihua Xu, Yi He, Zejun Ma, Weibin Zhang:
Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR. 1666-1670 - Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, Yonghong Yan:
Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies. 1671-1675 - Ye Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi, Xiaorui Wang:
Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition. 1676-1680 - Zhanheng Yang, Sining Sun, Jin Li, Xiaoming Zhang, Xiong Wang, Long Ma, Lei Xie:
CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer. 1681-1685 - Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li:
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT. 1686-1690 - Jash Rathod, Nauman Dawalatabad, Shatrughan Singh, Dhananjaya Gowda:
Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition. 1691-1695 - Weiran Wang, Ke Hu, Tara N. Sainath:
Streaming Align-Refine for Non-autoregressive Deliberation. 1696-1700 - Rongmei Lin, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays:
Federated Pruning: Improving Neural Network Efficiency with Federated Learning. 1701-1705 - Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman:
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes. 1706-1710 - Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov:
4-bit Conformer with Native Quantization Aware Training for Speech Recognition. 1711-1715 - Qiang Xu, Tongtong Song, Longbiao Wang, Hao Shi, Yuqin Lin, Yongjie Lv, Meng Ge, Qiang Yu, Jianwu Dang:
Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR Model. 1716-1720
Spoken Language Processing II
- Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka:
Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation. 1721-1725 - Linh The Nguyen, Nguyen Luong Tran, Long Doan, Manh Luong, Dat Quoc Nguyen:
A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation. 1726-1730 - Qian Wang, Chen Wang, Jiajun Zhang:
Investigating Parameter Sharing in Multilingual Speech Translation. 1731-1735 - Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie, Yonghong Yan:
Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset. 1736-1740 - Chengfei Li, Shuhao Deng, Yaoping Wang, Guangjing Wang, Yaguang Gong, Changbin Chen, Jinfeng Bai:
TALCS: An open-source Mandarin-English code-switching corpus and a speech recognition baseline. 1741-1745 - Keqi Deng, Shinji Watanabe
, Jiatong Shi, Siddhant Arora:
Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation. 1746-1750 - Nguyen Luong Tran, Duong Minh Le, Dat Quoc Nguyen:
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese. 1751-1755 - Maxim Markitantov, Elena Ryumina
, Dmitry Ryumin
, Alexey Karpov:
Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task. 1756-1760 - Jen-Tzung Chien
, Yu-Han Huang:
Bayesian Transformer Using Disentangled Mask Attention. 1761-1765 - Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe
, Odette Scharenborg
, Jingdong Chen, Baocai Yin, Jia Pan:
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis. 1766-1770 - Danni Liu
, Changhan Wang, Hongyu Gong, Xutai Ma, Yun Tang, Juan Miguel Pino:
From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation. 1771-1775 - Derek Tam, Surafel Melaku Lakew, Yogesh Virkar, Prashant Mathur, Marcello Federico:
Isochrony-Aware Neural Machine Translation for Automatic Dubbing. 1776-1780 - Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang:
Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation. 1781-1785
Source Separation I
- Zexu Pan, Meng Ge, Haizhou Li:
A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction. 1786-1790 - Axel Berg
, Mark O'Connor, Kalle Åström
, Magnus Oskarsson
:
Extending GCC-PHAT using Shift Equivariant Neural Networks. 1791-1795 - Efthymios Tzinis, Gordon Wichern, Aswin Shanmugam Subramanian, Paris Smaragdis, Jonathan Le Roux:
Heterogeneous Target Speech Separation. 1796-1800 - Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang:
Separate What You Describe: Language-Queried Audio Source Separation. 1801-1805 - Dejan Markovic, Alexandre Défossez, Alexander Richard:
Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain. 1806-1810
ASR Technologies and Systems
- Jumon Nozaki, Tatsuya Kawahara, Kenkichi Ishizuka, Taiichi Hashimoto:
End-to-end Speech-to-Punctuated-Text Recognition. 1811-1815 - Adrien Pupier
, Maximin Coavoux, Benjamin Lecouteux, Jérôme Goulian:
End-to-End Dependency Parsing of Spoken French. 1816-1820 - Shuo-Yiin Chang, Bo Li, Tara N. Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He:
Turn-Taking Prediction for Natural Conversational Speech. 1821-1825 - Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara N. Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman:
Streaming Intended Query Detection using E2E Modeling for Continued Conversation. 1826-1830 - Jan Lehecka, Jan Svec
, Ales Prazák, Josef Psutka:
Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech. 1831-1835 - Rodrigo Schoburg Carrillo de Mira
, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantic:
SVTS: Scalable Video-to-Speech Synthesis. 1836-1840
Speech Perception
- Takeshi Kishiyama
, Chuyu Huang, Yuki Hirose:
One-step models in pitch perception: Experimental evidence from Japanese. 1841-1845 - Rubén Pérez Ramón, Martin Cooke, María Luisa García Lecumberri:
Generating iso-accented stimuli for second language research: methodology and a dataset for Spanish-accented English. 1846-1850 - Adrian Leemann, Péter Jeszenszky, Carina Steiner, Corinne Lanthemann:
Factors affecting the percept of Yanny v. Laurel (or mixed): Insights from a large-scale study on Swiss German listeners. 1851-1855 - Zhaoyan Zhang, Jason Zhang, Jody Kreiman:
Effects of laryngeal manipulations on voice gender perception. 1856-1860 - Boram Lee, Naomi Yamaguchi
, Cécile Fougeron:
Why is Korean lenis stop difficult to perceive for L2 Korean learners? 1861-1865 - Alvaro Martin Iturralde Zurita, Meghan Clayards:
Lexical stress in Spanish word segmentation. 1866-1870
Spoken Term Detection and Voice Search
- Hyeon-Kyeong Shin, Hyewon Han
, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang:
Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting. 1871-1875 - Badr M. Abdullah, Bernd Möbius
, Dietrich Klakow:
Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings. 1876-1880 - Seunghan Yang, Byeonggeun Kim, Inseop Chung, Simyung Chang:
Personalized Keyword Spotting through Multi-task Learning. 1881-1885 - Jan Svec
, Jan Lehecka, Lubos Smídl
:
Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer. 1886-1890 - Christin Jose, Joe Wang, Grant P. Strimel, Mohammad Omar Khursheed, Yuriy Mishchenko, Brian Kulis:
Latency Control for Keyword Spotting. 1891-1895 - Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed H. Tewfik:
Improving Voice Trigger Detection with Metric Learning. 1896-1900
Speech and Language in Health: From Remote Monitoring to Medical Conversations I
- Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey:
RNN Transducers for Named Entity Recognition with constraints on alignment for understanding medical conversations. 1901-1905 - Zixiu Wu, Rim Helaoui, Diego Reforgiato Recupero, Daniele Riboni:
Towards Automated Counselling Decision-Making: Remarks on Therapist Action Forecasting on the AnnoMI Dataset. 1906-1910 - Salvatore Fara, Stefano Goria, Emilia Molimpakis, Nicholas Cummins
:
Speech and the n-Back task as a lens into depression. How combining both may allow us to isolate different core symptoms of depression. 1911-1915 - Amrit Romana, Minxue Niu, Matthew Perez
, Angela Roberts
, Emily Mower Provost:
Enabling Off-the-Shelf Disfluency Detection and Categorization for Pathological Speech. 1916-1920 - Catarina Botelho
, Tanja Schultz
, Alberto Abad
, Isabel Trancoso
:
Challenges of using longitudinal and cross-domain corpora on studies of pathological speech. 1921-1925
Speech Synthesis: Linguistic Processing, Paradigms and Other Topics I
- Yi-Chang Chen, Yu-Chuan Steven, Yen-Cheng Chang, Yi-Ren Yeh:
g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin. 1926-1930 - Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana:
A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-Speech. 1931-1935 - Tuomo Raitio, Petko Petkov, Jiangchuan Li, P. V. Muhammed Shifas, Andrea Davis, Yannis Stylianou:
Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise. 1936-1940 - Eunwoo Song, Ryuichi Yamamoto, Ohsung Kwon, Chan-Ho Song, Min-Jae Hwang, Suhyeon Oh, Hyun-Wook Yoon, Jin-Seob Kim, Jae-Min Kim:
TTS-by-TTS 2: Data-Selective Augmentation for Neural Speech Synthesis Using Ranking Support Vector Machine with Variational Autoencoder. 1941-1945 - Giulia Comini, Goeric Huybrechts, Manuel Sam Ribeiro, Adam Gabrys, Jaime Lorenzo-Trueba:
Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation. 1946-1950
Show and Tell II
- Digvijay Ingle, Ayush Kumar, Krishnachaitanya Gogineni, Jithendra Vepa:
Real-Time Monitoring of Silences in Contact Center Conversations. 1951-1952 - Konrad Zielinski, Marek Grzelec, Martin Hagmüller
:
Humanizing bionic voice: interactive demonstration of aesthetic design and control factors influencing the devices assembly and waveshape engineering. 1953-1954 - Damien Ronssin, Milos Cernak:
Application for Real-time Personalized Speaker Extraction. 1955-1956 - Debarpan Bhattacharya, Debottam Dutta, Neeraj Kumar Sharma, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K. K, Sadhana Gonuguntla, Murali Alagesan:
Coswara: A website application enabling COVID-19 screening by analysing respiratory sound samples and health symptoms. 1957-1958 - P. Schäfer, Paula Andrea Pérez-Toro, Philipp Klumpp, Juan Rafael Orozco-Arroyave, Elmar Nöth, Andreas K. Maier, A. Abad, Maria Schuster, Tomás Arias-Vergara:
CoachLea: an Android Application to Evaluate the Speech Production and Perception of Children with Hearing Loss. 1959-1960 - Fasih Haider, Saturnino Luz:
An Automated Mood Diary for Older User's using Ambient Assisted Living Recorded Speech. 1961-1962
Multimodal Speech Emotion Recognition and Paralinguistics
- Hai-tao Xu, Jie Zhang, Li-Rong Dai:
Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition. 1963-1967 - Daniel Fernau, Stefan Hillmann
, Nils Feldhus, Tim Polzehl:
Towards Automated Dialog Personalization using MBTI Personality Indicators. 1968-1972 - Fan Qian, Hongwei Song, Jiqing Han:
Word-wise Sparse Attention for Multimodal Sentiment Analysis. 1973-1977 - Tarun Gupta, Duc-Tuan Truong
, Tran The Anh, Eng Siong Chng:
Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model. 1978-1982 - Weiqiao Zheng, Ping Yang, Rongfeng Lai, Kongyang Zhu, Tao Zhang, Junpeng Zhang, Hongcheng Fu:
Exploring Multi-task Learning Based Gender Recognition and Age Estimation for Class-imbalanced Data. 1983-1987 - Jie Wei
, Guanyu Hu
, Xinyu Yang, Anh Tuan Luu, Yizhuo Dong:
Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition. 1988-1992 - Minyue Zhang, Hongwei Ding:
Impact of Background Noise and Contribution of Visual Information in Emotion Identification by Native Mandarin Speakers. 1993-1997 - Wei Yang, Satoru Fukayama, Panikos Heracleous, Jun Ogata:
Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion Recognition. 1998-2002 - Dehua Tao, Tan Lee
, Harold Chui, Sarah Luk:
Characterizing Therapist's Speaking Style in Relation to Empathy in Psychotherapy. 2003-2007 - Dehua Tao, Tan Lee
, Harold Chui, Sarah Luk:
Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling Session. 2008-2012 - Jinchao Li, Shuai Wang, Yang Chao, Xunying Liu, Helen Meng:
Context-aware Multimodal Fusion for Emotion Recognition. 2013-2017 - Jinhan Wang, Vijay Ravi, Jonathan Flint, Abeer Alwan:
Unsupervised Instance Discriminative Learning for Depression Detection from Speech Signals. 2018-2022 - Nasim Mahdinazhad Sardhaei, Marzena Zygis, Hamid Sharifzadeh:
How do our eyebrows respond to masks and whispering? The case of Persians. 2023-2027 - Alice Baird, Panagiotis Tzirakis, Jeffrey A. Brooks, Lauren Kim, Michael Opara, Christopher B. Gregory, Jacob Metrick, Garrett Boseck, Dacher Keltner, Alan Cowen:
State & Trait Measurement from Nonverbal Vocalizations: A Multi-Task Joint Learning Approach. 2028-2032 - Amruta Saraf, Ganesh Sivaraman, Elie Khoury
:
Confidence Measure for Automatic Age Estimation From Speech. 2033-2037
Neural Transducers, Streaming ASR and Novel ASR Models
- Andrea Fasoli, Chia-Yu Chen, Mauricio J. Serrano, Swagath Venkataramani, George Saon
, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan:
Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization. 2038-2042 - Guangzhi Sun, Chao Zhang, Philip C. Woodland:
Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition. 2043-2047 - Junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma:
Bring dialogue-context into RNN-T for streaming ASR. 2048-2052 - Felix Weninger, Marco Gaudesi, Md. Akmal Haidar, Nicola Ferri, Jesús Andrés-Ferrer, Puming Zhan:
Conformer with dual-mode chunked attention for joint online and offline ASR. 2053-2057 - Wei Zhou
, Wilfried Michel, Ralf Schlüter, Hermann Ney:
Efficient Training of Neural Transducer for Speech Recognition. 2058-2062 - Zhifu Gao, Shiliang Zhang, Ian McLoughlin
, Zhijie Yan:
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition. 2063-2067 - Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, Daniel Povey:
Pruned RNN-T for fast, memory-efficient ASR training. 2068-2072 - Xianchao Wu:
Deep Sparse Conformer for Speech Recognition. 2073-2077 - Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang:
Chain-based Discriminative Autoencoders for Speech Recognition. 2078-2082 - Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael L. Seltzer:
Streaming parallel transducer beam search with fast slow cascaded encoders. 2083-2087 - Mohan Li, Rama Sanand Doddipatla
, Catalin Zorila:
Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition. 2088-2092 - Dario Albesano, Jesús Andrés-Ferrer, Nicola Ferri, Puming Zhan:
On the Prediction Network Architecture in RNN-T for ASR. 2093-2097 - Yusuke Shinohara, Shinji Watanabe
:
Minimum latency training of sequence transducers for streaming end-to-end speech recognition. 2098-2102 - Keyu An, Huahuan Zheng, Zhijian Ou, Hongyu Xiang, Ke Ding, Guanglu Wan:
CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR. 2103-2107 - Xianchao Wu:
Attention Enhanced Citrinet for Speech Recognition. 2108-2112
Zero, Low-resource and Multi-Modal Speech Recognition II
- Qiantong Xu, Alexei Baevski, Michael Auli:
Simple and Effective Zero-shot Cross-lingual Phoneme Recognition. 2113-2117 - Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed:
Robust Self-Supervised Audio-Visual Speech Recognition. 2118-2122 - Robin Algayres, Adel Nabli, Benoît Sagot, Emmanuel Dupoux:
Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning. 2123-2127 - Junhao Xu, Shoukang Hu, Xunying Liu, Helen Meng:
Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Swithboard Corpus. 2128-2132 - Siqing Qin
, Longbiao Wang, Sheng Li
, Yuqin Lin, Jianwu Dang:
Finer-grained Modeling units-based Meta-Learning for Low-resource Tibetan Speech Recognition. 2133-2137
Atypical Speech Analysis and Detection
- Parvaneh Janbakhshi, Ina Kodrasi
:
Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification. 2138-2142 - Zhenglin Zhang
, Lizhuang Yang, Xun Wang, Hai Li
:
Automated Detection of Wilson's Disease Based on Improved Mel-frequency Cepstral Coefficients with Signal Decomposition. 2143-2147 - Zixia Fan
, Jing Shao
, Weigong Pan, Min Xu, Lan Wang:
The effect of backward noise on lexical tone discrimination in Mandarin-speaking amusics. 2148-2152 - Xiaoquan Ke, Man-Wai Mak, Helen M. Meng:
Automatic Selection of Discriminative Features for Dementia Detection in Cantonese-Speaking People. 2153-2157 - Zhuoya Liu, Mark A. Huckvale, Julian McGlashan:
Automated Voice Pathology Discrimination from Continuous Speech Benefits from Analysis by Phonetic Context. 2158-2162 - Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller:
Multi-Type Outer Product-Based Fusion of Respiratory Sounds for Detecting COVID-19. 2163-2167 - Xueshuai Zhang, Jiakun Shen, Jun Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shaoxing Zhang, Aijun Sun:
Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics. 2168-2172 - Farhad Javanmardi, Sudarsana Reddy Kadiri
, Manila Kodali
, Paavo Alku
:
Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiers. 2173-2177 - Gerasimos Chatzoudis, Manos Plitsis, Spyridoula Stamouli, Athanasia-Lida Dimou, Nassos Katsamanis, Vassilis Katsouros:
Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition. 2178-2182 - Youxiang Zhu, Xiaohui Liang, John A. Batsis, Robert M. Roth:
Domain-aware Intermediate Pretraining for Dementia Detection with Limited Data. 2183-2187 - Cécile Fougeron, Nicolas Audibert, Ina Kodrasi
, Parvaneh Janbakhshi, Michaela Pernon, Nathalie Lévêque, Stephanie Borel, Marina Laganaro
, Hervé Bourlard, Frédéric Assal:
Comparison of 5 methods for the evaluation of intelligibility in mild to moderate French dysarthric speech. 2188-2192
Adaptation, Transfer Learning, and Distillation for ASR
- Kuan-Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee:
Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation. 2193-2197 - Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee:
Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition. 2198-2202 - Kwanghee Choi, Hyung-Min Park:
Distilling a Pretrained Language Model to a Multilingual ASR Model. 2203-2207 - Hiroaki Sato, Tomoyasu Komori, Takeshi Mishima, Yoshihiko Kawai, Takahiro Mochizuki, Shoei Sato, Tetsuji Ogawa:
Text-Only Domain Adaptation Based on Intermediate CTC. 2208-2212 - Jenthe Thienpondt
, Kris Demuynck:
Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping. 2213-2217 - Yuki Takashima, Shota Horiguchi, Shinji Watanabe
, Leibny Paola García-Perera, Yohei Kawaguchi:
Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models. 2218-2222
Speaker and Language Recognition I
- Jeong-Hwan Choi
, Joon-Young Yang, Ye-Rin Jeoung, Joon-Hyuk Chang:
Improved CNN-Transformer using Broadcasted Residual Learning for Text-Independent Speaker Verification. 2223-2227 - Jee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung:
Pushing the limits of raw waveform speaker recognition. 2228-2232 - Hexin Liu, Leibny Paola García-Perera, Andy W. H. Khong, Suzy J. Styles, Sanjeev Khudanpur:
PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification. 2233-2237 - Moakala Tzudir, Priyankoo Sarmah, S. R. Mahadeva Prasanna:
Prosodic Information in Dialect Identification of a Tonal Language: The case of Ao. 2238-2242 - Wo Jae Lee, Emanuele Coviello:
A Multimodal Strategy for Singing Language Identification. 2243-2247
Pathological Speech Analysis
- Khalid Daoudi, Biswajit Das, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Margherita Fabbri, Anne Pavy-Le Traon, Olivier Rascol, Virginie Woisard, Wassilios G. Meissner
:
A comparative study on vowel articulation in Parkinson's disease and multiple system atrophy. 2248-2252 - Luc Ardaillon, Nathalie Henrich Bernardoni, Olivier Perrotin:
Voicing decision based on phonemes classification and spectral moments for whisper-to-speech conversion. 2253-2257 - Tanya Talkar, Christina Manxhari, James J. Williamson, Kara M. Smith, Thomas F. Quatieri:
Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks. 2258-2262 - Kelvin Tran, Lingfeng Xu
, Gabriela Stegmann, Julie Liss, Visar Berisha, Rene Utianski:
Investigating the Impact of Speech Compression on the Acoustics of Dysarthric Speech. 2263-2267 - Avamarie Brueggeman, John H. L. Hansen:
Speaker Trait Enhancement for Cochlear Implant Users: A Case Study for Speaker Emotion Perception. 2268-2272 - Neha Reddy
, Yoonjeong Lee
, Zhaoyan Zhang, Dinesh K. Chhetri:
Optimal thyroplasty implant shape and stiffness for treatment of acute unilateral vocal fold paralysis: Evidence from a canine in vivo phonation model. 2273-2277
Cross/Multi-lingual ASR
- Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli:
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. 2278-2282 - Janine Rugayan, Torbjørn Svendsen
, Giampiero Salvi
:
Semantically Meaningful Metrics for Norwegian ASR Systems. 2283-2287 - Ondrej Klejch, Electra Wallington, Peter Bell:
Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR. 2288-2292 - Rishabh Kumar, Devaraja Adiga, Rishav Ranjan, Amrith Krishna, Ganesh Ramakrishnan, Pawan Goyal, Preethi Jyothi:
Linguistically Informed Post-processing for ASR Error correction in Sanskrit. 2293-2297 - Mahir Morshed, Mark Hasegawa-Johnson:
Cross-lingual articulatory feature information transfer for speech recognition using recurrent progressive neural networks. 2298-2302
Speaking Styles and Interaction Styles I
- Diego Aguirre, Nigel G. Ward, Jonathan E. Avila, Heike Lehnert-LeHouillier:
Comparison of Models for Detecting Off-Putting Speaking Styles. 2303-2307 - Seiya Kawano, Muteki Arioka, Akishige Yuguchi, Kenta Yamamoto, Koji Inoue, Tatsuya Kawahara, Satoshi Nakamura, Koichiro Yoshino:
Multimodal Persuasive Dialogue Corpus using Teleoperated Android. 2308-2312 - Yookyung Shin, Younggun Lee
, Suhee Jo, Yeongtae Hwang, Taesu Kim:
Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS. 2313-2317 - Adaeze O. Adigwe
, Esther Klabbers:
Strategies for developing a Conversational Speech Dataset for Text-To-Speech Synthesis. 2318-2322 - Xiyuan Gao, Shekhar Nayak, Matt Coler:
Deep CNN-based Inductive Transfer Learning for Sarcasm Detection in Speech. 2323-2327
Speaking Styles and Interaction Styles II
- Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko Nankaku, Keiichi Tokuda:
End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue. 2328-2332 - Amber Afshan, Abeer Alwan:
Attention-based conditioning methods using variable frame rate for style-robust speaker verification. 2333-2337 - Amber Afshan, Abeer Alwan:
Learning from human perception to improve automatic speaker verification in style-mismatched conditions. 2338-2342 - Katariina Martikainen, Jussi Karlgren
, Khiet Truong:
Exploring audio-based stylistic variation in podcasts. 2343-2347
Speech Synthesis: Tools, Data, and Evaluation
- Kamil Deja
, Ariadna Sánchez, Julian Roth, Marius Cotescu
:
Automatic Evaluation of Speaker Similarity. 2348-2352 - Ziyao Zhang, Alessio Falai, Ariadna Sánchez, Orazio Angelini, Kayoko Yanagisawa:
Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS). 2353-2357 - Shinnosuke Takamichi, Wataru Nakata, Naoko Tanji, Hiroshi Saruwatari:
J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis. 2358-2362 - Jacob Webber, Samuel K. Lo, Isaac L. Bleaman:
REYD - The First Yiddish Text-to-Speech Dataset and System. 2363-2367 - Marcel de Korte, Jaebok Kim, Aki Kunikoshi, Adaeze Adigwe, Esther Klabbers:
Data-augmented cross-lingual synthesis in a teacher-student framework. 2368-2372 - Ayushi Pandey, Sébastien Le Maguer, Julie Carson-Berndsen
, Naomi Harte
:
Production characteristics of obstruents in WaveNet and older TTS systems. 2373-2377 - Sébastien Le Maguer
, Simon King, Naomi Harte
:
Back to the Future: Extending the Blizzard Challenge 2013. 2378-2382 - Josh Meyer, David Ifeoluwa Adelani, Edresson Casanova, Alp Öktem, Daniel Whitenack
, Julian Weber, Salomon Kabongo, Elizabeth Salesky
, Iroro Orife, Colin Leong, Perez Ogayo, Chris Chinenye Emezue, Jonathan Mukiibi, Salomey Osei, Apelete Agbolo, Victor Akinode, Bernard Opoku, Samuel Olanrewaju, Jesujoba O. Alabi, Shamsuddeen Hassan Muhammad:
BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus. 2383-2387 - Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis:
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis. 2388-2392
Acoustic Signal Representation and Analysis II
- Byeonggeun Kim, Seunghan Yang, Jangho Kim, Hyunsin Park, Juntae Lee, Simyung Chang:
Domain Generalization with Relaxed Instance Frequency-wise Normalization for Multi-device Acoustic Scene Classification. 2393-2397 - Rui Tao, Long Yan, Kazushige Ouchi, Xiangdong Wang:
Couple learning for semi-supervised sound event detection. 2398-2402 - Rajeev Rajan, Ananya Ayasi
:
Oktoechos Classification in Liturgical Music Using SBU-LSTM/GRU. 2403-2407 - Yuhang He, Andrew Markham:
SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms. 2408-2412 - Christian Bergler, Alexander Barnhill
, Dominik Perrin, Manuel Schmitt, Andreas K. Maier, Elmar Nöth:
ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep Learning. 2413-2417 - Joon-Hyuk Chang, Won-Gook Choi:
Convolutional Recurrent Neural Network with Auxiliary Stream for Robust Variable-Length Acoustic Scene Classification. 2418-2422 - Shahaf Bassan, Yossi Adi, Jeffrey S. Rosenschein:
Unsupervised Symbolic Music Segmentation using Ensemble Temporal Prediction Errors. 2423-2427 - Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha:
Visually-aware Acoustic Event Detection using Heterogeneous Graphs. 2428-2432 - Arshdeep Singh
, Mark D. Plumbley:
A Passive Similarity based CNN Filter Pruning for Efficient Acoustic Scene Classification. 2433-2437 - Alan Baade, Puyuan Peng, David Harwath:
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer. 2438-2442
Speech and Language in Health: From Remote Monitoring to Medical Conversations II
- Sebastian Peter Bayerl, Gabriel Roccabruna, Shammur Absar Chowdhury, Tommaso Ciulli, Morena Danieli, Korbinian Riedhammer
, Giuseppe Riccardi:
What can Speech and Language Tell us About the Working Alliance in Psychotherapy. 2443-2447 - Geoffrey T. Frost, Grant Theron
, Thomas Niesler:
TB or not TB? Acoustic cough analysis for tuberculosis classification. 2448-2452 - Visar Berisha, Chelsea Krantsevich, Gabriela Stegmann, Shira Hahn, Julie Liss:
Are reported accuracies in the clinical speech machine learning literature overoptimistic? 2453-2457 - Bahman Mirheidari, André Bittar, Nicholas Cummins
, Johnny Downs
, Helen L. Fisher, Heidi Christensen
:
Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and Opportunities. 2458-2462 - Bahman Mirheidari, Daniel Blackburn
, Heidi Christensen
:
Automatic cognitive assessment: Combining sparse datasets with disparate cognitive scores. 2463-2467 - Ting Dang
, Thomas Quinnell, Cecilia Mascolo:
Exploring Semi-supervised Learning for Audio-based COVID-19 Detection using FixMatch. 2468-2472 - Debarpan Bhattacharya, Debottam Dutta, Neeraj Kumar Sharma
, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K. K, Sadhana Gonuguntla, Murali Alagesan:
Analyzing the impact of SARS-CoV-2 variants on respiratory sound signals. 2473-2477 - Franziska Braun, Markus Förstel, Bastian Oppermann, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer
:
Automated Evaluation of Standardized Dementia Screening Tests. 2478-2482 - Paula Andrea Pérez-Toro
, Philipp Klumpp, Abner Hernandez, Tomas Arias
, Patricia Lillo, Andrea Slachevsky, Adolfo Martín García
, Maria Schuster, Andreas K. Maier, Elmar Nöth, Juan Rafael Orozco-Arroyave
:
Alzheimer's Detection from English to Spanish Using Acoustic and Linguistic Embeddings. 2483-2487 - Jing Su, Longxiang Zhang, Hamid Reza Hassanzadeh, Thomas Schaaf:
Extract and Abstract with BART for Clinical Notes from Doctor-Patient Conversations. 2488-2492 - Bishal Lamichhane
, Nidal Moukaddam, Ankit B. Patel, Ashutosh Sabharwal:
Dyadic Interaction Assessment from Free-living Audio for Depression Severity Assessment. 2493-2497 - Venkata Srikanth Nallanthighal, Aki Härmä
, Helmer Strik:
COVID-19 detection based on respiratory sensing from speech. 2498-2502
Dereverberation and Echo Cancellation
- Xiaoxue Luo, Chengshi Zheng, Andong Li, Yuxuan Ke, Xiaodong Li:
Bifurcation and Reunion: A Loss-Guided Two-Stage Approach for Monaural Speech Dereverberation. 2503-2507 - Linjuan Cheng, Chengshi Zheng, Andong Li, Yuquan Wu, Renhua Peng, Xiaodong Li:
A deep complex multi-frame filtering network for stereophonic acoustic echo cancellation. 2508-2512 - Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li:
Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation. 2513-2517 - Shimin Zhang, Ziteng Wang, Yukai Ju, Yihui Fu, Yueyue Na, Qiang Fu, Lei Xie:
Personalized Acoustic Echo Cancellation for Full-duplex Communications. 2518-2522 - Chenggang Zhang, Jinjiang Liu, Xueliang Zhang:
LCSM: A Lightweight Complex Spectral Mapping Framework for Stereophonic Acoustic Echo Cancellation. 2523-2527 - Vinay Kothapally, Yong Xu, Meng Yu, Shi-Xiong Zhang, Dong Yu:
Joint Neural AEC and Beamforming with Double-Talk Detection. 2528-2532 - Karim Helwani, Erfan Soltanmohammadi, Michael Mark Goodwin, Arvindh Krishnaswamy:
Clock Skew Robust Acoustic Echo Cancellation. 2533-2537 - Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein:
A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy. 2538-2542 - Vinay Kothapally, John H. L. Hansen:
Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation. 2543-2547
Voice Conversion and Adaptation III
- Liumeng Xue, Shan Yang, Na Hu, Dan Su, Lei Xie:
Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers. 2548-2552 - Sicheng Yang
, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang
, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng:
Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion. 2553-2557 - Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang:
FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion. 2558-2562 - Yi Lei, Shan Yang, Jian Cong, Lei Xie, Dan Su:
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion. 2563-2567 - Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, Tie-Yan Liu:
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios. 2568-2572 - Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng:
Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis. 2573-2577 - Haoquan Yang, Liqun Deng, Yu Ting Yeung, Nianzu Zheng, Yong Xu:
Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion. 2578-2582 - Tuan-Nam Nguyen, Ngoc-Quan Pham, Alexander Waibel:
Accent Conversion using Pre-trained Model and Synthesized Data from Voice Conversion. 2583-2587 - Pol van Rijn, Silvan Mertes, Dominik Schiller, Piotr Dura, Hubert Siuzdak, Peter M. C. Harrison, Elisabeth André
, Nori Jacoby:
VoiceMe: Personalized voice generation in TTS. 2588-2592 - Ruibin Yuan, Yuxuan Wu, Jacob Li, Jaxter Kim:
DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion. 2593-2597 - Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu:
Towards Improved Zero-shot Voice Conversion with Conditional DSVAE. 2598-2602 - Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li:
Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion. 2603-2607
Novel Models and Training Methods for ASR III
- Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li
, Xie Chen, Yu Wu, Yifan Gong:
Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition. 2608-2612 - Ye-Qian Du, Jie Zhang, Qiu-Shi Zhu, Lirong Dai, Ming-Hui Wu, Xin Fang, Zhou-Wang Yang:
A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and Text. 2613-2617 - Xun Gong, Zhikai Zhou, Yanmin Qian:
Knowledge Transfer and Distillation from Autoregressive to Non-Autoregessive Speech Recognition. 2618-2622 - Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng:
Confidence Score Based Conformer Speaker Adaptation for Speech Recognition. 2623-2627 - Han Zhu
, Jindong Wang
, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan:
Decoupled Federated Learning for ASR with Non-IID Data. 2628-2632 - Sanli Tian, Keqi Deng, Zehan Li, Lingxuan Ye, Gaofeng Cheng, Ta Li, Yonghong Yan:
Knowledge Distillation For CTC-based Speech Recognition Via Consistent Acoustic Representation Learning. 2633-2637 - Xiaodong Cui, George Saon
, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata:
Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing. 2638-2642 - Chengyi Wang, Yiming Wang, Yu Wu, Sanyuan Chen, Jinyu Li
, Shujie Liu, Furu Wei:
Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training. 2643-2647 - Shuo Ren, Shujie Liu, Yu Wu, Long Zhou, Furu Wei:
Speech Pre-training with Acoustic Piece. 2648-2652 - Bowen Zhang, Songjun Cao, Xiaoming Zhang, Yike Zhang, Long Ma, Takahiro Shinozaki:
Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training. 2653-2657 - Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, Lirong Dai, Jinyu Li
, Yao Qian, Furu Wei:
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data. 2658-2662 - Ramit Sawhney, Megh Thakkar, Vishwa Shah, Puneet Mathur, Vasu Sharma, Dinesh Manocha:
PISA: PoIncaré Saliency-Aware Interpolative Augmentation. 2663-2667 - Muqiao Yang, Ian R. Lane, Shinji Watanabe
:
Online Continual Learning of End-to-End Speech Recognition Models. 2668-2672 - Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki:
Streaming Target-Speaker ASR with Neural Transducer. 2673-2677 - Arjit Jain, Pranay Reddy Samala, Deepak Mittal, Preethi Jyothi, Maneesh Singh:
SPLICEOUT: A Simple and Efficient Audio Augmentation Method. 2678-2682
Spoken Language Modeling and Understanding
- Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas, Hong-Kwang Kuo, Brian Kingsbury:
Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems. 2683-2687 - Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida:
Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion. 2688-2692 - Jingjing Dong, Jiayi Fu, Peng Zhou, Hao Li, Xiaorui Wang:
Improving Spoken Language Understanding with Cross-Modal Contrastive Learning. 2693-2697 - Anderson R. Avila, Khalil Bibi, Rui Heng Yang, Xinlin Li, Chao Xing, Xiao Chen:
Low-bit Shift Network for End-to-End Spoken Language Understanding. 2698-2702 - Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang:
Meta Auxiliary Learning for Low-resource Spoken Language Understanding. 2703-2707 - Ye Wang, Baishun Ling, Yanmeng Wang, Junhao Xue, Shaojun Wang, Jing Xiao:
Adversarial Knowledge Distillation For Robust Spoken Language Understanding. 2708-2712 - Yangyang Ou, Peng Zhang, Jing Zhang, Hui Gao, Xing Ma:
Incorporating Dual-Aware with Hierarchical Interactive Memory Networks for Task-Oriented Dialogue. 2713-2717 - Yuntao Li, Hanchu Zhang, Yutian Li, Sirui Wang, Wei Wu, Yan Zhang:
Pay More Attention to History: A Context Modeling Strategy for Conversational Text-to-SQL. 2718-2722 - Yuntao Li, Can Xu, Huang Hu, Lei Sha
, Yan Zhang, Daxin Jiang
:
Small Changes Make Big Differences: Improving Multi-turn Response Selection in Dialogue Systems via Fine-Grained Contrastive Learning. 2723-2727 - Marco Dinarelli, Marco Naguib, François Portet
:
Toward Low-Cost End-to-End Spoken Language Understanding. 2728-2732 - Eleftherios Kapelonis, Efthymios Georgiou
, Alexandros Potamianos:
A Multi-Task BERT Model for Schema-Guided Dialogue State Tracking. 2733-2737 - Heting Gao, Junrui Ni
, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson:
WavPrompt: Towards Few-Shot Spoken Language Understanding with Frozen Language Models. 2738-2742 - Asahi Ogushi, Toshiki Onishi
, Yohei Tahara, Ryo Ishii, Atsushi Fukayama, Takao Nakamura, Akihiro Miyata:
Analysis of praising skills focusing on utterance contents. 2743-2747 - Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang:
Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech. 2748-2752
Acoustic Signal Representation and Analysis I
- Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, Gerhard Widmer:
Efficient Training of Audio Transformers with Patchout. 2753-2757 - Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid:
CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content. 2758-2762 - Hyeonuk Nam, Seong-Hu Kim, Byeong-Yun Ko, Yong-Hwa Park
:
Frequency Dynamic Convolution: Frequency-Adaptive Pattern Recognition for Sound Event Detection. 2763-2767 - Zohreh Mostaani, Mathew Magimai-Doss:
On Breathing Pattern Information in Synthetic Speech. 2768-2772 - Chen Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, Eng Siong Chng:
Interactive Auido-text Representation for Automated Audio Captioning with Contrastive Learning. 2773-2777 - Yuya Yamamoto, Juhan Nam
, Hiroko Terasawa:
Deformable CNN and Imbalance-Aware Feature Learning for Singing Technique Classification. 2778-2782
Privacy and Security in Speech Communication
- Nicolas M. Müller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, Konstantin Böttinger:
Does Audio Deepfake Detection Generalize? 2783-2787 - Nicolas M. Müller, Franziska Dieckmann, Jennifer Williams:
Attacker Attribution of Audio Deepfakes. 2788-2792 - Pierre Champion, Anthony Larcher, Denis Jouvet:
Are disentangled representations all you need to build speaker anonymization systems? 2793-2797 - Francisco Teixeira
, Alberto Abad
, Bhiksha Raj, Isabel Trancoso
:
Towards End-to-End Private Automatic Speaker Recognition. 2798-2802 - Ehsan Amid, Om Dipakbhai Thakkar, Arun Narayanan, Rajiv Mathews, Françoise Beaufays:
Extracting Targeted Training Data from ASR Models, and How to Mitigate It. 2803-2807 - W. Ronny Huang, Steve Chien, Om Dipakbhai Thakkar, Rajiv Mathews:
Detecting Unintended Memorization in Language-Model-Fused ASR. 2808-2812
Multimodal Systems
- Shuta Taniguchi, Tsuneo Kato, Akihiro Tamura, Keiji Yasuda:
Transformer-Based Automatic Speech Recognition with Auxiliary Input of Source Language Text Toward Transcribing Simultaneous Interpretation. 2813-2817 - Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid:
AVATAR: Unconstrained Audiovisual Speech Recognition. 2818-2822 - Puyuan Peng, David Harwath:
Word Discovery in Visually Grounded, Self-Supervised Speech Models. 2823-2827 - Richard Rose, Olivier Siohan:
End-to-End multi-talker audio-visual ASR using an active speaker attention module. 2828-2832 - Dmitriy Serdyuk, Otavio Braga, Olivier Siohan:
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Muti-Person Video. 2833-2837 - Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro:
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition. 2838-2842
Atypical Speech Detection
- John B. Harvill, Mark Hasegawa-Johnson, Chang D. Yoo:
Frame-Level Stutter Detection. 2843-2847 - Darshana Priyasad, Andi Partovi, Sridha Sridharan, Maryam Kashefpoor, Tharindu Fernando
, Simon Denman, Clinton Fookes, Jia Tang, David Kaye:
Detecting Heart Failure Through Voice Analysis using Self-Supervised Mode-Based Memory Fusion. 2848-2852 - Si Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee
:
Automatic Detection of Speech Sound Disorder in Child Speech Using Posterior-based Speaker Representations. 2853-2857 - Dominika Woszczyk, Anna Hlédiková, Alican Akman, Soteris Demetriou, Björn W. Schuller:
Data Augmentation for Dementia Detection in Spoken Language. 2858-2862 - Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, Amir Hossein Poorjam, Deepak Mittal, Maneesh Singh:
Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection. 2863-2867 - Sebastian Peter Bayerl, Dominik Wagner, Elmar Nöth, Korbinian Riedhammer
:
Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0. 2868-2872
Spoofing-Aware Automatic Speaker Verification (SASV) I
- Jeong-Hwan Choi
, Joon-Young Yang, Ye-Rin Jeoung, Joon-Hyuk Chang:
HYU Submission for the SASV Challenge 2022: Reforming Speaker Embeddings with Spoofing-Aware Conditioning. 2873-2877 - Jungwoo Heo, Ju-Ho Kim, Hyun-seo Shin:
Two Methods for Spoofing-Aware Speaker Verification: Multi-Layer Perceptron Score Fusion Model and Integrated Embedding Projector. 2878-2882 - Chang Zeng
, Lin Zhang
, Meng Liu, Junichi Yamagishi:
Spoofing-Aware Attention based ASV Back-end with Multiple Enrollment Utterances and a Sampling Strategy for the SASV Challenge 2022. 2883-2887 - Alexander Alenin, Nikita Torgashov, Anton Okhotnikov, Rostislav Makarov, Ivan Yakovlev:
A Subnetwork Approach for Spoofing Aware Speaker Verification. 2888-2892 - Jee-weon Jung, Hemlata Tak, Hye-jin Shim, Hee-Soo Heo, Bong-Jin Lee, Soo-Whan Chung, Ha-Jin Yu, Nicholas W. D. Evans, Tomi Kinnunen:
SASV 2022: The First Spoofing-Aware Speaker Verification Challenge. 2893-2897 - Jin Woo Lee
, Eungbeom Kim, Junghyun Koo, Kyogu Lee:
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification. 2898-2902
Single-channel and multi-channel Speech Enhancement
- Nils L. Westhausen, Bernd T. Meyer:
tPLCnet: Real-time Deep Packet Loss Concealment in the Time Domain Using a Short Temporal Context. 2903-2907 - Kristina Tesch, Nils-Hendrik Mohrmann, Timo Gerkmann:
On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement. 2908-2912 - Haoyu Li, Junichi Yamagishi:
DDS: A new device-degraded speech dataset for speech enhancement. 2913-2917 - Yicheng Du, Aditya Arie Nugraha, Kouhei Sekiguchi, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii
:
Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments. 2918-2922 - Julitta Bartolewska, Stanislaw Kacprzak
, Konrad Kowalczyk
:
Refining DNN-based Mask Estimation using CGMM-based EM Algorithm for Multi-channel Noise Reduction. 2923-2927 - Simon Welker
, Julius Richter, Timo Gerkmann:
Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain. 2928-2932 - Mohamed Nabih Ali
, Alessio Brutti, Daniele Falavigna:
Enhancing Embeddings for Speech Classification in Noisy Conditions. 2933-2937 - Arnon Turetzky, Tzvi Michelson, Yossi Adi, Shmuel Peleg:
Deep Audio Waveform Prior. 2938-2942 - Mieszko Fras, Marcin Witkowski
, Konrad Kowalczyk
:
Convolutive Weighted Multichannel Wiener Filter Front-end for Distant Automatic Speech Recognition in Reverberant Multispeaker Scenarios. 2943-2947 - Danilo de Oliveira
, Tal Peer, Timo Gerkmann:
Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes. 2948-2952 - Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe
, Bhiksha Raj:
Improving Speech Enhancement through Fine-Grained Speech Characteristics. 2953-2957
Voice Conversion and Adaptation II
- Piotr Bilinski
, Thomas Merritt, Abdelhamid Ezzerg, Kamil Pokora, Sebastian Cygert, Kayoko Yanagisawa, Roberto Barra-Chicote, Daniel Korzekwa:
Creating New Voices using Normalizing Flows. 2958-2962 - Ariadna Sánchez, Alessio Falai, Ziyao Zhang, Orazio Angelini, Kayoko Yanagisawa:
Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS). 2963-2967 - Kenta Udagawa, Yuki Saito, Hiroshi Saruwatari:
Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS. 2968-2972 - Magdalena Proszewska, Grzegorz Beringer, Daniel Sáez-Trigueros, Thomas Merritt, Abdelhamid Ezzerg, Roberto Barra-Chicote:
GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion. 2973-2977 - Jaeuk Lee, Joon-Hyuk Chang:
One-Shot Speaker Adaptation Based on Initialization by Generative Adversarial Networks for TTS. 2978-2982 - Alon Levkovitch, Eliya Nachmani, Lior Wolf:
Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models. 2983-2987 - Jaeuk Lee, Joon-Hyuk Chang:
Advanced Speaker Embedding with Predictive Variance of Gaussian Distribution for Speaker Adaptation in TTS. 2988-2992 - Panagiotis Kakoulidis
, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris:
Karaoker: Alignment-free singing voice synthesis with speech training data. 2993-2997 - Ji Sub Um, Yeunju Choi, Hoi Rin Kim:
ACNN-VC: Utilizing Adaptive Convolution Neural Network for One-Shot Voice Conversion. 2998-3002 - Tasnima Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, Mikhail A. Kudinov, Jiansheng Wei:
A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling. 3003-3007 - Tae-Woo Kim, Min-Su Kang, Gyeong-Hoon Lee:
Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis. 3008-3012 - Shrutina Agarwal, Naoya Takahashi, Sriram Ganapathy:
Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer. 3013-3017