Textless Speech-to-Speech Translation with Limited Parallel Data 
  Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi 
Findings of EMNLP 2024
Measuring Sound Symbolism in Audio-Visual Models 
  Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney 
SLT 2024
Self-Supervised Speech Models for Word-Level Stuttered Speech Detection 
  Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G Dimakis, David Harwath 
SLT 2024
Interface Design for Self-Supervised Speech Models 
  Yi-Jen Shih, David Harwath 
Interspeech 2024
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos 
  Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman 
ECCV 2024
Multimodal Contextualized Semantic Parsing from Speech 
  Jordan Voas, Raymond Mooney, David Harwath 
ACL 2024
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models 
  Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee 
ICASSP 2024
VoiceCraft: Zeo-Shot Speech Editing and Text-to-Speech in the Wild 
  Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath 
ACL 2024
SpeechCLIP+: Self-Supervised Multi-Task Representation Learning for Speech via CLIP and Speech-Image Data 
  Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath 
ICASSP 2024 Workshops
Integrating Self-Supervised Speech Model with Pseudo Word-Level Targets from Visually-Grounded Speech Model 
  Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath 
ICASSP 2024
BAT: Learning to Reason About Spatial Sounds with Large Language Models 
  Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath 
ICML 2024
Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals 
  Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang 
Interspeech 2024
Neural Codec Language Models for Disentangled and Textless Voice Conversion 
  Alan Baade, Puyuan Peng, David Harwath 
Interspeech 2024
Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation 
  Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz 
Interspeech 2024
Soundingactions: Learning How Actions Sound from Narrated Egocentric Videos 
  Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman 
CVPR 2024
Audio-Visual Neural Syntax Acquisition” 
  Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass 
ASRU 2023
Learning to Map Efficiently by Active Echolocation 
  Xixi Hu, Senthil Purushwalkam, David Harwath, Kristen Grauman 
IROS 2023
When to Use Efficient Self Attention? Profiling Text, Speech, and Image Transformer Variants 
  Anuj Diwan, Eunsol Choi, David Harwath 
ACL 2023
A Dataset for Foreground Speech Analysis with Smartwatches in Everyday Home Environments 
  Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz 
ICASSP 2023 Workshops
Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models 
  Reem Gody, David Harwath 
ICASSP 2023
Continual Learning for On-Device Speech Recognition Using Disentangled Conformers 
  Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed 
ICASSP 2023
M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech-to-Image Retrieval 
  Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath 
ICASSP 2023
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval 
  Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass 
ICASSP 2023
Learning Audio-Visual Dereverberation 
  Changan Chen, Wei Sun, David Harwath, Kristen Grauman 
ICASSP 2023
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages 
  Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass 
Interspeech 2023
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model 
  Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath 
Interspeech 2023
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization 
  Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath 
Interspeech 2023
Subject Generalization in Classifying Imagined and Spoken Speech with MEG 
  Debadatta Dash, Paul Ferrari, Abbas Babajani-Feremi, David Harwath, Amir Borna, Jun Wang 
2023 11th International IEEE/EMBS Conference on Neural Engineering (NER)
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model 
  Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath 
SLT 2022
Phoneme Segmentation Using Self-Supervised Speech Models 
  Luke Strgar, David Harwath 
SLT 2022
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality 
  Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald 
EMNLP 2022
Contrastive Audio-Visual Masked Autoencoder 
  Yuan Gong, Andrew Rouditchenko, Alexander H Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass 
ICLR 2022
Speak: A Toolkit Using Amazon Mechanical Turk to Collect and Validate Speech Audio Recordings 
  Christopher Song, David Harwath, Tuka Alhanai, James Glass 
LREC 2022
Adversarial Input Ablation for Audio-Visual Learning 
  David Xu, David Harwath 
ICASSP 2022
Fast-slow transformer for visually grounding speech 
  Puyuan Peng, David Harwath 
ICASSP 2022
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer 
  Alan Baade, Puyuan Peng, David Harwath 
arXiv preprint
Word Discovery in Visually Grounded, Self-Supervised Speech Models 
  Puyuan Peng, David Harwath 
arXiv preprint
Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach 
  Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz 
arXiv preprint
Everything at Once–Multi-modal Fusion Transformer for Video Retrieval 
  Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne 
CVPR 2022
Self-supervised representation learning for speech using visual grounding and masked language modeling 
  Puyuan Peng, David Harwath 
AAAI 2021 SAS Workshop
Routing with Self-Attention for Multimodal Capsule Networks 
  Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah 
arXiv preprint
Cascaded Multilingual Audio-Visual Learning from Videos 
  Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass 
Interspeech 2021
Learning Audio Visual Dereverberation 
  Changan Chen, Wei Sun, David Harwath, Kristen Grauman 
arXiv preprint
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions 
  Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva 
CVPR 2021
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos 
  Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang 
ICCV 2021
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units 
  Wei-Ning Hsu, David Harwath, Christopher Song, James Glass 
ACL 2021
AVLNet: Learning Audio-Visual Language Representations from Instructional Videos 
  Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass 
Interspeech 2021
Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms 
  Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass 
ICASSP 2020
Pair Expansion for Learning Multilingual Semantic Embeddings using Disjoint Visually-Grounded Speech Audio Datasets 
  Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass 
Interspeech 2020
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech 
  David Harwath, Wei-Ning Hsu, James Glass 
ICLR 2020