Publications

2024

Textless Speech-to-Speech Translation with Limited Parallel Data
Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
Findings of EMNLP 2024

Measuring Sound Symbolism in Audio-Visual Models
Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney
SLT 2024

Self-Supervised Speech Models for Word-Level Stuttered Speech Detection
Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G Dimakis, David Harwath
SLT 2024

Interface Design for Self-Supervised Speech Models
Yi-Jen Shih, David Harwath
Interspeech 2024

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
ECCV 2024

Multimodal Contextualized Semantic Parsing from Speech
Jordan Voas, Raymond Mooney, David Harwath
ACL 2024

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee
ICASSP 2024

VoiceCraft: Zeo-Shot Speech Editing and Text-to-Speech in the Wild
Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
ACL 2024

SpeechCLIP+: Self-Supervised Multi-Task Representation Learning for Speech via CLIP and Speech-Image Data
Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath
ICASSP 2024 Workshops

Integrating Self-Supervised Speech Model with Pseudo Word-Level Targets from Visually-Grounded Speech Model
Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath
ICASSP 2024

BAT: Learning to Reason About Spatial Sounds with Large Language Models
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
ICML 2024

Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals
Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang
Interspeech 2024

Neural Codec Language Models for Disentangled and Textless Voice Conversion
Alan Baade, Puyuan Peng, David Harwath
Interspeech 2024

Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation
Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz
Interspeech 2024

Soundingactions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
CVPR 2024

2023

Audio-Visual Neural Syntax Acquisition”
Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
ASRU 2023

Learning to Map Efficiently by Active Echolocation
Xixi Hu, Senthil Purushwalkam, David Harwath, Kristen Grauman
IROS 2023

When to Use Efficient Self Attention? Profiling Text, Speech, and Image Transformer Variants
Anuj Diwan, Eunsol Choi, David Harwath
ACL 2023

A Dataset for Foreground Speech Analysis with Smartwatches in Everyday Home Environments
Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz
ICASSP 2023 Workshops

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models
Reem Gody, David Harwath
ICASSP 2023

Continual Learning for On-Device Speech Recognition Using Disentangled Conformers
Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed
ICASSP 2023

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech-to-Image Retrieval
Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath
ICASSP 2023

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
ICASSP 2023

Learning Audio-Visual Dereverberation
Changan Chen, Wei Sun, David Harwath, Kristen Grauman
ICASSP 2023

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass
Interspeech 2023

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
Interspeech 2023

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath
Interspeech 2023

Subject Generalization in Classifying Imagined and Spoken Speech with MEG
Debadatta Dash, Paul Ferrari, Abbas Babajani-Feremi, David Harwath, Amir Borna, Jun Wang
2023 11th International IEEE/EMBS Conference on Neural Engineering (NER)

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
SLT 2022

2022

Phoneme Segmentation Using Self-Supervised Speech Models
Luke Strgar, David Harwath
SLT 2022

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald
EMNLP 2022

Contrastive Audio-Visual Masked Autoencoder
Yuan Gong, Andrew Rouditchenko, Alexander H Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
ICLR 2022

Speak: A Toolkit Using Amazon Mechanical Turk to Collect and Validate Speech Audio Recordings
Christopher Song, David Harwath, Tuka Alhanai, James Glass
LREC 2022

Adversarial Input Ablation for Audio-Visual Learning
David Xu, David Harwath
ICASSP 2022

Fast-slow transformer for visually grounding speech
Puyuan Peng, David Harwath
ICASSP 2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
arXiv preprint

Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
arXiv preprint

Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach
Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz
arXiv preprint

Everything at Once–Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
CVPR 2022

2021

Self-supervised representation learning for speech using visual grounding and masked language modeling
Puyuan Peng, David Harwath
AAAI 2021 SAS Workshop

Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
arXiv preprint

Cascaded Multilingual Audio-Visual Learning from Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
Interspeech 2021

Learning Audio Visual Dereverberation
Changan Chen, Wei Sun, David Harwath, Kristen Grauman
arXiv preprint

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva
CVPR 2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
ICCV 2021

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
ACL 2021

AVLNet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Interspeech 2021

2020

Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
ICASSP 2020

Pair Expansion for Learning Multilingual Semantic Embeddings using Disjoint Visually-Grounded Speech Audio Datasets
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
Interspeech 2020

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath, Wei-Ning Hsu, James Glass
ICLR 2020