Publications

2024

Textless Speech-to-Speech Translation with Limited Parallel Data
Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
Findings of EMNLP 2024

Measuring Sound Symbolism in Audio-Visual Models
Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney
SLT 2024

Self-Supervised Speech Models for Word-Level Stuttered Speech Detection
Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G Dimakis, David Harwath
SLT 2024

Interface Design for Self-Supervised Speech Models
Yi-Jen Shih, David Harwath
Interspeech 2024

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
ECCV 2024

Multimodal Contextualized Semantic Parsing from Speech
Jordan Voas, Raymond Mooney, David Harwath
ACL 2024

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee
ICASSP 2024

VoiceCraft: Zeo-Shot Speech Editing and Text-to-Speech in the Wild
Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
ACL 2024

SpeechCLIP+: Self-Supervised Multi-Task Representation Learning for Speech via CLIP and Speech-Image Data
Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath
ICASSP 2024 Workshops

Integrating Self-Supervised Speech Model with Pseudo Word-Level Targets from Visually-Grounded Speech Model
Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath
ICASSP 2024

BAT: Learning to Reason About Spatial Sounds with Large Language Models
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
ICML 2024

Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals
Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang
Interspeech 2024

Neural Codec Language Models for Disentangled and Textless Voice Conversion
Alan Baade, Puyuan Peng, David Harwath
Interspeech 2024

Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation
Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz
Interspeech 2024

Soundingactions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
CVPR 2024

2023

Audio-Visual Neural Syntax Acquisition”
Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
ASRU 2023

Learning to Map Efficiently by Active Echolocation
Xixi Hu, Senthil Purushwalkam, David Harwath, Kristen Grauman
IROS 2023

When to Use Efficient Self Attention? Profiling Text, Speech, and Image Transformer Variants
Anuj Diwan, Eunsol Choi, David Harwath
ACL 2023

A Dataset for Foreground Speech Analysis with Smartwatches in Everyday Home Environments
Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz
ICASSP 2023 Workshops

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models
Reem Gody, David Harwath
ICASSP 2023

Continual Learning for On-Device Speech Recognition Using Disentangled Conformers
Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed
ICASSP 2023

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech-to-Image Retrieval
Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath
ICASSP 2023

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
ICASSP 2023

Learning Audio-Visual Dereverberation
Changan Chen, Wei Sun, David Harwath, Kristen Grauman
ICASSP 2023

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass
Interspeech 2023

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
Interspeech 2023

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath
Interspeech 2023

Subject Generalization in Classifying Imagined and Spoken Speech with MEG
Debadatta Dash, Paul Ferrari, Abbas Babajani-Feremi, David Harwath, Amir Borna, Jun Wang
2023 11th International IEEE/EMBS Conference on Neural Engineering (NER)

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
SLT 2022

2022

Phoneme Segmentation Using Self-Supervised Speech Models
Luke Strgar, David Harwath
SLT 2022

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald
EMNLP 2022

Contrastive Audio-Visual Masked Autoencoder
Yuan Gong, Andrew Rouditchenko, Alexander H Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
ICLR 2022

Speak: A Toolkit Using Amazon Mechanical Turk to Collect and Validate Speech Audio Recordings
Christopher Song, David Harwath, Tuka Alhanai, James Glass
LREC 2022

Adversarial Input Ablation for Audio-Visual Learning
David Xu, David Harwath
ICASSP 2022

Fast-slow transformer for visually grounding speech
Puyuan Peng, David Harwath
ICASSP 2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
arXiv preprint

Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
arXiv preprint

Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach
Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz
arXiv preprint

Everything at Once–Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
CVPR 2022

2021

Self-supervised representation learning for speech using visual grounding and masked language modeling
Puyuan Peng, David Harwath
AAAI 2021 SAS Workshop

Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
arXiv preprint

Cascaded Multilingual Audio-Visual Learning from Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
Interspeech 2021

Learning Audio Visual Dereverberation
Changan Chen, Wei Sun, David Harwath, Kristen Grauman
arXiv preprint

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva
CVPR 2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
ICCV 2021

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
ACL 2021

AVLNet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Interspeech 2021

2020

Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
ICASSP 2020

Pair Expansion for Learning Multilingual Semantic Embeddings using Disjoint Visually-Grounded Speech Audio Datasets
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
Interspeech 2020

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath, Wei-Ning Hsu, James Glass
ICLR 2020

2019

Transfer Learning from Audio-Visual Grounding to Speech Recognition
Wei-Ning Hsu, David Harwath, and James Glass
Interspeech 2019

Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio
Emmanuel Azuh, David Harwath, and James Glass
Interspeech 2019

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
IJCV, August 2019

Learning Words by Drawing Images
Dídac Surís, Adrià Recasens, David Bau, David Harwath, James Glass, and Antonio Torralba
CVPR 2019

Towards Visually Grounded Sub-Word Speech Unit Discovery
David Harwath and James Glass
ICASSP 2019

Grounding Spoken Words in Unlabeled Video
Angie Boggust, Kartik Audhkhasi, Dhiraj Joshi, David Harwath, Samuel Thomas, Rogerio Feris, Danny Gutfreund, Yang Zhang, Antonio Torralba, Michael Picheny, James Glass
CVPR Sight and Sound Workshop 2019

2018

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
ECCV 2018

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
David Harwath, Galen Chuang, and James Glass
ICASSP 2018

2017

Learning Word-Like Units from Joint Audio-Visual Analysis
David Harwath and James Glass
ACL 2017

2016

Unsupervised Learning of Spoken Language with Visual Context
David Harwath, Antonio Torralba, and James R. Glass
NeurIPS 2016

Look, Listen, and Decode: Multimodal Speech Recognition with Images
Felix Sun, David Harwath, and James R. Glass
SLT 2016

On the Use of Acoustic Unit Discovery for Language Recognition
Stephen Shum, David Harwath, Najim Dehak, and James Glass
IEEE TASLP, September 2016

2015

Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath and James Glass
ASRU 2015

2014

Speech Recognition Without a Lexicon - Bridging the Gap Between Graphemic and Phonetic Systems
David Harwath and James Glass
Interspeech 2014

Choosing Useful Word Alternates for Automatic Speech Recognition Correction Interfaces
David Harwath, Alexander Gruenstein, and Ian McGraw
Interspeech 2014

2013

Zero Resource Spoken Audio Corpus Analysis
David Harwath, Timothy J. Hazen, and James Glass
ICASSP 2013

A Summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Hermansky, Florian Metze, Richard Rose, Mike Seltzer, Pascal Clark, Ian McGraw, Balakrishnan Varadarajan, Erin Bennett, Benjamin Borschinger, Justin Chiu, Ewan Dunbar, Abdellah Fourtassi, David Harwath, Chia-ying Lee, Keith Levin, Atta Norouzian, Vijay Peddinti, Rachael Richardson, Thomas Schatz, Samuel Thomas
ICASSP 2013

2012

Topic Identification Based Extrinsic Evaluation of Summarization Techniques Applied to Conversational Speech
David Harwath and Timothy J. Hazen
ICASSP 2012

2011

2010

Phonetic Landmark Detection for Automatic Language Identification
David Harwath and Mark Hasegawa-Johnson
Speech Prosody 2010