Publications

2022

Adversarial Input Ablation for Audio-Visual Learning
David Xu, David Harwath
ICASSP 2022

Fast-slow transformer for visually grounding speech
Puyuan Peng, David Harwath
ICASSP 2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
arXiv preprint

Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
arXiv preprint

Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach
Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz
arXiv preprint

Everything at Once–Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
CVPR 2022

2021

Self-supervised representation learning for speech using visual grounding and masked language modeling
Puyuan Peng, David Harwath
AAAI 2021 SAS Workshop

Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
arXiv preprint

Cascaded Multilingual Audio-Visual Learning from Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
Interspeech 2021

Learning Audio Visual Dereverberation
Changan Chen, Wei Sun, David Harwath, Kristen Grauman
arXiv preprint

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva
CVPR 2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
ICCV 2021

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
ACL 2021

AVLNet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Interspeech 2021

2020

Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
ICASSP 2020

Pair Expansion for Learning Multilingual Semantic Embeddings using Disjoint Visually-Grounded Speech Audio Datasets
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
Interspeech 2020

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath, Wei-Ning Hsu, James Glass
ICLR 2020

2019

Transfer Learning from Audio-Visual Grounding to Speech Recognition
Wei-Ning Hsu, David Harwath, and James Glass
Interspeech 2019

Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio
Emmanuel Azuh, David Harwath, and James Glass
Interspeech 2019

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
IJCV, August 2019

Learning Words by Drawing Images
Dídac Surís, Adrià Recasens, David Bau, David Harwath, James Glass, and Antonio Torralba
CVPR 2019

Towards Visually Grounded Sub-Word Speech Unit Discovery
David Harwath and James Glass
ICASSP 2019

Grounding Spoken Words in Unlabeled Video
Angie Boggust, Kartik Audhkhasi, Dhiraj Joshi, David Harwath, Samuel Thomas, Rogerio Feris, Danny Gutfreund, Yang Zhang, Antonio Torralba, Michael Picheny, James Glass
CVPR Sight and Sound Workshop 2019

2018

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
ECCV 2018

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
David Harwath, Galen Chuang, and James Glass
ICASSP 2018

2017

Learning Word-Like Units from Joint Audio-Visual Analysis
David Harwath and James Glass
ACL 2017

2016

Unsupervised Learning of Spoken Language with Visual Context
David Harwath, Antonio Torralba, and James R. Glass
NeurIPS 2016

Look, Listen, and Decode: Multimodal Speech Recognition with Images
Felix Sun, David Harwath, and James R. Glass
SLT 2016

On the Use of Acoustic Unit Discovery for Language Recognition
Stephen Shum, David Harwath, Najim Dehak, and James Glass
IEEE TASLP, September 2016

2015

Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath and James Glass
ASRU 2015

2014

Speech Recognition Without a Lexicon - Bridging the Gap Between Graphemic and Phonetic Systems
David Harwath and James Glass
Interspeech 2014

Choosing Useful Word Alternates for Automatic Speech Recognition Correction Interfaces
David Harwath, Alexander Gruenstein, and Ian McGraw
Interspeech 2014

2013

Zero Resource Spoken Audio Corpus Analysis
David Harwath, Timothy J. Hazen, and James Glass
ICASSP 2013

A Summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Hermansky, Florian Metze, Richard Rose, Mike Seltzer, Pascal Clark, Ian McGraw, Balakrishnan Varadarajan, Erin Bennett, Benjamin Borschinger, Justin Chiu, Ewan Dunbar, Abdellah Fourtassi, David Harwath, Chia-ying Lee, Keith Levin, Atta Norouzian, Vijay Peddinti, Rachael Richardson, Thomas Schatz, Samuel Thomas
ICASSP 2013

2012

Topic Identification Based Extrinsic Evaluation of Summarization Techniques Applied to Conversational Speech
David Harwath and Timothy J. Hazen
ICASSP 2012

2011

2010

Phonetic Landmark Detection for Automatic Language Identification
David Harwath and Mark Hasegawa-Johnson
Speech Prosody 2010