Adversarial Input Ablation for Audio-Visual Learning
David Xu, David Harwath
ICASSP 2022
Fast-slow transformer for visually grounding speech
Puyuan Peng, David Harwath
ICASSP 2022
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
arXiv preprint
Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
arXiv preprint
Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach
Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz
arXiv preprint
Everything at Once–Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
CVPR 2022
Self-supervised representation learning for speech using visual grounding and masked language modeling
Puyuan Peng, David Harwath
AAAI 2021 SAS Workshop
Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
arXiv preprint
Cascaded Multilingual Audio-Visual Learning from Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
Interspeech 2021
Learning Audio Visual Dereverberation
Changan Chen, Wei Sun, David Harwath, Kristen Grauman
arXiv preprint
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva
CVPR 2021
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
ICCV 2021
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
ACL 2021
AVLNet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Interspeech 2021
Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
ICASSP 2020
Pair Expansion for Learning Multilingual Semantic Embeddings using Disjoint Visually-Grounded Speech Audio Datasets
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
Interspeech 2020
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath, Wei-Ning Hsu, James Glass
ICLR 2020
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Wei-Ning Hsu, David Harwath, and James Glass
Interspeech 2019
Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio
Emmanuel Azuh, David Harwath, and James Glass
Interspeech 2019
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
IJCV, August 2019
Learning Words by Drawing Images
Dídac Surís, Adrià Recasens, David Bau, David Harwath, James Glass, and Antonio Torralba
CVPR 2019
Towards Visually Grounded Sub-Word Speech Unit Discovery
David Harwath and James Glass
ICASSP 2019
Grounding Spoken Words in Unlabeled Video
Angie Boggust, Kartik Audhkhasi, Dhiraj Joshi, David Harwath, Samuel Thomas, Rogerio Feris, Danny Gutfreund, Yang Zhang, Antonio Torralba, Michael Picheny, James Glass
CVPR Sight and Sound Workshop 2019
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
ECCV 2018
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
David Harwath, Galen Chuang, and James Glass
ICASSP 2018
Learning Word-Like Units from Joint Audio-Visual Analysis
David Harwath and James Glass
ACL 2017
Unsupervised Learning of Spoken Language with Visual Context
David Harwath, Antonio Torralba, and James R. Glass
NeurIPS 2016
Look, Listen, and Decode: Multimodal Speech Recognition with Images
Felix Sun, David Harwath, and James R. Glass
SLT 2016
On the Use of Acoustic Unit Discovery for Language Recognition
Stephen Shum, David Harwath, Najim Dehak, and James Glass
IEEE TASLP, September 2016
Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath and James Glass
ASRU 2015
Speech Recognition Without a Lexicon - Bridging the Gap Between Graphemic and Phonetic Systems
David Harwath and James Glass
Interspeech 2014
Choosing Useful Word Alternates for Automatic Speech Recognition Correction Interfaces
David Harwath, Alexander Gruenstein, and Ian McGraw
Interspeech 2014
Zero Resource Spoken Audio Corpus Analysis
David Harwath, Timothy J. Hazen, and James Glass
ICASSP 2013
A Summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Hermansky, Florian Metze, Richard Rose, Mike Seltzer, Pascal Clark, Ian McGraw, Balakrishnan Varadarajan, Erin Bennett, Benjamin Borschinger, Justin Chiu, Ewan Dunbar, Abdellah Fourtassi, David Harwath, Chia-ying Lee, Keith Levin, Atta Norouzian, Vijay Peddinti, Rachael Richardson, Thomas Schatz, Samuel Thomas
ICASSP 2013
Topic Identification Based Extrinsic Evaluation of Summarization Techniques Applied to Conversational Speech
David Harwath and Timothy J. Hazen
ICASSP 2012
Phonetic Landmark Detection for Automatic Language Identification
David Harwath and Mark Hasegawa-Johnson
Speech Prosody 2010