Textless Speech-to-Speech Translation with Limited Parallel Data
Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
Findings of EMNLP 2024
Measuring Sound Symbolism in Audio-Visual Models
Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney
SLT 2024
Self-Supervised Speech Models for Word-Level Stuttered Speech Detection
Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G Dimakis, David Harwath
SLT 2024
Interface Design for Self-Supervised Speech Models
Yi-Jen Shih, David Harwath
Interspeech 2024
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
ECCV 2024
Multimodal Contextualized Semantic Parsing from Speech
Jordan Voas, Raymond Mooney, David Harwath
ACL 2024
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee
ICASSP 2024
VoiceCraft: Zeo-Shot Speech Editing and Text-to-Speech in the Wild
Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
ACL 2024
SpeechCLIP+: Self-Supervised Multi-Task Representation Learning for Speech via CLIP and Speech-Image Data
Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath
ICASSP 2024 Workshops
Integrating Self-Supervised Speech Model with Pseudo Word-Level Targets from Visually-Grounded Speech Model
Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath
ICASSP 2024
BAT: Learning to Reason About Spatial Sounds with Large Language Models
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
ICML 2024
Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals
Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang
Interspeech 2024
Neural Codec Language Models for Disentangled and Textless Voice Conversion
Alan Baade, Puyuan Peng, David Harwath
Interspeech 2024
Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation
Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz
Interspeech 2024
Soundingactions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
CVPR 2024
Audio-Visual Neural Syntax Acquisition”
Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
ASRU 2023
Learning to Map Efficiently by Active Echolocation
Xixi Hu, Senthil Purushwalkam, David Harwath, Kristen Grauman
IROS 2023
When to Use Efficient Self Attention? Profiling Text, Speech, and Image Transformer Variants
Anuj Diwan, Eunsol Choi, David Harwath
ACL 2023
A Dataset for Foreground Speech Analysis with Smartwatches in Everyday Home Environments
Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz
ICASSP 2023 Workshops
Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models
Reem Gody, David Harwath
ICASSP 2023
Continual Learning for On-Device Speech Recognition Using Disentangled Conformers
Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed
ICASSP 2023
M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech-to-Image Retrieval
Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath
ICASSP 2023
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
ICASSP 2023
Learning Audio-Visual Dereverberation
Changan Chen, Wei Sun, David Harwath, Kristen Grauman
ICASSP 2023
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass
Interspeech 2023
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
Interspeech 2023
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath
Interspeech 2023
Subject Generalization in Classifying Imagined and Spoken Speech with MEG
Debadatta Dash, Paul Ferrari, Abbas Babajani-Feremi, David Harwath, Amir Borna, Jun Wang
2023 11th International IEEE/EMBS Conference on Neural Engineering (NER)
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
SLT 2022
Phoneme Segmentation Using Self-Supervised Speech Models
Luke Strgar, David Harwath
SLT 2022
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald
EMNLP 2022
Contrastive Audio-Visual Masked Autoencoder
Yuan Gong, Andrew Rouditchenko, Alexander H Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
ICLR 2022
Speak: A Toolkit Using Amazon Mechanical Turk to Collect and Validate Speech Audio Recordings
Christopher Song, David Harwath, Tuka Alhanai, James Glass
LREC 2022
Adversarial Input Ablation for Audio-Visual Learning
David Xu, David Harwath
ICASSP 2022
Fast-slow transformer for visually grounding speech
Puyuan Peng, David Harwath
ICASSP 2022
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
arXiv preprint
Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
arXiv preprint
Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach
Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz
arXiv preprint
Everything at Once–Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
CVPR 2022
Self-supervised representation learning for speech using visual grounding and masked language modeling
Puyuan Peng, David Harwath
AAAI 2021 SAS Workshop
Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
arXiv preprint
Cascaded Multilingual Audio-Visual Learning from Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
Interspeech 2021
Learning Audio Visual Dereverberation
Changan Chen, Wei Sun, David Harwath, Kristen Grauman
arXiv preprint
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva
CVPR 2021
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
ICCV 2021
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
ACL 2021
AVLNet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Interspeech 2021
Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
ICASSP 2020
Pair Expansion for Learning Multilingual Semantic Embeddings using Disjoint Visually-Grounded Speech Audio Datasets
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
Interspeech 2020
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath, Wei-Ning Hsu, James Glass
ICLR 2020
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Wei-Ning Hsu, David Harwath, and James Glass
Interspeech 2019
Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio
Emmanuel Azuh, David Harwath, and James Glass
Interspeech 2019
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
IJCV, August 2019
Learning Words by Drawing Images
Dídac Surís, Adrià Recasens, David Bau, David Harwath, James Glass, and Antonio Torralba
CVPR 2019
Towards Visually Grounded Sub-Word Speech Unit Discovery
David Harwath and James Glass
ICASSP 2019
Grounding Spoken Words in Unlabeled Video
Angie Boggust, Kartik Audhkhasi, Dhiraj Joshi, David Harwath, Samuel Thomas, Rogerio Feris, Danny Gutfreund, Yang Zhang, Antonio Torralba, Michael Picheny, James Glass
CVPR Sight and Sound Workshop 2019
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
ECCV 2018
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
David Harwath, Galen Chuang, and James Glass
ICASSP 2018
Learning Word-Like Units from Joint Audio-Visual Analysis
David Harwath and James Glass
ACL 2017
Unsupervised Learning of Spoken Language with Visual Context
David Harwath, Antonio Torralba, and James R. Glass
NeurIPS 2016
Look, Listen, and Decode: Multimodal Speech Recognition with Images
Felix Sun, David Harwath, and James R. Glass
SLT 2016
On the Use of Acoustic Unit Discovery for Language Recognition
Stephen Shum, David Harwath, Najim Dehak, and James Glass
IEEE TASLP, September 2016
Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath and James Glass
ASRU 2015
Speech Recognition Without a Lexicon - Bridging the Gap Between Graphemic and Phonetic Systems
David Harwath and James Glass
Interspeech 2014
Choosing Useful Word Alternates for Automatic Speech Recognition Correction Interfaces
David Harwath, Alexander Gruenstein, and Ian McGraw
Interspeech 2014
Zero Resource Spoken Audio Corpus Analysis
David Harwath, Timothy J. Hazen, and James Glass
ICASSP 2013
A Summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Hermansky, Florian Metze, Richard Rose, Mike Seltzer, Pascal Clark, Ian McGraw, Balakrishnan Varadarajan, Erin Bennett, Benjamin Borschinger, Justin Chiu, Ewan Dunbar, Abdellah Fourtassi, David Harwath, Chia-ying Lee, Keith Levin, Atta Norouzian, Vijay Peddinti, Rachael Richardson, Thomas Schatz, Samuel Thomas
ICASSP 2013
Topic Identification Based Extrinsic Evaluation of Summarization Techniques Applied to Conversational Speech
David Harwath and Timothy J. Hazen
ICASSP 2012
Phonetic Landmark Detection for Automatic Language Identification
David Harwath and Mark Hasegawa-Johnson
Speech Prosody 2010