PyTorch implementation for P. Peng and D. Harwath, “Fast-Slow Transformer for Visually Grounding Speech”
PyTorch implementation for A. Baade, P. Peng, D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer”
PyTorch implementation for D. Harwath, W-N. Hsu, and J. Glass, “Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech,” Proc. ICLR, 2020
PyTorch implementation for D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass, “Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input,” Proc. of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany,September 2018
Members of our lab have been involved in the collection of several datasets of spoken image captions, including:
To download these datasets, please visit https://groups.csail.mit.edu/sls/downloads/placesaudio/