Downloads

Code

PyTorch implementation for P. Peng and D. Harwath, “Fast-Slow Transformer for Visually Grounding Speech”

PyTorch implementation for A. Baade, P. Peng, D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer”

PyTorch implementation for D. Harwath, W-N. Hsu, and J. Glass, “Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech,” Proc. ICLR, 2020

PyTorch implementation for D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass, “Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input,” Proc. of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany,September 2018

Data

Members of our lab have been involved in the collection of several datasets of spoken image captions, including:

  • Places Audio Captions (English) 400k
  • Places Audio Captions (Hindi) 100k
  • SpokenCOCO (English)
  • Flickr8k Audio Captions (English)

To download these datasets, please visit https://groups.csail.mit.edu/sls/downloads/placesaudio/