Downloads

Code

Code for A. Baade, P. Peng, and D. Harwath, “SyllableLM: Learning Coarse Semantic Units for Speech Language Models”

Code for P. Peng and D. Harwath, “Fast-Slow Transformer for Visually Grounding Speech”

Code for A. Baade, P. Peng, D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer”

Code for D. Harwath, W-N. Hsu, and J. Glass, “Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech,” Proc. ICLR, 2020

Code for D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass, “Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input,” Proc. of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany,September 2018

Data

Members of our lab have been involved in the collection of several datasets of spoken image captions, including:

Places Audio Captions (English) 400k
Places Audio Captions (Hindi) 100k
SpokenCOCO (English)
Flickr8k Audio Captions (English)

To download these datasets, please visit https://groups.csail.mit.edu/sls/downloads/placesaudio/