Code for A. Baade, P. Peng, and D. Harwath, “SyllableLM: Learning Coarse Semantic Units for Speech Language Models”
Code for P. Peng and D. Harwath, “Fast-Slow Transformer for Visually Grounding Speech”
Code for A. Baade, P. Peng, D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer”
Code for D. Harwath, W-N. Hsu, and J. Glass, “Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech,” Proc. ICLR, 2020
Code for D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass, “Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input,” Proc. of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany,September 2018
Members of our lab have been involved in the collection of several datasets of spoken image captions, including:
To download these datasets, please visit https://groups.csail.mit.edu/sls/downloads/placesaudio/