Research

Our research interests are in the area of machine learning for speech and language processing. The ultimate goal of our work is to discover the algorithmic mechanisms that would enable computers to learn and use spoken language the way that humans do. Our approach emphasizes the multimodal and grounded nature of human language, and thus has a strong connection to other machine learning disciplines such as computer vision.

While modern machine learning techniques such as deep learning have made impressive progress across a variety of domains, it is doubtful that existing methods can fully capture the phenomenon of language. State-of-the-art deep learning models for tasks such as speech recognition are extremely data hungry, requiring many thousands of hours of speech recordings that have been painstakingly transcribed by humans. Even then, they are highly brittle when used outside of their training domain, breaking down when confronted with new vocabulary, accents, or environmental noise. Because of its reliance on massive training datasets, the technology we do have is completely out of reach for all but several dozen of the 7,000 human languages spoken worldwide.

In contrast, human toddlers are able to grasp the meaning of new word forms from only a few spoken examples, and learn to carry a meaningful conversation long before they are able to read and write. There are critical aspects of language that are currently missing from our machine learning models. Human language is inherently multimodal; it is grounded in embodied experience; it holistically integrates information from all of our sensory organs into our rational capacity; and it is acquired via immersion and interaction, without the kind of heavy-handed supervision relied upon by most machine learning models. Our research agenda revolves around finding ways to bring these aspects into the fold.