Everyday interactions require a common understanding of language, i.e. for people to communicate effectively, words (for example ‘cat’) should invoke similar beliefs over physical concepts (what cats look like, the sounds they make, how they behave, what their skin feels like etc.). However, how this ‘common understanding’ emerges is still unclear. One appealing hypothesis is that language is tied to how we interact with the environment. As a result, meaning emerges by ‘grounding’ language in modalities in our environment (images, sounds, actions, etc.).

Recent concurrent works in machine learning have focused on bridging visual and natural language understanding through visually-grounded language learning tasks, e.g. through natural images (Visual Question Answering, Visual Dialog), or through interactions with virtual physical environments. In cognitive science, progress in fMRI enables creating a semantic atlas of the cerebral cortex, or to decode semantic information from visual input. And in psychology, recent studies show that a baby’s most likely first words are based on their visual experience, laying the foundation for a new theory of infant language acquisition and learning.

As the grounding problem requires an interdisciplinary attitude, this workshop aims to gather researchers with broad expertise in various fields -- machine learning, computer vision, natural language, neuroscience, and psychology -- and who are excited about this space of grounding and interactions, and who are willing to share their current work or perspectives on future directions.


Invited Speakers

sanja-fidler Sanja Fidler is an Assistant Professor at University of Toronto. Her main research interests are 2D and 3D object detection, particularly scalable multi-class detection, object segmentation and image labeling, and (3D) scene understanding. She is also interested in the interplay between language and vision. [Webpage]
jack-gallant Jack L. Gallant is a Professor in the Department of Psychology at University of California, Berkeley. The focus of research in his laboratory is on understanding the structure and function of the visual system. [Webpage]
felix-hill Felix Hill is a Research Scientist at DeepMind. He works on models and algorithms for extracting and representing semantic knowledge from text and other naturally occurring data. [Webpage]
raymond-mooney Raymond J. Mooney is a Professor of Computer Science at The University of Texas at Austin and leads the Machine Learning Research Group within UT Artificial Intelligence Laboratory. His current focus is on natural language processing / computational linguistics. [Webpage]
devi-parikh Devi Parikh is an Assistant Professor in the School of Interactive Computing at Georgia Tech, and a Research Scientist at Facebook AI Research (FAIR). Her research interests include computer vision and AI in general and visual recognition problems in particular. [Webpage]
olivier-pietquin Olivier Pietquin is with DeepMind in London. His research interests include spoken dialog systems evaluation, simulation and automatic optimization, machine learning (especially direct and inverse reinforcement learning), speech and signal processing. [Webpage]
linda-smith Chen Yu is a Professor at Computational Cognition and Learning Lab at the University of Indiana. His research interests focus on understanding human development and learning as the interdependence and integration of perceptual, attention, motor, cognitive, language and social processes. [Webpage]

Important Dates

3rd November 2017: Submission deadline

17th November 2017: Submission deadline

24th November 2017: Acceptance notification

8th December 2017: Workshop

Accepted Papers:

Submission Details

We invite you to submit papers related to the following topics:

Submissions should be up to 4 pages excluding references, acknowledgements, and supplementary material, and should be in the NIPS format. We also welcome published papers that are within the scope of the workshop (without re-formatting).

Accepted papers will be presented during 2 poster sessions, and up to 5 will be invited to deliver short talks. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals.

Please submit your paper to the following address:

Detailed Description

Statistical language models learned from text-only corpuses form the dominant paradigm in modern natural language understanding. Many popular models of this type (including GloVe and word2vec) are distributional, i.e. the "meaning" of words is based only on their co-occurence patterns with other words in similar context. While effective for many applications, these text-only distributional approaches suffer from limited semantics as they miss the interactive environment in which communication often takes place, i.e., its symbols are not grounded. This limitation was first highlighted with the symbol grounding problem: "meaningless symbols (i.e. words) cannot be grounded in anything but other meaningless symbols" [16].

Humans, on the other hand, acquire and learn language by communicating about and interacting with the visual environment. This behavior provides the necessary grounding of physical concepts in words. To this end, several recent works study grounded language-learning tasks, e.g. grounding in natural images (ReferIt [1], GuessWhat?! [2], Visual Question Answering [3,4], Visual Dialog [5], Captioning [6]) or grounding in a physically-simulated environment (DeepMind Lab [7], Baidu XWorld [8], OpenAI Universe [9]). We believe this line of research is more suited for human-machine collaboration than unimodal approaches that ignore the grounding aspect.

From a modeling perspective, deep learning approaches are promising for grounding because they are capable of learning high-level semantics from low-level sensory data in both computer vision and language. Subsequently, deep learning turns out to be an efficient tool for fusing different modalities into a single representation [3,4]. In addition, as grounded language acquisition requires to interact with an external environment, reinforcement learning provides an elegant framework to cover the planning aspect of visually grounded dialogue as well as other goal-oriented tasks. There has been some recent effort on combining deep learning and reinforcement learning approaches in various grounding scenarios [10,11,12].

Research in understanding human behavior provides yet another perspective in building models capable of grounded language-learning. In cognitive science, recent progress in fMRI enables us to create a semantic atlas of the cerebral cortex [13] or to learn to decode semantic information from visual input [14]. In one study, psychologists followed blind children and show that they are not linguistically deficient. Despite the lack of visual stimuli, blind children manage to use visual concepts such as colors or visual verbs ("see" or "look") [15] and circumvent their visual impairment through unique strategies [17].

This workshop aims to gather people from backgrounds in machine learning, computer vision, natural language, neuroscience, and psychology, who are excited about this space of grounding and interaction, and are willing to share ideas from their work and perspectives on future directions.


florian harm abhishek satwik
Florian Strub
University of Lille, Inria
Harm de Vries
University of Montreal
Abhishek Das
Georgia Tech
Satwik Kottur
Carnegie Mellon
stefan mateusz olivier
Stefan Lee
Georgia Tech
Mateusz Malinowski
Olivier Pietquin
devi dhruv aaron jeremie
Devi Parikh
Georgia Tech & Facebook AI Research
Dhruv Batra
Georgia Tech & Facebook AI Research
Aaron Courville
University of Montreal
Jeremie Mary



  1. Kazemzadeh, Sahar, et al. "ReferIt Game: Referring to Objects in Photographs of Natural Scenes". EMNLP. 2014.
  2. de Vries, Harm, et al. "GuessWhat?! Visual Object Discovery through Multi-modal Dialogue". CVPR. 2017.
  3. Antol, Stanislaw, et al. "VQA: Visual Question Answering". ICCV. 2015.
  4. Malinowski, Mateusz, et al. "Ask Your Neurons: A Neural-based Approach to Answering Questions about Images". ICCV. 2015.
  5. Das, Abhishek, et al. "Visual Dialog". CVPR. 2017.
  6. Rohrbach, Anna, et. al. "Generating Descriptions with Grounded and Co-Referenced People". CVPR. 2017.
  7. Beattie, Charles, et. al. "DeepMind Lab". 2016.
  8. Yu, Haonan, et al. "A Deep Compositional Framework for Human-like Language Acquisition in Virtual Environment". arXiv preprint arXiv:1703.09831. 2017.
  9. OpenAI. "Universe". 2016.
  10. Strub, Florian, et al. "End-to-end Optimization of Goal-driven and Visually Grounded Dialogue Systems". IJCAI. 2017.
  11. Das, Abhishek, et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning". ICCV. 2017.
  12. Hermann, Karl Moritz, et al. "Grounded Language Learning in a Simulated 3D World". arXiv preprint arXiv:1706.06551. 2017.
  13. Huth, Alexander G., et al. "Natural Speech Reveals the Semantic Maps that Tile Human Cerebral Cortex". Nature 532.7600 (2016): 453-458. 2016.
  14. Huth, Alexander G., et al. "Decoding the Semantic Content of Natural Movies from Human Brain Activity". Frontiers in systems neuroscience 10. 2016.
  15. Landau, Barbara, et al. "Language and Experience: Evidence from the Blind Child". Vol. 8. Harvard University Press. 2009.
  16. Harnad, Stevan. "The Symbol Grounding Problem". Physica D. 1990.
  17. Perez-Pereira et al. "Language Development and Social Interaction in Blind Children". Psychology Press. 2013.