Everyday interactions require a common understanding of language, i.e. for people to communicate effectively, words (for example ‘cat’) should invoke similar beliefs over physical concepts (what cats look like, the sounds they make, how they behave, what their skin feels like etc.). However, how this ‘common understanding’ emerges is still unclear. One appealing hypothesis is that language is tied to how we interact with the environment. As a result, meaning emerges by ‘grounding’ language in modalities in our environment (images, sounds, actions, etc.).
Recent concurrent works in machine learning have focused on bridging visual and natural language understanding through visually-grounded language learning tasks, e.g. through natural images (Visual Question Answering, Visual Dialog), or through interactions with virtual physical environments. In cognitive science, progress in fMRI enables creating a semantic atlas of the cerebral cortex, or to decode semantic information from visual input. And in psychology, recent studies show that a baby’s most likely first words are based on their visual experience, laying the foundation for a new theory of infant language acquisition and learning.
As the grounding problem requires an interdisciplinary attitude, this workshop aims to gather researchers with broad expertise in various fields -- machine learning, computer vision, natural language, neuroscience, and psychology -- and who are excited about this space of grounding and interactions, and who are willing to share their current work or perspectives on future directions.
- 08:30 AM : Welcoming talk!
- 08:45 AM : Visually Grounded Language: Past, Present, and Future… Raymond J. Mooney
- 09:30 AM : Connecting high-level semantics with low-level vision. Sanja Fidler
- 10:15 AM : Coffee Break & Poster Session
- 10:40 AM : The interface between vision and language in the human brain Jack Gallant
- 11:25 AM : Embodied Question Answering. Devi Parikh
- 12:10 AM : LUNCH
- 02:00 PM : Dialogue systems and RL: interconnecting language, vision and rewards. Olivier Pietquin
- 02:45 PM : Spotlights
- Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environment. Peter Anderson et al.
- Curriculum Q-Learning for Visual Vocabulary Acquisition. Ahmed H. Zaidi et al.
- Examining Cooperation in Visual Dialog Models. Mircea Mironenco et al.
- Interpretable Counting for Visual Question Answering. Alexander Trott et al.
- Interactive Reinforcement Learning for Object Grounding via Self-Talking. Yan Zhu et al.
- Informing Action Primitives Through Free-Form Text. Ben Murdoch et al.
- 03:15 PM : Coffee Break & Poster Session
- 03:40 PM : Grounded Language Learning in a Simulated 3D World. Felix Hill
- 04:25 PM : How infant learn to speak by interacting with the visual world? Chen Yu
- 05:10 PM : Panel Discussion
- 06:00 PM : END
|Sanja Fidler is an Assistant Professor at University of Toronto. Her main research interests are 2D and 3D object detection, particularly scalable multi-class detection, object segmentation and image labeling, and (3D) scene understanding. She is also interested in the interplay between language and vision. [Webpage]|
|Jack L. Gallant is a Professor in the Department of Psychology at University of California, Berkeley. The focus of research in his laboratory is on understanding the structure and function of the visual system. [Webpage]|
|Felix Hill is a Research Scientist at DeepMind. He works on models and algorithms for extracting and representing semantic knowledge from text and other naturally occurring data. [Webpage]|
|Raymond J. Mooney is a Professor of Computer Science at The University of Texas at Austin and leads the Machine Learning Research Group within UT Artificial Intelligence Laboratory. His current focus is on natural language processing / computational linguistics. [Webpage] - [slides]|
|Devi Parikh is an Assistant Professor in the School of Interactive Computing at Georgia Tech, and a Research Scientist at Facebook AI Research (FAIR). Her research interests include computer vision and AI in general and visual recognition problems in particular. [Webpage]|
|Olivier Pietquin is with DeepMind in London. His research interests include spoken dialog systems evaluation, simulation and automatic optimization, machine learning (especially direct and inverse reinforcement learning), speech and signal processing. [Webpage]|
|Chen Yu is a Professor at Computational Cognition and Learning Lab at the University of Indiana. His research interests focus on understanding human development and learning as the interdependence and integration of perceptual, attention, motor, cognitive, language and social processes. [Webpage] - [slides]|
3rd November 2017: Submission deadline 17th November 2017: Submission deadline 24th November 2017: Acceptance notification
8th December 2017: Workshop
- Actor-Critic Sequence Training for Image Captioning - Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, Timothy M. Hospedales - pdf
- Answerer in Questioner’s Mind for Goal-Oriented Visual Dialogue - Sang-Woo Lee, Yujung Heo, and Byoung-Tak Zhang - pdf
- Attention Based Natural Language Grounding by Navigating Virtual Environment - Akilesh B, Abhishek Sinha, Mausoom Sarkar, Balaji Krishnamurthy pdf
- Characterizing how Visual Question Answering models scale with the world - Eli Bingham*, Piero Molino*, Paul Szerlip*, Fritz Obermeyer, Noah D. Goodman - pfd
- Compositional Generation of Images - Amit Raj, Cusuh Ham, Huda Alamri, Vincent Cartillier, Stefan Lee, James Hays - pdf
- Curriculum Q-Learning for Visual Vocabulary Acquisition - Ahmed H. Zaidi, Russell Moore, Ted Briscoe - pdf
- dBaby: Grounded Language Teaching through Games and Efficient Reinforcement Learning Guntis Barzdins, Renars Liepins, Paulis F. Barzdins, Didzis Gosko - pdf
- Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli - Eri Matsuo Ichiro Kobayashi, Shinji Nishimoto Satoshi Nishida, Hideki Asoh - pdf
- Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
- End-to-End Models for Task-Oriented Gameplay with Gated-Attention Networks and Malliavin-Stein Variational Policy Gradients - Ali Zaidi pdf
- Ensembling Visual Explanation for VQA - Nazneen Fatema Rajani, Raymond J. Mooney pdf
- Examining Cooperation in Visual Dialog Models - Mircea Mironenco*, Dana Kianfar*, Ke Tran, Evangelos Kanoulas, Efstratios Gavves - pdf
- FigureQA: An Annotated Figure Dataset for Visual Reasoning - Samira Ebrahimi Kahou, Adam Atkinson, Vincent Michalski, Ákos Kádár, Adam Trischler, Yoshua Bengio - pdf
- FiLM: Visual Reasoning with a General Conditioning Layer - Ethan Perez, Florian Strub, Harm de Vries ,Vincent Dumoulin, Aaron Courville - pdf
- fMRI Semantic Category Decoding using Linguistic Encoding of Word2Vec - Subba Reddy Oota, Naresh Manwani, Raju S. Bapi - pdf
- Gated-Attention Architectures for Task-Oriented Language Grounding - Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, Ruslan Salakhutdinov - pdf
- Generating Descriptions with Grounded and Co-Referenced People - Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele - pdf
- Graph R-CNN: Improved Scene Graph Generation and Its Applications to Image Captioning and VQA
- Grounded Objects and Interactions for Video Captioning - Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, Hans Peter Graf - pdf
- HoME: a Household Multimodal Environment - Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, Aaron Courville - pdf
- Hyper-dimensional computing for a visual question-answering system that is trainable end-to-end - Guglielmo Montone, J.Kevin O’Regan, Alexander V. Terekhov - pdf
- Improving Visually Grounded Sentence Representations with Self-Attention - Kang Min Yoo, Youhyun Shin, Sang-goo Lee - pdf
- Informing Action Primitives Through Free-Form Text - Nancy Fulda, Ben Murdoch, Daniel Ricks, David Wingate - pdf
- Interactive Image Manipulation with Natural Language Instruction Commands - Seitaro Shinagawa, Koichiro Yoshino, Sakti Sakriani, Yu Suzuki, Satoshi Nakamura - pdf
- Interactive Reinforcement Learning for Object Grounding via Self-Talking - Yan Zhu, Shaoting Zhang, Dimitris Metaxas - pdf
- Interpretable Counting for Visual Question Answering - Alexander Trott, Caiming Xiong, Richard Socher - pdf
- Labelless Scene Classification - Meng Ye, Yuhong Guo - pdf
- Learning to Color from Language - Varun Manjunatha*, Mohit Iyyer*, Jordan Boyd-Graber, Larry Davis - pdf
- Listen, Interact and Talk: Learning to Speak via Interaction - Haichao Zhang, Haonan Yu, and Wei Xu - pdf
- Modulating and attending the source image during encoding improves Multimodal Translation - pdf
- Multi-level Classification: Implications for Human-like Generalization - Joshua Peterson, Paul Soulos, Aida Nematzadeh, Tom Griffiths - pdf
- Relationships from Entity Stream - Martin Andrews, Sam Witteveen - pdf
- Retweet Wars: Tweet Popularity Prediction via Multimodal Regression - Ke Wang, Mohit Bansal, Jan-Michael Frahm - pdf
- Semantic Image Retrieval via Active Grounding of Visual Situations - Max H. Quinn, Erik Conser, Jordan M. Witte, Melanie Mitchell - pdf
- Video SemNet: Memory-Augmented Video Semantic Network - Prashanth Vijayaraghavan, Deb Roy - pdf
- Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments - Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, Anton van den Hengel - pdf
- Visual Explanations from Hadamard Product in Multimodal Deep Networks - Jin-Hwa Kim, Byoung-Tak Zhang - pdf
We invite you to submit papers related to the following topics:
- language acquisition or learning through interactions
- visual captioning, dialog, and question-answering
- reasoning in language and vision
- visual synthesis from language
- transfer learning in language and vision tasks
- navigation in virtual worlds with natural-language instructions
- machine translation with visual cues
- novel tasks that combine language, vision and actions
- understanding and modeling the relationship between language and vision in humans
- semantic systems and modeling of natural language and visual stimuli representations in the human brain
Submissions should be up to 4 pages excluding references, acknowledgements, and supplementary material, and should be in the NIPS format. We also welcome published papers that are within the scope of the workshop (without re-formatting).
Accepted papers will be presented during 2 poster sessions, and up to 5 will be invited to deliver short talks. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals.
Please submit your paper to the following address: email@example.com
Statistical language models learned from text-only corpuses form the dominant paradigm in modern natural language understanding. Many popular models of this type (including GloVe and word2vec) are distributional, i.e. the "meaning" of words is based only on their co-occurence patterns with other words in similar context. While effective for many applications, these text-only distributional approaches suffer from limited semantics as they miss the interactive environment in which communication often takes place, i.e., its symbols are not grounded. This limitation was first highlighted with the symbol grounding problem: "meaningless symbols (i.e. words) cannot be grounded in anything but other meaningless symbols" .
Humans, on the other hand, acquire and learn language by communicating about and interacting with the visual environment. This behavior provides the necessary grounding of physical concepts in words. To this end, several recent works study grounded language-learning tasks, e.g. grounding in natural images (ReferIt , GuessWhat?! , Visual Question Answering [3,4], Visual Dialog , Captioning ) or grounding in a physically-simulated environment (DeepMind Lab , Baidu XWorld , OpenAI Universe ). We believe this line of research is more suited for human-machine collaboration than unimodal approaches that ignore the grounding aspect.
From a modeling perspective, deep learning approaches are promising for grounding because they are capable of learning high-level semantics from low-level sensory data in both computer vision and language. Subsequently, deep learning turns out to be an efficient tool for fusing different modalities into a single representation [3,4]. In addition, as grounded language acquisition requires to interact with an external environment, reinforcement learning provides an elegant framework to cover the planning aspect of visually grounded dialogue as well as other goal-oriented tasks. There has been some recent effort on combining deep learning and reinforcement learning approaches in various grounding scenarios [10,11,12].
Research in understanding human behavior provides yet another perspective in building models capable of grounded language-learning. In cognitive science, recent progress in fMRI enables us to create a semantic atlas of the cerebral cortex  or to learn to decode semantic information from visual input . In one study, psychologists followed blind children and show that they are not linguistically deficient. Despite the lack of visual stimuli, blind children manage to use visual concepts such as colors or visual verbs ("see" or "look")  and circumvent their visual impairment through unique strategies .
This workshop aims to gather people from backgrounds in machine learning, computer vision, natural language, neuroscience, and psychology, who are excited about this space of grounding and interaction, and are willing to share ideas from their work and perspectives on future directions.
University of Lille, Inria
|Harm de Vries
University of Montreal
| Satwik Kottur
Georgia Tech & Facebook AI Research
Georgia Tech & Facebook AI Research
University of Montreal
- Kazemzadeh, Sahar, et al. "ReferIt Game: Referring to Objects in Photographs of Natural Scenes". EMNLP. 2014.
- de Vries, Harm, et al. "GuessWhat?! Visual Object Discovery through Multi-modal Dialogue". CVPR. 2017.
- Antol, Stanislaw, et al. "VQA: Visual Question Answering". ICCV. 2015.
- Malinowski, Mateusz, et al. "Ask Your Neurons: A Neural-based Approach to Answering Questions about Images". ICCV. 2015.
- Das, Abhishek, et al. "Visual Dialog". CVPR. 2017.
- Rohrbach, Anna, et. al. "Generating Descriptions with Grounded and Co-Referenced People". CVPR. 2017.
- Beattie, Charles, et. al. "DeepMind Lab". 2016.
- Yu, Haonan, et al. "A Deep Compositional Framework for Human-like Language Acquisition in Virtual Environment". arXiv preprint arXiv:1703.09831. 2017.
- OpenAI. "Universe". 2016.
- Strub, Florian, et al. "End-to-end Optimization of Goal-driven and Visually Grounded Dialogue Systems". IJCAI. 2017.
- Das, Abhishek, et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning". ICCV. 2017.
- Hermann, Karl Moritz, et al. "Grounded Language Learning in a Simulated 3D World". arXiv preprint arXiv:1706.06551. 2017.
- Huth, Alexander G., et al. "Natural Speech Reveals the Semantic Maps that Tile Human Cerebral Cortex". Nature 532.7600 (2016): 453-458. 2016.
- Huth, Alexander G., et al. "Decoding the Semantic Content of Natural Movies from Human Brain Activity". Frontiers in systems neuroscience 10. 2016.
- Landau, Barbara, et al. "Language and Experience: Evidence from the Blind Child". Vol. 8. Harvard University Press. 2009.
- Harnad, Stevan. "The Symbol Grounding Problem". Physica D. 1990.
- Perez-Pereira et al. "Language Development and Social Interaction in Blind Children". Psychology Press. 2013.