Visually-Grounded Interaction and Language (ViGIL)

Vigil Sessions

2019: [Vigil2019]

2018: [Vigil2018]

Introduction

Everyday interactions require a common understanding of language, i.e. for people to communicate effectively, words (for example ‘cat’) should invoke similar beliefs over physical concepts (what cats look like, the sounds they make, how they behave, what their skin feels like etc.). However, how this ‘common understanding’ emerges is still unclear. One appealing hypothesis is that language is tied to how we interact with the environment. As a result, meaning emerges by ‘grounding’ language in modalities in our environment (images, sounds, actions, etc.).

Recent concurrent works in machine learning have focused on bridging visual and natural language understanding through visually-grounded language learning tasks, e.g. through natural images (Visual Question Answering, Visual Dialog), or through interactions with virtual physical environments. In cognitive science, progress in fMRI enables creating a semantic atlas of the cerebral cortex, or to decode semantic information from visual input. And in psychology, recent studies show that a baby’s most likely first words are based on their visual experience, laying the foundation for a new theory of infant language acquisition and learning.

As the grounding problem requires an interdisciplinary attitude, this workshop aims to gather researchers with broad expertise in various fields -- machine learning, computer vision, natural language, neuroscience, and psychology -- and who are excited about this space of grounding and interactions, and who are willing to share their current work or perspectives on future directions.

Schedule

08:30 AM : Welcoming talk!
08:45 AM : Visually Grounded Language: Past, Present, and Future… Raymond J. Mooney
09:30 AM : Connecting high-level semantics with low-level vision. Sanja Fidler
10:15 AM : Coffee Break & Poster Session
10:40 AM : The interface between vision and language in the human brain Jack Gallant
11:25 AM : Embodied Question Answering. Devi Parikh
12:10 AM : LUNCH
02:00 PM : Dialogue systems and RL: interconnecting language, vision and rewards. Olivier Pietquin
02:45 PM : Spotlights
- Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environment. Peter Anderson et al.
- Curriculum Q-Learning for Visual Vocabulary Acquisition. Ahmed H. Zaidi et al.
- Examining Cooperation in Visual Dialog Models. Mircea Mironenco et al.
- Interpretable Counting for Visual Question Answering. Alexander Trott et al.
- Interactive Reinforcement Learning for Object Grounding via Self-Talking. Yan Zhu et al.
- Informing Action Primitives Through Free-Form Text. Ben Murdoch et al.
03:15 PM : Coffee Break & Poster Session
03:40 PM : Grounded Language Learning in a Simulated 3D World. Felix Hill
04:25 PM : How infant learn to speak by interacting with the visual world? Chen Yu
05:10 PM : Panel Discussion
06:00 PM : END

Invited Speakers

	Sanja Fidler is an Assistant Professor at University of Toronto. Her main research interests are 2D and 3D object detection, particularly scalable multi-class detection, object segmentation and image labeling, and (3D) scene understanding. She is also interested in the interplay between language and vision. [Webpage]
	Jack L. Gallant is a Professor in the Department of Psychology at University of California, Berkeley. The focus of research in his laboratory is on understanding the structure and function of the visual system. [Webpage]
	Felix Hill is a Research Scientist at DeepMind. He works on models and algorithms for extracting and representing semantic knowledge from text and other naturally occurring data. [Webpage]
	Raymond J. Mooney is a Professor of Computer Science at The University of Texas at Austin and leads the Machine Learning Research Group within UT Artificial Intelligence Laboratory. His current focus is on natural language processing / computational linguistics. [Webpage] - [slides]
	Devi Parikh is an Assistant Professor in the School of Interactive Computing at Georgia Tech, and a Research Scientist at Facebook AI Research (FAIR). Her research interests include computer vision and AI in general and visual recognition problems in particular. [Webpage]
	Olivier Pietquin is with DeepMind in London. His research interests include spoken dialog systems evaluation, simulation and automatic optimization, machine learning (especially direct and inverse reinforcement learning), speech and signal processing. [Webpage]
	Chen Yu is a Professor at Computational Cognition and Learning Lab at the University of Indiana. His research interests focus on understanding human development and learning as the interdependence and integration of perceptual, attention, motor, cognitive, language and social processes. [Webpage] - [slides]

Important Dates

~~3rd November 2017: Submission deadline~~

~~17th November 2017: Submission deadline~~

~~24th November 2017: Acceptance notification~~

8th December 2017: Workshop

Accepted Papers:

Actor-Critic Sequence Training for Image Captioning - Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, Timothy M. Hospedales - pdf
Answerer in Questioner’s Mind for Goal-Oriented Visual Dialogue - Sang-Woo Lee, Yujung Heo, and Byoung-Tak Zhang - pdf
Attention Based Natural Language Grounding by Navigating Virtual Environment - Akilesh B, Abhishek Sinha, Mausoom Sarkar, Balaji Krishnamurthy pdf
Characterizing how Visual Question Answering models scale with the world - Eli Bingham*, Piero Molino*, Paul Szerlip*, Fritz Obermeyer, Noah D. Goodman - pfd
Compositional Generation of Images - Amit Raj, Cusuh Ham, Huda Alamri, Vincent Cartillier, Stefan Lee, James Hays - pdf
Curriculum Q-Learning for Visual Vocabulary Acquisition - Ahmed H. Zaidi, Russell Moore, Ted Briscoe - pdf
dBaby: Grounded Language Teaching through Games and Efficient Reinforcement Learning Guntis Barzdins, Renars Liepins, Paulis F. Barzdins, Didzis Gosko - pdf
Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli - Eri Matsuo Ichiro Kobayashi, Shinji Nishimoto Satoshi Nishida, Hideki Asoh - pdf
Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
End-to-End Models for Task-Oriented Gameplay with Gated-Attention Networks and Malliavin-Stein Variational Policy Gradients - Ali Zaidi pdf
Ensembling Visual Explanation for VQA - Nazneen Fatema Rajani, Raymond J. Mooney pdf
Examining Cooperation in Visual Dialog Models - Mircea Mironenco*, Dana Kianfar*, Ke Tran, Evangelos Kanoulas, Efstratios Gavves - pdf
FigureQA: An Annotated Figure Dataset for Visual Reasoning - Samira Ebrahimi Kahou, Adam Atkinson, Vincent Michalski, Ákos Kádár, Adam Trischler, Yoshua Bengio - pdf
FiLM: Visual Reasoning with a General Conditioning Layer - Ethan Perez, Florian Strub, Harm de Vries ,Vincent Dumoulin, Aaron Courville - pdf
fMRI Semantic Category Decoding using Linguistic Encoding of Word2Vec - Subba Reddy Oota, Naresh Manwani, Raju S. Bapi - pdf
Gated-Attention Architectures for Task-Oriented Language Grounding - Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, Ruslan Salakhutdinov - pdf
Generating Descriptions with Grounded and Co-Referenced People - Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele - pdf
Graph R-CNN: Improved Scene Graph Generation and Its Applications to Image Captioning and VQA
Grounded Objects and Interactions for Video Captioning - Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, Hans Peter Graf - pdf
HoME: a Household Multimodal Environment - Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, Aaron Courville - pdf
Hyper-dimensional computing for a visual question-answering system that is trainable end-to-end - Guglielmo Montone, J.Kevin O’Regan, Alexander V. Terekhov - pdf
Improving Visually Grounded Sentence Representations with Self-Attention - Kang Min Yoo, Youhyun Shin, Sang-goo Lee - pdf
Informing Action Primitives Through Free-Form Text - Nancy Fulda, Ben Murdoch, Daniel Ricks, David Wingate - pdf
Interactive Image Manipulation with Natural Language Instruction Commands - Seitaro Shinagawa, Koichiro Yoshino, Sakti Sakriani, Yu Suzuki, Satoshi Nakamura - pdf
Interactive Reinforcement Learning for Object Grounding via Self-Talking - Yan Zhu, Shaoting Zhang, Dimitris Metaxas - pdf
Interpretable Counting for Visual Question Answering - Alexander Trott, Caiming Xiong, Richard Socher - pdf
Labelless Scene Classification - Meng Ye, Yuhong Guo - pdf
Learning to Color from Language - Varun Manjunatha*, Mohit Iyyer*, Jordan Boyd-Graber, Larry Davis - pdf
Listen, Interact and Talk: Learning to Speak via Interaction - Haichao Zhang, Haonan Yu, and Wei Xu - pdf
Modulating and attending the source image during encoding improves Multimodal Translation - pdf
Multi-level Classification: Implications for Human-like Generalization - Joshua Peterson, Paul Soulos, Aida Nematzadeh, Tom Griffiths - pdf
Relationships from Entity Stream - Martin Andrews, Sam Witteveen - pdf
Retweet Wars: Tweet Popularity Prediction via Multimodal Regression - Ke Wang, Mohit Bansal, Jan-Michael Frahm - pdf
Semantic Image Retrieval via Active Grounding of Visual Situations - Max H. Quinn, Erik Conser, Jordan M. Witte, Melanie Mitchell - pdf
Video SemNet: Memory-Augmented Video Semantic Network - Prashanth Vijayaraghavan, Deb Roy - pdf
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments - Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, Anton van den Hengel - pdf
Visual Explanations from Hadamard Product in Multimodal Deep Networks - Jin-Hwa Kim, Byoung-Tak Zhang - pdf

Submission Details

We invite you to submit papers related to the following topics:

language acquisition or learning through interactions
visual captioning, dialog, and question-answering
reasoning in language and vision
visual synthesis from language
transfer learning in language and vision tasks
navigation in virtual worlds with natural-language instructions
machine translation with visual cues
novel tasks that combine language, vision and actions
understanding and modeling the relationship between language and vision in humans
semantic systems and modeling of natural language and visual stimuli representations in the human brain

Submissions should be up to 4 pages excluding references, acknowledgements, and supplementary material, and should be in the NIPS format. We also welcome published papers that are within the scope of the workshop (without re-formatting).

Accepted papers will be presented during 2 poster sessions, and up to 5 will be invited to deliver short talks. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals.

Please submit your paper to the following address: nips2017vigil@gmail.com

Detailed Description

Statistical language models learned from text-only corpuses form the dominant paradigm in modern natural language understanding. Many popular models of this type (including GloVe and word2vec) are distributional, i.e. the "meaning" of words is based only on their co-occurence patterns with other words in similar context. While effective for many applications, these text-only distributional approaches suffer from limited semantics as they miss the interactive environment in which communication often takes place, i.e., its symbols are not grounded. This limitation was first highlighted with the symbol grounding problem: "meaningless symbols (i.e. words) cannot be grounded in anything but other meaningless symbols" [16].

Humans, on the other hand, acquire and learn language by communicating about and interacting with the visual environment. This behavior provides the necessary grounding of physical concepts in words. To this end, several recent works study grounded language-learning tasks, e.g. grounding in natural images (ReferIt [1], GuessWhat?! [2], Visual Question Answering [3,4], Visual Dialog [5], Captioning [6]) or grounding in a physically-simulated environment (DeepMind Lab [7], Baidu XWorld [8], OpenAI Universe [9]). We believe this line of research is more suited for human-machine collaboration than unimodal approaches that ignore the grounding aspect.

From a modeling perspective, deep learning approaches are promising for grounding because they are capable of learning high-level semantics from low-level sensory data in both computer vision and language. Subsequently, deep learning turns out to be an efficient tool for fusing different modalities into a single representation [3,4]. In addition, as grounded language acquisition requires to interact with an external environment, reinforcement learning provides an elegant framework to cover the planning aspect of visually grounded dialogue as well as other goal-oriented tasks. There has been some recent effort on combining deep learning and reinforcement learning approaches in various grounding scenarios [10,11,12].

Research in understanding human behavior provides yet another perspective in building models capable of grounded language-learning. In cognitive science, recent progress in fMRI enables us to create a semantic atlas of the cerebral cortex [13] or to learn to decode semantic information from visual input [14]. In one study, psychologists followed blind children and show that they are not linguistically deficient. Despite the lack of visual stimuli, blind children manage to use visual concepts such as colors or visual verbs ("see" or "look") [15] and circumvent their visual impairment through unique strategies [17].

This workshop aims to gather people from backgrounds in machine learning, computer vision, natural language, neuroscience, and psychology, who are excited about this space of grounding and interaction, and are willing to share ideas from their work and perspectives on future directions.

Organizers


Florian Strub University of Lille, Inria	Harm de Vries University of Montreal	Abhishek Das Georgia Tech	Satwik Kottur Carnegie Mellon

Stefan Lee Georgia Tech	Mateusz Malinowski DeepMind	Olivier Pietquin DeepMind

Devi Parikh Georgia Tech & Facebook AI Research	Dhruv Batra Georgia Tech & Facebook AI Research	Aaron Courville University of Montreal	Jeremie Mary Criteo

References

Kazemzadeh, Sahar, et al. "ReferIt Game: Referring to Objects in Photographs of Natural Scenes". EMNLP. 2014.
de Vries, Harm, et al. "GuessWhat?! Visual Object Discovery through Multi-modal Dialogue". CVPR. 2017.
Antol, Stanislaw, et al. "VQA: Visual Question Answering". ICCV. 2015.
Malinowski, Mateusz, et al. "Ask Your Neurons: A Neural-based Approach to Answering Questions about Images". ICCV. 2015.
Das, Abhishek, et al. "Visual Dialog". CVPR. 2017.
Rohrbach, Anna, et. al. "Generating Descriptions with Grounded and Co-Referenced People". CVPR. 2017.
Beattie, Charles, et. al. "DeepMind Lab". 2016.
Yu, Haonan, et al. "A Deep Compositional Framework for Human-like Language Acquisition in Virtual Environment". arXiv preprint arXiv:1703.09831. 2017.
OpenAI. "Universe". 2016.
Strub, Florian, et al. "End-to-end Optimization of Goal-driven and Visually Grounded Dialogue Systems". IJCAI. 2017.
Das, Abhishek, et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning". ICCV. 2017.
Hermann, Karl Moritz, et al. "Grounded Language Learning in a Simulated 3D World". arXiv preprint arXiv:1706.06551. 2017.
Huth, Alexander G., et al. "Natural Speech Reveals the Semantic Maps that Tile Human Cerebral Cortex". Nature 532.7600 (2016): 453-458. 2016.
Huth, Alexander G., et al. "Decoding the Semantic Content of Natural Movies from Human Brain Activity". Frontiers in systems neuroscience 10. 2016.
Landau, Barbara, et al. "Language and Experience: Evidence from the Blind Child". Vol. 8. Harvard University Press. 2009.
Harnad, Stevan. "The Symbol Grounding Problem". Physica D. 1990.
Perez-Pereira et al. "Language Development and Social Interaction in Blind Children". Psychology Press. 2013.

Visually-Grounded Interaction and Language (ViGIL)

NIPS 2017 Workshop, Long Beach, California, USA
Hall 101B, Friday, December 8th, 08:00 AM — 06:30 PM

Vigil Sessions

Introduction

Schedule

Invited Speakers

Important Dates

Accepted Papers:

Submission Details

Detailed Description

Organizers

Sponsors

References

Visually-Grounded Interaction and Language (ViGIL)

NIPS 2017 Workshop, Long Beach, California, USA Hall 101B, Friday, December 8th, 08:00 AM — 06:30 PM

Vigil Sessions

Introduction

Schedule

Invited Speakers

Important Dates

Accepted Papers:

Submission Details

Detailed Description

Organizers

Sponsors

References

NIPS 2017 Workshop, Long Beach, California, USA
Hall 101B, Friday, December 8th, 08:00 AM — 06:30 PM