Home

Imagine being able to simply buy a robot and customize it through either task demonstration or linguistic instructions to perform household tasks (e.g. picking up toys and putting them in the right place, emptying the dishwasher, etc.). An essential capability for such a robot is the ability to adapt to both the language and the environment in order to perform the right tasks in the right way. During training, it will encounter new phrases referring to new objects, actions and relations, and it will have to truly ground these to perform its novel tasks. Thus it needs to identify which objects, actions and relationships in the real world the phrases refer to in order to obtain a true understanding of the world and the complex interactions that govern the physical environment. 

Making this vision a reality is one of the key challenges in building intelligent robots that assist us with our daily tasks. Realizing it requires a fundamental shift in how the research community approaches the problem of symbol grounding. Currently, the focus lies on merely grounding and anchoring the individual symbols, which refer to single objects and their properties, in the environment. In contrast, this project hypothesizes that the grounding process should consider the full context of the environment, which consists of multiple objects as well as their relationships and properties, and how these change through actions and over time. It aims to develop a novel relational grounding approach that accounts for the relationships between multiple symbols in the language and between multiple referents in the environment. Our vision is closely related to J.J. Gibson’s original notion of affordances, which referred to the action opportunities (or possibilities) offered to an organism by its environment, and postulated that the organism and its environment complement each other. Our second hypothesis is that affordances play a central role in mapping language to the world. In one direction, linguistic clues may help identify affordances in the environment: verbs frequently used with an object provide clues about its affordances. In the other direction, what can and cannot be done with a physical object in an environment provides information relevant to learning word meanings and resolving ambiguous utterances. While affordances have been studied extensively in the robotics literature, and are sometimes mentioned in linguistics the focus is on "object affordances". In contrast, this project will develop a framework for "relational affordances", which also model relationships in the environment. Furthermore, unlike most current work on affordances in robotics, this project will also model the environment itself, and support reasoning about it. To achieve these aims, we will use logical and relational representations and learning techniques, which have proven useful in both language and robotics. 

Concretely, the project targets the following breakthrough: a novel framework for "affordance" grounding such that an agent placed in a new environment can adapt to its new setting and interpret possibly unimodal input in order to correctly carry out the requested tasks. To achieve this overarching objective, the sub-goals of this project are:

  • Develop the necessary techniques for representing and reasoning about natural language and sensorimotor input that facilitates both grounding the symbols and capturing their affordances.
  • Learn the affordances among objects, the state of the environment, action, and effects by combining information from multiple modalities (language and perception).
  • Evaluate the developed methodology using a set of progressively more challenging, realistic setups that involve a robot operating in a household environment.

The expected outputs of the project are :

  • Novel techniques that perform symbol grounding by learning relational affordances.
  • New insights into symbol grounding, relational affordances, and developmental language learning.
  • Demonstration that relational grounding significantly improves the percentage of successfully completed tasks in both a simulated environment and a real robotic setup about a household environment.
  • Publicly available data from the robotic and simulated environment and code for our algorithms.

The probabilistic programming framework of distributional clauses (DC) has been selected for reasoning and learning about affordances at KULeuven. It allows to represent relational MDPs and affordance models using a kind of probabilistic planning specification that can be used to represent relational affordances. In contrast to most other approaches to relational MDPs, distributional clauses can represent both discrete and continuous random variables. Within this framework, KULeuven developed a novel inductive algorithm for learning the structure of dynamic relational models. The learned model can be used for planning and reasoning tasks using the affordances, and the inductive algorithm to learn the transition and affordance models in ReGround from multi-modal data.

KOC developed a deep learning framework, called Knet, that allows users to define and train their models in plain Julia. ReGround uses Knet to implement the language grounding system. In particular, KOC developed a system to ground the meaning of words onto physical objects, their properties and relations between them. The meaning of nouns, adjectives and prepositions are learned as neural components. The model is also able to learn the composition of those language components. Furthermore, using Knet, KOC has also investigated end-to-end neural architectures for navigational instruction following task, which involves learning the meaning of names, spatial prepositions, and navigational action verbs. The proposed model achieves the best results to date by using a perceptual attention and improved word representation. Additionally, KOC conducted a set of experiments to investigate the relational language learning in children. These experiments suggest that before 2 years of age children can use social cues and joint attention for learning a variety of words. In particular, nonverbal social cues help for the acquisition of relational language.

Örebro developed an anchoring framework, which handles the full processing pipeline of segmenting objects, extracting features, classifying objects, ground perceptual-symbolic correspondences, and anchoring objects. For detecting objects of interest in our kitchen scenarios, it uses an object segmentation algorithm that is based on organized visual and depth information given by an RGB-D sensor. The object classification procedure, which is further used for grounding perceptual-symbolic correspondences, is facilitated by a deep convolutional neural network, which has been trained and fine-tuned based on categories of objects that can be expected to be found in a kitchen. This processing pipeline allow one to store and maintain both perceptual and the symbolic information of objects (stored in anchors), information that is utilized by both the DC framework developed by KULeuven and the Knet learning framework of KOC.