INLG 2025 tutorial - Visual and Task Context for Multi-Modal Reference

Visual and Task Context for Multi-Modal Reference

An INLG 2025 tutorial by Nikolai Ilinykh (University of Gothenburg, Sweden), Simon Dobnik (University of Gothenburg, Sweden), Simeon Junker (University of Bielefeld, Germany) and Sina Zarrieß (University of Bielefeld, Germany).

Slides:
• Part 1: What is to ``refer''?
• Part 2: Manifestation of reference in tasks and contexts
• Part 3: Evaluation of reference
• Hands-on session: Code

Target audience The tutorial assumes basic knowledge of NLP and linguistics. It is meant for anyone investigating text generation in multimodal settings -- including the questions of unwanted (e.g., social) bias. The tutorial is aimed at researchers and practitioners who want to study reference in multimodal, task-oriented settings rather than in text-only NLP. Colleagues working on referring expression generation as well as those focused on model evaluation and interpretability, are likely to find it particularly useful. We target researchers at different levels of expertise. Newcomers will discover the spectrum of vision-language tasks where reference matters; intermediate researchers will learn about metrics and probes for tracking model-produced reference; advanced NLG experts will learn how state-of-the-art multi-modal LLMs can be interpreted for referring.

Tutorial overview Referring to "things" is fundamental in human language. Automatic models of human language have been long studied to capture this crucial ability. With an increasing number of benchmarks, models, and especially evaluation metrics for studying of reference in contexts which are not text-only, it becomes challenging to systematise and crystallise the progress made in this area. This tutorial aims to fill this gap by providing a comprehensive overview of the state-of-the-art in multi-modal reference. Our goal is provide a single toolkit for studying and evaluating how vision-and-language models deal with reference in language.

The toolkit consists of a set of resources designed to generalise the research on reference in multimodal contexts. These resources include datasets, evaluation metrics, and analysis tools that can be applied across different tasks and models. We will specifically focus on the role of the task and visual information in modelling reference. This tutorial is thus timely as it unifies the latest methods, mapping the key visual and task factors important for modelling referring, and packaging them into a ready-to-use resource.

While the tutorial introduces a technical toolkit, we will guide you through the code examples, so no need to worry. Our focus is to ensure that all participants have an idea of how they can use the code for their own research. We will use Python and Google Colab and if you are new to programming, we will guide you every step of the way.

Click here to contact us.