Grounded Language Models#
A grounded language model is a language model that generates language for some meaningful representation like a table, an image or a video. We will focus on image captioning, i.e., models that produce a text-based description of an image, a so-called image caption.
We will visit a neural image captioner written from scratch, and also consult ways of fine-tuning pre-trained large image processing models and language models for image captioning.
Learning goals for this session#
understand basic architectures for grounded LMs (with a focus on neural image captioning)
critically assess research papers on (grounded) LMs
interpret and apply common evaluation metrics
Slides#
Here are the slides for this session.