Clayton Fields – Data Science
CCP Conf. 368 or via Zoom
Title: Vision-Language Transformers
Vision language tasks, such as answering questions about or describing an image, are quite difficult for computers to perform. Deep learning models designed for vision language tasks tend to be difficult to understand and implement and are confined to a narrow range of uses. A recent body of research however, has introduced a class of models called vision language transformers that greatly improve performance and versatility over previous models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of similar advancements in the modeling domain where computer vision and natural language processing intersect. This presentation will provide a broad synthesis of the currently available research on vision-language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.