Multiscale video transformers for video class agnostic segmentation in an autonomous driving setting

Date
2024-08-24
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Semantic segmentation is a key technique in the perception of autonomous driving. Traditional semantic segmentation models, however, are constrained by the need for extensive annotated datasets and struggle with unknown classes not encountered during training. On the other hand, video class-agnostic segmentation aims to segment objects without relying on their semantic category. Motion cues could be used towards that goal to account for objects outside the closed set of training classes. This project proposes an innovative approach to video class-agnostic segmentation in autonomous driving using multiscale video transformers. We enhance the Video Class Agnostic Segmentation (VCAS) dataset by integrating richer annotations and tracking data from the TAO-VOS (BDD) dataset, thereby providing a comprehensive dataset for better generalization in complex driving scenarios. Our project involves designing a novel multi-scale video transformer-based architecture that foregoes optical flow, focusing instead on learning motion implicitly to identify objects in a class-agnostic manner. This architecture utilizes the Multiscale Encoder-Decoder Video Transformer (MED-VT) framework, which processes sequential data in a multiscale approach to capture both fine and coarse-grained information. Video transformers utilize encoder and decoder components, along with attention mechanisms, to efficiently process sequential data. Our approach takes an input clip and outputs the class-agnostic segmentation of moving objects in that clip. Features extracted from the raw input clip using a convolutional backbone are treated as tokens and provided to the multiscale transformer for pixel-wise classification. Additionally, we augment the currently available video class-agnostic segmentation datasets with TAO-VOS (BDD) datasets. We also label some missing objects in TAO-VOS (BDD) datasets with a standard semantic segmentation annotation tool in a few of the sequences. The outcomes of this project include a more diverse and comprehensive dataset and a superior video class-agnostic segmentation model with improved accuracy in mean intersection over union (mIoU). Our training on datasets focused on autonomous driving scenes demonstrated a significant improvement in mIoU compared to models trained on general-purpose video object segmentation datasets.
Description
Keywords
Citation