ForecastOcc: Vision-based Semantic Occupancy Forecasting

Abstract

Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.

Technical Approach

Figure: Architecture of ForecastOcc for semantic occupancy forecasting. Multi-view images from the past, current, and future are used as input, with future views only during training. Each image is encoded to 2D features F^2D. In the forecasting module, F^2D are embedded as Scale(4,256)×Views(M,256)×Temporal(4,256). The future state query from F^2D_t has shape (H/16)×(W/16)×M×256. Each future interaction layer shares weights and contains two multiheaded self-attention blocks, a feedforward network, and a future state synthesizer (linear+ReLU)×2, linear with embedding dimension 256 and 16 heads. Depth and context distributions have sizes M×88×H/16×W/16 and M×64×H/16×W/16, yielding a 3D feature volume 64×16×200×200. A 3D ResNet occupancy encoder and a semantic predictor output voxelwise logits of size N_C×16×200×200.

We propose a feature-centric framework for vision-based 3D semantic occupancy forecasting. Instead of directly predicting future occupancy in 3D space, our approach first anticipates how the scene will evolve in the image domain. Given a sequence of multi-view images from past and current time steps, an image encoder extracts spatial features, which are then processed by a forecasting module to synthesize future feature representations at multiple prediction horizons. These future-aware features are subsequently lifted into a 3D voxel grid using a standard Lift-Splat-Shoot view transformation and refined by a spatio-temporal occupancy encoder to produce voxel-level semantic predictions. This design allows our method to remain fully compatible with existing vision-to-3D pipelines while enabling accurate future scene understanding.

The key component of our framework is a transformer-based future state synthesizer that models temporal evolution directly in feature space. To provide explicit context, the input features are enriched with learnable embeddings that encode camera identity, temporal position, and feature scale. Starting from the current frame, a set of future state queries iteratively interacts with past observations through cross-attention, progressively constructing a representation of the anticipated scene. During training, the synthesized features are explicitly aligned with features extracted from actual future images using a feature-level alignment loss. By learning to predict future visual representations rather than task outputs directly, the model captures scene dynamics more effectively and enables robust long-horizon semantic occupancy forecasting.

Publications

If you find our work useful, please consider citing our paper:

Riya Mohan, Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada

ForecastOcc: Vision-based Semantic Occupancy Forecasting
IEEE International Conference on Robotics and Automation (ICRA), Vienna, Austria, 2026.

(PDF) (BibTeX)