StreamOcc Streaming Dense Voxel Representations for 3D Occupancy Prediction

Real-time dense voxel streaming for accurate 3D occupancy prediction with distortion-aware temporal aggregation and dynamic-object query injection.

1Korea University 2TU Darmstadt & hessian.AI 3University of Michigan 4DGIST 5NAVER LABS
Work done during an internship at NAVER LABS Work done while at NAVER LABS

Author's Email: shmoon96@korea.ac.kr

StreamOcc overview comparing naive dense voxel streaming with StreamAgg and QueryAgg
StreamOcc addresses two failure modes of naive dense voxel streaming: warping distortion from temporal alignment and degraded dynamic-object representations from image-to-voxel projection.

TL;DR StreamOcc keeps dense voxel features in a recurrent streaming buffer, rectifies propagated features with StreamAgg, and injects dynamic-object semantics with QueryAgg to improve accuracy under real-time constraints.

Motivation

Dense voxel representations preserve fine-grained 3D spatial structure, but multi-frame dense fusion is expensive. Streaming avoids repeatedly processing all historical frames, yet naive dense voxel streaming creates interpolation artifacts during warping and weakens dynamic-object features when image evidence is projected into voxel space.

01

Warping Distortion

Past voxel features must be aligned to the current ego frame, and interpolation can blur boundaries or introduce artifacts.

02

Dynamic Object Loss

Distant, occluded, and overlapping agents often lose fine-grained semantics during image-to-voxel projection.

03

Real-Time Constraint

Practical 3D occupancy needs strong spatial detail without the memory and latency costs of repeated dense history processing.

Method

Overall architecture of StreamOcc
StreamOcc predicts voxel occupancy in a streaming manner through two aggregation stages: StreamAgg for temporal dense voxel accumulation and QueryAgg for targeted dynamic-object refinement.
StreamAgg

Rectified Voxel Streaming Aggregation

Propagated voxel features are motion-warped into the current ego frame, then corrected with adaptive residual refinement so temporal accumulation stays spatially consistent.

QueryAgg

Query-Guided Aggregation

Instance-level queries capture dynamic-object semantics from image space and selectively inject them into occupied voxel regions instead of re-aggregating image features everywhere.

Adaptive residual refinement module in StreamAgg
Adaptive residual refinement focuses correction on informative warped voxel features using geometry-aware attention.

Results

SOTA Results: Occ3D-nuScenes / SurroundOcc-benchmark / RayIoU

41.9 Occ3D-nuScenes mIoU
23.4 SurroundOcc-benchmark mIoU
41.1 RayIoU
Quantitative results on Occ3D-nuScenes
SurroundOcc and RayIoU quantitative results

Qualitative Results

Citation

@misc{moon2025streamocc,
  title={Streaming Dense Voxel Representations for 3D Occupancy Prediction},
  author={Moon, Seokha and Baek, Janghyun and Jeong, Yujin and Chae, Daewon and Kim, Giseop and Lee, Jungbeom and Kim, Jinkyu and Choi, Sunwook},
  year={2025},
  eprint={2503.22087},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2503.22087}
}