ECCV 2026

StreamOcc Streaming Dense Voxel Representations for 3D Occupancy Prediction

Real-time dense voxel streaming for accurate 3D occupancy prediction with distortion-aware temporal aggregation and dynamic-object query injection.

Seokha Moon^1,5,†, Janghyun Baek¹, Yujin Jeong², Daewon Chae³, Giseop Kim^4,5,‡, Jungbeom Lee¹, Jinkyu Kim^1,*, Sunwook Choi^5,*

¹Korea University ²TU Darmstadt & hessian.AI ³University of Michigan ⁴DGIST ⁵NAVER LABS

^†Work done during an internship at NAVER LABS ^‡Work done while at NAVER LABS ^*Corresponding authors

Author's Email: shmoon96@korea.ac.kr

PDFarXiv Codemoonseokha/StreamOcc

StreamOcc overview comparing naive dense voxel streaming with StreamAgg and QueryAgg

StreamOcc addresses two failure modes of naive dense voxel streaming: warping distortion from temporal alignment and degraded dynamic-object representations from image-to-voxel projection.

TL;DR StreamOcc keeps dense voxel features in a recurrent streaming buffer, rectifies propagated features with StreamAgg, and injects dynamic-object semantics with QueryAgg to improve accuracy under real-time constraints.

Motivation

Dense voxel representations preserve fine-grained 3D spatial structure, but multi-frame dense fusion is expensive. Streaming avoids repeatedly processing all historical frames, yet naive dense voxel streaming creates interpolation artifacts during warping and weakens dynamic-object features when image evidence is projected into voxel space.

Warping Distortion

Past voxel features must be aligned to the current ego frame, and interpolation can blur boundaries or introduce artifacts.

Dynamic Object Loss

Distant, occluded, and overlapping agents often lose fine-grained semantics during image-to-voxel projection.

Real-Time Constraint

Practical 3D occupancy needs strong spatial detail without the memory and latency costs of repeated dense history processing.

Method

StreamOcc predicts voxel occupancy in a streaming manner through two aggregation stages: StreamAgg for temporal dense voxel accumulation and QueryAgg for targeted dynamic-object refinement.

StreamAgg

Rectified Voxel Streaming Aggregation

Propagated voxel features are motion-warped into the current ego frame, then corrected with adaptive residual refinement so temporal accumulation stays spatially consistent.

QueryAgg

Query-Guided Aggregation

Instance-level queries capture dynamic-object semantics from image space and selectively inject them into occupied voxel regions instead of re-aggregating image features everywhere.

Adaptive residual refinement module in StreamAgg

Adaptive residual refinement focuses correction on informative warped voxel features using geometry-aware attention.

Results

SOTA Results: Occ3D-nuScenes / SurroundOcc-benchmark / RayIoU

41.9 Occ3D-nuScenes mIoU

23.4 SurroundOcc-benchmark mIoU

41.1 RayIoU

Quantitative results on Occ3D-nuScenes — **Occ3D-nuScenes.** StreamOcc achieves 41.9 mIoU with 83.3 ms latency and 2.8 GB memory.

SurroundOcc and RayIoU quantitative results — **SurroundOcc benchmark and RayIoU.** StreamOcc reaches 23.4 mIoU and 41.1 RayIoU.

Qualitative Results

Qualitative comparison between StreamOcc and ALOcc-mini

Comparison with the prior real-time method on 3D occupancy reconstruction.

Comparison between image-to-query only and image-to-query plus voxel-to-query

Voxel-to-query injection improves predictions beyond image-to-query only.

QueryAgg ablation on distant, overlapped, and occluded dynamic objects

QueryAgg recovers distant, overlapped, and occluded dynamic objects.

Comparison between ALOcc-mini and StreamOcc across three urban scenes

Additional ALOcc-mini vs StreamOcc comparison across challenging urban scenes.

Dynamic urban scene with construction and pedestrians.

3D occupancy prediction at a crowded intersection on a rainy day

Crowded rainy intersection with multiple dynamic objects.

3D occupancy prediction in a narrow urban street

Narrow urban street with parked and moving vehicles, pedestrians, and bicycles.

3D occupancy prediction on a bridge with moving vehicles and pedestrians

Bridge scene with moving vehicles and pedestrians.

Citation

@misc{moon2025streamocc,
  title={Streaming Dense Voxel Representations for 3D Occupancy Prediction},
  author={Moon, Seokha and Baek, Janghyun and Jeong, Yujin and Chae, Daewon and Kim, Giseop and Lee, Jungbeom and Kim, Jinkyu and Choi, Sunwook},
  year={2025},
  eprint={2503.22087},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2503.22087}
}