News
-
Our paper Stateful Token Reduction for Long-Video Hybrid VLMs is now on arXiv. We propose stateful token reduction for long-video VLMs, enabling 3.8–4.2× faster prefilling (TTFT) with near-baseline accuracy while using only 25% of visual tokens. With light finetuning under reduction, accuracy improves further and can even surpass the baseline. Check out the paper!
-
Check out how the AI community is discussing STORM! From technical podcasts to detailed audio summaries, these resources offer a variety of ways to learn about our latest research in efficient video-LLMs. You can find the full list of videos here.
-
1 paper accepted to CVPR 2026.
-
3 papers accepted to NeurIPS 2025.
-
We recently introduced Nemotron Nano V2, a 9B hybrid model that delivers competitive or superior accuracy on reasoning benchmarks while achieving up to 6× higher inference throughput in reasoning tasks (e.g., 8k input and 16k output tokens). The model builds on our Mamba-based Hybrid LLM work.
-
Our token-efficient long video model for multimodal LLMs (STORM) is on arXiv. It achieves more than 5% improvement on MLVU and LongVideoBench compared to SOTA while reducing the computation costs by up to 8× and the decoding latency by 2.4-2.9×. Check the project page for more details!
Research Interests
-
Recurrent Neural Network (RNN), State-Space Models (SSM), Linear RNNs
-
Sequence Learning, Spatio-Temporal Learning
Selected Projects
-
J Jiang*, A S Deshmukh, K Chumachenko, K Sapra, Z Yu, G Liu, A Tao, P Molchanov, J Kautz, W Byeon*, “Stateful Token Reduction for Long-Video Hybrid VLMs”, arXiv, 2026
-
Co-authored with many colleagues at NVIDIA (incl. W. Byeon), “NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model”, arXiv, 2025
-
Co-authored with many colleagues at NVIDIA (incl. W. Byeon), “Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models”, arXiv, 2025
-
J Jiang, X Li, Z Liu, M Li, G Chen, Z Li, D Huang, G Liu, Z Yu, K Keutzer, S Ahn, J Kautz, H Yin, Y Lu, S Han, W Byeon, “Token-Efficient Long Video Understanding for Multimodal LLMs”, arXiv, 2025