Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

Zichen Wen, Boxue Yang, Shuang Chen¹, Yaojie Zhang¹, Yuhang Han¹, Junlong Ke¹, Cong Wang¹, Yicheng Fu¹, Jiawang Zhao¹, Jiangchao Yao¹, Xi Fang², Zhen Wang², Henxing Cai², Lin Yao², Zhifeng Gao², Yanhui Hong², Nang Yuan², Yixuan Li², Guojiang Zhao², Haoyi Tao², Nan Wang², Han Lyu², Guolin Ke², Ning Liao³, Xiaoxing Wang³, Kai Chen³, Zhiyu Li³, Feiyu Xiong³, Sihan Hu, Kun Chen, Yanfeng Wang¹,
Weinan E¹†, Linfeng Zhang²†, Linfeng Zhang¹†
1 School of Artificial Intelligence, Shanghai Jiao Tong University  |  2 DP Technology  |  3 MemTensor  |  4 Institute of Theoretical Physics, Chinese Academy of Sciences
* Equal Contribution  |  Corresponding Author

Abstract

We present Innovator-VL, a scientific multimodal large language model (MLLM) designed to advance multimodal understanding and reasoning across diverse scientific domains while still maintaining excellent performance on general vision tasks. Contrary to the prevailing trend of relying on massive domain-specific pretraining data and opaque training pipelines, our work demonstrates that principled training design and transparent methodology can yield strong scientific multimodal intelligence with substantially reduced data requirements. (i) First, we provide a fully transparent and end-to-end reproducible training pipeline for scientific multimodal modeling, covering all stages from data collection and cleaning to preprocessing, supervised fine-tuning, reinforcement learning, and evaluation, together with detailed optimization and hyperparameter recipes. This enables faithful reproduction of our results and facilitates systematic extension and adaptation by the community. (ii) Second, Innovator-VL exhibits remarkable data efficiency, achieving competitive performance on a wide range of scientific tasks using fewer than five million carefully curated scientific training samples, despite not relying on large-scale scientific pretraining. These results highlight that effective scientific multimodal reasoning can be achieved through principled data selection and training strategies rather than indiscriminate data scaling. (iii) Third, Innovator-VL demonstrates strong generalization beyond scientific domains, achieving competitive performance among MLLMs of comparable size on general vision benchmarks, multimodal reasoning benchmarks, and scientific benchmarks. This indicates that scientific alignment can be integrated into a unified multimodal model without compromising general-purpose capabilities. Together, our practices suggest that, in the absence of large-scale scientific data, efficient, reproducible, and high-performing scientific multimodal models can be built, thereby providing a practical and transparent foundation for future research in scientific multimodal modeling.

Overall Illustration

Overall illustration of Innovator-VL

Figure 1: Overall illustration of Innovator-VL.

Performance

Performance of Innovator-VL

Figure 2: Performance of Innovator-VL across general, reasoning, and scientific benchmarks.

General: Innovator-VL-8B-Instruct achieves top-tier performance on general multimodal benchmarks, matching or surpassing leading open-source models in visual perception, OCR, document understanding, and real-world reasoning.


Math & Reasoning: Innovator-VL-8B-Thinking further improves multimodal reasoning, attaining the highest average scores on visual math and reasoning benchmarks among comparable 8B models through reinforcement learning–enhanced long-horizon reasoning.


Science: Innovator-VL consistently outperforms general-purpose baselines, delivering strong results on chemistry, reaction understanding, molecular parsing, microscopy analysis, and scientific VQA tasks.

Training Data

Data distribution

Figure 3: Data distribution across different training stages of Innovator-VL.

Mid-Training: The mid-training stage focuses on large-scale multimodal alignment, combining diverse general-domain and scientific data to establish robust visual perception and cross-modal representation learning.


Instruction Tuning: The instruction-tuning stage emphasizes high-quality supervised data, integrating general instruction-following samples with carefully curated scientific tasks to enhance controllability and domain-specific understanding.


Reinforcement Learning: The reinforcement learning stage uses a compact but targeted dataset to further refine long-horizon reasoning, response consistency, and decision-making quality.

Data Construction Pipeline

Data pipeline

Figure 4: Overview of data construction pipelines for scientific multimodal training.

Optical Chemical Structure Recognition (OCSR): A human-in-the-loop pipeline combines large-scale synthetic bootstrapping with active-learning-driven expansion on real patent and paper data, using E-SMILES as a unified annotation format to represent complex chemical structures.


Chemical Reaction Understanding: Reaction datasets are constructed from scientific PDFs via automated layout parsing and expert-verified question generation, covering both fine-grained reaction perception and document-level multimodal reasoning.


Electron Microscopy (EM) Microstructures: Large-scale EM datasets are built through iterative expert annotation and model-assisted refinement, with dense instance-level segmentation and structured attribute descriptions to support microstructural analysis.

Scientific Data Overview

Curated multimodal scientific data supporting reproducible and data-efficient training

Total Scientific Samples
≈4.8M
Scientific Domains
6+
Modalities
3

Domain-wise Composition

Scientific Domain Primary Tasks Modalities
Chemistry OCSR, reaction understanding, molecule parsing Image + Structured Text
Materials Science Microstructure analysis, EM interpretation Microscopy Image + Text
Scientific Documents Figure understanding, document-level reasoning PDF Layout + Image + Text
General Science QA Visual question answering, reasoning Image + Natural Language

Training Stage Statistics

Stage Dataset Scale Purpose
Mid-Training Innovator-VL-Mid-Training 85M samples Multimodal alignment and representation learning
Instruction Tuning Innovator-VL-Instruct 46M samples Instruction following and scientific controllability
Reinforcement Learning Innovator-VL-RL 172K trajectories Long-horizon reasoning and decision refinement

On Token Efficiency of Reasoning

Token efficiency

Figure 5: Accuracy and token efficiency comparison on mathematical reasoning benchmarks.

Innovator-VL exhibits strong efficiency in reasoning token consumption. As shown in Figure 5, Innovator-VL-8B-Thinking achieves competitive or superior accuracy on mathematical reasoning benchmarks while using substantially fewer tokens than comparable baselines, leading to lower inference cost and reduced latency.


This efficiency is primarily driven by the reinforcement learning stage, which encourages the model to focus on critical reasoning steps and eliminate redundant computation, resulting in more compact and effective reasoning trajectories.

BibTeX

@article{wen2026innovator,
  title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
  author={Wen, Zichen and Yang, Boxue and Chen, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others},
  journal={arXiv preprint arXiv:2601.19325},
  year={2026}
}