FireRed-OCR 2B Model: State-of-the-Art Document Parsing, Outperforming 397B Models

Xiaohongshu has open-sourced FireRed-OCR, with a 2B parameter model achieving 92.94% on OmniDocBench v1.5. To put this in perspective: it surpasses Qwen3.5-397B (90.80%) and Gemini-3.0 Pro (90.33%). With Apache 2.0 license, both code and weights are available for commercial use.

1. The "Structural Hallucination" Pain Point in Document Parsing

General-purpose large models have a common problem when reading PDFs: they recognize text accurately but get the structure completely wrong.

Typical scenarios:

Table rows and columns are scrambled, with data mismatched
Mathematical formulas are "created" with symbols appearing out of nowhere
Multicolumn documents have confused reading order, with lines crossing columns

This is not an occasional bug. General VLMs are better trained to generate semantically coherent text but lack precise constraints on the pixel-level spatial structure of documents.

FireRed-OCR's approach is straightforward: transform general VLMs into "structural engineers" using a systematic training framework that gives the model强制 constraints on format syntax.

2. Technical Solution: Three-Stage Training + Format-Constrained GRPO

FireRed-OCR is not just simple fine-tuning; it's a complete training pipeline:

Stage 1: Multi-Task Pre-Alignment

Establish "spatial foundation" at the visual perception level. The model first learns object detection, region identification, and layout-to-Markdown mapping, building the foundation for spatial localization.

Stage 2: Task-Specific Supervised Fine-Tuning (SFT)

Precisely tune on high-quality, standardized Markdown datasets to ensure outputs have logical consistency and hierarchical expression capabilities.

Stage 3: Format-Constrained Reinforcement Learning (Format-Constrained GRPO)

The core innovation is here. GRPO (Group Relative Policy Optimization) is a reinforcement learning method. FireRed-OCR introduces specialized format reward signals based on this, covering four dimensions:

Dimension	Reward Signal
Formula Syntax Correctness	Whether LaTeX is valid
Table Structure Integrity	Whether tags are closed
Hierarchical Tag Closure	Whether Markdown nesting is correct
Text Accuracy	Character-level recognition precision

Every time the model outputs a result, the system scores it on these four dimensions and feeds back to the model for self-correction.

3. Data Comparison: What Does 92.94% Mean?

FireRed-OCR-2B's performance on OmniDocBench v1.5:

Model	Overall Score	Parameters	Approach Type
FireRed-OCR-2B	92.94%	2B	End-to-End
Qwen3.5-397B	90.80%	397B	End-to-End
Gemini-3.0 Pro	90.33%	-	End-to-End
DeepSeek-OCR 2	91.09%	-	End-to-End
GLM-OCR	94.60%	-	Pipeline
PaddleOCR-VL-1.5	94.50%	1.5B	Pipeline

FireRed-OCR is the optimal solution among end-to-end single models. PaddleOCR-VL-1.5 and GLM-OCR are pipeline solutions (multiple specialized models connected in series), scoring higher but with more complex deployment.

For text recognition alone (OCRBench TextRec), FireRed-OCR-2B ranks first among all models with 93.5 points, surpassing GPT-5.2 (93.0) and Gemini-3.0 Pro (91.9).

On FireRedBench (a "stress test" benchmark built by the team, specifically collecting non-standard layout documents from real-world scenarios), FireRed-OCR-2B takes first place among end-to-end solutions with 74.62 points, surpassing the pipeline solution GLM-OCR (74.33), and only slightly below PaddleOCR-VL-1.5 (76.47).

The base model Qwen3-VL-2B-Instruct only scores 65.58, showing significant improvement.

4. Practical Deployment: A Few Lines of Code

With 2B parameters and bfloat16 precision, VRAM usage is about 4-5GB. A single RTX 3090 / A10 GPU is sufficient for smooth inference.

Installation:

pip install transformers qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR

Inference Example:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "FireRedTeam/FireRed-OCR",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")

image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)
print(output_text)  # Standard Markdown format

Performance Optimization Tips:

Enable flash_attention_2 to significantly reduce peak VRAM and improve throughput
max_new_tokens defaults to 8192; for dense academic papers with many pages, keep this value or higher
Image quality has a significant impact; try to provide images ≥150 DPI

5. Suitable Use Cases

FireRed-OCR excels at document parsing requiring structural integrity:

Academic papers (with formulas)
Financial tables
Technical documentation
Scanned books with multicolumn layouts

For downstream tasks with high requirements for "tables must not break" and "formulas must not be wrong," it is currently the most reliable choice among end-to-end solutions.

Unsuitable scenarios:

Extremely high precision requirements with engineering resources to maintain multi-model systems → choose PaddleOCR-VL-1.5 or GLM-OCR
Very poor quality scanned images (<100 DPI) → performance will significantly decline

6. Conclusion

FireRed-OCR is a demonstration of "task-specific optimization": not by piling up parameters, but through carefully designed training frameworks, enabling a 2B small model to outperform 235B general large models on specialized tasks.

On vertical tasks, the efficiency of specialized training exceeds that of scaling up model size.

Resources and References

GitHub: github.com/FireRedTeam/FireRed-OCR
ModelScope: modelscope.cn/models/FireRedTeam/FireRed-OCR
Technical Report: arxiv.org/abs/2603.01840
Demo: huggingface.co/spaces/FireRedTeam/FireRed-OCR