OCR Model

FireRed-OCR 2B Model: State-of-the-Art Document Parsing, Outperforming 397B Models

March 6, 2026 15 min read
FireRed-OCR Logo

Xiaohongshu has open-sourced FireRed-OCR, with a 2B parameter model achieving 92.94% on OmniDocBench v1.5. To put this in perspective: it surpasses Qwen3.5-397B (90.80%) and Gemini-3.0 Pro (90.33%). With Apache 2.0 license, both code and weights are available for commercial use.

1. The "Structural Hallucination" Pain Point in Document Parsing

General-purpose large models have a common problem when reading PDFs: they recognize text accurately but get the structure completely wrong.

Structural Hallucination Example

Typical scenarios:

This is not an occasional bug. General VLMs are better trained to generate semantically coherent text but lack precise constraints on the pixel-level spatial structure of documents.

FireRed-OCR's approach is straightforward: transform general VLMs into "structural engineers" using a systematic training framework that gives the model强制 constraints on format syntax.

2. Technical Solution: Three-Stage Training + Format-Constrained GRPO

FireRed-OCR is not just simple fine-tuning; it's a complete training pipeline:

Model Architecture

Stage 1: Multi-Task Pre-Alignment

Establish "spatial foundation" at the visual perception level. The model first learns object detection, region identification, and layout-to-Markdown mapping, building the foundation for spatial localization.

Stage 2: Task-Specific Supervised Fine-Tuning (SFT)

Precisely tune on high-quality, standardized Markdown datasets to ensure outputs have logical consistency and hierarchical expression capabilities.

Stage 3: Format-Constrained Reinforcement Learning (Format-Constrained GRPO)

The core innovation is here. GRPO (Group Relative Policy Optimization) is a reinforcement learning method. FireRed-OCR introduces specialized format reward signals based on this, covering four dimensions:

Dimension Reward Signal
Formula Syntax Correctness Whether LaTeX is valid
Table Structure Integrity Whether tags are closed
Hierarchical Tag Closure Whether Markdown nesting is correct
Text Accuracy Character-level recognition precision

Every time the model outputs a result, the system scores it on these four dimensions and feeds back to the model for self-correction.

3. Data Comparison: What Does 92.94% Mean?

FireRed-OCR-2B's performance on OmniDocBench v1.5:

Model Overall Score Parameters Approach Type
FireRed-OCR-2B 92.94% 2B End-to-End
Qwen3.5-397B 90.80% 397B End-to-End
Gemini-3.0 Pro 90.33% - End-to-End
DeepSeek-OCR 2 91.09% - End-to-End
GLM-OCR 94.60% - Pipeline
PaddleOCR-VL-1.5 94.50% 1.5B Pipeline

FireRed-OCR is the optimal solution among end-to-end single models. PaddleOCR-VL-1.5 and GLM-OCR are pipeline solutions (multiple specialized models connected in series), scoring higher but with more complex deployment.

For text recognition alone (OCRBench TextRec), FireRed-OCR-2B ranks first among all models with 93.5 points, surpassing GPT-5.2 (93.0) and Gemini-3.0 Pro (91.9).

On FireRedBench (a "stress test" benchmark built by the team, specifically collecting non-standard layout documents from real-world scenarios), FireRed-OCR-2B takes first place among end-to-end solutions with 74.62 points, surpassing the pipeline solution GLM-OCR (74.33), and only slightly below PaddleOCR-VL-1.5 (76.47).

The base model Qwen3-VL-2B-Instruct only scores 65.58, showing significant improvement.

4. Practical Deployment: A Few Lines of Code

With 2B parameters and bfloat16 precision, VRAM usage is about 4-5GB. A single RTX 3090 / A10 GPU is sufficient for smooth inference.

Installation:

pip install transformers qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR

Inference Example:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "FireRedTeam/FireRed-OCR",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")

image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)
print(output_text)  # Standard Markdown format

Performance Optimization Tips:

5. Suitable Use Cases

FireRed-OCR excels at document parsing requiring structural integrity:

For downstream tasks with high requirements for "tables must not break" and "formulas must not be wrong," it is currently the most reliable choice among end-to-end solutions.

Unsuitable scenarios:

6. Conclusion

FireRed-OCR is a demonstration of "task-specific optimization": not by piling up parameters, but through carefully designed training frameworks, enabling a 2B small model to outperform 235B general large models on specialized tasks.

On vertical tasks, the efficiency of specialized training exceeds that of scaling up model size.

Resources and References