Xiaohongshu has open-sourced FireRed-OCR, with a 2B parameter model achieving 92.94% on OmniDocBench v1.5. To put this in perspective: it surpasses Qwen3.5-397B (90.80%) and Gemini-3.0 Pro (90.33%). With Apache 2.0 license, both code and weights are available for commercial use.
1. The "Structural Hallucination" Pain Point in Document Parsing
General-purpose large models have a common problem when reading PDFs: they recognize text accurately but get the structure completely wrong.
Typical scenarios:
- Table rows and columns are scrambled, with data mismatched
- Mathematical formulas are "created" with symbols appearing out of nowhere
- Multicolumn documents have confused reading order, with lines crossing columns
This is not an occasional bug. General VLMs are better trained to generate semantically coherent text but lack precise constraints on the pixel-level spatial structure of documents.
FireRed-OCR's approach is straightforward: transform general VLMs into "structural engineers" using a systematic training framework that gives the model强制 constraints on format syntax.
2. Technical Solution: Three-Stage Training + Format-Constrained GRPO
FireRed-OCR is not just simple fine-tuning; it's a complete training pipeline:
Stage 1: Multi-Task Pre-Alignment
Establish "spatial foundation" at the visual perception level. The model first learns object detection, region identification, and layout-to-Markdown mapping, building the foundation for spatial localization.
Stage 2: Task-Specific Supervised Fine-Tuning (SFT)
Precisely tune on high-quality, standardized Markdown datasets to ensure outputs have logical consistency and hierarchical expression capabilities.
Stage 3: Format-Constrained Reinforcement Learning (Format-Constrained GRPO)
The core innovation is here. GRPO (Group Relative Policy Optimization) is a reinforcement learning method. FireRed-OCR introduces specialized format reward signals based on this, covering four dimensions:
| Dimension | Reward Signal |
|---|---|
| Formula Syntax Correctness | Whether LaTeX is valid |
| Table Structure Integrity | Whether tags are closed |
| Hierarchical Tag Closure | Whether Markdown nesting is correct |
| Text Accuracy | Character-level recognition precision |
Every time the model outputs a result, the system scores it on these four dimensions and feeds back to the model for self-correction.
3. Data Comparison: What Does 92.94% Mean?
FireRed-OCR-2B's performance on OmniDocBench v1.5:
| Model | Overall Score | Parameters | Approach Type |
|---|---|---|---|
| FireRed-OCR-2B | 92.94% | 2B | End-to-End |
| Qwen3.5-397B | 90.80% | 397B | End-to-End |
| Gemini-3.0 Pro | 90.33% | - | End-to-End |
| DeepSeek-OCR 2 | 91.09% | - | End-to-End |
| GLM-OCR | 94.60% | - | Pipeline |
| PaddleOCR-VL-1.5 | 94.50% | 1.5B | Pipeline |
FireRed-OCR is the optimal solution among end-to-end single models. PaddleOCR-VL-1.5 and GLM-OCR are pipeline solutions (multiple specialized models connected in series), scoring higher but with more complex deployment.
For text recognition alone (OCRBench TextRec), FireRed-OCR-2B ranks first among all models with 93.5 points, surpassing GPT-5.2 (93.0) and Gemini-3.0 Pro (91.9).
On FireRedBench (a "stress test" benchmark built by the team, specifically collecting non-standard layout documents from real-world scenarios), FireRed-OCR-2B takes first place among end-to-end solutions with 74.62 points, surpassing the pipeline solution GLM-OCR (74.33), and only slightly below PaddleOCR-VL-1.5 (76.47).
The base model Qwen3-VL-2B-Instruct only scores 65.58, showing significant improvement.
4. Practical Deployment: A Few Lines of Code
With 2B parameters and bfloat16 precision, VRAM usage is about 4-5GB. A single RTX 3090 / A10 GPU is sufficient for smooth inference.
Installation:
pip install transformers qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR
Inference Example:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv
model = Qwen3VLForConditionalGeneration.from_pretrained(
"FireRedTeam/FireRed-OCR",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")
image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)
print(output_text) # Standard Markdown format
Performance Optimization Tips:
- Enable
flash_attention_2to significantly reduce peak VRAM and improve throughput - max_new_tokens defaults to 8192; for dense academic papers with many pages, keep this value or higher
- Image quality has a significant impact; try to provide images ≥150 DPI
5. Suitable Use Cases
FireRed-OCR excels at document parsing requiring structural integrity:
- Academic papers (with formulas)
- Financial tables
- Technical documentation
- Scanned books with multicolumn layouts
For downstream tasks with high requirements for "tables must not break" and "formulas must not be wrong," it is currently the most reliable choice among end-to-end solutions.
Unsuitable scenarios:
- Extremely high precision requirements with engineering resources to maintain multi-model systems → choose PaddleOCR-VL-1.5 or GLM-OCR
- Very poor quality scanned images (<100 DPI) → performance will significantly decline
6. Conclusion
FireRed-OCR is a demonstration of "task-specific optimization": not by piling up parameters, but through carefully designed training frameworks, enabling a 2B small model to outperform 235B general large models on specialized tasks.
On vertical tasks, the efficiency of specialized training exceeds that of scaling up model size.
Resources and References
- GitHub: github.com/FireRedTeam/FireRed-OCR
- ModelScope: modelscope.cn/models/FireRedTeam/FireRed-OCR
- Technical Report: arxiv.org/abs/2603.01840
- Demo: huggingface.co/spaces/FireRedTeam/FireRed-OCR