GOT-OCR (General OCR Theory)

GOT-OCR 是一个通用的端到端OCR模型，基于视觉-语言多模态架构，能够统一处理多种OCR任务，包括场景文字识别、文档OCR、公式识别等。

🌟 核心特性

✅ 统一架构：一个模型处理多种OCR任务
✅ 端到端：从图像直接到结构化文本输出
✅ 多语言支持：支持80+种语言识别
✅ 版面理解：智能理解文档结构和排版
✅ 公式识别：支持复杂数学公式LaTeX转换
✅ 表格识别：精确识别表格结构和内容
✅ 高精度：在多个基准测试中达到SOTA水平

📦 安装与使用

环境准备

# 创建虚拟环境
conda create -n got-ocr python=3.9
conda activate got-ocr

# 安装依赖
pip install torch torchvision
pip install transformers
pip install pillow opencv-python

# 安装 GOT-OCR
pip install got-ocr

快速开始

from got_ocr import GOTOCR
from PIL import Image

# 初始化模型
model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")

# 加载图像
image = Image.open("document.jpg")

# 执行OCR
result = model.chat(
    image,
    ocr_type="ocr"  # ocr/format/formula
)

print(result)

不同任务类型

# 1. 纯文本OCR
text = model.chat(image, ocr_type="ocr")

# 2. 格式化输出（保持版面）
formatted = model.chat(image, ocr_type="format")

# 3. 公式识别
formula = model.chat(image, ocr_type="formula")

# 4. 细粒度OCR（带位置信息）
detailed = model.chat(
    image,
    ocr_type="ocr",
    render=True  # 返回带位置的结构化结果
)

🏗️ 技术架构

模型架构

视觉编码器: Swin Transformer，提取图像特征
多模态融合: Cross-Attention机制融合视觉和文本信息
文本解码器: GPT架构，自回归生成OCR结果
任务适配: 通过不同提示词控制输出格式

🎯 应用场景

1. 文档OCR与版面保持

from got_ocr import GOTOCR
from PIL import Image

def document_ocr_with_layout(image_path):
    """保持版面的文档OCR"""
    model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
    image = Image.open(image_path)
    
    # 格式化OCR，保持原始排版
    result = model.chat(
        image,
        ocr_type="format",
        fine_grained=True
    )
    
    # result包含结构化信息
    return {
        "text": result["text"],
        "layout": result["layout"],
        "confidence": result["confidence"]
    }

2. 数学公式识别

def recognize_formula(image_path):
    """识别数学公式"""
    model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
    image = Image.open(image_path)
    
    # 公式识别模式
    latex = model.chat(
        image,
        ocr_type="formula"
    )
    
    print(f"LaTeX: {latex}")
    return latex

# 示例输出
# LaTeX: \int_{0}^{\infty} e^{-x^2} dx = \frac{\sqrt{\pi}}{2}

3. 表格识别与结构化

import pandas as pd

def extract_table(image_path):
    """提取表格并转为DataFrame"""
    model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
    image = Image.open(image_path)
    
    # 表格识别
    result = model.chat(
        image,
        ocr_type="format",
        table_mode=True
    )
    
    # 解析表格结构
    table_data = parse_table_structure(result)
    
    # 转为DataFrame
    df = pd.DataFrame(table_data)
    return df

def parse_table_structure(ocr_result):
    """解析表格结构"""
    # 根据OCR结果解析表格
    rows = []
    for line in ocr_result.split('\n'):
        if '|' in line:
            cells = [cell.strip() for cell in line.split('|')]
            rows.append(cells)
    return rows

4. 多语言文档处理

def multilingual_ocr(image_path, languages=None):
    """多语言OCR识别"""
    model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
    image = Image.open(image_path)
    
    # 指定语言（可选）
    result = model.chat(
        image,
        ocr_type="ocr",
        languages=languages  # ['zh', 'en', 'ja', 'ko']
    )
    
    return result

# 中英混合文档
text = multilingual_ocr("mixed_doc.jpg", languages=['zh', 'en'])

# 自动检测语言
text = multilingual_ocr("document.jpg")  # 自动检测

5. 场景文字识别

def scene_text_recognition(image_path):
    """场景文字识别（路牌、广告等）"""
    model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
    image = Image.open(image_path)
    
    # 场景文字模式
    result = model.chat(
        image,
        ocr_type="ocr",
        scene_mode=True,
        render=True  # 返回文字位置
    )
    
    # 绘制检测框
    from got_ocr.utils import draw_boxes
    annotated_image = draw_boxes(image, result["boxes"])
    
    return {
        "text": result["text"],
        "boxes": result["boxes"],
        "image": annotated_image
    }

6. 批量文档处理

from pathlib import Path
import json

def batch_ocr(input_dir, output_dir):
    """批量处理文档"""
    model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    results = []
    
    for img_file in input_path.glob("*.jpg"):
        try:
            print(f"处理: {img_file.name}")
            
            image = Image.open(img_file)
            result = model.chat(image, ocr_type="format")
            
            # 保存文本
            txt_file = output_path / f"{img_file.stem}.txt"
            with open(txt_file, "w", encoding="utf-8") as f:
                f.write(result)
            
            results.append({
                "file": img_file.name,
                "status": "success",
                "length": len(result)
            })
            
        except Exception as e:
            results.append({
                "file": img_file.name,
                "status": "error",
                "error": str(e)
            })
    
    # 保存处理报告
    with open(output_path / "report.json", "w") as f:
        json.dump(results, f, indent=2)
    
    return results

🔧 高级配置

自定义推理参数

model = GOTOCR.from_pretrained(
    "ucaslcl/GOT-OCR2_0",
    device="cuda",           # 使用GPU
    dtype=torch.float16,     # 半精度加速
    low_cpu_mem_usage=True   # 减少内存占用
)

# 推理配置
result = model.chat(
    image,
    ocr_type="format",
    # 生成参数
    max_length=4096,
    temperature=0.0,
    top_p=0.9,
    # OCR参数
    min_confidence=0.8,
    merge_boxes=True,
    preserve_layout=True
)

细粒度控制

def fine_grained_ocr(image_path):
    """细粒度OCR，返回详细信息"""
    model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
    image = Image.open(image_path)
    
    result = model.chat(
        image,
        ocr_type="ocr",
        render=True,
        fine_grained=True
    )
    
    # 返回每个文字块的详细信息
    for block in result["blocks"]:
        print(f"文本: {block['text']}")
        print(f"位置: {block['bbox']}")
        print(f"置信度: {block['confidence']}")
        print(f"语言: {block['language']}")
        print("---")
    
    return result

性能优化

import torch

# 使用编译优化（PyTorch 2.0+）
model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
model = torch.compile(model)

# 批量推理
images = [Image.open(f"img_{i}.jpg") for i in range(10)]
results = model.batch_chat(
    images,
    ocr_type="ocr",
    batch_size=4
)

# 使用混合精度
with torch.autocast(device_type="cuda", dtype=torch.float16):
    result = model.chat(image, ocr_type="ocr")

📊 性能对比

模型	SROIE	TextVQA	DocVQA	公式识别	速度
GOT-OCR2.0	98.1%	84.5%	91.2%	95.3%	⭐⭐⭐⭐
PaddleOCR	95.2%	-	-	88.0%	⭐⭐⭐⭐⭐
Tesseract	85.4%	-	-	-	⭐⭐⭐⭐⭐
TrOCR	96.5%	81.2%	87.3%	-	⭐⭐⭐
Donut	93.8%	79.8%	85.6%	90.1%	⭐⭐⭐

多语言性能

语言	准确率	支持级别
中文	97.8%	⭐⭐⭐⭐⭐
英文	98.5%	⭐⭐⭐⭐⭐
日文	96.2%	⭐⭐⭐⭐
韩文	95.8%	⭐⭐⭐⭐
阿拉伯文	94.1%	⭐⭐⭐

💡 最佳实践

任务类型选择

# 根据需求选择合适的OCR类型
- ocr: 纯文本提取
- format: 保持版面格式
- formula: 数学公式识别

图像预处理

from PIL import Image, ImageEnhance

# 提升图像质量
image = Image.open("document.jpg")
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(1.5)

# 调整分辨率
image = image.resize((image.width * 2, image.height * 2))

后处理优化

# 清理OCR结果
def clean_ocr_result(text):
    # 移除多余空格
    text = " ".join(text.split())
    # 修正常见错误
    text = text.replace("O", "0")  # 数字0识别为字母O
    text = text.replace("l", "1")  # 数字1识别为字母l
    return text

内存管理

import gc
import torch

# 处理完后清理
result = model.chat(image, ocr_type="ocr")
torch.cuda.empty_cache()
gc.collect()

🔍 常见问题

Q1: 如何处理低质量图像？

# 启用图像增强
result = model.chat(
    image,
    ocr_type="ocr",
    enhance_image=True,
    denoise=True
)

Q2: 如何提取特定区域？

# 裁剪感兴趣区域
from PIL import Image

image = Image.open("document.jpg")
region = image.crop((x1, y1, x2, y2))
result = model.chat(region, ocr_type="ocr")

Q3: 如何处理多页文档？

import PyPDF2
from pdf2image import convert_from_path

def ocr_pdf(pdf_path):
    """OCR多页PDF"""
    # 转换PDF为图像
    images = convert_from_path(pdf_path, dpi=300)
    
    model = GOTOCR.from_pretrained("ucaslcl/GOT-OCR2_0")
    
    results = []
    for i, image in enumerate(images):
        print(f"处理第 {i+1} 页")
        result = model.chat(image, ocr_type="format")
        results.append(result)
    
    # 合并结果
    full_text = "\n\n=== 分页 ===\n\n".join(results)
    return full_text

📚 资源链接

GitHub: https://github.com/Ucas-HaoranWei/GOT-OCR2.0
论文: https://arxiv.org/abs/2409.01704
模型: https://huggingface.co/ucaslcl/GOT-OCR2_0
Demo: https://huggingface.co/spaces/ucaslcl/GOT-OCR2.0

⚠️ 注意事项

首次运行会下载模型（约1.8GB）
推荐使用GPU加速（RTX 3090或更高）
需要16GB+内存和8GB+显存
处理大图像时注意内存占用
某些特殊字符可能需要后处理

🆚 与其他OCR的对比

GOT-OCR vs 传统OCR

特性	GOT-OCR	传统OCR
架构	端到端多模态	检测+识别两阶段
版面理解	⭐⭐⭐⭐⭐	⭐⭐
泛化能力	⭐⭐⭐⭐⭐	⭐⭐⭐
准确率	非常高	高
速度	中等	快
部署复杂度	中等	简单

适用场景

选择 GOT-OCR 当：

需要高精度OCR
处理复杂版面文档
需要理解文档结构
多语言混合场景
公式识别需求

选择其他工具当：

追求极致速度 → PaddleOCR
简单场景 → Tesseract
资源受限 → EasyOCR
移动端部署 → MMOCR

🔄 更新日志

2024.09: GOT-OCR 2.0 发布，支持更多任务
2024.06: 优化多语言支持
2024.03: 首次发布 GOT-OCR 1.0

🌟 核心特性​

📦 安装与使用​

环境准备​

快速开始​

不同任务类型​

🏗️ 技术架构​

模型架构​

🎯 应用场景​

1. 文档OCR与版面保持​

2. 数学公式识别​

3. 表格识别与结构化​

4. 多语言文档处理​

5. 场景文字识别​

6. 批量文档处理​

🔧 高级配置​

自定义推理参数​

细粒度控制​

性能优化​

📊 性能对比​

多语言性能​

💡 最佳实践​

🔍 常见问题​

Q1: 如何处理低质量图像？​

Q2: 如何提取特定区域？​

Q3: 如何处理多页文档？​

📚 资源链接​

⚠️ 注意事项​

🆚 与其他OCR的对比​

GOT-OCR vs 传统OCR​

适用场景​

🔄 更新日志​

🌟 核心特性

📦 安装与使用

环境准备

快速开始

不同任务类型

🏗️ 技术架构

模型架构

🎯 应用场景

1. 文档OCR与版面保持

2. 数学公式识别

3. 表格识别与结构化

4. 多语言文档处理

5. 场景文字识别

6. 批量文档处理

🔧 高级配置

自定义推理参数

细粒度控制

性能优化

📊 性能对比

多语言性能

💡 最佳实践

🔍 常见问题

Q1: 如何处理低质量图像？

Q2: 如何提取特定区域？

Q3: 如何处理多页文档？

📚 资源链接

⚠️ 注意事项

🆚 与其他OCR的对比

GOT-OCR vs 传统OCR

适用场景

🔄 更新日志