MinerU

MinerU 是一款开源的高质量 PDF/文档解析工具，专注于从学术论文、技术文档等复杂PDF中精准提取文本、表格、公式和图片，并输出为Markdown等格式。

🌟 核心特性

✅ 高质量解析：准确识别复杂版面，保持文档结构
✅ 公式提取：支持LaTeX公式识别和转换
✅ 表格还原：精确提取表格内容，保持格式
✅ 多栏处理：智能处理多栏排版
✅ 图片提取：提取文档中的图片和图表
✅ Markdown输出：生成高质量的Markdown文档
✅ 批量处理：支持批量文档解析

📦 安装与使用

快速安装

# 使用 pip 安装
pip install magic-pdf

# 或从源码安装
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
pip install -e .

环境配置

# 下载模型文件
magic-pdf --download-models

# 配置 CUDA（可选，用于GPU加速）
export CUDA_VISIBLE_DEVICES=0

基础使用

# 解析单个PDF文件
magic-pdf -p document.pdf -o output_dir

# 指定输出格式
magic-pdf -p document.pdf -o output_dir -m markdown

# 批量处理
magic-pdf -p ./pdfs/*.pdf -o ./output

Python API

from magic_pdf import MagicPDF

# 初始化解析器
parser = MagicPDF()

# 解析PDF
result = parser.parse("document.pdf")

# 获取Markdown内容
markdown_content = result.get_markdown()
print(markdown_content)

# 获取图片
images = result.get_images()
for img in images:
    img.save(f"image_{img.page}_{img.index}.png")

# 获取表格
tables = result.get_tables()
for table in tables:
    print(table.to_markdown())

🏗️ 技术架构

graph TB
    A[PDF文档] --> B[版面分析]
    B --> C[元素检测]
    C --> D[文本提取]
    C --> E[表格识别]
    C --> F[公式识别]
    C --> G[图片提取]
    D --> H[内容组装]
    E --> H
    F --> H
    G --> H
    H --> I[Markdown输出]
    
    B -->|YOLOv8| B1[区域检测]
    E -->|TableTransformer| E1[表格结构]
    F -->|LaTeX-OCR| F1[公式转换]

🎯 应用场景

1. 学术论文解析

from magic_pdf import MagicPDF

def parse_academic_paper(pdf_path):
    """解析学术论文"""
    parser = MagicPDF(
        extract_images=True,
        extract_tables=True,
        extract_formulas=True
    )
    
    result = parser.parse(pdf_path)
    
    # 提取论文结构
    paper = {
        "title": result.get_title(),
        "abstract": result.get_abstract(),
        "sections": result.get_sections(),
        "references": result.get_references(),
        "figures": result.get_figures(),
        "tables": result.get_tables()
    }
    
    return paper

2. 技术文档转换

def convert_tech_doc(pdf_path, output_path):
    """转换技术文档为Markdown"""
    parser = MagicPDF(
        keep_layout=True,
        extract_code_blocks=True
    )
    
    result = parser.parse(pdf_path)
    markdown = result.get_markdown()
    
    # 保存为Markdown
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(markdown)
    
    # 保存图片
    images_dir = output_path.replace('.md', '_images')
    result.save_images(images_dir)
    
    return markdown

3. 表格数据提取

def extract_tables_to_csv(pdf_path, output_dir):
    """提取所有表格并保存为CSV"""
    parser = MagicPDF()
    result = parser.parse(pdf_path)
    
    tables = result.get_tables()
    
    for i, table in enumerate(tables):
        csv_path = f"{output_dir}/table_{i+1}.csv"
        table.to_csv(csv_path)
        
        # 也可以转为DataFrame
        df = table.to_dataframe()
        print(f"表格 {i+1}:")
        print(df)

4. 批量文档处理

import os
from pathlib import Path

def batch_process_pdfs(input_dir, output_dir):
    """批量处理PDF文档"""
    parser = MagicPDF()
    
    pdf_files = Path(input_dir).glob("*.pdf")
    
    for pdf_path in pdf_files:
        try:
            print(f"处理: {pdf_path.name}")
            
            result = parser.parse(str(pdf_path))
            
            # 输出Markdown
            md_path = Path(output_dir) / f"{pdf_path.stem}.md"
            with open(md_path, 'w', encoding='utf-8') as f:
                f.write(result.get_markdown())
            
            # 保存图片
            img_dir = Path(output_dir) / f"{pdf_path.stem}_images"
            result.save_images(str(img_dir))
            
            print(f"完成: {pdf_path.name}")
            
        except Exception as e:
            print(f"错误 {pdf_path.name}: {e}")

🔧 高级配置

自定义解析参数

parser = MagicPDF(
    # 版面分析
    layout_model="yolov8",
    layout_threshold=0.5,
    
    # 表格识别
    table_model="table-transformer",
    table_threshold=0.7,
    
    # 公式识别
    formula_model="latex-ocr",
    formula_threshold=0.8,
    
    # OCR设置
    ocr_engine="paddleocr",
    ocr_lang="ch",
    
    # 输出设置
    keep_layout=True,
    extract_images=True,
    extract_tables=True,
    extract_formulas=True,
    
    # 性能设置
    use_gpu=True,
    batch_size=4
)

公式处理

def extract_formulas(pdf_path):
    """提取并转换公式"""
    parser = MagicPDF(extract_formulas=True)
    result = parser.parse(pdf_path)
    
    formulas = result.get_formulas()
    
    for formula in formulas:
        print(f"页码: {formula.page}")
        print(f"LaTeX: {formula.latex}")
        print(f"位置: {formula.bbox}")
        print(f"类型: {formula.type}")  # inline/display
        print("---")

版面还原

def preserve_layout(pdf_path):
    """保持原始版面布局"""
    parser = MagicPDF(
        keep_layout=True,
        preserve_spaces=True,
        preserve_indentation=True
    )
    
    result = parser.parse(pdf_path)
    markdown = result.get_markdown()
    
    return markdown

📊 性能对比

工具	文本准确率	公式识别	表格识别	速度	开源
MinerU	95%+	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	中	✅
PyMuPDF	90%	⭐⭐	⭐⭐⭐	快	✅
pdfplumber	92%	⭐	⭐⭐⭐⭐	快	✅
Adobe Acrobat	96%	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	慢	❌
Marker	94%	⭐⭐⭐⭐	⭐⭐⭐⭐	中	✅

💡 最佳实践

PDF质量
- 优先使用原生PDF（非扫描版）
- 扫描PDF需要配合OCR
- 确保PDF文件完整无损
模型选择
- 中文文档：使用PaddleOCR
- 英文文档：可使用Tesseract
- 公式密集：启用高精度公式识别
性能优化
- 使用GPU加速处理速度
- 批量处理时调整batch_size
- 大文件可分页处理
输出优化
- 根据需求选择输出格式
- 调整阈值提升准确率
- 后处理修正格式问题

🔍 常见问题

Q1: 如何处理扫描版PDF？

parser = MagicPDF(
    ocr_engine="paddleocr",
    ocr_enabled=True,
    image_dpi=300  # 提高OCR质量
)

Q2: 如何提取特定页面？

result = parser.parse("document.pdf", pages=[1, 3, 5])
# 或指定范围
result = parser.parse("document.pdf", pages=range(10, 20))

Q3: 如何处理双栏文档？

parser = MagicPDF(
    detect_columns=True,
    reading_order="column"  # 按列阅读顺序
)

📚 资源链接

GitHub: https://github.com/opendatalab/MinerU
文档: https://mineru.readthedocs.io/
在线演示: https://mineru.opendatalab.com/
模型下载: https://huggingface.co/opendatalab/MinerU

⚠️ 注意事项

首次使用需要下载模型文件（约2GB）
GPU版本需要CUDA支持
复杂PDF处理时间较长
部分特殊格式可能需要手动调整
建议使用Python 3.8+

🆚 与其他工具对比

MinerU vs PyMuPDF

MinerU: 更智能的版面理解，更好的公式和表格支持
PyMuPDF: 更快的处理速度，更底层的PDF操作

MinerU vs Marker

MinerU: 更好的中文支持，更灵活的配置
Marker: 更简洁的使用方式，更快的处理速度

MinerU vs pdfplumber

MinerU: 更强的AI能力，更好的复杂版面处理
pdfplumber: 更精确的表格提取，更稳定的基础功能

🔄 更新日志

2024.06: 发布v0.6，支持更多PDF格式
2024.03: 优化公式识别准确率
2023.12: 发布v0.5，添加GPU加速
2023.09: 首次开源发布

🌟 核心特性​

📦 安装与使用​

快速安装​

环境配置​

基础使用​

Python API​

🏗️ 技术架构​

🎯 应用场景​

1. 学术论文解析​

2. 技术文档转换​

3. 表格数据提取​

4. 批量文档处理​

🔧 高级配置​

自定义解析参数​

公式处理​

版面还原​

📊 性能对比​

💡 最佳实践​

🔍 常见问题​

Q1: 如何处理扫描版PDF？​

Q2: 如何提取特定页面？​

Q3: 如何处理双栏文档？​

📚 资源链接​

⚠️ 注意事项​

🆚 与其他工具对比​

MinerU vs PyMuPDF​

MinerU vs Marker​

MinerU vs pdfplumber​

🔄 更新日志​

🌟 核心特性

📦 安装与使用

快速安装

环境配置

基础使用

Python API

🏗️ 技术架构

🎯 应用场景

1. 学术论文解析

2. 技术文档转换

3. 表格数据提取

4. 批量文档处理

🔧 高级配置

自定义解析参数

公式处理

版面还原

📊 性能对比

💡 最佳实践

🔍 常见问题

Q1: 如何处理扫描版PDF？

Q2: 如何提取特定页面？

Q3: 如何处理双栏文档？

📚 资源链接

⚠️ 注意事项

🆚 与其他工具对比

MinerU vs PyMuPDF

MinerU vs Marker

MinerU vs pdfplumber

🔄 更新日志