docproc
Document → Markdown extraction engine for LLM pipelines
Turn messy PDFs into structured, model-ready text.
Demo
Problem
Modern LLM workflows depend on ingesting documents, but PDFs destroy structure. Tables collapse, equations disappear, and images vanish.
docproc rebuilds that structure.
What it does
docproc converts:
- DOCX
- PPTX
- XLSX
into clean Markdown while preserving:
- headings
- layout hierarchy
- equations
- figures
- embedded images
Features
Clean Markdown
Structured output ready for pipelines.
Equation Preservation
LaTeX and math kept intact.
Image Extraction
Figures and embedded images exported.
Layout Reconstruction
Headings and hierarchy restored.
LLM Ready Output
Model-friendly markdown format.
Architecture
Real output
Input (PDF snippet)
A typical methods section with inline equation: “The diffusion equation is given by ∂u/∂t = α ∇²u.”
Output (Markdown)
## Methods The diffusion equation is given by: $$ ∂u/∂t = α ∇²u $$
Example usage
docproc research_paper.pdf > paper.mdWhy I Built This
I kept losing structure when ingesting PDFs into LLM pipelines. Most tools flatten documents into text.
docproc rebuilds the document as structured markdown.