docproc

Document → Markdown extraction engine for LLM pipelines

Turn messy PDFs into structured, model-ready text.

View on GitHub↗

Demo

terminal

$ docproc research_paper.pdf

> parsing layout...

> extracting figures...

> detecting equations...

> rebuilding hierarchy...

> exporting markdown...

✓ paper.md generated

Problem

Modern LLM workflows depend on ingesting documents, but PDFs destroy structure. Tables collapse, equations disappear, and images vanish.

docproc rebuilds that structure.

What it does

docproc converts:

PDF
DOCX
PPTX
XLSX

into clean Markdown while preserving:

headings
layout hierarchy
equations
figures
embedded images

Features

Clean Markdown

Structured output ready for pipelines.

Equation Preservation

LaTeX and math kept intact.

Image Extraction

Figures and embedded images exported.

Layout Reconstruction

Headings and hierarchy restored.

LLM Ready Output

Model-friendly markdown format.

Architecture

PDF/DOCX

Layout Parser

Equation + Figure Extractor

Structure Rebuilder

Clean Markdown

Real output

Input (PDF snippet)

A typical methods section with inline equation: “The diffusion equation is given by ∂u/∂t = α ∇²u.”

→

Output (Markdown)

## Methods

The diffusion equation is given by:

$$ ∂u/∂t = α ∇²u $$

Example usage

docproc research_paper.pdf > paper.md

Why I Built This

I kept losing structure when ingesting PDFs into LLM pipelines. Most tools flatten documents into text.

docproc rebuilds the document as structured markdown.

Repository

View on GitHub