Back to projects

docproc

Document → Markdown extraction engine for LLM pipelines

Turn messy PDFs into structured, model-ready text.

View on GitHub

Demo

terminal
$ docproc research_paper.pdf
> parsing layout...
> extracting figures...
> detecting equations...
> rebuilding hierarchy...
> exporting markdown...
paper.md generated

Problem

Modern LLM workflows depend on ingesting documents, but PDFs destroy structure. Tables collapse, equations disappear, and images vanish.

docproc rebuilds that structure.

What it does

docproc converts:

  • PDF
  • DOCX
  • PPTX
  • XLSX

into clean Markdown while preserving:

  • headings
  • layout hierarchy
  • equations
  • figures
  • embedded images

Features

Clean Markdown

Structured output ready for pipelines.

Equation Preservation

LaTeX and math kept intact.

Image Extraction

Figures and embedded images exported.

Layout Reconstruction

Headings and hierarchy restored.

LLM Ready Output

Model-friendly markdown format.

Architecture

PDF/DOCX
Layout Parser
Equation + Figure Extractor
Structure Rebuilder
Clean Markdown

Real output

Input (PDF snippet)

A typical methods section with inline equation: “The diffusion equation is given by ∂u/∂t = α ∇²u.”

Output (Markdown)

## Methods

The diffusion equation is given by:

$$ ∂u/∂t = α ∇²u $$

Example usage

docproc research_paper.pdf > paper.md

Why I Built This

I kept losing structure when ingesting PDFs into LLM pipelines. Most tools flatten documents into text.

docproc rebuilds the document as structured markdown.

Repository

View on GitHub

Links