Extract2MD: Convert PDF to Markdown using Local LLMs and OCR

6月5日 Published inMarkdown Tools

Extract2MD is a client-side JavaScript library that converts PDF files into Markdown. It operates entirely within the browser, meaning your documents are processed locally without server uploads or restrictive file size caps.

The library offers five distinct methods for extracting text, allowing you to choose the approach that best fits your document's complexity.

Five Modes, One Library

1. Quick Conversion This is the fastest method, ideal for PDFs that already contain a selectable text layer. It utilizes PDF.js to extract the existing text directly.

const markdown = await Extract2MDConverter.quickConvertOnly(pdfFile);

2. High-Accuracy OCR For PDFs consisting of scanned images or documents with irregular layouts, Extract2MD uses Tesseract.js to perform Optical Character Recognition (OCR).

const markdown = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile);

3. Quick + LLM Polish This mode combines fast text extraction with a local Large Language Model (via WebLLM). The AI reviews the raw text to fix formatting errors and improve readability.

const markdown = await Extract2MDConverter.quickConvertWithLLM(pdfFile);

4. High-Accuracy + LLM Polish After Tesseract.js handles the initial OCR, the LLM refines the output. This is particularly effective for scanned documents that require structural cleanup.

const markdown = await Extract2MDConverter.highAccuracyConvertWithLLM(pdfFile);

5. Combined Extraction + LLM (Recommended) This comprehensive mode utilizes all available tools. PDF.js extracts native text while Tesseract.js runs OCR on image-based elements. A specialized prompt then directs the LLM to merge these streams into a single, coherent Markdown file.

const markdown = await Extract2MDConverter.combinedConvertWithLLM(pdfFile);

Detailed Configuration

Extract2MD provides a deep configuration object. You can specify custom worker paths for PDF.js and Tesseract.js, select different LLM models, or define your own system prompts.

const config = {
  pdfJsWorkerSrc: "../pdf.worker.min.mjs",
  tesseract: {
    workerPath: "./tesseract-worker.min.js",
    corePath: "./tesseract-core.wasm.js",
    langPath: "./lang-data/",
    language: "eng",
  },
  webllm: {
    model: "Qwen3-0.6B-q4f16_1-MLC",
    options: { temperature: 0.7, maxTokens: 4096 }
  },
  systemPrompts: {
    singleExtraction: "Keep all code blocks exactly as they appear.",
    combinedExtraction: "Preserve table structures from the OCR pass."
  },
  processing: {
    pdfRenderScale: 2.5,
    postProcessRules: [{ find: /\bAPI\b/g, replace: "API" }]
  },
  progressCallback: (progress) => {
    console.log(`${progress.stage}: ${progress.currentPage}/${progress.totalPages}`);
  }
};

Advanced Components

If you do not require the full conversion pipeline, you can use the library’s components individually for modular tasks.

import { WebLLMEngine, OutputParser, ConfigValidator } from 'extract2md';

const validatedConfig = ConfigValidator.validate(userConfig);
const engine = new WebLLMEngine(validatedConfig);
await engine.initialize();

const result = await engine.generate("Your prompt here");
const cleanMarkdown = new OutputParser().parse(result);

Error Handling and Progress Tracking

The progressCallback function provides a real-time stream of the conversion status, including page rendering, OCR stages, and LLM loading progress. It is recommended to wrap conversion calls in a try/catch block to handle potential issues gracefully.

try {
  const result = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
} catch (error) {
  console.error('Conversion failed:', error.message);
}

Installation

Install the library via npm:

npm install extract2md

Alternatively, include the UMD bundle directly in your HTML:

<script src="https://unpkg.com/extract2md/dist/assets/extract2md.umd.js"></script>

Legacy Support

To ensure backward compatibility, the LegacyExtract2MDConverter class is available. This allows existing projects to use the new library without requiring an immediate rewrite of their implementation.

import { LegacyExtract2MDConverter } from 'extract2md';
const converter = new LegacyExtract2MDConverter(options);
const quick = await converter.quickConvert(pdfFile);

Resource Footprint

The core library is approximately 11MB. PDF.js contributes about 950KB, while Tesseract.js adds roughly 4.5MB. The final resource requirements for the LLM will vary based on the model you select. By keeping all processing on the client side, Extract2MD ensures your data remains private and never touches a remote server.