Extract2MD is a client-side JavaScript library that converts PDF files into Markdown. It operates entirely within the browser, meaning your documents are processed locally without server uploads or restrictive file size caps.
The library offers five distinct methods for extracting text, allowing you to choose the approach that best fits your document's complexity.
Five Modes, One Library
1. Quick Conversion This is the fastest method, ideal for PDFs that already contain a selectable text layer. It utilizes PDF.js to extract the existing text directly.
const markdown = await Extract2MDConverter.quickConvertOnly(pdfFile);
2. High-Accuracy OCR For PDFs consisting of scanned images or documents with irregular layouts, Extract2MD uses Tesseract.js to perform Optical Character Recognition (OCR).
const markdown = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile);
3. Quick + LLM Polish This mode combines fast text extraction with a local Large Language Model (via WebLLM). The AI reviews the raw text to fix formatting errors and improve readability.
const markdown = await Extract2MDConverter.quickConvertWithLLM(pdfFile);
4. High-Accuracy + LLM Polish After Tesseract.js handles the initial OCR, the LLM refines the output. This is particularly effective for scanned documents that require structural cleanup.
const markdown = await Extract2MDConverter.highAccuracyConvertWithLLM(pdfFile);
5. Combined Extraction + LLM (Recommended) This comprehensive mode utilizes all available tools. PDF.js extracts native text while Tesseract.js runs OCR on image-based elements. A specialized prompt then directs the LLM to merge these streams into a single, coherent Markdown file.
const markdown = await Extract2MDConverter.combinedConvertWithLLM(pdfFile);
Detailed Configuration
Extract2MD provides a deep configuration object. You can specify custom worker paths for PDF.js and Tesseract.js, select different LLM models, or define your own system prompts.
const config = {
pdfJsWorkerSrc: "../pdf.worker.min.mjs",
tesseract: {
workerPath: "./tesseract-worker.min.js",
corePath: "./tesseract-core.wasm.js",
langPath: "./lang-data/",
language: "eng",
},
webllm: {
model: "Qwen3-0.6B-q4f16_1-MLC",
options: { temperature: 0.7, maxTokens: 4096 }
},
systemPrompts: {
singleExtraction: "Keep all code blocks exactly as they appear.",
combinedExtraction: "Preserve table structures from the OCR pass."
},
processing: {
pdfRenderScale: 2.5,
postProcessRules: [{ find: /\bAPI\b/g, replace: "API" }]
},
progressCallback: (progress) => {
console.log(`${progress.stage}: ${progress.currentPage}/${progress.totalPages}`);
}
};
Advanced Components
If you do not require the full conversion pipeline, you can use the library’s components individually for modular tasks.
import { WebLLMEngine, OutputParser, ConfigValidator } from 'extract2md';
const validatedConfig = ConfigValidator.validate(userConfig);
const engine = new WebLLMEngine(validatedConfig);
await engine.initialize();
const result = await engine.generate("Your prompt here");
const cleanMarkdown = new OutputParser().parse(result);
Error Handling and Progress Tracking
The progressCallback function provides a real-time stream of the conversion status, including page rendering, OCR stages, and LLM loading progress. It is recommended to wrap conversion calls in a try/catch block to handle potential issues gracefully.
try {
const result = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
} catch (error) {
console.error('Conversion failed:', error.message);
}
Installation
Install the library via npm:
npm install extract2md
Alternatively, include the UMD bundle directly in your HTML:
<script src="https://unpkg.com/extract2md/dist/assets/extract2md.umd.js"></script>
Legacy Support
To ensure backward compatibility, the LegacyExtract2MDConverter class is available. This allows existing projects to use the new library without requiring an immediate rewrite of their implementation.
import { LegacyExtract2MDConverter } from 'extract2md';
const converter = new LegacyExtract2MDConverter(options);
const quick = await converter.quickConvert(pdfFile);
Resource Footprint
The core library is approximately 11MB. PDF.js contributes about 950KB, while Tesseract.js adds roughly 4.5MB. The final resource requirements for the LLM will vary based on the model you select. By keeping all processing on the client side, Extract2MD ensures your data remains private and never touches a remote server.
DeepDoc Turns Local Files Into AI Research Reports (No Cloud Needed)
Open Deep Research: Customizable AI Agents for Automated Report Generation
Puter: An Open-Source Personal Cloud OS for Files, Apps, and Games
Clueless: A Native AI Meeting Assistant for Mac with Live Transcription
MonkeyCode: Secure Private AI Coding with Integrated Security Scanning & Admin Controls
CodeIndexer: Semantic Code Search for IDEs (AI-Powered)
Chinese Kinship Calculator: Instantly Decode Family Relationship Terms
How to Install and Use Vosk Offline Speech Recognition
Zotero PDF2zh: Translate Academic PDFs Directly Within Zotero
How to Build a Meeting Prep Agent with Tavily and Google Calendar
DeepWiki: Automatically Generate Interactive Wikis for Any GitHub Repository
Shendeng VPN: Unlimited Bandwidth, Smart Routing & VIP Membership (¥28/Month)