A powerful Node.js library for reading and extracting text from various document formats including PDF, DOCX, DOC, PPT, PPTX, and TXT files.
Everything you need to extract text and metadata from documents with ease and reliability.
Extract text from PDF, DOCX, DOC, PPT, PPTX, and TXT files with a unified API.
Get comprehensive document statistics including word count, character count, and page numbers.
Read documents directly from memory buffers without writing to disk.
Full TypeScript support with comprehensive type definitions and IntelliSense.
Modern async/await API design for seamless integration with your applications.
Comprehensive error handling with custom error types and detailed error messages.
Complete API reference and usage examples to get you started quickly.
npm install doc-extract
For full functionality, install these system packages:
sudo apt-get install antiword unrtf poppler-utils tesseract-ocr
brew install antiword unrtf poppler tesseract
choco install poppler tesseract
See how to integrate doc-extract into your applications with these practical examples.
Handle file uploads and extract text in an Express.js application
import express from "express"; import multer from "multer"; import { DocumentReader } from "doc-extract"; const app = express(); const upload = multer(); const reader = new DocumentReader(); app.post("/upload", upload.single("document"), async (req, res) => { try { if (!req.file) { return res.status(400).json({ error: "No file uploaded" }); } const content = await reader.readDocumentFromBuffer( req.file.buffer, req.file.originalname, req.file.mimetype ); res.json({ text: content.text, metadata: content.metadata, }); } catch (error) { res.status(500).json({ error: error.message }); } }); app.listen(3000, () => { console.log("Server running on port 3000"); });
Check out our comprehensive documentation and community examples on GitHub.