Doc Extract

A powerful Node.js library for reading and extracting text from various document formats including PDF, DOCX, DOC, PPT, PPTX, and TXT files.

PDF

DOCX

DOC

PPT

PPTX

TXT

Powerful Features

Everything you need to extract text and metadata from documents with ease and reliability.

6 Formats

Multiple Format Support

Extract text from PDF, DOCX, DOC, PPT, PPTX, and TXT files with a unified API.

Detailed Stats

Rich Metadata

Get comprehensive document statistics including word count, character count, and page numbers.

Memory Efficient

Buffer Support

Read documents directly from memory buffers without writing to disk.

Type Safe

TypeScript Ready

Full TypeScript support with comprehensive type definitions and IntelliSense.

Async/Await

Promise-based API

Modern async/await API design for seamless integration with your applications.

Robust

Error Handling

Comprehensive error handling with custom error types and detailed error messages.

Supported Formats

100%

TypeScript Coverage

MIT

Open Source License

Documentation

Complete API reference and usage examples to get you started quickly.

Installation & Setup

Install the Package

npm install doc-extract

System Dependencies

For full functionality, install these system packages:

Ubuntu/Debian

sudo apt-get install antiword unrtf poppler-utils tesseract-ocr

macOS

brew install antiword unrtf poppler tesseract

Windows

choco install poppler tesseract

Real-World Examples

See how to integrate doc-extract into your applications with these practical examples.

Express.js Integration

Handle file uploads and extract text in an Express.js application

Express

Multer

File Upload

import express from "express";
import multer from "multer";
import { DocumentReader } from "doc-extract";

const app = express();
const upload = multer();
const reader = new DocumentReader();

app.post("/upload", upload.single("document"), async (req, res) => {
  try {
    if (!req.file) {
      return res.status(400).json({ error: "No file uploaded" });
    }

    const content = await reader.readDocumentFromBuffer(
      req.file.buffer,
      req.file.originalname,
      req.file.mimetype
    );

    res.json({
      text: content.text,
      metadata: content.metadata,
    });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

app.listen(3000, () => {
  console.log("Server running on port 3000");
});

Need More Examples?

Check out our comprehensive documentation and community examples on GitHub.