100% Private

PDF Manipulation: A Practical Developer's Guide

PDF is deceptively complex. The format was designed in the early 1990s to solve a real problem — documents that look the same everywhere — and it succeeded so well that PDF is now the lingua franca of documents. But that versatility comes with complexity. This guide covers the practical operations developers and power users need: merging, splitting, extracting text, adding watermarks, compressing, OCR, and working with PDFs programmatically in JavaScript.

What's Actually Inside a PDF

Understanding PDF structure helps explain why some operations are easy and others are hard. A PDF is not a document in the word-processor sense — it's closer to a rendered page layout. Each page contains positioned drawing instructions: place this text at these coordinates, draw this image here, fill this rectangle with this color.

This is why editing PDF text is hard. There's no concept of "paragraph" or "line" — just character glyphs at specific positions. Reflow doesn't exist. Adding a sentence in the middle doesn't push subsequent text down; you'd have to manually reposition everything.

A PDF file contains:

  • Page objects: Each page's content stream with drawing commands
  • Resources: Fonts, images, and color profiles referenced by pages
  • Cross-reference table: Byte offset map to find objects quickly
  • Metadata: XMP metadata and document info dictionary (author, dates, software)
  • Optional structures: Bookmarks, form fields, annotations, digital signatures, JavaScript

Modern PDF also supports incremental updates — changes can be appended to the file rather than rewriting it. This is how digital signatures work: the signed portion is frozen, and the signature is appended. It also means "deleted" content may still exist in the file.

Merging PDFs

Merging is one of the most straightforward PDF operations. The merger copies page objects and their resources from each source file, appending them in order into a new PDF. Most tools handle the complexity of remapping resource names to avoid conflicts between files.

Common Merge Scenarios

  • Combining a cover letter with a resume and portfolio into one submission
  • Assembling a report from sections created by different team members
  • Combining scanned pages (front and back, or multi-page documents scanned separately)
  • Creating a single delivery with multiple documents for a client

Things to Check Before Merging

  1. Page orientation: Mixing portrait and landscape pages is valid PDF but can surprise readers. Most PDF viewers handle it, but print settings may not.
  2. Page size: Merging an A4 and a US Letter document creates a PDF with mixed page sizes. Some printers will scale pages to fit; others will center them with margins.
  3. Bookmarks: Most simple merge operations discard source bookmarks. If you're merging a 50-page report with chapters, you'll lose the table of contents navigation. Use a tool that preserves or rebuilds bookmarks.
  4. Form fields: If both source PDFs have form fields with the same names, they may merge incorrectly. Field naming conflicts can cause data in one form to overwrite another.

Privacy note: Browser-based PDF tools like those at ToolsDock PDF Tools process files locally in your browser. Nothing is uploaded to a server.

Splitting and Extracting Pages

Splitting and extracting are related but distinct: splitting produces multiple output files from one input (breaking a 100-page PDF into 10 chunks of 10), while extracting produces one output file containing specific selected pages.

Page Selection Syntax

Single page:       5         → produces page 5
Multiple pages:    1, 3, 7   → produces pages 1, 3, 7 in one PDF
Range:             10-20     → produces pages 10 through 20
Combined:          1, 3, 10-20, 25  → all of these
Reverse range:     20-10     → pages 20 through 10 (reversed order)
Last N pages:      (total-5)-(total)  → last 6 pages

Splitting Strategies

  • By fixed page count: Every N pages → useful for distributing handouts or creating equal-size chunks for email attachments
  • By bookmarks: Split at bookmark boundaries → natural for splitting chapters from a book-style PDF
  • By blank pages: Split when a blank page appears → common pattern for scanned multi-document batches where blank pages were inserted as separators
  • By file size: Produce chunks under a size limit → for email attachment limits or upload size restrictions

Reordering Pages

Some tools let you reorder pages within a PDF — drag to rearrange, then save. Under the hood, this is extracting pages in a new order and merging them. Useful for fixing a scan where pages came out of order, or for rearranging slides.

Tools: PDF Tools

Split, merge, extract pages, reorder — all browser-based with no file uploads.

Extracting Text

Text extraction from PDFs ranges from trivial to impossible depending on the PDF's structure.

Native PDFs (Text Selectable)

PDFs created from Word, LaTeX, InDesign, or any tool that generates PDF from source content contain text as actual text objects. The PDF knows each character, its position, and its Unicode value. Extraction is reliable and fast.

Scanned PDFs (Image-Based)

A scanned document is just an image inside a PDF wrapper. There's no text — only pixels. Extraction requires OCR (see the OCR section below). Without OCR, you get zero text. With OCR, you get text quality that depends heavily on scan resolution, font clarity, and language.

The In-Between Case

Some PDFs have both: a scanned image layer with an invisible text layer added by OCR software. These look like scanned PDFs but have selectable text. Extraction from these works, but the text quality depends on how well the OCR was done originally.

Text Extraction with pdfjs (JavaScript)

import * as pdfjsLib from 'pdfjs-dist';

async function extractText(pdfBytes) { const pdf = await pdfjsLib.getDocument({ data: pdfBytes }).promise; const textContent = [];

for (let i = 1; i <= pdf.numPages; i++) { const page = await pdf.getPage(i); const content = await page.getTextContent(); const pageText = content.items .map(item => item.str) .join(' '); textContent.push(pageText); }

return textContent.join('\n\n'); }

// Usage with a File input const fileBytes = await file.arrayBuffer(); const text = await extractText(new Uint8Array(fileBytes));

Note that getTextContent() returns text items in roughly reading order, but PDF doesn't guarantee reading order. Multi-column layouts, tables, and complex designs may produce garbled extraction output. Post-processing or heuristics are often needed for complex layouts.

Text Extraction with Python (pdfminer)

from pdfminer.high_level import extract_text

Simple extraction

text = extract_text('document.pdf')

With page-by-page control

from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer

for page_layout in extract_pages('document.pdf'): for element in page_layout: if isinstance(element, LTTextContainer): print(element.get_text())

pypdf for basic needs (lighter weight)

import pypdf reader = pypdf.PdfReader('document.pdf') for page in reader.pages: print(page.extract_text())

Adding Watermarks

PDF watermarks are typically implemented in one of two ways:

  1. Stamp (foreground): Added as a transparent overlay on top of page content. Visible but content underneath is still accessible.
  2. Watermark (background): Added behind page content. Visible through transparent areas of the page, but content overlaps it.

For security purposes, neither prevents determined extraction — a "CONFIDENTIAL" watermark doesn't protect document content. It's a deterrent and a legal marker, not a technical protection.

Adding Watermarks with pdf-lib (JavaScript)

import { PDFDocument, rgb, degrees, StandardFonts } from 'pdf-lib';

async function addWatermark(pdfBytes, text) { const pdfDoc = await PDFDocument.load(pdfBytes); const font = await pdfDoc.embedFont(StandardFonts.Helvetica); const pages = pdfDoc.getPages();

for (const page of pages) { const { width, height } = page.getSize(); const fontSize = 60; const textWidth = font.widthOfTextAtSize(text, fontSize);

page.drawText(text, { x: width / 2 - textWidth / 2, y: height / 2, size: fontSize, font: font, color: rgb(0.8, 0.1, 0.1), opacity: 0.3, rotate: degrees(45), }); }

return await pdfDoc.save(); }

// Usage const originalBytes = await fetch('document.pdf').then(r => r.arrayBuffer()); const watermarked = await addWatermark(new Uint8Array(originalBytes), 'CONFIDENTIAL'); // watermarked is a Uint8Array of the new PDF

Watermark Tool

For no-code watermarking, use the PDF Tools — add text or image watermarks visually with opacity and position controls.

Compressing PDFs

What Compression Actually Does

PDF compression isn't one thing — it's several optimizations applied together:

  • Image recompression: The biggest win. A 300 DPI image meant for print, embedded in a PDF, recompressed to 150 DPI for screen viewing. Can reduce image data by 50-90%.
  • Font subsetting: Instead of embedding the full Helvetica font (which includes all 65,000+ glyphs), embed only the ~200 glyphs actually used in the document. Common in PDF generators; compression tools can add subsetting to PDFs that missed it.
  • Metadata removal: XMP metadata, document properties, editing history can add hundreds of KB to a PDF.
  • Content stream optimization: Duplicate resources shared across pages, compress streams with Flate compression, remove dead objects from incremental updates.
  • Remove embedded thumbnails: Some PDF creators embed page thumbnails for preview; these are redundant for most uses.

Compression Levels and Tradeoffs

Low Compression
  • 10-30% size reduction
  • No quality loss
  • Safe for print
  • Images stay at original DPI
Medium Compression
  • 30-60% size reduction
  • Images downsampled to 150 DPI
  • Good for email
  • Fine for screen viewing
High Compression
  • 60-90% size reduction
  • Images at 72 DPI or lower
  • Visible quality loss on photos
  • Screen only, not printable

Understanding Why Your PDF Is Large

Common causes and typical size impact:

Scanned pages at 300 DPI → Each page 500 KB - 2 MB High-res photos → Can be 1-5 MB per image Embedded full fonts → 200-500 KB per font Unoptimized vector graphics → Variable, can be large XMP metadata and thumbnails → Usually small (10-100 KB) Incremental update debris → Can be significant after many edits

CLI Compression with Ghostscript

# Screen quality (72 DPI images, smallest size)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 
-dPDFSETTINGS=/screen
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=compressed.pdf input.pdf

Ebook quality (150 DPI images, good balance)

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/ebook
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=compressed.pdf input.pdf

Printer quality (300 DPI, minimal compression)

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/printer
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=compressed.pdf input.pdf

Ghostscript is free, handles virtually any PDF, and is the engine behind many online PDF compressors. The /ebook setting is the best general-purpose option for web delivery.

Converting To and From PDF

To PDF

Source FormatMethodQuality Notes
Images (JPEG, PNG, TIFF, WebP)Image to PDFExcellent — images embedded at original quality
HTML/Web pageHTML to PDFGood for simple layouts; complex CSS may not render perfectly
DOCX (Word document)OS print-to-PDF or LibreOfficeExcellent — preserves formatting and fonts
PowerPoint (PPTX)PPTX to PDFGood; animations stripped, static slides preserved
MarkdownVia HTML (render, then print-to-PDF or Pandoc)Depends on CSS styling

From PDF

Target FormatMethodNotes
Images (PNG, JPG)PDF to ImageEach page becomes one image; set DPI based on use (150 screen, 300 print)
Plain textpdfminer, pdfjs, or copy-pasteWorks for native PDFs; OCR needed for scanned ones
EPUBPDF to EPUBWorks best for text-heavy PDFs; layout-heavy PDFs convert poorly
DOCXAdobe Acrobat, LibreOffice DrawQuality varies; complex layouts often need manual cleanup

PDF to Image Quality Settings

DPI settings for PDF to image conversion:

72 DPI: Web thumbnails, previews — tiny files 96 DPI: Standard screen display 150 DPI: Good screen quality, reasonable file size 300 DPI: Print quality — use for images you'll print 600 DPI: Archive quality — large files, maximum detail

For a typical A4 page (8.27 × 11.69 inches): 72 DPI → 595 × 842 px (~50 KB PNG) 150 DPI → 1240 × 1754 px (~180 KB PNG) 300 DPI → 2480 × 3508 px (~700 KB PNG)

Form Filling

PDF forms come in two flavors: AcroForm (the original form technology, widely supported) and XFA (Adobe's XML Forms Architecture, used in complex forms, poorly supported in non-Adobe readers).

AcroForm Fields

Standard PDF forms use AcroForm fields — text boxes, checkboxes, radio buttons, dropdowns, signature boxes. These are embedded in the PDF and can be filled programmatically.

Filling Forms with pdf-lib (JavaScript)

import { PDFDocument } from 'pdf-lib';

async function fillForm(pdfBytes, formData) { const pdfDoc = await PDFDocument.load(pdfBytes); const form = pdfDoc.getForm();

// List all fields const fields = form.getFields(); fields.forEach(field => console.log(field.getName(), field.constructor.name));

// Fill text fields form.getTextField('full_name').setText(formData.name); form.getTextField('date_of_birth').setText(formData.dob); form.getTextField('email').setText(formData.email);

// Check a checkbox if (formData.agreeToTerms) { form.getCheckBox('agree_checkbox').check(); }

// Select a radio button form.getRadioGroup('payment_method').select(formData.paymentMethod);

// Select a dropdown option form.getDropdown('country').select(formData.country);

// Flatten the form (makes fields non-editable, permanently bakes values in) form.flatten();

return await pdfDoc.save(); }

Calling form.flatten() is important when you want to lock in the values — it converts form fields to static content so the PDF can be signed, printed, or shared without risk of values changing.

Filling Forms with Python (pypdf)

<code">import pypdf

reader = pypdf.PdfReader('form.pdf') writer = pypdf.PdfWriter()

Get field names

print(reader.get_fields().keys())

Fill fields

writer.clone_reader_document_root(reader) writer.update_page_form_field_values( writer.pages[0], { 'full_name': 'Jane Smith', 'email': 'jane@example.com', 'date': '2024-03-15', } )

with open('filled_form.pdf', 'wb') as f: writer.write(f)

OCR on Scanned PDFs

OCR (Optical Character Recognition) converts images of text into actual text data. For a scanned PDF, OCR adds an invisible text layer positioned over the visible scan — the PDF still looks like a scan, but now text is selectable and searchable.

OCR Quality Factors

  • Scan resolution: 300 DPI minimum for acceptable results. 600 DPI is better, especially for small fonts.
  • Image quality: Skew, noise, faint text, and low contrast all degrade OCR accuracy. Pre-processing (deskew, denoise, increase contrast) improves results significantly.
  • Font type: Printed fonts work much better than handwriting. Decorative fonts, old typewriter text, and faded ink reduce accuracy.
  • Language: OCR engines need the right language model. Specify the document language explicitly when possible.

Tesseract (Open Source OCR)

# Convert scanned PDF to images, then OCR each page

First: convert PDF to images with pdftoppm

pdftoppm -r 300 scanned.pdf page

Then: OCR each page image

for page in page-*.ppm; do tesseract "$page" "${page%.ppm}" -l eng done

Combine OCR outputs into searchable PDF

tesseract scanned.pdf output -l eng pdf

Better: use ocrmypdf (handles the full workflow)

ocrmypdf input.pdf output.pdf ocrmypdf --deskew --clean input.pdf output.pdf # With pre-processing ocrmypdf -l deu input.pdf output.pdf # German language

ocrmypdf is highly recommended — it wraps Ghostscript and Tesseract, handles multi-page PDFs, applies preprocessing, and produces a proper searchable PDF with the original appearance preserved.

Browser-Based OCR

Modern browsers can run Tesseract via WebAssembly. The PDF OCR tool processes your scanned PDF entirely in the browser — no uploads, no server-side processing.

Realistic Accuracy Expectations

Tesseract accuracy on well-scanned printed documents: 95-99%. On poor quality scans, handwriting, or degraded documents, accuracy can drop to 60-80% or lower. For critical data extraction (legal documents, financial records), always verify OCR output manually or use a validation step.

JavaScript PDF Libraries

pdf-lib — Create and Modify PDFs

pdf-lib is the go-to library for creating new PDFs or modifying existing ones in JavaScript. It runs in both browser and Node.js environments, has no dependencies, and handles most common tasks well.

import { PDFDocument, StandardFonts, rgb } from 'pdf-lib';

// Create a new PDF const pdfDoc = await PDFDocument.create(); const page = pdfDoc.addPage([600, 800]); const font = await pdfDoc.embedFont(StandardFonts.Helvetica);

page.drawText('Hello, PDF!', { x: 50, y: 750, size: 30, font: font, color: rgb(0, 0, 0), });

// Embed an image const jpgBytes = await fetch('photo.jpg').then(r => r.arrayBuffer()); const jpgImage = await pdfDoc.embedJpg(jpgBytes); page.drawImage(jpgImage, { x: 50, y: 400, width: 300, height: 200 });

const pdfBytes = await pdfDoc.save();

// Load existing PDF and modify it const existing = await PDFDocument.load(existingPdfBytes); const [copiedPage] = await pdfDoc.copyPages(existing, [0]); pdfDoc.addPage(copiedPage);

PDF.js — Render and Read PDFs

Mozilla's pdfjs-dist is the standard for rendering PDFs in the browser. It renders pages to canvas elements and provides text content extraction.

<code">import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = 'pdfjs-dist/build/pdf.worker.min.js';

const pdf = await pdfjsLib.getDocument(pdfBytes).promise;

// Render page to canvas const page = await pdf.getPage(1); const viewport = page.getViewport({ scale: 1.5 }); const canvas = document.getElementById('pdf-canvas'); const ctx = canvas.getContext('2d'); canvas.height = viewport.height; canvas.width = viewport.width;

await page.render({ canvasContext: ctx, viewport }).promise;

// Extract text const textContent = await page.getTextContent(); const strings = textContent.items.map(item => item.str); console.log(strings.join(' '));

Choosing Between Libraries

NeedLibrary
Create new PDFs from scratchpdf-lib
Modify existing PDFs (watermark, merge, fill forms)pdf-lib
Render PDF pages for displaypdfjs-dist
Extract text from PDFspdfjs-dist
Advanced PDF manipulation in Node.jshummus, pdfmake, Puppeteer
Generate PDF from HTMLPuppeteer + Chromium, wkhtmltopdf

Metadata and Privacy

PDFs can contain a surprising amount of metadata that reveals information about the document's origin and history. Before sharing a PDF externally, especially for anonymous submissions or sensitive communications, check what's embedded.

Common Metadata Fields

  • Author: The name of the person who created the document (often pulled from the OS user account)
  • Creator/Producer: The software used — "Microsoft Word", "Adobe Acrobat", "LibreOffice"
  • Creation and modification dates: When the file was first created and last modified
  • Title and Subject: Document properties that may contain internal project names
  • Keywords: Searchable metadata terms
  • XMP metadata: Extended metadata that can include revision history and more

Reading Metadata with Python

<code">import pypdf

reader = pypdf.PdfReader('document.pdf') info = reader.metadata

print(f"Author: {info.author}") print(f"Creator: {info.creator}") print(f"Producer: {info.producer}") print(f"Created: {info.creation_date}") print(f"Modified: {info.modification_date}")

Removing Metadata

<code"># With ExifTool (handles PDF metadata well)
exiftool -all:all= document.pdf -o clean.pdf

With Ghostscript (strips most metadata)

gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite
-dFastWebView=true
-sOutputFile=clean.pdf
document.pdf

With Python (pypdf)

import pypdf reader = pypdf.PdfReader('document.pdf') writer = pypdf.PdfWriter() writer.clone_reader_document_root(reader) writer.add_metadata({ '/Author': '', '/Creator': '', '/Producer': '', '/Title': '', }) with open('clean.pdf', 'wb') as f: writer.write(f)

The browser-based PDF Metadata Scrubber handles this without any code — upload, scrub, download.

Batch Workflows

Batch Compress a Directory of PDFs

#!/bin/bash

Compress all PDFs in current directory

for f in *.pdf; do gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/ebook
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile="compressed_${f}" "${f}" echo "Compressed: ${f} ($(du -h "${f}" | cut -f1) → $(du -h "compressed_${f}" | cut -f1))" done

Batch OCR with ocrmypdf

<code"># OCR all scanned PDFs in a directory
find ./scanned -name "*.pdf" | while read f; do
output="./searchable/$(basename "$f")"
ocrmypdf --deskew --clean "$f" "$output" && echo "Done: $f"
done

Parallel processing with GNU parallel

find ./scanned -name "*.pdf" |
parallel ocrmypdf --deskew {} ./searchable/{/}

Batch Merge (Multiple PDFs per Subfolder)

<code">import { PDFDocument } from 'pdf-lib';
import fs from 'fs';
import path from 'path';

async function mergeDirectory(dirPath, outputPath) { const files = fs.readdirSync(dirPath) .filter(f => f.endsWith('.pdf')) .sort() .map(f => path.join(dirPath, f));

const merged = await PDFDocument.create();

for (const filePath of files) { const bytes = fs.readFileSync(filePath); const doc = await PDFDocument.load(bytes); const pages = await merged.copyPages(doc, doc.getPageIndices()); pages.forEach(page => merged.addPage(page)); }

fs.writeFileSync(outputPath, await merged.save()); console.log(Merged ${files.length} PDFs → ${outputPath}); }

await mergeDirectory('./invoices', './invoices-combined.pdf');

PDF Tools

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.