PDF Manipulation: A Practical Developer's Guide
PDF is deceptively complex. The format was designed in the early 1990s to solve a real problem — documents that look the same everywhere — and it succeeded so well that PDF is now the lingua franca of documents. But that versatility comes with complexity. This guide covers the practical operations developers and power users need: merging, splitting, extracting text, adding watermarks, compressing, OCR, and working with PDFs programmatically in JavaScript.
Understanding PDF structure helps explain why some operations are easy and others are hard. A PDF is not a document in the word-processor sense — it's closer to a rendered page layout. Each page contains positioned drawing instructions: place this text at these coordinates, draw this image here, fill this rectangle with this color.What's Actually Inside a PDF
This is why editing PDF text is hard. There's no concept of "paragraph" or "line" — just character glyphs at specific positions. Reflow doesn't exist. Adding a sentence in the middle doesn't push subsequent text down; you'd have to manually reposition everything.
A PDF file contains:
- Page objects: Each page's content stream with drawing commands
- Resources: Fonts, images, and color profiles referenced by pages
- Cross-reference table: Byte offset map to find objects quickly
- Metadata: XMP metadata and document info dictionary (author, dates, software)
- Optional structures: Bookmarks, form fields, annotations, digital signatures, JavaScript
Modern PDF also supports incremental updates — changes can be appended to the file rather than rewriting it. This is how digital signatures work: the signed portion is frozen, and the signature is appended. It also means "deleted" content may still exist in the file.
Merging is one of the most straightforward PDF operations. The merger copies page objects and their resources from each source file, appending them in order into a new PDF. Most tools handle the complexity of remapping resource names to avoid conflicts between files.Merging PDFs
Common Merge Scenarios
- Combining a cover letter with a resume and portfolio into one submission
- Assembling a report from sections created by different team members
- Combining scanned pages (front and back, or multi-page documents scanned separately)
- Creating a single delivery with multiple documents for a client
Things to Check Before Merging
- Page orientation: Mixing portrait and landscape pages is valid PDF but can surprise readers. Most PDF viewers handle it, but print settings may not.
- Page size: Merging an A4 and a US Letter document creates a PDF with mixed page sizes. Some printers will scale pages to fit; others will center them with margins.
- Bookmarks: Most simple merge operations discard source bookmarks. If you're merging a 50-page report with chapters, you'll lose the table of contents navigation. Use a tool that preserves or rebuilds bookmarks.
- Form fields: If both source PDFs have form fields with the same names, they may merge incorrectly. Field naming conflicts can cause data in one form to overwrite another.
Splitting and extracting are related but distinct: splitting produces multiple output files from one input (breaking a 100-page PDF into 10 chunks of 10), while extracting produces one output file containing specific selected pages.Splitting and Extracting Pages
Page Selection Syntax
Single page: 5 → produces page 5
Multiple pages: 1, 3, 7 → produces pages 1, 3, 7 in one PDF
Range: 10-20 → produces pages 10 through 20
Combined: 1, 3, 10-20, 25 → all of these
Reverse range: 20-10 → pages 20 through 10 (reversed order)
Last N pages: (total-5)-(total) → last 6 pagesSplitting Strategies
- By fixed page count: Every N pages → useful for distributing handouts or creating equal-size chunks for email attachments
- By bookmarks: Split at bookmark boundaries → natural for splitting chapters from a book-style PDF
- By blank pages: Split when a blank page appears → common pattern for scanned multi-document batches where blank pages were inserted as separators
- By file size: Produce chunks under a size limit → for email attachment limits or upload size restrictions
Reordering Pages
Some tools let you reorder pages within a PDF — drag to rearrange, then save. Under the hood, this is extracting pages in a new order and merging them. Useful for fixing a scan where pages came out of order, or for rearranging slides.
Tools: PDF Tools
Split, merge, extract pages, reorder — all browser-based with no file uploads.
Text extraction from PDFs ranges from trivial to impossible depending on the PDF's structure.Extracting Text
Native PDFs (Text Selectable)
PDFs created from Word, LaTeX, InDesign, or any tool that generates PDF from source content contain text as actual text objects. The PDF knows each character, its position, and its Unicode value. Extraction is reliable and fast.
Scanned PDFs (Image-Based)
A scanned document is just an image inside a PDF wrapper. There's no text — only pixels. Extraction requires OCR (see the OCR section below). Without OCR, you get zero text. With OCR, you get text quality that depends heavily on scan resolution, font clarity, and language.
The In-Between Case
Some PDFs have both: a scanned image layer with an invisible text layer added by OCR software. These look like scanned PDFs but have selectable text. Extraction from these works, but the text quality depends on how well the OCR was done originally.
Text Extraction with pdfjs (JavaScript)
import * as pdfjsLib from 'pdfjs-dist';
async function extractText(pdfBytes) {
const pdf = await pdfjsLib.getDocument({ data: pdfBytes }).promise;
const textContent = [];
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
const pageText = content.items
.map(item => item.str)
.join(' ');
textContent.push(pageText);
}
return textContent.join('\n\n');
}
// Usage with a File input
const fileBytes = await file.arrayBuffer();
const text = await extractText(new Uint8Array(fileBytes));
Note that getTextContent() returns text items in roughly reading order, but PDF doesn't guarantee reading order. Multi-column layouts, tables, and complex designs may produce garbled extraction output. Post-processing or heuristics are often needed for complex layouts.
Text Extraction with Python (pdfminer)
from pdfminer.high_level import extract_text
Simple extraction
text = extract_text('document.pdf')
With page-by-page control
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages('document.pdf'):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
pypdf for basic needs (lighter weight)
import pypdf
reader = pypdf.PdfReader('document.pdf')
for page in reader.pages:
print(page.extract_text())
PDF watermarks are typically implemented in one of two ways:Adding Watermarks
- Stamp (foreground): Added as a transparent overlay on top of page content. Visible but content underneath is still accessible.
- Watermark (background): Added behind page content. Visible through transparent areas of the page, but content overlaps it.
For security purposes, neither prevents determined extraction — a "CONFIDENTIAL" watermark doesn't protect document content. It's a deterrent and a legal marker, not a technical protection.
Adding Watermarks with pdf-lib (JavaScript)
import { PDFDocument, rgb, degrees, StandardFonts } from 'pdf-lib';
async function addWatermark(pdfBytes, text) {
const pdfDoc = await PDFDocument.load(pdfBytes);
const font = await pdfDoc.embedFont(StandardFonts.Helvetica);
const pages = pdfDoc.getPages();
for (const page of pages) {
const { width, height } = page.getSize();
const fontSize = 60;
const textWidth = font.widthOfTextAtSize(text, fontSize);
page.drawText(text, {
x: width / 2 - textWidth / 2,
y: height / 2,
size: fontSize,
font: font,
color: rgb(0.8, 0.1, 0.1),
opacity: 0.3,
rotate: degrees(45),
});
}
return await pdfDoc.save();
}
// Usage
const originalBytes = await fetch('document.pdf').then(r => r.arrayBuffer());
const watermarked = await addWatermark(new Uint8Array(originalBytes), 'CONFIDENTIAL');
// watermarked is a Uint8Array of the new PDF
Watermark Tool
For no-code watermarking, use the PDF Tools — add text or image watermarks visually with opacity and position controls.
Compressing PDFs
What Compression Actually Does
PDF compression isn't one thing — it's several optimizations applied together:
- Image recompression: The biggest win. A 300 DPI image meant for print, embedded in a PDF, recompressed to 150 DPI for screen viewing. Can reduce image data by 50-90%.
- Font subsetting: Instead of embedding the full Helvetica font (which includes all 65,000+ glyphs), embed only the ~200 glyphs actually used in the document. Common in PDF generators; compression tools can add subsetting to PDFs that missed it.
- Metadata removal: XMP metadata, document properties, editing history can add hundreds of KB to a PDF.
- Content stream optimization: Duplicate resources shared across pages, compress streams with Flate compression, remove dead objects from incremental updates.
- Remove embedded thumbnails: Some PDF creators embed page thumbnails for preview; these are redundant for most uses.
Compression Levels and Tradeoffs
- 10-30% size reduction
- No quality loss
- Safe for print
- Images stay at original DPI
- 30-60% size reduction
- Images downsampled to 150 DPI
- Good for email
- Fine for screen viewing
- 60-90% size reduction
- Images at 72 DPI or lower
- Visible quality loss on photos
- Screen only, not printable
Understanding Why Your PDF Is Large
Common causes and typical size impact:
Scanned pages at 300 DPI → Each page 500 KB - 2 MB
High-res photos → Can be 1-5 MB per image
Embedded full fonts → 200-500 KB per font
Unoptimized vector graphics → Variable, can be large
XMP metadata and thumbnails → Usually small (10-100 KB)
Incremental update debris → Can be significant after many edits
CLI Compression with Ghostscript
# Screen quality (72 DPI images, smallest size)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/screen
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=compressed.pdf input.pdf
Ebook quality (150 DPI images, good balance)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/ebook
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=compressed.pdf input.pdf
Printer quality (300 DPI, minimal compression)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/printer
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile=compressed.pdf input.pdf
Ghostscript is free, handles virtually any PDF, and is the engine behind many online PDF compressors. The /ebook setting is the best general-purpose option for web delivery.
Converting To and From PDF
To PDF
| Source Format | Method | Quality Notes |
|---|---|---|
| Images (JPEG, PNG, TIFF, WebP) | Image to PDF | Excellent — images embedded at original quality |
| HTML/Web page | HTML to PDF | Good for simple layouts; complex CSS may not render perfectly |
| DOCX (Word document) | OS print-to-PDF or LibreOffice | Excellent — preserves formatting and fonts |
| PowerPoint (PPTX) | PPTX to PDF | Good; animations stripped, static slides preserved |
| Markdown | Via HTML (render, then print-to-PDF or Pandoc) | Depends on CSS styling |
From PDF
| Target Format | Method | Notes |
|---|---|---|
| Images (PNG, JPG) | PDF to Image | Each page becomes one image; set DPI based on use (150 screen, 300 print) |
| Plain text | pdfminer, pdfjs, or copy-paste | Works for native PDFs; OCR needed for scanned ones |
| EPUB | PDF to EPUB | Works best for text-heavy PDFs; layout-heavy PDFs convert poorly |
| DOCX | Adobe Acrobat, LibreOffice Draw | Quality varies; complex layouts often need manual cleanup |
PDF to Image Quality Settings
DPI settings for PDF to image conversion:
72 DPI: Web thumbnails, previews — tiny files
96 DPI: Standard screen display
150 DPI: Good screen quality, reasonable file size
300 DPI: Print quality — use for images you'll print
600 DPI: Archive quality — large files, maximum detail
For a typical A4 page (8.27 × 11.69 inches):
72 DPI → 595 × 842 px (~50 KB PNG)
150 DPI → 1240 × 1754 px (~180 KB PNG)
300 DPI → 2480 × 3508 px (~700 KB PNG)
PDF forms come in two flavors: AcroForm (the original form technology, widely supported) and XFA (Adobe's XML Forms Architecture, used in complex forms, poorly supported in non-Adobe readers).Form Filling
AcroForm Fields
Standard PDF forms use AcroForm fields — text boxes, checkboxes, radio buttons, dropdowns, signature boxes. These are embedded in the PDF and can be filled programmatically.
Filling Forms with pdf-lib (JavaScript)
import { PDFDocument } from 'pdf-lib';
async function fillForm(pdfBytes, formData) {
const pdfDoc = await PDFDocument.load(pdfBytes);
const form = pdfDoc.getForm();
// List all fields
const fields = form.getFields();
fields.forEach(field => console.log(field.getName(), field.constructor.name));
// Fill text fields
form.getTextField('full_name').setText(formData.name);
form.getTextField('date_of_birth').setText(formData.dob);
form.getTextField('email').setText(formData.email);
// Check a checkbox
if (formData.agreeToTerms) {
form.getCheckBox('agree_checkbox').check();
}
// Select a radio button
form.getRadioGroup('payment_method').select(formData.paymentMethod);
// Select a dropdown option
form.getDropdown('country').select(formData.country);
// Flatten the form (makes fields non-editable, permanently bakes values in)
form.flatten();
return await pdfDoc.save();
}
Calling form.flatten() is important when you want to lock in the values — it converts form fields to static content so the PDF can be signed, printed, or shared without risk of values changing.
Filling Forms with Python (pypdf)
<code">import pypdfreader = pypdf.PdfReader('form.pdf') writer = pypdf.PdfWriter()
Get field names
print(reader.get_fields().keys())
Fill fields
writer.clone_reader_document_root(reader) writer.update_page_form_field_values( writer.pages[0], { 'full_name': 'Jane Smith', 'email': 'jane@example.com', 'date': '2024-03-15', } )
with open('filled_form.pdf', 'wb') as f: writer.write(f)
OCR (Optical Character Recognition) converts images of text into actual text data. For a scanned PDF, OCR adds an invisible text layer positioned over the visible scan — the PDF still looks like a scan, but now text is selectable and searchable.OCR on Scanned PDFs
OCR Quality Factors
- Scan resolution: 300 DPI minimum for acceptable results. 600 DPI is better, especially for small fonts.
- Image quality: Skew, noise, faint text, and low contrast all degrade OCR accuracy. Pre-processing (deskew, denoise, increase contrast) improves results significantly.
- Font type: Printed fonts work much better than handwriting. Decorative fonts, old typewriter text, and faded ink reduce accuracy.
- Language: OCR engines need the right language model. Specify the document language explicitly when possible.
Tesseract (Open Source OCR)
# Convert scanned PDF to images, then OCR each page
First: convert PDF to images with pdftoppm
pdftoppm -r 300 scanned.pdf page
Then: OCR each page image
for page in page-*.ppm; do
tesseract "$page" "${page%.ppm}" -l eng
done
Combine OCR outputs into searchable PDF
tesseract scanned.pdf output -l eng pdf
Better: use ocrmypdf (handles the full workflow)
ocrmypdf input.pdf output.pdf
ocrmypdf --deskew --clean input.pdf output.pdf # With pre-processing
ocrmypdf -l deu input.pdf output.pdf # German language
ocrmypdf is highly recommended — it wraps Ghostscript and Tesseract, handles multi-page PDFs, applies preprocessing, and produces a proper searchable PDF with the original appearance preserved.
Browser-Based OCR
Modern browsers can run Tesseract via WebAssembly. The PDF OCR tool processes your scanned PDF entirely in the browser — no uploads, no server-side processing.
Realistic Accuracy Expectations
Tesseract accuracy on well-scanned printed documents: 95-99%. On poor quality scans, handwriting, or degraded documents, accuracy can drop to 60-80% or lower. For critical data extraction (legal documents, financial records), always verify OCR output manually or use a validation step.
JavaScript PDF Libraries
pdf-lib — Create and Modify PDFs
pdf-lib is the go-to library for creating new PDFs or modifying existing ones in JavaScript. It runs in both browser and Node.js environments, has no dependencies, and handles most common tasks well.
import { PDFDocument, StandardFonts, rgb } from 'pdf-lib';
// Create a new PDF
const pdfDoc = await PDFDocument.create();
const page = pdfDoc.addPage([600, 800]);
const font = await pdfDoc.embedFont(StandardFonts.Helvetica);
page.drawText('Hello, PDF!', {
x: 50,
y: 750,
size: 30,
font: font,
color: rgb(0, 0, 0),
});
// Embed an image
const jpgBytes = await fetch('photo.jpg').then(r => r.arrayBuffer());
const jpgImage = await pdfDoc.embedJpg(jpgBytes);
page.drawImage(jpgImage, { x: 50, y: 400, width: 300, height: 200 });
const pdfBytes = await pdfDoc.save();
// Load existing PDF and modify it
const existing = await PDFDocument.load(existingPdfBytes);
const [copiedPage] = await pdfDoc.copyPages(existing, [0]);
pdfDoc.addPage(copiedPage);
PDF.js — Render and Read PDFs
Mozilla's pdfjs-dist is the standard for rendering PDFs in the browser. It renders pages to canvas elements and provides text content extraction.
<code">import * as pdfjsLib from 'pdfjs-dist'; pdfjsLib.GlobalWorkerOptions.workerSrc = 'pdfjs-dist/build/pdf.worker.min.js';const pdf = await pdfjsLib.getDocument(pdfBytes).promise;
// Render page to canvas const page = await pdf.getPage(1); const viewport = page.getViewport({ scale: 1.5 }); const canvas = document.getElementById('pdf-canvas'); const ctx = canvas.getContext('2d'); canvas.height = viewport.height; canvas.width = viewport.width;
await page.render({ canvasContext: ctx, viewport }).promise;
// Extract text const textContent = await page.getTextContent(); const strings = textContent.items.map(item => item.str); console.log(strings.join(' '));
Choosing Between Libraries
| Need | Library |
|---|---|
| Create new PDFs from scratch | pdf-lib |
| Modify existing PDFs (watermark, merge, fill forms) | pdf-lib |
| Render PDF pages for display | pdfjs-dist |
| Extract text from PDFs | pdfjs-dist |
| Advanced PDF manipulation in Node.js | hummus, pdfmake, Puppeteer |
| Generate PDF from HTML | Puppeteer + Chromium, wkhtmltopdf |
PDFs can contain a surprising amount of metadata that reveals information about the document's origin and history. Before sharing a PDF externally, especially for anonymous submissions or sensitive communications, check what's embedded.Metadata and Privacy
Common Metadata Fields
- Author: The name of the person who created the document (often pulled from the OS user account)
- Creator/Producer: The software used — "Microsoft Word", "Adobe Acrobat", "LibreOffice"
- Creation and modification dates: When the file was first created and last modified
- Title and Subject: Document properties that may contain internal project names
- Keywords: Searchable metadata terms
- XMP metadata: Extended metadata that can include revision history and more
Reading Metadata with Python
<code">import pypdfreader = pypdf.PdfReader('document.pdf') info = reader.metadata
print(f"Author: {info.author}") print(f"Creator: {info.creator}") print(f"Producer: {info.producer}") print(f"Created: {info.creation_date}") print(f"Modified: {info.modification_date}")
Removing Metadata
<code"># With ExifTool (handles PDF metadata well) exiftool -all:all= document.pdf -o clean.pdfWith Ghostscript (strips most metadata)
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite
-dFastWebView=true
-sOutputFile=clean.pdf
document.pdfWith Python (pypdf)
import pypdf reader = pypdf.PdfReader('document.pdf') writer = pypdf.PdfWriter() writer.clone_reader_document_root(reader) writer.add_metadata({ '/Author': '', '/Creator': '', '/Producer': '', '/Title': '', }) with open('clean.pdf', 'wb') as f: writer.write(f)
The browser-based PDF Metadata Scrubber handles this without any code — upload, scrub, download.
Batch Workflows
Batch Compress a Directory of PDFs
#!/bin/bash
Compress all PDFs in current directory
for f in *.pdf; do
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/ebook
-dNOPAUSE -dQUIET -dBATCH
-sOutputFile="compressed_${f}" "${f}"
echo "Compressed: ${f} ($(du -h "${f}" | cut -f1) → $(du -h "compressed_${f}" | cut -f1))"
done
Batch OCR with ocrmypdf
<code"># OCR all scanned PDFs in a directory find ./scanned -name "*.pdf" | while read f; do output="./searchable/$(basename "$f")" ocrmypdf --deskew --clean "$f" "$output" && echo "Done: $f" doneParallel processing with GNU parallel
find ./scanned -name "*.pdf" |
parallel ocrmypdf --deskew {} ./searchable/{/}
Batch Merge (Multiple PDFs per Subfolder)
<code">import { PDFDocument } from 'pdf-lib';
import fs from 'fs';
import path from 'path';
async function mergeDirectory(dirPath, outputPath) {
const files = fs.readdirSync(dirPath)
.filter(f => f.endsWith('.pdf'))
.sort()
.map(f => path.join(dirPath, f));
const merged = await PDFDocument.create();
for (const filePath of files) {
const bytes = fs.readFileSync(filePath);
const doc = await PDFDocument.load(bytes);
const pages = await merged.copyPages(doc, doc.getPageIndices());
pages.forEach(page => merged.addPage(page));
}
fs.writeFileSync(outputPath, await merged.save());
console.log(Merged ${files.length} PDFs → ${outputPath});
}
await mergeDirectory('./invoices', './invoices-combined.pdf');
PDF Tools