How do I merge multiple PDFs into one file?

Use a PDF merger tool to combine files. Select your PDFs in the desired order, then merge them. The result is a single PDF with all pages in sequence. For documents with different page sizes, consider if uniform sizing is important before merging.

Can I split a PDF without Adobe Acrobat?

Yes, browser-based PDF tools can split PDFs without any software. Upload your PDF, specify page ranges (1-5, 10-15) or split every N pages, and download the separate files. Many free tools work entirely in your browser for privacy.

How much can I compress a PDF without losing quality?

Typical PDFs can be compressed 30-60% with minimal quality loss by optimizing images and removing metadata. High-compression mode (60-90% reduction) noticeably affects image quality. Text remains sharp at all compression levels. Choose based on whether the PDF is for screen or print.

Is it safe to use online PDF tools for sensitive documents?

Use browser-based tools that process files locally—your PDF never leaves your computer. Check for 'client-side processing' or 'no upload' claims. For highly sensitive documents, use offline desktop software or verify the tool's privacy policy.

How do I extract specific pages from a PDF?

Use a PDF page extractor to select specific pages by number, range (10-20), or combination (1, 3, 10-15). The tool creates a new PDF containing only your selected pages. This doesn't modify the original file.

Why is my PDF file so large?

Large PDFs usually contain high-resolution images, embedded fonts, or multiple layers. Scanned documents are especially large. Compress by reducing image quality, subsetting fonts (only including used characters), and removing hidden layers or metadata.

How do I remove metadata from a PDF before sharing?

Use a PDF metadata scrubber to remove author name, creation date, editing software, and other hidden information. This protects privacy when sharing documents externally. Some tools also remove comments, form data, and revision history.

Can I convert a scanned PDF to searchable text?

Yes, using OCR (Optical Character Recognition). OCR analyzes the scanned image and adds a text layer behind it. The PDF looks the same but text becomes selectable and searchable. Quality depends on scan resolution and text clarity.

What's the difference between PDF/A and regular PDF?

PDF/A is an archival format designed for long-term preservation. It embeds all fonts, disallows external links and JavaScript, and uses specific color spaces. Government and legal documents often require PDF/A to ensure they remain readable for decades.

How do I convert images to a multi-page PDF?

Use an image-to-PDF converter to combine multiple images into one PDF document. Each image typically becomes one page. Arrange images in desired order, optionally set page size and orientation, then convert. This is useful for creating portfolios or combining scanned pages.

PDF Manipulation: A Practical Developer's Guide

PDF is deceptively complex. The format was designed in the early 1990s to solve a real problem — documents that look the same everywhere — and it succeeded so well that PDF is now the lingua franca of documents. But that versatility comes with complexity. This guide covers the practical operations developers and power users need: merging, splitting, extracting text, adding watermarks, compressing, OCR, and working with PDFs programmatically in JavaScript.

What's Actually Inside a PDF

Understanding PDF structure helps explain why some operations are easy and others are hard. A PDF is not a document in the word-processor sense — it's closer to a rendered page layout. Each page contains positioned drawing instructions: place this text at these coordinates, draw this image here, fill this rectangle with this color.

This is why editing PDF text is hard. There's no concept of "paragraph" or "line" — just character glyphs at specific positions. Reflow doesn't exist. Adding a sentence in the middle doesn't push subsequent text down; you'd have to manually reposition everything.

A PDF file contains:

Page objects: Each page's content stream with drawing commands
Resources: Fonts, images, and color profiles referenced by pages
Cross-reference table: Byte offset map to find objects quickly
Metadata: XMP metadata and document info dictionary (author, dates, software)
Optional structures: Bookmarks, form fields, annotations, digital signatures, JavaScript

Modern PDF also supports incremental updates — changes can be appended to the file rather than rewriting it. This is how digital signatures work: the signed portion is frozen, and the signature is appended. It also means "deleted" content may still exist in the file.

Merging PDFs

Merging is one of the most straightforward PDF operations. The merger copies page objects and their resources from each source file, appending them in order into a new PDF. Most tools handle the complexity of remapping resource names to avoid conflicts between files.

Common Merge Scenarios

Combining a cover letter with a resume and portfolio into one submission
Assembling a report from sections created by different team members
Combining scanned pages (front and back, or multi-page documents scanned separately)
Creating a single delivery with multiple documents for a client

Things to Check Before Merging

Page orientation: Mixing portrait and landscape pages is valid PDF but can surprise readers. Most PDF viewers handle it, but print settings may not.
Page size: Merging an A4 and a US Letter document creates a PDF with mixed page sizes. Some printers will scale pages to fit; others will center them with margins.
Bookmarks: Most simple merge operations discard source bookmarks. If you're merging a 50-page report with chapters, you'll lose the table of contents navigation. Use a tool that preserves or rebuilds bookmarks.
Form fields: If both source PDFs have form fields with the same names, they may merge incorrectly. Field naming conflicts can cause data in one form to overwrite another.

Privacy note: Browser-based PDF tools like those at ToolsDock PDF Tools process files locally in your browser. Nothing is uploaded to a server.

Splitting and Extracting Pages

Splitting and extracting are related but distinct: splitting produces multiple output files from one input (breaking a 100-page PDF into 10 chunks of 10), while extracting produces one output file containing specific selected pages.

Page Selection Syntax

Single page:       5         → produces page 5
Multiple pages:    1, 3, 7   → produces pages 1, 3, 7 in one PDF
Range:             10-20     → produces pages 10 through 20
Combined:          1, 3, 10-20, 25  → all of these
Reverse range:     20-10     → pages 20 through 10 (reversed order)
Last N pages:      (total-5)-(total)  → last 6 pages

Splitting Strategies

By fixed page count: Every N pages → useful for distributing handouts or creating equal-size chunks for email attachments
By bookmarks: Split at bookmark boundaries → natural for splitting chapters from a book-style PDF
By blank pages: Split when a blank page appears → common pattern for scanned multi-document batches where blank pages were inserted as separators
By file size: Produce chunks under a size limit → for email attachment limits or upload size restrictions

Reordering Pages

Some tools let you reorder pages within a PDF — drag to rearrange, then save. Under the hood, this is extracting pages in a new order and merging them. Useful for fixing a scan where pages came out of order, or for rearranging slides.

Tools: PDF Tools

Split, merge, extract pages, reorder — all browser-based with no file uploads.

Extracting Text

Text extraction from PDFs ranges from trivial to impossible depending on the PDF's structure.

Native PDFs (Text Selectable)

PDFs created from Word, LaTeX, InDesign, or any tool that generates PDF from source content contain text as actual text objects. The PDF knows each character, its position, and its Unicode value. Extraction is reliable and fast.

Scanned PDFs (Image-Based)

A scanned document is just an image inside a PDF wrapper. There's no text — only pixels. Extraction requires OCR (see the OCR section below). Without OCR, you get zero text. With OCR, you get text quality that depends heavily on scan resolution, font clarity, and language.

The In-Between Case

Some PDFs have both: a scanned image layer with an invisible text layer added by OCR software. These look like scanned PDFs but have selectable text. Extraction from these works, but the text quality depends on how well the OCR was done originally.

Text Extraction with pdfjs (JavaScript)

import * as pdfjsLib from 'pdfjs-dist';
async function extractText(pdfBytes) {
const pdf = await pdfjsLib.getDocument({ data: pdfBytes }).promise;
const textContent = [];
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
const pageText = content.items
.map(item => item.str)
.join(' ');
textContent.push(pageText);
}
return textContent.join('\n\n');
}
// Usage with a File input
const fileBytes = await file.arrayBuffer();
const text = await extractText(new Uint8Array(fileBytes));

Note that getTextContent() returns text items in roughly reading order, but PDF doesn't guarantee reading order. Multi-column layouts, tables, and complex designs may produce garbled extraction output. Post-processing or heuristics are often needed for complex layouts.

Text Extraction with Python (pdfminer)

from pdfminer.high_level import extract_text
Simple extraction
text = extract_text('document.pdf')
With page-by-page control
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages('document.pdf'):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
pypdf for basic needs (lighter weight)
import pypdf
reader = pypdf.PdfReader('document.pdf')
for page in reader.pages:
print(page.extract_text())

Adding Watermarks

PDF watermarks are typically implemented in one of two ways:

Stamp (foreground): Added as a transparent overlay on top of page content. Visible but content underneath is still accessible.
Watermark (background): Added behind page content. Visible through transparent areas of the page, but content overlaps it.

For security purposes, neither prevents determined extraction — a "CONFIDENTIAL" watermark doesn't protect document content. It's a deterrent and a legal marker, not a technical protection.

Adding Watermarks with pdf-lib (JavaScript)

import { PDFDocument, rgb, degrees, StandardFonts } from 'pdf-lib';
async function addWatermark(pdfBytes, text) {
const pdfDoc = await PDFDocument.load(pdfBytes);
const font = await pdfDoc.embedFont(StandardFonts.Helvetica);
const pages = pdfDoc.getPages();
for (const page of pages) {
const { width, height } = page.getSize();
const fontSize = 60;
const textWidth = font.widthOfTextAtSize(text, fontSize);
page.drawText(text, {
x: width / 2 - textWidth / 2,
y: height / 2,
size: fontSize,
font: font,
color: rgb(0.8, 0.1, 0.1),
opacity: 0.3,
rotate: degrees(45),
});
}
return await pdfDoc.save();
}
// Usage
const originalBytes = await fetch('document.pdf').then(r => r.arrayBuffer());
const watermarked = await addWatermark(new Uint8Array(originalBytes), 'CONFIDENTIAL');
// watermarked is a Uint8Array of the new PDF

Watermark Tool

For no-code watermarking, use the PDF Tools — add text or image watermarks visually with opacity and position controls.

Compressing PDFs

What Compression Actually Does

PDF compression isn't one thing — it's several optimizations applied together:

Image recompression: The biggest win. A 300 DPI image meant for print, embedded in a PDF, recompressed to 150 DPI for screen viewing. Can reduce image data by 50-90%.
Font subsetting: Instead of embedding the full Helvetica font (which includes all 65,000+ glyphs), embed only the ~200 glyphs actually used in the document. Common in PDF generators; compression tools can add subsetting to PDFs that missed it.
Metadata removal: XMP metadata, document properties, editing history can add hundreds of KB to a PDF.
Content stream optimization: Duplicate resources shared across pages, compress streams with Flate compression, remove dead objects from incremental updates.
Remove embedded thumbnails: Some PDF creators embed page thumbnails for preview; these are redundant for most uses.

Compression Levels and Tradeoffs

Low Compression

10-30% size reduction
No quality loss
Safe for print
Images stay at original DPI

Medium Compression

30-60% size reduction
Images downsampled to 150 DPI
Good for email
Fine for screen viewing

High Compression

60-90% size reduction
Images at 72 DPI or lower
Visible quality loss on photos
Screen only, not printable

Understanding Why Your PDF Is Large

Common causes and typical size impact:
Scanned pages at 300 DPI       → Each page 500 KB - 2 MB
High-res photos                → Can be 1-5 MB per image
Embedded full fonts            → 200-500 KB per font
Unoptimized vector graphics    → Variable, can be large
XMP metadata and thumbnails    → Usually small (10-100 KB)
Incremental update debris      → Can be significant after many edits

CLI Compression with Ghostscript

# Screen quality (72 DPI images, smallest size)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 

-dPDFSETTINGS=/screen 

-dNOPAUSE -dQUIET -dBATCH 

-sOutputFile=compressed.pdf input.pdf
Ebook quality (150 DPI images, good balance)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 

-dPDFSETTINGS=/ebook 

-dNOPAUSE -dQUIET -dBATCH 

-sOutputFile=compressed.pdf input.pdf
Printer quality (300 DPI, minimal compression)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 

-dPDFSETTINGS=/printer 

-dNOPAUSE -dQUIET -dBATCH 

-sOutputFile=compressed.pdf input.pdf

Ghostscript is free, handles virtually any PDF, and is the engine behind many online PDF compressors. The /ebook setting is the best general-purpose option for web delivery.

Converting To and From PDF

To PDF

Source Format	Method	Quality Notes
Images (JPEG, PNG, TIFF, WebP)	Image to PDF	Excellent — images embedded at original quality
HTML/Web page	HTML to PDF	Good for simple layouts; complex CSS may not render perfectly
DOCX (Word document)	OS print-to-PDF or LibreOffice	Excellent — preserves formatting and fonts
PowerPoint (PPTX)	PPTX to PDF	Good; animations stripped, static slides preserved
Markdown	Via HTML (render, then print-to-PDF or Pandoc)	Depends on CSS styling

From PDF

Target Format	Method	Notes
Images (PNG, JPG)	PDF to Image	Each page becomes one image; set DPI based on use (150 screen, 300 print)
Plain text	pdfminer, pdfjs, or copy-paste	Works for native PDFs; OCR needed for scanned ones
EPUB	PDF to EPUB	Works best for text-heavy PDFs; layout-heavy PDFs convert poorly
DOCX	Adobe Acrobat, LibreOffice Draw	Quality varies; complex layouts often need manual cleanup

PDF to Image Quality Settings

DPI settings for PDF to image conversion:
72 DPI:   Web thumbnails, previews — tiny files
96 DPI:   Standard screen display
150 DPI:  Good screen quality, reasonable file size
300 DPI:  Print quality — use for images you'll print
600 DPI:  Archive quality — large files, maximum detail
For a typical A4 page (8.27 × 11.69 inches):
72 DPI  → 595 × 842 px   (~50 KB PNG)
150 DPI → 1240 × 1754 px (~180 KB PNG)
300 DPI → 2480 × 3508 px (~700 KB PNG)

Form Filling

PDF forms come in two flavors: AcroForm (the original form technology, widely supported) and XFA (Adobe's XML Forms Architecture, used in complex forms, poorly supported in non-Adobe readers).

AcroForm Fields

Standard PDF forms use AcroForm fields — text boxes, checkboxes, radio buttons, dropdowns, signature boxes. These are embedded in the PDF and can be filled programmatically.

Filling Forms with pdf-lib (JavaScript)

import { PDFDocument } from 'pdf-lib';
async function fillForm(pdfBytes, formData) {
const pdfDoc = await PDFDocument.load(pdfBytes);
const form = pdfDoc.getForm();
// List all fields
const fields = form.getFields();
fields.forEach(field => console.log(field.getName(), field.constructor.name));
// Fill text fields
form.getTextField('full_name').setText(formData.name);
form.getTextField('date_of_birth').setText(formData.dob);
form.getTextField('email').setText(formData.email);
// Check a checkbox
if (formData.agreeToTerms) {
form.getCheckBox('agree_checkbox').check();
}
// Select a radio button
form.getRadioGroup('payment_method').select(formData.paymentMethod);
// Select a dropdown option
form.getDropdown('country').select(formData.country);
// Flatten the form (makes fields non-editable, permanently bakes values in)
form.flatten();
return await pdfDoc.save();
}

Calling form.flatten() is important when you want to lock in the values — it converts form fields to static content so the PDF can be signed, printed, or shared without risk of values changing.

Filling Forms with Python (pypdf)

<code">import pypdf
reader = pypdf.PdfReader('form.pdf')
writer = pypdf.PdfWriter()
Get field names
print(reader.get_fields().keys())
Fill fields
writer.clone_reader_document_root(reader)
writer.update_page_form_field_values(
writer.pages[0],
{
'full_name': 'Jane Smith',
'email': 'jane@example.com',
'date': '2024-03-15',
}
)
with open('filled_form.pdf', 'wb') as f:
writer.write(f)

OCR on Scanned PDFs

OCR (Optical Character Recognition) converts images of text into actual text data. For a scanned PDF, OCR adds an invisible text layer positioned over the visible scan — the PDF still looks like a scan, but now text is selectable and searchable.

OCR Quality Factors

Scan resolution: 300 DPI minimum for acceptable results. 600 DPI is better, especially for small fonts.
Image quality: Skew, noise, faint text, and low contrast all degrade OCR accuracy. Pre-processing (deskew, denoise, increase contrast) improves results significantly.
Font type: Printed fonts work much better than handwriting. Decorative fonts, old typewriter text, and faded ink reduce accuracy.
Language: OCR engines need the right language model. Specify the document language explicitly when possible.

Tesseract (Open Source OCR)

# Convert scanned PDF to images, then OCR each page
First: convert PDF to images with pdftoppm
pdftoppm -r 300 scanned.pdf page
Then: OCR each page image
for page in page-*.ppm; do
tesseract "$page" "${page%.ppm}" -l eng
done
Combine OCR outputs into searchable PDF
tesseract scanned.pdf output -l eng pdf
Better: use ocrmypdf (handles the full workflow)
ocrmypdf input.pdf output.pdf
ocrmypdf --deskew --clean input.pdf output.pdf  # With pre-processing
ocrmypdf -l deu input.pdf output.pdf  # German language

ocrmypdf is highly recommended — it wraps Ghostscript and Tesseract, handles multi-page PDFs, applies preprocessing, and produces a proper searchable PDF with the original appearance preserved.

Browser-Based OCR

Modern browsers can run Tesseract via WebAssembly. The PDF OCR tool processes your scanned PDF entirely in the browser — no uploads, no server-side processing.

Realistic Accuracy Expectations

Tesseract accuracy on well-scanned printed documents: 95-99%. On poor quality scans, handwriting, or degraded documents, accuracy can drop to 60-80% or lower. For critical data extraction (legal documents, financial records), always verify OCR output manually or use a validation step.

JavaScript PDF Libraries

pdf-lib — Create and Modify PDFs

pdf-lib is the go-to library for creating new PDFs or modifying existing ones in JavaScript. It runs in both browser and Node.js environments, has no dependencies, and handles most common tasks well.

import { PDFDocument, StandardFonts, rgb } from 'pdf-lib';
// Create a new PDF
const pdfDoc = await PDFDocument.create();
const page = pdfDoc.addPage([600, 800]);
const font = await pdfDoc.embedFont(StandardFonts.Helvetica);
page.drawText('Hello, PDF!', {
x: 50,
y: 750,
size: 30,
font: font,
color: rgb(0, 0, 0),
});
// Embed an image
const jpgBytes = await fetch('photo.jpg').then(r => r.arrayBuffer());
const jpgImage = await pdfDoc.embedJpg(jpgBytes);
page.drawImage(jpgImage, { x: 50, y: 400, width: 300, height: 200 });
const pdfBytes = await pdfDoc.save();
// Load existing PDF and modify it
const existing = await PDFDocument.load(existingPdfBytes);
const [copiedPage] = await pdfDoc.copyPages(existing, [0]);
pdfDoc.addPage(copiedPage);

PDF.js — Render and Read PDFs

Mozilla's pdfjs-dist is the standard for rendering PDFs in the browser. It renders pages to canvas elements and provides text content extraction.

<code">import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = 'pdfjs-dist/build/pdf.worker.min.js';
const pdf = await pdfjsLib.getDocument(pdfBytes).promise;
// Render page to canvas
const page = await pdf.getPage(1);
const viewport = page.getViewport({ scale: 1.5 });
const canvas = document.getElementById('pdf-canvas');
const ctx = canvas.getContext('2d');
canvas.height = viewport.height;
canvas.width = viewport.width;
await page.render({ canvasContext: ctx, viewport }).promise;
// Extract text
const textContent = await page.getTextContent();
const strings = textContent.items.map(item => item.str);
console.log(strings.join(' '));

Choosing Between Libraries

Need	Library
Create new PDFs from scratch	pdf-lib
Modify existing PDFs (watermark, merge, fill forms)	pdf-lib
Render PDF pages for display	pdfjs-dist
Extract text from PDFs	pdfjs-dist
Advanced PDF manipulation in Node.js	hummus, pdfmake, Puppeteer
Generate PDF from HTML	Puppeteer + Chromium, wkhtmltopdf

Metadata and Privacy

PDFs can contain a surprising amount of metadata that reveals information about the document's origin and history. Before sharing a PDF externally, especially for anonymous submissions or sensitive communications, check what's embedded.

Common Metadata Fields

Author: The name of the person who created the document (often pulled from the OS user account)
Creator/Producer: The software used — "Microsoft Word", "Adobe Acrobat", "LibreOffice"
Creation and modification dates: When the file was first created and last modified
Title and Subject: Document properties that may contain internal project names
Keywords: Searchable metadata terms
XMP metadata: Extended metadata that can include revision history and more

Reading Metadata with Python

<code">import pypdf
reader = pypdf.PdfReader('document.pdf')
info = reader.metadata
print(f"Author: {info.author}")
print(f"Creator: {info.creator}")
print(f"Producer: {info.producer}")
print(f"Created: {info.creation_date}")
print(f"Modified: {info.modification_date}")

Removing Metadata

<code"># With ExifTool (handles PDF metadata well)
exiftool -all:all= document.pdf -o clean.pdf
With Ghostscript (strips most metadata)
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite 

-dFastWebView=true 

-sOutputFile=clean.pdf 

document.pdf
With Python (pypdf)
import pypdf
reader = pypdf.PdfReader('document.pdf')
writer = pypdf.PdfWriter()
writer.clone_reader_document_root(reader)
writer.add_metadata({
'/Author': '',
'/Creator': '',
'/Producer': '',
'/Title': '',
})
with open('clean.pdf', 'wb') as f:
writer.write(f)

The browser-based PDF Metadata Scrubber handles this without any code — upload, scrub, download.

Batch Workflows

Batch Compress a Directory of PDFs

#!/bin/bash
Compress all PDFs in current directory
for f in *.pdf; do
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 

-dPDFSETTINGS=/ebook 

-dNOPAUSE -dQUIET -dBATCH 

-sOutputFile="compressed_${f}" "${f}"
echo "Compressed: ${f} ($(du -h "${f}" | cut -f1) → $(du -h "compressed_${f}" | cut -f1))"
done

Batch OCR with ocrmypdf

<code"># OCR all scanned PDFs in a directory
find ./scanned -name "*.pdf" | while read f; do
output="./searchable/$(basename "$f")"
ocrmypdf --deskew --clean "$f" "$output" && echo "Done: $f"
done
Parallel processing with GNU parallel
find ./scanned -name "*.pdf" | 

parallel ocrmypdf --deskew {} ./searchable/{/}

Batch Merge (Multiple PDFs per Subfolder)

<code">import { PDFDocument } from 'pdf-lib';
import fs from 'fs';
import path from 'path';
async function mergeDirectory(dirPath, outputPath) {
const files = fs.readdirSync(dirPath)
.filter(f => f.endsWith('.pdf'))
.sort()
.map(f => path.join(dirPath, f));
const merged = await PDFDocument.create();
for (const filePath of files) {
const bytes = fs.readFileSync(filePath);
const doc = await PDFDocument.load(bytes);
const pages = await merged.copyPages(doc, doc.getPageIndices());
pages.forEach(page => merged.addPage(page));
}
fs.writeFileSync(outputPath, await merged.save());
console.log(Merged ${files.length} PDFs → ${outputPath});
}
await mergeDirectory('./invoices', './invoices-combined.pdf');

PDF Tools

PDF Tools — merge, split, compress, convert, OCR, watermark, and more