100% Private

Document Format Conversion: What Actually Happens Under the Hood

Every document format makes different tradeoffs. DOCX is a ZIP of XML. PDF is a fixed canvas described in PostScript-derived syntax. EPUB is HTML with a spine. Understanding what each format actually is—not just what it's used for—explains why some conversions are lossless and others aren't.

Format Internals

DOCX: A ZIP of XML

DOCX (and XLSX, PPTX) are ZIP archives. Rename a .docx to .zip, extract it, and you'll find directories of XML files. The main document text is in word/document.xml. Styles are in word/styles.xml. Images live in word/media/. Relationships between files are mapped in _rels/ directories.

document.docx/
├── [Content_Types].xml
├── _rels/
│   └── .rels
├── word/
│   ├── document.xml          ← Main content
│   ├── styles.xml            ← Style definitions
│   ├── settings.xml          ← Document settings
│   ├── fontTable.xml         ← Font references
│   ├── numbering.xml         ← List numbering
│   ├── media/
│   │   └── image1.png        ← Embedded images
│   └── _rels/
│       └── document.xml.rels ← Content relationships
└── docProps/
    ├── core.xml              ← Author, dates, title
    └── app.xml               ← Word version, company

This means a DOCX is programmatically readable and writable without Word—libraries like python-docx, docx4j, or Pandoc process the XML directly. It also means your metadata (author name, revision count, edit time) is sitting in plaintext inside the ZIP, readable by anyone who opens it.

PDF: A Fixed Canvas

PDF describes a document as a sequence of drawing operations on a fixed-size canvas. Text is rendered by specifying font, position, and glyph sequence—not as a semantic "paragraph" or "heading." Images are embedded as raster or vector data. Pages are self-contained; there's no concept of text flowing between pages during rendering.

This fixed-layout model is PDF's strength (identical appearance everywhere) and its weakness (painful to edit, difficult to reflow for different screen sizes, accessibility requires extra metadata layers). When you "edit" a PDF in Acrobat, you're either overlaying new content or, in more sophisticated editing, manipulating the content streams directly—which is why it's slow and sometimes produces odd results.

PDF can contain text as actual text (selectable, searchable, copy-pasteable) or as images of text (from scanning or certain printing workflows). The latter requires OCR to become searchable. A PDF being "searchable" means the text content is present in the file's data structure, not just visible as pixels.

ODT: The Open Standard

ODT (OpenDocument Text) is also a ZIP of XML, following the ISO 26300 standard. The content is in content.xml, styles in styles.xml. It's structurally similar to DOCX but uses different XML namespaces and element names. LibreOffice is its native application; it's also supported by Google Docs and most modern word processors.

RTF: Plain Text with Formatting Codes

RTF (Rich Text Format, 1987) is literally a text file containing embedded control words. Open any .rtf file in a text editor and you'll see things like {\rtf1\ansi\deff0{\fonttbl{\f0\froman Times New Roman;}}}. It's verbose and limits formatting compared to DOCX, but any software from the last 35 years can read it. RTF is the right format when you genuinely don't know what software the recipient has.

EPUB: HTML with a Spine

EPUB 3 is a ZIP containing HTML files, CSS stylesheets, images, and a manifest (package.opf) that describes the reading order. Open an EPUB with a ZIP extractor and you'll find recognizable HTML. The "spine" in package.opf defines which HTML files constitute the chapters and in what order. EPUB is reflowable by design—the HTML adapts to the reading device's screen size, which is why ereaders work on phones, tablets, and dedicated e-ink devices interchangeably.

LaTeX: Source Code for Documents

LaTeX is a typesetting system—you write source code that compiles to PDF (or DVI). Unlike the formats above, LaTeX is not a finished file format; it's a programming language for documents. The compiler (pdflatex, xelatex, lualatex) interprets the source, resolves references, runs a bibliography pass, and produces the final output. This is why LaTeX has beautiful typography—it makes optimal line-breaking decisions over entire paragraphs, not line by line.

Conversion Gotchas

Fonts

DOCX references fonts by name. If the converting system doesn't have the font, it substitutes another—often with different metrics, causing text to reflow, headings to change size, and page breaks to move. PDF embeds fonts (if the creator didn't disable this), so the recipient always sees the intended font. When converting DOCX to PDF, always embed all fonts to prevent this.

Tables with merged cells

Complex table structures (rowspan, colspan) convert unreliably between formats. HTML tables, DOCX tables, ODT tables, and LaTeX tables all model spanning differently. Pandoc handles many cases but flattens others. If your document has complex tables, test the conversion explicitly and expect manual fixes.

Tracked changes

DOCX tracked changes are stored as a parallel set of "deletion" and "insertion" runs in the XML. When converting to HTML or Markdown, tracked changes are typically either accepted (showing only the final state) or stripped. There's no standard way to represent tracked changes in most other formats. If preserving edit history matters, keep the DOCX.

Headers and footers

Page-layout concepts like headers, footers, page numbers, and margin content don't map to HTML or Markdown—these formats have no concept of pages. Converting DOCX to HTML or Markdown silently drops headers and footers. Converting HTML to PDF via a tool like WeasyPrint requires CSS page rules (@page) to add them back.

Embedded objects

OLE objects (embedded Excel charts, Visio diagrams) in DOCX either convert to static images or are dropped entirely. SVG in EPUB converts reasonably to HTML. Embedded videos in DOCX become placeholders in PDF.

PDF to anything

PDF-to-text conversion quality depends entirely on the source. Text-based PDFs (made from Word, LaTeX, or InDesign) extract well. PDFs from scanned paper or fax require OCR. PDFs with multi-column layouts often produce garbled text order because columns are extracted left-to-right on the page, not column by column. PDF tables become difficult—the text positions are stored spatially, not as table cells.

What Survives Conversion (and What Doesn't)

Element Survival rate Notes
Paragraphs and textExcellentCore content survives almost everywhere
Heading hierarchy (H1–H6)ExcellentIf original uses real heading styles, not manual formatting
Bulleted and numbered listsExcellentIf using built-in list styles
Bold, italic, underlineExcellentUniversal across all formats
HyperlinksGoodURLs survive; anchor targets sometimes lost
Inline imagesGoodRaster images usually survive; vector may rasterize
Simple tablesGoodBasic rows/columns convert; spanning cells often don't
Code blocksGoodMonospace preserved; syntax highlighting often lost
Footnotes/endnotesModerateMany converters support; some formats lack the concept
Custom fontsPoorSubstituted unless embedded in PDF
Complex table structuresPoorSpans, merged cells, nested tables break frequently
Headers and footersPoorLost when converting to HTML/Markdown (no page concept)
Track changesPoorTypically accepted or discarded during conversion
Embedded media (video, audio)PoorUsually dropped or replaced with placeholder
Page layout (columns, text boxes)Very poorFundamentally incompatible with reflowable formats
Form fieldsVery poorInteractive forms require format-specific support
The semantic markup principle: Everything that uses built-in semantic styles (Heading 1, Body Text, List Bullet) converts significantly better than content that was manually formatted to look the same. A line formatted as Heading 1 has semantic meaning that converters recognize. A line that's just "Arial, 18pt, bold" is invisible to the converter's structure detection.

Pandoc: The Swiss Army Knife

Pandoc handles conversions between 40+ formats. It's open-source, command-line, scriptable, and available on all platforms. For most conversion tasks, it's the right answer.

Installation

brew install pandoc         # macOS
sudo apt install pandoc    # Debian/Ubuntu
choco install pandoc       # Windows (Chocolatey)
# Or download from https://pandoc.org/installing.html

Common conversions

# Format is inferred from file extension when possible
pandoc input.md -o output.html
pandoc input.md -o output.docx
pandoc input.md -o output.pdf    # Requires LaTeX or WeasyPrint
pandoc input.docx -o output.md
pandoc input.html -o output.md

# Explicit format specification
pandoc input.txt -f markdown -t html5 -o output.html

# Standalone HTML (with head, html, body tags)
pandoc input.md -s -o output.html

# With table of contents
pandoc input.md --toc --toc-depth=2 -o output.html

# Apply a CSS stylesheet to HTML output
pandoc input.md -s --css=styles.css -o output.html

# Use a Word template for styles in DOCX output
pandoc input.md -o output.docx --reference-doc=template.docx

# With bibliography
pandoc paper.md --bibliography=refs.bib --csl=apa.csl -o paper.docx

Key options

OptionPurpose
-s, --standaloneProduce a complete document (with HTML head/body, not just a fragment)
--tocGenerate a table of contents from headings
--reference-doc=file.docxUse a DOCX template's styles for the output
--bibliography=file.bibProcess citations from a BibTeX file
--csl=style.cslCitation style (APA, MLA, IEEE, Chicago, etc.)
--highlight-style=githubSyntax highlighting theme for code blocks
--pdf-engine=xelatexLaTeX engine for PDF generation (handles Unicode better than pdflatex)
--wrap=noneDon't rewrap text (useful for Markdown output)
-M title="My Doc"Set metadata (title, author, date, etc.)

Real examples

# Resume: Markdown → PDF with Eisvogel LaTeX template
pandoc resume.md -o resume.pdf \
  --template=eisvogel.latex \
  --pdf-engine=xelatex

# Academic paper: LaTeX → DOCX with bibliography
pandoc paper.tex -o paper.docx \
  --bibliography=references.bib \
  --csl=ieee.csl

# Ebook: multiple Markdown chapters → EPUB
pandoc title.md ch01.md ch02.md ch03.md -o book.epub \
  --toc \
  --epub-cover-image=cover.jpg \
  -M title="My Book" \
  -M author="Author Name"

# Migrate HTML docs to Markdown
pandoc docs.html -f html -t markdown --wrap=none -o docs.md

LibreOffice Headless

LibreOffice has a headless mode (no GUI) that converts documents the way LibreOffice actually renders them—which is closer to how a human user would save them. This makes it better than Pandoc for some conversions, particularly DOCX to PDF with complex formatting, and DOCX ↔ ODT.

# Convert DOCX to PDF (headless mode)
libreoffice --headless --convert-to pdf document.docx

# Convert to a specific directory
libreoffice --headless --convert-to pdf --outdir /output/ document.docx

# Batch convert all DOCX in current directory
libreoffice --headless --convert-to pdf *.docx

# Convert DOCX to ODT
libreoffice --headless --convert-to odt document.docx

# Convert ODT to DOCX
libreoffice --headless --convert-to docx document.odt

# In a Docker container (useful for server-side conversion)
docker run --rm -v $(pwd):/data ubuntu:22.04 \
  libreoffice --headless --convert-to pdf /data/document.docx

LibreOffice headless is the standard approach for server-side DOCX-to-PDF conversion when you need close fidelity to the original formatting. It's used by many document management systems, PDF generation services, and document preview APIs.

When to use LibreOffice vs Pandoc

TaskBetter toolWhy
DOCX → PDF (fidelity)LibreOfficeRenders fonts and layout like the actual application
Markdown → DOCXPandocLibreOffice doesn't read Markdown
DOCX → MarkdownPandocLibreOffice doesn't write Markdown
DOCX ↔ ODTLibreOfficeBoth are native formats; conversion is higher fidelity
Markdown → EPUBPandocPandoc's EPUB output with metadata support is excellent
LaTeX → PDFpdflatex/xelatexLaTeX requires its own compiler; Pandoc calls it internally
Citation managementPandocBuilt-in BibTeX and CSL support

Browser-Based WASM Conversion

Several document processing libraries have been compiled to WebAssembly, enabling document conversion in the browser without uploading files to a server. This is how ToolsDock's converters work—Pandoc, LibreOffice, and other tools compiled to WASM and run locally.

The privacy advantage is significant: a 50MB DOCX containing a confidential contract never leaves your machine. The browser processes it locally using the same code that would run on a server.

What's available as WASM

  • Pandoc WASM — the full Pandoc converter running in the browser; powers most of ToolsDock's markup and document converters
  • pdf-lib — read and write PDFs in JavaScript, good for merging, splitting, and metadata editing
  • PDFium WASM — Google's PDF renderer compiled to WASM; handles rendering and extraction
  • MuPDF WASM — lightweight PDF/XPS renderer; used for PDF preview and processing
  • mammoth.js — DOCX to HTML converter written natively for browsers; good for simple DOCX reading

WASM converters have limitations: large files are slow because WASM memory is limited, very complex formatting may be handled differently than the server-side counterpart, and some library features aren't exposed in the WASM build.

Batch Workflows

Linux / macOS shell script

#!/bin/bash
# Convert all Markdown files to PDF
INPUT_DIR="./docs"
OUTPUT_DIR="./output/pdf"
mkdir -p "$OUTPUT_DIR"

for file in "$INPUT_DIR"/*.md; do
  basename="${file##*/}"
  name="${basename%.md}"
  echo "Converting: $basename"
  pandoc "$file" -o "$OUTPUT_DIR/$name.pdf" \
    --toc \
    --highlight-style=github
done
echo "Done. $(ls $OUTPUT_DIR/*.pdf | wc -l) files converted."

Makefile (parallel builds)

SOURCES := $(wildcard docs/*.md)
PDFS    := $(SOURCES:docs/%.md=output/%.pdf)
HTMLS   := $(SOURCES:docs/%.md=output/%.html)

all: $(PDFS) $(HTMLS)

output/%.pdf: docs/%.md
	pandoc "$<" -o "$@" --toc

output/%.html: docs/%.md
	pandoc "$<" -s -o "$@" --toc

# Run with: make -j4  (parallel, 4 jobs)

Windows PowerShell

# Convert all DOCX to PDF using LibreOffice
$files = Get-ChildItem -Path ".\docs" -Filter "*.docx"
foreach ($file in $files) {
  Write-Host "Converting: $($file.Name)"
  & "C:\Program Files\LibreOffice\program\soffice.exe" `
    --headless --convert-to pdf `
    --outdir ".\output" $file.FullName
}

# Convert all Markdown to DOCX with Pandoc
$mdFiles = Get-ChildItem -Filter "*.md"
foreach ($f in $mdFiles) {
  $out = $f.BaseName + ".docx"
  pandoc $f.Name -o $out --reference-doc=template.docx
  Write-Host "Created: $out"
}

Python batch conversion

# Using subprocess to call Pandoc
import subprocess
from pathlib import Path

def batch_convert(source_dir, output_dir, from_ext, to_ext, extra_args=None):
    source_path = Path(source_dir)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    extra_args = extra_args or []

    files = list(source_path.glob(f'**/*.{from_ext}'))
    print(f'Found {len(files)} .{from_ext} files')

    for file in files:
        out = output_path / file.with_suffix(f'.{to_ext}').name
        cmd = ['pandoc', str(file), '-o', str(out)] + extra_args
        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode == 0:
            print(f'OK: {file.name} → {out.name}')
        else:
            print(f'FAIL: {file.name}')
            print(result.stderr)

# Convert all Markdown to DOCX
batch_convert('./docs', './output', 'md', 'docx',
              ['--reference-doc=template.docx', '--toc'])

Academic Writing Workflows

The LaTeX → DOCX problem

Many journals want DOCX submissions even from authors who write in LaTeX. Pandoc converts the core content well. Mathematics is the sticking point—it can output as MathML (works in Word), images, or via --webtex (renders as images via a service). Complex custom LaTeX macros and packages often need manual fixes.

# LaTeX → DOCX with bibliography (best approach)
pandoc paper.tex \
  --bibliography=references.bib \
  --csl=ieee.csl \
  --reference-doc=journal-template.docx \
  -o paper.docx

# If math breaks, try MathType-compatible output
pandoc paper.tex --mathml -o paper.docx

# After conversion, manually review:
# - Equation numbering
# - Figure and table labels
# - Citation formatting
# - Any custom commands that Pandoc doesn't recognize

Collaborative Markdown → LaTeX pipeline

Write in Markdown for easy Git collaboration, generate DOCX for non-LaTeX collaborators, and produce the final PDF via LaTeX for submission quality.

# Draft phase: Markdown → DOCX for co-authors
pandoc manuscript.md -o manuscript_draft.docx \
  --bibliography=refs.bib --csl=apa.csl

# Submission phase: Markdown → LaTeX → PDF
pandoc manuscript.md -o manuscript.tex \
  --bibliography=refs.bib
pdflatex manuscript.tex
bibtex manuscript
pdflatex manuscript.tex
pdflatex manuscript.tex  # Run twice to resolve references

Ebook Creation

Recommended workflow

Write in Markdown → convert to EPUB with Pandoc → validate with EPUBCheck → upload to distributor. For Kindle, upload the EPUB to KDP; Amazon handles the conversion to their format.

# Single file → EPUB
pandoc book.md -o book.epub \
  --toc --toc-depth=2 \
  --epub-cover-image=cover.jpg \
  -M title="Book Title" \
  -M author="Author Name" \
  -M lang=en-US

# Multiple chapters → EPUB
pandoc \
  front-matter.md \
  ch01-introduction.md \
  ch02-getting-started.md \
  ch03-advanced.md \
  appendix.md \
  -o book.epub \
  --toc \
  --epub-cover-image=cover.jpg \
  -M title="Book Title"

# Validate with EPUBCheck (requires Java)
java -jar epubcheck.jar book.epub

EPUB best practices

  • Use heading styles consistently—H1 for chapters, H2 for sections—since ereaders generate navigation from headings
  • Keep images under 300KB each; ereaders have limited memory
  • 72–96 DPI is sufficient for ereader screens
  • Remove manual page breaks—EPUB is reflowable and page breaks are device-specific
  • Test on at least one physical Kindle, one Kobo, and one phone (Kindle/Apple Books app)
  • Use relative font sizes (em, rem) not pixels in your CSS
  • Include the lang attribute—screen readers use it for text-to-speech language selection

Accessibility in Document Formats

Accessible documents are well-structured documents. The same practices that make a document convert cleanly also make it screen-reader friendly.

PDF accessibility

PDFs are inherently challenging for accessibility because they're fixed-layout. Tagged PDF (PDF/UA standard) adds structure tags that screen readers use to understand reading order. When generating PDF with Pandoc, use --pdf-engine=xelatex—it produces better structure than pdflatex. Add alt text to images in the source document before converting.

# Pandoc to accessible PDF via LaTeX
pandoc input.md -o output.pdf \
  --pdf-engine=xelatex \
  -V colorlinks=true  # Visible link indicators

# Images in Markdown (alt text is accessible description):
![A bar chart showing quarterly revenue growth from Q1 to Q4 2025](revenue.png)

DOCX accessibility checklist

  1. Use built-in Heading styles (Heading 1, 2, 3)—not manually formatted text
  2. Add alt text to every image (right-click → Edit Alt Text in Word)
  3. Use tables for data, not layout—and include a header row with scope attributes
  4. Write descriptive link text ("View quarterly report" not "click here")
  5. Ensure 4.5:1 contrast ratio minimum for all text
  6. Don't use color alone to convey meaning
  7. Run Word's Accessibility Checker before exporting (Review → Check Accessibility)

EPUB accessibility

# Pandoc EPUB with accessibility metadata
pandoc book.md -o book.epub \
  -M lang=en \
  --epub-metadata=metadata.xml

# metadata.xml
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:language>en</dc:language>
  <meta property="schema:accessMode">textual</meta>
  <meta property="schema:accessibilityFeature">tableOfContents</meta>
  <meta property="schema:accessibilityFeature">structuralNavigation</meta>
  <meta property="schema:accessibilityFeature">alternativeText</meta>
</metadata>

Tools

Markdown to DOCX

Convert Markdown to Word document with Pandoc, supports tables, code blocks, and headings.

Convert Now
DOCX to PDF

Convert Word documents to PDF while preserving formatting and embedded fonts.

Convert Now
Markdown to EPUB

Create EPUB ebooks from Markdown with table of contents, cover image, and metadata.

Convert Now
DOCX to Markdown

Extract clean Markdown from Word documents for Git-based workflows.

Convert Now
ODT to DOCX

Convert LibreOffice documents to Microsoft Word format.

Convert Now
HTML to Markdown

Convert HTML to clean Markdown—useful for migrating blog or CMS content.

Convert Now

Last updated: March 2026. Document conversions on ToolsDock run in your browser using Pandoc and other tools compiled to WebAssembly. No files are uploaded to any server.

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.