Document Format Conversion: What Actually Happens Under the Hood

Every document format makes different tradeoffs. DOCX is a ZIP of XML. PDF is a fixed canvas described in PostScript-derived syntax. EPUB is HTML with a spine. Understanding what each format actually is—not just what it's used for—explains why some conversions are lossless and others aren't.

Format Internals

DOCX: A ZIP of XML

DOCX (and XLSX, PPTX) are ZIP archives. Rename a .docx to .zip, extract it, and you'll find directories of XML files. The main document text is in word/document.xml. Styles are in word/styles.xml. Images live in word/media/. Relationships between files are mapped in _rels/ directories.

document.docx/
├── [Content_Types].xml
├── _rels/
│   └── .rels
├── word/
│   ├── document.xml          ← Main content
│   ├── styles.xml            ← Style definitions
│   ├── settings.xml          ← Document settings
│   ├── fontTable.xml         ← Font references
│   ├── numbering.xml         ← List numbering
│   ├── media/
│   │   └── image1.png        ← Embedded images
│   └── _rels/
│       └── document.xml.rels ← Content relationships
└── docProps/
    ├── core.xml              ← Author, dates, title
    └── app.xml               ← Word version, company

This means a DOCX is programmatically readable and writable without Word—libraries like python-docx, docx4j, or Pandoc process the XML directly. It also means your metadata (author name, revision count, edit time) is sitting in plaintext inside the ZIP, readable by anyone who opens it.

PDF: A Fixed Canvas

PDF describes a document as a sequence of drawing operations on a fixed-size canvas. Text is rendered by specifying font, position, and glyph sequence—not as a semantic "paragraph" or "heading." Images are embedded as raster or vector data. Pages are self-contained; there's no concept of text flowing between pages during rendering.

This fixed-layout model is PDF's strength (identical appearance everywhere) and its weakness (painful to edit, difficult to reflow for different screen sizes, accessibility requires extra metadata layers). When you "edit" a PDF in Acrobat, you're either overlaying new content or, in more sophisticated editing, manipulating the content streams directly—which is why it's slow and sometimes produces odd results.

PDF can contain text as actual text (selectable, searchable, copy-pasteable) or as images of text (from scanning or certain printing workflows). The latter requires OCR to become searchable. A PDF being "searchable" means the text content is present in the file's data structure, not just visible as pixels.

ODT: The Open Standard

ODT (OpenDocument Text) is also a ZIP of XML, following the ISO 26300 standard. The content is in content.xml, styles in styles.xml. It's structurally similar to DOCX but uses different XML namespaces and element names. LibreOffice is its native application; it's also supported by Google Docs and most modern word processors.

RTF: Plain Text with Formatting Codes

RTF (Rich Text Format, 1987) is literally a text file containing embedded control words. Open any .rtf file in a text editor and you'll see things like {\rtf1\ansi\deff0{\fonttbl{\f0\froman Times New Roman;}}}. It's verbose and limits formatting compared to DOCX, but any software from the last 35 years can read it. RTF is the right format when you genuinely don't know what software the recipient has.

EPUB: HTML with a Spine

EPUB 3 is a ZIP containing HTML files, CSS stylesheets, images, and a manifest (package.opf) that describes the reading order. Open an EPUB with a ZIP extractor and you'll find recognizable HTML. The "spine" in package.opf defines which HTML files constitute the chapters and in what order. EPUB is reflowable by design—the HTML adapts to the reading device's screen size, which is why ereaders work on phones, tablets, and dedicated e-ink devices interchangeably.

LaTeX: Source Code for Documents

LaTeX is a typesetting system—you write source code that compiles to PDF (or DVI). Unlike the formats above, LaTeX is not a finished file format; it's a programming language for documents. The compiler (pdflatex, xelatex, lualatex) interprets the source, resolves references, runs a bibliography pass, and produces the final output. This is why LaTeX has beautiful typography—it makes optimal line-breaking decisions over entire paragraphs, not line by line.

Conversion Gotchas

Fonts

DOCX references fonts by name. If the converting system doesn't have the font, it substitutes another—often with different metrics, causing text to reflow, headings to change size, and page breaks to move. PDF embeds fonts (if the creator didn't disable this), so the recipient always sees the intended font. When converting DOCX to PDF, always embed all fonts to prevent this.

Tables with merged cells

Complex table structures (rowspan, colspan) convert unreliably between formats. HTML tables, DOCX tables, ODT tables, and LaTeX tables all model spanning differently. Pandoc handles many cases but flattens others. If your document has complex tables, test the conversion explicitly and expect manual fixes.

Tracked changes

DOCX tracked changes are stored as a parallel set of "deletion" and "insertion" runs in the XML. When converting to HTML or Markdown, tracked changes are typically either accepted (showing only the final state) or stripped. There's no standard way to represent tracked changes in most other formats. If preserving edit history matters, keep the DOCX.

Headers and footers

Page-layout concepts like headers, footers, page numbers, and margin content don't map to HTML or Markdown—these formats have no concept of pages. Converting DOCX to HTML or Markdown silently drops headers and footers. Converting HTML to PDF via a tool like WeasyPrint requires CSS page rules (@page) to add them back.

Embedded objects

OLE objects (embedded Excel charts, Visio diagrams) in DOCX either convert to static images or are dropped entirely. SVG in EPUB converts reasonably to HTML. Embedded videos in DOCX become placeholders in PDF.

PDF to anything

PDF-to-text conversion quality depends entirely on the source. Text-based PDFs (made from Word, LaTeX, or InDesign) extract well. PDFs from scanned paper or fax require OCR. PDFs with multi-column layouts often produce garbled text order because columns are extracted left-to-right on the page, not column by column. PDF tables become difficult—the text positions are stored spatially, not as table cells.

What Survives Conversion (and What Doesn't)

Element	Survival rate	Notes
Paragraphs and text	Excellent	Core content survives almost everywhere
Heading hierarchy (H1–H6)	Excellent	If original uses real heading styles, not manual formatting
Bulleted and numbered lists	Excellent	If using built-in list styles
Bold, italic, underline	Excellent	Universal across all formats
Hyperlinks	Good	URLs survive; anchor targets sometimes lost
Inline images	Good	Raster images usually survive; vector may rasterize
Simple tables	Good	Basic rows/columns convert; spanning cells often don't
Code blocks	Good	Monospace preserved; syntax highlighting often lost
Footnotes/endnotes	Moderate	Many converters support; some formats lack the concept
Custom fonts	Poor	Substituted unless embedded in PDF
Complex table structures	Poor	Spans, merged cells, nested tables break frequently
Headers and footers	Poor	Lost when converting to HTML/Markdown (no page concept)
Track changes	Poor	Typically accepted or discarded during conversion
Embedded media (video, audio)	Poor	Usually dropped or replaced with placeholder
Page layout (columns, text boxes)	Very poor	Fundamentally incompatible with reflowable formats
Form fields	Very poor	Interactive forms require format-specific support

The semantic markup principle: Everything that uses built-in semantic styles (Heading 1, Body Text, List Bullet) converts significantly better than content that was manually formatted to look the same. A line formatted as Heading 1 has semantic meaning that converters recognize. A line that's just "Arial, 18pt, bold" is invisible to the converter's structure detection.

Pandoc: The Swiss Army Knife

Pandoc handles conversions between 40+ formats. It's open-source, command-line, scriptable, and available on all platforms. For most conversion tasks, it's the right answer.

Installation

brew install pandoc         # macOS
sudo apt install pandoc    # Debian/Ubuntu
choco install pandoc       # Windows (Chocolatey)
# Or download from https://pandoc.org/installing.html

Common conversions

# Format is inferred from file extension when possible
pandoc input.md -o output.html
pandoc input.md -o output.docx
pandoc input.md -o output.pdf    # Requires LaTeX or WeasyPrint
pandoc input.docx -o output.md
pandoc input.html -o output.md

# Explicit format specification
pandoc input.txt -f markdown -t html5 -o output.html

# Standalone HTML (with head, html, body tags)
pandoc input.md -s -o output.html

# With table of contents
pandoc input.md --toc --toc-depth=2 -o output.html

# Apply a CSS stylesheet to HTML output
pandoc input.md -s --css=styles.css -o output.html

# Use a Word template for styles in DOCX output
pandoc input.md -o output.docx --reference-doc=template.docx

# With bibliography
pandoc paper.md --bibliography=refs.bib --csl=apa.csl -o paper.docx

Key options

Option	Purpose
`-s, --standalone`	Produce a complete document (with HTML head/body, not just a fragment)
`--toc`	Generate a table of contents from headings
`--reference-doc=file.docx`	Use a DOCX template's styles for the output
`--bibliography=file.bib`	Process citations from a BibTeX file
`--csl=style.csl`	Citation style (APA, MLA, IEEE, Chicago, etc.)
`--highlight-style=github`	Syntax highlighting theme for code blocks
`--pdf-engine=xelatex`	LaTeX engine for PDF generation (handles Unicode better than pdflatex)
`--wrap=none`	Don't rewrap text (useful for Markdown output)
`-M title="My Doc"`	Set metadata (title, author, date, etc.)

Real examples

# Resume: Markdown → PDF with Eisvogel LaTeX template
pandoc resume.md -o resume.pdf \
  --template=eisvogel.latex \
  --pdf-engine=xelatex

# Academic paper: LaTeX → DOCX with bibliography
pandoc paper.tex -o paper.docx \
  --bibliography=references.bib \
  --csl=ieee.csl

# Ebook: multiple Markdown chapters → EPUB
pandoc title.md ch01.md ch02.md ch03.md -o book.epub \
  --toc \
  --epub-cover-image=cover.jpg \
  -M title="My Book" \
  -M author="Author Name"

# Migrate HTML docs to Markdown
pandoc docs.html -f html -t markdown --wrap=none -o docs.md

LibreOffice Headless

LibreOffice has a headless mode (no GUI) that converts documents the way LibreOffice actually renders them—which is closer to how a human user would save them. This makes it better than Pandoc for some conversions, particularly DOCX to PDF with complex formatting, and DOCX ↔ ODT.

# Convert DOCX to PDF (headless mode)
libreoffice --headless --convert-to pdf document.docx

# Convert to a specific directory
libreoffice --headless --convert-to pdf --outdir /output/ document.docx

# Batch convert all DOCX in current directory
libreoffice --headless --convert-to pdf *.docx

# Convert DOCX to ODT
libreoffice --headless --convert-to odt document.docx

# Convert ODT to DOCX
libreoffice --headless --convert-to docx document.odt

# In a Docker container (useful for server-side conversion)
docker run --rm -v $(pwd):/data ubuntu:22.04 \
  libreoffice --headless --convert-to pdf /data/document.docx

LibreOffice headless is the standard approach for server-side DOCX-to-PDF conversion when you need close fidelity to the original formatting. It's used by many document management systems, PDF generation services, and document preview APIs.

When to use LibreOffice vs Pandoc

Task	Better tool	Why
DOCX → PDF (fidelity)	LibreOffice	Renders fonts and layout like the actual application
Markdown → DOCX	Pandoc	LibreOffice doesn't read Markdown
DOCX → Markdown	Pandoc	LibreOffice doesn't write Markdown
DOCX ↔ ODT	LibreOffice	Both are native formats; conversion is higher fidelity
Markdown → EPUB	Pandoc	Pandoc's EPUB output with metadata support is excellent
LaTeX → PDF	pdflatex/xelatex	LaTeX requires its own compiler; Pandoc calls it internally
Citation management	Pandoc	Built-in BibTeX and CSL support

Browser-Based WASM Conversion

Several document processing libraries have been compiled to WebAssembly, enabling document conversion in the browser without uploading files to a server. This is how ToolsDock's converters work—Pandoc, LibreOffice, and other tools compiled to WASM and run locally.

The privacy advantage is significant: a 50MB DOCX containing a confidential contract never leaves your machine. The browser processes it locally using the same code that would run on a server.

What's available as WASM

Pandoc WASM — the full Pandoc converter running in the browser; powers most of ToolsDock's markup and document converters
pdf-lib — read and write PDFs in JavaScript, good for merging, splitting, and metadata editing
PDFium WASM — Google's PDF renderer compiled to WASM; handles rendering and extraction
MuPDF WASM — lightweight PDF/XPS renderer; used for PDF preview and processing
mammoth.js — DOCX to HTML converter written natively for browsers; good for simple DOCX reading

WASM converters have limitations: large files are slow because WASM memory is limited, very complex formatting may be handled differently than the server-side counterpart, and some library features aren't exposed in the WASM build.

Batch Workflows

Linux / macOS shell script

#!/bin/bash
# Convert all Markdown files to PDF
INPUT_DIR="./docs"
OUTPUT_DIR="./output/pdf"
mkdir -p "$OUTPUT_DIR"

for file in "$INPUT_DIR"/*.md; do
  basename="${file##*/}"
  name="${basename%.md}"
  echo "Converting: $basename"
  pandoc "$file" -o "$OUTPUT_DIR/$name.pdf" \
    --toc \
    --highlight-style=github
done
echo "Done. $(ls $OUTPUT_DIR/*.pdf | wc -l) files converted."

Makefile (parallel builds)

SOURCES := $(wildcard docs/*.md)
PDFS    := $(SOURCES:docs/%.md=output/%.pdf)
HTMLS   := $(SOURCES:docs/%.md=output/%.html)

all: $(PDFS) $(HTMLS)

output/%.pdf: docs/%.md
	pandoc "$<" -o "$@" --toc

output/%.html: docs/%.md
	pandoc "$<" -s -o "$@" --toc

# Run with: make -j4  (parallel, 4 jobs)

Windows PowerShell

# Convert all DOCX to PDF using LibreOffice
$files = Get-ChildItem -Path ".\docs" -Filter "*.docx"
foreach ($file in $files) {
  Write-Host "Converting: $($file.Name)"
  & "C:\Program Files\LibreOffice\program\soffice.exe" `
    --headless --convert-to pdf `
    --outdir ".\output" $file.FullName
}

# Convert all Markdown to DOCX with Pandoc
$mdFiles = Get-ChildItem -Filter "*.md"
foreach ($f in $mdFiles) {
  $out = $f.BaseName + ".docx"
  pandoc $f.Name -o $out --reference-doc=template.docx
  Write-Host "Created: $out"
}

Python batch conversion

# Using subprocess to call Pandoc
import subprocess
from pathlib import Path

def batch_convert(source_dir, output_dir, from_ext, to_ext, extra_args=None):
    source_path = Path(source_dir)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    extra_args = extra_args or []

    files = list(source_path.glob(f'**/*.{from_ext}'))
    print(f'Found {len(files)} .{from_ext} files')

    for file in files:
        out = output_path / file.with_suffix(f'.{to_ext}').name
        cmd = ['pandoc', str(file), '-o', str(out)] + extra_args
        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode == 0:
            print(f'OK: {file.name} → {out.name}')
        else:
            print(f'FAIL: {file.name}')
            print(result.stderr)

# Convert all Markdown to DOCX
batch_convert('./docs', './output', 'md', 'docx',
              ['--reference-doc=template.docx', '--toc'])

Academic Writing Workflows

The LaTeX → DOCX problem

Many journals want DOCX submissions even from authors who write in LaTeX. Pandoc converts the core content well. Mathematics is the sticking point—it can output as MathML (works in Word), images, or via --webtex (renders as images via a service). Complex custom LaTeX macros and packages often need manual fixes.

# LaTeX → DOCX with bibliography (best approach)
pandoc paper.tex \
  --bibliography=references.bib \
  --csl=ieee.csl \
  --reference-doc=journal-template.docx \
  -o paper.docx

# If math breaks, try MathType-compatible output
pandoc paper.tex --mathml -o paper.docx

# After conversion, manually review:
# - Equation numbering
# - Figure and table labels
# - Citation formatting
# - Any custom commands that Pandoc doesn't recognize

Collaborative Markdown → LaTeX pipeline

Write in Markdown for easy Git collaboration, generate DOCX for non-LaTeX collaborators, and produce the final PDF via LaTeX for submission quality.

# Draft phase: Markdown → DOCX for co-authors
pandoc manuscript.md -o manuscript_draft.docx \
  --bibliography=refs.bib --csl=apa.csl

# Submission phase: Markdown → LaTeX → PDF
pandoc manuscript.md -o manuscript.tex \
  --bibliography=refs.bib
pdflatex manuscript.tex
bibtex manuscript
pdflatex manuscript.tex
pdflatex manuscript.tex  # Run twice to resolve references

Ebook Creation

Recommended workflow

Write in Markdown → convert to EPUB with Pandoc → validate with EPUBCheck → upload to distributor. For Kindle, upload the EPUB to KDP; Amazon handles the conversion to their format.

# Single file → EPUB
pandoc book.md -o book.epub \
  --toc --toc-depth=2 \
  --epub-cover-image=cover.jpg \
  -M title="Book Title" \
  -M author="Author Name" \
  -M lang=en-US

# Multiple chapters → EPUB
pandoc \
  front-matter.md \
  ch01-introduction.md \
  ch02-getting-started.md \
  ch03-advanced.md \
  appendix.md \
  -o book.epub \
  --toc \
  --epub-cover-image=cover.jpg \
  -M title="Book Title"

# Validate with EPUBCheck (requires Java)
java -jar epubcheck.jar book.epub

EPUB best practices

Use heading styles consistently—H1 for chapters, H2 for sections—since ereaders generate navigation from headings
Keep images under 300KB each; ereaders have limited memory
72–96 DPI is sufficient for ereader screens
Remove manual page breaks—EPUB is reflowable and page breaks are device-specific
Test on at least one physical Kindle, one Kobo, and one phone (Kindle/Apple Books app)
Use relative font sizes (em, rem) not pixels in your CSS
Include the lang attribute—screen readers use it for text-to-speech language selection

Accessibility in Document Formats

Accessible documents are well-structured documents. The same practices that make a document convert cleanly also make it screen-reader friendly.

PDF accessibility

PDFs are inherently challenging for accessibility because they're fixed-layout. Tagged PDF (PDF/UA standard) adds structure tags that screen readers use to understand reading order. When generating PDF with Pandoc, use --pdf-engine=xelatex—it produces better structure than pdflatex. Add alt text to images in the source document before converting.

# Pandoc to accessible PDF via LaTeX
pandoc input.md -o output.pdf \
  --pdf-engine=xelatex \
  -V colorlinks=true  # Visible link indicators

# Images in Markdown (alt text is accessible description):
![A bar chart showing quarterly revenue growth from Q1 to Q4 2025](revenue.png)

DOCX accessibility checklist

Use built-in Heading styles (Heading 1, 2, 3)—not manually formatted text
Add alt text to every image (right-click → Edit Alt Text in Word)
Use tables for data, not layout—and include a header row with scope attributes
Write descriptive link text ("View quarterly report" not "click here")
Ensure 4.5:1 contrast ratio minimum for all text
Don't use color alone to convey meaning
Run Word's Accessibility Checker before exporting (Review → Check Accessibility)

EPUB accessibility

# Pandoc EPUB with accessibility metadata
pandoc book.md -o book.epub \
  -M lang=en \
  --epub-metadata=metadata.xml

# metadata.xml
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:language>en</dc:language>
  <meta property="schema:accessMode">textual</meta>
  <meta property="schema:accessibilityFeature">tableOfContents</meta>
  <meta property="schema:accessibilityFeature">structuralNavigation</meta>
  <meta property="schema:accessibilityFeature">alternativeText</meta>
</metadata>

Tools

Markdown to DOCX

Convert Markdown to Word document with Pandoc, supports tables, code blocks, and headings.

Convert Now

DOCX to PDF

Convert Word documents to PDF while preserving formatting and embedded fonts.

Convert Now

Markdown to EPUB

Create EPUB ebooks from Markdown with table of contents, cover image, and metadata.

Convert Now

DOCX to Markdown

Extract clean Markdown from Word documents for Git-based workflows.

Convert Now

ODT to DOCX

Convert LibreOffice documents to Microsoft Word format.

Convert Now

HTML to Markdown

Convert HTML to clean Markdown—useful for migrating blog or CMS content.

Convert Now

Frequently Asked Questions

Converting a document from one format to another while preserving as much structure, formatting, and content as possible. 'As much as possible' is doing a lot of work in that sentence—format conversion always involves some loss because formats represent content differently. Understanding what survives and what doesn't is the point of this guide.

When you need to guarantee the recipient sees exactly what you created—same fonts, same layout, same pagination. PDF is a fixed-layout format; DOCX is not. A DOCX opened on a system with different fonts or a different Word version will reflow and look different. Send PDFs for contracts, invoices, reports, and resumes.

LaTeX for anything with substantial mathematics, complex citations, or journal-specific formatting requirements. Markdown or DOCX with Zotero for collaborative work where co-authors won't use LaTeX. Git + LaTeX or Git + Markdown gives you version control. Word alone gives you track changes and comments but poor version history.

Use semantic markup—actual heading styles (Heading 1, 2, 3), real lists, emphasis for bold and italic—rather than manual formatting that looks the same but has no semantic meaning. Converters understand headings; they can't understand 'large bold text that visually looks like a heading.' Test with a representative sample early.

Pandoc is an open-source command-line tool that converts between 40+ document formats. It's the de facto standard because it's genuinely universal, handles bibliography and citations via BibTeX/CSL, supports templates and custom output styling, and is scriptable for batch workflows. Nothing else comes close in breadth.

It depends entirely on what generated the PDF. PDFs from Word documents retain structure and convert well. PDFs from print layouts or desktop publishing tools may convert poorly—columns get merged, text order gets scrambled, tables become text. Scanned PDFs require OCR first and are always messy. Set realistic expectations.

EPUB is the universal standard supported by all major ereaders except Kindle. Start with Markdown or clean HTML, convert with Pandoc, validate with EPUBCheck. For Kindle, upload the EPUB directly to Kindle Direct Publishing—Amazon converts it automatically. Don't try to create MOBI manually in 2026.

Command line with Pandoc and a shell script or Makefile. For Windows: for %f in (*.md) do pandoc "%f" -o "%~nf.pdf". For Linux/macOS: for f in *.md; do pandoc "$f" -o "${f%.md}.pdf"; done. Add -j4 to parallel Make targets for speed.

RTF (Rich Text Format, 1987) is plain text with embedded formatting codes—readable in any text editor, writable by any word processor ever made. DOCX (2007) is a ZIP of XML files with full support for tracked changes, comments, styles, and complex formatting. RTF for maximum compatibility with old systems; DOCX for everything modern.

DOCX if your collaborators use Microsoft Office. ODT if they use LibreOffice or you care about open standards (ODT is ISO 26300). Both convert between each other, but complex formatting sometimes degrades. Google Docs exports both. For teams using mixed software, DOCX has fewer surprises in practice.