Document Format Conversion: What Actually Happens Under the Hood
Every document format makes different tradeoffs. DOCX is a ZIP of XML. PDF is a fixed canvas described in PostScript-derived syntax. EPUB is HTML with a spine. Understanding what each format actually is—not just what it's used for—explains why some conversions are lossless and others aren't.
Format Internals
DOCX: A ZIP of XML
DOCX (and XLSX, PPTX) are ZIP archives. Rename a .docx to .zip, extract it, and you'll find directories of XML files. The main document text is in word/document.xml. Styles are in word/styles.xml. Images live in word/media/. Relationships between files are mapped in _rels/ directories.
document.docx/
├── [Content_Types].xml
├── _rels/
│ └── .rels
├── word/
│ ├── document.xml ← Main content
│ ├── styles.xml ← Style definitions
│ ├── settings.xml ← Document settings
│ ├── fontTable.xml ← Font references
│ ├── numbering.xml ← List numbering
│ ├── media/
│ │ └── image1.png ← Embedded images
│ └── _rels/
│ └── document.xml.rels ← Content relationships
└── docProps/
├── core.xml ← Author, dates, title
└── app.xml ← Word version, company
This means a DOCX is programmatically readable and writable without Word—libraries like python-docx, docx4j, or Pandoc process the XML directly. It also means your metadata (author name, revision count, edit time) is sitting in plaintext inside the ZIP, readable by anyone who opens it.
PDF: A Fixed Canvas
PDF describes a document as a sequence of drawing operations on a fixed-size canvas. Text is rendered by specifying font, position, and glyph sequence—not as a semantic "paragraph" or "heading." Images are embedded as raster or vector data. Pages are self-contained; there's no concept of text flowing between pages during rendering.
This fixed-layout model is PDF's strength (identical appearance everywhere) and its weakness (painful to edit, difficult to reflow for different screen sizes, accessibility requires extra metadata layers). When you "edit" a PDF in Acrobat, you're either overlaying new content or, in more sophisticated editing, manipulating the content streams directly—which is why it's slow and sometimes produces odd results.
PDF can contain text as actual text (selectable, searchable, copy-pasteable) or as images of text (from scanning or certain printing workflows). The latter requires OCR to become searchable. A PDF being "searchable" means the text content is present in the file's data structure, not just visible as pixels.
ODT: The Open Standard
ODT (OpenDocument Text) is also a ZIP of XML, following the ISO 26300 standard. The content is in content.xml, styles in styles.xml. It's structurally similar to DOCX but uses different XML namespaces and element names. LibreOffice is its native application; it's also supported by Google Docs and most modern word processors.
RTF: Plain Text with Formatting Codes
RTF (Rich Text Format, 1987) is literally a text file containing embedded control words. Open any .rtf file in a text editor and you'll see things like {\rtf1\ansi\deff0{\fonttbl{\f0\froman Times New Roman;}}}. It's verbose and limits formatting compared to DOCX, but any software from the last 35 years can read it. RTF is the right format when you genuinely don't know what software the recipient has.
EPUB: HTML with a Spine
EPUB 3 is a ZIP containing HTML files, CSS stylesheets, images, and a manifest (package.opf) that describes the reading order. Open an EPUB with a ZIP extractor and you'll find recognizable HTML. The "spine" in package.opf defines which HTML files constitute the chapters and in what order. EPUB is reflowable by design—the HTML adapts to the reading device's screen size, which is why ereaders work on phones, tablets, and dedicated e-ink devices interchangeably.
LaTeX: Source Code for Documents
LaTeX is a typesetting system—you write source code that compiles to PDF (or DVI). Unlike the formats above, LaTeX is not a finished file format; it's a programming language for documents. The compiler (pdflatex, xelatex, lualatex) interprets the source, resolves references, runs a bibliography pass, and produces the final output. This is why LaTeX has beautiful typography—it makes optimal line-breaking decisions over entire paragraphs, not line by line.
Conversion Gotchas
Fonts
DOCX references fonts by name. If the converting system doesn't have the font, it substitutes another—often with different metrics, causing text to reflow, headings to change size, and page breaks to move. PDF embeds fonts (if the creator didn't disable this), so the recipient always sees the intended font. When converting DOCX to PDF, always embed all fonts to prevent this.
Tables with merged cells
Complex table structures (rowspan, colspan) convert unreliably between formats. HTML tables, DOCX tables, ODT tables, and LaTeX tables all model spanning differently. Pandoc handles many cases but flattens others. If your document has complex tables, test the conversion explicitly and expect manual fixes.
Tracked changes
DOCX tracked changes are stored as a parallel set of "deletion" and "insertion" runs in the XML. When converting to HTML or Markdown, tracked changes are typically either accepted (showing only the final state) or stripped. There's no standard way to represent tracked changes in most other formats. If preserving edit history matters, keep the DOCX.
Headers and footers
Page-layout concepts like headers, footers, page numbers, and margin content don't map to HTML or Markdown—these formats have no concept of pages. Converting DOCX to HTML or Markdown silently drops headers and footers. Converting HTML to PDF via a tool like WeasyPrint requires CSS page rules (@page) to add them back.
Embedded objects
OLE objects (embedded Excel charts, Visio diagrams) in DOCX either convert to static images or are dropped entirely. SVG in EPUB converts reasonably to HTML. Embedded videos in DOCX become placeholders in PDF.
PDF to anything
PDF-to-text conversion quality depends entirely on the source. Text-based PDFs (made from Word, LaTeX, or InDesign) extract well. PDFs from scanned paper or fax require OCR. PDFs with multi-column layouts often produce garbled text order because columns are extracted left-to-right on the page, not column by column. PDF tables become difficult—the text positions are stored spatially, not as table cells.
What Survives Conversion (and What Doesn't)
| Element | Survival rate | Notes |
|---|---|---|
| Paragraphs and text | Excellent | Core content survives almost everywhere |
| Heading hierarchy (H1–H6) | Excellent | If original uses real heading styles, not manual formatting |
| Bulleted and numbered lists | Excellent | If using built-in list styles |
| Bold, italic, underline | Excellent | Universal across all formats |
| Hyperlinks | Good | URLs survive; anchor targets sometimes lost |
| Inline images | Good | Raster images usually survive; vector may rasterize |
| Simple tables | Good | Basic rows/columns convert; spanning cells often don't |
| Code blocks | Good | Monospace preserved; syntax highlighting often lost |
| Footnotes/endnotes | Moderate | Many converters support; some formats lack the concept |
| Custom fonts | Poor | Substituted unless embedded in PDF |
| Complex table structures | Poor | Spans, merged cells, nested tables break frequently |
| Headers and footers | Poor | Lost when converting to HTML/Markdown (no page concept) |
| Track changes | Poor | Typically accepted or discarded during conversion |
| Embedded media (video, audio) | Poor | Usually dropped or replaced with placeholder |
| Page layout (columns, text boxes) | Very poor | Fundamentally incompatible with reflowable formats |
| Form fields | Very poor | Interactive forms require format-specific support |
Pandoc: The Swiss Army Knife
Pandoc handles conversions between 40+ formats. It's open-source, command-line, scriptable, and available on all platforms. For most conversion tasks, it's the right answer.
Installation
brew install pandoc # macOS
sudo apt install pandoc # Debian/Ubuntu
choco install pandoc # Windows (Chocolatey)
# Or download from https://pandoc.org/installing.html
Common conversions
# Format is inferred from file extension when possible
pandoc input.md -o output.html
pandoc input.md -o output.docx
pandoc input.md -o output.pdf # Requires LaTeX or WeasyPrint
pandoc input.docx -o output.md
pandoc input.html -o output.md
# Explicit format specification
pandoc input.txt -f markdown -t html5 -o output.html
# Standalone HTML (with head, html, body tags)
pandoc input.md -s -o output.html
# With table of contents
pandoc input.md --toc --toc-depth=2 -o output.html
# Apply a CSS stylesheet to HTML output
pandoc input.md -s --css=styles.css -o output.html
# Use a Word template for styles in DOCX output
pandoc input.md -o output.docx --reference-doc=template.docx
# With bibliography
pandoc paper.md --bibliography=refs.bib --csl=apa.csl -o paper.docx
Key options
| Option | Purpose |
|---|---|
-s, --standalone | Produce a complete document (with HTML head/body, not just a fragment) |
--toc | Generate a table of contents from headings |
--reference-doc=file.docx | Use a DOCX template's styles for the output |
--bibliography=file.bib | Process citations from a BibTeX file |
--csl=style.csl | Citation style (APA, MLA, IEEE, Chicago, etc.) |
--highlight-style=github | Syntax highlighting theme for code blocks |
--pdf-engine=xelatex | LaTeX engine for PDF generation (handles Unicode better than pdflatex) |
--wrap=none | Don't rewrap text (useful for Markdown output) |
-M title="My Doc" | Set metadata (title, author, date, etc.) |
Real examples
# Resume: Markdown → PDF with Eisvogel LaTeX template
pandoc resume.md -o resume.pdf \
--template=eisvogel.latex \
--pdf-engine=xelatex
# Academic paper: LaTeX → DOCX with bibliography
pandoc paper.tex -o paper.docx \
--bibliography=references.bib \
--csl=ieee.csl
# Ebook: multiple Markdown chapters → EPUB
pandoc title.md ch01.md ch02.md ch03.md -o book.epub \
--toc \
--epub-cover-image=cover.jpg \
-M title="My Book" \
-M author="Author Name"
# Migrate HTML docs to Markdown
pandoc docs.html -f html -t markdown --wrap=none -o docs.md
LibreOffice Headless
LibreOffice has a headless mode (no GUI) that converts documents the way LibreOffice actually renders them—which is closer to how a human user would save them. This makes it better than Pandoc for some conversions, particularly DOCX to PDF with complex formatting, and DOCX ↔ ODT.
# Convert DOCX to PDF (headless mode)
libreoffice --headless --convert-to pdf document.docx
# Convert to a specific directory
libreoffice --headless --convert-to pdf --outdir /output/ document.docx
# Batch convert all DOCX in current directory
libreoffice --headless --convert-to pdf *.docx
# Convert DOCX to ODT
libreoffice --headless --convert-to odt document.docx
# Convert ODT to DOCX
libreoffice --headless --convert-to docx document.odt
# In a Docker container (useful for server-side conversion)
docker run --rm -v $(pwd):/data ubuntu:22.04 \
libreoffice --headless --convert-to pdf /data/document.docx
LibreOffice headless is the standard approach for server-side DOCX-to-PDF conversion when you need close fidelity to the original formatting. It's used by many document management systems, PDF generation services, and document preview APIs.
When to use LibreOffice vs Pandoc
| Task | Better tool | Why |
|---|---|---|
| DOCX → PDF (fidelity) | LibreOffice | Renders fonts and layout like the actual application |
| Markdown → DOCX | Pandoc | LibreOffice doesn't read Markdown |
| DOCX → Markdown | Pandoc | LibreOffice doesn't write Markdown |
| DOCX ↔ ODT | LibreOffice | Both are native formats; conversion is higher fidelity |
| Markdown → EPUB | Pandoc | Pandoc's EPUB output with metadata support is excellent |
| LaTeX → PDF | pdflatex/xelatex | LaTeX requires its own compiler; Pandoc calls it internally |
| Citation management | Pandoc | Built-in BibTeX and CSL support |
Browser-Based WASM Conversion
Several document processing libraries have been compiled to WebAssembly, enabling document conversion in the browser without uploading files to a server. This is how ToolsDock's converters work—Pandoc, LibreOffice, and other tools compiled to WASM and run locally.
The privacy advantage is significant: a 50MB DOCX containing a confidential contract never leaves your machine. The browser processes it locally using the same code that would run on a server.
What's available as WASM
- Pandoc WASM — the full Pandoc converter running in the browser; powers most of ToolsDock's markup and document converters
- pdf-lib — read and write PDFs in JavaScript, good for merging, splitting, and metadata editing
- PDFium WASM — Google's PDF renderer compiled to WASM; handles rendering and extraction
- MuPDF WASM — lightweight PDF/XPS renderer; used for PDF preview and processing
- mammoth.js — DOCX to HTML converter written natively for browsers; good for simple DOCX reading
WASM converters have limitations: large files are slow because WASM memory is limited, very complex formatting may be handled differently than the server-side counterpart, and some library features aren't exposed in the WASM build.
Batch Workflows
Linux / macOS shell script
#!/bin/bash
# Convert all Markdown files to PDF
INPUT_DIR="./docs"
OUTPUT_DIR="./output/pdf"
mkdir -p "$OUTPUT_DIR"
for file in "$INPUT_DIR"/*.md; do
basename="${file##*/}"
name="${basename%.md}"
echo "Converting: $basename"
pandoc "$file" -o "$OUTPUT_DIR/$name.pdf" \
--toc \
--highlight-style=github
done
echo "Done. $(ls $OUTPUT_DIR/*.pdf | wc -l) files converted."
Makefile (parallel builds)
SOURCES := $(wildcard docs/*.md)
PDFS := $(SOURCES:docs/%.md=output/%.pdf)
HTMLS := $(SOURCES:docs/%.md=output/%.html)
all: $(PDFS) $(HTMLS)
output/%.pdf: docs/%.md
pandoc "$<" -o "$@" --toc
output/%.html: docs/%.md
pandoc "$<" -s -o "$@" --toc
# Run with: make -j4 (parallel, 4 jobs)
Windows PowerShell
# Convert all DOCX to PDF using LibreOffice
$files = Get-ChildItem -Path ".\docs" -Filter "*.docx"
foreach ($file in $files) {
Write-Host "Converting: $($file.Name)"
& "C:\Program Files\LibreOffice\program\soffice.exe" `
--headless --convert-to pdf `
--outdir ".\output" $file.FullName
}
# Convert all Markdown to DOCX with Pandoc
$mdFiles = Get-ChildItem -Filter "*.md"
foreach ($f in $mdFiles) {
$out = $f.BaseName + ".docx"
pandoc $f.Name -o $out --reference-doc=template.docx
Write-Host "Created: $out"
}
Python batch conversion
# Using subprocess to call Pandoc
import subprocess
from pathlib import Path
def batch_convert(source_dir, output_dir, from_ext, to_ext, extra_args=None):
source_path = Path(source_dir)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
extra_args = extra_args or []
files = list(source_path.glob(f'**/*.{from_ext}'))
print(f'Found {len(files)} .{from_ext} files')
for file in files:
out = output_path / file.with_suffix(f'.{to_ext}').name
cmd = ['pandoc', str(file), '-o', str(out)] + extra_args
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
print(f'OK: {file.name} → {out.name}')
else:
print(f'FAIL: {file.name}')
print(result.stderr)
# Convert all Markdown to DOCX
batch_convert('./docs', './output', 'md', 'docx',
['--reference-doc=template.docx', '--toc'])
Academic Writing Workflows
The LaTeX → DOCX problem
Many journals want DOCX submissions even from authors who write in LaTeX. Pandoc converts the core content well. Mathematics is the sticking point—it can output as MathML (works in Word), images, or via --webtex (renders as images via a service). Complex custom LaTeX macros and packages often need manual fixes.
# LaTeX → DOCX with bibliography (best approach)
pandoc paper.tex \
--bibliography=references.bib \
--csl=ieee.csl \
--reference-doc=journal-template.docx \
-o paper.docx
# If math breaks, try MathType-compatible output
pandoc paper.tex --mathml -o paper.docx
# After conversion, manually review:
# - Equation numbering
# - Figure and table labels
# - Citation formatting
# - Any custom commands that Pandoc doesn't recognize
Collaborative Markdown → LaTeX pipeline
Write in Markdown for easy Git collaboration, generate DOCX for non-LaTeX collaborators, and produce the final PDF via LaTeX for submission quality.
# Draft phase: Markdown → DOCX for co-authors
pandoc manuscript.md -o manuscript_draft.docx \
--bibliography=refs.bib --csl=apa.csl
# Submission phase: Markdown → LaTeX → PDF
pandoc manuscript.md -o manuscript.tex \
--bibliography=refs.bib
pdflatex manuscript.tex
bibtex manuscript
pdflatex manuscript.tex
pdflatex manuscript.tex # Run twice to resolve references
Ebook Creation
Recommended workflow
Write in Markdown → convert to EPUB with Pandoc → validate with EPUBCheck → upload to distributor. For Kindle, upload the EPUB to KDP; Amazon handles the conversion to their format.
# Single file → EPUB
pandoc book.md -o book.epub \
--toc --toc-depth=2 \
--epub-cover-image=cover.jpg \
-M title="Book Title" \
-M author="Author Name" \
-M lang=en-US
# Multiple chapters → EPUB
pandoc \
front-matter.md \
ch01-introduction.md \
ch02-getting-started.md \
ch03-advanced.md \
appendix.md \
-o book.epub \
--toc \
--epub-cover-image=cover.jpg \
-M title="Book Title"
# Validate with EPUBCheck (requires Java)
java -jar epubcheck.jar book.epub
EPUB best practices
- Use heading styles consistently—H1 for chapters, H2 for sections—since ereaders generate navigation from headings
- Keep images under 300KB each; ereaders have limited memory
- 72–96 DPI is sufficient for ereader screens
- Remove manual page breaks—EPUB is reflowable and page breaks are device-specific
- Test on at least one physical Kindle, one Kobo, and one phone (Kindle/Apple Books app)
- Use relative font sizes (em, rem) not pixels in your CSS
- Include the
langattribute—screen readers use it for text-to-speech language selection
Accessibility in Document Formats
Accessible documents are well-structured documents. The same practices that make a document convert cleanly also make it screen-reader friendly.
PDF accessibility
PDFs are inherently challenging for accessibility because they're fixed-layout. Tagged PDF (PDF/UA standard) adds structure tags that screen readers use to understand reading order. When generating PDF with Pandoc, use --pdf-engine=xelatex—it produces better structure than pdflatex. Add alt text to images in the source document before converting.
# Pandoc to accessible PDF via LaTeX
pandoc input.md -o output.pdf \
--pdf-engine=xelatex \
-V colorlinks=true # Visible link indicators
# Images in Markdown (alt text is accessible description):

DOCX accessibility checklist
- Use built-in Heading styles (Heading 1, 2, 3)—not manually formatted text
- Add alt text to every image (right-click → Edit Alt Text in Word)
- Use tables for data, not layout—and include a header row with scope attributes
- Write descriptive link text ("View quarterly report" not "click here")
- Ensure 4.5:1 contrast ratio minimum for all text
- Don't use color alone to convey meaning
- Run Word's Accessibility Checker before exporting (Review → Check Accessibility)
EPUB accessibility
# Pandoc EPUB with accessibility metadata
pandoc book.md -o book.epub \
-M lang=en \
--epub-metadata=metadata.xml
# metadata.xml
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:language>en</dc:language>
<meta property="schema:accessMode">textual</meta>
<meta property="schema:accessibilityFeature">tableOfContents</meta>
<meta property="schema:accessibilityFeature">structuralNavigation</meta>
<meta property="schema:accessibilityFeature">alternativeText</meta>
</metadata>
Tools
Markdown to DOCX
Convert Markdown to Word document with Pandoc, supports tables, code blocks, and headings.
Convert NowDOCX to PDF
Convert Word documents to PDF while preserving formatting and embedded fonts.
Convert NowMarkdown to EPUB
Create EPUB ebooks from Markdown with table of contents, cover image, and metadata.
Convert NowHTML to Markdown
Convert HTML to clean Markdown—useful for migrating blog or CMS content.
Convert Now