JSON vs XML vs YAML vs CSV vs TOML: Data Formats Compared
A practical guide to the most common data interchange formats — what each one looks like, where it fits, how to parse it, and how to convert between them without losing data.
CSV — Comma-Separated Values
CSV is the simplest format for tabular data. Each line is a row, each value separated by a delimiter (usually comma, sometimes tab or semicolon). Despite its age and limitations, CSV is still the most universally accepted format for data exchange with non-technical users and analytics tools.
name,age,city,active
John Doe,32,New York,true
Jane Smith,28,Los Angeles,false
Bob Johnson,45,"Chicago, IL",true
What CSV actually handles
- Fields with the delimiter inside must be quoted:
"Chicago, IL" - Fields with quotes inside double them:
"She said, ""Hello""" - Line breaks inside a field: allowed if the field is quoted
- All values are strings — there is no integer or boolean type
- No nesting — one level of rows and columns only
Delimiter variations
European Excel defaults to semicolons because the comma is the decimal separator in many locales. Tab-separated (TSV) is common in bioinformatics. Pipe-separated appears in legacy mainframe exports. When parsing, always confirm the delimiter — don't assume comma.
# European Excel format
name;price;quantity
Apple;1,50;10
# Tab-separated
name\tprice\tquantity
Apple\t1.50\t10
Parsing CSV
# Python — the csv module handles quoting correctly
import csv
with open('data.csv', newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['name'], row['age'])
# Node.js — papaparse handles edge cases well
import Papa from 'papaparse';
const result = Papa.parse(csvText, {
header: true,
dynamicTyping: true, // auto-detect numbers and booleans
skipEmptyLines: true
});
# R
data <- read.csv('data.csv', stringsAsFactors = FALSE)
# Pandas
import pandas as pd
df = pd.read_csv('data.csv', dtype={'zip': str}) # force zip to string
Convert CSV to other formats: CSV to JSON, CSV to XML, CSV to YAML, CSV to Markdown Table.
JSON — JavaScript Object Notation
JSON is the dominant format for REST APIs and modern web applications. It maps directly to data structures in virtually every programming language, and browsers can parse it natively without a library.
{
"users": [
{
"id": 1,
"name": "John Doe",
"active": true,
"address": {
"city": "New York",
"zip": "10001"
},
"tags": ["admin", "developer"],
"lastLogin": "2026-03-15T10:30:00Z",
"phoneNumber": null
}
],
"totalCount": 1,
"hasNextPage": false
}
JSON's six data types
- string — always double-quoted:
"hello" - number — integer or float, no quotes:
42,3.14 - boolean — lowercase:
true,false - null — lowercase:
null - array — ordered list:
[1, 2, 3] - object — key-value map:
{"key": "value"}
There is no date type. Use ISO 8601 strings: "2026-03-27T14:00:00Z". There are no comments — if you need them, use YAML or TOML for the config file instead.
Pitfall: large integers
JavaScript's Number type is a 64-bit float. Integers above 253−1 (9,007,199,254,740,991) lose precision. Twitter's API famously returns tweet IDs as both a number and a string for this reason. If you have database IDs or timestamps that exceed this limit, serialize them as strings.
// Safe
{"id": 9007199254740991}
// This ID loses precision in JavaScript
{"id": 9007199254740993} // → parsed as 9007199254740992
// Safe approach: always use string for large IDs
{"id": "9007199254740993", "id_int": 9007199254740993}
Parsing JSON
// JavaScript — native, no library needed
const data = JSON.parse(jsonString);
const str = JSON.stringify(data, null, 2); // pretty-print
// Python — standard library
import json
data = json.loads(json_string)
output = json.dumps(data, indent=2, ensure_ascii=False)
// Go — encoding/json package
import "encoding/json"
var result map[string]interface{}
json.Unmarshal([]byte(jsonStr), &result)
# Ruby — standard library
require 'json'
data = JSON.parse(json_string)
output = JSON.pretty_generate(data)
Tools: JSON Formatter, JSON Validator, JSON Schema Validator, JSON to XML, JSON to YAML.
XML — Extensible Markup Language
XML predates JSON by a decade and is still deeply embedded in enterprise systems, SOAP web services, Microsoft Office files (.docx, .xlsx are ZIP archives of XML), RSS feeds, and SVG graphics. It's verbose, but it has capabilities JSON lacks: element attributes, namespaces, XPath queries, XSLT transformations, and mature schema validation via XSD.
<?xml version="1.0" encoding="UTF-8"?>
<users xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<user id="1" active="true">
<name>John Doe</name>
<address>
<city>New York</city>
<zip>10001</zip>
</address>
<tags>
<tag>admin</tag>
<tag>developer</tag>
</tags>
<lastLogin>2026-03-15T10:30:00Z</lastLogin>
</user>
</users>
Attributes vs elements
XML lets you put data either as element attributes (<user id="1">) or as child elements (<id>1</id>). The convention: use attributes for metadata that describes the element (IDs, types, flags), use child elements for the actual data content. There's no technical difference — it's a design choice. Pick one convention and document it.
Special characters in XML
| Character | Entity | Use |
|---|---|---|
< | < | Less than, tag start |
> | > | Greater than |
& | & | Ampersand |
" | " | In attribute values |
' | ' | In attribute values |
Alternatively, wrap content in <![CDATA[ ... ]]> to include arbitrary text without escaping.
Parsing XML
# Python — standard library ElementTree (DOM parsing)
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
for user in root.findall('user'):
name = user.find('name').text
user_id = user.get('id') # attribute
# Python — lxml for XPath and namespaces
from lxml import etree
tree = etree.parse('data.xml')
names = tree.xpath('//user[@active="true"]/name/text()')
# Node.js — fast-xml-parser
import { XMLParser } from 'fast-xml-parser';
const parser = new XMLParser({ ignoreAttributes: false });
const result = parser.parse(xmlString);
# Java — JAXB or DOM
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File("data.xml"));
Tools: XML Formatter, XML to JSON, XML to CSV, XML to YAML.
YAML — YAML Ain't Markup Language
YAML was designed to be human-readable above all else. It's the format of choice for configuration files: Docker Compose, Kubernetes manifests, GitHub Actions workflows, Ansible playbooks, and many more. It supports comments (unlike JSON), multi-line strings, and anchors for reuse.
# Comments are supported — a major advantage over JSON
users:
- id: 1
name: John Doe
active: true
address:
city: New York
zip: "10001" # Quoted to prevent YAML from treating it as integer
tags:
- admin
- developer
lastLogin: 2026-03-15T10:30:00Z
- id: 2
name: Jane Smith
active: false
address: null
tags: []
database:
host: localhost
port: 5432
YAML features JSON doesn't have
# Multi-line string (literal block, preserves newlines)
description: |
First line.
Second line.
Third line.
# Folded block (newlines become spaces)
summary: >
This long text will be
folded into one line.
# Anchors and aliases (define once, reuse)
defaults: &defaults
timeout: 30
retries: 3
log_level: info
production:
<<: *defaults # merge defaults
host: prod.example.com
log_level: warn # override one value
staging:
<<: *defaults
host: staging.example.com
YAML gotchas
YAML's flexibility creates parsing surprises. Values that look like other types get auto-converted:
# These are NOT strings in YAML 1.1
enabled: yes # → boolean true
country: NO # → boolean false (Norway's ISO code!)
version: 1.0 # → float
zip: 08012 # → integer 8012 (loses leading zero)
date: 2026-03-27 # → date object in some parsers
# Fix: quote explicitly
country: "NO"
zip: "08012"
YAML 1.2 (the current spec) removed the yes/no/on/off boolean variants, but many parsers still implement YAML 1.1 behavior. When in doubt, quote your string values.
Security: YAML parsing arbitrary objects
Some YAML parsers support language-specific object deserialization. This is a known attack vector — a maliciously crafted YAML file can execute arbitrary code. Always use safe loaders:
# Python — NEVER use yaml.load() on untrusted input
import yaml
# Unsafe:
data = yaml.load(untrusted_input) # can execute arbitrary code
# Safe:
data = yaml.safe_load(untrusted_input)
# Or use the newer API:
data = yaml.load(untrusted_input, Loader=yaml.SafeLoader)
Tools: YAML Formatter, YAML to JSON, JSON to YAML, YAML to XML.
TOML — Tom's Obvious Minimal Language
TOML was created specifically for configuration files. It's more explicit than YAML (no ambiguous auto-conversion), cleaner than INI (supports typed values and nested sections), and easier to read than JSON (supports comments). Rust's package manager uses it for Cargo.toml, Python's packaging standard uses it for pyproject.toml, and Hugo uses it as the default config format.
# TOML config example
title = "My Application"
version = "2.1.0"
debug = false
[database]
host = "localhost"
port = 5432
name = "myapp_production"
timeout = 30.0 # float
[database.pool]
min_connections = 5
max_connections = 20
[[servers]] # array of tables
name = "web-01"
ip = "10.0.1.1"
roles = ["web", "proxy"]
[[servers]]
name = "web-02"
ip = "10.0.1.2"
roles = ["web"]
[logging]
level = "warn"
file = "/var/log/app.log"
rotate_daily = true
created_at = 2026-01-15T00:00:00Z # native datetime type
TOML native types
TOML has explicit types for dates and times, which YAML only handles in some implementations:
| Type | Example |
|---|---|
| String | "hello" |
| Integer | 42 |
| Float | 3.14, 6.626e-34 |
| Boolean | true, false |
| Datetime | 2026-03-27T14:30:00Z |
| Local date | 2026-03-27 |
| Array | [1, 2, 3] |
| Inline table | {host = "localhost", port = 5432} |
Convert: TOML to JSON, JSON to TOML.
Side-by-Side Comparison
The same data structure in each format, so you can compare verbosity and syntax:
JSON
{
"server": {
"host": "localhost",
"port": 8080,
"debug": true,
"tags": ["web", "api"]
}
}
YAML
# Server configuration
server:
host: localhost
port: 8080
debug: true
tags:
- web
- api
TOML
[server]
host = "localhost"
port = 8080
debug = true
tags = ["web", "api"]
XML
<server debug="true">
<host>localhost</host>
<port>8080</port>
<tags>
<tag>web</tag>
<tag>api</tag>
</tags>
</server>
| Feature | CSV | JSON | XML | YAML | TOML |
|---|---|---|---|---|---|
| Human readable | High | High | Medium | Very high | Very high |
| Comments | No | No | Yes | Yes | Yes |
| Nesting | No | Yes | Yes | Yes | Yes |
| Native data types | Strings only | 6 types | Strings (XSD types) | 11 types | 9 types + datetime |
| Schema validation | None | JSON Schema | XSD, DTD, RelaxNG | Limited | None |
| File size (relative) | Smallest | Small | Large | Medium | Medium |
| Parse speed | Fastest | Fast | Slowest | Medium | Medium |
| Library support | Universal | Universal | Universal | Wide | Growing |
| Binary data | No | Base64 string | Base64 / CDATA | Base64 string | No |
When to Pick Each Format
CSV
Data that's naturally rows and columns with no nesting. Reports, contact lists, financial exports, database table dumps, anything that needs to open in Excel. It's also the right choice when your audience includes non-technical users who will open the file in a spreadsheet.
Examples: e-commerce order exports, survey results, analytics reports, address books, product catalogs for flat data.
JSON
REST APIs, anything JavaScript touches, NoSQL database payloads (MongoDB, DynamoDB), event streaming (Kafka messages), and modern app configuration when you don't need comments. JSON is the right default for new systems unless there's a specific reason to choose otherwise.
Examples: REST API responses, GraphQL variables, Slack/webhook payloads, React component data, Firebase documents.
XML
SOAP web services, document workflows (legal, publishing), feed syndication (RSS/Atom), SVG graphics, Microsoft Office formats, and enterprise systems built before JSON became mainstream. Also use XML when you need namespace support or XSLT transformations.
Examples: SOAP APIs, DocBook technical documentation, RSS feeds, JATS academic publishing, Android resources, Maven POM files.
YAML
Configuration files that humans write and maintain. Kubernetes manifests, Docker Compose files, CI/CD pipelines, Ansible playbooks, Jekyll/Hugo site configs. YAML's comments and anchors make complex configurations maintainable in a way JSON can't match.
Examples: docker-compose.yml, .github/workflows/*.yml, kubernetes-deployment.yaml, _config.yml, Ansible roles.
TOML
Project and application configuration files, especially in ecosystems that have adopted it as a standard. If you're writing Rust, Python packaging, or Hugo, TOML is already expected. It's a better choice than YAML for config files because its type system is more explicit — you won't get surprised by yes becoming true.
Examples: Cargo.toml, pyproject.toml, hugo.toml, .taplo.toml.
Parsing Libraries
| Format | Python | JavaScript/Node | Go | Java |
|---|---|---|---|---|
| CSV | csv (stdlib), pandas |
papaparse, csv-parse | encoding/csv | OpenCSV, Apache Commons CSV |
| JSON | json (stdlib) |
JSON (native) |
encoding/json | Jackson, Gson |
| XML | ElementTree (stdlib), lxml | fast-xml-parser, xml2js | encoding/xml | JAXB, DOM4J |
| YAML | PyYAML, ruamel.yaml | js-yaml, yaml | gopkg.in/yaml.v3 | SnakeYAML |
| TOML | tomllib (stdlib 3.11+), tomli | @iarna/toml, smol-toml | github.com/BurntSushi/toml | toml4j |
Streaming vs DOM Parsing
For most files under 50 MB, load the whole document into memory (DOM parsing). It's simpler and most libraries default to this approach. For larger files — production database exports, log archives, data pipeline inputs — streaming parsers let you process records one at a time without loading everything into memory.
Streaming JSON with NDJSON
NDJSON (Newline-Delimited JSON) puts one JSON object per line, making it naturally streamable. This is what many log aggregators (Splunk, Elasticsearch) and data pipelines use:
{"id":1,"name":"John","event":"login","ts":"2026-03-27T10:00:00Z"}
{"id":2,"name":"Jane","event":"purchase","ts":"2026-03-27T10:01:23Z"}
{"id":3,"name":"Bob","event":"logout","ts":"2026-03-27T10:05:11Z"}
# Python streaming NDJSON
import json
with open('events.ndjson') as f:
for line in f:
event = json.loads(line.strip())
process(event)
# Node.js streaming JSON with JSONStream
import JSONStream from 'JSONStream';
import { createReadStream } from 'fs';
createReadStream('large-array.json')
.pipe(JSONStream.parse('*')) // parse each array element
.on('data', (item) => process(item));
Streaming XML with SAX
# Python SAX parser — processes XML without loading it all into memory
import xml.sax
class UserHandler(xml.sax.ContentHandler):
def startElement(self, name, attrs):
if name == "user":
self.current_id = attrs.get("id")
def characters(self, content):
self.current_content = content
xml.sax.parse("large-export.xml", UserHandler())
Convert to NDJSON: JSON to NDJSON, NDJSON to JSON.
Schema Validation
Schema validation catches format errors before they cause runtime problems. Each format has its own approach:
JSON Schema
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["name", "age"],
"properties": {
"name": {
"type": "string",
"minLength": 1,
"maxLength": 100
},
"age": {
"type": "integer",
"minimum": 0,
"maximum": 150
},
"email": {
"type": "string",
"format": "email"
},
"tags": {
"type": "array",
"items": { "type": "string" },
"uniqueItems": true
}
}
}
# Python validation with jsonschema
import jsonschema, json
schema = json.load(open('schema.json'))
data = json.load(open('data.json'))
try:
jsonschema.validate(instance=data, schema=schema)
print("Valid!")
except jsonschema.ValidationError as e:
print(f"Invalid: {e.message}")
// Node.js with Ajv (fastest JSON Schema validator)
import Ajv from 'ajv';
const ajv = new Ajv();
const validate = ajv.compile(schema);
const valid = validate(data);
if (!valid) console.log(validate.errors);
XML Schema Definition (XSD)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="user">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="age" type="xs:positiveInteger"/>
<xs:element name="email" type="xs:string" minOccurs="0"/>
</xs:sequence>
<xs:attribute name="id" type="xs:integer" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
Validate JSON: JSON Schema Validator. Format and check XML: XML Formatter.
Converting Between Formats
Handling nested data going to CSV
CSV has no nesting. When you convert JSON to CSV, nested objects must be flattened. Three strategies:
// Input JSON
{
"user": {
"name": "John",
"address": {
"city": "NYC",
"zip": "10001"
},
"tags": ["admin", "dev"]
}
}
// Strategy 1: Dot notation (most common)
user.name, user.address.city, user.address.zip, user.tags
John, NYC, 10001, admin;dev
// Strategy 2: Serialize nested as JSON string
user.name, user.address, user.tags
John, "{""city"":""NYC"",""zip"":""10001""}", "[""admin"",""dev""]"
// Strategy 3: Normalize into related tables
// users.csv: id, name
// addresses.csv: user_id, city, zip
// user_tags.csv: user_id, tag
Array handling in conversions
// JSON array → CSV options
{"user": "John", "tags": ["admin", "user"]}
// Option A: Join with delimiter
user, tags
John, "admin;user"
// Option B: Explode to multiple rows (relational)
user, tag
John, admin
John, user
// Option C: Multiple columns (only works if max length is known)
user, tag1, tag2
John, admin, user
Type coercion between formats
# CSV → JSON type inference
"123" → 123 (number) or "123" (string)? Decide up front.
"true" → true (boolean) or "true" (string)?
"2026-01-15" → keep as string or parse to date?
# Best practice: be explicit
- Use papaparse's dynamicTyping for automatic inference
- Or use a schema to specify column types
- Force zip codes and phone numbers to strings: dtype={'zip': str} in pandas
# YAML → JSON type coercion surprises
"yes" → true (YAML 1.1) or "yes" (YAML 1.2)
"1.0" → 1.0 (float)
"2026-03-27" → date object in some parsers, string in others
Special character handling
// Same string in each format
// Input: She said, "Hello & Goodbye"
// CSV (comma delimiter, double-quote escape)
"She said, ""Hello & Goodbye"""
// JSON
"She said, \"Hello & Goodbye\""
// XML
<msg>She said, "Hello & Goodbye"</msg>
// YAML
msg: 'She said, "Hello & Goodbye"'
Conversion tools
Common Pitfalls
1. Assuming CSV is simple to parse
Don't split on commas with str.split(','). That breaks on quoted fields containing commas. Always use a proper CSV library.
2. Losing data types at the CSV boundary
If you export JSON to CSV and re-import later, true becomes the string "true", integers become strings, and null might become an empty string or the literal text "null". Document the type expectations or use a schema.
3. YAML boolean traps
Country codes like NO, YES, ON, OFF become booleans in YAML 1.1 parsers. Version strings like 1.0 become floats. Always quote values you intend as strings when there's any ambiguity.
4. XML attribute vs element ambiguity
Converting JSON to XML requires deciding whether to use attributes or elements. Converting back to JSON loses attribute vs element distinction unless you encode it. Establish and document a convention if you'll round-trip between the two formats.
5. Large number precision in JSON
JavaScript's JSON.parse() silently corrupts integers larger than 9,007,199,254,740,991. Database primary keys from PostgreSQL's BIGINT or Twitter-style snowflake IDs can exceed this. Use strings for large IDs in JSON APIs.
6. Non-UTF-8 encoding
Old CSV exports from Windows Excel default to Windows-1252 encoding. Old XML files from enterprise systems sometimes use ISO-8859-1. If you see garbled characters (é instead of é), the encoding is wrong. Convert to UTF-8 first: Text Encoding Converter.
7. Dates without timezones
A date string like 2026-03-27 14:30:00 is ambiguous. Is that UTC? Local server time? The user's timezone? Always include timezone information. Use ISO 8601 with explicit UTC offset: 2026-03-27T14:30:00Z or 2026-03-27T14:30:00+02:00.