100% Private

JSON vs XML vs YAML vs CSV vs TOML: Data Formats Compared

A practical guide to the most common data interchange formats — what each one looks like, where it fits, how to parse it, and how to convert between them without losing data.

CSV — Comma-Separated Values

CSV is the simplest format for tabular data. Each line is a row, each value separated by a delimiter (usually comma, sometimes tab or semicolon). Despite its age and limitations, CSV is still the most universally accepted format for data exchange with non-technical users and analytics tools.

name,age,city,active
John Doe,32,New York,true
Jane Smith,28,Los Angeles,false
Bob Johnson,45,"Chicago, IL",true

What CSV actually handles

  • Fields with the delimiter inside must be quoted: "Chicago, IL"
  • Fields with quotes inside double them: "She said, ""Hello"""
  • Line breaks inside a field: allowed if the field is quoted
  • All values are strings — there is no integer or boolean type
  • No nesting — one level of rows and columns only

Delimiter variations

European Excel defaults to semicolons because the comma is the decimal separator in many locales. Tab-separated (TSV) is common in bioinformatics. Pipe-separated appears in legacy mainframe exports. When parsing, always confirm the delimiter — don't assume comma.

# European Excel format
name;price;quantity
Apple;1,50;10

# Tab-separated
name\tprice\tquantity
Apple\t1.50\t10

Parsing CSV

# Python — the csv module handles quoting correctly
import csv

with open('data.csv', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'], row['age'])

# Node.js — papaparse handles edge cases well
import Papa from 'papaparse';

const result = Papa.parse(csvText, {
  header: true,
  dynamicTyping: true,  // auto-detect numbers and booleans
  skipEmptyLines: true
});

# R
data <- read.csv('data.csv', stringsAsFactors = FALSE)

# Pandas
import pandas as pd
df = pd.read_csv('data.csv', dtype={'zip': str})  # force zip to string

Convert CSV to other formats: CSV to JSON, CSV to XML, CSV to YAML, CSV to Markdown Table.

JSON — JavaScript Object Notation

JSON is the dominant format for REST APIs and modern web applications. It maps directly to data structures in virtually every programming language, and browsers can parse it natively without a library.

{
  "users": [
    {
      "id": 1,
      "name": "John Doe",
      "active": true,
      "address": {
        "city": "New York",
        "zip": "10001"
      },
      "tags": ["admin", "developer"],
      "lastLogin": "2026-03-15T10:30:00Z",
      "phoneNumber": null
    }
  ],
  "totalCount": 1,
  "hasNextPage": false
}

JSON's six data types

  • string — always double-quoted: "hello"
  • number — integer or float, no quotes: 42, 3.14
  • boolean — lowercase: true, false
  • null — lowercase: null
  • array — ordered list: [1, 2, 3]
  • object — key-value map: {"key": "value"}

There is no date type. Use ISO 8601 strings: "2026-03-27T14:00:00Z". There are no comments — if you need them, use YAML or TOML for the config file instead.

Pitfall: large integers

JavaScript's Number type is a 64-bit float. Integers above 253−1 (9,007,199,254,740,991) lose precision. Twitter's API famously returns tweet IDs as both a number and a string for this reason. If you have database IDs or timestamps that exceed this limit, serialize them as strings.

// Safe
{"id": 9007199254740991}

// This ID loses precision in JavaScript
{"id": 9007199254740993}  // → parsed as 9007199254740992

// Safe approach: always use string for large IDs
{"id": "9007199254740993", "id_int": 9007199254740993}

Parsing JSON

// JavaScript — native, no library needed
const data = JSON.parse(jsonString);
const str = JSON.stringify(data, null, 2);  // pretty-print

// Python — standard library
import json
data = json.loads(json_string)
output = json.dumps(data, indent=2, ensure_ascii=False)

// Go — encoding/json package
import "encoding/json"
var result map[string]interface{}
json.Unmarshal([]byte(jsonStr), &result)

# Ruby — standard library
require 'json'
data = JSON.parse(json_string)
output = JSON.pretty_generate(data)

Tools: JSON Formatter, JSON Validator, JSON Schema Validator, JSON to XML, JSON to YAML.

XML — Extensible Markup Language

XML predates JSON by a decade and is still deeply embedded in enterprise systems, SOAP web services, Microsoft Office files (.docx, .xlsx are ZIP archives of XML), RSS feeds, and SVG graphics. It's verbose, but it has capabilities JSON lacks: element attributes, namespaces, XPath queries, XSLT transformations, and mature schema validation via XSD.

<?xml version="1.0" encoding="UTF-8"?>
<users xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <user id="1" active="true">
    <name>John Doe</name>
    <address>
      <city>New York</city>
      <zip>10001</zip>
    </address>
    <tags>
      <tag>admin</tag>
      <tag>developer</tag>
    </tags>
    <lastLogin>2026-03-15T10:30:00Z</lastLogin>
  </user>
</users>

Attributes vs elements

XML lets you put data either as element attributes (<user id="1">) or as child elements (<id>1</id>). The convention: use attributes for metadata that describes the element (IDs, types, flags), use child elements for the actual data content. There's no technical difference — it's a design choice. Pick one convention and document it.

Special characters in XML

CharacterEntityUse
<&lt;Less than, tag start
>&gt;Greater than
&&amp;Ampersand
"&quot;In attribute values
'&apos;In attribute values

Alternatively, wrap content in <![CDATA[ ... ]]> to include arbitrary text without escaping.

Parsing XML

# Python — standard library ElementTree (DOM parsing)
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
for user in root.findall('user'):
    name = user.find('name').text
    user_id = user.get('id')  # attribute

# Python — lxml for XPath and namespaces
from lxml import etree
tree = etree.parse('data.xml')
names = tree.xpath('//user[@active="true"]/name/text()')

# Node.js — fast-xml-parser
import { XMLParser } from 'fast-xml-parser';
const parser = new XMLParser({ ignoreAttributes: false });
const result = parser.parse(xmlString);

# Java — JAXB or DOM
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File("data.xml"));

Tools: XML Formatter, XML to JSON, XML to CSV, XML to YAML.

YAML — YAML Ain't Markup Language

YAML was designed to be human-readable above all else. It's the format of choice for configuration files: Docker Compose, Kubernetes manifests, GitHub Actions workflows, Ansible playbooks, and many more. It supports comments (unlike JSON), multi-line strings, and anchors for reuse.

# Comments are supported — a major advantage over JSON
users:
  - id: 1
    name: John Doe
    active: true
    address:
      city: New York
      zip: "10001"    # Quoted to prevent YAML from treating it as integer
    tags:
      - admin
      - developer
    lastLogin: 2026-03-15T10:30:00Z
  - id: 2
    name: Jane Smith
    active: false
    address: null
    tags: []

database:
  host: localhost
  port: 5432

YAML features JSON doesn't have

# Multi-line string (literal block, preserves newlines)
description: |
  First line.
  Second line.
  Third line.

# Folded block (newlines become spaces)
summary: >
  This long text will be
  folded into one line.

# Anchors and aliases (define once, reuse)
defaults: &defaults
  timeout: 30
  retries: 3
  log_level: info

production:
  <<: *defaults       # merge defaults
  host: prod.example.com
  log_level: warn     # override one value

staging:
  <<: *defaults
  host: staging.example.com

YAML gotchas

YAML's flexibility creates parsing surprises. Values that look like other types get auto-converted:

# These are NOT strings in YAML 1.1
enabled: yes          # → boolean true
country: NO           # → boolean false (Norway's ISO code!)
version: 1.0          # → float
zip: 08012            # → integer 8012 (loses leading zero)
date: 2026-03-27      # → date object in some parsers

# Fix: quote explicitly
country: "NO"
zip: "08012"

YAML 1.2 (the current spec) removed the yes/no/on/off boolean variants, but many parsers still implement YAML 1.1 behavior. When in doubt, quote your string values.

Security: YAML parsing arbitrary objects

Some YAML parsers support language-specific object deserialization. This is a known attack vector — a maliciously crafted YAML file can execute arbitrary code. Always use safe loaders:

# Python — NEVER use yaml.load() on untrusted input
import yaml
# Unsafe:
data = yaml.load(untrusted_input)  # can execute arbitrary code
# Safe:
data = yaml.safe_load(untrusted_input)
# Or use the newer API:
data = yaml.load(untrusted_input, Loader=yaml.SafeLoader)

Tools: YAML Formatter, YAML to JSON, JSON to YAML, YAML to XML.

TOML — Tom's Obvious Minimal Language

TOML was created specifically for configuration files. It's more explicit than YAML (no ambiguous auto-conversion), cleaner than INI (supports typed values and nested sections), and easier to read than JSON (supports comments). Rust's package manager uses it for Cargo.toml, Python's packaging standard uses it for pyproject.toml, and Hugo uses it as the default config format.

# TOML config example
title = "My Application"
version = "2.1.0"
debug = false

[database]
host = "localhost"
port = 5432
name = "myapp_production"
timeout = 30.0          # float

[database.pool]
min_connections = 5
max_connections = 20

[[servers]]             # array of tables
name = "web-01"
ip = "10.0.1.1"
roles = ["web", "proxy"]

[[servers]]
name = "web-02"
ip = "10.0.1.2"
roles = ["web"]

[logging]
level = "warn"
file = "/var/log/app.log"
rotate_daily = true
created_at = 2026-01-15T00:00:00Z  # native datetime type

TOML native types

TOML has explicit types for dates and times, which YAML only handles in some implementations:

TypeExample
String"hello"
Integer42
Float3.14, 6.626e-34
Booleantrue, false
Datetime2026-03-27T14:30:00Z
Local date2026-03-27
Array[1, 2, 3]
Inline table{host = "localhost", port = 5432}

Convert: TOML to JSON, JSON to TOML.

Side-by-Side Comparison

The same data structure in each format, so you can compare verbosity and syntax:

JSON
{
  "server": {
    "host": "localhost",
    "port": 8080,
    "debug": true,
    "tags": ["web", "api"]
  }
}
YAML
# Server configuration
server:
  host: localhost
  port: 8080
  debug: true
  tags:
    - web
    - api
TOML
[server]
host = "localhost"
port = 8080
debug = true
tags = ["web", "api"]
XML
<server debug="true">
  <host>localhost</host>
  <port>8080</port>
  <tags>
    <tag>web</tag>
    <tag>api</tag>
  </tags>
</server>
Feature CSV JSON XML YAML TOML
Human readable High High Medium Very high Very high
Comments No No Yes Yes Yes
Nesting No Yes Yes Yes Yes
Native data types Strings only 6 types Strings (XSD types) 11 types 9 types + datetime
Schema validation None JSON Schema XSD, DTD, RelaxNG Limited None
File size (relative) Smallest Small Large Medium Medium
Parse speed Fastest Fast Slowest Medium Medium
Library support Universal Universal Universal Wide Growing
Binary data No Base64 string Base64 / CDATA Base64 string No

When to Pick Each Format

CSV

Data that's naturally rows and columns with no nesting. Reports, contact lists, financial exports, database table dumps, anything that needs to open in Excel. It's also the right choice when your audience includes non-technical users who will open the file in a spreadsheet.

Examples: e-commerce order exports, survey results, analytics reports, address books, product catalogs for flat data.

JSON

REST APIs, anything JavaScript touches, NoSQL database payloads (MongoDB, DynamoDB), event streaming (Kafka messages), and modern app configuration when you don't need comments. JSON is the right default for new systems unless there's a specific reason to choose otherwise.

Examples: REST API responses, GraphQL variables, Slack/webhook payloads, React component data, Firebase documents.

XML

SOAP web services, document workflows (legal, publishing), feed syndication (RSS/Atom), SVG graphics, Microsoft Office formats, and enterprise systems built before JSON became mainstream. Also use XML when you need namespace support or XSLT transformations.

Examples: SOAP APIs, DocBook technical documentation, RSS feeds, JATS academic publishing, Android resources, Maven POM files.

YAML

Configuration files that humans write and maintain. Kubernetes manifests, Docker Compose files, CI/CD pipelines, Ansible playbooks, Jekyll/Hugo site configs. YAML's comments and anchors make complex configurations maintainable in a way JSON can't match.

Examples: docker-compose.yml, .github/workflows/*.yml, kubernetes-deployment.yaml, _config.yml, Ansible roles.

TOML

Project and application configuration files, especially in ecosystems that have adopted it as a standard. If you're writing Rust, Python packaging, or Hugo, TOML is already expected. It's a better choice than YAML for config files because its type system is more explicit — you won't get surprised by yes becoming true.

Examples: Cargo.toml, pyproject.toml, hugo.toml, .taplo.toml.

Parsing Libraries

FormatPythonJavaScript/NodeGoJava
CSV csv (stdlib), pandas papaparse, csv-parse encoding/csv OpenCSV, Apache Commons CSV
JSON json (stdlib) JSON (native) encoding/json Jackson, Gson
XML ElementTree (stdlib), lxml fast-xml-parser, xml2js encoding/xml JAXB, DOM4J
YAML PyYAML, ruamel.yaml js-yaml, yaml gopkg.in/yaml.v3 SnakeYAML
TOML tomllib (stdlib 3.11+), tomli @iarna/toml, smol-toml github.com/BurntSushi/toml toml4j

Streaming vs DOM Parsing

For most files under 50 MB, load the whole document into memory (DOM parsing). It's simpler and most libraries default to this approach. For larger files — production database exports, log archives, data pipeline inputs — streaming parsers let you process records one at a time without loading everything into memory.

Streaming JSON with NDJSON

NDJSON (Newline-Delimited JSON) puts one JSON object per line, making it naturally streamable. This is what many log aggregators (Splunk, Elasticsearch) and data pipelines use:

{"id":1,"name":"John","event":"login","ts":"2026-03-27T10:00:00Z"}
{"id":2,"name":"Jane","event":"purchase","ts":"2026-03-27T10:01:23Z"}
{"id":3,"name":"Bob","event":"logout","ts":"2026-03-27T10:05:11Z"}
# Python streaming NDJSON
import json

with open('events.ndjson') as f:
    for line in f:
        event = json.loads(line.strip())
        process(event)

# Node.js streaming JSON with JSONStream
import JSONStream from 'JSONStream';
import { createReadStream } from 'fs';

createReadStream('large-array.json')
  .pipe(JSONStream.parse('*'))  // parse each array element
  .on('data', (item) => process(item));

Streaming XML with SAX

# Python SAX parser — processes XML without loading it all into memory
import xml.sax

class UserHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        if name == "user":
            self.current_id = attrs.get("id")

    def characters(self, content):
        self.current_content = content

xml.sax.parse("large-export.xml", UserHandler())

Convert to NDJSON: JSON to NDJSON, NDJSON to JSON.

Schema Validation

Schema validation catches format errors before they cause runtime problems. Each format has its own approach:

JSON Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["name", "age"],
  "properties": {
    "name": {
      "type": "string",
      "minLength": 1,
      "maxLength": 100
    },
    "age": {
      "type": "integer",
      "minimum": 0,
      "maximum": 150
    },
    "email": {
      "type": "string",
      "format": "email"
    },
    "tags": {
      "type": "array",
      "items": { "type": "string" },
      "uniqueItems": true
    }
  }
}
# Python validation with jsonschema
import jsonschema, json

schema = json.load(open('schema.json'))
data = json.load(open('data.json'))

try:
    jsonschema.validate(instance=data, schema=schema)
    print("Valid!")
except jsonschema.ValidationError as e:
    print(f"Invalid: {e.message}")

// Node.js with Ajv (fastest JSON Schema validator)
import Ajv from 'ajv';
const ajv = new Ajv();
const validate = ajv.compile(schema);
const valid = validate(data);
if (!valid) console.log(validate.errors);

XML Schema Definition (XSD)

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="user">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="name" type="xs:string"/>
        <xs:element name="age" type="xs:positiveInteger"/>
        <xs:element name="email" type="xs:string" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="id" type="xs:integer" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

Validate JSON: JSON Schema Validator. Format and check XML: XML Formatter.

Converting Between Formats

Handling nested data going to CSV

CSV has no nesting. When you convert JSON to CSV, nested objects must be flattened. Three strategies:

// Input JSON
{
  "user": {
    "name": "John",
    "address": {
      "city": "NYC",
      "zip": "10001"
    },
    "tags": ["admin", "dev"]
  }
}

// Strategy 1: Dot notation (most common)
user.name, user.address.city, user.address.zip, user.tags
John, NYC, 10001, admin;dev

// Strategy 2: Serialize nested as JSON string
user.name, user.address, user.tags
John, "{""city"":""NYC"",""zip"":""10001""}", "[""admin"",""dev""]"

// Strategy 3: Normalize into related tables
// users.csv: id, name
// addresses.csv: user_id, city, zip
// user_tags.csv: user_id, tag

Array handling in conversions

// JSON array → CSV options
{"user": "John", "tags": ["admin", "user"]}

// Option A: Join with delimiter
user, tags
John, "admin;user"

// Option B: Explode to multiple rows (relational)
user, tag
John, admin
John, user

// Option C: Multiple columns (only works if max length is known)
user, tag1, tag2
John, admin, user

Type coercion between formats

# CSV → JSON type inference
"123"     → 123 (number) or "123" (string)? Decide up front.
"true"    → true (boolean) or "true" (string)?
"2026-01-15" → keep as string or parse to date?

# Best practice: be explicit
- Use papaparse's dynamicTyping for automatic inference
- Or use a schema to specify column types
- Force zip codes and phone numbers to strings: dtype={'zip': str} in pandas

# YAML → JSON type coercion surprises
"yes" → true (YAML 1.1) or "yes" (YAML 1.2)
"1.0" → 1.0 (float)
"2026-03-27" → date object in some parsers, string in others

Special character handling

// Same string in each format
// Input: She said, "Hello & Goodbye"

// CSV (comma delimiter, double-quote escape)
"She said, ""Hello & Goodbye"""

// JSON
"She said, \"Hello & Goodbye\""

// XML
<msg>She said, &quot;Hello &amp; Goodbye&quot;</msg>

// YAML
msg: 'She said, "Hello & Goodbye"'

Conversion tools

Common Pitfalls

1. Assuming CSV is simple to parse

Don't split on commas with str.split(','). That breaks on quoted fields containing commas. Always use a proper CSV library.

2. Losing data types at the CSV boundary

If you export JSON to CSV and re-import later, true becomes the string "true", integers become strings, and null might become an empty string or the literal text "null". Document the type expectations or use a schema.

3. YAML boolean traps

Country codes like NO, YES, ON, OFF become booleans in YAML 1.1 parsers. Version strings like 1.0 become floats. Always quote values you intend as strings when there's any ambiguity.

4. XML attribute vs element ambiguity

Converting JSON to XML requires deciding whether to use attributes or elements. Converting back to JSON loses attribute vs element distinction unless you encode it. Establish and document a convention if you'll round-trip between the two formats.

5. Large number precision in JSON

JavaScript's JSON.parse() silently corrupts integers larger than 9,007,199,254,740,991. Database primary keys from PostgreSQL's BIGINT or Twitter-style snowflake IDs can exceed this. Use strings for large IDs in JSON APIs.

6. Non-UTF-8 encoding

Old CSV exports from Windows Excel default to Windows-1252 encoding. Old XML files from enterprise systems sometimes use ISO-8859-1. If you see garbled characters (é instead of é), the encoding is wrong. Convert to UTF-8 first: Text Encoding Converter.

7. Dates without timezones

A date string like 2026-03-27 14:30:00 is ambiguous. Is that UTC? Local server time? The user's timezone? Always include timezone information. Use ISO 8601 with explicit UTC offset: 2026-03-27T14:30:00Z or 2026-03-27T14:30:00+02:00.

All conversion tools on ToolsDock run in your browser. Your data stays on your device.

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.