What is a data interchange format?

A standardized way to encode structured data so different systems and programming languages can read and write it. JSON, XML, YAML, and CSV are the most common. Each makes different tradeoffs between human readability, type support, nesting, and file size.

Which format should I use for my project?

JSON for REST APIs and anything JavaScript touches. CSV for tabular data, spreadsheets, and data analysis. XML for enterprise SOAP services, document workflows, or when you need XSD schema validation. YAML for config files humans edit. TOML for project config files (Cargo.toml, pyproject.toml).

How do I handle nested data when converting from JSON to CSV?

Three options: flatten with dot notation (user.address.city becomes a column), serialize the nested object as a JSON string in the cell, or split into multiple related CSV files. The right choice depends on how you'll use the data downstream.

What are the main differences between JSON and XML?

JSON is more compact, faster to parse, and has native JavaScript support. XML supports element attributes, namespaces, XPath/XSLT, and has mature schema validation via XSD. For new APIs choose JSON. For document workflows, SOAP services, or systems that already use XML, stick with XML.

Can I preserve data types when converting between formats?

JSON and YAML have native types (string, number, boolean, null, array, object). CSV stores everything as strings — type information is lost unless you add a schema or use type-hinted column names. When converting CSV to JSON, decide upfront whether to auto-detect types or preserve everything as strings.

What is the most efficient format for large datasets?

CSV is smallest and fastest to parse for flat tabular data. For structured data, JSON is a good balance. For truly large datasets, consider NDJSON (newline-delimited JSON) for streaming, or binary formats like Parquet or Protocol Buffers for columnar analytics workloads.

How do I handle special characters and encoding?

Always use UTF-8. CSV requires quoting fields that contain the delimiter, quotes, or newlines. JSON escapes with backslashes. XML uses entities. YAML uses quotes when values could be ambiguous. Most modern parsers handle this automatically if you tell them the encoding.

What are common pitfalls when converting between formats?

Type loss when going CSV → JSON. Encoding issues with non-UTF-8 files. Information loss when flattening JSON to CSV. Attribute vs element ambiguity when converting JSON to XML. Large number precision loss in JavaScript (integers above 2^53 lose precision as JSON numbers). Dates without timezone information.

When should I use YAML instead of JSON?

YAML for config files humans write and maintain — it supports comments, multi-line strings, and anchors for reuse. JSON for machine-generated API payloads where strict parsing and smaller size matter more than readability.

What's streaming vs DOM parsing and when does it matter?

DOM parsers load the entire document into memory before you can access any data. Streaming parsers (SAX for XML, JSONStream for JSON) process the document incrementally. For files under 50 MB DOM parsing is fine. For larger files or memory-constrained environments, use a streaming parser.

JSON vs XML vs YAML vs CSV vs TOML: Data Formats Compared

CSV — Comma-Separated Values

CSV is the simplest format for tabular data. Each line is a row, each value separated by a delimiter (usually comma, sometimes tab or semicolon). Despite its age and limitations, CSV is still the most universally accepted format for data exchange with non-technical users and analytics tools.

name,age,city,active
John Doe,32,New York,true
Jane Smith,28,Los Angeles,false
Bob Johnson,45,"Chicago, IL",true

What CSV actually handles

Fields with the delimiter inside must be quoted: "Chicago, IL"
Fields with quotes inside double them: "She said, ""Hello"""
Line breaks inside a field: allowed if the field is quoted
All values are strings — there is no integer or boolean type
No nesting — one level of rows and columns only

Delimiter variations

European Excel defaults to semicolons because the comma is the decimal separator in many locales. Tab-separated (TSV) is common in bioinformatics. Pipe-separated appears in legacy mainframe exports. When parsing, always confirm the delimiter — don't assume comma.

# European Excel format
name;price;quantity
Apple;1,50;10

# Tab-separated
name\tprice\tquantity
Apple\t1.50\t10

Parsing CSV

# Python — the csv module handles quoting correctly
import csv

with open('data.csv', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'], row['age'])

# Node.js — papaparse handles edge cases well
import Papa from 'papaparse';

const result = Papa.parse(csvText, {
  header: true,
  dynamicTyping: true,  // auto-detect numbers and booleans
  skipEmptyLines: true
});

# R
data <- read.csv('data.csv', stringsAsFactors = FALSE)

# Pandas
import pandas as pd
df = pd.read_csv('data.csv', dtype={'zip': str})  # force zip to string

Convert CSV to other formats: CSV to JSON, CSV to XML, CSV to YAML, CSV to Markdown Table.

JSON — JavaScript Object Notation

JSON is the dominant format for REST APIs and modern web applications. It maps directly to data structures in virtually every programming language, and browsers can parse it natively without a library.

{
  "users": [
    {
      "id": 1,
      "name": "John Doe",
      "active": true,
      "address": {
        "city": "New York",
        "zip": "10001"
      },
      "tags": ["admin", "developer"],
      "lastLogin": "2026-03-15T10:30:00Z",
      "phoneNumber": null
    }
  ],
  "totalCount": 1,
  "hasNextPage": false
}

JSON's six data types

string — always double-quoted: "hello"
number — integer or float, no quotes: 42, 3.14
boolean — lowercase: true, false
null — lowercase: null
array — ordered list: [1, 2, 3]
object — key-value map: {"key": "value"}

There is no date type. Use ISO 8601 strings: "2026-03-27T14:00:00Z". There are no comments — if you need them, use YAML or TOML for the config file instead.

Pitfall: large integers

JavaScript's Number type is a 64-bit float. Integers above 2⁵³−1 (9,007,199,254,740,991) lose precision. Twitter's API famously returns tweet IDs as both a number and a string for this reason. If you have database IDs or timestamps that exceed this limit, serialize them as strings.

// Safe
{"id": 9007199254740991}

// This ID loses precision in JavaScript
{"id": 9007199254740993}  // → parsed as 9007199254740992

// Safe approach: always use string for large IDs
{"id": "9007199254740993", "id_int": 9007199254740993}

Parsing JSON

// JavaScript — native, no library needed
const data = JSON.parse(jsonString);
const str = JSON.stringify(data, null, 2);  // pretty-print

// Python — standard library
import json
data = json.loads(json_string)
output = json.dumps(data, indent=2, ensure_ascii=False)

// Go — encoding/json package
import "encoding/json"
var result map[string]interface{}
json.Unmarshal([]byte(jsonStr), &result)

# Ruby — standard library
require 'json'
data = JSON.parse(json_string)
output = JSON.pretty_generate(data)

Tools: JSON Formatter, JSON Validator, JSON Schema Validator, JSON to XML, JSON to YAML.

XML — Extensible Markup Language

XML predates JSON by a decade and is still deeply embedded in enterprise systems, SOAP web services, Microsoft Office files (.docx, .xlsx are ZIP archives of XML), RSS feeds, and SVG graphics. It's verbose, but it has capabilities JSON lacks: element attributes, namespaces, XPath queries, XSLT transformations, and mature schema validation via XSD.

<?xml version="1.0" encoding="UTF-8"?>
<users xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <user id="1" active="true">
    <name>John Doe</name>
    <address>
      <city>New York</city>
      <zip>10001</zip>
    </address>
    <tags>
      <tag>admin</tag>
      <tag>developer</tag>
    </tags>
    <lastLogin>2026-03-15T10:30:00Z</lastLogin>
  </user>
</users>

Attributes vs elements

XML lets you put data either as element attributes (<user id="1">) or as child elements (<id>1</id>). The convention: use attributes for metadata that describes the element (IDs, types, flags), use child elements for the actual data content. There's no technical difference — it's a design choice. Pick one convention and document it.

Special characters in XML

Character	Entity	Use
`<`	`<`	Less than, tag start
`>`	`>`	Greater than
`&`	`&`	Ampersand
`"`	`"`	In attribute values
`'`	`'`	In attribute values

Alternatively, wrap content in <![CDATA[ ... ]]> to include arbitrary text without escaping.

Parsing XML

# Python — standard library ElementTree (DOM parsing)
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
for user in root.findall('user'):
    name = user.find('name').text
    user_id = user.get('id')  # attribute

# Python — lxml for XPath and namespaces
from lxml import etree
tree = etree.parse('data.xml')
names = tree.xpath('//user[@active="true"]/name/text()')

# Node.js — fast-xml-parser
import { XMLParser } from 'fast-xml-parser';
const parser = new XMLParser({ ignoreAttributes: false });
const result = parser.parse(xmlString);

# Java — JAXB or DOM
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File("data.xml"));

Tools: XML Formatter, XML to JSON, XML to CSV, XML to YAML.

YAML — YAML Ain't Markup Language

YAML was designed to be human-readable above all else. It's the format of choice for configuration files: Docker Compose, Kubernetes manifests, GitHub Actions workflows, Ansible playbooks, and many more. It supports comments (unlike JSON), multi-line strings, and anchors for reuse.

# Comments are supported — a major advantage over JSON
users:
  - id: 1
    name: John Doe
    active: true
    address:
      city: New York
      zip: "10001"    # Quoted to prevent YAML from treating it as integer
    tags:
      - admin
      - developer
    lastLogin: 2026-03-15T10:30:00Z
  - id: 2
    name: Jane Smith
    active: false
    address: null
    tags: []

database:
  host: localhost
  port: 5432

YAML features JSON doesn't have

# Multi-line string (literal block, preserves newlines)
description: |
  First line.
  Second line.
  Third line.

# Folded block (newlines become spaces)
summary: >
  This long text will be
  folded into one line.

# Anchors and aliases (define once, reuse)
defaults: &defaults
  timeout: 30
  retries: 3
  log_level: info

production:
  <<: *defaults       # merge defaults
  host: prod.example.com
  log_level: warn     # override one value

staging:
  <<: *defaults
  host: staging.example.com

YAML gotchas

YAML's flexibility creates parsing surprises. Values that look like other types get auto-converted:

# These are NOT strings in YAML 1.1
enabled: yes          # → boolean true
country: NO           # → boolean false (Norway's ISO code!)
version: 1.0          # → float
zip: 08012            # → integer 8012 (loses leading zero)
date: 2026-03-27      # → date object in some parsers

# Fix: quote explicitly
country: "NO"
zip: "08012"

YAML 1.2 (the current spec) removed the yes/no/on/off boolean variants, but many parsers still implement YAML 1.1 behavior. When in doubt, quote your string values.

Security: YAML parsing arbitrary objects

Some YAML parsers support language-specific object deserialization. This is a known attack vector — a maliciously crafted YAML file can execute arbitrary code. Always use safe loaders:

# Python — NEVER use yaml.load() on untrusted input
import yaml
# Unsafe:
data = yaml.load(untrusted_input)  # can execute arbitrary code
# Safe:
data = yaml.safe_load(untrusted_input)
# Or use the newer API:
data = yaml.load(untrusted_input, Loader=yaml.SafeLoader)

Tools: YAML Formatter, YAML to JSON, JSON to YAML, YAML to XML.

TOML — Tom's Obvious Minimal Language

TOML was created specifically for configuration files. It's more explicit than YAML (no ambiguous auto-conversion), cleaner than INI (supports typed values and nested sections), and easier to read than JSON (supports comments). Rust's package manager uses it for Cargo.toml, Python's packaging standard uses it for pyproject.toml, and Hugo uses it as the default config format.

# TOML config example
title = "My Application"
version = "2.1.0"
debug = false

[database]
host = "localhost"
port = 5432
name = "myapp_production"
timeout = 30.0          # float

[database.pool]
min_connections = 5
max_connections = 20

[[servers]]             # array of tables
name = "web-01"
ip = "10.0.1.1"
roles = ["web", "proxy"]

[[servers]]
name = "web-02"
ip = "10.0.1.2"
roles = ["web"]

[logging]
level = "warn"
file = "/var/log/app.log"
rotate_daily = true
created_at = 2026-01-15T00:00:00Z  # native datetime type

TOML native types

TOML has explicit types for dates and times, which YAML only handles in some implementations:

Type	Example
String	`"hello"`
Integer	`42`
Float	`3.14`, `6.626e-34`
Boolean	`true`, `false`
Datetime	`2026-03-27T14:30:00Z`
Local date	`2026-03-27`
Array	`[1, 2, 3]`
Inline table	`{host = "localhost", port = 5432}`

Convert: TOML to JSON, JSON to TOML.

Side-by-Side Comparison

The same data structure in each format, so you can compare verbosity and syntax:

JSON

{
  "server": {
    "host": "localhost",
    "port": 8080,
    "debug": true,
    "tags": ["web", "api"]
  }
}

YAML

# Server configuration
server:
  host: localhost
  port: 8080
  debug: true
  tags:
    - web
    - api

TOML

[server]
host = "localhost"
port = 8080
debug = true
tags = ["web", "api"]

XML

<server debug="true">
  <host>localhost</host>
  <port>8080</port>
  <tags>
    <tag>web</tag>
    <tag>api</tag>
  </tags>
</server>

Feature	CSV	JSON	XML	YAML	TOML
Human readable	High	High	Medium	Very high	Very high
Comments	No	No	Yes	Yes	Yes
Nesting	No	Yes	Yes	Yes	Yes
Native data types	Strings only	6 types	Strings (XSD types)	11 types	9 types + datetime
Schema validation	None	JSON Schema	XSD, DTD, RelaxNG	Limited	None
File size (relative)	Smallest	Small	Large	Medium	Medium
Parse speed	Fastest	Fast	Slowest	Medium	Medium
Library support	Universal	Universal	Universal	Wide	Growing
Binary data	No	Base64 string	Base64 / CDATA	Base64 string	No

When to Pick Each Format

CSV

Data that's naturally rows and columns with no nesting. Reports, contact lists, financial exports, database table dumps, anything that needs to open in Excel. It's also the right choice when your audience includes non-technical users who will open the file in a spreadsheet.

Examples: e-commerce order exports, survey results, analytics reports, address books, product catalogs for flat data.

JSON

REST APIs, anything JavaScript touches, NoSQL database payloads (MongoDB, DynamoDB), event streaming (Kafka messages), and modern app configuration when you don't need comments. JSON is the right default for new systems unless there's a specific reason to choose otherwise.

Examples: REST API responses, GraphQL variables, Slack/webhook payloads, React component data, Firebase documents.

XML

SOAP web services, document workflows (legal, publishing), feed syndication (RSS/Atom), SVG graphics, Microsoft Office formats, and enterprise systems built before JSON became mainstream. Also use XML when you need namespace support or XSLT transformations.

Examples: SOAP APIs, DocBook technical documentation, RSS feeds, JATS academic publishing, Android resources, Maven POM files.

YAML

Configuration files that humans write and maintain. Kubernetes manifests, Docker Compose files, CI/CD pipelines, Ansible playbooks, Jekyll/Hugo site configs. YAML's comments and anchors make complex configurations maintainable in a way JSON can't match.

Examples: docker-compose.yml, .github/workflows/*.yml, kubernetes-deployment.yaml, _config.yml, Ansible roles.

TOML

Project and application configuration files, especially in ecosystems that have adopted it as a standard. If you're writing Rust, Python packaging, or Hugo, TOML is already expected. It's a better choice than YAML for config files because its type system is more explicit — you won't get surprised by yes becoming true.

Examples: Cargo.toml, pyproject.toml, hugo.toml, .taplo.toml.

Parsing Libraries

Format	Python	JavaScript/Node	Go	Java
CSV	`csv` (stdlib), pandas	papaparse, csv-parse	encoding/csv	OpenCSV, Apache Commons CSV
JSON	`json` (stdlib)	`JSON` (native)	encoding/json	Jackson, Gson
XML	ElementTree (stdlib), lxml	fast-xml-parser, xml2js	encoding/xml	JAXB, DOM4J
YAML	PyYAML, ruamel.yaml	js-yaml, yaml	gopkg.in/yaml.v3	SnakeYAML
TOML	tomllib (stdlib 3.11+), tomli	@iarna/toml, smol-toml	github.com/BurntSushi/toml	toml4j

Streaming vs DOM Parsing

For most files under 50 MB, load the whole document into memory (DOM parsing). It's simpler and most libraries default to this approach. For larger files — production database exports, log archives, data pipeline inputs — streaming parsers let you process records one at a time without loading everything into memory.

Streaming JSON with NDJSON

NDJSON (Newline-Delimited JSON) puts one JSON object per line, making it naturally streamable. This is what many log aggregators (Splunk, Elasticsearch) and data pipelines use:

{"id":1,"name":"John","event":"login","ts":"2026-03-27T10:00:00Z"}
{"id":2,"name":"Jane","event":"purchase","ts":"2026-03-27T10:01:23Z"}
{"id":3,"name":"Bob","event":"logout","ts":"2026-03-27T10:05:11Z"}

# Python streaming NDJSON
import json

with open('events.ndjson') as f:
    for line in f:
        event = json.loads(line.strip())
        process(event)

# Node.js streaming JSON with JSONStream
import JSONStream from 'JSONStream';
import { createReadStream } from 'fs';

createReadStream('large-array.json')
  .pipe(JSONStream.parse('*'))  // parse each array element
  .on('data', (item) => process(item));

Streaming XML with SAX

# Python SAX parser — processes XML without loading it all into memory
import xml.sax

class UserHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        if name == "user":
            self.current_id = attrs.get("id")

    def characters(self, content):
        self.current_content = content

xml.sax.parse("large-export.xml", UserHandler())

Convert to NDJSON: JSON to NDJSON, NDJSON to JSON.

Schema Validation

Schema validation catches format errors before they cause runtime problems. Each format has its own approach:

JSON Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["name", "age"],
  "properties": {
    "name": {
      "type": "string",
      "minLength": 1,
      "maxLength": 100
    },
    "age": {
      "type": "integer",
      "minimum": 0,
      "maximum": 150
    },
    "email": {
      "type": "string",
      "format": "email"
    },
    "tags": {
      "type": "array",
      "items": { "type": "string" },
      "uniqueItems": true
    }
  }
}

# Python validation with jsonschema
import jsonschema, json

schema = json.load(open('schema.json'))
data = json.load(open('data.json'))

try:
    jsonschema.validate(instance=data, schema=schema)
    print("Valid!")
except jsonschema.ValidationError as e:
    print(f"Invalid: {e.message}")

// Node.js with Ajv (fastest JSON Schema validator)
import Ajv from 'ajv';
const ajv = new Ajv();
const validate = ajv.compile(schema);
const valid = validate(data);
if (!valid) console.log(validate.errors);

XML Schema Definition (XSD)

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="user">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="name" type="xs:string"/>
        <xs:element name="age" type="xs:positiveInteger"/>
        <xs:element name="email" type="xs:string" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="id" type="xs:integer" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

Validate JSON: JSON Schema Validator. Format and check XML: XML Formatter.

Converting Between Formats

Handling nested data going to CSV

CSV has no nesting. When you convert JSON to CSV, nested objects must be flattened. Three strategies:

// Input JSON
{
  "user": {
    "name": "John",
    "address": {
      "city": "NYC",
      "zip": "10001"
    },
    "tags": ["admin", "dev"]
  }
}

// Strategy 1: Dot notation (most common)
user.name, user.address.city, user.address.zip, user.tags
John, NYC, 10001, admin;dev

// Strategy 2: Serialize nested as JSON string
user.name, user.address, user.tags
John, "{""city"":""NYC"",""zip"":""10001""}", "[""admin"",""dev""]"

// Strategy 3: Normalize into related tables
// users.csv: id, name
// addresses.csv: user_id, city, zip
// user_tags.csv: user_id, tag

Array handling in conversions

// JSON array → CSV options
{"user": "John", "tags": ["admin", "user"]}

// Option A: Join with delimiter
user, tags
John, "admin;user"

// Option B: Explode to multiple rows (relational)
user, tag
John, admin
John, user

// Option C: Multiple columns (only works if max length is known)
user, tag1, tag2
John, admin, user

Type coercion between formats

# CSV → JSON type inference
"123"     → 123 (number) or "123" (string)? Decide up front.
"true"    → true (boolean) or "true" (string)?
"2026-01-15" → keep as string or parse to date?

# Best practice: be explicit
- Use papaparse's dynamicTyping for automatic inference
- Or use a schema to specify column types
- Force zip codes and phone numbers to strings: dtype={'zip': str} in pandas

# YAML → JSON type coercion surprises
"yes" → true (YAML 1.1) or "yes" (YAML 1.2)
"1.0" → 1.0 (float)
"2026-03-27" → date object in some parsers, string in others

Special character handling

// Same string in each format
// Input: She said, "Hello & Goodbye"

// CSV (comma delimiter, double-quote escape)
"She said, ""Hello & Goodbye"""

// JSON
"She said, \"Hello & Goodbye\""

// XML
<msg>She said, &quot;Hello &amp; Goodbye&quot;</msg>

// YAML
msg: 'She said, "Hello & Goodbye"'

Conversion tools

Common Pitfalls

1. Assuming CSV is simple to parse

Don't split on commas with str.split(','). That breaks on quoted fields containing commas. Always use a proper CSV library.

2. Losing data types at the CSV boundary

If you export JSON to CSV and re-import later, true becomes the string "true", integers become strings, and null might become an empty string or the literal text "null". Document the type expectations or use a schema.

3. YAML boolean traps

Country codes like NO, YES, ON, OFF become booleans in YAML 1.1 parsers. Version strings like 1.0 become floats. Always quote values you intend as strings when there's any ambiguity.

4. XML attribute vs element ambiguity

Converting JSON to XML requires deciding whether to use attributes or elements. Converting back to JSON loses attribute vs element distinction unless you encode it. Establish and document a convention if you'll round-trip between the two formats.

5. Large number precision in JSON

JavaScript's JSON.parse() silently corrupts integers larger than 9,007,199,254,740,991. Database primary keys from PostgreSQL's BIGINT or Twitter-style snowflake IDs can exceed this. Use strings for large IDs in JSON APIs.

6. Non-UTF-8 encoding

Old CSV exports from Windows Excel default to Windows-1252 encoding. Old XML files from enterprise systems sometimes use ISO-8859-1. If you see garbled characters (Ã© instead of é), the encoding is wrong. Convert to UTF-8 first: Text Encoding Converter.

7. Dates without timezones

A date string like 2026-03-27 14:30:00 is ambiguous. Is that UTC? Local server time? The user's timezone? Always include timezone information. Use ISO 8601 with explicit UTC offset: 2026-03-27T14:30:00Z or 2026-03-27T14:30:00+02:00.