From JSON Data to JSON Schema: Implementing Automatic Type Inference#

Last week, I was building an API documentation system and needed to generate JSON Schema for each endpoint. Writing Schema manually is tedious, especially with deeply nested objects. So I dug into how to automatically infer Schema from JSON data and documented the approach.

What is JSON Schema#

JSON Schema is a specification for describing the structure of JSON documents. For example, this JSON:

{
  "id": 1,
  "name": "John Doe",
  "email": "john@example.com"
}

Maps to this JSON Schema:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "id": { "type": "number" },
    "name": { "type": "string" },
    "email": { "type": "string" }
  },
  "required": ["id", "name", "email"]
}

This Schema defines:

  • The type is an object
  • Three properties: id (number), name (string), email (string)
  • All properties are required

With Schema, you can validate data, auto-generate forms, create documentation, and more.

The Core Inference Algorithm#

The core of Schema inference is type detection and recursive processing.

Basic Type Detection#

JavaScript’s typeof handles most types, but has some gotchas:

function inferType(value: unknown): string {
  if (value === null) return 'null'  // typeof null === 'object'
  if (Array.isArray(value)) return 'array'  // typeof [] === 'object'
  if (typeof value === 'object') return 'object'
  return typeof value  // 'string' | 'number' | 'boolean'
}

Key points:

  • typeof null returns 'object', needs special check
  • Arrays also return 'object', use Array.isArray() to distinguish
  • Other primitives work with typeof directly

Recursive Schema Generation#

Objects and arrays need recursive handling:

function generateSchema(obj: unknown, root = true): object {
  const type = inferType(obj)

  // Primitives: return type definition directly
  if (type === 'null') return { type: 'null' }
  if (type === 'string' || type === 'number' || type === 'boolean') {
    return { type }
  }

  // Arrays: infer items type
  if (type === 'array') {
    const arr = obj as unknown[]
    if (arr.length === 0) {
      return { type: 'array', items: {} }
    }
    return {
      type: 'array',
      items: generateSchema(arr[0], false)  // Recurse on first element
    }
  }

  // Objects: recurse on each property
  if (type === 'object') {
    const objValue = obj as Record<string, unknown>
    const properties: Record<string, object> = {}
    const required: string[] = []

    for (const [key, value] of Object.entries(objValue)) {
      properties[key] = generateSchema(value, false)
      required.push(key)
    }

    const schema: Record<string, unknown> = {
      type: 'object',
      properties,
    }

    if (root) {
      schema['$schema'] = 'http://json-schema.org/draft-07/schema#'
    }

    if (required.length > 0) {
      schema.required = required
    }

    return schema
  }

  return {}
}

Algorithm highlights:

  1. Root marker: Only the root needs the $schema field
  2. Array handling: Infer items type from the first element (simplified)
  3. Object handling: Iterate all properties, recurse for each
  4. Required fields: Default to all properties being required

Handling Complex Scenarios#

The basic algorithm works for simple cases, but real-world data has edge cases.

1. Mixed Array Types#

[1, "hello", true]

The simplified approach only checks the first element, yielding:

{ "type": "array", "items": { "type": "number" } }

That’s wrong. The fix is to scan all elements and merge types:

function inferArrayType(arr: unknown[]): object {
  const types = new Set<string>()
  
  for (const item of arr) {
    types.add(inferType(item))
  }

  if (types.size === 1) {
    return { type: 'array', items: { type: [...types][0] } }
  }

  return {
    type: 'array',
    items: {
      anyOf: [...types].map(t => ({ type: t }))
    }
  }
}

2. Circular Reference Detection#

Objects can have circular references:

const obj = { a: 1 }
obj.self = obj

Direct recursion would loop forever. Use WeakSet to detect cycles:

function generateSchema(
  obj: unknown,
  seen = new WeakSet()
): object {
  if (obj && typeof obj === 'object') {
    if (seen.has(obj)) {
      return { type: 'object' }
    }
    seen.add(obj)
    // ... continue processing
  }
}

WeakSet won’t prevent garbage collection, making it ideal for temporary tracking.

3. Format Detection#

Many strings have specific formats that can be inferred:

function inferStringFormat(value: string): object {
  // Email
  if (/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(value)) {
    return { type: 'string', format: 'email' }
  }
  
  // Date-time
  if (/^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}/.test(value)) {
    return { type: 'string', format: 'date-time' }
  }
  
  // URI
  if (/^https?:\/\//.test(value)) {
    return { type: 'string', format: 'uri' }
  }
  
  return { type: 'string' }
}

This generates more precise Schemas with format fields for validation.

Performance Optimization#

For large JSON files, performance matters.

Avoiding Redundant Computation#

Cache type inference results:

const typeCache = new WeakMap<object, string>()

function inferTypeCached(value: unknown): string {
  if (value && typeof value === 'object') {
    if (typeCache.has(value)) {
      return typeCache.get(value)!
    }
    const type = inferType(value)
    typeCache.set(value, type)
    return type
  }
  return inferType(value)
}

WeakMap doesn’t block garbage collection, perfect for caching.

Practical Application#

Based on these principles, I built an online tool: JSON Schema Generator

Key features:

  • Automatic type and structure inference
  • Nested objects and arrays support
  • Draft-07 standard Schema output
  • One-click copy

The implementation isn’t complex, but handling edge cases properly takes careful thought. Hope this helps.


Related tools: JSON Formatter | JSON Validator