From JSON Data to JSON Schema: Implementing Automatic Type Inference
From JSON Data to JSON Schema: Implementing Automatic Type Inference#
Last week, I was building an API documentation system and needed to generate JSON Schema for each endpoint. Writing Schema manually is tedious, especially with deeply nested objects. So I dug into how to automatically infer Schema from JSON data and documented the approach.
What is JSON Schema#
JSON Schema is a specification for describing the structure of JSON documents. For example, this JSON:
{
"id": 1,
"name": "John Doe",
"email": "john@example.com"
}
Maps to this JSON Schema:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"id": { "type": "number" },
"name": { "type": "string" },
"email": { "type": "string" }
},
"required": ["id", "name", "email"]
}
This Schema defines:
- The type is an object
- Three properties: id (number), name (string), email (string)
- All properties are required
With Schema, you can validate data, auto-generate forms, create documentation, and more.
The Core Inference Algorithm#
The core of Schema inference is type detection and recursive processing.
Basic Type Detection#
JavaScript’s typeof handles most types, but has some gotchas:
function inferType(value: unknown): string {
if (value === null) return 'null' // typeof null === 'object'
if (Array.isArray(value)) return 'array' // typeof [] === 'object'
if (typeof value === 'object') return 'object'
return typeof value // 'string' | 'number' | 'boolean'
}
Key points:
typeof nullreturns'object', needs special check- Arrays also return
'object', useArray.isArray()to distinguish - Other primitives work with
typeofdirectly
Recursive Schema Generation#
Objects and arrays need recursive handling:
function generateSchema(obj: unknown, root = true): object {
const type = inferType(obj)
// Primitives: return type definition directly
if (type === 'null') return { type: 'null' }
if (type === 'string' || type === 'number' || type === 'boolean') {
return { type }
}
// Arrays: infer items type
if (type === 'array') {
const arr = obj as unknown[]
if (arr.length === 0) {
return { type: 'array', items: {} }
}
return {
type: 'array',
items: generateSchema(arr[0], false) // Recurse on first element
}
}
// Objects: recurse on each property
if (type === 'object') {
const objValue = obj as Record<string, unknown>
const properties: Record<string, object> = {}
const required: string[] = []
for (const [key, value] of Object.entries(objValue)) {
properties[key] = generateSchema(value, false)
required.push(key)
}
const schema: Record<string, unknown> = {
type: 'object',
properties,
}
if (root) {
schema['$schema'] = 'http://json-schema.org/draft-07/schema#'
}
if (required.length > 0) {
schema.required = required
}
return schema
}
return {}
}
Algorithm highlights:
- Root marker: Only the root needs the
$schemafield - Array handling: Infer
itemstype from the first element (simplified) - Object handling: Iterate all properties, recurse for each
- Required fields: Default to all properties being required
Handling Complex Scenarios#
The basic algorithm works for simple cases, but real-world data has edge cases.
1. Mixed Array Types#
[1, "hello", true]
The simplified approach only checks the first element, yielding:
{ "type": "array", "items": { "type": "number" } }
That’s wrong. The fix is to scan all elements and merge types:
function inferArrayType(arr: unknown[]): object {
const types = new Set<string>()
for (const item of arr) {
types.add(inferType(item))
}
if (types.size === 1) {
return { type: 'array', items: { type: [...types][0] } }
}
return {
type: 'array',
items: {
anyOf: [...types].map(t => ({ type: t }))
}
}
}
2. Circular Reference Detection#
Objects can have circular references:
const obj = { a: 1 }
obj.self = obj
Direct recursion would loop forever. Use WeakSet to detect cycles:
function generateSchema(
obj: unknown,
seen = new WeakSet()
): object {
if (obj && typeof obj === 'object') {
if (seen.has(obj)) {
return { type: 'object' }
}
seen.add(obj)
// ... continue processing
}
}
WeakSet won’t prevent garbage collection, making it ideal for temporary tracking.
3. Format Detection#
Many strings have specific formats that can be inferred:
function inferStringFormat(value: string): object {
// Email
if (/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(value)) {
return { type: 'string', format: 'email' }
}
// Date-time
if (/^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}/.test(value)) {
return { type: 'string', format: 'date-time' }
}
// URI
if (/^https?:\/\//.test(value)) {
return { type: 'string', format: 'uri' }
}
return { type: 'string' }
}
This generates more precise Schemas with format fields for validation.
Performance Optimization#
For large JSON files, performance matters.
Avoiding Redundant Computation#
Cache type inference results:
const typeCache = new WeakMap<object, string>()
function inferTypeCached(value: unknown): string {
if (value && typeof value === 'object') {
if (typeCache.has(value)) {
return typeCache.get(value)!
}
const type = inferType(value)
typeCache.set(value, type)
return type
}
return inferType(value)
}
WeakMap doesn’t block garbage collection, perfect for caching.
Practical Application#
Based on these principles, I built an online tool: JSON Schema Generator
Key features:
- Automatic type and structure inference
- Nested objects and arrays support
- Draft-07 standard Schema output
- One-click copy
The implementation isn’t complex, but handling edge cases properly takes careful thought. Hope this helps.
Related tools: JSON Formatter | JSON Validator