PDF Merge and Split: A Complete Guide to Implementation and File Format#

PDF is one of the most common document formats in our daily work. Whether it’s contract archiving, report organization, or document distribution, PDF merging and splitting are frequent requirements. Today let’s discuss the technical principles and implementation approaches behind these operations.

PDF File Format Structure#

To understand PDF merging and splitting, we first need to grasp how PDF files are organized. A standard PDF file consists of four parts:

%PDF-1.7                    ← Header
1 0 obj                     ← Body (Object Collection)
  << /Type /Catalog ... >>
endobj
...
xref                        ← Cross-Reference Table
0 15
0000000000 65535 f 
0000000019 00000 n
...
trailer                     ← Trailer
<< /Size 15 /Root 1 0 R >>
startxref
98765
%%EOF

The Header declares the PDF version, the Body stores all objects (pages, fonts, images, etc.), the xref records the byte offset of each object in the file, and the trailer points to the xref location and root object. This design allows PDF to quickly locate any object, which is the core of merge and split operations.

How PDF Merging Works#

Merging multiple PDFs isn’t simply concatenating files, because each PDF has its own object ID space. Direct concatenation would cause ID conflicts and xref invalidation. The correct approach is object renumbering + xref reconstruction:

import { PDFDocument } from 'pdf-lib';

async function mergePdfs(pdfBuffers: Uint8Array[]) {
  const mergedPdf = await PDFDocument.create();
  
  for (const buffer of pdfBuffers) {
    const srcPdf = await PDFDocument.load(buffer);
    // Copy all pages, object renumbering handled internally
    const pages = await mergedPdf.copyPages(srcPdf, srcPdf.getPageIndices());
    pages.forEach(page => mergedPdf.addPage(page));
  }
  
  return mergedPdf.save();
}

The internal workflow of pdf-lib:

Parse each source PDF and extract the object tree
Allocate new ID ranges for each source PDF’s objects
Copy page objects and their dependencies (fonts, images, resources)
Rebuild the xref table and trailer

Common Performance Pitfalls#

Several issues encountered in practice:

Duplicate Shared Resources: 10 PDFs all embed the same font, resulting in the font being duplicated 10 times after merging. Solution: build a resource hash table during merging, keeping only one copy of identical resources.

Cross-File Object References: Some PDFs use shared objects across pages (like Form XObjects), requiring ID mapping maintenance during merging.

Incremental Update Residue: Some PDFs use incremental updates with multiple xrefs at the end of the file. Parsing requires finding the latest version of the xref.

How PDF Splitting Works#

Splitting PDFs is relatively simpler, with the core being page extraction + dependency tracking:

async function splitPdf(pdfBuffer: Uint8Array[], pagesPerFile: number) {
  const srcPdf = await PDFDocument.load(pdfBuffer);
  const totalPages = srcPdf.getPageCount();
  const results: Uint8Array[] = [];
  
  for (let i = 0; i < totalPages; i += pagesPerFile) {
    const newPdf = await PDFDocument.create();
    const endPage = Math.min(i + pagesPerFile, totalPages);
    
    for (let j = i; j < endPage; j++) {
      const [copiedPage] = await newPdf.copyPages(srcPdf, [j]);
      newPdf.addPage(copiedPage);
    }
    
    results.push(await newPdf.save());
  }
  
  return results;
}

When splitting, you need to track resources the page depends on: fonts, images, color spaces, ExtGState, etc. Missing any one of these will cause rendering issues in the split PDF.

Split by Bookmarks#

A more advanced scenario is splitting by outline/bookmarks:

async function splitByOutline(pdfBuffer: Uint8Array) {
  const pdf = await PDFDocument.load(pdfBuffer);
  const outlines = pdf.catalog.lookup(PDFName.of('Outlines'));
  // Recursively parse bookmark tree
  const bookmarks = parseOutlineTree(outlines);
  
  // Split based on bookmark target pages
  return bookmarks.map(bookmark => {
    const startPage = bookmark.dest.pageIndex;
    const endPage = bookmark.next?.dest.pageIndex ?? pdf.getPageCount();
    return extractPages(pdf, startPage, endPage);
  });
}

Browser-Side vs Server-Side Processing#

Browser-side can directly use pdf-lib or pdfjs-dist:

Library	Size	Features	Use Case
pdf-lib	~300KB	Create/Edit/Merge/Split	Full-featured editing
pdfjs-dist	~500KB	Render/Read/Extract text	Preview and reading
jspdf	~200KB	Create PDF	Generate from scratch

Browser-side limitations:

Large files (>100MB) stress memory
PDF/A compliance validation is difficult
Limited encrypted PDF handling

Server-side recommended solutions:

Node.js: pdf-lib, hummus, pdfjs-dist
Python: PyPDF2 (pure Python), pikepdf (QPDF binding, better performance)
Command line: PDFtk, qpdf, ghostscript

# PyPDF2 merge example
from PyPDF2 import PdfMerger

merger = PdfMerger()
for pdf in ['a.pdf', 'b.pdf', 'c.pdf']:
    merger.append(pdf)
merger.write('merged.pdf')
merger.close()

# pikepdf has better performance, supports encryption
import pikepdf

with pikepdf.open('input.pdf') as pdf:
    # Split pages 1-5
    new_pdf = pikepdf.New()
    new_pdf.pages.extend(pdf.pages[0:5])
    new_pdf.save('split.pdf')

Practical: Building a Drag-and-Drop PDF Merger#

Core frontend logic:

function PdfMerger() {
  const [files, setFiles] = useState<File[]>([]);
  const [merging, setMerging] = useState(false);

  const handleDrop = (e: React.DragEvent) => {
    e.preventDefault();
    const pdfFiles = Array.from(e.dataTransfer.files)
      .filter(f => f.type === 'application/pdf');
    setFiles(prev => [...prev, ...pdfFiles]);
  };

  const mergePdfs = async () => {
    setMerging(true);
    const buffers = await Promise.all(
      files.map(f => f.arrayBuffer())
    );
    
    const merged = await mergePdfs(
      buffers.map(b => new Uint8Array(b))
    );
    
    // Trigger download
    const blob = new Blob([merged], { type: 'application/pdf' });
    const url = URL.createObjectURL(blob);
    const a = document.createElement('a');
    a.href = url;
    a.download = 'merged.pdf';
    a.click();
    URL.revokeObjectURL(url);
    
    setMerging(false);
  };

  return (
    <div onDrop={handleDrop} onDragOver={e => e.preventDefault()}>
      {/* File list and merge button */}
    </div>
  );
}

Performance Optimization Tips#

Web Worker Background Processing: PDF processing is CPU-intensive, put it in a Worker to avoid UI blocking
Streaming: Use ReadableStream for chunked reading of large files
Preload Libraries: pdf-lib is 300KB, use dynamic import for lazy loading
Cache Parsed Results: Reuse PDFDocument instance when performing multiple operations on the same PDF

PDF Tools - Online PDF merge and split
JSON to Excel - Data format conversion
File Hash Calculator - Verify file integrity

PDF merging and splitting may seem simple, but the file format, object management, and resource tracking involve many details. Choosing the right library and understanding the underlying principles will help you quickly locate issues when encountering edge cases. Hope this article helps you the next time you have PDF processing needs.

PDF Merge and Split: A Complete Guide to Implementation and File Format