Semi-Structured Data
Definition
Semi-structured data occupies the space between fully structured data (database tables with defined schemas) and unstructured data (plain text, images, raw documents). It has enough organizational markers to be machine-parseable but lacks the rigid schema of relational data. Common examples include JSON documents, XML files, HTML web pages, email messages with headers and body, and knowledge base articles with metadata fields (title, author, date, category) and free-text content. In knowledge management, most documentation is semi-structured: it has consistent metadata fields but variable-length prose content.
Why It Matters
Understanding the semi-structured nature of knowledge base content is important for AI systems that need to parse, index, and retrieve it. Unlike fully structured data that can be queried with SQL, semi-structured content requires different parsing strategies. The structured portions (metadata, headings, lists) can be extracted and indexed reliably, while the unstructured portions (prose paragraphs) require text processing and semantic understanding. AI knowledge retrieval systems must handle both aspects effectively to surface the most relevant information for a given user query.
How It Works
Semi-structured data processing involves parsing both the structural elements and the textual content of documents. For a knowledge base article, this means extracting the title (structured), category (structured), and article body text (unstructured prose). AI systems use the structured metadata for filtering (only search articles in a specific category) and the text content for semantic search. Chunking strategies for RAG (Retrieval-Augmented Generation) systems must account for semi-structured documents by preserving structural context — keeping heading text with the paragraphs beneath it rather than chunking at arbitrary character counts.
Data Structure Spectrum
Fully Structured
Fixed schema
Semi-Structured
Flexible schema
Unstructured
No schema
Database Table
Every row follows exact columns
JSON / Markdown
"title": "Setup Guide",
"tags": ["api"],
"draft": true
}
Middle ground — best for KB
Plain Text / PDF
No predictable structure
Real-World Example
A 99helpers customer migrates their help center to a new platform that supports richer metadata on articles (product area, user role, feature name, last updated date). By upgrading their articles from plain prose to semi-structured documents with meaningful metadata tags, they enable their AI chatbot to filter knowledge retrieval by user context: the chatbot queries only articles tagged with the relevant product area when a user asks a question from a specific section of the app. Chatbot answer accuracy improves by 28% because the AI is working with a smaller, more relevant document set.
Common Mistakes
- ✕Treating semi-structured data as either fully structured (trying to force a rigid schema) or fully unstructured (ignoring the available metadata)
- ✕Inconsistently applying metadata — semi-structured data is only valuable if the structural markers are applied uniformly across all documents
- ✕Not indexing structural elements separately from text content — metadata and headings should be indexed with higher weight than body text in search systems
Related Terms
Structured Data
Structured data is information organized in a predefined format with clear fields and types — such as tables, spreadsheets, JSON, or database records. In a knowledge base context, structured data enables precise, queryable information retrieval that complements unstructured text content.
Unstructured Data
Unstructured data is information without a predefined format or schema — such as free-form text articles, PDFs, emails, and web pages. The vast majority of organizational knowledge exists as unstructured data, making robust text processing and semantic search essential for AI knowledge retrieval systems.
Metadata Tagging
Metadata tagging is the practice of attaching structured descriptive information — such as category, product area, audience, language, and last-updated date — to knowledge base articles. Tags enable filtered search, targeted retrieval, and better AI answers by providing context beyond the article text itself.
Document Parsing
Document parsing is the extraction of structured or clean text content from various file formats — PDF, DOCX, HTML, CSV, PPTX, and more — as part of a knowledge base ingestion pipeline. A robust parser handles format-specific complexities and produces clean, well-structured text ready for chunking and indexing.
Text Chunking
Text chunking is the process of splitting long documents into smaller, focused segments before indexing them in a knowledge base. Chunk size and overlap strategy directly affect retrieval quality — chunks that are too large lose precision, while chunks that are too small lose context. Finding the right balance is a key knowledge base engineering decision.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →