Knowledge Base & Content Management

Sitemap Indexing

Definition

A sitemap is an XML file that lists all the URLs a website wants indexed, typically found at /sitemap.xml. For knowledge base population, sitemap indexing uses this file as the authoritative source of URLs to crawl — more reliable than following links, which may miss pages not linked from navigation. The sitemap often includes metadata like lastmod (last modified date) that can be used to identify changed pages for incremental re-ingestion. Sitemap indexing is the recommended approach for systematically importing an entire website or help center into a knowledge base.

Why It Matters

Sitemap indexing provides comprehensive coverage with minimal configuration. Rather than hoping a link-following crawler finds all pages, sitemap indexing starts from the definitive list of URLs the site owner wants crawled. The lastmod timestamps also enable efficient incremental updates — only re-ingesting pages that changed since the last crawl, rather than re-processing the entire site.

How It Works

The sitemap URL (typically /sitemap.xml) is fetched and parsed. The list of URLs is extracted along with any available metadata. Each URL is visited sequentially or in parallel (within rate limits), and the page content is extracted and processed through the ingestion pipeline. If lastmod timestamps are present, a delta check compares them to the previous crawl to identify only changed pages for re-ingestion.

Sitemap to Search Engine Indexing

sitemap.xml

/glossary/ai-chatbot

/glossary/knowledge-base

/help/getting-started

+1,240 more URLs

Search Engines

GoogleIndexed (1,205)

BingIndexed (1,180)

Organic Traffic

Users find pages via search

+3,420 / month

Sitemap updated automatically on every publish — search engines re-crawl within 24–48 hours

Real-World Example

An organization has a help center with 300 articles managed in Zendesk, which generates a sitemap.xml at /sitemap.xml. The knowledge base ingestion tool uses this sitemap to discover all 300 article URLs, fetch each one, extract the article content, and index it. When the help center is updated, the tool re-reads the sitemap, identifies 12 articles with newer lastmod timestamps, and re-ingests only those 12 — keeping the knowledge base current without re-processing everything.

Common Mistakes

✕Not filtering sitemap URLs — sitemaps often include non-knowledge pages (pricing, login, homepage) that should not be ingested into the knowledge base.
✕Ignoring sitemap pagination — large sites often split their sitemap into a sitemap index file pointing to multiple sub-sitemaps.
✕Not using lastmod for incremental updates and re-ingesting the entire site on every sync, wasting compute resources.

Related Terms

Content Crawler

A content crawler is an automated tool that systematically visits web pages — starting from a URL or sitemap — and extracts their content for ingestion into a knowledge base. It enables organizations to automatically populate and keep their AI knowledge base current with content published on their website or help center.

Web Scraping

Web scraping is the automated extraction of content from web pages using code — parsing HTML to pull out text, links, and structured data. In knowledge management, web scraping populates knowledge bases from existing web content and enables ongoing synchronization between a website and the AI knowledge base.

Document Ingestion

Document ingestion is the process of importing, parsing, and indexing external documents — PDFs, Word files, web pages, CSVs, and more — into a knowledge base or AI retrieval system. It transforms raw files into searchable, retrievable content that an AI can use to answer questions.

Knowledge Base

A knowledge base is a centralized repository of structured information — articles, FAQs, guides, and documentation — that an AI chatbot or support system uses to answer user questions accurately. It is the foundation of any AI-powered self-service experience, directly determining how accurate and comprehensive the bot's answers are.

Content Freshness

Content freshness refers to how current and up-to-date knowledge base articles are. Fresh content produces accurate AI answers; stale content produces confidently wrong answers. Maintaining freshness requires review workflows, expiry policies, and systematic audits that keep articles aligned with the current state of the product.

← Knowledge Base & Content Management ← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →