Natural Language Processing (NLP)

Reading Comprehension

Definition

Reading comprehension (RC) in NLP evaluates a model's ability to answer questions about provided text passages. Extractive RC finds the exact answer span within the passage; abstractive RC generates a free-form answer that may synthesize multiple passage sentences. Benchmark datasets include SQuAD (Stanford Question Answering Dataset), SQuAD 2.0 (which adds unanswerable questions), and TriviaQA. BERT achieved human-level performance on SQuAD in 2018, marking a landmark in NLP. Reading comprehension is the core mechanism behind RAG question-answering systems and document-grounded chatbots.

Why It Matters

Reading comprehension is the fundamental capability that enables chatbots to answer questions accurately from provided documentation rather than generating plausible-sounding but potentially wrong answers. In RAG architectures, the retriever finds relevant passages and the reader (a reading comprehension model) extracts precise answers. Accurate RC determines whether a chatbot actually helps users find information or frustrates them with off-topic or incorrect responses. It is also critical for contract analysis, academic research, and compliance review systems.

How It Works

Extractive RC models use a transformer encoder (BERT) to process the question and passage as a concatenated sequence. The model predicts start and end positions of the answer span in the passage by outputting start/end logits for each passage token. For unanswerable questions (SQuAD 2.0 format), a separate 'no answer' score is compared against the best span score. Generative RC models (T5, GPT) use the passage as context and generate the answer token-by-token. Modern RAG systems combine dense retrieval with a generative reader for open-domain QA.

Reading Comprehension — Multi-Question Extraction

Passage

The Amazon rainforest covers over 5.5 million km² across nine countries. It produces 20% of the world's oxygen and is home to 10% of all species on Earth. Deforestation threatens approximately 17% of the forest has been lost in the past 50 years.

Questions & Extracted Answers

1

How large is the Amazon rainforest?

over 5.5 million km²

2

What percentage of oxygen does it produce?

20%

3

How many countries does it span?

nine countries

4

How much forest has been lost?

approximately 17%

84%
EM Score
91%
F1 Score
96%
Has Answer

Real-World Example

A technical documentation assistant uses reading comprehension to answer API integration questions. When a developer asks 'What HTTP status code does the API return for rate limit errors?', the system retrieves the error codes section from the API docs and the RC model extracts '429 Too Many Requests' as the answer span—providing a precise, verifiable answer with a link to the source passage. This approach reduces hallucination by grounding answers in retrieved documentation rather than relying on parametric model knowledge.

Common Mistakes

  • Expecting the model to answer questions requiring information not in the provided passage—RC models cannot answer unanswerable questions reliably without explicit training
  • Using reading comprehension without retrieval—asking a model to read an entire knowledge base at once exceeds context limits
  • Ignoring the passage quality problem—garbage in, garbage out: irrelevant retrieved passages produce wrong answers

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Reading Comprehension? Reading Comprehension Definition & Guide | 99helpers | 99helpers.com