Airline Chatbot Design: RAG Systems & Evaluation Methods

Abstract

The airline customer service domain presents unique challenges at the intersection of strict regulation and real-time operations. This study examines information retrieval methods and evaluation metrics for building robust airline chatbots that must satisfy three critical constraints: regulatory precision, tight latency requirements, and multi-source coverage. Through comprehensive analysis of retrieval techniques and evaluation frameworks, we provide actionable recommendations for implementing airline-grade conversational AI systems.

1. Problem Statement

The airline customer service domain operates at the intersection of strict regulation and real-time operations. Passengers inquire about:

Compensation rules and passenger rights
Baggage allowances and restrictions
Visa and health documentation requirements
Seat changes and upgrades
Gate assignments and flight updates
Rebooking options and policies

These queries often arrive in noisy, multilingual, informal language such as:

"Can I take my surfboard on a Light fare?"
"Do I get compensation if my Paris-Madrid flight is three hours late?"

Knowledge Base Complexity

The underlying knowledge base is heterogeneous, comprising:

Structured FAQs - Curated question-answer pairs
Policy manuals - Lengthy regulatory documents
Historical chat logs - Past customer interactions
Operational APIs - High-frequency live flight data

The Three-Way Constraint

Building an effective retrieval module requires satisfying three simultaneous demands:

Regulatory Precision - One incorrect statement risks fines or safety violations
Tight Latency - Passengers abandon conversations if responses exceed 1-2 seconds
Multi-Source Coverage - Long documents, API fields, and historical queries must coexist in a unified, version-controlled index with verifiable provenance

The system must filter noise, fuse lexical and semantic signals, and deliver grounded evidence in sub-second time while supporting multiple languages and maintaining aviation compliance.

2. Research Questions

A. Information Retrieval Methods

What information retrieval methods are reported in the literature?
What attributes or comparison criteria are typically used to evaluate these methods?
How do the identified methods perform across those attributes when tested?
Which retrieval method—or combination of methods—should be selected for an airline chatbot, and why?

B. Evaluation Metrics

What evaluation metrics are reported in the literature for measuring retrieval system quality?
What attributes or comparison criteria are commonly applied to judge these metrics?
How does each metric score across the chosen attributes when applied?
Which evaluation metric—or set of metrics—should be chosen to monitor retrieval quality, and why?

3. Information Retrieval Techniques

3.1 Key Attributes for Comparing IR Techniques

We analyze core IR techniques against seven critical attributes that address the regulatory and operational challenges unique to airline customer service:

Attribute Definitions

Recall (Coverage)

Measures the fraction of all relevant policy clauses or real-time data points surfaced by the retrieval system
Essential for airline chatbots: omitting baggage rules, visa requirements, or gate updates can lead to non-compliant guidance
Near-complete coverage prevents legal exposure, operational errors, and passenger frustration

Faithfulness / Hallucination Control

Quantifies the share of generated statements grounded in retrieved evidence
Critical in airline context: invented details like non-existent compensation clauses can incur fines or safety risks
Techniques like Self-RAG's auto-verification minimize errors by requiring every claim to cite verifiable sources

Latency (mean & p95)

Tracks retrieval and reranking processing time
Mean latency shows average response time; p95 captures worst-case tail performance
Passengers expect replies within ~1 second; optimizing both metrics ensures consistent, responsive interactions

Live-Data Compatibility

System's ability to merge static policy documents with dynamic API responses (flight status, baggage fees)
Critical for real-time queries; failure to integrate current data results in stale or inaccurate answers
Requires balancing freshness with latency through caching and smart API routing

Security / Compliance Filtering

Applies metadata and content rules to block obsolete, confidential, or PII-bearing fragments
Airlines must protect passenger data (PNRs, names, medical info) and avoid disclosing internal manuals
Ensures compliance with GDPR, TSA/FAA requirements, and internal policies

Adaptability to Policy Drift

Addresses system's ability to detect and incorporate changes in fare rules, compensation policies, or safety regulations
Monitors for concept drift and triggers incremental re-embedding and index updates
Keeps chatbot synchronized with latest official guidelines, avoiding outdated content citations

Explainability / Provenance

Guarantees every chatbot assertion can be traced to its original source
Provides document ID, chunk location, or API endpoint for each claim
Critical for regulatory audits, legal reviews, and customer trust through verifiable information trails

3.2 Attribute Importance Table

Attribute	Why it matters for airline chatbots	Key References
Recall (Coverage)	Omitting regulation clauses or live-status updates leads to incorrect/non-compliant advice	Pinecone: Offline Evaluation Weaviate: Retrieval Metrics Evidently: Precision-Recall RidgeRun: RAG Evaluation
Faithfulness / Hallucination Control	Fabricated statements risk fines, operational delays, or reputational damage	Cleanlab: RAG Hallucination Benchmarking AWS: Detecting Hallucinations in RAG Vectara: Measuring RAG Hallucinations RAGAS: Faithfulness Metrics
Latency (mean & p95)	Passengers abandon chat if replies exceed ~5 seconds; each retrieval step adds delay	Mastering Latency Metrics Zilliz: Importance of Tail Latency
Live-Data Compatibility	Many answers depend on real-time APIs; retrieval must integrate static and dynamic sources	Striim: Real-time RAG Streaming Imply: Real-time Data Importance
Security / Compliance Filtering	Must exclude obsolete, confidential, or PII-bearing content for aviation/privacy regulations	Pinecone: Metadata Filtering KX: Vector Search Filtering Aviation Privacy Laws
Adaptability to Policy Drift	Fare rules and policies change frequently; embeddings must refresh to avoid staleness	Knowledge Drift in RAG Models Arize: Embedding Drift Detection
Explainability / Provenance	Regulators may demand clear citation trails; provenance must be traceable end-to-end	FINOS: AI Citation Traceability Source Traceability in AI Systems

3.3 Retrieval Techniques Analysis

Technique Definitions

FAQ Q-to-Q Retrieval

Retrieves direct matches from curated, pre-approved FAQ knowledge base
Designed for high-frequency, well-structured queries (baggage allowances, check-in deadlines, seating rules)
Guarantees consistent, compliant answers by returning exact official wording
Acts as fastest, most reliable path for standardized information

Structured Retrieval (APIs/Tools)

Integrates directly with operational airline APIs and structured data systems
Retrieves live, authoritative information (flight status, gate assignments, baggage fees)
Prioritizes real-time accuracy and clear provenance from official operational systems
Essential for queries where freshness and factual accuracy are critical

Hybrid Search (BM25 + Vector + Cross-Encoder)

Combines lexical retrieval (BM25) with semantic vector search
Merges results using Reciprocal Rank Fusion and refines with cross-encoder re-ranking
Balances exact keyword matching with contextual understanding
Effective for nuanced policy queries with mixed formal/informal language

Agentic / Self-RAG

Adds self-verification step to standard retrieval-augmented generation pipeline
Model validates each claim against retrieved evidence, re-retrieving if inconsistencies detected
Minimizes unsupported claims and ensures grounding in verifiable sources
Useful for high-stakes, compliance-sensitive responses

Knowledge-Graph-Augmented Retrieval

Leverages domain-specific knowledge graph built from airline regulations and operational processes
Enables multi-hop reasoning for complex questions linking multiple policy elements
Enhances contextual understanding through explicit relationship modeling
Well-suited for multi-facet, interdependent queries

Dense Retriever (Fine-Tuned on Domain Data)

Uses dense vector encoders fine-tuned on airline-specific data
Maps queries and documents into shared embedding space
Provides fast, semantically rich retrieval for questions without exact wording
Higher domain alignment than general dense retrievers

Late-Interaction Models (ColBERT)

Implements token-level interaction between query and document embeddings
Maintains fine-grained relevance signals while enabling efficient large-scale retrieval
Effective for subtle or ambiguous queries across long policy documents
Moderate computational overhead during scoring

Sparse Neural Retrieval (SPLADEv2)

Generates sparse lexical representations using transformer models
Preserves term importance and semantic context
Achieves search speed close to BM25 while capturing deeper meanings
Useful for high-volume document repositories balancing latency and semantic fidelity

Table-Aware Retrieval

Tailored to handle structured tabular data (pricing charts, baggage allowances, seating configurations)
Uses retrieval models recognizing row/column semantics and numerical values
Supports precise extraction and reasoning over table content
Ideal for queries requiring comparisons, thresholds, or exact field retrieval

3.4 Technique Comparison Matrix

Technique	Recall	Faithfulness	Latency	Live-Data	Security	Adaptability	Provenance	References
FAQ Q-to-Q Retrieval	N/A	100%	Low	Low	High	Medium	High	arXiv:1905.02851 SCITEPRESS 2024 arXiv:2306.03411 EMNLP 2023
Structured Retrieval (APIs/Tools)	N/A	100%	N/A	High (100%)	High	High	High	FCIS Article
Hybrid Search (BM25 + Vector + Cross-Encoder)	High (≥90%)	High	High	Low	High	Medium	High	arXiv:2210.11934 arXiv:2505.23250 arXiv:2508.01405
Agentic / Self-RAG	High	High	High	Low	High	Medium	High	arXiv:2310.11511
Knowledge-Graph-Augmented Retrieval	High	High	High	Low	High	Medium	High	arXiv:2405.20139 arXiv:2504.08893 arXiv:2505.17058
Dense Retriever (Fine-Tuned)	High	Medium-High	Low-Medium	Low	High	High	Low-Medium	arXiv:2501.04652 arXiv:2112.07577 EMNLP 2024
Late-Interaction Models (ColBERT)	High	High	Low-Medium	Low	High	Medium	Medium-High	Stanford CS224V SIGIR 2020 arXiv:2205.09707 arXiv:2402.15059
Sparse Neural Retrieval (SPLADEv2)	High	High	Low	Low	High	High	Medium-High	arXiv:2109.10086 arXiv:2207.03834 arXiv:2505.15070
Table-Aware Retrieval	High	High	Medium	Low	High	Medium	Medium	ACL 2024 ACM 2022 arXiv:2203.16714 OpenReview

3.5 Technique Selection & Justification

Based on the three-way constraint analysis, we select three complementary retrieval strategies:

Technique	Rationale
FAQ Q-to-Q Retrieval	Zero hallucination & very low latency: Returns exact, vetted FAQ answers with perfect faithfulness and minimal processing time. Ideal for high-frequency, well-defined queries, reducing load on complex retrieval stages.
Structured Retrieval (APIs/Tools)	Real-time accuracy for operational data: Directly queries authoritative airline APIs ensuring 100% freshness and factual accuracy. Maintains strong provenance and compliance for dynamic, regulated data.
Hybrid Search (BM25 + Vector + Cross-Encoder)	Balanced semantic and lexical retrieval: Combines BM25, dense embeddings, and cross-encoder reranking for high recall and faithfulness over policy documents. Effective for complex, broad-scope airline queries.

Supporting Research

Sakata et al. (2019) - FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance

Proposes hybrid method combining query-question matching (BM25) with BERT-based re-ranking
Demonstrates substantial relevance gains over traditional baselines
Validates Q-to-Q retrieval suitability for frequent, well-defined queries
Reference: https://arxiv.org/abs/1905.02851

Li et al. (2025) - Evaluating RAG Methods for Conversational AI in the Airport Domain

Compares three RAG approaches on real airport queries
Results: BM25+LLM (84.84%), SQL-RAG (80.85%), Graph-RAG (91.49%)
Recommends SQL-RAG as balanced solution for operational reliability
Reference: https://aclanthology.org/2025.naacl-industry.61.pdf

Taranukhin et al. (2024) - Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights

Presents retrieval-only chatbot for EC 261 and equivalent passenger rights regulations
Achieved MAP@5 of 0.88 with hallucination-free responses in user study
Reinforces suitability for highly regulated domains
Reference: https://aclanthology.org/2024.nllp-1.27.pdf

4. Information Retrieval Evaluation

4.1 Attributes for Comparing Evaluation Methods

Attribute Definitions

Reproducibility

Ability to obtain identical evaluation results across multiple runs with same configuration
Guarantees performance regressions/improvements are real, not artifacts of randomness
Critical for trustworthy CI/CD regression tests and audit trail maintenance

Coverage

Evaluates both retriever's evidence surfacing ability and generator's context usage
Combines retrieval-level metrics (recall@K, nDCG) with generation-level metrics (faithfulness, correctness)
Enables pinpointing whether errors arise from incomplete context or LLM misuse

Sensitivity

Framework's power to distinguish small but meaningful performance differences
Ensures minor changes (reranker weight adjustments, index parameters) produce detectable score shifts
Crucial for iterative tuning without being lost in measurement noise

Automation

Entire evaluation pipeline runs end-to-end without manual intervention
Integrates seamlessly into nightly CI/CD workflows with automated alerts
Catches regressions/compliance violations before production deployment

Interpretability

Provides granular breakdowns by question category, metric family, or intent type
Enables tracing performance changes to specific system components or content areas
Supports rapid diagnosis and targeted remediation rather than opaque aggregate scores

Cost Efficiency

Balances computational and human-labeling expenses against insights gained
Chooses between automated low-cost metrics and expensive human/LLM-based approaches
Employs hybrid estimators to minimize spend while preserving accuracy

Domain Alignment

Ensures metrics, test splits, and benchmarks are tailored to airline customer service scenarios
Constructs scenario-specific test collections (EC 261 compensation, baggage fees, flight status)
Uses expert-validated metrics reflecting real-world, domain-critical situations

4.2 Evaluation Method Importance Table

Attribute	Why it matters for airline-grade evaluation	References
Reproducibility	Guarantees identical results across repeated runs for regression testing and audit trails	Comet: ML Reproducibility arXiv:2412.03854
Coverage	Spans both retrieval (recall@K, nDCG) and generation (faithfulness, correctness) metrics to diagnose any failure mode	arXiv:2405.07437
Sensitivity	Detects small performance changes (e.g., reranker weight tweaks) to support iterative tuning	arXiv:2507.07924 ACM: Performance Changes
Automation	Enables fully batchized, CI/CD-friendly evaluation with automated alerts for metric drift	arXiv:2409.19019
Interpretability	Offers breakdowns by category or metric family to pinpoint root causes of regressions	arXiv:2507.03479 arXiv:2211.02405
Cost Efficiency	Balances compute and human-label costs (LLM judges vs. automated) under operational budgets	arXiv:1807.02202 arXiv:1807.06998 arXiv:2006.13999
Domain Alignment	Uses airline-specific test splits and metrics (EC 261, baggage, live-data queries) to ensure evaluation relevance	RAGAS: Rubrics-Based Metrics UnfoldAI: RAG Evaluations

Method Definitions

Offline Batch Evaluation

Executes fixed, version-controlled test sets through system in batch mode
Measures retrieval (recall@K, nDCG) and generation metrics (Exact Match, F1)
Yields fully deterministic outputs for reproducibility and audit traceability

Automated RAG Evaluation

End-to-end pipelines assessing both retrieval and generation stages
Combines retrieval-level metrics (pytrec_eval) with generation metrics (RAGAS)
Includes synthetic test-case generation (RAGProbe) and lightweight LLM judging (ARES)

LLM-as-Judge

Employs separate large language model to evaluate system outputs
Scores responses on correctness, faithfulness, relevance, coherence
Scales evaluation beyond traditional human labeling or shallow auto-metrics

Human Annotation

Expert annotators manually score outputs on correctness, factual grounding, compliance
Remains gold standard for high-stakes domains and nuanced requirements
Detects subtle errors automated metrics often miss

Explicit Subtopic Evaluation

Segments test sets into domain-specific subcategories
Calculates metrics independently for each slice (EC 261, baggage rules, live status)
Enables targeted analysis of specific failure areas

Automated Exam Generation with IRT

LLMs generate multiple-choice questions from domain corpora
Item Response Theory models question difficulty and discriminative power
Enables interpretable, scalable assessments across difficulty levels

4.3 Method Comparison Matrix

Method	Reproducibility	Coverage	Sensitivity	Automation	Interpretability	Cost Efficiency	Domain Alignment	References
Offline Batch Evaluation	High	High	Medium	Low	High	Medium	High	arXiv:2404.13781 arXiv:2405.07437
Automated RAG Evaluation	High	High	High	High	Medium	Medium	Medium	arXiv:2309.15217 arXiv:2311.09476 arXiv:2409.19019
LLM-as-Judge	Medium	Generation only	High	Medium	Medium	Low	Medium	arXiv:2411.15594 arXiv:2408.08781 arXiv:2408.13006 arXiv:2406.07791
Human Annotation	Low-Medium	High	High	Low	High	Low	High	arXiv:2507.15821 arXiv:2310.14424 NIST TN.2287
Explicit Subtopic Evaluation	High	High	Medium	Medium	High	Medium	High	arXiv:2412.05206
Automated Exam Generation with IRT	High	High	High	Medium	High	Medium	High	arXiv:2405.13622 arXiv:1605.08889 BEA 2022

4.4 Method Selection & Justification

Based on airline domain requirements for regulatory compliance, multi-source coverage, reproducibility, interpretability, and sensitivity to performance shifts:

Method	Rationale
Automated RAG Evaluation (pytrec_eval + RAGAS + RAGProbe/ARES)	Combines retrieval-level metrics with generation-level metrics in automated pipeline. Supports synthetic robustness testing and lightweight LLM judging. Delivers high coverage, reproducibility, and sensitivity for continuous monitoring.
Explicit Subtopic Evaluation	Segments datasets into airline-specific categories for per-category metrics. Maximizes interpretability and domain alignment by identifying weaknesses in specific regulatory/operational areas.
Automated Exam Generation with IRT	Produces controlled airline-specific test questions calibrated with Item Response Theory. Enables targeted skill probing, interpretable difficulty-aware scoring, and early capability gap detection.

5. Implementation Recommendations

5.1 Retrieval Pipeline Integration

Layered Routing Strategy:

Route 1: FAQ Q-to-Q Retrieval
- Send FAQ-style queries for ultra-fast, fully grounded responses
- Handle high-frequency, well-defined passenger queries
- Free complex retrieval layers for nuanced cases
Route 2: Structured Retrieval (APIs/Tools)
- Forward real-time, data-bound queries for up-to-date outputs
- Ensure strict compliance with airline regulations
- Maintain verifiable provenance for mission-critical queries
Route 3: Hybrid Search
- Delegate nuanced or policy-spanning questions for broad coverage
- Apply refined precision for complex, multi-policy airline queries
- Respect latency constraints while ensuring comprehensive coverage

Provenance Requirements:

Return each retrieval result with metadata (source document ID, passage, timestamp, API endpoint)
Enable full citation path traceability from answer to source material

5.2 Evaluation Framework Deployment

Continuous Monitoring Schedule:

Nightly: Automated RAG Evaluation as part of CI/CD workflows
Weekly: Explicit Subtopic Evaluation for focused diagnosis
Quarterly: IRT-calibrated exam generation for long-term capability assessment

Compliance and Auditing:

Persist full evaluation logs, metrics, and test sets for audit trails
Enable on-demand traceability for any production response
Maintain verifiable citation paths for regulatory review

5.3 Future Development Priorities

Intent-Aware Routing Models
- Train lightweight classifiers for dynamic query routing
- Use LLM-based intent detection for optimal retrieval path selection
Retriever & Re-ranker Fine-Tuning
- Fine-tune embedding models on airline-specific data
- Optimize recall, precision, and latency control
Real-Time Hallucination Detection
- Integrate Self-RAG reflection loops
- Implement confidence-based refusal mechanisms for high-risk contexts
Feedback-Driven Evaluation Loops
- Incorporate passenger feedback and escalation logs
- Refine test sets based on real operational pain points

6. Conclusion

This study establishes a comprehensive, domain-specific framework pairing layered retrieval techniques with robust, interpretable evaluation methods to meet the demanding requirements of airline customer service.

Key Contributions

Retrieval Architecture:

Ultra-fast FAQ retrieval for zero-hallucination responses
Real-time API integration for factual accuracy and provenance
Hybrid semantic search for complex, multi-policy queries

Evaluation Framework:

Automated pipelines combining retrieval and generation metrics
Domain-specific subtopic analysis for targeted improvements
Difficulty-aware exam generation for capability monitoring

System Benefits

The framework enables chatbot architecture that is:

Accurate and Compliant: Grounded, auditable responses supporting regulatory trust
Resilient to Change: Detects regressions before user impact through continuous monitoring
Scalable Over Time: Modular evaluation, feedback loops, and domain-tuned test sets

Strategic Impact

This configuration transforms the chatbot from a reactive interface into a living knowledge infrastructure capable of adapting to new regulations, service offerings, and customer behaviors. The system positions itself as a core enabler of safe, scalable, and intelligent automation in the airline industry.

The framework's emphasis on regulatory precision, operational readiness, and strategic adaptability ensures sustainable deployment in one of the most demanding customer service environments, setting a foundation for broader application across regulated industries requiring high-stakes conversational AI.