Precision Under Pressure: A Comparative Study of Information Retrieval Methods and Evaluation Metrics in Airline Chatbot Design
Comprehensive study of information retrieval techniques and evaluation frameworks for building robust airline chatbots that balance regulatory precision, latency requirements, and multi-source coverage.


Abstract
The airline customer service domain presents unique challenges at the intersection of strict regulation and real-time operations. This study examines information retrieval methods and evaluation metrics for building robust airline chatbots that must satisfy three critical constraints: regulatory precision, tight latency requirements, and multi-source coverage. Through comprehensive analysis of retrieval techniques and evaluation frameworks, we provide actionable recommendations for implementing airline-grade conversational AI systems.
1. Problem Statement
The airline customer service domain operates at the intersection of strict regulation and real-time operations. Passengers inquire about:
- Compensation rules and passenger rights
- Baggage allowances and restrictions
- Visa and health documentation requirements
- Seat changes and upgrades
- Gate assignments and flight updates
- Rebooking options and policies
These queries often arrive in noisy, multilingual, informal language such as:
- "Can I take my surfboard on a Light fare?"
- "Do I get compensation if my Paris-Madrid flight is three hours late?"
Knowledge Base Complexity
The underlying knowledge base is heterogeneous, comprising:
- Structured FAQs - Curated question-answer pairs
- Policy manuals - Lengthy regulatory documents
- Historical chat logs - Past customer interactions
- Operational APIs - High-frequency live flight data
The Three-Way Constraint
Building an effective retrieval module requires satisfying three simultaneous demands:
- Regulatory Precision - One incorrect statement risks fines or safety violations
- Tight Latency - Passengers abandon conversations if responses exceed 1-2 seconds
- Multi-Source Coverage - Long documents, API fields, and historical queries must coexist in a unified, version-controlled index with verifiable provenance
The system must filter noise, fuse lexical and semantic signals, and deliver grounded evidence in sub-second time while supporting multiple languages and maintaining aviation compliance.
2. Research Questions
A. Information Retrieval Methods
- What information retrieval methods are reported in the literature?
- What attributes or comparison criteria are typically used to evaluate these methods?
- How do the identified methods perform across those attributes when tested?
- Which retrieval method—or combination of methods—should be selected for an airline chatbot, and why?
B. Evaluation Metrics
- What evaluation metrics are reported in the literature for measuring retrieval system quality?
- What attributes or comparison criteria are commonly applied to judge these metrics?
- How does each metric score across the chosen attributes when applied?
- Which evaluation metric—or set of metrics—should be chosen to monitor retrieval quality, and why?
3. Information Retrieval Techniques
3.1 Key Attributes for Comparing IR Techniques
We analyze core IR techniques against seven critical attributes that address the regulatory and operational challenges unique to airline customer service:
Attribute Definitions
Recall (Coverage)
- Measures the fraction of all relevant policy clauses or real-time data points surfaced by the retrieval system
- Essential for airline chatbots: omitting baggage rules, visa requirements, or gate updates can lead to non-compliant guidance
- Near-complete coverage prevents legal exposure, operational errors, and passenger frustration
Faithfulness / Hallucination Control
- Quantifies the share of generated statements grounded in retrieved evidence
- Critical in airline context: invented details like non-existent compensation clauses can incur fines or safety risks
- Techniques like Self-RAG's auto-verification minimize errors by requiring every claim to cite verifiable sources
Latency (mean & p95)
- Tracks retrieval and reranking processing time
- Mean latency shows average response time; p95 captures worst-case tail performance
- Passengers expect replies within ~1 second; optimizing both metrics ensures consistent, responsive interactions
Live-Data Compatibility
- System's ability to merge static policy documents with dynamic API responses (flight status, baggage fees)
- Critical for real-time queries; failure to integrate current data results in stale or inaccurate answers
- Requires balancing freshness with latency through caching and smart API routing
Security / Compliance Filtering
- Applies metadata and content rules to block obsolete, confidential, or PII-bearing fragments
- Airlines must protect passenger data (PNRs, names, medical info) and avoid disclosing internal manuals
- Ensures compliance with GDPR, TSA/FAA requirements, and internal policies
Adaptability to Policy Drift
- Addresses system's ability to detect and incorporate changes in fare rules, compensation policies, or safety regulations
- Monitors for concept drift and triggers incremental re-embedding and index updates
- Keeps chatbot synchronized with latest official guidelines, avoiding outdated content citations
Explainability / Provenance
- Guarantees every chatbot assertion can be traced to its original source
- Provides document ID, chunk location, or API endpoint for each claim
- Critical for regulatory audits, legal reviews, and customer trust through verifiable information trails
3.2 Attribute Importance Table
Attribute | Why it matters for airline chatbots | Key References |
---|---|---|
Recall (Coverage) | Omitting regulation clauses or live-status updates leads to incorrect/non-compliant advice | Pinecone: Offline Evaluation Weaviate: Retrieval Metrics Evidently: Precision-Recall RidgeRun: RAG Evaluation |
Faithfulness / Hallucination Control | Fabricated statements risk fines, operational delays, or reputational damage | Cleanlab: RAG Hallucination Benchmarking AWS: Detecting Hallucinations in RAG Vectara: Measuring RAG Hallucinations RAGAS: Faithfulness Metrics |
Latency (mean & p95) | Passengers abandon chat if replies exceed ~5 seconds; each retrieval step adds delay | Mastering Latency Metrics Zilliz: Importance of Tail Latency |
Live-Data Compatibility | Many answers depend on real-time APIs; retrieval must integrate static and dynamic sources | Striim: Real-time RAG Streaming Imply: Real-time Data Importance |
Security / Compliance Filtering | Must exclude obsolete, confidential, or PII-bearing content for aviation/privacy regulations | Pinecone: Metadata Filtering KX: Vector Search Filtering Aviation Privacy Laws |
Adaptability to Policy Drift | Fare rules and policies change frequently; embeddings must refresh to avoid staleness | Knowledge Drift in RAG Models Arize: Embedding Drift Detection |
Explainability / Provenance | Regulators may demand clear citation trails; provenance must be traceable end-to-end | FINOS: AI Citation Traceability Source Traceability in AI Systems |
3.3 Retrieval Techniques Analysis
Technique Definitions
FAQ Q-to-Q Retrieval
- Retrieves direct matches from curated, pre-approved FAQ knowledge base
- Designed for high-frequency, well-structured queries (baggage allowances, check-in deadlines, seating rules)
- Guarantees consistent, compliant answers by returning exact official wording
- Acts as fastest, most reliable path for standardized information
Structured Retrieval (APIs/Tools)
- Integrates directly with operational airline APIs and structured data systems
- Retrieves live, authoritative information (flight status, gate assignments, baggage fees)
- Prioritizes real-time accuracy and clear provenance from official operational systems
- Essential for queries where freshness and factual accuracy are critical
Hybrid Search (BM25 + Vector + Cross-Encoder)
- Combines lexical retrieval (BM25) with semantic vector search
- Merges results using Reciprocal Rank Fusion and refines with cross-encoder re-ranking
- Balances exact keyword matching with contextual understanding
- Effective for nuanced policy queries with mixed formal/informal language
Agentic / Self-RAG
- Adds self-verification step to standard retrieval-augmented generation pipeline
- Model validates each claim against retrieved evidence, re-retrieving if inconsistencies detected
- Minimizes unsupported claims and ensures grounding in verifiable sources
- Useful for high-stakes, compliance-sensitive responses
Knowledge-Graph-Augmented Retrieval
- Leverages domain-specific knowledge graph built from airline regulations and operational processes
- Enables multi-hop reasoning for complex questions linking multiple policy elements
- Enhances contextual understanding through explicit relationship modeling
- Well-suited for multi-facet, interdependent queries
Dense Retriever (Fine-Tuned on Domain Data)
- Uses dense vector encoders fine-tuned on airline-specific data
- Maps queries and documents into shared embedding space
- Provides fast, semantically rich retrieval for questions without exact wording
- Higher domain alignment than general dense retrievers
Late-Interaction Models (ColBERT)
- Implements token-level interaction between query and document embeddings
- Maintains fine-grained relevance signals while enabling efficient large-scale retrieval
- Effective for subtle or ambiguous queries across long policy documents
- Moderate computational overhead during scoring
Sparse Neural Retrieval (SPLADEv2)
- Generates sparse lexical representations using transformer models
- Preserves term importance and semantic context
- Achieves search speed close to BM25 while capturing deeper meanings
- Useful for high-volume document repositories balancing latency and semantic fidelity
Table-Aware Retrieval
- Tailored to handle structured tabular data (pricing charts, baggage allowances, seating configurations)
- Uses retrieval models recognizing row/column semantics and numerical values
- Supports precise extraction and reasoning over table content
- Ideal for queries requiring comparisons, thresholds, or exact field retrieval
3.4 Technique Comparison Matrix
Technique | Recall | Faithfulness | Latency | Live-Data | Security | Adaptability | Provenance | References |
---|---|---|---|---|---|---|---|---|
FAQ Q-to-Q Retrieval | N/A | 100% | Low | Low | High | Medium | High | arXiv:1905.02851 SCITEPRESS 2024 arXiv:2306.03411 EMNLP 2023 |
Structured Retrieval (APIs/Tools) | N/A | 100% | N/A | High (100%) | High | High | High | FCIS Article |
Hybrid Search (BM25 + Vector + Cross-Encoder) | High (≥90%) | High | High | Low | High | Medium | High | arXiv:2210.11934 arXiv:2505.23250 arXiv:2508.01405 |
Agentic / Self-RAG | High | High | High | Low | High | Medium | High | arXiv:2310.11511 |
Knowledge-Graph-Augmented Retrieval | High | High | High | Low | High | Medium | High | arXiv:2405.20139 arXiv:2504.08893 arXiv:2505.17058 |
Dense Retriever (Fine-Tuned) | High | Medium-High | Low-Medium | Low | High | High | Low-Medium | arXiv:2501.04652 arXiv:2112.07577 EMNLP 2024 |
Late-Interaction Models (ColBERT) | High | High | Low-Medium | Low | High | Medium | Medium-High | Stanford CS224V SIGIR 2020 arXiv:2205.09707 arXiv:2402.15059 |
Sparse Neural Retrieval (SPLADEv2) | High | High | Low | Low | High | High | Medium-High | arXiv:2109.10086 arXiv:2207.03834 arXiv:2505.15070 |
Table-Aware Retrieval | High | High | Medium | Low | High | Medium | Medium | ACL 2024 ACM 2022 arXiv:2203.16714 OpenReview |
3.5 Technique Selection & Justification
Based on the three-way constraint analysis, we select three complementary retrieval strategies:
Technique | Rationale |
---|---|
FAQ Q-to-Q Retrieval | Zero hallucination & very low latency: Returns exact, vetted FAQ answers with perfect faithfulness and minimal processing time. Ideal for high-frequency, well-defined queries, reducing load on complex retrieval stages. |
Structured Retrieval (APIs/Tools) | Real-time accuracy for operational data: Directly queries authoritative airline APIs ensuring 100% freshness and factual accuracy. Maintains strong provenance and compliance for dynamic, regulated data. |
Hybrid Search (BM25 + Vector + Cross-Encoder) | Balanced semantic and lexical retrieval: Combines BM25, dense embeddings, and cross-encoder reranking for high recall and faithfulness over policy documents. Effective for complex, broad-scope airline queries. |
Supporting Research
Sakata et al. (2019) - FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance
- Proposes hybrid method combining query-question matching (BM25) with BERT-based re-ranking
- Demonstrates substantial relevance gains over traditional baselines
- Validates Q-to-Q retrieval suitability for frequent, well-defined queries
- Reference: https://arxiv.org/abs/1905.02851
Li et al. (2025) - Evaluating RAG Methods for Conversational AI in the Airport Domain
- Compares three RAG approaches on real airport queries
- Results: BM25+LLM (84.84%), SQL-RAG (80.85%), Graph-RAG (91.49%)
- Recommends SQL-RAG as balanced solution for operational reliability
- Reference: https://aclanthology.org/2025.naacl-industry.61.pdf
Taranukhin et al. (2024) - Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights
- Presents retrieval-only chatbot for EC 261 and equivalent passenger rights regulations
- Achieved MAP@5 of 0.88 with hallucination-free responses in user study
- Reinforces suitability for highly regulated domains
- Reference: https://aclanthology.org/2024.nllp-1.27.pdf
4. Information Retrieval Evaluation
4.1 Attributes for Comparing Evaluation Methods
Attribute Definitions
Reproducibility
- Ability to obtain identical evaluation results across multiple runs with same configuration
- Guarantees performance regressions/improvements are real, not artifacts of randomness
- Critical for trustworthy CI/CD regression tests and audit trail maintenance
Coverage
- Evaluates both retriever's evidence surfacing ability and generator's context usage
- Combines retrieval-level metrics (recall@K, nDCG) with generation-level metrics (faithfulness, correctness)
- Enables pinpointing whether errors arise from incomplete context or LLM misuse
Sensitivity
- Framework's power to distinguish small but meaningful performance differences
- Ensures minor changes (reranker weight adjustments, index parameters) produce detectable score shifts
- Crucial for iterative tuning without being lost in measurement noise
Automation
- Entire evaluation pipeline runs end-to-end without manual intervention
- Integrates seamlessly into nightly CI/CD workflows with automated alerts
- Catches regressions/compliance violations before production deployment
Interpretability
- Provides granular breakdowns by question category, metric family, or intent type
- Enables tracing performance changes to specific system components or content areas
- Supports rapid diagnosis and targeted remediation rather than opaque aggregate scores
Cost Efficiency
- Balances computational and human-labeling expenses against insights gained
- Chooses between automated low-cost metrics and expensive human/LLM-based approaches
- Employs hybrid estimators to minimize spend while preserving accuracy
Domain Alignment
- Ensures metrics, test splits, and benchmarks are tailored to airline customer service scenarios
- Constructs scenario-specific test collections (EC 261 compensation, baggage fees, flight status)
- Uses expert-validated metrics reflecting real-world, domain-critical situations
4.2 Evaluation Method Importance Table
Attribute | Why it matters for airline-grade evaluation | References |
---|---|---|
Reproducibility | Guarantees identical results across repeated runs for regression testing and audit trails | Comet: ML Reproducibility arXiv:2412.03854 |
Coverage | Spans both retrieval (recall@K, nDCG) and generation (faithfulness, correctness) metrics to diagnose any failure mode | arXiv:2405.07437 |
Sensitivity | Detects small performance changes (e.g., reranker weight tweaks) to support iterative tuning | arXiv:2507.07924 ACM: Performance Changes |
Automation | Enables fully batchized, CI/CD-friendly evaluation with automated alerts for metric drift | arXiv:2409.19019 |
Interpretability | Offers breakdowns by category or metric family to pinpoint root causes of regressions | arXiv:2507.03479 arXiv:2211.02405 |
Cost Efficiency | Balances compute and human-label costs (LLM judges vs. automated) under operational budgets | arXiv:1807.02202 arXiv:1807.06998 arXiv:2006.13999 |
Domain Alignment | Uses airline-specific test splits and metrics (EC 261, baggage, live-data queries) to ensure evaluation relevance | RAGAS: Rubrics-Based Metrics UnfoldAI: RAG Evaluations |
Method Definitions
Offline Batch Evaluation
- Executes fixed, version-controlled test sets through system in batch mode
- Measures retrieval (recall@K, nDCG) and generation metrics (Exact Match, F1)
- Yields fully deterministic outputs for reproducibility and audit traceability
Automated RAG Evaluation
- End-to-end pipelines assessing both retrieval and generation stages
- Combines retrieval-level metrics (pytrec_eval) with generation metrics (RAGAS)
- Includes synthetic test-case generation (RAGProbe) and lightweight LLM judging (ARES)
LLM-as-Judge
- Employs separate large language model to evaluate system outputs
- Scores responses on correctness, faithfulness, relevance, coherence
- Scales evaluation beyond traditional human labeling or shallow auto-metrics
Human Annotation
- Expert annotators manually score outputs on correctness, factual grounding, compliance
- Remains gold standard for high-stakes domains and nuanced requirements
- Detects subtle errors automated metrics often miss
Explicit Subtopic Evaluation
- Segments test sets into domain-specific subcategories
- Calculates metrics independently for each slice (EC 261, baggage rules, live status)
- Enables targeted analysis of specific failure areas
Automated Exam Generation with IRT
- LLMs generate multiple-choice questions from domain corpora
- Item Response Theory models question difficulty and discriminative power
- Enables interpretable, scalable assessments across difficulty levels
4.3 Method Comparison Matrix
Method | Reproducibility | Coverage | Sensitivity | Automation | Interpretability | Cost Efficiency | Domain Alignment | References |
---|---|---|---|---|---|---|---|---|
Offline Batch Evaluation | High | High | Medium | Low | High | Medium | High | arXiv:2404.13781 arXiv:2405.07437 |
Automated RAG Evaluation | High | High | High | High | Medium | Medium | Medium | arXiv:2309.15217 arXiv:2311.09476 arXiv:2409.19019 |
LLM-as-Judge | Medium | Generation only | High | Medium | Medium | Low | Medium | arXiv:2411.15594 arXiv:2408.08781 arXiv:2408.13006 arXiv:2406.07791 |
Human Annotation | Low-Medium | High | High | Low | High | Low | High | arXiv:2507.15821 arXiv:2310.14424 NIST TN.2287 |
Explicit Subtopic Evaluation | High | High | Medium | Medium | High | Medium | High | arXiv:2412.05206 |
Automated Exam Generation with IRT | High | High | High | Medium | High | Medium | High | arXiv:2405.13622 arXiv:1605.08889 BEA 2022 |
4.4 Method Selection & Justification
Based on airline domain requirements for regulatory compliance, multi-source coverage, reproducibility, interpretability, and sensitivity to performance shifts:
Method | Rationale |
---|---|
Automated RAG Evaluation (pytrec_eval + RAGAS + RAGProbe/ARES) | Combines retrieval-level metrics with generation-level metrics in automated pipeline. Supports synthetic robustness testing and lightweight LLM judging. Delivers high coverage, reproducibility, and sensitivity for continuous monitoring. |
Explicit Subtopic Evaluation | Segments datasets into airline-specific categories for per-category metrics. Maximizes interpretability and domain alignment by identifying weaknesses in specific regulatory/operational areas. |
Automated Exam Generation with IRT | Produces controlled airline-specific test questions calibrated with Item Response Theory. Enables targeted skill probing, interpretable difficulty-aware scoring, and early capability gap detection. |
5. Implementation Recommendations
5.1 Retrieval Pipeline Integration
Layered Routing Strategy:
- Route 1: FAQ Q-to-Q Retrieval
- Send FAQ-style queries for ultra-fast, fully grounded responses
- Handle high-frequency, well-defined passenger queries
- Free complex retrieval layers for nuanced cases
- Route 2: Structured Retrieval (APIs/Tools)
- Forward real-time, data-bound queries for up-to-date outputs
- Ensure strict compliance with airline regulations
- Maintain verifiable provenance for mission-critical queries
- Route 3: Hybrid Search
- Delegate nuanced or policy-spanning questions for broad coverage
- Apply refined precision for complex, multi-policy airline queries
- Respect latency constraints while ensuring comprehensive coverage
Provenance Requirements:
- Return each retrieval result with metadata (source document ID, passage, timestamp, API endpoint)
- Enable full citation path traceability from answer to source material
5.2 Evaluation Framework Deployment
Continuous Monitoring Schedule:
- Nightly: Automated RAG Evaluation as part of CI/CD workflows
- Weekly: Explicit Subtopic Evaluation for focused diagnosis
- Quarterly: IRT-calibrated exam generation for long-term capability assessment
Compliance and Auditing:
- Persist full evaluation logs, metrics, and test sets for audit trails
- Enable on-demand traceability for any production response
- Maintain verifiable citation paths for regulatory review
5.3 Future Development Priorities
- Intent-Aware Routing Models
- Train lightweight classifiers for dynamic query routing
- Use LLM-based intent detection for optimal retrieval path selection
- Retriever & Re-ranker Fine-Tuning
- Fine-tune embedding models on airline-specific data
- Optimize recall, precision, and latency control
- Real-Time Hallucination Detection
- Integrate Self-RAG reflection loops
- Implement confidence-based refusal mechanisms for high-risk contexts
- Feedback-Driven Evaluation Loops
- Incorporate passenger feedback and escalation logs
- Refine test sets based on real operational pain points
6. Conclusion
This study establishes a comprehensive, domain-specific framework pairing layered retrieval techniques with robust, interpretable evaluation methods to meet the demanding requirements of airline customer service.
Key Contributions
Retrieval Architecture:
- Ultra-fast FAQ retrieval for zero-hallucination responses
- Real-time API integration for factual accuracy and provenance
- Hybrid semantic search for complex, multi-policy queries
Evaluation Framework:
- Automated pipelines combining retrieval and generation metrics
- Domain-specific subtopic analysis for targeted improvements
- Difficulty-aware exam generation for capability monitoring
System Benefits
The framework enables chatbot architecture that is:
- Accurate and Compliant: Grounded, auditable responses supporting regulatory trust
- Resilient to Change: Detects regressions before user impact through continuous monitoring
- Scalable Over Time: Modular evaluation, feedback loops, and domain-tuned test sets
Strategic Impact
This configuration transforms the chatbot from a reactive interface into a living knowledge infrastructure capable of adapting to new regulations, service offerings, and customer behaviors. The system positions itself as a core enabler of safe, scalable, and intelligent automation in the airline industry.
The framework's emphasis on regulatory precision, operational readiness, and strategic adaptability ensures sustainable deployment in one of the most demanding customer service environments, setting a foundation for broader application across regulated industries requiring high-stakes conversational AI.