Back to Blog
Research

Precision Under Pressure: A Comparative Study of Information Retrieval Methods and Evaluation Metrics in Airline Chatbot Design

Comprehensive study of information retrieval techniques and evaluation frameworks for building robust airline chatbots that balance regulatory precision, latency requirements, and multi-source coverage.

Conceptual visualization of information retrieval and evaluation methods in airline chatbot systems
Gustavo Meneses
Gustavo Meneses
Applied AI Engineer, Kaiban
Published
August 9, 2025
Read Time
15 min read

Abstract

The airline customer service domain presents unique challenges at the intersection of strict regulation and real-time operations. This study examines information retrieval methods and evaluation metrics for building robust airline chatbots that must satisfy three critical constraints: regulatory precision, tight latency requirements, and multi-source coverage. Through comprehensive analysis of retrieval techniques and evaluation frameworks, we provide actionable recommendations for implementing airline-grade conversational AI systems.


1. Problem Statement

The airline customer service domain operates at the intersection of strict regulation and real-time operations. Passengers inquire about:

  • Compensation rules and passenger rights
  • Baggage allowances and restrictions
  • Visa and health documentation requirements
  • Seat changes and upgrades
  • Gate assignments and flight updates
  • Rebooking options and policies

These queries often arrive in noisy, multilingual, informal language such as:

  • "Can I take my surfboard on a Light fare?"
  • "Do I get compensation if my Paris-Madrid flight is three hours late?"

Knowledge Base Complexity

The underlying knowledge base is heterogeneous, comprising:

  • Structured FAQs - Curated question-answer pairs
  • Policy manuals - Lengthy regulatory documents
  • Historical chat logs - Past customer interactions
  • Operational APIs - High-frequency live flight data

The Three-Way Constraint

Building an effective retrieval module requires satisfying three simultaneous demands:

  1. Regulatory Precision - One incorrect statement risks fines or safety violations
  2. Tight Latency - Passengers abandon conversations if responses exceed 1-2 seconds
  3. Multi-Source Coverage - Long documents, API fields, and historical queries must coexist in a unified, version-controlled index with verifiable provenance

The system must filter noise, fuse lexical and semantic signals, and deliver grounded evidence in sub-second time while supporting multiple languages and maintaining aviation compliance.


2. Research Questions

A. Information Retrieval Methods

  1. What information retrieval methods are reported in the literature?
  2. What attributes or comparison criteria are typically used to evaluate these methods?
  3. How do the identified methods perform across those attributes when tested?
  4. Which retrieval method—or combination of methods—should be selected for an airline chatbot, and why?

B. Evaluation Metrics

  1. What evaluation metrics are reported in the literature for measuring retrieval system quality?
  2. What attributes or comparison criteria are commonly applied to judge these metrics?
  3. How does each metric score across the chosen attributes when applied?
  4. Which evaluation metric—or set of metrics—should be chosen to monitor retrieval quality, and why?

3. Information Retrieval Techniques

3.1 Key Attributes for Comparing IR Techniques

We analyze core IR techniques against seven critical attributes that address the regulatory and operational challenges unique to airline customer service:

Attribute Definitions

Recall (Coverage)

  • Measures the fraction of all relevant policy clauses or real-time data points surfaced by the retrieval system
  • Essential for airline chatbots: omitting baggage rules, visa requirements, or gate updates can lead to non-compliant guidance
  • Near-complete coverage prevents legal exposure, operational errors, and passenger frustration

Faithfulness / Hallucination Control

  • Quantifies the share of generated statements grounded in retrieved evidence
  • Critical in airline context: invented details like non-existent compensation clauses can incur fines or safety risks
  • Techniques like Self-RAG's auto-verification minimize errors by requiring every claim to cite verifiable sources

Latency (mean & p95)

  • Tracks retrieval and reranking processing time
  • Mean latency shows average response time; p95 captures worst-case tail performance
  • Passengers expect replies within ~1 second; optimizing both metrics ensures consistent, responsive interactions

Live-Data Compatibility

  • System's ability to merge static policy documents with dynamic API responses (flight status, baggage fees)
  • Critical for real-time queries; failure to integrate current data results in stale or inaccurate answers
  • Requires balancing freshness with latency through caching and smart API routing

Security / Compliance Filtering

  • Applies metadata and content rules to block obsolete, confidential, or PII-bearing fragments
  • Airlines must protect passenger data (PNRs, names, medical info) and avoid disclosing internal manuals
  • Ensures compliance with GDPR, TSA/FAA requirements, and internal policies

Adaptability to Policy Drift

  • Addresses system's ability to detect and incorporate changes in fare rules, compensation policies, or safety regulations
  • Monitors for concept drift and triggers incremental re-embedding and index updates
  • Keeps chatbot synchronized with latest official guidelines, avoiding outdated content citations

Explainability / Provenance

  • Guarantees every chatbot assertion can be traced to its original source
  • Provides document ID, chunk location, or API endpoint for each claim
  • Critical for regulatory audits, legal reviews, and customer trust through verifiable information trails

3.2 Attribute Importance Table

AttributeWhy it matters for airline chatbotsKey References
Recall (Coverage)Omitting regulation clauses or live-status updates leads to incorrect/non-compliant advicePinecone: Offline Evaluation
Weaviate: Retrieval Metrics
Evidently: Precision-Recall
RidgeRun: RAG Evaluation
Faithfulness / Hallucination ControlFabricated statements risk fines, operational delays, or reputational damageCleanlab: RAG Hallucination Benchmarking
AWS: Detecting Hallucinations in RAG
Vectara: Measuring RAG Hallucinations
RAGAS: Faithfulness Metrics
Latency (mean & p95)Passengers abandon chat if replies exceed ~5 seconds; each retrieval step adds delayMastering Latency Metrics
Zilliz: Importance of Tail Latency
Live-Data CompatibilityMany answers depend on real-time APIs; retrieval must integrate static and dynamic sourcesStriim: Real-time RAG Streaming
Imply: Real-time Data Importance
Security / Compliance FilteringMust exclude obsolete, confidential, or PII-bearing content for aviation/privacy regulationsPinecone: Metadata Filtering
KX: Vector Search Filtering
Aviation Privacy Laws
Adaptability to Policy DriftFare rules and policies change frequently; embeddings must refresh to avoid stalenessKnowledge Drift in RAG Models
Arize: Embedding Drift Detection
Explainability / ProvenanceRegulators may demand clear citation trails; provenance must be traceable end-to-endFINOS: AI Citation Traceability
Source Traceability in AI Systems

3.3 Retrieval Techniques Analysis

Technique Definitions

FAQ Q-to-Q Retrieval

  • Retrieves direct matches from curated, pre-approved FAQ knowledge base
  • Designed for high-frequency, well-structured queries (baggage allowances, check-in deadlines, seating rules)
  • Guarantees consistent, compliant answers by returning exact official wording
  • Acts as fastest, most reliable path for standardized information

Structured Retrieval (APIs/Tools)

  • Integrates directly with operational airline APIs and structured data systems
  • Retrieves live, authoritative information (flight status, gate assignments, baggage fees)
  • Prioritizes real-time accuracy and clear provenance from official operational systems
  • Essential for queries where freshness and factual accuracy are critical

Hybrid Search (BM25 + Vector + Cross-Encoder)

  • Combines lexical retrieval (BM25) with semantic vector search
  • Merges results using Reciprocal Rank Fusion and refines with cross-encoder re-ranking
  • Balances exact keyword matching with contextual understanding
  • Effective for nuanced policy queries with mixed formal/informal language

Agentic / Self-RAG

  • Adds self-verification step to standard retrieval-augmented generation pipeline
  • Model validates each claim against retrieved evidence, re-retrieving if inconsistencies detected
  • Minimizes unsupported claims and ensures grounding in verifiable sources
  • Useful for high-stakes, compliance-sensitive responses

Knowledge-Graph-Augmented Retrieval

  • Leverages domain-specific knowledge graph built from airline regulations and operational processes
  • Enables multi-hop reasoning for complex questions linking multiple policy elements
  • Enhances contextual understanding through explicit relationship modeling
  • Well-suited for multi-facet, interdependent queries

Dense Retriever (Fine-Tuned on Domain Data)

  • Uses dense vector encoders fine-tuned on airline-specific data
  • Maps queries and documents into shared embedding space
  • Provides fast, semantically rich retrieval for questions without exact wording
  • Higher domain alignment than general dense retrievers

Late-Interaction Models (ColBERT)

  • Implements token-level interaction between query and document embeddings
  • Maintains fine-grained relevance signals while enabling efficient large-scale retrieval
  • Effective for subtle or ambiguous queries across long policy documents
  • Moderate computational overhead during scoring

Sparse Neural Retrieval (SPLADEv2)

  • Generates sparse lexical representations using transformer models
  • Preserves term importance and semantic context
  • Achieves search speed close to BM25 while capturing deeper meanings
  • Useful for high-volume document repositories balancing latency and semantic fidelity

Table-Aware Retrieval

  • Tailored to handle structured tabular data (pricing charts, baggage allowances, seating configurations)
  • Uses retrieval models recognizing row/column semantics and numerical values
  • Supports precise extraction and reasoning over table content
  • Ideal for queries requiring comparisons, thresholds, or exact field retrieval

3.4 Technique Comparison Matrix

TechniqueRecallFaithfulnessLatencyLive-DataSecurityAdaptabilityProvenanceReferences
FAQ Q-to-Q RetrievalN/A100%LowLowHighMediumHigharXiv:1905.02851
SCITEPRESS 2024
arXiv:2306.03411
EMNLP 2023
Structured Retrieval (APIs/Tools)N/A100%N/AHigh (100%)HighHighHighFCIS Article
Hybrid Search (BM25 + Vector + Cross-Encoder)High (≥90%)HighHighLowHighMediumHigharXiv:2210.11934
arXiv:2505.23250
arXiv:2508.01405
Agentic / Self-RAGHighHighHighLowHighMediumHigharXiv:2310.11511
Knowledge-Graph-Augmented RetrievalHighHighHighLowHighMediumHigharXiv:2405.20139
arXiv:2504.08893
arXiv:2505.17058
Dense Retriever (Fine-Tuned)HighMedium-HighLow-MediumLowHighHighLow-MediumarXiv:2501.04652
arXiv:2112.07577
EMNLP 2024
Late-Interaction Models (ColBERT)HighHighLow-MediumLowHighMediumMedium-HighStanford CS224V
SIGIR 2020
arXiv:2205.09707
arXiv:2402.15059
Sparse Neural Retrieval (SPLADEv2)HighHighLowLowHighHighMedium-HigharXiv:2109.10086
arXiv:2207.03834
arXiv:2505.15070
Table-Aware RetrievalHighHighMediumLowHighMediumMediumACL 2024
ACM 2022
arXiv:2203.16714
OpenReview

3.5 Technique Selection & Justification

Based on the three-way constraint analysis, we select three complementary retrieval strategies:

TechniqueRationale
FAQ Q-to-Q Retrieval

Zero hallucination & very low latency: Returns exact, vetted FAQ answers with perfect faithfulness and minimal processing time. Ideal for high-frequency, well-defined queries, reducing load on complex retrieval stages.

Structured Retrieval (APIs/Tools)

Real-time accuracy for operational data: Directly queries authoritative airline APIs ensuring 100% freshness and factual accuracy. Maintains strong provenance and compliance for dynamic, regulated data.

Hybrid Search (BM25 + Vector + Cross-Encoder)

Balanced semantic and lexical retrieval: Combines BM25, dense embeddings, and cross-encoder reranking for high recall and faithfulness over policy documents. Effective for complex, broad-scope airline queries.

Supporting Research

Sakata et al. (2019) - FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance

  • Proposes hybrid method combining query-question matching (BM25) with BERT-based re-ranking
  • Demonstrates substantial relevance gains over traditional baselines
  • Validates Q-to-Q retrieval suitability for frequent, well-defined queries
  • Reference: https://arxiv.org/abs/1905.02851

Li et al. (2025) - Evaluating RAG Methods for Conversational AI in the Airport Domain

  • Compares three RAG approaches on real airport queries
  • Results: BM25+LLM (84.84%), SQL-RAG (80.85%), Graph-RAG (91.49%)
  • Recommends SQL-RAG as balanced solution for operational reliability
  • Reference: https://aclanthology.org/2025.naacl-industry.61.pdf

Taranukhin et al. (2024) - Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights

  • Presents retrieval-only chatbot for EC 261 and equivalent passenger rights regulations
  • Achieved MAP@5 of 0.88 with hallucination-free responses in user study
  • Reinforces suitability for highly regulated domains
  • Reference: https://aclanthology.org/2024.nllp-1.27.pdf

4. Information Retrieval Evaluation

4.1 Attributes for Comparing Evaluation Methods

Attribute Definitions

Reproducibility

  • Ability to obtain identical evaluation results across multiple runs with same configuration
  • Guarantees performance regressions/improvements are real, not artifacts of randomness
  • Critical for trustworthy CI/CD regression tests and audit trail maintenance

Coverage

  • Evaluates both retriever's evidence surfacing ability and generator's context usage
  • Combines retrieval-level metrics (recall@K, nDCG) with generation-level metrics (faithfulness, correctness)
  • Enables pinpointing whether errors arise from incomplete context or LLM misuse

Sensitivity

  • Framework's power to distinguish small but meaningful performance differences
  • Ensures minor changes (reranker weight adjustments, index parameters) produce detectable score shifts
  • Crucial for iterative tuning without being lost in measurement noise

Automation

  • Entire evaluation pipeline runs end-to-end without manual intervention
  • Integrates seamlessly into nightly CI/CD workflows with automated alerts
  • Catches regressions/compliance violations before production deployment

Interpretability

  • Provides granular breakdowns by question category, metric family, or intent type
  • Enables tracing performance changes to specific system components or content areas
  • Supports rapid diagnosis and targeted remediation rather than opaque aggregate scores

Cost Efficiency

  • Balances computational and human-labeling expenses against insights gained
  • Chooses between automated low-cost metrics and expensive human/LLM-based approaches
  • Employs hybrid estimators to minimize spend while preserving accuracy

Domain Alignment

  • Ensures metrics, test splits, and benchmarks are tailored to airline customer service scenarios
  • Constructs scenario-specific test collections (EC 261 compensation, baggage fees, flight status)
  • Uses expert-validated metrics reflecting real-world, domain-critical situations

4.2 Evaluation Method Importance Table

AttributeWhy it matters for airline-grade evaluationReferences
ReproducibilityGuarantees identical results across repeated runs for regression testing and audit trailsComet: ML Reproducibility
arXiv:2412.03854
CoverageSpans both retrieval (recall@K, nDCG) and generation (faithfulness, correctness) metrics to diagnose any failure modearXiv:2405.07437
SensitivityDetects small performance changes (e.g., reranker weight tweaks) to support iterative tuningarXiv:2507.07924
ACM: Performance Changes
AutomationEnables fully batchized, CI/CD-friendly evaluation with automated alerts for metric driftarXiv:2409.19019
InterpretabilityOffers breakdowns by category or metric family to pinpoint root causes of regressionsarXiv:2507.03479
arXiv:2211.02405
Cost EfficiencyBalances compute and human-label costs (LLM judges vs. automated) under operational budgetsarXiv:1807.02202
arXiv:1807.06998
arXiv:2006.13999
Domain AlignmentUses airline-specific test splits and metrics (EC 261, baggage, live-data queries) to ensure evaluation relevanceRAGAS: Rubrics-Based Metrics
UnfoldAI: RAG Evaluations

Method Definitions

Offline Batch Evaluation

  • Executes fixed, version-controlled test sets through system in batch mode
  • Measures retrieval (recall@K, nDCG) and generation metrics (Exact Match, F1)
  • Yields fully deterministic outputs for reproducibility and audit traceability

Automated RAG Evaluation

  • End-to-end pipelines assessing both retrieval and generation stages
  • Combines retrieval-level metrics (pytrec_eval) with generation metrics (RAGAS)
  • Includes synthetic test-case generation (RAGProbe) and lightweight LLM judging (ARES)

LLM-as-Judge

  • Employs separate large language model to evaluate system outputs
  • Scores responses on correctness, faithfulness, relevance, coherence
  • Scales evaluation beyond traditional human labeling or shallow auto-metrics

Human Annotation

  • Expert annotators manually score outputs on correctness, factual grounding, compliance
  • Remains gold standard for high-stakes domains and nuanced requirements
  • Detects subtle errors automated metrics often miss

Explicit Subtopic Evaluation

  • Segments test sets into domain-specific subcategories
  • Calculates metrics independently for each slice (EC 261, baggage rules, live status)
  • Enables targeted analysis of specific failure areas

Automated Exam Generation with IRT

  • LLMs generate multiple-choice questions from domain corpora
  • Item Response Theory models question difficulty and discriminative power
  • Enables interpretable, scalable assessments across difficulty levels

4.3 Method Comparison Matrix

MethodReproducibilityCoverageSensitivityAutomationInterpretabilityCost EfficiencyDomain AlignmentReferences
Offline Batch EvaluationHighHighMediumLowHighMediumHigharXiv:2404.13781
arXiv:2405.07437
Automated RAG EvaluationHighHighHighHighMediumMediumMediumarXiv:2309.15217
arXiv:2311.09476
arXiv:2409.19019
LLM-as-JudgeMediumGeneration onlyHighMediumMediumLowMediumarXiv:2411.15594
arXiv:2408.08781
arXiv:2408.13006
arXiv:2406.07791
Human AnnotationLow-MediumHighHighLowHighLowHigharXiv:2507.15821
arXiv:2310.14424
NIST TN.2287
Explicit Subtopic EvaluationHighHighMediumMediumHighMediumHigharXiv:2412.05206
Automated Exam Generation with IRTHighHighHighMediumHighMediumHigharXiv:2405.13622
arXiv:1605.08889
BEA 2022

4.4 Method Selection & Justification

Based on airline domain requirements for regulatory compliance, multi-source coverage, reproducibility, interpretability, and sensitivity to performance shifts:

MethodRationale
Automated RAG Evaluation
(pytrec_eval + RAGAS + RAGProbe/ARES)

Combines retrieval-level metrics with generation-level metrics in automated pipeline. Supports synthetic robustness testing and lightweight LLM judging. Delivers high coverage, reproducibility, and sensitivity for continuous monitoring.

Explicit Subtopic Evaluation

Segments datasets into airline-specific categories for per-category metrics. Maximizes interpretability and domain alignment by identifying weaknesses in specific regulatory/operational areas.

Automated Exam Generation with IRT

Produces controlled airline-specific test questions calibrated with Item Response Theory. Enables targeted skill probing, interpretable difficulty-aware scoring, and early capability gap detection.


5. Implementation Recommendations

5.1 Retrieval Pipeline Integration

Layered Routing Strategy:

  1. Route 1: FAQ Q-to-Q Retrieval
    • Send FAQ-style queries for ultra-fast, fully grounded responses
    • Handle high-frequency, well-defined passenger queries
    • Free complex retrieval layers for nuanced cases
  2. Route 2: Structured Retrieval (APIs/Tools)
    • Forward real-time, data-bound queries for up-to-date outputs
    • Ensure strict compliance with airline regulations
    • Maintain verifiable provenance for mission-critical queries
  3. Route 3: Hybrid Search
    • Delegate nuanced or policy-spanning questions for broad coverage
    • Apply refined precision for complex, multi-policy airline queries
    • Respect latency constraints while ensuring comprehensive coverage

Provenance Requirements:

  • Return each retrieval result with metadata (source document ID, passage, timestamp, API endpoint)
  • Enable full citation path traceability from answer to source material

5.2 Evaluation Framework Deployment

Continuous Monitoring Schedule:

  • Nightly: Automated RAG Evaluation as part of CI/CD workflows
  • Weekly: Explicit Subtopic Evaluation for focused diagnosis
  • Quarterly: IRT-calibrated exam generation for long-term capability assessment

Compliance and Auditing:

  • Persist full evaluation logs, metrics, and test sets for audit trails
  • Enable on-demand traceability for any production response
  • Maintain verifiable citation paths for regulatory review

5.3 Future Development Priorities

  1. Intent-Aware Routing Models
    • Train lightweight classifiers for dynamic query routing
    • Use LLM-based intent detection for optimal retrieval path selection
  2. Retriever & Re-ranker Fine-Tuning
    • Fine-tune embedding models on airline-specific data
    • Optimize recall, precision, and latency control
  3. Real-Time Hallucination Detection
    • Integrate Self-RAG reflection loops
    • Implement confidence-based refusal mechanisms for high-risk contexts
  4. Feedback-Driven Evaluation Loops
    • Incorporate passenger feedback and escalation logs
    • Refine test sets based on real operational pain points

6. Conclusion

This study establishes a comprehensive, domain-specific framework pairing layered retrieval techniques with robust, interpretable evaluation methods to meet the demanding requirements of airline customer service.

Key Contributions

Retrieval Architecture:

  • Ultra-fast FAQ retrieval for zero-hallucination responses
  • Real-time API integration for factual accuracy and provenance
  • Hybrid semantic search for complex, multi-policy queries

Evaluation Framework:

  • Automated pipelines combining retrieval and generation metrics
  • Domain-specific subtopic analysis for targeted improvements
  • Difficulty-aware exam generation for capability monitoring

System Benefits

The framework enables chatbot architecture that is:

  • Accurate and Compliant: Grounded, auditable responses supporting regulatory trust
  • Resilient to Change: Detects regressions before user impact through continuous monitoring
  • Scalable Over Time: Modular evaluation, feedback loops, and domain-tuned test sets

Strategic Impact

This configuration transforms the chatbot from a reactive interface into a living knowledge infrastructure capable of adapting to new regulations, service offerings, and customer behaviors. The system positions itself as a core enabler of safe, scalable, and intelligent automation in the airline industry.

The framework's emphasis on regulatory precision, operational readiness, and strategic adaptability ensures sustainable deployment in one of the most demanding customer service environments, setting a foundation for broader application across regulated industries requiring high-stakes conversational AI.

Related Topics

airline chatbotsRAG systemsinformation retrievalchatbot evaluationairline AImulti-agent systemsAI agents aviationcustomer service automationhybrid searchaviation technologyregulatory compliance

About the Author

Gustavo Meneses

Gustavo Meneses

Applied AI Engineer, Kaiban

I'm an Applied AI Engineer with 15 years of software engineering experience. Back in university, I wrote my thesis on making search algorithms faster for machine learning problems. I love building things with Python, Golang, and Rust, and have worked on everything from e-commerce platforms to enterprise systems. These days, I'm focused on building multi-agent systems for airlines at Kaiban.

Found this article helpful? Share it with your network.