Back to Blog
Research

A Comparative Study of Intent Classification Techniques and Evaluation Methods for Airline Applications

Detailed research on intent classification techniques and evaluation methods for airline applications, covering prompting-based, embedding-based, and hybrid approaches for understanding customer needs.

Visual representation of intent classification in airline customer service, showing how AI understands and categorizes passenger requests
Luis Reynaldo
Luis Reynaldo
Applied AI Engineer, Kaiban
Published
August 9, 2025
Read Time
35 min read

Abstract

This study analyzes intent classification techniques for airline customer service applications.

Airlines operate in a high-stakes environment where misunderstanding passenger intent can trigger operational chaos. Multilingual requirements complicate everything. Dynamic contexts shift by the hour. One misclassified intent about a flight change or baggage issue can cascade through systems, affecting schedules, crew assignments, and passenger safety.

We examine 18 techniques across seven categories: prompting-based, embedding-based, hybrid systems. Each evaluated against accuracy, latency, few-shot learning, and cross-domain transfer.

Our analysis reveals Few-Shot In-Context Learning as the most effective approach for initial deployment, while Hybrid Systems with Uncertainty-Based Routing prove superior for production scaling. The evaluation framework covers four critical areas: conformal prediction for uncertainty quantification, out-of-scope detection for operational safety, latency profiling for real-time responsiveness, and multi-intent detection for complex queries.

This research provides practical guidance for deploying robust intent classification in safety-critical, multilingual airline environments.

1. Domain Characterization

1.1 Context

This study focuses on airlines as a domain that presents several characteristics relevant to intent classification research. Airlines operate in a complex operational environment with diverse customer interaction patterns.

Domain Criticality: Airline intent misclassification can have significant operational and financial consequences. Research shows that operating costs from delays constituted 86-134% of airline operating profits in 2007, with industry-wide delay costs ranging $8.3-13 billion annually (Gu et al., 2024). Misunderstanding customer requests during service disruptions, flight changes, or emergency situations can impact passenger experience and operational efficiency. The stakes are higher than in typical e-commerce or general customer service applications.

Multilingual and Cultural Complexity: International airlines must support intent classification across multiple languages and cultural contexts, where the same underlying need may be expressed differently across regions. Studies indicate that 74% of customers are more likely to repurchase from businesses offering customer service in their native language (HappyFox, 2025). Passenger queries reflect diverse linguistic patterns and cultural expectations for service delivery.

Operational Dynamics: Airlines operate in a constantly changing environment where flight schedules, policies, weather conditions, and regulatory requirements evolve continuously. Intent classification systems must handle queries about dynamic information while maintaining accuracy across changing operational contexts.

Contextual Dependencies: Airline customer interactions often involve multiple related intents and require understanding of travel context. Passengers frequently have compound requests that span booking modifications, service preferences, and operational concerns within a single conversation.

System Integration Requirements: Unlike many conversational AI applications, airline intent classification typically requires integration with real-time operational systems including reservations, flight operations, baggage tracking, and customer management platforms. This integration complexity adds constraints to system design and deployment.

These characteristics make airlines a challenging and representative domain for evaluating intent classification techniques, as solutions must demonstrate robustness, adaptability, and operational reliability.

1.2 Canonical User Intents for Airline Systems

To design and evaluate intent classification systems in the airline domain, it is essential to establish a comprehensive and functionally organized set of user intents. These intents represent the core types of queries and requests that passengers generate when interacting with airline customer service platforms—whether through chatbots, apps, or voice assistants—across all phases of the travel experience.

This study defines 35 canonical intents grouped into seven functional categories. These were derived through comparative analysis of public dialogue datasets including ATIS (Hou et al., 2024) with 17 intent categories, SGD (Rastogi et al., 2020) with 20 domains covering travel, events, and other services, and CLINC150 (Larson et al., 2019) with 150 intents across 10 domains, combined with analysis of real-world airline chatbot interfaces and relevant literature. The taxonomy reflects both traditional service needs and emerging user demands such as environmental responsibility, digital account management, and personalized onboard services.

1. Flight Planning & Reservation

These intents capture user interactions during the early planning and booking stages.

  • flight_search – Find available flights based on destination, date, time, or price.
  • flight_booking – Complete a flight reservation, including passenger details and add-ons.
  • seat_selection – Select or change seat assignments, request upgrades.
  • fare_rules_explained – Inquire about fare conditions, change penalties, refundability.
  • booking_modification – Change flight details (e.g., dates, names) for an existing reservation.
  • booking_cancellation – Cancel a booked ticket and receive applicable refund terms.
  • upgrade_offer_bid – Submit or review upgrade offers (e.g., to business class).
  • lounge_access – Ask about eligibility, pricing, and location of airport lounges.

2. Check-in & Boarding

These intents occur closer to departure and involve real-time or operational support.

  • check_in_boarding – Perform check-in or retrieve digital boarding passes.
  • boarding_group_info – Understand assigned boarding group or priority boarding rules.
  • flight_status – Request real-time updates on flight departure, delays, gate changes.
  • alternative_flights – Search for other flight options (e.g., in case of disruption).
  • rebooking_assistance – Automatically rebook after a delay, cancellation, or missed connection.
  • weather_updates – Check how weather may affect flights.
  • airport_information – Ask about services, navigation, or amenities at a specific airport.

3. Baggage Services

These intents focus on all aspects of baggage policies, tracking, and incidents.

  • baggage_policy – Ask about weight, size, and allowance rules.
  • baggage_fees – Calculate or understand fees for checked or excess baggage.
  • carry_on – Verify carry-on size, item restrictions, and allowances.
  • baggage_tracking – Track the status or location of checked baggage.
  • lost_baggage – Report or follow up on lost, delayed, or damaged baggage.
  • special_equipment – Ask about traveling with sports gear, musical instruments, medical devices.

4. Post-Sale Support & Claims

These intents represent post-booking concerns, support, and compensation.

  • refund_request – Request refund based on ticket conditions or travel disruption.
  • compensation_claim – File a claim for delays, cancellations, baggage incidents.
  • travel_insurance – Inquire about coverage, purchase options, or initiate a claim.
  • loyalty_program – Ask about mileage accrual, redemption, tier benefits, or upgrades.

5. Documentation & Travel Requirements

These intents deal with regulatory and compliance matters before departure.

  • visa_requirements – Check visa obligations for the destination.
  • passport_validity – Verify passport expiration and entry rules.
  • health_requirements – Ask about vaccination, testing, or medical certificate needs.
  • travel_restrictions – Request information on entry bans, COVID-19 rules, or quarantine.

6. Special Services & Onboard Preferences

These intents cover optional or personalized services.

  • meal_preferences – Select or inquire about special meals (e.g., vegetarian, halal).
  • special_assistance – Request wheelchair service, escort, or medical accommodations.
  • carbon_offset_program – Participate in sustainability programs or purchase offsets.
  • wifi_connectivity – Ask about onboard Wi-Fi availability, pricing, and usage.

7. Digital Account & Mobile App Support

These intents involve digital access and personalization.

  • digital_account_login – Troubleshoot login issues for user profiles or frequent flyer accounts.
  • app_notification_settings – Configure alerts for delays, gate changes, promotions, or check-in reminders.

Note: This taxonomy represents a theoretical framework derived from dataset analysis and industry observation. The 35 intents proposed here are based on comparative analysis of existing dialogue datasets and airline service patterns, but empirical validation with real airline operational data would be required for production deployment.

1.3 Domain-Specific Requirements for Intent Classification Systems

The airline domain presents specific challenges for intent classification that influence how standard classification requirements manifest in this operational context. The following requirements emerge from the theoretical analysis of airline intents and how passengers typically express their needs in this domain.

Classification Accuracy Requirements

  • Intent Disambiguation Capability - Airline intents frequently overlap semantically (baggage_policy vs baggage_fees, flight_search vs flight_status), requiring systems that distinguish between closely related categories based on subtle linguistic cues
  • Domain Terminology Recognition - Classification must handle airline-specific vocabulary (layover, codeshare, PNR, oversold) and abbreviations that affect intent identification
  • Context-Dependent Classification - Passengers may use the same phrase for different intents depending on their specific travel situation

Robustness Requirements

  • Variation Handling - Passengers express the same intent using diverse terminology influenced by regional differences, airline branding, and travel experience levels
  • Multilingual Classification Consistency - Intent boundaries must remain consistent across languages where travel concepts may have different cultural associations
  • Out-of-Scope Detection for Domain Boundaries - Distinguish between airline-related queries and unrelated requests (hotel bookings, ground transportation)

Adaptability Requirements

  • Seasonal Intent Recognition - Classification must adapt to temporal variations in how passengers express travel needs (holiday travel patterns, weather-related concerns, seasonal routes)
  • Intent Granularity Flexibility - Ability to classify at different levels of specificity as airline services evolve
  • New Intent Integration - Accommodate emerging intent categories as airline services change (new health requirements, technology services, environmental concerns)

These requirements specifically address the challenges of correctly identifying passenger intent categories within the airline domain's linguistic and operational constraints.

2. Guiding Research Questions

2.1. Questions for Intent Classification Techniques

  1. What are the main existing techniques for intent classification in conversational AI applications?
  2. What technical attributes and capabilities ****are used to compare these techniques (architecture, few-shot learning, latency, computational requirements)?
  3. How do these techniques perform across key metrics like accuracy, out-of-scope detection, and operational efficiency in airline-relevant scenarios?
  4. Which techniques are most suitable for airline chatbots considering domain-specific requirements and constraints?
  5. How can intent classification techniques be combined with clarification and out-of-scope detection to improve system robustness?

2.2. Questions for Evaluation Methods

  1. What are the main evaluation methodologies used for assessing intent classification systems?
  2. What metrics, datasets, and evaluation protocols are employed to measure performance and compare different approaches?
  3. How do these evaluation methods address airline-specific challenges like multilingual support, contextual dependencies, and operational criticality?
  4. What evaluation strategy is most appropriate for validating intent classification systems in airline applications?

3. Intent Classification Techniques

To select the most appropriate intent classification techniques for airline applications, it is necessary to systematically analyze the available options. This section compares the main techniques identified in the current literature, evaluating their specific characteristics in relation to the domain requirements outlined earlier.

3.1. Identified Techniques

Through a systematic review of current literature, 18 specific intent classification techniques were identified and categorized into seven main groups. These techniques represent the state of the art in the field, ranging from fully training-free approaches to sophisticated hybrid systems.

Prompting-Based Techniques

  • Zero-Shot In-Context Learning (ICL) – A technique in which the language model classifies user intents using only textual descriptions of each intent, without any prior training examples. The LLM relies on its pretrained knowledge to map the user query to the most appropriate intent category based solely on the definitions provided (Parikh et al., 2023).
  • Few-Shot In-Context Learning – This approach includes 1 to 10 representative examples of each intent directly within the prompt, along with the query to be classified. The LLM infers patterns and features from the examples to perform intent classification (Parikh et al., 2023; Zhang et al., 2024).
  • Adaptive In-Context Learning – An extension of few-shot ICL that dynamically selects the most relevant examples for each query using semantic similarity. Embedding models are used to retrieve the K nearest examples to the input, and the prompt is constructed adaptively (Arora et al., 2024; Rodriguez et al., 2024).
  • Chain-of-Thought (CoT) Prompting – A prompting technique that instructs the LLM to explain its reasoning step by step before producing a final classification. The model verbalizes the analysis process, highlighting key features of the input before issuing its decision (Arora et al., 2024).

Embedding-Based Techniques

  • Dual Sentence Encoders (USE + ConveRT) – An architecture that combines the Universal Sentence Encoder (USE) for general-purpose representations with ConveRT, a conversationally optimized encoder, as fixed feature extractors. The resulting embeddings are concatenated and passed to a simple MLP classifier with one hidden layer. Both encoders remain frozen during training (Casanueva et al., 2020).
  • SetFit (Sentence Transformer Fine-tuning) – A few-shot learning method that performs contrastive fine-tuning of sentence transformers using positive and negative pairs, followed by training a lightweight classifier on the learned representations. It combines contrastive learning with discriminative classification (Arora et al., 2024).
  • Label-Aware BERT Attention Network (LABAN) – A neural architecture that builds an embedding space explicitly informed by the semantics of intent labels. It uses attention mechanisms to project user queries into this semantic space and classifies them based on projection weights toward each intent category (Wu et al., 2021).

Generative-Based Techniques

  • Text-to-Text Generation (Gen-PINT) – A complete reformulation of the classification problem as a free-form text generation task. Instead of selecting from predefined intent classes, the LLM generates the intent name or description directly as natural language using instruction tuning (Zhang et al., 2024).
  • Intent Discovery with LLMs (IntentGPT) – A training-free system that employs a dual-LLM architecture for automatic intent discovery. It combines a contextual prompt generator, an intent predictor, and a semantic sampler that selects relevant examples through automatic clustering to identify emerging intent categories (Rodriguez et al., 2024).
  • Generate-then-Refine – A two-stage pipeline in which synthetic utterances are first generated using LLMs in a zero-shot setting to expand the dataset, followed by a seq2seq refinement model that enhances the quality, coherence, and utility of the generated data (Lin et al., 2024).

Fine-Tuning-Based Techniques

  • BERT Fine-tuning – Full fine-tuning of pretrained transformer models (e.g., BERT-Large) on domain-specific intent classification datasets. This approach requires updating all model parameters to adapt the model to the target task (Casanueva et al., 2020).
  • PEFT (IA3 Adapters) – A parameter-efficient fine-tuning method that uses IA3 adapters (Infused Adapter by Inhibiting and Amplifying Inner Activations) to adapt pretrained models by training only a small subset of parameters, achieving competitive performance with very limited examples (Parikh et al., 2023).

Hybrid Techniques

  • Hybrid System with Uncertainty-Based Routing – A system that combines lightweight models (e.g., SetFit) with large language models through an uncertainty-driven routing strategy. It uses confidence estimation to forward queries to more capable models only when uncertainty exceeds a predefined threshold, balancing accuracy and efficiency (Arora et al., 2024).

Specialized Techniques

  • Out-of-Scope Detection – A set of specialized methods designed to detect user queries that do not belong to any predefined intent. These approaches include confidence-based thresholds, representation clustering, and similarity analysis to prevent misclassification of out-of-domain inputs (Arora et al., 2024).
  • Two-Step OOS Detection – A two-stage methodology for out-of-scope detection. First, it predicts in-scope labels while ignoring potential OOS cases. Then, it compares the transformer's internal representations with training instances using cosine similarity to identify outliers (Arora et al., 2024).
  • Multi-Intent Detection – Techniques specifically designed to detect and classify multiple intents within a single user query. These systems rely on multi-label architectures, attention mechanisms, or sequential decomposition to capture all distinct purposes present in complex utterances (Wu et al., 2021).

LLM-Based Clarification Techniques

  • LLM Ambiguity Identification and Clarification – A system that leverages large language models (e.g., ChatGPT, GPT-4) to identify ambiguous user queries and generate appropriate clarification questions. It employs chain-of-thought reasoning and few-shot prompting to systematically detect and resolve ambiguities (Zhang et al., 2024).
  • LLM-Based Interactive Clarification – A method that uses a "communicator" LLM to identify high-uncertainty and low-confidence segments in user problem descriptions. It then generates specific clarification questions to elicit additional information before proceeding with the task (Wu, 2023).

These 18 techniques represent the full spectrum of approaches available for intent classification, ranging from simple and efficient methods to complex systems with advanced capabilities for intent discovery and clarification.

Note: Some techniques represent conceptual approaches or recent developments that may require additional empirical validation for comparative evaluation in operational airline systems.

3.2. Technical Attributes for Comparison

To systematically evaluate the identified intent classification techniques, key attributes were defined based on a comprehensive review of the research literature. These attributes represent the most relevant dimensions used by researchers to characterize and compare methods, and are particularly applicable to airline-related use cases.

Primary Performance Metrics

  • Accuracy - Primary metric used across all frameworks for overall classification correctness
  • F1-Score - Harmonic mean of precision and recall, especially important for out-of-scope detection and imbalanced classes

Specialized Metrics

  • Out-of-Scope Recall - Percentage of out-of-scope samples correctly identified as not belonging to any predefined intent category. Critical for production systems to prevent incorrect automated responses when passengers ask questions outside the chatbot's domain (e.g., asking about hotel bookings to an airline chatbot). High OOS recall prevents embarrassing misclassifications but must be balanced with precision to avoid rejecting valid queries.
  • NMI (Normalized Mutual Information) - Clustering evaluation metric that measures the quality of discovered intent groupings compared to ground truth labels. Values range from 0 (random clustering) to 1 (perfect clustering). Used specifically for intent discovery tasks where the system automatically identifies new intent categories from unlabeled data, particularly relevant for discovering emerging customer needs in airline services.
  • ARI (Adjusted Rand Index) - Clustering metric that compares predicted intent partitions against ground truth, adjusted for chance agreement. Unlike raw accuracy, ARI accounts for the expected similarity that would occur by random chance. Values range from -1 to 1, where 1 indicates perfect clustering. Used alongside NMI for comprehensive evaluation of intent discovery systems.

Operational Metrics

  • Latency/Inference Time - Time required to process a single query from input to intent prediction output. Critical for real-time customer service applications where delays impact user experience. Measured in milliseconds or seconds, with airline chatbots typically requiring sub-2-second responses to maintain conversational flow.

Learning Configuration Performance

  • Zero-Shot Performance - Model's ability to classify intents using only textual descriptions without any training examples. Evaluated by providing intent definitions (e.g., "flight_search: user wants to find available flights") and measuring classification accuracy on unseen queries. Crucial for rapidly deploying systems to new routes or services without collecting training data.
  • Few-Shot Performance - Classification accuracy when trained with very limited examples per intent (typically 1, 5, or 10 examples). Measures data efficiency and practical deployment feasibility, as collecting extensive training data for every airline intent is resource-intensive. Performance curves show how accuracy improves with additional examples.
  • Cross-Domain Transfer - Model's ability to maintain performance when applied to different airline contexts or related domains. For example, a model trained on domestic flight intents transferring to international travel queries, or adapting from one airline's terminology to another's. Measures generalization capability across operational contexts.

Additional Evaluation Metrics

  • Semantic Similarity - Cosine similarity between generated intent names/descriptions and ground truth labels, measured using sentence embeddings. This metric is used when models generate free-form intent labels rather than selecting from predefined categories. It is particularly relevant for intent discovery systems (e.g., IntentGPT) or text-to-text generative models (e.g., Gen-PINT), where semantic matching is more informative than exact string comparison.

3.3. Performance Comparison of Techniques

This section presents a systematic comparison of techniques using the attributes defined in the previous section. The techniques are grouped by methodological category to facilitate comparative analysis and highlight patterns across different approaches.

Prompting-based Techniques

TechniqueAccuracyF1-ScoreOut-of-Scope RecallNMIARILatencyZero-Shot PerformanceFew-Shot PerformanceCross-Domain TransferSemantic Similarity
Zero-Shot ICL

73.9% (MASSIVE, Flan-T5-XXL)
89.3% (Benchmark01, GPT-3)
[Parikh et al., 2023]

Not reported

0.97 (Benchmark01, GPT-3)
0.67 (Benchmark02, GPT-3)
[Parikh et al., 2023]

Not reportedNot reportedNot reported

Primary method
[Parikh et al., 2023]

Not applicableNot reportedNot reported
Few-Shot ICL

63% (MASSIVE, K=5, ELMSE)
80% (Benchmark01, K=5, ELMSE)
[Parikh et al., 2023]

Not reportedNot reportedNot reportedNot reportedNot reportedNot applicable

K=3: 57% (MASSIVE)
K=5: 63% (MASSIVE)
[Parikh et al., 2023]

Not reportedNot reported
Adaptive ICLNot reportedNot reportedNot reportedNot reportedNot reportedNot reportedNot reported

Adaptive example retrieval
[Arora et al., 2024]

Not reportedNot reported
Chain-of-ThoughtNot reportedNot reportedNot reportedNot reportedNot reportedNot reported

Emergent capability ≥100B parameters
[Wei et al., 2022]

Few-shot prompting
[Wei et al., 2022]

Cross-task evaluation
[Wei et al., 2022]

Not reported

Embedding-based Techniques

TechniqueAccuracyF1-ScoreOut-of-Scope RecallNMIARILatencyZero-Shot PerformanceFew-Shot PerformanceCross-Domain TransferSemantic Similarity
Dual Encoder

85.19% (BANKING77, 10-shot)
93.36% (BANKING77, full)
[Casanueva et al., 2020]

Not reportedNot reportedNot reportedNot reported

Reported as faster than BERT
[Casanueva et al., 2020]

Not reported

10-shot performance demonstrated
[Casanueva et al., 2020]

Not reported

USE + ConveRT embeddings
[Casanueva et al., 2020]

SetFitNot reportedNot reportedNot reportedNot reportedNot reported

Reported as highly efficient
[Tunstall et al., 2022]

Not reported

8–16 examples typical
[Tunstall et al., 2022]

Not reported

Contrastive fine-tuning approach
[Tunstall et al., 2022]

LABANNot reportedNot reportedNot reportedNot reportedNot reportedNot reported

Reported zero-shot capability
[Wu et al., 2021]

Not reported

Reported zero-shot transfer capability
[Wu et al., 2021]

Label-aware semantic space
[Wu et al., 2021]

Generative-based Techniques

TechniqueAccuracyF1-ScoreOut-of-Scope RecallNMIARILatencyZero-Shot PerformanceFew-Shot PerformanceCross-Domain TransferSemantic Similarity
Gen-PINT

77.38% (average across 8 datasets, 1-shot)
[Zhang et al., 2024]

Not reportedNot reportedNot reportedNot reportedNot reported

Reported cross-domain capability
[Zhang et al., 2024]

Specialized 1-shot performance
[Zhang et al., 2024]

Reported domain-agnostic generalization
[Zhang et al., 2024]

Not reported
IntentGPT

77.21% (BANKING77, 50-shot GPT-4)
[Rodriguez et al., 2024]

Not reportedNot reported

96.06% (CLINC150, 50-shot GPT-4)
[Rodriguez et al., 2024]

84.76% (CLINC150, 50-shot GPT-4)
[Rodriguez et al., 2024]

Not reported

Training-free approach
[Rodriguez et al., 2024]

50-shot performance reported
[Rodriguez et al., 2024]

Training-free generalization
[Rodriguez et al., 2024]

SBERT embeddings + cosine similarity
[Rodriguez et al., 2024]

Generate-then-Refine

76.9% (CLINC150, 1-shot)
[Lin et al., 2024]

Not reportedNot reportedNot reportedNot reportedNot reported

Reported cross-domain transfer
[Lin et al., 2024]

Optimized 1-shot generation
[Lin et al., 2024]

Reported domain generalization
[Lin et al., 2024]

Not reported

Fine-Tuning-Based Techniques

TechniqueAccuracyF1-ScoreOut-of-Scope RecallNMIARILatencyZero-Shot PerformanceFew-Shot PerformanceCross-Domain TransferSemantic Similarity
BERT Fine-tuning

96.93% (CLINC150, full dataset)
[Casanueva et al., 2020]

Not reportedNot reportedNot reportedNot reportedNot reported

Requires full fine-tuning
[Casanueva et al., 2020]

Requires sufficient training data
[Casanueva et al., 2020]

Not reported

BERT embeddings
[Casanueva et al., 2020]

PEFT (IA3)Not reportedNot reportedNot reportedNot reportedNot reportedNot reportedNot reported

One example per intent sufficient
[Parikh et al., 2023]

Not reportedNot reported

Hybrid-Based Techniques

TechniqueAccuracyF1-ScoreOut-of-Scope RecallNMIARILatencyZero-Shot PerformanceFew-Shot PerformanceCross-Domain TransferSemantic Similarity
Hybrid System with Uncertainty-Based Routing

Reported within 2% of full LLM accuracy
[Arora et al., 2024]

Not reportedNot reportedNot reportedNot reported

Reported 50% latency reduction
[Arora et al., 2024]

Combines lightweight and LLM approaches
[Arora et al., 2024]

Uncertainty-based optimization
[Arora et al., 2024]

Not reported

Hybrid architecture
[Arora et al., 2024]

Specialized Techniques

TechniqueAccuracyF1-ScoreOut-of-Scope RecallNMIARILatencyZero-Shot PerformanceFew-Shot PerformanceCross-Domain TransferSemantic Similarity
Out-of-Scope DetectionNot reported

Critical for evaluation
[Arora et al., 2024]

Primary metric for OOS
[Arora et al., 2024]

Not applicableNot applicableNot reported

Threshold-based methods
[Arora et al., 2024]

Adaptable to different configurations
[Arora et al., 2024]

Not reported

Threshold-based similarity
[Arora et al., 2024]

Two-Step OOSNot reported

Reported >5% improvement
[Arora et al., 2024]

Reported >5% improvement
[Arora et al., 2024]

Not applicableNot applicableNot reported

Compatible with base classifiers
[Arora et al., 2024]

Supports few-shot adaptation
[Arora et al., 2024]

Not reported

Internal representations similarity
[Arora et al., 2024]

Multi-Intent DetectionNot reported

State-of-the-art performance
[Wu et al., 2021]

Not reportedNot applicableNot applicableNot reported

Multi-label classification capability
[Wu et al., 2021]

Architecture-dependent
[Wu et al., 2021]

Not reported

Multi-label architecture
[Wu et al., 2021]

LLM-Based Clarification Techniques

TechniqueAccuracyF1-ScoreOut-of-Scope RecallNMIARILatencyZero-Shot PerformanceFew-Shot PerformanceCross-Domain TransferSemantic Similarity
LLM Ambiguity Identification

54.25% (ChatGPT average)
[Zhang et al., 2024]

52.77% (ChatGPT average)
[Zhang et al., 2024]

Reported to reduce ambiguity
[Zhang et al., 2024]

Not applicableNot applicable

Clarification overhead reported
[Zhang et al., 2024]

Chain-of-Thought-based identification
[Zhang et al., 2024]

Compatible with prompting examples
[Zhang et al., 2024]

Not reported

Chain-of-Thought-based analysis
[Zhang et al., 2024]

LLM Interactive ClarificationNot reportedNot reported

Reported to handle high uncertainty
[Wu, 2023]

Not applicableNot applicable

Multi-turn interaction overhead
[Wu, 2023]

Automatic identification capability
[Wu, 2023]

Adaptable framework
[Wu, 2023]

Not reported

Uncertainty-based interaction
[Wu, 2023]

The comparative analysis reveals that no single technique consistently outperforms others across all evaluation metrics. Prompting-based methods, such as Zero-Shot ICL and Adaptive ICL, offer clear advantages in training-free scenarios, showing competitive performance in zero-shot classification and emerging capabilities in step-by-step reasoning. Embedding-based approaches, such as Dual Encoder, stand out for their operational efficiency and low inference time, making them appealing for production-grade systems. Generative techniques like IntentGPT and Gen-PINT introduce new opportunities for intent discovery and unsupervised classification, albeit with higher computational demands. Among hybrid methods, Hybrid System with Uncertainty-Based Routing emerges as a promising solution by balancing latency and accuracy through the combination of lightweight classifiers and high-capacity LLMs. Finally, specialized techniques, such as Two-Step OOS and LLM-based clarification systems, enable targeted handling of critical challenges like out-of-scope detection and interactive clarification.

These findings provide a foundation for the informed selection of techniques in the next section.

3.4. Technique Suitability for Airline Context

For the initial deployment of an intent classification system in the airline domain, we select Few-Shot In-Context Learning (ICL) as the baseline technique. This decision is grounded in operational simplicity, implementation feasibility, and alignment with early-stage scenarios where limited annotated data is available.

Technical Rationale

Few-Shot ICL is a prompt-based method that leverages pretrained large language models (LLMs) to perform classification without any additional fine-tuning. It operates by providing the model with a few labeled examples (typically 1 to 10) embedded directly in the prompt, followed by an unlabeled query. The LLM infers the most probable intent based on its prior knowledge and the observed input patterns.

Parikh et al. (2023) show that this technique consistently outperforms zero-shot prompting in intent classification tasks, even with minimal data. Specifically, on the MASSIVE benchmark, their ELMSE baseline achieved 63% accuracy with K=5 examples, while GPT-3 achieved 73.9% in zero-shot mode, indicating the effectiveness of both approaches for different operational constraints (Parikh et al., 2023).

Furthermore, Few-Shot ICL requires no supervised training or modification of the model architecture, significantly reducing deployment complexity. Zhang et al. (2024) also highlight this method as a reliable baseline in low-resource classification experiments, especially in multilingual and cross-domain contexts (Zhang et al., 2024).

Operational Advantages for First Deployment

  • No fine-tuning required: Models can be used as-is with only prompt engineering.
  • Architecture-agnostic: Compatible with any modern LLM (e.g., GPT-4, Claude, Mistral).
  • Multilingual by design: LLMs exhibit multilingual competence in classification tasks (Chung et al., 2022; Winata et al., 2023).
  • Benchmark-compatible: Easily evaluated using standard few-shot benchmarks like CLINC150 or BANKING77 in 1-shot, 5-shot, and 10-shot configurations (Casanueva et al., 2020).

Forward-Looking Enhancements

While Few-Shot ICL offers a robust starting point, more advanced techniques can be adopted in future iterations to meet specific operational demands:

  • Hybrid System with Uncertainty-Based Routing, as proposed by Arora et al. (2024), combines lightweight models (e.g., SetFit) with LLMs using uncertainty-based routing. This method achieves accuracy within 2% of full LLM inference while reducing latency by 50%, making it ideal for production systems (Arora et al., 2024).
  • IntentGPT, introduced by Rodriguez et al. (2024), enables training-free intent discovery. It achieves 96.06% NMI and 84.76% ARI on CLINC150, making it suitable for identifying emerging user needs and updating intent taxonomies without labeled data (Rodriguez et al., 2024).
  • PEFT using IA3 adapters, explored by Parikh et al. (2023), offers parameter-efficient fine-tuning with strong performance using just one example per class, outperforming traditional full-model tuning in few-shot scenarios (Parikh et al., 2023).

These alternatives provide clear upgrade paths aligned with future system requirements such as latency optimization, out-of-scope detection, semantic scalability, and continuous integration of new airline services.

3.5 Combined Use: Intent Classification, Clarification, and Out-of-Scope Detection

In real-world airline applications, user queries often contain ambiguity, lack critical information, or fall entirely outside the scope of predefined intents. Such scenarios challenge the robustness of intent classification systems, particularly in high-stakes environments like airline operations, where misclassification can lead to operational errors or poor customer experience. To address these challenges, recent research has proposed combining intent classification with clarification mechanisms and out-of-scope (OOS) detection, resulting in more resilient and adaptive systems.

Integration Strategies

Three general integration strategies have emerged for combining intent classification, clarification, and OOS detection:

  1. Clarification-before-Classification: When a user query is ambiguous or underspecified, the system generates a clarification question to elicit additional information. Only after receiving a clarifying response does the system perform intent classification.
  2. OOS Filtering: Before or after attempting classification, the system evaluates whether the input falls outside the supported intent taxonomy. If the confidence score is low or the semantic similarity to in-scope intents is weak, the query is flagged as out-of-scope and handled accordingly.
  3. Uncertainty-Driven Routing: The system estimates its own confidence (e.g., via entropy, Monte Carlo dropout, or cosine similarity) and routes low-confidence inputs to a secondary module—such as a large language model (LLM) for clarification or a verification step using OOS thresholds.

These strategies can be implemented in sequence or conditionally, forming dynamic pipelines tailored to the nature of the input.

Improving System Robustness

The integration of clarification and OOS detection into intent classification systems improves robustness in the following ways:

  • Error Prevention: OOS detection reduces false positives by rejecting inputs that do not belong to the supported domain (e.g., hotel bookings in an airline chatbot).
  • Disambiguation Support: Clarification questions enable the system to resolve input ambiguity (e.g., distinguishing between baggage_policy and baggage_fees).
  • Adaptive Processing: Uncertainty-based routing allows the system to dynamically escalate inputs to more capable modules when needed, rather than making uncertain predictions.

These improvements collectively reduce misclassification, increase user trust, and provide a more transparent conversational experience.

Functional Example

Consider the query: "What do I pay for my bags?"

This input could correspond to either baggage_policy (allowance rules) or baggage_fees (cost for additional bags).

A combined system would process this input as follows:

  1. The base classifier detects high uncertainty due to semantic overlap.
  2. The system triggers a clarification step: "Do you mean baggage allowance or extra fees?"
  3. The user responds: "Extra fees."
  4. The system now classifies the query as baggage_fees with high confidence.

Alternatively, for the input "Where can I book a hotel?", the OOS detection module would flag the query as out-of-domain and trigger a fallback response without attempting classification.

Relevance to Airline Applications

Airline customer interactions are particularly prone to ambiguity, context dependency, and off-topic requests. Integrating clarification and OOS detection into the classification pipeline is essential to:

  • Handle ill-specified user inputs, especially from novice or multilingual users.
  • Avoid inappropriate automated responses to out-of-domain requests.
  • Enhance transparency and reliability, especially during operational disruptions or service recovery.

Such integrated systems are more aligned with real-world airline needs, where robust understanding and reliable decision-making under uncertainty are critical.

4. Intent Classification Evaluation

4.1 Attributes for Comparing Evaluation Methods

Attribute Definitions

Uncertainty Assessment

  • Measures the evaluation method's ability to assess how well models quantify prediction uncertainty and confidence
  • Evaluates whether the method can determine if confidence scores accurately reflect actual prediction reliability
  • Critical for airline operations where understanding prediction uncertainty prevents overconfident automated decisions and enables appropriate escalation to human agents

Reproducibility

  • Ability to obtain identical evaluation results across multiple runs with consistent configuration and random seed control
  • Ensures performance improvements/regressions are genuine rather than measurement artifacts or statistical noise
  • Essential for regulatory compliance, audit trail maintenance, and trustworthy regression testing in safety-critical airline systems

Out-of-Scope Detection Performance

  • Evaluates system's ability to identify and reject queries outside the supported intent taxonomy using specialized metrics like AU-IOC
  • Prevents misclassification of hotel bookings, rental cars, and unrelated requests into airline intent categories
  • Critical for maintaining service boundaries, operational safety, and preventing inappropriate automated responses in production chatbots

Latency Profiling

  • Measures inference time distribution under realistic passenger interaction loads, including mean and P95 latency metrics
  • Production systems require sub-200ms response times with P95 latency under 5ms for optimal user experience
  • Essential for real-time passenger interactions during check-in, boarding, operational disruptions, and mobile app responsiveness

Multilingual Consistency

  • Assesses intent classification accuracy equivalence across multiple languages for semantically identical passenger requests
  • Validates that "¿Puedo cambiar mi vuelo?" maps to the same intent classification as "Can I change my flight?"
  • Critical for international airlines serving diverse passenger populations across 50+ languages and cultural contexts

Drift Detection & Monitoring

  • Systematic evaluation of model performance degradation over time using statistical methods and machine learning approaches
  • Monitors feature distribution shifts, prediction accuracy decay, and concept drift in passenger language patterns
  • Prevents silent performance deterioration and enables proactive model retraining before service quality impact

Multi-Intent Detection Capability

  • Evaluates system's ability to identify and classify multiple intents within single passenger utterances using multi-label metrics
  • Handles complex requests spanning booking modifications, service preferences, and operational concerns simultaneously
  • Essential for natural conversation flow and reducing clarification overhead in airline customer interactions

Robustness Assessment

  • Tests classification performance under adversarial conditions including noise injection, spelling errors, and grammatical inconsistencies
  • Evaluates resilience against Fast Gradient Sign Method, character-level perturbations, and environmental signal degradation
  • Critical for handling real-world passenger input variations and maintaining service quality across diverse interaction conditions

4.2 Evaluation Method Importance Table

AttributeWhy it matters for airline-grade evaluationReferences
Uncertainty AssessmentCalibrated confidence estimates (e.g., temperature scaling, conformal prediction) reduce over-confident misclassifications, enable safe hand-offs for high-risk intents (visa denial, dangerous goods), and provide auditable KPIs that aviation regulators expect.Uncertainty Assessment Evolution
Temperature Scaling
EMNLP 2020
NAACL 2024
ReproducibilityEnsures that evaluation results are identical across reruns with fixed seeds, documented code, data, and environment—vital for regulatory audits, safety-critical regression testing, and reliable A/B gating of new airline-chatbot models.Reproducibility Study
ML Reproducibility Crisis
ACL 2023
Out-of-Scope Detection PerformanceDetecting and rejecting queries that fall outside the airline's supported intent taxonomy (e.g., hotel bookings, car rentals, hostile/irrelevant requests) prevents the bot from giving misleading answers, breaching safety regulations, or triggering the wrong operational workflow. Robust OOS evaluatiors ensures the system maintains clear service boundaries and escalates unknown intents to human agents.OOS Detection Study
EMNLP 2024
ACL 2019
Latency ProfilingSub-200ms response times required for real-time passenger interactions during check-in, boarding, and operational disruptions; P95 latency monitoring prevents service degradationInference Optimization
AWS Edge Inference
EMNLP Industry 2024
Multilingual ConsistencyInternational airlines serve 50+ languages; evaluation must ensure equivalent performance across cultural contexts and linguistic variations without biasACL 2023
Multilingual NLP
ICML 2020
Drift Detection & MonitoringPassenger language patterns and service demands evolve continuously; systematic monitoring prevents silent performance degradation over operational periodsDistribution Shifts
AI Monitoring
Model Degradation
Multi-Intent Detection CapabilityAirline passengers frequently express compound requests (booking + special assistance + meal preferences); accurate multi-intent evaluation prevents conversation breakdownEMNLP 2024 (1)
EMNLP 2024 (2)
EMNLP 2020
Robustness AssessmentEvaluation must account for noisy input conditions (background noise, spelling errors, informal language) common in real passenger interactionsMicrosoft Research
ACL 2020

Method Definitions

Conformal Prediction Evaluation

  • Provides theoretical guarantees for uncertainty quantification using prediction sets with mathematically proven coverage probabilities
  • Constructs calibrated confidence intervals that contain the true intent class with predetermined confidence levels (90%, 95%, 99%)
  • Enables adaptive decision-making through prediction set size analysis and rejection option implementation for ambiguous queries

Out-of-Scope Detection Evaluation

  • Evaluates system's ability to reject queries outside supported intent taxonomy using dedicated benchmark datasets with explicit OOS samples
  • Measures AU-IOC (Area Under In-scope and Out-of-scope Characteristic Curve) for comprehensive dual-performance assessment of classification accuracy and rejection capability
  • Combines threshold-based rejection with representation similarity analysis to prevent misclassification of non-domain queries into operational intent categories

Latency Profiling Protocol

  • Measures inference time distribution under realistic passenger interaction loads with comprehensive P95, P99, and mean latency tracking
  • Evaluates Time to First Token (TTFT) < 200ms and total response latency < 500ms requirements for maintaining conversational flow during peak operations
  • Implements stress testing scenarios including concurrent user simulation, batch processing optimization, and hardware-specific performance profiling

Multilingual Consistency Assessment

  • Tests intent classification accuracy equivalence across multiple languages using professionally translated content and cross-lingual transfer protocols
  • Evaluates zero-shot performance degradation patterns and measures semantic consistency preservation across typologically diverse language families
  • Assesses cultural adaptation effectiveness through synthetic persona simulation and region-specific linguistic variation testing

Drift Detection Methodology

  • Monitors temporal performance degradation using statistical distribution shift detection including Kolmogorov-Smirnov testing and Maximum Mean Discrepancy analysis
  • Implements sliding window data acquisition with adaptive threshold adjustment for proactive model degradation detection
  • Measures Mean Time to Detection (MTD) and False Discovery Rate (FDR) to optimize alert sensitivity and minimize false positive interventions

Multi-Intent Detection Evaluation

  • Assesses multi-label classification performance using macro/micro F1-scores, Hamming loss, and label-wise accuracy for complex passenger queries
  • Evaluates joint intent-slot parsing accuracy with complete semantic frame correctness validation for compound service requests
  • Tests hierarchical intent taxonomy navigation and co-occurrence pattern recognition for realistic multi-service passenger interactions

Robustness Assessment Framework

  • Applies systematic adversarial testing using character-level, word-level, and sentence-level perturbations with human validation protocols
  • Implements behavioral testing through Minimum Functionality Tests, Invariance Tests, and Directional Expectation Tests for natural language variations
  • Measures Attack Success Rate (ASR) under realistic input corruption scenarios including speech-to-text errors, typos, and grammatical inconsistencies

Reproducibility Framework Assessment

  • Implements standardized experimental protocols with comprehensive documentation requirements following established research reproducibility checklists
  • Requires multi-seed evaluation with statistical significance testing and complete artifact version control for experiment replication
  • Utilizes containerized environments and automated pipeline management to ensure consistent results across different computational platforms and research teams

4.3 Method Comparison Matrix

MethodUncertainty AssessmentReproducibilityOut-of-Scope DetectionLatency ProfilingMultilingual ConsistencyDrift DetectionMulti-Intent DetectionRobustness AssessmentReferences
Conformal Prediction EvaluationHighHighHighMedium-HighHighLowHighHighNAACL 2024
arXiv 2503.15850
arXiv 2309.06240
Out-of-Scope Detection EvaluationLowHighHighMediumLowLowLowMedium-HighEMNLP 2019
Springer 2023
arXiv 2403.05640
Latency Profiling ProtocolLowHighLowHighMediumLowMediumLowDatabricks Blog
DZone Article
Latency Metrics
Multilingual Consistency AssessmentLowHighLowMediumHighLowMedium-HighHighACL 2023
ResearchGate
EMNLP 2021
Drift Detection MethodologyLowHighLowLowLowHighLowLowMicrosoft Docs
Evidently AI
arXiv 2505.17043
Multi-Intent Detection EvaluationLowHighLowMediumMediumLowHighLowEMNLP 2023
Science Direct
FewNLU
Robustness Assessment FrameworkLowHighLowMediumHighLowLowHighGitHub Repo
EMNLP 2023
CMU Blog
Reproducibility Framework AssessmentLowHighLowLowMediumMediumMediumLowarXiv 2407.10239
Nature
arXiv 2406.14325

4.4 Method Selection & Justification

Based on the comparative analysis of evaluation methods and the critical requirements of airline operations, this study adopts a focused evaluation strategy targeting four essential capabilities: uncertainty quantification for safety-critical decision-making, out-of-scope detection for operational boundary management, latency optimization for real-time passenger interactions, and multi-intent processing for complex service requests. These selected methods address the unique challenges of airline chatbot systems including regulatory compliance for automated decisions, prevention of dangerous query misrouting, operational efficiency during peak travel periods, and comprehensive understanding of compound passenger needs across booking modifications, service preferences, and support requests.

MethodRationale
Conformal Prediction Evaluation

Provides theoretical guarantees for uncertainty quantification essential in safety-critical airline operations where overconfident misclassifications can trigger inappropriate automated responses. Enables mathematically proven confidence intervals for human handoff decisions during operational disruptions and emergency scenarios. Critical for regulatory compliance requiring auditable decision-making processes in aviation safety systems.

Out-of-Scope Detection Evaluation

Essential for ensuring safety and reliability when chatbots are exposed to open-ended customer inputs beyond airline scope. Prevents dangerous misrouting of non-aviation queries (hotel bookings, rental cars, weather) into operational flight systems. Critical for maintaining service boundaries and preventing automated responses to queries outside airline operational scope, directly impacting passenger safety.

Latency Profiling Protocol

Reflects constraints in production environments such as check-in kiosks, mobile apps, and real-time customer service interactions. Models must demonstrate system responsiveness under realistic passenger interaction loads to maintain operational efficiency during peak travel periods. Essential for validating deployment viability in time-critical airline operations where response delays can cascade into operational disruptions.

Multi-Intent Detection Evaluation

Addresses complex passenger queries that span multiple services (booking modifications + seat preferences + meal requests) common in airline customer interactions. Validates system capability to handle compound service requests through multi-label classification performance assessment. Critical for reducing clarification overhead and improving passenger experience during complex transaction scenarios.

5. Final Recommendations & Next Steps

5.1. Summary of Comparative Findings

The analysis conducted in this study shows that intent classification techniques exhibit diverse performance profiles depending on their methodological approach and the context of application. Prompting-based methods, such as Few-Shot In-Context Learning, are particularly useful during early development stages or prototyping, as they require no training and allow for rapid iteration. Embedding-based techniques, such as Dual Encoder, stand out for their computational efficiency, making them suitable for systems with latency or infrastructure constraints. Hybrid approaches, such as Adaptive ICL + CoT Hybrid Routing, offer a favorable balance between accuracy and inference time by combining lightweight classifiers with high-capacity LLMs. Generative methods, including IntentGPT and Gen-PINT, enable advanced tasks such as automatic intent discovery and semantic adaptation, although they often entail greater operational requirements.

These findings suggest that the selection of a technique should not rely on a single performance metric. Instead, it should account for the specific operational context, system constraints, and functional objectives relevant to each phase in the development and deployment of a conversational AI system.

5.2. Technique Recommendations by Use Case

The table below summarizes recommended techniques for common use cases in airline conversational systems:

Use Case ScenarioRecommended TechniquePrimary Justification
Initial deployment / prototypingFew-Shot In-Context Learning (ICL)Requires no training, enables rapid iteration with minimal labeled data
Production scalingAdaptive ICL + CoT Hybrid RoutingHigh accuracy with reduced latency; optimal trade-off for real-time environments
Emerging or evolving intentsIntentGPTUnsupervised intent discovery; useful when services or policies change dynamically
Domain-specific adaptationPEFT (IA3 Adapters)Efficient fine-tuning; low resource cost; allows for rapid domain specialization

These recommendations are grounded in the comparative analysis in Section 3.3 and are aligned with the technical attributes and performance metrics presented in Sections 4.1 and 4.2.

5.3. Implementation Considerations

Before selecting or deploying an intent classification technique, the following practical aspects should be taken into account:

  • Infrastructure availability: some techniques require access to large language models or GPU-based inference environments.
  • Real-time response requirements: use cases such as automated check-in or airport support require low inference latency.
  • Out-of-scope (OOS) robustness: it is critical to detect irrelevant or unexpected inputs to avoid generating inappropriate responses.
  • Maintainability and adaptability: systems must be capable of rapid updates in response to changes in regulations, services, or query patterns.
  • Multilingual support and cultural variation: airline systems must deliver consistent classification across different languages and regions.

5.4. Future Work

This study provides a solid foundation for evaluating intent classification techniques in complex operational domains such as aviation. However, several lines of work are proposed to further strengthen the applicability and robustness of such systems:

  • Development of a domain-specific dataset for airlines, including real passenger utterances, expert annotations, and ambiguous cases.
  • Multilingual and cross-regional evaluation with native speakers, to validate semantic consistency and intent boundaries across cultural contexts.
  • Controlled experimentation with clarification techniques, such as TO-GATE or LLM-based interactive clarification, to improve real-time ambiguity resolution.
  • Design of domain-oriented benchmarks that combine multiple task types (classification, OOS detection, discovery, reasoning) and integrate evaluation metrics.

These steps will help advance toward more reliable, adaptive, and operationally aligned conversational systems for the airline industry.

6. Conclusion

This study presents a comprehensive analysis of intent classification techniques and evaluation methods, with a specific focus on their applicability to airline customer service systems. Through a detailed characterization of the airline domain, a structured comparison of 20 intent classification approaches, and an in-depth review of evaluation protocols, we highlight the strengths, limitations, and suitability of each technique under different operational scenarios.

Our findings show that no single method excels universally. Instead, successful deployment depends on aligning technique selection with specific contextual requirements—such as the availability of training data, latency constraints, or the need for out-of-scope handling and clarification. Prompting-based methods, hybrid architectures, and LLM-powered discovery systems each provide complementary strengths that can be leveraged at different stages of system development.

The evaluation strategy proposed in this study emphasizes few-shot adaptability, semantic robustness, and real-time responsiveness. These dimensions are critical to supporting the multilingual, dynamic, and operationally sensitive nature of airline applications.

Overall, this work provides actionable guidance for researchers and practitioners seeking to design or evaluate intent classification systems in complex real-world domains, offering both a theoretical foundation and practical pathways for scalable and reliable deployment.

Related Topics

intent classificationairline chatbotsnatural language processingairline operationsmachine learningcustomer service AINLP aviationairline customer intentconversational AImulti-agent systems

About the Author

Luis Reynaldo

Luis Reynaldo

Applied AI Engineer, Kaiban

I'm passionate about AI and multi-agent systems, spending my time studying in-context learning techniques and prompt engineering. Coming from a background where I loved crafting beautiful UIs, I now enjoy exploring how these technologies can solve real aviation challenges through elegant solutions.

Found this article helpful? Share it with your network.