A Comparative Study of Intent Classification Techniques and Evaluation Methods for Airline Applications
Detailed research on intent classification techniques and evaluation methods for airline applications, covering prompting-based, embedding-based, and hybrid approaches for understanding customer needs.


Abstract
This study analyzes intent classification techniques for airline customer service applications.
Airlines operate in a high-stakes environment where misunderstanding passenger intent can trigger operational chaos. Multilingual requirements complicate everything. Dynamic contexts shift by the hour. One misclassified intent about a flight change or baggage issue can cascade through systems, affecting schedules, crew assignments, and passenger safety.
We examine 18 techniques across seven categories: prompting-based, embedding-based, hybrid systems. Each evaluated against accuracy, latency, few-shot learning, and cross-domain transfer.
Our analysis reveals Few-Shot In-Context Learning as the most effective approach for initial deployment, while Hybrid Systems with Uncertainty-Based Routing prove superior for production scaling. The evaluation framework covers four critical areas: conformal prediction for uncertainty quantification, out-of-scope detection for operational safety, latency profiling for real-time responsiveness, and multi-intent detection for complex queries.
This research provides practical guidance for deploying robust intent classification in safety-critical, multilingual airline environments.
1. Domain Characterization
1.1 Context
This study focuses on airlines as a domain that presents several characteristics relevant to intent classification research. Airlines operate in a complex operational environment with diverse customer interaction patterns.
Domain Criticality: Airline intent misclassification can have significant operational and financial consequences. Research shows that operating costs from delays constituted 86-134% of airline operating profits in 2007, with industry-wide delay costs ranging $8.3-13 billion annually (Gu et al., 2024). Misunderstanding customer requests during service disruptions, flight changes, or emergency situations can impact passenger experience and operational efficiency. The stakes are higher than in typical e-commerce or general customer service applications.
Multilingual and Cultural Complexity: International airlines must support intent classification across multiple languages and cultural contexts, where the same underlying need may be expressed differently across regions. Studies indicate that 74% of customers are more likely to repurchase from businesses offering customer service in their native language (HappyFox, 2025). Passenger queries reflect diverse linguistic patterns and cultural expectations for service delivery.
Operational Dynamics: Airlines operate in a constantly changing environment where flight schedules, policies, weather conditions, and regulatory requirements evolve continuously. Intent classification systems must handle queries about dynamic information while maintaining accuracy across changing operational contexts.
Contextual Dependencies: Airline customer interactions often involve multiple related intents and require understanding of travel context. Passengers frequently have compound requests that span booking modifications, service preferences, and operational concerns within a single conversation.
System Integration Requirements: Unlike many conversational AI applications, airline intent classification typically requires integration with real-time operational systems including reservations, flight operations, baggage tracking, and customer management platforms. This integration complexity adds constraints to system design and deployment.
These characteristics make airlines a challenging and representative domain for evaluating intent classification techniques, as solutions must demonstrate robustness, adaptability, and operational reliability.
1.2 Canonical User Intents for Airline Systems
To design and evaluate intent classification systems in the airline domain, it is essential to establish a comprehensive and functionally organized set of user intents. These intents represent the core types of queries and requests that passengers generate when interacting with airline customer service platforms—whether through chatbots, apps, or voice assistants—across all phases of the travel experience.
This study defines 35 canonical intents grouped into seven functional categories. These were derived through comparative analysis of public dialogue datasets including ATIS (Hou et al., 2024) with 17 intent categories, SGD (Rastogi et al., 2020) with 20 domains covering travel, events, and other services, and CLINC150 (Larson et al., 2019) with 150 intents across 10 domains, combined with analysis of real-world airline chatbot interfaces and relevant literature. The taxonomy reflects both traditional service needs and emerging user demands such as environmental responsibility, digital account management, and personalized onboard services.
1. Flight Planning & Reservation
These intents capture user interactions during the early planning and booking stages.
flight_search
– Find available flights based on destination, date, time, or price.flight_booking
– Complete a flight reservation, including passenger details and add-ons.seat_selection
– Select or change seat assignments, request upgrades.fare_rules_explained
– Inquire about fare conditions, change penalties, refundability.booking_modification
– Change flight details (e.g., dates, names) for an existing reservation.booking_cancellation
– Cancel a booked ticket and receive applicable refund terms.upgrade_offer_bid
– Submit or review upgrade offers (e.g., to business class).lounge_access
– Ask about eligibility, pricing, and location of airport lounges.
2. Check-in & Boarding
These intents occur closer to departure and involve real-time or operational support.
check_in_boarding
– Perform check-in or retrieve digital boarding passes.boarding_group_info
– Understand assigned boarding group or priority boarding rules.flight_status
– Request real-time updates on flight departure, delays, gate changes.alternative_flights
– Search for other flight options (e.g., in case of disruption).rebooking_assistance
– Automatically rebook after a delay, cancellation, or missed connection.weather_updates
– Check how weather may affect flights.airport_information
– Ask about services, navigation, or amenities at a specific airport.
3. Baggage Services
These intents focus on all aspects of baggage policies, tracking, and incidents.
baggage_policy
– Ask about weight, size, and allowance rules.baggage_fees
– Calculate or understand fees for checked or excess baggage.carry_on
– Verify carry-on size, item restrictions, and allowances.baggage_tracking
– Track the status or location of checked baggage.lost_baggage
– Report or follow up on lost, delayed, or damaged baggage.special_equipment
– Ask about traveling with sports gear, musical instruments, medical devices.
4. Post-Sale Support & Claims
These intents represent post-booking concerns, support, and compensation.
refund_request
– Request refund based on ticket conditions or travel disruption.compensation_claim
– File a claim for delays, cancellations, baggage incidents.travel_insurance
– Inquire about coverage, purchase options, or initiate a claim.loyalty_program
– Ask about mileage accrual, redemption, tier benefits, or upgrades.
5. Documentation & Travel Requirements
These intents deal with regulatory and compliance matters before departure.
visa_requirements
– Check visa obligations for the destination.passport_validity
– Verify passport expiration and entry rules.health_requirements
– Ask about vaccination, testing, or medical certificate needs.travel_restrictions
– Request information on entry bans, COVID-19 rules, or quarantine.
6. Special Services & Onboard Preferences
These intents cover optional or personalized services.
meal_preferences
– Select or inquire about special meals (e.g., vegetarian, halal).special_assistance
– Request wheelchair service, escort, or medical accommodations.carbon_offset_program
– Participate in sustainability programs or purchase offsets.wifi_connectivity
– Ask about onboard Wi-Fi availability, pricing, and usage.
7. Digital Account & Mobile App Support
These intents involve digital access and personalization.
digital_account_login
– Troubleshoot login issues for user profiles or frequent flyer accounts.app_notification_settings
– Configure alerts for delays, gate changes, promotions, or check-in reminders.
Note: This taxonomy represents a theoretical framework derived from dataset analysis and industry observation. The 35 intents proposed here are based on comparative analysis of existing dialogue datasets and airline service patterns, but empirical validation with real airline operational data would be required for production deployment.
1.3 Domain-Specific Requirements for Intent Classification Systems
The airline domain presents specific challenges for intent classification that influence how standard classification requirements manifest in this operational context. The following requirements emerge from the theoretical analysis of airline intents and how passengers typically express their needs in this domain.
Classification Accuracy Requirements
- Intent Disambiguation Capability - Airline intents frequently overlap semantically (baggage_policy vs baggage_fees, flight_search vs flight_status), requiring systems that distinguish between closely related categories based on subtle linguistic cues
- Domain Terminology Recognition - Classification must handle airline-specific vocabulary (layover, codeshare, PNR, oversold) and abbreviations that affect intent identification
- Context-Dependent Classification - Passengers may use the same phrase for different intents depending on their specific travel situation
Robustness Requirements
- Variation Handling - Passengers express the same intent using diverse terminology influenced by regional differences, airline branding, and travel experience levels
- Multilingual Classification Consistency - Intent boundaries must remain consistent across languages where travel concepts may have different cultural associations
- Out-of-Scope Detection for Domain Boundaries - Distinguish between airline-related queries and unrelated requests (hotel bookings, ground transportation)
Adaptability Requirements
- Seasonal Intent Recognition - Classification must adapt to temporal variations in how passengers express travel needs (holiday travel patterns, weather-related concerns, seasonal routes)
- Intent Granularity Flexibility - Ability to classify at different levels of specificity as airline services evolve
- New Intent Integration - Accommodate emerging intent categories as airline services change (new health requirements, technology services, environmental concerns)
These requirements specifically address the challenges of correctly identifying passenger intent categories within the airline domain's linguistic and operational constraints.
2. Guiding Research Questions
2.1. Questions for Intent Classification Techniques
- What are the main existing techniques for intent classification in conversational AI applications?
- What technical attributes and capabilities ****are used to compare these techniques (architecture, few-shot learning, latency, computational requirements)?
- How do these techniques perform across key metrics like accuracy, out-of-scope detection, and operational efficiency in airline-relevant scenarios?
- Which techniques are most suitable for airline chatbots considering domain-specific requirements and constraints?
- How can intent classification techniques be combined with clarification and out-of-scope detection to improve system robustness?
2.2. Questions for Evaluation Methods
- What are the main evaluation methodologies used for assessing intent classification systems?
- What metrics, datasets, and evaluation protocols are employed to measure performance and compare different approaches?
- How do these evaluation methods address airline-specific challenges like multilingual support, contextual dependencies, and operational criticality?
- What evaluation strategy is most appropriate for validating intent classification systems in airline applications?
3. Intent Classification Techniques
To select the most appropriate intent classification techniques for airline applications, it is necessary to systematically analyze the available options. This section compares the main techniques identified in the current literature, evaluating their specific characteristics in relation to the domain requirements outlined earlier.
3.1. Identified Techniques
Through a systematic review of current literature, 18 specific intent classification techniques were identified and categorized into seven main groups. These techniques represent the state of the art in the field, ranging from fully training-free approaches to sophisticated hybrid systems.
Prompting-Based Techniques
- Zero-Shot In-Context Learning (ICL) – A technique in which the language model classifies user intents using only textual descriptions of each intent, without any prior training examples. The LLM relies on its pretrained knowledge to map the user query to the most appropriate intent category based solely on the definitions provided (Parikh et al., 2023).
- Few-Shot In-Context Learning – This approach includes 1 to 10 representative examples of each intent directly within the prompt, along with the query to be classified. The LLM infers patterns and features from the examples to perform intent classification (Parikh et al., 2023; Zhang et al., 2024).
- Adaptive In-Context Learning – An extension of few-shot ICL that dynamically selects the most relevant examples for each query using semantic similarity. Embedding models are used to retrieve the K nearest examples to the input, and the prompt is constructed adaptively (Arora et al., 2024; Rodriguez et al., 2024).
- Chain-of-Thought (CoT) Prompting – A prompting technique that instructs the LLM to explain its reasoning step by step before producing a final classification. The model verbalizes the analysis process, highlighting key features of the input before issuing its decision (Arora et al., 2024).
Embedding-Based Techniques
- Dual Sentence Encoders (USE + ConveRT) – An architecture that combines the Universal Sentence Encoder (USE) for general-purpose representations with ConveRT, a conversationally optimized encoder, as fixed feature extractors. The resulting embeddings are concatenated and passed to a simple MLP classifier with one hidden layer. Both encoders remain frozen during training (Casanueva et al., 2020).
- SetFit (Sentence Transformer Fine-tuning) – A few-shot learning method that performs contrastive fine-tuning of sentence transformers using positive and negative pairs, followed by training a lightweight classifier on the learned representations. It combines contrastive learning with discriminative classification (Arora et al., 2024).
- Label-Aware BERT Attention Network (LABAN) – A neural architecture that builds an embedding space explicitly informed by the semantics of intent labels. It uses attention mechanisms to project user queries into this semantic space and classifies them based on projection weights toward each intent category (Wu et al., 2021).
Generative-Based Techniques
- Text-to-Text Generation (Gen-PINT) – A complete reformulation of the classification problem as a free-form text generation task. Instead of selecting from predefined intent classes, the LLM generates the intent name or description directly as natural language using instruction tuning (Zhang et al., 2024).
- Intent Discovery with LLMs (IntentGPT) – A training-free system that employs a dual-LLM architecture for automatic intent discovery. It combines a contextual prompt generator, an intent predictor, and a semantic sampler that selects relevant examples through automatic clustering to identify emerging intent categories (Rodriguez et al., 2024).
- Generate-then-Refine – A two-stage pipeline in which synthetic utterances are first generated using LLMs in a zero-shot setting to expand the dataset, followed by a seq2seq refinement model that enhances the quality, coherence, and utility of the generated data (Lin et al., 2024).
Fine-Tuning-Based Techniques
- BERT Fine-tuning – Full fine-tuning of pretrained transformer models (e.g., BERT-Large) on domain-specific intent classification datasets. This approach requires updating all model parameters to adapt the model to the target task (Casanueva et al., 2020).
- PEFT (IA3 Adapters) – A parameter-efficient fine-tuning method that uses IA3 adapters (Infused Adapter by Inhibiting and Amplifying Inner Activations) to adapt pretrained models by training only a small subset of parameters, achieving competitive performance with very limited examples (Parikh et al., 2023).
Hybrid Techniques
- Hybrid System with Uncertainty-Based Routing – A system that combines lightweight models (e.g., SetFit) with large language models through an uncertainty-driven routing strategy. It uses confidence estimation to forward queries to more capable models only when uncertainty exceeds a predefined threshold, balancing accuracy and efficiency (Arora et al., 2024).
Specialized Techniques
- Out-of-Scope Detection – A set of specialized methods designed to detect user queries that do not belong to any predefined intent. These approaches include confidence-based thresholds, representation clustering, and similarity analysis to prevent misclassification of out-of-domain inputs (Arora et al., 2024).
- Two-Step OOS Detection – A two-stage methodology for out-of-scope detection. First, it predicts in-scope labels while ignoring potential OOS cases. Then, it compares the transformer's internal representations with training instances using cosine similarity to identify outliers (Arora et al., 2024).
- Multi-Intent Detection – Techniques specifically designed to detect and classify multiple intents within a single user query. These systems rely on multi-label architectures, attention mechanisms, or sequential decomposition to capture all distinct purposes present in complex utterances (Wu et al., 2021).
LLM-Based Clarification Techniques
- LLM Ambiguity Identification and Clarification – A system that leverages large language models (e.g., ChatGPT, GPT-4) to identify ambiguous user queries and generate appropriate clarification questions. It employs chain-of-thought reasoning and few-shot prompting to systematically detect and resolve ambiguities (Zhang et al., 2024).
- LLM-Based Interactive Clarification – A method that uses a "communicator" LLM to identify high-uncertainty and low-confidence segments in user problem descriptions. It then generates specific clarification questions to elicit additional information before proceeding with the task (Wu, 2023).
These 18 techniques represent the full spectrum of approaches available for intent classification, ranging from simple and efficient methods to complex systems with advanced capabilities for intent discovery and clarification.
Note: Some techniques represent conceptual approaches or recent developments that may require additional empirical validation for comparative evaluation in operational airline systems.
3.2. Technical Attributes for Comparison
To systematically evaluate the identified intent classification techniques, key attributes were defined based on a comprehensive review of the research literature. These attributes represent the most relevant dimensions used by researchers to characterize and compare methods, and are particularly applicable to airline-related use cases.
Primary Performance Metrics
- Accuracy - Primary metric used across all frameworks for overall classification correctness
- F1-Score - Harmonic mean of precision and recall, especially important for out-of-scope detection and imbalanced classes
Specialized Metrics
- Out-of-Scope Recall - Percentage of out-of-scope samples correctly identified as not belonging to any predefined intent category. Critical for production systems to prevent incorrect automated responses when passengers ask questions outside the chatbot's domain (e.g., asking about hotel bookings to an airline chatbot). High OOS recall prevents embarrassing misclassifications but must be balanced with precision to avoid rejecting valid queries.
- NMI (Normalized Mutual Information) - Clustering evaluation metric that measures the quality of discovered intent groupings compared to ground truth labels. Values range from 0 (random clustering) to 1 (perfect clustering). Used specifically for intent discovery tasks where the system automatically identifies new intent categories from unlabeled data, particularly relevant for discovering emerging customer needs in airline services.
- ARI (Adjusted Rand Index) - Clustering metric that compares predicted intent partitions against ground truth, adjusted for chance agreement. Unlike raw accuracy, ARI accounts for the expected similarity that would occur by random chance. Values range from -1 to 1, where 1 indicates perfect clustering. Used alongside NMI for comprehensive evaluation of intent discovery systems.
Operational Metrics
- Latency/Inference Time - Time required to process a single query from input to intent prediction output. Critical for real-time customer service applications where delays impact user experience. Measured in milliseconds or seconds, with airline chatbots typically requiring sub-2-second responses to maintain conversational flow.
Learning Configuration Performance
- Zero-Shot Performance - Model's ability to classify intents using only textual descriptions without any training examples. Evaluated by providing intent definitions (e.g., "flight_search: user wants to find available flights") and measuring classification accuracy on unseen queries. Crucial for rapidly deploying systems to new routes or services without collecting training data.
- Few-Shot Performance - Classification accuracy when trained with very limited examples per intent (typically 1, 5, or 10 examples). Measures data efficiency and practical deployment feasibility, as collecting extensive training data for every airline intent is resource-intensive. Performance curves show how accuracy improves with additional examples.
- Cross-Domain Transfer - Model's ability to maintain performance when applied to different airline contexts or related domains. For example, a model trained on domestic flight intents transferring to international travel queries, or adapting from one airline's terminology to another's. Measures generalization capability across operational contexts.
Additional Evaluation Metrics
- Semantic Similarity - Cosine similarity between generated intent names/descriptions and ground truth labels, measured using sentence embeddings. This metric is used when models generate free-form intent labels rather than selecting from predefined categories. It is particularly relevant for intent discovery systems (e.g., IntentGPT) or text-to-text generative models (e.g., Gen-PINT), where semantic matching is more informative than exact string comparison.
3.3. Performance Comparison of Techniques
This section presents a systematic comparison of techniques using the attributes defined in the previous section. The techniques are grouped by methodological category to facilitate comparative analysis and highlight patterns across different approaches.
Prompting-based Techniques
Technique | Accuracy | F1-Score | Out-of-Scope Recall | NMI | ARI | Latency | Zero-Shot Performance | Few-Shot Performance | Cross-Domain Transfer | Semantic Similarity |
---|---|---|---|---|---|---|---|---|---|---|
Zero-Shot ICL | 73.9% (MASSIVE, Flan-T5-XXL) | Not reported | 0.97 (Benchmark01, GPT-3) | Not reported | Not reported | Not reported | Primary method | Not applicable | Not reported | Not reported |
Few-Shot ICL | 63% (MASSIVE, K=5, ELMSE) | Not reported | Not reported | Not reported | Not reported | Not reported | Not applicable | K=3: 57% (MASSIVE) | Not reported | Not reported |
Adaptive ICL | Not reported | Not reported | Not reported | Not reported | Not reported | Not reported | Not reported | Adaptive example retrieval | Not reported | Not reported |
Chain-of-Thought | Not reported | Not reported | Not reported | Not reported | Not reported | Not reported | Emergent capability ≥100B parameters | Few-shot prompting | Cross-task evaluation | Not reported |
Embedding-based Techniques
Technique | Accuracy | F1-Score | Out-of-Scope Recall | NMI | ARI | Latency | Zero-Shot Performance | Few-Shot Performance | Cross-Domain Transfer | Semantic Similarity |
---|---|---|---|---|---|---|---|---|---|---|
Dual Encoder | 85.19% (BANKING77, 10-shot) | Not reported | Not reported | Not reported | Not reported | Reported as faster than BERT | Not reported | 10-shot performance demonstrated | Not reported | USE + ConveRT embeddings |
SetFit | Not reported | Not reported | Not reported | Not reported | Not reported | Reported as highly efficient | Not reported | 8–16 examples typical | Not reported | Contrastive fine-tuning approach |
LABAN | Not reported | Not reported | Not reported | Not reported | Not reported | Not reported | Reported zero-shot capability | Not reported | Reported zero-shot transfer capability | Label-aware semantic space |
Generative-based Techniques
Technique | Accuracy | F1-Score | Out-of-Scope Recall | NMI | ARI | Latency | Zero-Shot Performance | Few-Shot Performance | Cross-Domain Transfer | Semantic Similarity |
---|---|---|---|---|---|---|---|---|---|---|
Gen-PINT | 77.38% (average across 8 datasets, 1-shot) | Not reported | Not reported | Not reported | Not reported | Not reported | Reported cross-domain capability | Specialized 1-shot performance | Reported domain-agnostic generalization | Not reported |
IntentGPT | 77.21% (BANKING77, 50-shot GPT-4) | Not reported | Not reported | 96.06% (CLINC150, 50-shot GPT-4) | 84.76% (CLINC150, 50-shot GPT-4) | Not reported | Training-free approach | 50-shot performance reported | Training-free generalization | SBERT embeddings + cosine similarity |
Generate-then-Refine | 76.9% (CLINC150, 1-shot) | Not reported | Not reported | Not reported | Not reported | Not reported | Reported cross-domain transfer | Optimized 1-shot generation | Reported domain generalization | Not reported |
Fine-Tuning-Based Techniques
Technique | Accuracy | F1-Score | Out-of-Scope Recall | NMI | ARI | Latency | Zero-Shot Performance | Few-Shot Performance | Cross-Domain Transfer | Semantic Similarity |
---|---|---|---|---|---|---|---|---|---|---|
BERT Fine-tuning | 96.93% (CLINC150, full dataset) | Not reported | Not reported | Not reported | Not reported | Not reported | Requires full fine-tuning | Requires sufficient training data | Not reported | BERT embeddings |
PEFT (IA3) | Not reported | Not reported | Not reported | Not reported | Not reported | Not reported | Not reported | One example per intent sufficient | Not reported | Not reported |
Hybrid-Based Techniques
Technique | Accuracy | F1-Score | Out-of-Scope Recall | NMI | ARI | Latency | Zero-Shot Performance | Few-Shot Performance | Cross-Domain Transfer | Semantic Similarity |
---|---|---|---|---|---|---|---|---|---|---|
Hybrid System with Uncertainty-Based Routing | Reported within 2% of full LLM accuracy | Not reported | Not reported | Not reported | Not reported | Reported 50% latency reduction | Combines lightweight and LLM approaches | Uncertainty-based optimization | Not reported | Hybrid architecture |
Specialized Techniques
Technique | Accuracy | F1-Score | Out-of-Scope Recall | NMI | ARI | Latency | Zero-Shot Performance | Few-Shot Performance | Cross-Domain Transfer | Semantic Similarity |
---|---|---|---|---|---|---|---|---|---|---|
Out-of-Scope Detection | Not reported | Critical for evaluation | Primary metric for OOS | Not applicable | Not applicable | Not reported | Threshold-based methods | Adaptable to different configurations | Not reported | Threshold-based similarity |
Two-Step OOS | Not reported | Reported >5% improvement | Reported >5% improvement | Not applicable | Not applicable | Not reported | Compatible with base classifiers | Supports few-shot adaptation | Not reported | Internal representations similarity |
Multi-Intent Detection | Not reported | State-of-the-art performance | Not reported | Not applicable | Not applicable | Not reported | Multi-label classification capability | Architecture-dependent | Not reported | Multi-label architecture |
LLM-Based Clarification Techniques
Technique | Accuracy | F1-Score | Out-of-Scope Recall | NMI | ARI | Latency | Zero-Shot Performance | Few-Shot Performance | Cross-Domain Transfer | Semantic Similarity |
---|---|---|---|---|---|---|---|---|---|---|
LLM Ambiguity Identification | 54.25% (ChatGPT average) | 52.77% (ChatGPT average) | Reported to reduce ambiguity | Not applicable | Not applicable | Clarification overhead reported | Chain-of-Thought-based identification | Compatible with prompting examples | Not reported | Chain-of-Thought-based analysis |
LLM Interactive Clarification | Not reported | Not reported | Reported to handle high uncertainty | Not applicable | Not applicable | Multi-turn interaction overhead | Automatic identification capability | Adaptable framework | Not reported | Uncertainty-based interaction |
The comparative analysis reveals that no single technique consistently outperforms others across all evaluation metrics. Prompting-based methods, such as Zero-Shot ICL and Adaptive ICL, offer clear advantages in training-free scenarios, showing competitive performance in zero-shot classification and emerging capabilities in step-by-step reasoning. Embedding-based approaches, such as Dual Encoder, stand out for their operational efficiency and low inference time, making them appealing for production-grade systems. Generative techniques like IntentGPT and Gen-PINT introduce new opportunities for intent discovery and unsupervised classification, albeit with higher computational demands. Among hybrid methods, Hybrid System with Uncertainty-Based Routing emerges as a promising solution by balancing latency and accuracy through the combination of lightweight classifiers and high-capacity LLMs. Finally, specialized techniques, such as Two-Step OOS and LLM-based clarification systems, enable targeted handling of critical challenges like out-of-scope detection and interactive clarification.
These findings provide a foundation for the informed selection of techniques in the next section.
3.4. Technique Suitability for Airline Context
For the initial deployment of an intent classification system in the airline domain, we select Few-Shot In-Context Learning (ICL) as the baseline technique. This decision is grounded in operational simplicity, implementation feasibility, and alignment with early-stage scenarios where limited annotated data is available.
Technical Rationale
Few-Shot ICL is a prompt-based method that leverages pretrained large language models (LLMs) to perform classification without any additional fine-tuning. It operates by providing the model with a few labeled examples (typically 1 to 10) embedded directly in the prompt, followed by an unlabeled query. The LLM infers the most probable intent based on its prior knowledge and the observed input patterns.
Parikh et al. (2023) show that this technique consistently outperforms zero-shot prompting in intent classification tasks, even with minimal data. Specifically, on the MASSIVE benchmark, their ELMSE baseline achieved 63% accuracy with K=5 examples, while GPT-3 achieved 73.9% in zero-shot mode, indicating the effectiveness of both approaches for different operational constraints (Parikh et al., 2023).
Furthermore, Few-Shot ICL requires no supervised training or modification of the model architecture, significantly reducing deployment complexity. Zhang et al. (2024) also highlight this method as a reliable baseline in low-resource classification experiments, especially in multilingual and cross-domain contexts (Zhang et al., 2024).
Operational Advantages for First Deployment
- No fine-tuning required: Models can be used as-is with only prompt engineering.
- Architecture-agnostic: Compatible with any modern LLM (e.g., GPT-4, Claude, Mistral).
- Multilingual by design: LLMs exhibit multilingual competence in classification tasks (Chung et al., 2022; Winata et al., 2023).
- Benchmark-compatible: Easily evaluated using standard few-shot benchmarks like CLINC150 or BANKING77 in 1-shot, 5-shot, and 10-shot configurations (Casanueva et al., 2020).
Forward-Looking Enhancements
While Few-Shot ICL offers a robust starting point, more advanced techniques can be adopted in future iterations to meet specific operational demands:
- Hybrid System with Uncertainty-Based Routing, as proposed by Arora et al. (2024), combines lightweight models (e.g., SetFit) with LLMs using uncertainty-based routing. This method achieves accuracy within 2% of full LLM inference while reducing latency by 50%, making it ideal for production systems (Arora et al., 2024).
- IntentGPT, introduced by Rodriguez et al. (2024), enables training-free intent discovery. It achieves 96.06% NMI and 84.76% ARI on CLINC150, making it suitable for identifying emerging user needs and updating intent taxonomies without labeled data (Rodriguez et al., 2024).
- PEFT using IA3 adapters, explored by Parikh et al. (2023), offers parameter-efficient fine-tuning with strong performance using just one example per class, outperforming traditional full-model tuning in few-shot scenarios (Parikh et al., 2023).
These alternatives provide clear upgrade paths aligned with future system requirements such as latency optimization, out-of-scope detection, semantic scalability, and continuous integration of new airline services.
3.5 Combined Use: Intent Classification, Clarification, and Out-of-Scope Detection
In real-world airline applications, user queries often contain ambiguity, lack critical information, or fall entirely outside the scope of predefined intents. Such scenarios challenge the robustness of intent classification systems, particularly in high-stakes environments like airline operations, where misclassification can lead to operational errors or poor customer experience. To address these challenges, recent research has proposed combining intent classification with clarification mechanisms and out-of-scope (OOS) detection, resulting in more resilient and adaptive systems.
Integration Strategies
Three general integration strategies have emerged for combining intent classification, clarification, and OOS detection:
- Clarification-before-Classification: When a user query is ambiguous or underspecified, the system generates a clarification question to elicit additional information. Only after receiving a clarifying response does the system perform intent classification.
- OOS Filtering: Before or after attempting classification, the system evaluates whether the input falls outside the supported intent taxonomy. If the confidence score is low or the semantic similarity to in-scope intents is weak, the query is flagged as out-of-scope and handled accordingly.
- Uncertainty-Driven Routing: The system estimates its own confidence (e.g., via entropy, Monte Carlo dropout, or cosine similarity) and routes low-confidence inputs to a secondary module—such as a large language model (LLM) for clarification or a verification step using OOS thresholds.
These strategies can be implemented in sequence or conditionally, forming dynamic pipelines tailored to the nature of the input.
Improving System Robustness
The integration of clarification and OOS detection into intent classification systems improves robustness in the following ways:
- Error Prevention: OOS detection reduces false positives by rejecting inputs that do not belong to the supported domain (e.g., hotel bookings in an airline chatbot).
- Disambiguation Support: Clarification questions enable the system to resolve input ambiguity (e.g., distinguishing between
baggage_policy
andbaggage_fees
). - Adaptive Processing: Uncertainty-based routing allows the system to dynamically escalate inputs to more capable modules when needed, rather than making uncertain predictions.
These improvements collectively reduce misclassification, increase user trust, and provide a more transparent conversational experience.
Functional Example
Consider the query: "What do I pay for my bags?"
This input could correspond to either baggage_policy
(allowance rules) or baggage_fees
(cost for additional bags).
A combined system would process this input as follows:
- The base classifier detects high uncertainty due to semantic overlap.
- The system triggers a clarification step: "Do you mean baggage allowance or extra fees?"
- The user responds: "Extra fees."
- The system now classifies the query as
baggage_fees
with high confidence.
Alternatively, for the input "Where can I book a hotel?", the OOS detection module would flag the query as out-of-domain and trigger a fallback response without attempting classification.
Relevance to Airline Applications
Airline customer interactions are particularly prone to ambiguity, context dependency, and off-topic requests. Integrating clarification and OOS detection into the classification pipeline is essential to:
- Handle ill-specified user inputs, especially from novice or multilingual users.
- Avoid inappropriate automated responses to out-of-domain requests.
- Enhance transparency and reliability, especially during operational disruptions or service recovery.
Such integrated systems are more aligned with real-world airline needs, where robust understanding and reliable decision-making under uncertainty are critical.
4. Intent Classification Evaluation
4.1 Attributes for Comparing Evaluation Methods
Attribute Definitions
Uncertainty Assessment
- Measures the evaluation method's ability to assess how well models quantify prediction uncertainty and confidence
- Evaluates whether the method can determine if confidence scores accurately reflect actual prediction reliability
- Critical for airline operations where understanding prediction uncertainty prevents overconfident automated decisions and enables appropriate escalation to human agents
Reproducibility
- Ability to obtain identical evaluation results across multiple runs with consistent configuration and random seed control
- Ensures performance improvements/regressions are genuine rather than measurement artifacts or statistical noise
- Essential for regulatory compliance, audit trail maintenance, and trustworthy regression testing in safety-critical airline systems
Out-of-Scope Detection Performance
- Evaluates system's ability to identify and reject queries outside the supported intent taxonomy using specialized metrics like AU-IOC
- Prevents misclassification of hotel bookings, rental cars, and unrelated requests into airline intent categories
- Critical for maintaining service boundaries, operational safety, and preventing inappropriate automated responses in production chatbots
Latency Profiling
- Measures inference time distribution under realistic passenger interaction loads, including mean and P95 latency metrics
- Production systems require sub-200ms response times with P95 latency under 5ms for optimal user experience
- Essential for real-time passenger interactions during check-in, boarding, operational disruptions, and mobile app responsiveness
Multilingual Consistency
- Assesses intent classification accuracy equivalence across multiple languages for semantically identical passenger requests
- Validates that "¿Puedo cambiar mi vuelo?" maps to the same intent classification as "Can I change my flight?"
- Critical for international airlines serving diverse passenger populations across 50+ languages and cultural contexts
Drift Detection & Monitoring
- Systematic evaluation of model performance degradation over time using statistical methods and machine learning approaches
- Monitors feature distribution shifts, prediction accuracy decay, and concept drift in passenger language patterns
- Prevents silent performance deterioration and enables proactive model retraining before service quality impact
Multi-Intent Detection Capability
- Evaluates system's ability to identify and classify multiple intents within single passenger utterances using multi-label metrics
- Handles complex requests spanning booking modifications, service preferences, and operational concerns simultaneously
- Essential for natural conversation flow and reducing clarification overhead in airline customer interactions
Robustness Assessment
- Tests classification performance under adversarial conditions including noise injection, spelling errors, and grammatical inconsistencies
- Evaluates resilience against Fast Gradient Sign Method, character-level perturbations, and environmental signal degradation
- Critical for handling real-world passenger input variations and maintaining service quality across diverse interaction conditions
4.2 Evaluation Method Importance Table
Attribute | Why it matters for airline-grade evaluation | References |
---|---|---|
Uncertainty Assessment | Calibrated confidence estimates (e.g., temperature scaling, conformal prediction) reduce over-confident misclassifications, enable safe hand-offs for high-risk intents (visa denial, dangerous goods), and provide auditable KPIs that aviation regulators expect. | Uncertainty Assessment Evolution Temperature Scaling EMNLP 2020 NAACL 2024 |
Reproducibility | Ensures that evaluation results are identical across reruns with fixed seeds, documented code, data, and environment—vital for regulatory audits, safety-critical regression testing, and reliable A/B gating of new airline-chatbot models. | Reproducibility Study ML Reproducibility Crisis ACL 2023 |
Out-of-Scope Detection Performance | Detecting and rejecting queries that fall outside the airline's supported intent taxonomy (e.g., hotel bookings, car rentals, hostile/irrelevant requests) prevents the bot from giving misleading answers, breaching safety regulations, or triggering the wrong operational workflow. Robust OOS evaluatiors ensures the system maintains clear service boundaries and escalates unknown intents to human agents. | OOS Detection Study EMNLP 2024 ACL 2019 |
Latency Profiling | Sub-200ms response times required for real-time passenger interactions during check-in, boarding, and operational disruptions; P95 latency monitoring prevents service degradation | Inference Optimization AWS Edge Inference EMNLP Industry 2024 |
Multilingual Consistency | International airlines serve 50+ languages; evaluation must ensure equivalent performance across cultural contexts and linguistic variations without bias | ACL 2023 Multilingual NLP ICML 2020 |
Drift Detection & Monitoring | Passenger language patterns and service demands evolve continuously; systematic monitoring prevents silent performance degradation over operational periods | Distribution Shifts AI Monitoring Model Degradation |
Multi-Intent Detection Capability | Airline passengers frequently express compound requests (booking + special assistance + meal preferences); accurate multi-intent evaluation prevents conversation breakdown | EMNLP 2024 (1) EMNLP 2024 (2) EMNLP 2020 |
Robustness Assessment | Evaluation must account for noisy input conditions (background noise, spelling errors, informal language) common in real passenger interactions | Microsoft Research ACL 2020 |
Method Definitions
Conformal Prediction Evaluation
- Provides theoretical guarantees for uncertainty quantification using prediction sets with mathematically proven coverage probabilities
- Constructs calibrated confidence intervals that contain the true intent class with predetermined confidence levels (90%, 95%, 99%)
- Enables adaptive decision-making through prediction set size analysis and rejection option implementation for ambiguous queries
Out-of-Scope Detection Evaluation
- Evaluates system's ability to reject queries outside supported intent taxonomy using dedicated benchmark datasets with explicit OOS samples
- Measures AU-IOC (Area Under In-scope and Out-of-scope Characteristic Curve) for comprehensive dual-performance assessment of classification accuracy and rejection capability
- Combines threshold-based rejection with representation similarity analysis to prevent misclassification of non-domain queries into operational intent categories
Latency Profiling Protocol
- Measures inference time distribution under realistic passenger interaction loads with comprehensive P95, P99, and mean latency tracking
- Evaluates Time to First Token (TTFT) < 200ms and total response latency < 500ms requirements for maintaining conversational flow during peak operations
- Implements stress testing scenarios including concurrent user simulation, batch processing optimization, and hardware-specific performance profiling
Multilingual Consistency Assessment
- Tests intent classification accuracy equivalence across multiple languages using professionally translated content and cross-lingual transfer protocols
- Evaluates zero-shot performance degradation patterns and measures semantic consistency preservation across typologically diverse language families
- Assesses cultural adaptation effectiveness through synthetic persona simulation and region-specific linguistic variation testing
Drift Detection Methodology
- Monitors temporal performance degradation using statistical distribution shift detection including Kolmogorov-Smirnov testing and Maximum Mean Discrepancy analysis
- Implements sliding window data acquisition with adaptive threshold adjustment for proactive model degradation detection
- Measures Mean Time to Detection (MTD) and False Discovery Rate (FDR) to optimize alert sensitivity and minimize false positive interventions
Multi-Intent Detection Evaluation
- Assesses multi-label classification performance using macro/micro F1-scores, Hamming loss, and label-wise accuracy for complex passenger queries
- Evaluates joint intent-slot parsing accuracy with complete semantic frame correctness validation for compound service requests
- Tests hierarchical intent taxonomy navigation and co-occurrence pattern recognition for realistic multi-service passenger interactions
Robustness Assessment Framework
- Applies systematic adversarial testing using character-level, word-level, and sentence-level perturbations with human validation protocols
- Implements behavioral testing through Minimum Functionality Tests, Invariance Tests, and Directional Expectation Tests for natural language variations
- Measures Attack Success Rate (ASR) under realistic input corruption scenarios including speech-to-text errors, typos, and grammatical inconsistencies
Reproducibility Framework Assessment
- Implements standardized experimental protocols with comprehensive documentation requirements following established research reproducibility checklists
- Requires multi-seed evaluation with statistical significance testing and complete artifact version control for experiment replication
- Utilizes containerized environments and automated pipeline management to ensure consistent results across different computational platforms and research teams
4.3 Method Comparison Matrix
Method | Uncertainty Assessment | Reproducibility | Out-of-Scope Detection | Latency Profiling | Multilingual Consistency | Drift Detection | Multi-Intent Detection | Robustness Assessment | References |
---|---|---|---|---|---|---|---|---|---|
Conformal Prediction Evaluation | High | High | High | Medium-High | High | Low | High | High | NAACL 2024 arXiv 2503.15850 arXiv 2309.06240 |
Out-of-Scope Detection Evaluation | Low | High | High | Medium | Low | Low | Low | Medium-High | EMNLP 2019 Springer 2023 arXiv 2403.05640 |
Latency Profiling Protocol | Low | High | Low | High | Medium | Low | Medium | Low | Databricks Blog DZone Article Latency Metrics |
Multilingual Consistency Assessment | Low | High | Low | Medium | High | Low | Medium-High | High | ACL 2023 ResearchGate EMNLP 2021 |
Drift Detection Methodology | Low | High | Low | Low | Low | High | Low | Low | Microsoft Docs Evidently AI arXiv 2505.17043 |
Multi-Intent Detection Evaluation | Low | High | Low | Medium | Medium | Low | High | Low | EMNLP 2023 Science Direct FewNLU |
Robustness Assessment Framework | Low | High | Low | Medium | High | Low | Low | High | GitHub Repo EMNLP 2023 CMU Blog |
Reproducibility Framework Assessment | Low | High | Low | Low | Medium | Medium | Medium | Low | arXiv 2407.10239 Nature arXiv 2406.14325 |
4.4 Method Selection & Justification
Based on the comparative analysis of evaluation methods and the critical requirements of airline operations, this study adopts a focused evaluation strategy targeting four essential capabilities: uncertainty quantification for safety-critical decision-making, out-of-scope detection for operational boundary management, latency optimization for real-time passenger interactions, and multi-intent processing for complex service requests. These selected methods address the unique challenges of airline chatbot systems including regulatory compliance for automated decisions, prevention of dangerous query misrouting, operational efficiency during peak travel periods, and comprehensive understanding of compound passenger needs across booking modifications, service preferences, and support requests.
Method | Rationale |
---|---|
Conformal Prediction Evaluation | Provides theoretical guarantees for uncertainty quantification essential in safety-critical airline operations where overconfident misclassifications can trigger inappropriate automated responses. Enables mathematically proven confidence intervals for human handoff decisions during operational disruptions and emergency scenarios. Critical for regulatory compliance requiring auditable decision-making processes in aviation safety systems. |
Out-of-Scope Detection Evaluation | Essential for ensuring safety and reliability when chatbots are exposed to open-ended customer inputs beyond airline scope. Prevents dangerous misrouting of non-aviation queries (hotel bookings, rental cars, weather) into operational flight systems. Critical for maintaining service boundaries and preventing automated responses to queries outside airline operational scope, directly impacting passenger safety. |
Latency Profiling Protocol | Reflects constraints in production environments such as check-in kiosks, mobile apps, and real-time customer service interactions. Models must demonstrate system responsiveness under realistic passenger interaction loads to maintain operational efficiency during peak travel periods. Essential for validating deployment viability in time-critical airline operations where response delays can cascade into operational disruptions. |
Multi-Intent Detection Evaluation | Addresses complex passenger queries that span multiple services (booking modifications + seat preferences + meal requests) common in airline customer interactions. Validates system capability to handle compound service requests through multi-label classification performance assessment. Critical for reducing clarification overhead and improving passenger experience during complex transaction scenarios. |
5. Final Recommendations & Next Steps
5.1. Summary of Comparative Findings
The analysis conducted in this study shows that intent classification techniques exhibit diverse performance profiles depending on their methodological approach and the context of application. Prompting-based methods, such as Few-Shot In-Context Learning, are particularly useful during early development stages or prototyping, as they require no training and allow for rapid iteration. Embedding-based techniques, such as Dual Encoder, stand out for their computational efficiency, making them suitable for systems with latency or infrastructure constraints. Hybrid approaches, such as Adaptive ICL + CoT Hybrid Routing, offer a favorable balance between accuracy and inference time by combining lightweight classifiers with high-capacity LLMs. Generative methods, including IntentGPT and Gen-PINT, enable advanced tasks such as automatic intent discovery and semantic adaptation, although they often entail greater operational requirements.
These findings suggest that the selection of a technique should not rely on a single performance metric. Instead, it should account for the specific operational context, system constraints, and functional objectives relevant to each phase in the development and deployment of a conversational AI system.
5.2. Technique Recommendations by Use Case
The table below summarizes recommended techniques for common use cases in airline conversational systems:
Use Case Scenario | Recommended Technique | Primary Justification |
---|---|---|
Initial deployment / prototyping | Few-Shot In-Context Learning (ICL) | Requires no training, enables rapid iteration with minimal labeled data |
Production scaling | Adaptive ICL + CoT Hybrid Routing | High accuracy with reduced latency; optimal trade-off for real-time environments |
Emerging or evolving intents | IntentGPT | Unsupervised intent discovery; useful when services or policies change dynamically |
Domain-specific adaptation | PEFT (IA3 Adapters) | Efficient fine-tuning; low resource cost; allows for rapid domain specialization |
These recommendations are grounded in the comparative analysis in Section 3.3 and are aligned with the technical attributes and performance metrics presented in Sections 4.1 and 4.2.
5.3. Implementation Considerations
Before selecting or deploying an intent classification technique, the following practical aspects should be taken into account:
- Infrastructure availability: some techniques require access to large language models or GPU-based inference environments.
- Real-time response requirements: use cases such as automated check-in or airport support require low inference latency.
- Out-of-scope (OOS) robustness: it is critical to detect irrelevant or unexpected inputs to avoid generating inappropriate responses.
- Maintainability and adaptability: systems must be capable of rapid updates in response to changes in regulations, services, or query patterns.
- Multilingual support and cultural variation: airline systems must deliver consistent classification across different languages and regions.
5.4. Future Work
This study provides a solid foundation for evaluating intent classification techniques in complex operational domains such as aviation. However, several lines of work are proposed to further strengthen the applicability and robustness of such systems:
- Development of a domain-specific dataset for airlines, including real passenger utterances, expert annotations, and ambiguous cases.
- Multilingual and cross-regional evaluation with native speakers, to validate semantic consistency and intent boundaries across cultural contexts.
- Controlled experimentation with clarification techniques, such as TO-GATE or LLM-based interactive clarification, to improve real-time ambiguity resolution.
- Design of domain-oriented benchmarks that combine multiple task types (classification, OOS detection, discovery, reasoning) and integrate evaluation metrics.
These steps will help advance toward more reliable, adaptive, and operationally aligned conversational systems for the airline industry.
6. Conclusion
This study presents a comprehensive analysis of intent classification techniques and evaluation methods, with a specific focus on their applicability to airline customer service systems. Through a detailed characterization of the airline domain, a structured comparison of 20 intent classification approaches, and an in-depth review of evaluation protocols, we highlight the strengths, limitations, and suitability of each technique under different operational scenarios.
Our findings show that no single method excels universally. Instead, successful deployment depends on aligning technique selection with specific contextual requirements—such as the availability of training data, latency constraints, or the need for out-of-scope handling and clarification. Prompting-based methods, hybrid architectures, and LLM-powered discovery systems each provide complementary strengths that can be leveraged at different stages of system development.
The evaluation strategy proposed in this study emphasizes few-shot adaptability, semantic robustness, and real-time responsiveness. These dimensions are critical to supporting the multilingual, dynamic, and operationally sensitive nature of airline applications.
Overall, this work provides actionable guidance for researchers and practitioners seeking to design or evaluate intent classification systems in complex real-world domains, offering both a theoretical foundation and practical pathways for scalable and reliable deployment.