Abstract

This study analyzes intent classification techniques for airline customer service applications.

Airlines operate in a high-stakes environment where misunderstanding passenger intent can trigger operational chaos. Multilingual requirements complicate everything. Dynamic contexts shift by the hour. One misclassified intent about a flight change or baggage issue can cascade through systems, affecting schedules, crew assignments, and passenger safety.

We examine 18 techniques across seven categories: prompting-based, embedding-based, hybrid systems. Each evaluated against accuracy, latency, few-shot learning, and cross-domain transfer.

Our analysis reveals Few-Shot In-Context Learning as the most effective approach for initial deployment, while Hybrid Systems with Uncertainty-Based Routing prove superior for production scaling. The evaluation framework covers four critical areas: conformal prediction for uncertainty quantification, out-of-scope detection for operational safety, latency profiling for real-time responsiveness, and multi-intent detection for complex queries.

This research provides practical guidance for deploying robust intent classification in safety-critical, multilingual airline environments.

1. Domain Characterization

1.1 Context

This study focuses on airlines as a domain that presents several characteristics relevant to intent classification research. Airlines operate in a complex operational environment with diverse customer interaction patterns.

Domain Criticality: Airline intent misclassification can have significant operational and financial consequences. Research shows that operating costs from delays constituted 86-134% of airline operating profits in 2007, with industry-wide delay costs ranging $8.3-13 billion annually (Gu et al., 2024). Misunderstanding customer requests during service disruptions, flight changes, or emergency situations can impact passenger experience and operational efficiency. The stakes are higher than in typical e-commerce or general customer service applications.

Multilingual and Cultural Complexity: International airlines must support intent classification across multiple languages and cultural contexts, where the same underlying need may be expressed differently across regions. Studies indicate that 74% of customers are more likely to repurchase from businesses offering customer service in their native language (HappyFox, 2025). Passenger queries reflect diverse linguistic patterns and cultural expectations for service delivery.

Operational Dynamics: Airlines operate in a constantly changing environment where flight schedules, policies, weather conditions, and regulatory requirements evolve continuously. Intent classification systems must handle queries about dynamic information while maintaining accuracy across changing operational contexts.

Contextual Dependencies: Airline customer interactions often involve multiple related intents and require understanding of travel context. Passengers frequently have compound requests that span booking modifications, service preferences, and operational concerns within a single conversation.

System Integration Requirements: Unlike many conversational AI applications, airline intent classification typically requires integration with real-time operational systems including reservations, flight operations, baggage tracking, and customer management platforms. This integration complexity adds constraints to system design and deployment.

These characteristics make airlines a challenging and representative domain for evaluating intent classification techniques, as solutions must demonstrate robustness, adaptability, and operational reliability.

1.2 Canonical User Intents for Airline Systems

To design and evaluate intent classification systems in the airline domain, it is essential to establish a comprehensive and functionally organized set of user intents. These intents represent the core types of queries and requests that passengers generate when interacting with airline customer service platforms—whether through chatbots, apps, or voice assistants—across all phases of the travel experience.

This study defines 35 canonical intents grouped into seven functional categories. These were derived through comparative analysis of public dialogue datasets including ATIS (Hou et al., 2024) with 17 intent categories, SGD (Rastogi et al., 2020) with 20 domains covering travel, events, and other services, and CLINC150 (Larson et al., 2019) with 150 intents across 10 domains, combined with analysis of real-world airline chatbot interfaces and relevant literature. The taxonomy reflects both traditional service needs and emerging user demands such as environmental responsibility, digital account management, and personalized onboard services.

1. Flight Planning & Reservation

These intents capture user interactions during the early planning and booking stages.

flight_search – Find available flights based on destination, date, time, or price.
flight_booking – Complete a flight reservation, including passenger details and add-ons.
seat_selection – Select or change seat assignments, request upgrades.
fare_rules_explained – Inquire about fare conditions, change penalties, refundability.
booking_modification – Change flight details (e.g., dates, names) for an existing reservation.
booking_cancellation – Cancel a booked ticket and receive applicable refund terms.
upgrade_offer_bid – Submit or review upgrade offers (e.g., to business class).
lounge_access – Ask about eligibility, pricing, and location of airport lounges.

2. Check-in & Boarding

These intents occur closer to departure and involve real-time or operational support.

check_in_boarding – Perform check-in or retrieve digital boarding passes.
boarding_group_info – Understand assigned boarding group or priority boarding rules.
flight_status – Request real-time updates on flight departure, delays, gate changes.
alternative_flights – Search for other flight options (e.g., in case of disruption).
rebooking_assistance – Automatically rebook after a delay, cancellation, or missed connection.
weather_updates – Check how weather may affect flights.
airport_information – Ask about services, navigation, or amenities at a specific airport.

3. Baggage Services

These intents focus on all aspects of baggage policies, tracking, and incidents.

baggage_policy – Ask about weight, size, and allowance rules.
baggage_fees – Calculate or understand fees for checked or excess baggage.
carry_on – Verify carry-on size, item restrictions, and allowances.
baggage_tracking – Track the status or location of checked baggage.
lost_baggage – Report or follow up on lost, delayed, or damaged baggage.
special_equipment – Ask about traveling with sports gear, musical instruments, medical devices.

4. Post-Sale Support & Claims

These intents represent post-booking concerns, support, and compensation.

refund_request – Request refund based on ticket conditions or travel disruption.
compensation_claim – File a claim for delays, cancellations, baggage incidents.
travel_insurance – Inquire about coverage, purchase options, or initiate a claim.
loyalty_program – Ask about mileage accrual, redemption, tier benefits, or upgrades.

5. Documentation & Travel Requirements

These intents deal with regulatory and compliance matters before departure.

visa_requirements – Check visa obligations for the destination.
passport_validity – Verify passport expiration and entry rules.
health_requirements – Ask about vaccination, testing, or medical certificate needs.
travel_restrictions – Request information on entry bans, COVID-19 rules, or quarantine.

6. Special Services & Onboard Preferences

These intents cover optional or personalized services.

meal_preferences – Select or inquire about special meals (e.g., vegetarian, halal).
special_assistance – Request wheelchair service, escort, or medical accommodations.
carbon_offset_program – Participate in sustainability programs or purchase offsets.
wifi_connectivity – Ask about onboard Wi-Fi availability, pricing, and usage.

7. Digital Account & Mobile App Support

These intents involve digital access and personalization.

digital_account_login – Troubleshoot login issues for user profiles or frequent flyer accounts.
app_notification_settings – Configure alerts for delays, gate changes, promotions, or check-in reminders.

Note: This taxonomy represents a theoretical framework derived from dataset analysis and industry observation. The 35 intents proposed here are based on comparative analysis of existing dialogue datasets and airline service patterns, but empirical validation with real airline operational data would be required for production deployment.

1.3 Domain-Specific Requirements for Intent Classification Systems

The airline domain presents specific challenges for intent classification that influence how standard classification requirements manifest in this operational context. The following requirements emerge from the theoretical analysis of airline intents and how passengers typically express their needs in this domain.

Classification Accuracy Requirements

Intent Disambiguation Capability - Airline intents frequently overlap semantically (baggage_policy vs baggage_fees, flight_search vs flight_status), requiring systems that distinguish between closely related categories based on subtle linguistic cues
Domain Terminology Recognition - Classification must handle airline-specific vocabulary (layover, codeshare, PNR, oversold) and abbreviations that affect intent identification
Context-Dependent Classification - Passengers may use the same phrase for different intents depending on their specific travel situation

Robustness Requirements

Variation Handling - Passengers express the same intent using diverse terminology influenced by regional differences, airline branding, and travel experience levels
Multilingual Classification Consistency - Intent boundaries must remain consistent across languages where travel concepts may have different cultural associations
Out-of-Scope Detection for Domain Boundaries - Distinguish between airline-related queries and unrelated requests (hotel bookings, ground transportation)

Adaptability Requirements

Seasonal Intent Recognition - Classification must adapt to temporal variations in how passengers express travel needs (holiday travel patterns, weather-related concerns, seasonal routes)
Intent Granularity Flexibility - Ability to classify at different levels of specificity as airline services evolve
New Intent Integration - Accommodate emerging intent categories as airline services change (new health requirements, technology services, environmental concerns)

These requirements specifically address the challenges of correctly identifying passenger intent categories within the airline domain's linguistic and operational constraints.

2. Guiding Research Questions

2.1. Questions for Intent Classification Techniques

What are the main existing techniques for intent classification in conversational AI applications?
What technical attributes and capabilities ****are used to compare these techniques (architecture, few-shot learning, latency, computational requirements)?
How do these techniques perform across key metrics like accuracy, out-of-scope detection, and operational efficiency in airline-relevant scenarios?
Which techniques are most suitable for airline chatbots considering domain-specific requirements and constraints?
How can intent classification techniques be combined with clarification and out-of-scope detection to improve system robustness?

2.2. Questions for Evaluation Methods

What are the main evaluation methodologies used for assessing intent classification systems?
What metrics, datasets, and evaluation protocols are employed to measure performance and compare different approaches?
How do these evaluation methods address airline-specific challenges like multilingual support, contextual dependencies, and operational criticality?
What evaluation strategy is most appropriate for validating intent classification systems in airline applications?

3. Intent Classification Techniques

To select the most appropriate intent classification techniques for airline applications, it is necessary to systematically analyze the available options. This section compares the main techniques identified in the current literature, evaluating their specific characteristics in relation to the domain requirements outlined earlier.

3.1. Identified Techniques

Through a systematic review of current literature, 18 specific intent classification techniques were identified and categorized into seven main groups. These techniques represent the state of the art in the field, ranging from fully training-free approaches to sophisticated hybrid systems.

Prompting-Based Techniques

Zero-Shot In-Context Learning (ICL) – A technique in which the language model classifies user intents using only textual descriptions of each intent, without any prior training examples. The LLM relies on its pretrained knowledge to map the user query to the most appropriate intent category based solely on the definitions provided (Parikh et al., 2023).
Few-Shot In-Context Learning – This approach includes 1 to 10 representative examples of each intent directly within the prompt, along with the query to be classified. The LLM infers patterns and features from the examples to perform intent classification (Parikh et al., 2023; Zhang et al., 2024).
Adaptive In-Context Learning – An extension of few-shot ICL that dynamically selects the most relevant examples for each query using semantic similarity. Embedding models are used to retrieve the K nearest examples to the input, and the prompt is constructed adaptively (Arora et al., 2024; Rodriguez et al., 2024).
Chain-of-Thought (CoT) Prompting – A prompting technique that instructs the LLM to explain its reasoning step by step before producing a final classification. The model verbalizes the analysis process, highlighting key features of the input before issuing its decision (Arora et al., 2024).

Embedding-Based Techniques

Dual Sentence Encoders (USE + ConveRT) – An architecture that combines the Universal Sentence Encoder (USE) for general-purpose representations with ConveRT, a conversationally optimized encoder, as fixed feature extractors. The resulting embeddings are concatenated and passed to a simple MLP classifier with one hidden layer. Both encoders remain frozen during training (Casanueva et al., 2020).
SetFit (Sentence Transformer Fine-tuning) – A few-shot learning method that performs contrastive fine-tuning of sentence transformers using positive and negative pairs, followed by training a lightweight classifier on the learned representations. It combines contrastive learning with discriminative classification (Arora et al., 2024).
Label-Aware BERT Attention Network (LABAN) – A neural architecture that builds an embedding space explicitly informed by the semantics of intent labels. It uses attention mechanisms to project user queries into this semantic space and classifies them based on projection weights toward each intent category (Wu et al., 2021).

Generative-Based Techniques

Text-to-Text Generation (Gen-PINT) – A complete reformulation of the classification problem as a free-form text generation task. Instead of selecting from predefined intent classes, the LLM generates the intent name or description directly as natural language using instruction tuning (Zhang et al., 2024).
Intent Discovery with LLMs (IntentGPT) – A training-free system that employs a dual-LLM architecture for automatic intent discovery. It combines a contextual prompt generator, an intent predictor, and a semantic sampler that selects relevant examples through automatic clustering to identify emerging intent categories (Rodriguez et al., 2024).
Generate-then-Refine – A two-stage pipeline in which synthetic utterances are first generated using LLMs in a zero-shot setting to expand the dataset, followed by a seq2seq refinement model that enhances the quality, coherence, and utility of the generated data (Lin et al., 2024).

Fine-Tuning-Based Techniques

BERT Fine-tuning – Full fine-tuning of pretrained transformer models (e.g., BERT-Large) on domain-specific intent classification datasets. This approach requires updating all model parameters to adapt the model to the target task (Casanueva et al., 2020).
PEFT (IA3 Adapters) – A parameter-efficient fine-tuning method that uses IA3 adapters (Infused Adapter by Inhibiting and Amplifying Inner Activations) to adapt pretrained models by training only a small subset of parameters, achieving competitive performance with very limited examples (Parikh et al., 2023).

Hybrid Techniques

Hybrid System with Uncertainty-Based Routing – A system that combines lightweight models (e.g., SetFit) with large language models through an uncertainty-driven routing strategy. It uses confidence estimation to forward queries to more capable models only when uncertainty exceeds a predefined threshold, balancing accuracy and efficiency (Arora et al., 2024).

Specialized Techniques

Out-of-Scope Detection – A set of specialized methods designed to detect user queries that do not belong to any predefined intent. These approaches include confidence-based thresholds, representation clustering, and similarity analysis to prevent misclassification of out-of-domain inputs (Arora et al., 2024).
Two-Step OOS Detection – A two-stage methodology for out-of-scope detection. First, it predicts in-scope labels while ignoring potential OOS cases. Then, it compares the transformer's internal representations with training instances using cosine similarity to identify outliers (Arora et al., 2024).
Multi-Intent Detection – Techniques specifically designed to detect and classify multiple intents within a single user query. These systems rely on multi-label architectures, attention mechanisms, or sequential decomposition to capture all distinct purposes present in complex utterances (Wu et al., 2021).

LLM-Based Clarification Techniques

LLM Ambiguity Identification and Clarification – A system that leverages large language models (e.g., ChatGPT, GPT-4) to identify ambiguous user queries and generate appropriate clarification questions. It employs chain-of-thought reasoning and few-shot prompting to systematically detect and resolve ambiguities (Zhang et al., 2024).
LLM-Based Interactive Clarification – A method that uses a "communicator" LLM to identify high-uncertainty and low-confidence segments in user problem descriptions. It then generates specific clarification questions to elicit additional information before proceeding with the task (Wu, 2023).

These 18 techniques represent the full spectrum of approaches available for intent classification, ranging from simple and efficient methods to complex systems with advanced capabilities for intent discovery and clarification.

Note: Some techniques represent conceptual approaches or recent developments that may require additional empirical validation for comparative evaluation in operational airline systems.

3.2. Technical Attributes for Comparison

To systematically evaluate the identified intent classification techniques, key attributes were defined based on a comprehensive review of the research literature. These attributes represent the most relevant dimensions used by researchers to characterize and compare methods, and are particularly applicable to airline-related use cases.

Primary Performance Metrics

Accuracy - Primary metric used across all frameworks for overall classification correctness
F1-Score - Harmonic mean of precision and recall, especially important for out-of-scope detection and imbalanced classes

Specialized Metrics

Out-of-Scope Recall - Percentage of out-of-scope samples correctly identified as not belonging to any predefined intent category. Critical for production systems to prevent incorrect automated responses when passengers ask questions outside the chatbot's domain (e.g., asking about hotel bookings to an airline chatbot). High OOS recall prevents embarrassing misclassifications but must be balanced with precision to avoid rejecting valid queries.
NMI (Normalized Mutual Information) - Clustering evaluation metric that measures the quality of discovered intent groupings compared to ground truth labels. Values range from 0 (random clustering) to 1 (perfect clustering). Used specifically for intent discovery tasks where the system automatically identifies new intent categories from unlabeled data, particularly relevant for discovering emerging customer needs in airline services.
ARI (Adjusted Rand Index) - Clustering metric that compares predicted intent partitions against ground truth, adjusted for chance agreement. Unlike raw accuracy, ARI accounts for the expected similarity that would occur by random chance. Values range from -1 to 1, where 1 indicates perfect clustering. Used alongside NMI for comprehensive evaluation of intent discovery systems.

Operational Metrics

Latency/Inference Time - Time required to process a single query from input to intent prediction output. Critical for real-time customer service applications where delays impact user experience. Measured in milliseconds or seconds, with airline chatbots typically requiring sub-2-second responses to maintain conversational flow.

Learning Configuration Performance

Zero-Shot Performance - Model's ability to classify intents using only textual descriptions without any training examples. Evaluated by providing intent definitions (e.g., "flight_search: user wants to find available flights") and measuring classification accuracy on unseen queries. Crucial for rapidly deploying systems to new routes or services without collecting training data.
Few-Shot Performance - Classification accuracy when trained with very limited examples per intent (typically 1, 5, or 10 examples). Measures data efficiency and practical deployment feasibility, as collecting extensive training data for every airline intent is resource-intensive. Performance curves show how accuracy improves with additional examples.
Cross-Domain Transfer - Model's ability to maintain performance when applied to different airline contexts or related domains. For example, a model trained on domestic flight intents transferring to international travel queries, or adapting from one airline's terminology to another's. Measures generalization capability across operational contexts.

Additional Evaluation Metrics

Semantic Similarity - Cosine similarity between generated intent names/descriptions and ground truth labels, measured using sentence embeddings. This metric is used when models generate free-form intent labels rather than selecting from predefined categories. It is particularly relevant for intent discovery systems (e.g., IntentGPT) or text-to-text generative models (e.g., Gen-PINT), where semantic matching is more informative than exact string comparison.

3.3. Performance Comparison of Techniques

This section presents a systematic comparison of techniques using the attributes defined in the previous section. The techniques are grouped by methodological category to facilitate comparative analysis and highlight patterns across different approaches.

Prompting-based Techniques

Technique	Accuracy	F1-Score	Out-of-Scope Recall	NMI	ARI	Latency	Zero-Shot Performance	Few-Shot Performance	Cross-Domain Transfer	Semantic Similarity
Zero-Shot ICL	73.9% (MASSIVE, Flan-T5-XXL) 89.3% (Benchmark01, GPT-3) [Parikh et al., 2023]	Not reported	0.97 (Benchmark01, GPT-3) 0.67 (Benchmark02, GPT-3) [Parikh et al., 2023]	Not reported	Not reported	Not reported	Primary method [Parikh et al., 2023]	Not applicable	Not reported	Not reported
Few-Shot ICL	63% (MASSIVE, K=5, ELMSE) 80% (Benchmark01, K=5, ELMSE) [Parikh et al., 2023]	Not reported	Not reported	Not reported	Not reported	Not reported	Not applicable	K=3: 57% (MASSIVE) K=5: 63% (MASSIVE) [Parikh et al., 2023]	Not reported	Not reported
Adaptive ICL	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	Adaptive example retrieval [Arora et al., 2024]	Not reported	Not reported
Chain-of-Thought	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	Emergent capability ≥100B parameters [Wei et al., 2022]	Few-shot prompting [Wei et al., 2022]	Cross-task evaluation [Wei et al., 2022]	Not reported

Embedding-based Techniques

Technique	Accuracy	F1-Score	Out-of-Scope Recall	NMI	ARI	Latency	Zero-Shot Performance	Few-Shot Performance	Cross-Domain Transfer	Semantic Similarity
Dual Encoder	85.19% (BANKING77, 10-shot) 93.36% (BANKING77, full) [Casanueva et al., 2020]	Not reported	Not reported	Not reported	Not reported	Reported as faster than BERT [Casanueva et al., 2020]	Not reported	10-shot performance demonstrated [Casanueva et al., 2020]	Not reported	USE + ConveRT embeddings [Casanueva et al., 2020]
SetFit	Not reported	Not reported	Not reported	Not reported	Not reported	Reported as highly efficient [Tunstall et al., 2022]	Not reported	8–16 examples typical [Tunstall et al., 2022]	Not reported	Contrastive fine-tuning approach [Tunstall et al., 2022]
LABAN	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	Reported zero-shot capability [Wu et al., 2021]	Not reported	Reported zero-shot transfer capability [Wu et al., 2021]	Label-aware semantic space [Wu et al., 2021]

Generative-based Techniques

Technique	Accuracy	F1-Score	Out-of-Scope Recall	NMI	ARI	Latency	Zero-Shot Performance	Few-Shot Performance	Cross-Domain Transfer	Semantic Similarity
Gen-PINT	77.38% (average across 8 datasets, 1-shot) [Zhang et al., 2024]	Not reported	Not reported	Not reported	Not reported	Not reported	Reported cross-domain capability [Zhang et al., 2024]	Specialized 1-shot performance [Zhang et al., 2024]	Reported domain-agnostic generalization [Zhang et al., 2024]	Not reported
IntentGPT	77.21% (BANKING77, 50-shot GPT-4) [Rodriguez et al., 2024]	Not reported	Not reported	96.06% (CLINC150, 50-shot GPT-4) [Rodriguez et al., 2024]	84.76% (CLINC150, 50-shot GPT-4) [Rodriguez et al., 2024]	Not reported	Training-free approach [Rodriguez et al., 2024]	50-shot performance reported [Rodriguez et al., 2024]	Training-free generalization [Rodriguez et al., 2024]	SBERT embeddings + cosine similarity [Rodriguez et al., 2024]
Generate-then-Refine	76.9% (CLINC150, 1-shot) [Lin et al., 2024]	Not reported	Not reported	Not reported	Not reported	Not reported	Reported cross-domain transfer [Lin et al., 2024]	Optimized 1-shot generation [Lin et al., 2024]	Reported domain generalization [Lin et al., 2024]	Not reported

Fine-Tuning-Based Techniques

Technique	Accuracy	F1-Score	Out-of-Scope Recall	NMI	ARI	Latency	Zero-Shot Performance	Few-Shot Performance	Cross-Domain Transfer	Semantic Similarity
BERT Fine-tuning	96.93% (CLINC150, full dataset) [Casanueva et al., 2020]	Not reported	Not reported	Not reported	Not reported	Not reported	Requires full fine-tuning [Casanueva et al., 2020]	Requires sufficient training data [Casanueva et al., 2020]	Not reported	BERT embeddings [Casanueva et al., 2020]
PEFT (IA3)	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	Not reported	One example per intent sufficient [Parikh et al., 2023]	Not reported	Not reported

Hybrid-Based Techniques

Technique	Accuracy	F1-Score	Out-of-Scope Recall	NMI	ARI	Latency	Zero-Shot Performance	Few-Shot Performance	Cross-Domain Transfer	Semantic Similarity
Hybrid System with Uncertainty-Based Routing	Reported within 2% of full LLM accuracy [Arora et al., 2024]	Not reported	Not reported	Not reported	Not reported	Reported 50% latency reduction [Arora et al., 2024]	Combines lightweight and LLM approaches [Arora et al., 2024]	Uncertainty-based optimization [Arora et al., 2024]	Not reported	Hybrid architecture [Arora et al., 2024]

Specialized Techniques

Technique	Accuracy	F1-Score	Out-of-Scope Recall	NMI	ARI	Latency	Zero-Shot Performance	Few-Shot Performance	Cross-Domain Transfer	Semantic Similarity
Out-of-Scope Detection	Not reported	Critical for evaluation [Arora et al., 2024]	Primary metric for OOS [Arora et al., 2024]	Not applicable	Not applicable	Not reported	Threshold-based methods [Arora et al., 2024]	Adaptable to different configurations [Arora et al., 2024]	Not reported	Threshold-based similarity [Arora et al., 2024]
Two-Step OOS	Not reported	Reported >5% improvement [Arora et al., 2024]	Reported >5% improvement [Arora et al., 2024]	Not applicable	Not applicable	Not reported	Compatible with base classifiers [Arora et al., 2024]	Supports few-shot adaptation [Arora et al., 2024]	Not reported	Internal representations similarity [Arora et al., 2024]
Multi-Intent Detection	Not reported	State-of-the-art performance [Wu et al., 2021]	Not reported	Not applicable	Not applicable	Not reported	Multi-label classification capability [Wu et al., 2021]	Architecture-dependent [Wu et al., 2021]	Not reported	Multi-label architecture [Wu et al., 2021]

LLM-Based Clarification Techniques

Technique	Accuracy	F1-Score	Out-of-Scope Recall	NMI	ARI	Latency	Zero-Shot Performance	Few-Shot Performance	Cross-Domain Transfer	Semantic Similarity
LLM Ambiguity Identification	54.25% (ChatGPT average) [Zhang et al., 2024]	52.77% (ChatGPT average) [Zhang et al., 2024]	Reported to reduce ambiguity [Zhang et al., 2024]	Not applicable	Not applicable	Clarification overhead reported [Zhang et al., 2024]	Chain-of-Thought-based identification [Zhang et al., 2024]	Compatible with prompting examples [Zhang et al., 2024]	Not reported	Chain-of-Thought-based analysis [Zhang et al., 2024]
LLM Interactive Clarification	Not reported	Not reported	Reported to handle high uncertainty [Wu, 2023]	Not applicable	Not applicable	Multi-turn interaction overhead [Wu, 2023]	Automatic identification capability [Wu, 2023]	Adaptable framework [Wu, 2023]	Not reported	Uncertainty-based interaction [Wu, 2023]

The comparative analysis reveals that no single technique consistently outperforms others across all evaluation metrics. Prompting-based methods, such as Zero-Shot ICL and Adaptive ICL, offer clear advantages in training-free scenarios, showing competitive performance in zero-shot classification and emerging capabilities in step-by-step reasoning. Embedding-based approaches, such as Dual Encoder, stand out for their operational efficiency and low inference time, making them appealing for production-grade systems. Generative techniques like IntentGPT and Gen-PINT introduce new opportunities for intent discovery and unsupervised classification, albeit with higher computational demands. Among hybrid methods, Hybrid System with Uncertainty-Based Routing emerges as a promising solution by balancing latency and accuracy through the combination of lightweight classifiers and high-capacity LLMs. Finally, specialized techniques, such as Two-Step OOS and LLM-based clarification systems, enable targeted handling of critical challenges like out-of-scope detection and interactive clarification.

These findings provide a foundation for the informed selection of techniques in the next section.

3.4. Technique Suitability for Airline Context

For the initial deployment of an intent classification system in the airline domain, we select Few-Shot In-Context Learning (ICL) as the baseline technique. This decision is grounded in operational simplicity, implementation feasibility, and alignment with early-stage scenarios where limited annotated data is available.

Technical Rationale

Few-Shot ICL is a prompt-based method that leverages pretrained large language models (LLMs) to perform classification without any additional fine-tuning. It operates by providing the model with a few labeled examples (typically 1 to 10) embedded directly in the prompt, followed by an unlabeled query. The LLM infers the most probable intent based on its prior knowledge and the observed input patterns.

Parikh et al. (2023) show that this technique consistently outperforms zero-shot prompting in intent classification tasks, even with minimal data. Specifically, on the MASSIVE benchmark, their ELMSE baseline achieved 63% accuracy with K=5 examples, while GPT-3 achieved 73.9% in zero-shot mode, indicating the effectiveness of both approaches for different operational constraints (Parikh et al., 2023).

Furthermore, Few-Shot ICL requires no supervised training or modification of the model architecture, significantly reducing deployment complexity. Zhang et al. (2024) also highlight this method as a reliable baseline in low-resource classification experiments, especially in multilingual and cross-domain contexts (Zhang et al., 2024).

Operational Advantages for First Deployment

No fine-tuning required: Models can be used as-is with only prompt engineering.
Architecture-agnostic: Compatible with any modern LLM (e.g., GPT-4, Claude, Mistral).
Multilingual by design: LLMs exhibit multilingual competence in classification tasks (Chung et al., 2022; Winata et al., 2023).
Benchmark-compatible: Easily evaluated using standard few-shot benchmarks like CLINC150 or BANKING77 in 1-shot, 5-shot, and 10-shot configurations (Casanueva et al., 2020).

Forward-Looking Enhancements

While Few-Shot ICL offers a robust starting point, more advanced techniques can be adopted in future iterations to meet specific operational demands:

Hybrid System with Uncertainty-Based Routing, as proposed by Arora et al. (2024), combines lightweight models (e.g., SetFit) with LLMs using uncertainty-based routing. This method achieves accuracy within 2% of full LLM inference while reducing latency by 50%, making it ideal for production systems (Arora et al., 2024).
IntentGPT, introduced by Rodriguez et al. (2024), enables training-free intent discovery. It achieves 96.06% NMI and 84.76% ARI on CLINC150, making it suitable for identifying emerging user needs and updating intent taxonomies without labeled data (Rodriguez et al., 2024).
PEFT using IA3 adapters, explored by Parikh et al. (2023), offers parameter-efficient fine-tuning with strong performance using just one example per class, outperforming traditional full-model tuning in few-shot scenarios (Parikh et al., 2023).

These alternatives provide clear upgrade paths aligned with future system requirements such as latency optimization, out-of-scope detection, semantic scalability, and continuous integration of new airline services.

3.5 Combined Use: Intent Classification, Clarification, and Out-of-Scope Detection

In real-world airline applications, user queries often contain ambiguity, lack critical information, or fall entirely outside the scope of predefined intents. Such scenarios challenge the robustness of intent classification systems, particularly in high-stakes environments like airline operations, where misclassification can lead to operational errors or poor customer experience. To address these challenges, recent research has proposed combining intent classification with clarification mechanisms and out-of-scope (OOS) detection, resulting in more resilient and adaptive systems.

Integration Strategies

Three general integration strategies have emerged for combining intent classification, clarification, and OOS detection:

Clarification-before-Classification: When a user query is ambiguous or underspecified, the system generates a clarification question to elicit additional information. Only after receiving a clarifying response does the system perform intent classification.
OOS Filtering: Before or after attempting classification, the system evaluates whether the input falls outside the supported intent taxonomy. If the confidence score is low or the semantic similarity to in-scope intents is weak, the query is flagged as out-of-scope and handled accordingly.
Uncertainty-Driven Routing: The system estimates its own confidence (e.g., via entropy, Monte Carlo dropout, or cosine similarity) and routes low-confidence inputs to a secondary module—such as a large language model (LLM) for clarification or a verification step using OOS thresholds.

These strategies can be implemented in sequence or conditionally, forming dynamic pipelines tailored to the nature of the input.

Improving System Robustness

The integration of clarification and OOS detection into intent classification systems improves robustness in the following ways:

Error Prevention: OOS detection reduces false positives by rejecting inputs that do not belong to the supported domain (e.g., hotel bookings in an airline chatbot).
Disambiguation Support: Clarification questions enable the system to resolve input ambiguity (e.g., distinguishing between baggage_policy and baggage_fees).
Adaptive Processing: Uncertainty-based routing allows the system to dynamically escalate inputs to more capable modules when needed, rather than making uncertain predictions.

These improvements collectively reduce misclassification, increase user trust, and provide a more transparent conversational experience.

Functional Example

Consider the query: "What do I pay for my bags?"

This input could correspond to either baggage_policy (allowance rules) or baggage_fees (cost for additional bags).

A combined system would process this input as follows:

The base classifier detects high uncertainty due to semantic overlap.
The system triggers a clarification step: "Do you mean baggage allowance or extra fees?"
The user responds: "Extra fees."
The system now classifies the query as baggage_fees with high confidence.

Alternatively, for the input "Where can I book a hotel?", the OOS detection module would flag the query as out-of-domain and trigger a fallback response without attempting classification.

Relevance to Airline Applications

Airline customer interactions are particularly prone to ambiguity, context dependency, and off-topic requests. Integrating clarification and OOS detection into the classification pipeline is essential to:

Handle ill-specified user inputs, especially from novice or multilingual users.
Avoid inappropriate automated responses to out-of-domain requests.
Enhance transparency and reliability, especially during operational disruptions or service recovery.

Such integrated systems are more aligned with real-world airline needs, where robust understanding and reliable decision-making under uncertainty are critical.

4. Intent Classification Evaluation

4.1 Attributes for Comparing Evaluation Methods

Attribute Definitions

Uncertainty Assessment

Measures the evaluation method's ability to assess how well models quantify prediction uncertainty and confidence
Evaluates whether the method can determine if confidence scores accurately reflect actual prediction reliability
Critical for airline operations where understanding prediction uncertainty prevents overconfident automated decisions and enables appropriate escalation to human agents

Reproducibility

Ability to obtain identical evaluation results across multiple runs with consistent configuration and random seed control
Ensures performance improvements/regressions are genuine rather than measurement artifacts or statistical noise
Essential for regulatory compliance, audit trail maintenance, and trustworthy regression testing in safety-critical airline systems

Out-of-Scope Detection Performance

Evaluates system's ability to identify and reject queries outside the supported intent taxonomy using specialized metrics like AU-IOC
Prevents misclassification of hotel bookings, rental cars, and unrelated requests into airline intent categories
Critical for maintaining service boundaries, operational safety, and preventing inappropriate automated responses in production chatbots

Latency Profiling

Measures inference time distribution under realistic passenger interaction loads, including mean and P95 latency metrics
Production systems require sub-200ms response times with P95 latency under 5ms for optimal user experience
Essential for real-time passenger interactions during check-in, boarding, operational disruptions, and mobile app responsiveness

Multilingual Consistency

Assesses intent classification accuracy equivalence across multiple languages for semantically identical passenger requests
Validates that "¿Puedo cambiar mi vuelo?" maps to the same intent classification as "Can I change my flight?"
Critical for international airlines serving diverse passenger populations across 50+ languages and cultural contexts

Drift Detection & Monitoring

Systematic evaluation of model performance degradation over time using statistical methods and machine learning approaches
Monitors feature distribution shifts, prediction accuracy decay, and concept drift in passenger language patterns
Prevents silent performance deterioration and enables proactive model retraining before service quality impact

Multi-Intent Detection Capability

Evaluates system's ability to identify and classify multiple intents within single passenger utterances using multi-label metrics
Handles complex requests spanning booking modifications, service preferences, and operational concerns simultaneously
Essential for natural conversation flow and reducing clarification overhead in airline customer interactions

Robustness Assessment

Tests classification performance under adversarial conditions including noise injection, spelling errors, and grammatical inconsistencies
Evaluates resilience against Fast Gradient Sign Method, character-level perturbations, and environmental signal degradation
Critical for handling real-world passenger input variations and maintaining service quality across diverse interaction conditions

4.2 Evaluation Method Importance Table

Attribute	Why it matters for airline-grade evaluation	References
Uncertainty Assessment	Calibrated confidence estimates (e.g., temperature scaling, conformal prediction) reduce over-confident misclassifications, enable safe hand-offs for high-risk intents (visa denial, dangerous goods), and provide auditable KPIs that aviation regulators expect.	Uncertainty Assessment Evolution Temperature Scaling EMNLP 2020 NAACL 2024
Reproducibility	Ensures that evaluation results are identical across reruns with fixed seeds, documented code, data, and environment—vital for regulatory audits, safety-critical regression testing, and reliable A/B gating of new airline-chatbot models.	Reproducibility Study ML Reproducibility Crisis ACL 2023
Out-of-Scope Detection Performance	Detecting and rejecting queries that fall outside the airline's supported intent taxonomy (e.g., hotel bookings, car rentals, hostile/irrelevant requests) prevents the bot from giving misleading answers, breaching safety regulations, or triggering the wrong operational workflow. Robust OOS evaluatiors ensures the system maintains clear service boundaries and escalates unknown intents to human agents.	OOS Detection Study EMNLP 2024 ACL 2019
Latency Profiling	Sub-200ms response times required for real-time passenger interactions during check-in, boarding, and operational disruptions; P95 latency monitoring prevents service degradation	Inference Optimization AWS Edge Inference EMNLP Industry 2024
Multilingual Consistency	International airlines serve 50+ languages; evaluation must ensure equivalent performance across cultural contexts and linguistic variations without bias	ACL 2023 Multilingual NLP ICML 2020
Drift Detection & Monitoring	Passenger language patterns and service demands evolve continuously; systematic monitoring prevents silent performance degradation over operational periods	Distribution Shifts AI Monitoring Model Degradation
Multi-Intent Detection Capability	Airline passengers frequently express compound requests (booking + special assistance + meal preferences); accurate multi-intent evaluation prevents conversation breakdown	EMNLP 2024 (1) EMNLP 2024 (2) EMNLP 2020
Robustness Assessment	Evaluation must account for noisy input conditions (background noise, spelling errors, informal language) common in real passenger interactions	Microsoft Research ACL 2020

Method Definitions

Conformal Prediction Evaluation

Provides theoretical guarantees for uncertainty quantification using prediction sets with mathematically proven coverage probabilities
Constructs calibrated confidence intervals that contain the true intent class with predetermined confidence levels (90%, 95%, 99%)
Enables adaptive decision-making through prediction set size analysis and rejection option implementation for ambiguous queries

Out-of-Scope Detection Evaluation

Evaluates system's ability to reject queries outside supported intent taxonomy using dedicated benchmark datasets with explicit OOS samples
Measures AU-IOC (Area Under In-scope and Out-of-scope Characteristic Curve) for comprehensive dual-performance assessment of classification accuracy and rejection capability
Combines threshold-based rejection with representation similarity analysis to prevent misclassification of non-domain queries into operational intent categories

Latency Profiling Protocol

Measures inference time distribution under realistic passenger interaction loads with comprehensive P95, P99, and mean latency tracking
Evaluates Time to First Token (TTFT) < 200ms and total response latency < 500ms requirements for maintaining conversational flow during peak operations
Implements stress testing scenarios including concurrent user simulation, batch processing optimization, and hardware-specific performance profiling

Multilingual Consistency Assessment

Tests intent classification accuracy equivalence across multiple languages using professionally translated content and cross-lingual transfer protocols
Evaluates zero-shot performance degradation patterns and measures semantic consistency preservation across typologically diverse language families
Assesses cultural adaptation effectiveness through synthetic persona simulation and region-specific linguistic variation testing

Drift Detection Methodology

Monitors temporal performance degradation using statistical distribution shift detection including Kolmogorov-Smirnov testing and Maximum Mean Discrepancy analysis
Implements sliding window data acquisition with adaptive threshold adjustment for proactive model degradation detection
Measures Mean Time to Detection (MTD) and False Discovery Rate (FDR) to optimize alert sensitivity and minimize false positive interventions

Multi-Intent Detection Evaluation

Assesses multi-label classification performance using macro/micro F1-scores, Hamming loss, and label-wise accuracy for complex passenger queries
Evaluates joint intent-slot parsing accuracy with complete semantic frame correctness validation for compound service requests
Tests hierarchical intent taxonomy navigation and co-occurrence pattern recognition for realistic multi-service passenger interactions

Robustness Assessment Framework

Applies systematic adversarial testing using character-level, word-level, and sentence-level perturbations with human validation protocols
Implements behavioral testing through Minimum Functionality Tests, Invariance Tests, and Directional Expectation Tests for natural language variations
Measures Attack Success Rate (ASR) under realistic input corruption scenarios including speech-to-text errors, typos, and grammatical inconsistencies

Reproducibility Framework Assessment

Implements standardized experimental protocols with comprehensive documentation requirements following established research reproducibility checklists
Requires multi-seed evaluation with statistical significance testing and complete artifact version control for experiment replication
Utilizes containerized environments and automated pipeline management to ensure consistent results across different computational platforms and research teams

4.3 Method Comparison Matrix

Method	Uncertainty Assessment	Reproducibility	Out-of-Scope Detection	Latency Profiling	Multilingual Consistency	Drift Detection	Multi-Intent Detection	Robustness Assessment	References
Conformal Prediction Evaluation	High	High	High	Medium-High	High	Low	High	High	NAACL 2024 arXiv 2503.15850 arXiv 2309.06240
Out-of-Scope Detection Evaluation	Low	High	High	Medium	Low	Low	Low	Medium-High	EMNLP 2019 Springer 2023 arXiv 2403.05640
Latency Profiling Protocol	Low	High	Low	High	Medium	Low	Medium	Low	Databricks Blog DZone Article Latency Metrics
Multilingual Consistency Assessment	Low	High	Low	Medium	High	Low	Medium-High	High	ACL 2023 ResearchGate EMNLP 2021
Drift Detection Methodology	Low	High	Low	Low	Low	High	Low	Low	Microsoft Docs Evidently AI arXiv 2505.17043
Multi-Intent Detection Evaluation	Low	High	Low	Medium	Medium	Low	High	Low	EMNLP 2023 Science Direct FewNLU
Robustness Assessment Framework	Low	High	Low	Medium	High	Low	Low	High	GitHub Repo EMNLP 2023 CMU Blog
Reproducibility Framework Assessment	Low	High	Low	Low	Medium	Medium	Medium	Low	arXiv 2407.10239 Nature arXiv 2406.14325

4.4 Method Selection & Justification

Based on the comparative analysis of evaluation methods and the critical requirements of airline operations, this study adopts a focused evaluation strategy targeting four essential capabilities: uncertainty quantification for safety-critical decision-making, out-of-scope detection for operational boundary management, latency optimization for real-time passenger interactions, and multi-intent processing for complex service requests. These selected methods address the unique challenges of airline chatbot systems including regulatory compliance for automated decisions, prevention of dangerous query misrouting, operational efficiency during peak travel periods, and comprehensive understanding of compound passenger needs across booking modifications, service preferences, and support requests.

Method	Rationale
Conformal Prediction Evaluation	Provides theoretical guarantees for uncertainty quantification essential in safety-critical airline operations where overconfident misclassifications can trigger inappropriate automated responses. Enables mathematically proven confidence intervals for human handoff decisions during operational disruptions and emergency scenarios. Critical for regulatory compliance requiring auditable decision-making processes in aviation safety systems.
Out-of-Scope Detection Evaluation	Essential for ensuring safety and reliability when chatbots are exposed to open-ended customer inputs beyond airline scope. Prevents dangerous misrouting of non-aviation queries (hotel bookings, rental cars, weather) into operational flight systems. Critical for maintaining service boundaries and preventing automated responses to queries outside airline operational scope, directly impacting passenger safety.
Latency Profiling Protocol	Reflects constraints in production environments such as check-in kiosks, mobile apps, and real-time customer service interactions. Models must demonstrate system responsiveness under realistic passenger interaction loads to maintain operational efficiency during peak travel periods. Essential for validating deployment viability in time-critical airline operations where response delays can cascade into operational disruptions.
Multi-Intent Detection Evaluation	Addresses complex passenger queries that span multiple services (booking modifications + seat preferences + meal requests) common in airline customer interactions. Validates system capability to handle compound service requests through multi-label classification performance assessment. Critical for reducing clarification overhead and improving passenger experience during complex transaction scenarios.

5. Final Recommendations & Next Steps

5.1. Summary of Comparative Findings

The analysis conducted in this study shows that intent classification techniques exhibit diverse performance profiles depending on their methodological approach and the context of application. Prompting-based methods, such as Few-Shot In-Context Learning, are particularly useful during early development stages or prototyping, as they require no training and allow for rapid iteration. Embedding-based techniques, such as Dual Encoder, stand out for their computational efficiency, making them suitable for systems with latency or infrastructure constraints. Hybrid approaches, such as Adaptive ICL + CoT Hybrid Routing, offer a favorable balance between accuracy and inference time by combining lightweight classifiers with high-capacity LLMs. Generative methods, including IntentGPT and Gen-PINT, enable advanced tasks such as automatic intent discovery and semantic adaptation, although they often entail greater operational requirements.

These findings suggest that the selection of a technique should not rely on a single performance metric. Instead, it should account for the specific operational context, system constraints, and functional objectives relevant to each phase in the development and deployment of a conversational AI system.

5.2. Technique Recommendations by Use Case

The table below summarizes recommended techniques for common use cases in airline conversational systems:

Use Case Scenario	Recommended Technique	Primary Justification
Initial deployment / prototyping	Few-Shot In-Context Learning (ICL)	Requires no training, enables rapid iteration with minimal labeled data
Production scaling	Adaptive ICL + CoT Hybrid Routing	High accuracy with reduced latency; optimal trade-off for real-time environments
Emerging or evolving intents	IntentGPT	Unsupervised intent discovery; useful when services or policies change dynamically
Domain-specific adaptation	PEFT (IA3 Adapters)	Efficient fine-tuning; low resource cost; allows for rapid domain specialization

These recommendations are grounded in the comparative analysis in Section 3.3 and are aligned with the technical attributes and performance metrics presented in Sections 4.1 and 4.2.

5.3. Implementation Considerations

Before selecting or deploying an intent classification technique, the following practical aspects should be taken into account:

Infrastructure availability: some techniques require access to large language models or GPU-based inference environments.
Real-time response requirements: use cases such as automated check-in or airport support require low inference latency.
Out-of-scope (OOS) robustness: it is critical to detect irrelevant or unexpected inputs to avoid generating inappropriate responses.
Maintainability and adaptability: systems must be capable of rapid updates in response to changes in regulations, services, or query patterns.
Multilingual support and cultural variation: airline systems must deliver consistent classification across different languages and regions.

5.4. Future Work

This study provides a solid foundation for evaluating intent classification techniques in complex operational domains such as aviation. However, several lines of work are proposed to further strengthen the applicability and robustness of such systems:

Development of a domain-specific dataset for airlines, including real passenger utterances, expert annotations, and ambiguous cases.
Multilingual and cross-regional evaluation with native speakers, to validate semantic consistency and intent boundaries across cultural contexts.
Controlled experimentation with clarification techniques, such as TO-GATE or LLM-based interactive clarification, to improve real-time ambiguity resolution.
Design of domain-oriented benchmarks that combine multiple task types (classification, OOS detection, discovery, reasoning) and integrate evaluation metrics.

These steps will help advance toward more reliable, adaptive, and operationally aligned conversational systems for the airline industry.

6. Conclusion

This study presents a comprehensive analysis of intent classification techniques and evaluation methods, with a specific focus on their applicability to airline customer service systems. Through a detailed characterization of the airline domain, a structured comparison of 20 intent classification approaches, and an in-depth review of evaluation protocols, we highlight the strengths, limitations, and suitability of each technique under different operational scenarios.

Our findings show that no single method excels universally. Instead, successful deployment depends on aligning technique selection with specific contextual requirements—such as the availability of training data, latency constraints, or the need for out-of-scope handling and clarification. Prompting-based methods, hybrid architectures, and LLM-powered discovery systems each provide complementary strengths that can be leveraged at different stages of system development.

The evaluation strategy proposed in this study emphasizes few-shot adaptability, semantic robustness, and real-time responsiveness. These dimensions are critical to supporting the multilingual, dynamic, and operationally sensitive nature of airline applications.

Overall, this work provides actionable guidance for researchers and practitioners seeking to design or evaluate intent classification systems in complex real-world domains, offering both a theoretical foundation and practical pathways for scalable and reliable deployment.