Zero-shot learning (ZSL) is revolutionizing machine learning (ML) by enabling models to classify or predict outcomes for concepts they’ve never encountered before, marking a departure from traditional approaches that require extensive labeled data. This guide explores how ZSL works, its applications, how it compares to few-shot learning (FSL), and its challenges and future potential.
Table of contents
- What is zero-shot learning?
- How zero-shot learning works
- Zero-shot learning vs. few-shot learning and one-shot learning
- Zero-shot learning vs. zero-shot prompting
- Applications of zero-shot learning
- Benefits of zero-shot learning
- Challenges of zero-shot learning
What is zero-shot learning (ZSL)?
ZSL allows machine learning models to make predictions about unseen categories without requiring specific training examples for those categories. Unlike traditional supervised learning models, which rely heavily on labeled datasets where every category must be explicitly represented, ZSL leverages auxiliary information—such as semantic embeddings or attributes—to generalize knowledge.
For instance, a supervised learning model trained to classify animals would need labeled examples of “dog,” “cat,” and “zebra” to recognize them, whereas a ZSL model trained on animal images could identify a zebra based on descriptive attributes like “striped” and “horse-like,” even without exposure to prior examples. This makes ZSL particularly useful for tasks involving large, unlabeled datasets or situations where collecting labeled data is impractical. Its applications span computer vision, natural language processing (NLP), robotics, and more.
How zero-shot learning works
ZSL models are first pre-trained on a large labeled dataset to create a knowledge base. The model extracts auxiliary information from the labeled data, including features such as color, shape, and sentiment.
It then uses those features to map semantic relationships between seen and unseen categories (or classes) of data. This process, called knowledge transfer, allows a ZSL model to understand, for example, that a duck and a goose are related because they both have beaks, feathers, and webbed feet.
The most common techniques are attribute-based ZSL, semantic embedding–based ZSL, and generalized ZSL. Below, we examine each.
Attribute-based zero-shot learning
Attribute-based ZSL models are most often used for computer vision tasks. They work by training on human-labeled datasets of images. The labels consist of attributes the person labeling considers useful. For each image, the person applies a text description of its features, such as color, shape, or other characteristics.
For example, in image classification, attributes like “gray,” “four-legged,” and “dog” might describe different categories. Through training, the model learns to associate these attributes with specific categories.
When you show the model an example of something new—like a type of animal it hasn’t seen before—it can figure out whether it’s looking at a class that’s similar to but not the same as the classes seen in training.
When the model encounters an unseen category—for example, a wolf—it can infer the class by analyzing attributes shared with learned categories, even if the “wolf” label wasn’t explicitly part of the training. These human-interpretable attributes improve explainability and allow the model to generalize to new classes.
Semantic embedding–based zero-shot learning
This approach is similar to attribute-based ZSL, but instead of humans creating attribute labels for training, the model generates what are known as semantic embeddings of the training data. These semantic embeddings are encoded as vectors—mathematical ways of representing real-world objects—and then mapped in an embedding space.
The embedding space allows the model to organize its contextual knowledge by grouping related information closer together. For example, “dog” and “wolf” categories will be closer to each other in an embedding space than “dog” and “bird” categories will be, due to shared semantic features. This is similar to how large language models (LLMs) use semantic embeddings to cluster synonyms because of their similar meanings.
When the model is given unseen categories (another way of saying “new data the model hasn’t encountered before”), it projects vectors from those new classes into the same embedding space and measures the distance between them and vectors for classes it already knows about. This gives the model context for the unseen examples and allows it to infer semantic relationships between known and unknown classes.
Generalized zero-shot learning
Most zero-shot learning techniques train the model on one kind of data and then apply it to a different but related problem. That’s the idea of “zero shots”: the model doesn’t get exposed to any examples of the new classes before it encounters them in the wild.
However, real-world applications aren’t always so black and white. The dataset you want your ZSL model to classify might contain things from known classes alongside new classes.
The problem is that traditional ZSL models can sometimes show a strong bias for mislabeling new classes as things it already knows if you mix new and familiar together. So, it’s useful to have a ZSL model that can generalize to a dataset that might contain classes already seen in training.
In generalized ZSL, the model takes an additional step to reduce bias toward known categories. Before it performs classification, it first decides whether the object in question belongs to a known or unknown class.
Zero-shot learning vs. few-shot learning and one-shot learning
Like ZSL, few-shot learning (FSL) and one-shot learning (OSL) enable deep learning models to perform new tasks with minimal or no new data. All three approaches rely on mapping the relationships between features of known examples to infer patterns in unknown examples. Their primary goal is to create models that are effective in real-world scenarios where data is scarce or where there’s no time to train a new model for a specific task.
The key difference lies in how they handle new data:
- FSL involves providing the model with a small number of labeled examples for the new class it needs to identify.
- OSL is a more specific case, where the model is shown just one labeled example of the new class.
Both FSL and OSL require an additional training step compared to ZSL, which increases the time needed to learn new tasks. However, this extra training equips them to handle tasks that deviate significantly from the model’s pre-trained knowledge, making them more adaptable in practice.
While ZSL is often seen as “flexible” because it doesn’t require labeled examples for new tasks, this flexibility is largely theoretical. In real-world applications, ZSL methods can struggle with:
- Tasks involving a mix of seen and unseen examples (e.g., generalized ZSL scenarios)
- Tasks that are substantially different from the model’s training data
ZSL models are also sensitive to factors like how datasets are split during pre-training and evaluation, which can affect performance. On the other hand, FSL and OSL offer more practical flexibility for task adaptation by incorporating new examples into the learning process, allowing them to perform better in diverse scenarios.
Zero-shot learning vs. zero-shot prompting
ZSL is a type of model architecture designed for various deep learning tasks. In contrast, zero-shot prompting refers to asking an LLM like ChatGPT or Claude to generate an output without providing specific examples in the prompt to guide its response. In both cases, the model performs a task without explicit examples of what the task involves.
In zero-shot prompting, you don’t supply the model with any examples related to the task. Instead, you rely on the LLM’s pre-trained knowledge to infer and execute the task.
For instance, you could input the text of a restaurant review and ask the LLM to classify it as positive, neutral, or negative—without giving it any sample reviews to use as a reference. The LLM would draw on its pre-training to determine the appropriate label for the review.
While zero-shot learning and zero-shot prompting share the concept of performing tasks without examples, there is a key distinction:
- Zero-shot learning is a type of model architecture built for such tasks.
- Zero-shot prompting is a technique specific to interacting with LLMs, not a model architecture.
Applications of zero-shot learning
Because of its focus on helping deep learning models adapt to new tasks, ZSL has applications across many areas of ML, including computer vision, NLP, and robotics. ZSL can be used in healthcare, sentiment analysis, customer service, document translation, and cybersecurity, for example:
- Sentiment analysis: When breaking news occurs, a zero-shot NLP model can perform sentiment analysis on public commentary to provide a nearly real-time look at the public’s reactions.
- Multilingual document processing: NLP zero-shot models trained to extract information from tax documents in English can perform the same extractions on tax documents in Spanish without additional training.
- Medical diagnostics: ZSL models have been used to identify X-rays of patients with COVID-19 without any visual examples. The identifications are based on textual descriptions, made by doctors working in the field, of what positive X-rays look like.
- More nuanced chatbots: ZSL NLP models can understand slang and idioms they haven’t encountered before during chats with people, allowing them to respond more meaningfully to questions they weren’t specifically trained to handle.
- Anomaly detection: ZSL can be used in cybersecurity to detect unusual patterns in network activity or label new kinds of hacking attacks as novel threats emerge.
Benefits of zero-shot learning
Traditional supervised learning approaches are often impractical for many real-world applications, given the large datasets, training time, money, and computational resources they require. ZSL can mitigate some of those challenges. The benefits include reducing the costs associated with training a new model and coping with situations where data is scarce or not yet available:
Cost-effective development
Acquiring and curating the large labeled datasets required by supervised learning is expensive and time-consuming. Training a model on a high-quality labeled dataset can cost tens of thousands of dollars, in addition to the cost of servers, cloud computing space, and engineers.
ZSL shows promise in lowering the cost of ML projects by allowing institutions to repurpose models for new tasks without additional training. It also allows smaller entities or individuals to repurpose models built by others.
Solving problems with scarce data
The flexibility of ZSL makes it a good tool for situations where little data is available, or where data is still emerging. For example, it is useful for diagnosing new diseases when information is not yet widespread, or for disaster situations where information is evolving rapidly. ZSL is also useful for anomaly detection when data is too substantial for human analysts to process.
Challenges of zero-shot learning
ZSL relies heavily on having high-quality training data during its pre-training phase to understand semantic relationships between categories well enough to generalize to new ones. Without high-quality data, ZSL can produce unreliable results that are sometimes difficult to evaluate.
Common issues that ZSL models face include trouble adapting to tasks that are dissimilar to tasks it has already trained on and problems with training data that cause it to rely too heavily on certain labels when predicting unseen classes.
Domain adaptation
ZSL models perform best when asked to deal with new data from a domain that is not dramatically different from what it has been trained on. For example, if a model has been trained on still photos, it will have difficulty classifying videos.
ZSL models rely on mapping auxiliary information from unknown data onto known data, so if the data sources are too different, the model has no way to generalize its knowledge to the new task.
The hubness problem
The hubness problem in ZSL occurs when a model starts using only a few labels when making predictions for unseen categories. It happens when many points in the embedded feature space become clustered together, forming “hubs” that bias the model toward particular labels.
This can happen because of noise in the training data, too many examples of some kinds of data and not enough of others, or because the model’s semantic embeddings are not distinct enough.