Logistic regression is a cornerstone method in statistical analysis and machine learning (ML). This comprehensive guide will explain the basics of logistic regression and discuss various types, real-world applications, and the advantages and disadvantages of using this powerful technique.
Table of contents
- What is logistic regression?
- Types of logistic regression
- Logistic vs. linear regression
- How logistic regression works
- Applications
- Advantages
- Disadvantages
What is logistic regression?
Logistic regression, also known as logit regression or the logit model, is a type of supervised learning algorithm used for classification tasks, especially for predicting the probability of a binary outcome (i.e., two possible classes). It is based on the statistical methods of the same name, which estimate the probability of a specific event occurring. For example, logistic regression can be used to predict the likelihood that an email is spam or that a customer will make a purchase or leave a website.
The model evaluates relevant properties of the event (called “predictor variables” or “features”). For example, if the event is “an email arrived,” relevant properties might include the source IP address, sender email address, or a content readability rating. It models the relationship between these predictors and the probability of the outcome using the logistic function, which has the following form:
f (x) = 1 / ( 1 + e -x )
This function outputs a value between 0 and 1, representing the estimated probability of the event (it might say, “This email is 80% likely to be spam”).
Logistic regression is widely used in ML, particularly for binary classification tasks. The sigmoid function (a type of logistic function) is often used to convert the output of any binary classification model into a probability. Although logistic regression is simple, it serves as a foundational technique for more complex models, such as neural networks, where similar logistic functions are used to model probabilities. The term logit model refers to models that use this logit function to map input features to predicted probabilities.
Types of logistic regression
There are three main types of logistic regression: binary, multinomial, and ordinal.
Binary logistic regression
Also known as binary regression, this is the standard and most common form of logistic regression. When the term logistic regression is used without qualifiers, it usually refers to this type. The name “binary” comes from the fact that it considers exactly two outcomes; it can be thought of as answering yes or no questions. Binary regression can handle more complicated questions if they are reframed as chains of yes or no, or binary, questions.
Example: Imagine calculating the odds of three mutually exclusive options: whether a client will churn (i.e., stop using the product), sign up for a free version of a service, or sign up for the paid premium version. The chained binary regression might solve this problem by answering the following chain of questions:
- Will the client churn (yes or no)?
- If not, will the client sign up for the free service (yes or no)?
- If not, will the client sign up for the paid premium service (yes or no)?
Multinomial logistic regression
Also known as multinomial regression, this form of logistic regression is an extension of binary regression that can answer questions with more than two potential outcomes. It avoids the need for chaining questions to solve more complex problems. Multinomial regression assumes that the odds being calculated do not have any interdependencies or order to them and that the set of options considered covers all possible outcomes.
Example: Multinomial regression works well when predicting what color a customer is likely to want for a car they’re buying from a list of available colors. However, it doesn’t work well for calculating odds where order matters, such as evaluating the colors green, yellow, and red as severity tags for a customer support issue, where the issue always starts as green and might be escalated to yellow and then red (with yellow always following green and red always following yellow).
Ordinal logistic regression
Also known as a proportional odds model for regression, this specialized form of logistic regression is designed for ordinal values—situations where the relative order among outcomes matters. Ordinal logistic regression is used when the outcomes have a natural order but the distances between the categories are not known.
Example: It might be used to calculate the odds of where a hotel guest is likely to rank their stay on a five-part scale: very bad, bad, neutral, good, and very good. The relative order is important—bad is always worse than neutral, and it’s important to note which direction reviews will move on the scale. When order matters, ordinal regression can quantify the relationships between the values whose odds are being calculated (e.g., it might detect that bad tends to show up half as often as neutral).
Logistic regression vs. linear regression
Though different, logistic regression and linear regression often show up in similar contexts, as they are part of a larger, related mathematical toolset. Logistic regression generally calculates probabilities for discrete outcomes, while linear regression calculates expected values for continuous outcomes.
For example, if one were to try to predict the most likely temperature for a day in the future, a linear regression model would be a good tool for the job. Logistic regression models, by contrast, attempt to calculate or predict the odds for two or more options out of a fixed list of choices. Instead of predicting a specific temperature, a logistic regression model might give the odds that a particular day will fall into warm, comfortable, or cold temperature ranges.
Since they are built to address separate use cases, the two models make different assumptions about the statistical properties of the values they’re predicting and are implemented with different statistical tools. Logistic regression typically assumes a statistical distribution that applies to discrete values, such as a Bernoulli distribution, while linear regression might use a Gaussian distribution. Logistic regression often requires larger datasets to work effectively, while linear regression is usually more sensitive to influential outliers. Additionally, logistic regression makes assumptions about the structure of the odds it’s calculating, whereas linear regression makes assumptions about how errors are distributed in the training dataset.
The differences between these models cause them to perform better for their specific ideal use cases. Logistic regression will be more accurate for predicting categorical values, and linear regression will be more accurate when predicting continuous values. The two techniques are often confused with each other though, since their outputs can be repurposed with straightforward mathematical calculations. A logistic regression model’s output can be applied, after a transformation, to the same kinds of problems as a linear model’s output, saving on the cost of training two separate models. But it won’t work as well; the same is true in reverse.
How does logistic regression work?
As a kind of supervised learning algorithm, logistic regression depends on learning from well-annotated datasets. The datasets usually contain lists of feature representations matched with the expected model output for each.
To gain a clearer understanding of logistic regression, it’s essential to first grasp the following key terminology:
- Predictor variables: Properties or features considered by the logistic model when calculating odds for outcomes. For example, predictor variables for estimating a customer’s likelihood to buy a product could include demographic data and browsing history.
- Feature representation: A specific instance of predictor variables. For example, if the predictor variables are “postal code,” “state,” and “income bracket,” one feature representation might be “90210,” “California,” and “75K+/year.”
- Link function: The mathematical function at the core of a regression model that connects predictor variables to the odds of a particular outcome. The function will follow the pattern:
θ = b(μ)
where θ is the odds per category to predict, b is a specific function (usually an S–shaped function, called a sigmoid), and μ represents the predicted value (from a continuous range of values).
- Logistic function: The specific link function used in logistic regression, defined as
σ ( x ) = 1 / ( 1 + e -x )
It normalizes the output to a probability between 0 and 1, converting proportional, multiplication-based changes in predictor variables into consistent, additive changes in odds.
- Logit function: The inverse of the logistic function, converting probability values into log-odds, which helps to explain how predictor variables relate to the odds of an outcome. It helps explain how predictor variables relate to the odds of an outcome. It is defined as:
logit p = σ ( p ) -1 = l n ( p / ( 1 – p ) )
For a given odds p, it performs the inverse of the logistic function.
- Log loss: Also known as cross-entropy loss or logistic loss, it measures the difference between predicted probabilities and actual outcomes in classification models. For binary classification, it is often called “binary cross-entropy.”
At the core of a logistic regression process is the decision of which link function to use. For a binary logistic regression, that will always be the logistic function. More complex regressions will use other kinds of sigmoid functions; one of the most popular sigmoid functions is known as softmax and is very frequently used in ML models and for multinomial regression use cases.
During training, the system will also depend on a loss function, which calculates how well the regression is performing, or its fit. The systems’ objective can be thought of as reducing the distance between a predicted outcome or odds and what happens in the real world (sometimes this distance is called “the surprise”). For logistic regression, the loss function is a variation of the very popular log loss function.
A variety of standard ML training algorithms can be used to train the logistic regression model, including gradient descent, maximum-likelihood estimation, and stochastic gradient descent.
Applications of logistic regression in ML
Logistic regression ML models are typically used for classification tasks, or to predict classes from partial information. Use cases span many domains, including financial, healthcare, epidemiology, and marketing. Two of the most well-known applications are for email spam detection and medical diagnosis.
Email spam detection
Logistic regression can be an effective tool for classifying communication, such as identifying emails as spam or not, though more advanced methods are often used in complex cases. The sender address, destination, text contents for the message, source IP address, and so on—all of the properties of an email—can be marked as predictor variables and accounted for in the odds that a given email is spam. Email spam filter tools rapidly train and update binary logistic models on new email messages and quickly detect and react to new spam strategies.
More advanced versions of spam filters pre-process emails to make them more easy to identify as spam. For example, a script could add a percentage of emails that are marked as spam for the sender’s IP address in an email, and the regression can take that info into account.
Medical diagnosis
Logistic regression models are commonly used to assist in diagnosing medical conditions such as diabetes and breast cancer. They learn from and build on analysis performed by doctors and medical researchers.
For an image-heavy diagnosis, such as cancer detection, medical researchers and professionals build datasets from various tests, imaging, and scans. This data is then processed and transformed into lists of textual assessments. An image might be analyzed for such details as pixel density, number and mean radius of various clusters of pixels, and so on. These measurements are then included in a list of predictor variables that include the results of other tests and evaluations. Logistic regression systems learn from them and predict if a patient is likely to be diagnosed with cancer.
Besides predicting medical diagnosis with high accuracy, logistic regression systems can also indicate which test results are most relevant to its evaluations. This information can help prioritize tests for a new patient, speeding up the diagnosis process.
Advantages of logistic regression in ML
Logistic regression is often favored for its simplicity and interpretability, particularly in cases where results need to be produced relatively quickly and where insights into the data are important.
Fast, practical results
From a practical standpoint, logistic regression is straightforward to implement and easy to interpret. It performs reliably and provides valuable insights even when the data doesn’t perfectly align with assumptions or expectations. The underlying mathematical models are efficient and relatively simple to optimize, making logistic regression a robust and practical choice for many applications.
Useful insights into data properties
Theoretically, logistic regression excels in binary classification tasks and is generally very fast at classifying new data. It can help identify which variables are associated with the outcome of interest, providing insight into where further data analysis should focus. Logistic regression often delivers high accuracy in simple use cases; even when accuracy diminishes for certain datasets, it still provides meaningful insights into the relative importance of variables and the direction of their impact (positive or negative).
Disadvantages of logistic regression in ML
Logistic regression makes assumptions about the data it analyzes, helping the underlying algorithms be faster and easier to understand at the cost of limiting their usefulness. They can’t be used to model continuous results or nonlinear relationships, can fail if the relationship to the model is too complex, and will overfit if they analyze too much data.
Limited to discrete outcomes
Logistic regression can only be used to predict discrete outcomes. If the problem requires continuous predictions, techniques like linear regression are more suitable.
Assume linear relationships
The model assumes a linear relationship between the predictor variables and the estimated odds, which is rarely the case in real-world data. This often necessitates additional preprocessing and adjustments to improve accuracy. Additionally, logistic regression assumes that classification decisions can be made using simple linear functions, which may not reflect the complexities of real-world scenarios. As a result, logistic regression is often an approximation that may require regular optimization and updates to stay relevant.
May fail to model complex relationships
If a set of predictor variables doesn’t have a linear relationship to the calculated odds, or if the predictor variables aren’t independent enough from each other, logistic regression may fail to work altogether, or it may detect only a subset of linear relationships when the system has a mix of both linear and other more complex properties.
Overfit large datasets
For larger and more complex datasets, logistic regression is prone to overfitting, where the model becomes too closely aligned with the specific data it was trained on, capturing noise and minor details rather than general patterns. This can result in poor performance on new, unseen data. Techniques such as regularization can help mitigate overfitting, but careful consideration is needed when applying logistic regression to complex data.