Multivariate logistic regression

Multivariate logistic regression is a type of data analysis that predicts any number of^[1] outcomes based on multiple^[2]^[3] independent variables.^[4]^[5]^[6] It is based on the assumption that the natural logarithm of the odds has a linear relationship with independent variables.^[7]

Procedure

First, the baseline odds of a specific outcome compared to not having that outcome are calculated, giving a constant (intercept).^[8] Next, the independent variables are incorporated into the model, giving a regression coefficient (beta) and a "P" value for each independent variable.^[9] The "P" value determines how significantly the independent variable impacts the odds of having the outcome or not.^[10]

It is desirable to use as few variables as necessary,^[11] and to have at least 10 - 20 times as many observations as independent variables.^[12]

Formula

Multivariate logistic regression uses a formula similar to univariate logistic regression,^[13] but with multiple independent variables.

\pi \left(x\right)={\frac {e^{\beta _{0}+\beta _{1}X_{1}+\beta _{2}X_{2}+\dots +\beta _{v}X_{v}}}{1+e^{\beta _{0}+\beta _{1}X_{1}+\beta _{2}X_{2}+\dots +\beta _{v}X_{v}}}}

where v is the number of independent variables. The following formula shows that multivariate logistic regression is simply a standard linear regression model:^[14]

\mathrm {logit} \left(\pi \left(x\right)\right)=\beta _{0}+\beta _{1}X_{1}+\beta _{2}X_{2}+\dots +\beta _{v}X_{v}

Types

The two main types of multivariate logistic regression are linear regression and logistic regression.

Linear regression

Linear regression produces results that show a linear relationship with a single independent variable (IV) and can be plotted on a graph as a straight line.^[15]

Logistic regression

In contrast, logistic regression produces results that show a nonlinear relationship. As a result, plotting the data on a graph produces a curved line called a sigmoid. Unlike linear regression, logistic regression produces results based on two or more independent variables.^[16]^[17]^[5]

The odds ratio associated with a single independent variable can change when other independent variables are accounted for as well.^[18] However, the changes are usually insignificant, but they can indicate errors.^[19]

Assumptions

Multivariate logistic regression assumes that the different observations are independent.^[20] It also assumes that the natural logarithm of the odds ratio and the dependent variables show a linear relationship. However, it does not assume a normal distribution of the dependent variables.

Null hypothesis

A null hypothesis is an assumption that the independent variables do not have any impact on the dependent variable.^[21]

Dependent variables

There are three main types of logistic regression dependent variables (DVs): Binary, multi-class, and ordinal.^[22]

Binary

A binary dependent variable is a variable with only two outcomes, and the possible values must be opposites of each other.^[23]

Multi-class

A multi-class dependent variable is a variable with at least three qualitative (non-numerical) outcomes, usually with a constant numerical stand-in.^[24]

Ordinal

An ordinal dependent variable is a variable with at least three possible outcomes, which are numerically different.^[25]

Models

Multivariate logistic regression produces the following models:^[26]

Logit models

Logit models distinguish independent and dependent variables.

Log-linear models

Unlike logit models, log-linear models do not distinguish between categories of variables.

Probit models

Probit models function similarly to logit models due to the similarities of normal and logistic distributions. However, since the independent variables are interpreted as standard deviations instead of odds ratios, these models are also more similar to linear models than logit models.

Uses

Scientists

When scientists use logistic regression, they usually include as many independent variables as necessary.^[5]

Doctors and physicians

Multivariate logistic regression is used by physicians to:^[27]

associate certain characteristics with certain outcomes^[28]
determine the effects of certain techniques
give people with certain conditions appropriate treatments
develop appropriate models^[29]

Market

Multivariate logistic regression is also used to analyze customer preferences for products.^[30]

Artificial intelligence

Multivariate logistic regressions are also used in machine learning.^[31]

In comparison to multivariable logistic regression

While both multivariate logistic regression and multivariable logistic regression correlate multiple independent variables to outcomes, multivariate logistic regression correlates independent variables to multiple outcomes, while multivariable logistic regression correlates independent variables to a single outcome.^[32]^[3]

References

^ "... while multivariate logistic regression deals with multiple outcomes simultaneously." - [1] (This vs. That)
^ "In contrast, multivariate logistic regression involves analyzing the relationship between multiple outcome variables and one or more predictors. This can lead to more complex interpretations, as researchers must consider the impact of predictors on each outcome variable." - [2] (This vs. That)
^ ^a ^b "Number of variables: One dependent (y), At least two independent (x)" - [3] (Ylva B. Almquist)
^ "Multivariate logistic regression is a type of analysis that can help predict results when you're working with multiple variables." - [4] (Indeed)
^ ^a ^b ^c Sperandei, Sandro (2014). "Understanding logistic regression analysis". Biochemia Medica. 24 (1): 12–18. doi:10.11613/BM.2014.003. ISSN 1330-0962. PMC 3936971. PMID 24627710.
^ "Use multiple logistic regression when you have one nominal variable and two or more measurement variables, and you want to know how the measurement variables affect the nominal variable." - [5] (Handbook of Biological Statistics)
^ "The multiple logistic regression equation is based on the premise that the natural log of odds (logit) is linearly related to independent variables." - [6] (Springer)
^ "The statistical program first calculates the baseline odds of having the outcome versus not having the outcome without using any predictor." - [7] (National Library of Medicine)
^ "Then, the chosen independent (input/predictor) variables are entered into the model, and a regression coefficient (known also as “beta”) and “P” value for each of these are calculated." - [8] (National Library of Medicine)
^ "The “P” value indicates whether the particular variable contributes significantly to the occurrence of the outcome or not." - [9] (National Library of Medicine)
^ "Whether the purpose of a multiple logistic regression is prediction or understanding functional relationships, you'll usually want to decide which variables are important and which are unimportant. In the bird example, if your purpose was prediction it would be useful to know that your prediction would be almost as good if you measured only three variables and didn't have to measure more difficult variables such as range and weight." - [10] (LibreTexts)
^ "A frequently seen rule of thumb is that you should have at least 10 to 20 times as many observations as you have independent variables." - [11] (LibreTexts)
^ "As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation for modeling the probability, we have: ..." - [12] (McGill Medical)
^ "... which shows that logistic regression is really just a standard linear regression model, ..." - [13] (McGill Medical)
^ "Linear regression has a continuous set of results that can easily be mapped on a graph as a straight line." - [14] (Indeed)
^ "Logistic regressions are non-linear and are portrayed on a graph with a curved shape called a sigmoid. Instead of a continuous set of results, a logistical regression has two or more categories for data." - [15] (Indeed)
^ "Logistic regression analysis is a statistical technique to evaluate the relationship between various predictor variables (either categorical or continuous) and an outcome which is binary (dichotomous)." - [16] (National Library of Medicine)
^ "A specific odds ratio from a simple logistic regression model can increase when other x-variables are included." - [17] (Ylva B. Almquist)
^ "Usually, it is just “noise”, i.e. not any large increases, and therefore not much to be concerned about. But it can also reflect that there is something going on that we need to explore further." - [18] (Ylva B. Almquist)
^ "Multiple logistic regression assumes that the observations are independent." - [19] (Statistics LibreTexts)
^ "he main null hypothesis of a multiple logistic regression is that there is no relationship between the X variables and the Y variable;" - [20] (LibreTexts)
^ "Logistic regression includes three basic types: ..." - [21] (Indeed)
^ "A binary output is a variable where there are only two possible outcomes. These outcomes must be opposite of each other and mutually exclusive." - [22] (Indeed)
^ "A multi-class has three or more categories without any numerical value, though they usually have a numerical stand-in for datasets." - [23] (Indeed)
^ "An ordinal output also has three or more categories, though they're in a ranked output." - [24] (Indeed)
^ [25] (Springer)
^ "Multivariable regression models are used to establish the relationship between a dependent variable (i.e. an outcome of interest) and more than 1 independent variable. Multivariable regression can be used to (i) identify patient characteristics associated with an outcome (often called ‘risk factors’), (ii) determine the effect of a procedural technique on a particular outcome, (iii) adjust for differences between groups to allow a comparison of different treatment strategies, (iv) quantify the magnitude of an effect size, (v) develop a propensity score and (vi) develop risk-prediction models." - [26] (Oxford Academic)
^ "Epidemiologists use multiple logistic regression a lot, because they are concerned with dependent variables such as alive vs. dead or diseased vs. healthy, and they are studying people and can't do well-controlled experiments, so they have a lot of independent variables." - [27] (Handbook of Biological Statistics)
^ [28] (ScienceDirect)
^ "It is also used in market research to analyze consumer preferences among product categories." - [29] (This vs. That)
^ "This is a common classification algorithm used in data science and machine learning." - [30] (Indeed)
^ "Two statistical terms, multivariate and multivariable, are repeatedly and interchangeably used in the literature, when in fact they stand for two distinct methodological approaches. While the multivariable model is used for the analysis with one outcome (dependent) and multiple independent (a.k.a., predictor or explanatory) variables, multivariate is used for the analysis with more than 1 outcome[s] (eg, repeated measures) and multiple independent variables." - [31] (Oxford Academic)

[1] "... while multivariate logistic regression deals with multiple outcomes simultaneously." - [1] (This vs. That)

[2] "In contrast, multivariate logistic regression involves analyzing the relationship between multiple outcome variables and one or more predictors. This can lead to more complex interpretations, as researchers must consider the impact of predictors on each outcome variable." - [2] (This vs. That)

[statsapplied.com-3] "Number of variables: One dependent (y), At least two independent (x)" - [3] (Ylva B. Almquist)

[4] "Multivariate logistic regression is a type of analysis that can help predict results when you're working with multiple variables." - [4] (Indeed)

[:0-5] Sperandei, Sandro (2014). "Understanding logistic regression analysis". Biochemia Medica. 24 (1): 12–18. doi:10.11613/BM.2014.003. ISSN 1330-0962. PMC 3936971. PMID 24627710.

[6] "Use multiple logistic regression when you have one nominal variable and two or more measurement variables, and you want to know how the measurement variables affect the nominal variable." - [5] (Handbook of Biological Statistics)

[7] "The multiple logistic regression equation is based on the premise that the natural log of odds (logit) is linearly related to independent variables." - [6] (Springer)

[8] "The statistical program first calculates the baseline odds of having the outcome versus not having the outcome without using any predictor." - [7] (National Library of Medicine)

[9] "Then, the chosen independent (input/predictor) variables are entered into the model, and a regression coefficient (known also as “beta”) and “P” value for each of these are calculated." - [8] (National Library of Medicine)

[10] "The “P” value indicates whether the particular variable contributes significantly to the occurrence of the outcome or not." - [9] (National Library of Medicine)

[11] "Whether the purpose of a multiple logistic regression is prediction or understanding functional relationships, you'll usually want to decide which variables are important and which are unimportant. In the bird example, if your purpose was prediction it would be useful to know that your prediction would be almost as good if you measured only three variables and didn't have to measure more difficult variables such as range and weight." - [10] (LibreTexts)

[12] "A frequently seen rule of thumb is that you should have at least 10 to 20 times as many observations as you have independent variables." - [11] (LibreTexts)

[13] "As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation for modeling the probability, we have: ..." - [12] (McGill Medical)

[14] "... which shows that logistic regression is really just a standard linear regression model, ..." - [13] (McGill Medical)

[15] "Linear regression has a continuous set of results that can easily be mapped on a graph as a straight line." - [14] (Indeed)

[16] "Logistic regressions are non-linear and are portrayed on a graph with a curved shape called a sigmoid. Instead of a continuous set of results, a logistical regression has two or more categories for data." - [15] (Indeed)

[17] "Logistic regression analysis is a statistical technique to evaluate the relationship between various predictor variables (either categorical or continuous) and an outcome which is binary (dichotomous)." - [16] (National Library of Medicine)

[18] "A specific odds ratio from a simple logistic regression model can increase when other x-variables are included." - [17] (Ylva B. Almquist)

[19] "Usually, it is just “noise”, i.e. not any large increases, and therefore not much to be concerned about. But it can also reflect that there is something going on that we need to explore further." - [18] (Ylva B. Almquist)

[20] "Multiple logistic regression assumes that the observations are independent." - [19] (Statistics LibreTexts)

[21] "he main null hypothesis of a multiple logistic regression is that there is no relationship between the X variables and the Y variable;" - [20] (LibreTexts)

[22] "Logistic regression includes three basic types: ..." - [21] (Indeed)

[23] "A binary output is a variable where there are only two possible outcomes. These outcomes must be opposite of each other and mutually exclusive." - [22] (Indeed)

[24] "A multi-class has three or more categories without any numerical value, though they usually have a numerical stand-in for datasets." - [23] (Indeed)

[25] "An ordinal output also has three or more categories, though they're in a ranked output." - [24] (Indeed)

[26] [25] (Springer)

[27] "Multivariable regression models are used to establish the relationship between a dependent variable (i.e. an outcome of interest) and more than 1 independent variable. Multivariable regression can be used to (i) identify patient characteristics associated with an outcome (often called ‘risk factors’), (ii) determine the effect of a procedural technique on a particular outcome, (iii) adjust for differences between groups to allow a comparison of different treatment strategies, (iv) quantify the magnitude of an effect size, (v) develop a propensity score and (vi) develop risk-prediction models." - [26] (Oxford Academic)

[28] "Epidemiologists use multiple logistic regression a lot, because they are concerned with dependent variables such as alive vs. dead or diseased vs. healthy, and they are studying people and can't do well-controlled experiments, so they have a lot of independent variables." - [27] (Handbook of Biological Statistics)

[29] [28] (ScienceDirect)

[30] "It is also used in market research to analyze consumer preferences among product categories." - [29] (This vs. That)

[31] "This is a common classification algorithm used in data science and machine learning." - [30] (Indeed)

[32] "Two statistical terms, multivariate and multivariable, are repeatedly and interchangeably used in the literature, when in fact they stand for two distinct methodological approaches. While the multivariable model is used for the analysis with one outcome (dependent) and multiple independent (a.k.a., predictor or explanatory) variables, multivariate is used for the analysis with more than 1 outcome[s] (eg, repeated measures) and multiple independent variables." - [31] (Oxford Academic)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]