What is it? Logistic regression is a classifier that uses a set of weighted measurements to predict the class (e.g., healthy, diseased) to which a sample belongs based on probability.
When is it used? This model is used when 1) we want to compare ≤ 2 different groups to each other, 2) the samples are independent from each other, 3) the populations may or may not be normally distributed, 4) the population groups are known (e.g., healthy, diseased), and 5) data transformation by other methods results in nonsensical values. It is usually used when the response is binary: yes or no, healthy or diseased, etc.
How does it work?
Logistic Regression: Example
We analyze the protein profile of 1,000 proteins of 100 healthy patients and 100 cancer patients using an antibody-based microarray. We want to identify the specific biomarkers that will predict which future patients are healthy or diseased.
- Center & scale data by subtracting the mean of each patient dataset from itself (Figure 1B) and dividing each patient dataset with its standard deviation (Figure 1C), respectively. Now all datasets have a mean of 0 and a standard deviation of 1.
- Fit the logistic model based on a subset of variables (Figure 2). This is accomplished by adding different weights to the biomarkers. The data should follow an S-shape (i.e., sigmoid function).
- Evaluate the performance of the model by ROC curve analysis (Figure 3).
What does the data look like? The data can be presented in a table format, which would list the biomarkers and their corresponding coefficients (i.e., weights) that are used in the model. Performance of the logistic regression is evaluated by ROC curve analysis.