What is it? Random forest consists of hundreds or more decision trees, with each tree using a random subset of data. All of the decision trees cast a vote on the classification of a sample; the majority vote wins. When is it used? This analysis is one of the most…

# Category: Biostatistics & Bioinformatics

## Support Vector Machine (SVM) Model

What is it? Support vector machine (SVM) determines the boundaries that best classifies the different groups from each other using a subset of variables (e.g., biomarkers) in multi-dimensional space. The boundary is a hyperplane in which a subset of data points closest to the hyperplane (called support vectors) have the…

## Linear Discriminant Analysis (LDA) Model

What is it? Linear discriminant analysis (LDA) separates samples into ≥ 2 classes based on the distance between class means and variance within each class. LDA can also serve to reduce data dimension. When is it used? This analysis is used when there are a lot of variables to consider…

## Logistic Regression Model

What is it? Logistic regression is a classifier that uses a set of weighted measurements to predict the sample class (e.g., healthy, diseased) based on probability. When is it used? This model is used when 1) we want to compare ≤ 2 different groups to each other, 2) the samples…

## ROC Curve Analysis

What is it? A Receiver’s Operating Characteristic (ROC) curve plots every value of a continuous measurement by its specificity and sensitivity to distinguish health status in a population of subjects. The area under the curve (AUC) reflects the measurement’s potential to be a diagnosis tool. At a specific sensitivity, the…

## PCA Analysis

What is it? Principle Component Analysis (PCA) transforms high-dimensional data into a lower-dimensional structure to improve data presentation, pattern recognition, and analysis. PCA determines which dimensions will result in the largest variability of measurements (e.g., expression of specific proteins) across all samples. It does not separate the different groups from…

## K-Clustering

What is it? K-clustering groups samples that are most similar to each other. One cluster (or group) is formed around one centroid; the number of centroids are determined by the user. When is it used? This test is performed to profile samples. You dictate how many clusters are made. How…

## Hierarchical Clustering

What is it? Hierarchical clustering characterizes how similar (or dissimilar) the samples are based on overall patterns of measurements. For example, the groups may be patients and the overall patterns derived from the expression across numerous proteins. Hierarchical clustering analyzes the similarity in a binary fashion starting from one sample.…

## SAM (Significance Analysis of Microarray)

What is it? SAM is a method used for large-scale gene or protein expression data like those collected with microarrays. It addresses the issue of analyzing large-scale data in which a microarray experiment of 10,000 proteins would identify 100 proteins by chance using a p-value cut-off of 0.01. Therefore, SAM…

## Wilcoxon Rank-Sum

What is it? The Wilcoxon Rank-Sum is a method that determines whether two populations are statistically different from each other based on ranks rather than the original values of the measurements. In other words, it ranks all values to determine whether the values are or are not evenly distributed across…