The tutorial on “Principal Component Analysis (PCA) in R” by: Statistics Globe
R ENVIRONMENT SET UP & DATA
Needed R Packages
We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
We may also use the packages below (specifying package::function for clarity).
# Load pckgs for this R sessionoptions(scipen =999)# --- General library(here) # tools find your project's files, based on working directorylibrary(dplyr) # A Grammar of Data Manipulationlibrary(skimr) # Compact and Flexible Summaries of Datalibrary(magrittr) # A Forward-Pipe Operator for R library(readr) # A Forward-Pipe Operator for R library(tidyr) # Tidy Messy Datalibrary(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax# ---Plotting & data visualizationlibrary(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphicslibrary(ggfortify) # Data Visualization Tools for Statistical Analysis Resultslibrary(scatterplot3d) # 3D Scatter Plot# --- Statisticslibrary(MASS) # Support Functions and Datasets for Venables and Ripley's MASSlibrary(factoextra) # Extract and Visualize the Results of Multivariate Data Analyseslibrary(FactoMineR) # Multivariate Exploratory Data Analysis and Data Mininglibrary(rstatix) # Pipe-Friendly Framework for Basic Statistical Testslibrary(car) # Companion to Applied Regressionlibrary(ROCR) # Visualizing the Performance of Scoring Classifiers# --- Tidymodels (meta package)library(rsample) # General Resampling Infrastructure library(broom) # Convert Statistical Objects into Tidy Tibbles
LOGISTIC REGRESSION
(EXAMPLE of SUPERVISED ML ALGORITHM)
Logistic regression: review
Logistic regression is a classification model used with a binary response variable, e.g.:
yes|no, or 0|1, or True|False in a survey question;
success|failure in a clinical trial experiment;
benign|malignant in a biopsy experiment.
Logistic regression is a type of Generalized Linear Model (GLM): a more flexible version of linear regression that can work also for categorical response variables or count data (e.g. poisson regression).
When logistic regression fits the coefficients \(\beta_0\), \(\beta_1\), …, \(\beta_k\) to the data, it minimizes errors using the Maximum Likelihood Estimation method (as opposed to linear regression’s Least Squares Error Estimation method).
Logistic regression: logit function
If we have predictor variables like \(X_{1}\), \(X_{2}\), …, \(X_{k}\) and a binary response variable \(Y\) (where \(y_i = 0\) or \(y_i = 1\)), we need a “special” function to transform the expected value of the response variable into the \([0,1]\) outcome we’re trying to predict.
The logit function (of the GLM family) helps us determine the coefficients \(\beta_0\), \(\beta_1\), …, \(\beta_k\) that best fit this sort of data and it is defined as: \[
logit (p_i) = \ln\left( \frac{p_i}{1-p_i} \right)= \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}
\] where \(p_i\) is the probability that \(y_i = 1\)
Via the inverse-logit function (logistic), we solve for the probability \(p_i\), given values of the predictor variables, like so:
a (toy) heart_data dataset from the book by: Brian Machut, Nathan Cornwell Data Analytics with R
a few clean datasets used in the “Core Statistics using R” course by: Martin van Rongen
Dataset on Heart Disease (heart_data)
Name: heart_data.csv Documentation: Toy dataset prepared for teaching purposes. See reference on the data here Data Analytics with R Sampling details: This dataset contains 10,000 observations on 4 variables.
# Use `here` in specifying all the subfolders AFTER the working directory heart_data <-read.csv(file = here::here("practice", "data_input", "05_datasets","heart_data.csv"), header =TRUE, # 1st line is the name of the variablessep =",", # which is the field separator character.na.strings =c("?","NA" ), # specific MISSING values row.names =NULL)
heart_data variables with description
Variable
Type
Description
heart_disease
int
whether an individual has heart disease (1 = yes; 0 = no)
a numerical field corresponding to the annual spend of each individual on fast food
income
dbl
a numerical field corresponding to the individual’s annual income
heart_data dataset splitting
In Machine Learning, it is good practice to split the data into training and testing sets.
We will use the training set (70%) to fit the model and then the testing set (30%) to evaluate the model’s performance.
set.seed(123)# Obtain 2 sub-samples from the dataset: training and testingsample <-sample(c(TRUE, FALSE), nrow(heart_data), replace =TRUE , prob =c(0.7, 0.3) )heart_train <- heart_data[sample,]heart_test <- heart_data[!sample,]
Which results in:
# check the structure of the resulting datasetsdim(heart_train)
[1] 7048 4
dim(heart_test)
[1] 2952 4
Convert binary variables to factors
Before examining the training dataset heart_train, we converting the binary variables heart_disease and coffee_drinker to factors (for better readability).
heart_train <- heart_train %>%# convert to factor with levels "Yes" and "No"mutate(heart_disease =factor(heart_disease, levels =c(0, 1),labels =c("No_HD", "Yes_HD")),coffee_drinker =factor(coffee_drinker, levels =c(0, 1),labels =c("No_Coffee", "Yes_Coffee")) )# show the first 5 rows of the datasetheart_train[1:5,]
Let’s visualize the relationship between the binary outcome variable heart_disease and the continuous predictor variable fast_food_spend.
geom_jitter adds a bit of randomness to the points to avoid overplotting.
geom_boxplot shows the distribution of the continuous variable by the binary outcome variable.
# plot the distribution of heart disease status by fast food spendheart_train %>%ggplot(aes(x = heart_disease, y = fast_food_spend, fill = heart_disease)) +geom_jitter(aes(fill = heart_disease), alpha =0.3, shape =21, width =0.25) +scale_color_manual(values =c("#005ca1", "#9b2339")) +scale_fill_manual(values =c("#57b7ff", "#e07689")) +geom_boxplot(fill =NA, color ="black", linewidth = .7) +coord_flip() +theme(plot.title =element_text(size =13,face="bold", color ="#873c4a"),axis.text.x =element_text(size=12,face="italic"), axis.text.y =element_text(size=12,face="italic"),legend.position ="none") +labs(title ="Fast food expenditure by heart disease status") +xlab("Heart Disease (Y/N)") +ylab("Annual Fast Food Spend")
Plotting Y by X1 (continuous variable)
+ The boxplots indicate that subjects with heart disease (HD =1) seem to spend higher amounts on fast food + Also, this sample has many more subjects without heart disease (HD = 0) than with heart disease (HD = 1)
Plotting Y by X2 (discrete variable)
Let’s see also the relationship between the binary outcome variable heart_disease and the binary predictor variable coffee_drinker.
we use the handy count function from dplyr to count occurrences in categorical variable(s) combinations.
# Dataset manipulationheart_train %>%# count the unique values per each group from 2 categorical variables' combinations dplyr::count(heart_disease, coffee_drinker, name ="count_by_group") %>% dplyr::group_by(coffee_drinker) %>% dplyr::mutate(total_coffee_class =sum(count_by_group),proportion = count_by_group / total_coffee_class) %>% dplyr::ungroup() %>%# filter only those with heart disease dplyr::filter(heart_disease =="Yes_HD") %>%# Plot ggplot(aes(x = coffee_drinker, y = proportion, fill = coffee_drinker)) +geom_bar(stat ="identity") +scale_fill_manual(values =c("#57b7ff", "#e07689")) +theme_minimal() +ylab("Percent with Heart Disease") +xlab("Coffee Drinker (Y/N)") +ggtitle("Figure 3: Percent of Coffee Drinkers with Heart Disease") +labs(fill ="Coffee Drinker") +scale_y_continuous(labels = scales::percent)
Plotting Y by X2 (discrete variable)
Also drinking coffee seems associated to a higher likelihood of heart disease (HD =1)
Linear regression wouldn’t work!
In principle, we could use a linear regression model to study likelihood of having Heart Disease in relation to risk factors: \[ Y = \beta_0 + \beta_1\text{(US\$ Spent Fast Food)} + \beta_2\text{(Coffee Drinker = YES)} \]
But we’ll see why the logistic regression model is a better option.
Let’s compare the 2 models (for simplicity, we ignore the coffee_drinker variable for now.):
# --- 1) Linear regression modellinear_mod <-lm(heart_disease ~ fast_food_spend# + coffee_drinker , data = heart_data)# --- 2) Logistic regression modellogit_mod <-glm(heart_disease ~ fast_food_spend# + coffee_drinker , data = heart_data, family ="binomial")
Compute alternative models’ predictions
We can now extract the coefficients from the linear regression and logistic models, then estimate the predicted outcomes from these models: lin_pred, logit_pred, and logistic_pred (i.e. the conversion form log(odds) to probability).
# --- 1) Extract coefficients from linear regression modelintercept_lin <-coef(linear_mod)[1]fast_food_spend_lin <-coef(linear_mod)[2]coffee_drinker_lin <-coef(linear_mod)[3]# --- 2) Extract coefficients from logit regression modelintercept_logit <-coef(logit_mod)[1]fast_food_spend_logit <-coef(logit_mod)[2]coffee_drinker_logit <-coef(logit_mod)[3]# --- Estimate predicted data from different models heart_data <- heart_data %>%mutate(# Convert outcome variable to factorheart_disease_factor =factor(heart_disease, labels =c("No Disease (Y=0)", "Disease (Y=1)")),# 1) Linear model predictionlin_pred = intercept_lin + fast_food_spend_lin * fast_food_spend, # 2) Logit model predictionlogit_pred = intercept_logit + fast_food_spend_logit * fast_food_spend, coffee_drinker,# 3) Convert logit to probability (logistic model prediction)logistic_pred =1/ (1+exp(-logit_pred)) ) %>%arrange(fast_food_spend)
Plot alternative models’ outcomes
We can also plot the predicted outcomes from the 3 models to see how they differ.
# --- Plot ggplot(heart_data, aes(x = fast_food_spend)) +# Actual dataset observations (Y=0, Y=1 ) using `color =`geom_jitter(aes(y = heart_disease, color = heart_disease_factor), width =200, height =0.03, alpha =0.75, size =2, shape =16) +# Models' predictions (smooth trends)geom_smooth(aes(y = lin_pred, color ="Linear Regression"), method ="lm", se =FALSE, linewidth =1.25, linetype ="dashed") +geom_smooth(aes(y = logit_pred, color ="Logit (Log-Odds)"), method ="lm", se =FALSE, linewidth =1.25, linetype ="dotdash") +geom_smooth(aes(y = logistic_pred, color ="Logistic Regression"), method ="glm", method.args =list(family ="binomial"), se =FALSE, linewidth =1.25, linetype ="solid") +# Separate legends: color for dots, color for linesscale_color_manual(name ="Actual Y values & Prediction Models", values =c("No Disease (Y=0)"="#A6A6A6", "Disease (Y=1)"="#4c4c4c","Linear Regression"="#d02e4c","Logit (Log-Odds)"="#239b85", "Logistic Regression"="#BD8723")) +# Define scales for the axesscale_x_continuous(breaks =seq(0, 6500, by =500), limits =c(0, 6500), expand =c(0, 0))+scale_y_continuous(breaks =seq(-3, 3, by = .25)) +coord_cartesian(ylim =c(-1.25,1.25), xlim =c(0, 6500)) +theme_minimal() +labs(title ="Comparing Linear and Logistic Regression Predictions v. actual Y values",#subtitle = "(For simplicity, only fast food spend is considered)",y ="Y = Heart disease [0,1]", x ="Fast food spend [US$/yr]", color ="Actual Y values and Predictions")
Plot alternative models’ outcomes
+ (The actual data points are shown as the grey dots) + The linear model predicts values that are ≠ 0 and 1, which poorly fit the actual data + The logit model predicts log(odds) ranging from -Inf to +Inf, which is not interpretable + The logistic model squeezesprobabilities between 0 and 1, which fits the data better
Linear regression didn’t work!
Besides the poor fit, recall linear regression models implies certain assumptions:
Linear relationship between Y and the predictors X1 and X2
Residuals must be 1) approximately normally distributed and 2) uncorrelated
Homoscedasticity: residuals should have constant variance
Non collinearity: predictor variables should not be highly correlated with each other
Assumptions that are not met in this case
# Set graphical parameters to plot side by sidepar(mfrow =c(1, 2))# Diagnostic plotsplot(linear_mod, which =1)plot(linear_mod, which =2)# Reset graphical parameters (optional, avoids affecting later plots)par(mfrow =c(1, 1))
Linear regression didn’t work!
Figure 1: “Diagnostic plots for a hypothetical linear regression model 👎🏻” + a) Residuals should be randomly scattered around 0 + b) QQ plot should be close to the diagonal line
Fitting a logistic regression model
[Now that we are convinced…]
…let’s fit instead a logistic regression model to the heart_train data using:
the glm function for Generalized Linear Models,
with argument family = binomial to specify logistic regression which will use logit as the link function,
(after ~) we include the 3 predictor variables in the model.
# Fit a logistic regression modelheart_model <-glm(heart_disease ~ coffee_drinker + fast_food_spend + income,data = heart_train, family =binomial(link ="logit"))
Model output
Table 1 (next) shows the model output, with the coefficient estimate for each predictor.
The broom::tidy function converts the model summary into a more readable data frame.
The odds ratio (= exponentiated coefficient estimate) is more interpretable than the coefficient itself.
# Convert model's output summary into data frameheart_model_coef <- broom::tidy(heart_model) %>%# improve readability of significance levels dplyr::mutate('signif. lev.'=case_when(`p.value`<0.001~"***",`p.value`<0.01~"**",`p.value`<0.05~"*",TRUE~""))%>%# add odds ratio column dplyr::mutate(odds_ratio =exp(estimate)) %>% dplyr::relocate(odds_ratio, .after = estimate) %>% dplyr::mutate(across(where(is.numeric), ~round(.x, 4))) %>%# format as table knitr::kable() %>%# reduce font sizekable_styling(font_size =20) %>%# add table title kableExtra::add_header_above(c("Logistic Regression Analysis of Heart Disease Risk Factors"=7))heart_model_coef
Model output
Table 1: Logistic regression model output
+ estimates of coefficients are in the form of natural logarithm of the odds (log (odds)) of the event happening (Heart Disease) + a positive estimate indicates an increase in the odds of having Heart Desease + a negative estimate indicates a decrease in the odds of having Heart Desease + odds ratio = the exponentiated coefficient estimate
Logistic Regression Analysis of Heart Disease Risk Factors
term
estimate
odds_ratio
std.error
statistic
p.value
signif. lev.
(Intercept)
-11.0554
0.0000
0.6040
-18.3051
0.0000
***
coffee_drinkerYes_Coffee
-0.7296
0.4821
0.2910
-2.5071
0.0122
*
fast_food_spend
0.0024
1.0024
0.0001
20.6018
0.0000
***
income
0.0000
1.0000
0.0000
-0.2299
0.8182
Interpreting the logistic coefficients
term
estimate
odds_ratio
std.error
statistic
p.value
signif. lev.
(Intercept)
-11.0554
0.0000
0.6040
-18.3051
0.0000
***
coffee_drinkerYes_Coffee
-0.7296
0.4821
0.2910
-2.5071
0.0122
*
fast_food_spend
0.0024
1.0024
0.0001
20.6018
0.0000
***
income
0.0000
1.0000
0.0000
-0.2299
0.8182
Intercept: gives the log-odds of heart disease when all predictor variables are zero. This is not generally interpreted, but the highly negative value suggests very low probability of heart disease in the sample of reference.
(If interpreted) the intercept term -11.0554 would translate as probability\(P = e^{-11.0554} / (1 + e^{-11.0554}) = 0.00002\) or \(0.002\%\).
…which means: when \(\text{coffee_drinker} = NO\), and \(\text{fast_food_spend} = 0\), and \(\text{income} = 0\), the probability of heart disease is as low as \(0.002\%\).
Income: Based on a p-value = 0.8182, we conclude that income is not significantly associated with heart disease. (Anyhow, the odds ratio of ≈1 would suggest no change in odds based on income.)
The coefficient of fast food $$ 🍔🍟
term
estimate
odds_ratio
std.error
statistic
p.value
signif. lev.
(Intercept)
-11.0554
0.0000
0.6040
-18.3051
0
***
fast_food_spend
0.0024
1.0024
0.0001
20.6018
0
***
The positive coefficient of Fast Food Annual Spending (US$), 0.0024 (highly statistically significant as p-value = 0.0000), suggests a positive association with heart disease.
This means that for each additional\(\Delta X_1 = +1 \; US\$\) spent on fast food annually:
the log(odds) of heart disease increases by: \(\Delta \log(\text{odds}) = \beta_{1} \times \Delta X_1 = 0.0024 \times 1 = 0.0024\)
the odds of heart disease (with spending) increase by a factor of: \(OR = e^{0.0024} \approx 1.0024\) (compared to without spending).
The probability of heart disease can be computed using the logistic function: \(P_{HD=1} = \frac{e^{\beta_0 + (\beta_1 \times X_1)}}{1 + e^{\beta_0 + (\beta_1 \times X_1)}}\)
Solving for \(\beta_0 = -11.0554\), \(\beta_1 = 0.0024\) and \(X_1 = 1\) gives:
which indicates a probability of heart disease is as low as \(0.00159\%\) when fast_food_spend\(= 1\; US\$\).
The coefficient of fast food $$ 🍔🍟
term
estimate
odds_ratio
std.error
statistic
p.value
signif. lev.
(Intercept)
-11.0554
0.0000
0.6040
-18.3051
0
***
fast_food_spend
0.0024
1.0024
0.0001
20.6018
0
***
However, considering that this is annual spending, we should use a more adequate scale!
For example, for an additional\(\Delta X_1 = +100 \; US\$\) spent on fast food annually:
the log(odds) of heart disease increases by: \(\Delta \log(\text{odds}) = \beta_{1} \times \Delta X_1 = 0.0024 \times 100 = 0.24\)
the odds of heart disease (with spending) increase by a factor of: \(OR = e^{0.24} \approx 1.271\) (compared to without spending).
The probability of heart disease can be computed using the logistic function: \(P_{HD=1} = \frac{e^{\beta_0 + (\beta_1 \times X_1)}}{1 + e^{\beta_0 + (\beta_1 \times X_1)}}\)
Solving for \(\beta_0 = -11.0554\), \(\beta_1 = 0.0024\) and and \(X_1 = 100\) gives: \[P_{HD=1} = \frac{e^{-11.0554 + (0.0024 \times 100)}}{1 + e^{-11.0554 + (0.0024 \times 100)}} \approx 0.00002008\]
which (this time) indicates a (still very low probability) of heart disease at \(0.002008\%\) when fast_food_spend\(= 100\; US\$\).
The coefficient of coffee drinking ☕️
term
estimate
odds_ratio
std.error
statistic
p.value
signif. lev.
(Intercept)
-11.0554
0.0000
0.604
-18.3051
0.0000
***
coffee_drinkerYes_Coffee
-0.7296
0.4821
0.291
-2.5071
0.0122
*
The negative coefficient of Coffee Drinker (=YES), -0.7296 expressed in log-odds, means that coffee drinkers have lower odds of having heart disease (relative to non-coffee drinkers).
This effect is statistically significant as p-value = 0.0122.
Transforming the estimate into odds ratio, we obtain \(\text{O.R.} = e^{-0.7296} = 0.48\) , meaning coffee-drinkers have 0.48 (or 48%) the odds that non-coffee drinkers have to experience heart disease (holding all other predictors fixed). \[ \text{Odds Ratio} = e^{-0.7296} = 0.48 \]
How much is the drop of the odds? Since O.R. is < 1, I have a percentage decrease: \((1 - 0.48) = 0.52\) or \(52\%\)lower odds of having heart disease compared to non-coffee drinkers.
Wait, is drinking coffee good or bad? 🤔
Our plot above showed that there was a higher proportion of coffee drinkers with heart disease as compared to non coffee drinkers. However, our model just told us that coffee drinking is associated with a decrease in the likelihood of having heart disease.
How can that be❓
It’s because coffee_drinking and fast__food_spend are correlated so, on it’s own, it would appear as if coffee drinking were associated with heart disease, but this is only because coffee drinking is also associated with fast food spend, which our model tells us is the real contributor to heart disease.
Understanding Odds
Odds represent the ratio of the probability of an event occurring to the probability of it not occurring:
Example: If the probability of heart disease is 0.25 (25%), then the odds are:
This means that for every 1 person without heart disease, there are 0.33 people with it.
(Or equivalently: for every 3 people without heart disease, 1 person has it.)
Understanding Odds Ratio
Odds Ratio (O.R.) compares the odds of an event occurring in one group (exposed) relative to another group (reference).
Example: If the odds of heart disease for coffee drinkers are 0.48, and for non-coffee drinkers are 1, then the odds ratio is:
Making predictions from logistic regression model ✍🏻
# Define cvalues for individual case x1 <-1x2 <-5000x3 <-50000# Coefficients from our logistic regression modelbeta_0 <--11.055360943337b1 <--0.729562678149b2 <-0.002376170127b3 <--0.000002304669# Compute the exponent partexponent <- beta_0 + (b1 * x1) + (b2 * x2) + (b3 * x3)# Compute the predicted probabilityprobability <-1/ (1+exp(-exponent))# Print the resultprint(probability) # 0.4951735
[1] 0.4951735
Making predictions from logistic regression model 👩🏻💻
Now, let’s use the predict() function to make a prediction for the same individual.
individual <-data.frame(coffee_drinker ="Yes_Coffee", # factor fast_food_spend =5000, income =50000)# Make a predictionpredict(heart_model, individual, type ="response")
1
0.4951735
Adding predictions to the training data
In fact, we can use the predict() function to make predictions for all the individuals observed in the training data heart_train.
heart_train$heart_disease_pred = added column with the predicted probabilities of heart disease for each individual.
# Add calculated predictions to the training dataheart_train$heart_disease_pred <-predict(heart_model, type ="response")# Subset of 3 with heart_disease = Yes_HDheart_train[heart_train$heart_disease =="Yes_HD", ][1:3,]
# Convert predictions to classificationsheart_train$heart_disease_pred_class <-ifelse(heart_train$heart_disease_pred >0.5, 1, 0)
Evaluating the model]
ROC curve]
# Load the pROC packagelibrary(pROC)# Create a ROC curveroc_curve <-roc(heart_train$heart_disease, heart_train$heart_disease_pred)# Plot the ROC curveplot(roc_curve, col ="blue", main ="ROC Curve", legacy.axes =TRUE)
The first step of PCA is to calculate the principal components. To accomplish this, we use the prcomp() function from the stats package.
With argument “scale = TRUE” each variable in the biopsy data is scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components (just like option Autoscaling in MetaboAnalyst)
# calculate principal componentbiopsy_pca <-prcomp(data_biopsy, # standardize variablesscale =TRUE)
Analyze Principal Components
Let’s check out the elements of our obtained biopsy_pca object
(All accessible via the $ operator)
names(biopsy_pca)
[1] "sdev" "rotation" "center" "scale" "x"
“sdev” = the standard deviation of the principal components
“sdev”^2 = the variance of the principal components (eigenvalues of the covariance/correlation matrix)
“rotation” = the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).
“center” and “scale” = the means and standard deviations of the original variables before the transformation;
“x” = the principal component scores (after PCA the observations are expressed in principal component scores)
Analyze Principal Components (cont.)
We can see the summary of the analysis using the summary() function
The first row gives the Standard deviation of each component, which can also be retrieved via biopsy_pca$sdev.
The second row shows the Proportion of Variance, i.e. the percentage of explained variance.
The output suggests the 1st principal component explains around 65% of the total variance, the 2nd principal component explains about 9% of the variance, and this goes on with diminishing proportion for each component.
Cumulative Proportion of variance for components]
The last row from the summary(biopsy_pca), shows the Cumulative Proportion of variance, which calculates the cumulative sum of the Proportion of Variance.
# Extracting Cumulative Proportion from summarysummary(biopsy_pca)$importance[3,]
Once you computed the PCA in R you must decide the number of components to retain based on the obtained results.
VISUALIZING PCA OUTPUTS
Scree plot
There are several ways to decide on the number of components to retain.
One helpful option is visualizing the percentage of explained variance per principal component via a scree plot.
Plotting with the fviz_eig() function from the factoextra package
# Scree plot shows the variance of each principal component factoextra::fviz_eig(biopsy_pca, addlabels =TRUE, ylim =c(0, 70))
Visualization is essential in the interpretation of PCA results. Based on the number of retained principal components, which is usually the first few, the observations expressed in component scores can be plotted in several ways.
Scree plot
The obtained scree plot simply visualizes the output of summary(biopsy_pca).
Principal Component Scores]
After a PCA, the observations are expressed as principal component scores.
We can retrieve the principal component scores for each Variable by calling biopsy_pca$x, and store them in a new dataframe PC_scores.
Next we draw a scatterplot of the observations – expressed in terms of principal components
# Create new object with PC_scoresPC_scores <-as.data.frame(biopsy_pca$x)head(PC_scores)
It is also important to visualize the observations along the new axes (principal components) to interpret the relations in the dataset:
Principal Component Scores plot (adding label variable)]
When data includes a factor variable, like in our case, it may be interesting to show the grouping on the plot as well.
In such cases, the label variable class can be added to the PC set as follows.
# retrieve class variablebiopsy_no_na <-na.omit(biopsy)# adding class grouping variable to PC_scoresPC_scores$Label <- biopsy_no_na$class
The visualization of the observation points (point cloud) could be in 2D or 3D.
Principal Component Scores plot (2D)]
The Scores Plot can be visualized via the ggplot2 package.
grouping is indicated by argument the color = Label;
geom_point() is used for the point cloud.
ggplot(PC_scores, aes(x = PC1, y = PC2, color = Label)) +geom_point() +scale_color_manual(values=c("#245048", "#CC0066")) +ggtitle("Figure 1: Scores Plot") +theme_bw()
Principal Component Scores plot (2D)]
Figure 1 shows the observations projected into the new data space made up of principal components
Principal Component Scores (2D Ellipse Plot)]
Confidence ellipses can also be added to a grouped scatter plot visualized after a PCA. We use the ggplot2 package.
grouping is indicated by argument the color = Label;
geom_point() is used for the point cloud;
the stat_ellipse() function is called to add the ellipses per biopsy group.
ggplot(PC_scores, aes(x = PC1, y = PC2, color = Label)) +geom_point() +scale_color_manual(values=c("#245048", "#CC0066")) +stat_ellipse() +ggtitle("Figure 2: Ellipse Plot") +theme_bw()
Principal Component Scores (2D Ellipse Plot)]
Figure 2 shows the observations projected into the new data space made up of principal components, with 95% confidence regions displayed.
Principal Component Scores plot (3D)]
A 3D scatterplot of observations shows the first 3 principal components’ scores.
For this one, we need the scatterplot3d() function of the scatterplot3d package;
The color argument assigned to the Label variable;
To add a legend, we use the legend() function and specify its coordinates via the xyz.convert() function.
# 3D scatterplot ...plot_3d <-with(PC_scores, scatterplot3d::scatterplot3d(PC_scores$PC1, PC_scores$PC2, PC_scores$PC3, color =as.numeric(Label), pch =19, main ="Figure 3: 3D Scatter Plot", xlab="PC1",ylab="PC2",zlab="PC3"))# ... + legendlegend(plot_3d$xyz.convert(0.5, 0.7, 0.5), pch =19, yjust=-0.6,xjust=-0.9,legend =levels(PC_scores$Label), col =seq_along(levels(PC_scores$Label)))
Principal Component Scores plot (3D)]
Figure 3 shows the observations projected into the new 3D data space made up of principal components.
Biplot: principal components v. original variables]
Next, we create another special type of scatterplot (a biplot) to understand the relationship between the principal components and the original variables.
In the biplot each of the observations is projected onto a scatterplot that uses the first and second principal components as the axes.
For this plot, we use the fviz_pca_biplot() function from the factoextra package
We will specify the color for the variables, or rather, for the “loading vectors”
The habillage argument allows to highlight with color the grouping by class
Biplot: principal components v. original variables]
The axes show the principal component scores, and the vectors are the loading vectors
Interpreting biplot output
Biplots have two key elements: scores (the 2 axes) and loadings (the vectors). As in the scores plot, each point represents an observation projected in the space of principal components where:
Biopsies of the same class are located closer to each other, which indicates that they have similar scores referred to the 2 main principal components;
The loading vectors show strength and direction of association of original variables with new PC variables.
As expected from PCA, the single PC1 accounts for variance in almost all original variables, while V9 has the major projection along PC2.
Interpreting biplot output (cont.)
scores <- biopsy_pca$xloadings <- biopsy_pca$rotation# excerpt of first 2 componentsloadings[ ,1:2]
Motivated the choice of learning/using R for scientific quantitative analysis, and lay out some fundamental concepts in biostatistics with concrete R coding examples.
Consolidated understanding of inferential statistic, through R coding examples conducted on real biostatistics research data.
Discussed the relationship between any two variables, and introduce a widely used analytical tool: regression.
Presented a popular ML technique for dimensionality reduction (PCA), performed both with MetaboAnalyst and R.
Introduction to power analysis to define the correct sample size for hypotheses testing and discussion of how ML approaches deal with available data.
Final thoughts
While the workshop only allowed for a synthetic overview of fundamental ideas, it hopefully provided a solid foundation on the most common statistical analysis you will likely run in your daily work:
Thorough understanding of the input data and the data collection process
Univariate and bivariate exploratory analysis (accompanied by visual intuition) to form hypothesis
Upon verifying the assumptions, we fit data to hypothesized model(s)
Assessment of the model performance (\(R^2\), \(Adj. R^2\), \(F-Statistic\), etc.)
You should now have a solid grasp on the R language to keep using and exploring the huge potential of this programming ecosystem
We only scratched the surface in terms of ML classification and prediction models, but we got a hang of the fundamental steps and some useful tools that might serve us also in more advanced analysis