Lab 5: Intro to Machine Learning

Practice session

Luisa M. Mimmi —   https://luisamimmi.org/

March 28, 2025

GOAL OF TODAY’S PRACTICE SESSION

Lab # 5

  • In this Lab session, we will focus on Machine Learning (ML), as introduced in Lecture 5
  • We will review examples of both supervised and unsupervised ML algorithms
    • Supervised ML algorithms examples:
      • Logistic regression
      • Classification and regression trees (CART)
      • 🌳 Random Forest classifier
    • Unsupervised ML algorithms examples:
      • K-means Clustering
      • PCA for dimension reduction
      • PLS-DA for classification, a supervised ML alternative to PCA

ACKNOWLEDGEMENTS

The examples and datasets in this Lab session follow very closely two sources:

  1. The tutorial on “Data Analytics with R” by: Brian Machut, Nathan Cornwell

  2. The tutorial on “Principal Component Analysis (PCA) in R” by: Statistics Globe

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We may also use the packages below (specifying package::function for clarity).
# Prefer non-scientific notation
options(scipen = 999)

# Load pckgs for this R session
# --- General 
library(here)     # tools find your project's files, based on working directory
library(dplyr)    # A Grammar of Data Manipulation
library(skimr)    # Compact and Flexible Summaries of Data
library(tibble)   # Simple Data Frames
library(magrittr) # A Forward-Pipe Operator for R 
library(readr)    # A Forward-Pipe Operator for R 
library(tidyr)    # Tidy Messy Data
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax
# ---Plotting & data visualization
library(ggplot2)      # Create Elegant Data Visualisations Using the Grammar of Graphics
library(ggfortify)     # Data Visualization Tools for Statistical Analysis Results
library(scatterplot3d) # 3D Scatter Plot
# --- Statistics
library(MASS)       # Support Functions and Datasets for Venables and Ripley's MASS
library(factoextra) # Extract and Visualize the Results of Multivariate Data Analyses
library(FactoMineR) # Multivariate Exploratory Data Analysis and Data Mining
library(rstatix)    # Pipe-Friendly Framework for Basic Statistical Tests
library(car)        # Companion to Applied Regression
library(ROCR)       # Visualizing the Performance of Scoring Classifiers
library(pROC)      # Display and Analyze ROC Curves
# --- Tidymodels (meta package)
library(rsample)    # General Resampling Infrastructure  
library(broom)      # Convert Statistical Objects into Tidy Tibbles

LOGISTIC REGRESSION

(EXAMPLE of SUPERVISED ML ALGORITHM)

Logistic regression: review

  • Logistic regression is a classification model used with a binary response variable, e.g.:
    • yes|no, or 0|1, or True|False in a survey question;
    • success|failure in a clinical trial experiment;
    • benign|malignant in a biopsy experiment.
  • Logistic regression is a type of Generalized Linear Model (GLM): a more flexible version of linear regression that can work also for categorical response variables or count data (e.g. poisson regression).
  • When logistic regression fits the coefficients \(\beta_0\), \(\beta_1\), …, \(\beta_k\) to the data, it minimizes errors using the Maximum Likelihood Estimation method (as opposed to linear regression’s Least Squares Error Estimation method).

Logistic regression: logit function

If we have predictor variables like \(X_{1}\), \(X_{2}\), …, \(X_{k}\) and a binary response variable \(Y\) (where \(y_i = 0\) or \(y_i = 1\)), we need a “special” function to transform the expected value of the response variable into the \([0,1]\) outcome we’re trying to predict.

The logit function (of the GLM family) helps us determine the coefficients \(\beta_0\), \(\beta_1\), …, \(\beta_k\) that best fit this sort of data and it is defined as: \[ logit (p_i) = \ln\left( \frac{p_i}{1-p_i} \right)= \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i} \] where \(p_i\) is the probability that \(y_i = 1\)

Via the inverse-logit function (logistic), we solve for the probability \(p_i\), given values of the predictor variables, like so:

\[ P(y_i = 1 | x_{1,i}, \ldots, x_{k,i} ) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}} \]

DATASET #1: Heart Disease

Dataset on Heart Disease (heart_data)

Name: heart_data.csv
Documentation: Toy dataset prepared for teaching purposes. See reference on the data here Data Analytics with R
Sampling details: This dataset contains 10,000 observations on 4 variables.



# Use `here` in specifying all the subfolders AFTER the working directory 
heart_data <- utils::read.csv(file = here::here("practice", "data_input", "05_datasets",
                                      "heart_data.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

heart_data variables with description

Variable Type Description
heart_disease int whether an individual has heart disease (1 = yes; 0 = no)
coffee_drinker int whether an individual drinks coffee regularly (1 = yes; 0 = no)
fast_food_spend dbl a numerical field corresponding to the annual spend of each individual on fast food
income dbl a numerical field corresponding to the individual’s annual income

heart_data dataset splitting

In ML modeling, it is good practice to split the data into training and testing sets.

  • We will use the training set (70%) to fit the model and then the testing set (30%) to evaluate the model’s performance.
set.seed(123)

# Obtain 2 sub-samples from the dataset: training and testing
sample  <-  sample(c(TRUE, FALSE), nrow(heart_data), replace = TRUE , prob = c(0.7, 0.3) )
heart_train  <-  heart_data[sample,]
heart_test  <-  heart_data[!sample,]
  • Which results in:
# check the structure of the resulting datasets
dim(heart_train)
[1] 7048    4
dim(heart_test)
[1] 2952    4

Convert binary variables to factors

Before examining the training dataset heart_train, we converting the binary variables heart_disease and coffee_drinker to factors (for better readability).

heart_train <- heart_train %>% 
  # convert to factor with levels "Yes" and "No"
  dplyr::mutate(heart_disease = factor(heart_disease, levels = c(0, 1),
                                labels = c("No_HD", "Yes_HD")),
         coffee_drinker = factor(coffee_drinker, levels = c(0, 1),
                                 labels = c("No_Coffee", "Yes_Coffee")) 
  )

# show the first 5 rows of the dataset
heart_train[1:5,]
  heart_disease coffee_drinker fast_food_spend    income
1         No_HD      No_Coffee        1823.816 44361.625
3         No_HD      No_Coffee        2683.873 31767.139
6         No_HD     Yes_Coffee        2298.971  7491.559
7         No_HD      No_Coffee        2063.783 24905.227
9         No_HD      No_Coffee        2902.645 37468.529

Plotting Y by X1 (continuous variable)

Let’s visualize the relationship between the binary outcome variable heart_disease and the continuous predictor variable fast_food_spend.

  • geom_jitter adds a bit of randomness to the points to avoid overplotting.
  • geom_boxplot shows the distribution of the continuous variable by the binary outcome variable.
# plot the distribution of heart disease status by fast food spend
heart_train %>% 
  ggplot2::ggplot(aes(x = heart_disease, y = fast_food_spend, fill = heart_disease)) + 
  geom_jitter(aes(fill = heart_disease), alpha = 0.3, shape = 21, width = 0.25) +  
  scale_color_manual(values = c("#005ca1", "#9b2339")) + 
  scale_fill_manual(values = c("#57b7ff", "#e07689")) + 
  geom_boxplot(fill = NA, color = "black", linewidth = .7) + 
  coord_flip() +
    theme(plot.title = element_text(size = 13,face="bold", color = "#873c4a"),
        axis.text.x = element_text(size=12,face="italic"), 
        axis.text.y = element_text(size=12,face="italic"),
        legend.position = "none") + 
  labs(title = "Fast food expenditure by heart disease status") + 
  xlab("Heart Disease (Y/N)") + 
  ylab("Annual Fast Food Spend") 

Plotting Y by X1 (continuous variable)

+ The boxplots indicate that subjects with heart disease (HD =1) seem to spend higher amounts on fast food
+ Also, this sample has many more subjects without heart disease (HD = 0) than with heart disease (HD = 1)

Plotting Y by X2 (discrete variable)

Let’s see also the relationship between the binary outcome variable heart_disease and the binary predictor variable coffee_drinker.

  • First, we need a little manipulation of the data:
    • With the handy dplyr::count function we count occurrences in categorical variable(s) combinations.
heart_train %>% 
  # 1) Count the unique values per each group from 2 categorical variables' combinations
  dplyr::count(heart_disease, coffee_drinker, name = "count_by_group") %>% 
  dplyr::group_by(coffee_drinker) %>% 
  dplyr::mutate(
    total_coffee_class = sum(count_by_group),
    proportion = count_by_group / total_coffee_class) %>% 
  dplyr::ungroup() %>% 
  # 2) Filter only those with heart disease
  dplyr::filter(heart_disease == "Yes_HD") %>% 
  # 3) Plot frequency of heart disease status by coffee drinking 
  ggplot2::ggplot(aes(x = coffee_drinker, y = proportion, fill = coffee_drinker)) + 
  geom_bar(stat = "identity") + 
  scale_fill_manual(values = c("#57b7ff", "#e07689")) + 
  theme_minimal() +
  ylab("Percent with Heart Disease") +
  xlab("Coffee Drinker (Y/N)") +
  ggtitle("Figure 3: Percent of HD incidence by CoffeeDrinking class") +
  labs(fill = "Coffee Drinker") + 
  scale_y_continuous(labels = scales::percent)

Plotting Y by X2 (discrete variable)

Also drinking coffee seems associated to a higher likelihood of heart disease (HD =1)

Linear regression wouldn’t work!

  • In principle, we could use a linear regression model to study likelihood of having Heart Disease in relation to risk factors: \[ Y = \beta_0 + \beta_1\text{(US\$ Spent Fast Food)} + \beta_2\text{(Coffee Drinker = YES)} \]

  • But we’ll see why the logistic regression model is a better option.

\[ \log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1\text{(US\$ Spent Fast Food)} + \beta_2\text{(Coffee Drinker = YES)} \]

  • Let’s compare the 2 models (for simplicity, we ignore the coffee_drinker variable for now.):
# --- 1) Linear regression model
linear_mod <- lm(heart_disease ~ fast_food_spend# + coffee_drinker
                 , data = heart_data)
# --- 2) Logistic regression model
logit_mod <- glm(heart_disease ~ fast_food_spend# + coffee_drinker
                 , data = heart_data, family = "binomial")

Compute alternative models’ predictions

We can now extract the coefficients from the linear regression and logistic models, then estimate the predicted outcomes from these models: lin_pred, logit_pred, and logistic_pred (i.e. the conversion form log(odds) to probability).

# --- 1) Extract coefficients from linear regression model
intercept_lin <- coef(linear_mod)[1]
fast_food_spend_lin <- coef(linear_mod)[2]
coffee_drinker_lin <- coef(linear_mod)[3]

# --- 2) Extract coefficients from logit regression model
intercept_logit <- coef(logit_mod)[1]
fast_food_spend_logit <- coef(logit_mod)[2]
coffee_drinker_logit <- coef(logit_mod)[3]

# --- Estimate predicted data from different models   
heart_data <- heart_data %>%
  dplyr::mutate(
    # Convert outcome variable to factor
    heart_disease_factor = factor(heart_disease, 
                                  labels = c("No Disease (Y=0)", "Disease (Y=1)")),
    # 1) Linear model prediction
    lin_pred = intercept_lin + fast_food_spend_lin * fast_food_spend, 
    # 2) Logit model prediction
    logit_pred = intercept_logit + fast_food_spend_logit * fast_food_spend,  coffee_drinker,
    # 3) Convert logit to probability (logistic model prediction)
    logistic_pred = 1 / (1 + exp(-logit_pred))) %>%
  dplyr::arrange(fast_food_spend)   

Plot alternative models’ outcomes

We can also plot the predicted outcomes from the 3 models to see how they differ.

# --- Plot  
ggplot2::ggplot(heart_data, aes(x = fast_food_spend)) +
  # Actual dataset observations (Y=0, Y=1 ) using `color =`
  geom_jitter(aes(y = heart_disease, color = heart_disease_factor), 
              width = 200, height = 0.03, alpha = 0.75, size = 2, shape = 16) + 
  # Models' predictions (smooth trends)
  geom_smooth(aes(y = lin_pred, color = "Linear Regression"), method = "lm", se = FALSE, 
              linewidth = 1.25, linetype = "dashed") +
  geom_smooth(aes(y = logit_pred, color = "Logit (Log-Odds)"), method = "lm", se = FALSE, 
              linewidth = 1.25, linetype = "dotdash") +
  geom_smooth(aes(y = logistic_pred, color = "Logistic Regression"), method = "glm", 
              method.args = list(family = "binomial"), se = FALSE, 
              linewidth = 1.25, linetype = "solid") +
  # Separate legends: color for dots, color for lines
  scale_color_manual(name = "Actual Y values & Prediction Models", 
                     values = c("No Disease (Y=0)" = "#A6A6A6", "Disease (Y=1)" = "#4c4c4c",
                                "Linear Regression" = "#d02e4c","Logit (Log-Odds)" = "#239b85", 
                                "Logistic Regression" = "#BD8723")) +
  # Define scales for the axes
  scale_x_continuous(breaks = seq(0, 6500, by = 500), limits = c(0, 6500), expand = c(0, 0))+
  scale_y_continuous(breaks = seq(-3, 3, by = .25)) +
  coord_cartesian(ylim = c(-1.25,1.25), xlim = c(0, 6500))  + theme_minimal() +
  labs(title = "Comparing Linear and Logistic Regression Predictions v. actual Y values",
       #subtitle = "(For simplicity, only fast food spend is considered)",
       y = "Y = Heart disease [0,1]", x = "Fast food spend [US$/yr]", color = "Actual Y values and Predictions")

Plot alternative models’ outcomes

+ (The actual data points are shown as the grey dots)
+ The linear model predicts values that are ≠ 0 and 1, which poorly fit the actual data
+ The logit model predicts log(odds) ranging from -Inf to +Inf, which is not interpretable
+ The logistic model squeezes probabilities between 0 and 1, which fits the data better

Linear regression didn’t work!

  • Besides the poor fit, recall linear regression models implies certain assumptions:
    • Linear relationship between Y and the predictors X1 and X2
    • Residuals must be: 1) approximately normally distributed and 2) uncorrelated
    • Homoscedasticity: residuals should have constant variance
    • Non collinearity: predictor variables should not be highly correlated with each other


  • Assumptions that are not met in this case
# Set graphical parameters to plot side by side
par(mfrow = c(1, 2))

# Diagnostic plots
plot(linear_mod, which = 1)
plot(linear_mod, which = 2)

# Reset graphical parameters (optional, avoids affecting later plots)
par(mfrow = c(1, 1))

Linear regression didn’t work!

Figure 1: “Diagnostic plots for a hypothetical linear regression model 👎🏻”
+ a) Residuals should be randomly scattered around 0
+ b) QQ plot should be close to the diagonal line

Fitting a logistic regression model

  • [Now that we are convinced…]
  • …let’s fit a logistic regression model to the heart_train data using:
    • the glm function for Generalized Linear Models,
      • with argument family = binomial to specify logistic regression which will use logit as the link function,
      • (after ~) we include the 3 predictor variables in the model.
# Fit a logistic regression model
heart_model <- stats::glm(heart_disease ~ coffee_drinker + fast_food_spend + income,
                   data = heart_train, 
                   family = binomial(link = "logit"))

Model output

  • Table 1 (next) shows the model output, with the coefficient estimate for each predictor.
    • The broom’s tidy function converts the model summary into a more readable data frame.
    • The odds ratio (= exponentiated coefficient estimate) is more interpretable than the coefficient itself.
# Convert model's output summary into data frame
heart_model_coef <- broom::tidy(heart_model) %>% 
  # improve readability of significance levels
  dplyr::mutate('signif. lev.' = case_when(
    `p.value` < 0.001 ~ "***",
    `p.value` < 0.01 ~ "**",
    `p.value` < 0.05 ~ "*",
    TRUE ~ ""))%>%
  # add odds ratio column
  dplyr::mutate(odds_ratio = exp(estimate)) %>%
  dplyr::relocate(odds_ratio, .after = estimate) %>%
  dplyr::mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
  # format as table
  knitr::kable() %>% 
  # reduce font size
  kable_styling(font_size = 20) %>% 
  # add table title
  kableExtra::add_header_above(c("Logistic Regression Analysis of Heart Disease Risk Factors"= 7))

heart_model_coef

Model output

Table 1: Logistic regression model output

+ estimates of coefficients are in the form of natural logarithm of the odds (log (odds)) of the event happening (Heart Disease)
+ a positive estimate indicates an increase in the odds of having Heart Desease
+ a negative estimate indicates a decrease in the odds of having Heart Desease
+ odds ratio = the exponentiated coefficient estimate

Logistic Regression Analysis of Heart Disease Risk Factors
term estimate odds_ratio std.error statistic p.value signif. lev.
(Intercept) -11.0554 0.0000 0.6040 -18.3051 0.0000 ***
coffee_drinkerYes_Coffee -0.7296 0.4821 0.2910 -2.5071 0.0122 *
fast_food_spend 0.0024 1.0024 0.0001 20.6018 0.0000 ***
income 0.0000 1.0000 0.0000 -0.2299 0.8182

Interpreting the logistic coefficients

term estimate odds_ratio std.error statistic p.value signif. lev.
(Intercept) -11.0554 0.0000 0.6040 -18.3051 0.0000 ***
coffee_drinkerYes_Coffee -0.7296 0.4821 0.2910 -2.5071 0.0122 *
fast_food_spend 0.0024 1.0024 0.0001 20.6018 0.0000 ***
income 0.0000 1.0000 0.0000 -0.2299 0.8182
  • Intercept: gives the log-odds of heart disease when all predictor variables are zero. This is not generally interpreted, but the highly negative value suggests very low probability of heart disease in the sample of reference.
    • (If interpreted) the intercept term -11.0554 would translate as probability \(P = e^{-11.0554} / (1 + e^{-11.0554}) = 0.00002\) or \(0.002\%\).
    • …which means: when \(\text{coffee_drinker} = NO\), and \(\text{fast_food_spend} = 0\), and \(\text{income} = 0\), the probability of heart disease is as low as \(0.002\%\).
  • Income: Based on a p-value = 0.8182, we conclude that income is not significantly associated with heart disease. (Anyhow, the odds ratio of ≈1 would suggest no change in odds based on income.)

The coefficient of fast food $$ 🍔🍟

term estimate odds_ratio std.error statistic p.value signif. lev.
(Intercept) -11.0554 0.0000 0.6040 -18.3051 0 ***
fast_food_spend 0.0024 1.0024 0.0001 20.6018 0 ***
  • The positive coefficient of Fast Food Annual Spending (US$), 0.0024 (highly statistically significant as p-value = 0.0000), suggests a positive association with heart disease.

  • This means that for each additional \(\Delta X_1 = +1 \; US\$\) spent on fast food annually:

    • the log(odds) of heart disease increases by: \(\Delta \log(\text{odds}) = \beta_{1} \times \Delta X_1 = 0.0024 \times 1 = 0.0024\)

    • the odds of heart disease (with spending) increase by a factor of: \(OR = e^{0.0024} \approx 1.0024\) (compared to without spending).

  • The probability of heart disease can be computed using the logistic function: \(P_{HD=1} = \frac{e^{\beta_0 + (\beta_1 \times X_1)}}{1 + e^{\beta_0 + (\beta_1 \times X_1)}}\)
    • Solving for \(\beta_0 = -11.0554\), \(\beta_1 = 0.0024\) and \(X_1 = 1\) gives:
    \[P_{HD=1} = \frac{e^{-11.0554 + (0.0024 \times 1)}}{1 + e^{-11.0554 + (0.0024 \times 1)}} = 0.00001583981\]

which indicates a probability of heart disease is as low as \(0.00159\%\) when fast_food_spend \(= 1\; US\$\).

The coefficient of fast food $$ 🍔🍟

term estimate odds_ratio std.error statistic p.value signif. lev.
(Intercept) -11.0554 0.0000 0.6040 -18.3051 0 ***
fast_food_spend 0.0024 1.0024 0.0001 20.6018 0 ***
  • However, considering that this is annual spending, we should use a more adequate scale!

  • For example, for an additional \(\Delta X_1 = +100 \; US\$\) spent on fast food annually:

    • the log(odds) of heart disease increases by: \(\Delta \log(\text{odds}) = \beta_{1} \times \Delta X_1 = 0.0024 \times 100 = 0.24\)

    • the odds of heart disease (with spending) increase by a factor of: \(OR = e^{0.24} \approx 1.271\) (compared to without spending).

  • The probability of heart disease can be computed using the logistic function: \(P_{HD=1} = \frac{e^{\beta_0 + (\beta_1 \times X_1)}}{1 + e^{\beta_0 + (\beta_1 \times X_1)}}\)

  • Solving for \(\beta_0 = -11.0554\), \(\beta_1 = 0.0024\) and and \(X_1 = 100\) gives:
    \[P_{HD=1} = \frac{e^{-11.0554 + (0.0024 \times 100)}}{1 + e^{-11.0554 + (0.0024 \times 100)}} \approx 0.00002008\]

    which (this time) indicates a (still very low probability) of heart disease at \(0.002008\%\) when fast_food_spend \(= 100\; US\$\).

The coefficient of coffee drinking ☕️

term estimate odds_ratio std.error statistic p.value signif. lev.
(Intercept) -11.0554 0.0000 0.604 -18.3051 0.0000 ***
coffee_drinkerYes_Coffee -0.7296 0.4821 0.291 -2.5071 0.0122 *
  • The negative coefficient of Coffee Drinker (=YES), -0.7296 expressed in log-odds, means that coffee drinkers have lower odds of having heart disease (relative to non-coffee drinkers).
  • This effect is statistically significant as p-value = 0.0122.
  • Transforming the estimate into odds ratio, we obtain \(\text{O.R.} = e^{-0.7296} = 0.48\) , meaning coffee-drinkers have 0.48 (or 48%) the odds that non-coffee drinkers have to experience heart disease (holding all other predictors fixed). \[ \text{Odds Ratio} = e^{-0.7296} = 0.48 \]
    • How much is the drop of the odds? Since O.R. is < 1, I have a percentage decrease:
      \((1 - 0.48) = 0.52\) or \(52\%\) lower odds of having heart disease compared to non-coffee drinkers.

Wait, is drinking coffee good or bad? 🤔

Our plot above showed that there was a higher proportion of coffee drinkers with heart disease as compared to non coffee drinkers. However, our model just told us that coffee drinking is associated with a decrease in the likelihood of having heart disease.

❓ How can that be❓

It’s because coffee_drinking and fast__food_spend are correlated so, on it’s own, it would appear as if coffee drinking were associated with heart disease, but this is only because coffee drinking is also associated with fast food spend, which our model tells us is the real contributor to heart disease.

DIGRESSION

Understanding Odds

  • Odds represent the ratio of the probability of an event occurring to the probability of it not occurring:
    • Example: If the probability of heart disease is 0.25 (25%), then the odds are:

\[ \text{Odds} = \frac{P(\text{event})}{1 - P(\text{event})} \]

\[ \frac{0.25}{1 - 0.25} = \frac{0.25}{0.75} = 0.33 \]

This means that for every 1 person without heart disease, there are 0.33 people with it.
(Or equivalently: for every 3 people without heart disease, 1 person has it.)

Understanding Odds Ratio

  • Odds Ratio (O.R.) compares the odds of an event occurring in one group (exposed) relative to another group (reference).

    • Example: If the odds of heart disease for coffee drinkers are 0.48, and for non-coffee drinkers are 1, then the odds ratio is:

\[\text{Odds Ratio} = \frac{\text{Odds}_{\text{exposed}}}{\text{Odds}_{\text{reference}}} = \frac{\text{Odds}_{\text{coffee drinkers}}}{\text{Odds}_{\text{non-coffee drinkers}}} = \frac{0.48}{1} = 0.48\]

  • Interpretation:
    • If O.R. > 1, the event is more likely in the exposed group.
    • If O.R. < 1, the event is less likely in the exposed group.
    • If O.R. = 1, there is no difference between groups.

Example: Coffee Drinkers vs. Non-Coffee Drinkers

If the odds of heart disease are:

  • Coffee drinkers (exposed group): \(0.48\)
  • Non-coffee drinkers (reference group): \(1\)

Then the Odds Ratio is:

\[\text{Odds Ratio} = \frac{\text{Odds}_{\text{coffee drinkers}}}{\text{Odds}_{\text{non-coffee drinkers}}} = \frac{0.48}{1} = 0.48\]

This means coffee drinkers have 52% lower odds of heart disease than non-coffee drinkers.

Odds Ratio (O.R.) Example O.R. Formula for % Change in Odds Calculation Interpretation
O.R. < 1 0.48 (1 - O.R.) × 100 (1 - 0.48) × 100 = 52% 52% lower odds of having heart disease compared to non-coffee drinkers
O.R. > 1 2.08 (O.R. - 1) × 100 (2.08 - 1) × 100 = 108% 108% higher odds of having heart disease compared to coffee drinkers

Making predictions from logistic regression model ✍🏻

Let’s use the logistic regression model to make predictions (on the training data). First, we’ll do it manually:

  • Consider an individual that drinks coffee, spends US$5,000 on fast food per year, and has an annual income of US$50,000.

  • We can calculate the probability of this individual having heart disease using our logistic regression model:

\[p_i(HD=YES) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3)}} = \]

\[ = \frac{1}{1 + e^{-(-11.0554 -0.7296 \times 1 + 0.0024 \times 5000 -0.0000023 \times 50,000 )}} = 0.4951735\]

Making predictions from logistic regression model ✍🏻

# Define values for individual case 
x1 <- 1
x2 <- 5000
x3 <- 50000

# Coefficients from our logistic regression model
# (mind the decimals!)
beta_0 <- -11.055360943337
b1 <- -0.729562678149
b2 <- 0.002376170127
b3 <- -0.000002304669 

# Compute the exponent part
exponent <- beta_0 + (b1 * x1) + (b2 * x2) + (b3 * x3)

# Compute the predicted probability
probability <- 1 / (1 + exp(-exponent))

# Print the result
print(probability) # 0.4951735
[1] 0.4951735

Making predictions from logistic regression model 👩🏻‍💻

Now, let’s use the stats::predict() function to make a prediction for the same individual.

# Define values for individual case (as df)
individual <- data.frame(
  coffee_drinker = "Yes_Coffee", # x1 (factor)  
  fast_food_spend = 5000,  # x2  
  income = 50000) # x3 

# Make a prediction for individual outcome
predict(heart_model, individual, type = "response") # 0.4951735 
        1 
0.4951735 

Adding predictions to the training data

In fact, we can use the stats::predict() function to make predictions for ALL the individuals observed in the training data heart_train.

  • heart_train$heart_disease_pred = added column with the predicted probabilities of heart disease for each individual.
# Add calculated predictions to the whole training dataset
heart_train$heart_disease_pred <- predict(heart_model, type = "response")

Let’s take a look at a few obtained predictions for some individuals in the training data.

# Subsetof 3 with heart_disease = Yes_HD
heart_train[heart_train$heart_disease == "Yes_HD", ][1:3,] 
    heart_disease coffee_drinker fast_food_spend   income heart_disease_pred
207        Yes_HD      No_Coffee        4723.998 48956.17         0.51420472
210        Yes_HD     Yes_Coffee        4748.477 20655.20         0.36601782
242        Yes_HD     Yes_Coffee        3932.141 14930.18         0.07756531
# Subsetof 3 with heart_disease = No_HD
heart_train[heart_train$heart_disease == "No_HD", ][1:3,]
  heart_disease coffee_drinker fast_food_spend    income heart_disease_pred
1         No_HD      No_Coffee        1823.816 44361.625        0.001086288
3         No_HD      No_Coffee        2683.873 31767.139        0.008566989
6         No_HD     Yes_Coffee        2298.971  7491.559        0.001762176

Converting predictions into classifications

For simplicity, we will establish a threshold of \(0.5\) to classify the predictions as either Yes_HD (if \(pred>0.5\)) or No_HD.

  • using the R base function ifelse(test, yes, no)
# Convert predictions to classifications
heart_train <- heart_train %>%
  dplyr::mutate(heart_disease_pred_class = ifelse(heart_disease_pred > 0.5, 
                                           "pred_YES", "pred_NO"))
# (... same BASE WAY)
# heart_train$heart_disease_pred_class <- ifelse(heart_train$heart_disease_pred > 0.5, 1, 0)

Evaluating performance with the confusion matrix

Recall how the confusion matrix helps evaluating a model’s performance.

Confusion matrix with threshold of 0.5

Then we compare the predicted classification to the actual training data.

# Confusion matrix at 0.5
confusion_matrix <- table(heart_train$heart_disease, heart_train$heart_disease_pred_class)

confusion_matrix %>% 
  knitr::kable()
pred_NO pred_YES
No_HD 6790 30
Yes_HD 148 80

Sensitivity, Specificity, Accuracy

  • Sensitivity (True Positive Rate): the proportion of actual positives that are correctly identified.
  • Specificity (True Negative Rate): the proportion of actual negatives that are correctly identified.
  • Accuracy: the proportion of true results (both true positives and true negatives) in the population.
sensitivity <- confusion_matrix[2, 2] / sum(confusion_matrix[2, ]) # 80/228
specificity <- confusion_matrix[1, 1] / sum(confusion_matrix[1, ]) # 6790/6820
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix) # 6870/7048

# Print the results
sensitivity # 0.3508772
[1] 0.3508772
specificity # 0.9956012
[1] 0.9956012
accuracy # 0.9751131
[1] 0.9747446

Making predictions on test data

An even more meaningful step is to evaluate the model on the test data (heart_test), which (in principle) the model has not seen before.

# Prep test data to match training data
heart_test <- heart_test %>% 
  # convert to factor with levels "Yes" and "No"
  dplyr::mutate(heart_disease = factor(heart_disease, levels = c(0, 1),
                                labels = c("No_HD", "Yes_HD")),
         coffee_drinker = factor(coffee_drinker, levels = c(0, 1),
                                 labels = c("No_Coffee", "Yes_Coffee")) 
  ) 

heart_test_pred <- heart_test %>%
  # add prediction as column 
  mutate(heart_disease_pred = predict(heart_model, newdata = heart_test, type = "response"))

Confusion matrix with threshold of 0.5

Once again, we use a generic \(0.5\) threshold to classify the predictions (in test data).
Then we compare the predicted classification to the actual ones.

# Convert predictions to classifications
heart_test_pred$heart_disease_pred_class <- ifelse(
  heart_test_pred$heart_disease_pred > 0.5, "pred_YES", "pred_NO")
 
# Confusion matrix at 0.5
confusion_matrix_test <- table(heart_test_pred$heart_disease,
                               heart_test_pred$heart_disease_pred_class)

# Print the confusion matrix
confusion_matrix_test %>% 
  knitr::kable()
pred_NO pred_YES
No_HD 2838 9
Yes_HD 80 25

Sensitivity, Specificity, Accuracy (on test data)

  • Sensitivity: True Positive Rate
  • Specificity: True Negative Rate
  • Accuracy: Proportion of both true positives and true negatives in the population.
sensitivity_test <- confusion_matrix_test[2, 2] / sum(confusion_matrix_test[2, ]) # 25/105
specificity_test <- confusion_matrix_test[1, 1] / sum(confusion_matrix_test[1, ]) # 2838/2847
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test) # 2863/2952

# Print the results
sensitivity_test # 0.2380952
[1] 0.2380952
specificity_test # 0.9968388
[1] 0.9968388
accuracy_test    # 0.9698509
[1] 0.9698509
  • 🤔 Sensitivity is really quite low, which means that the model is not so good at identifying those who have heart disease.
    • This is quite problematic in a medical context, where it is critical to identify to identify as many “positive” cases as possible.

Performance at different classification thresholds

What if we change the threshold to \(0.25\) or \(0.75\)? How would the model perform?

# Set different thresholds options
thresh_025 <- 0.25
thresh_075 <- 0.75

# Classify the predictions based on such alternative threshold
heart_test_pred <-  heart_test_pred %>% 
  mutate(heart_disease_pred_class_025 = ifelse(heart_disease_pred > thresh_025,
                                               "pred_YES", "pred_NO"),
         heart_disease_pred_class_075 = ifelse(heart_disease_pred > thresh_075,
                                               "pred_YES", "pred_NO"))
 
# Recompute confusion matrices for different thresholds
confusion_matrix_test_025 <- table(heart_test_pred$heart_disease,
                                   heart_test_pred$heart_disease_pred_class_025)
confusion_matrix_test_075 <- table(heart_test_pred$heart_disease,
                                   heart_test_pred$heart_disease_pred_class_075)

# Calcluate sensitivity and specificity for each threshold
sensitivity_test_025 <- confusion_matrix_test_025[2, 2] / sum(confusion_matrix_test_025[2, ])
sensitivity_test_075 <- confusion_matrix_test_075[2, 2] / sum(confusion_matrix_test_075[2, ])
specificity_test_025 <- confusion_matrix_test_025[1, 1] / sum(confusion_matrix_test_025[1, ])
specificity_test_075 <- confusion_matrix_test_075[1, 1] / sum(confusion_matrix_test_075[1, ])

Confusion matrices at different classification thresholds

Confusion matrix at threshold 0.25
pred_NO pred_YES
No_HD 2791 56
Yes_HD 54 51



Better threshold in terms of TRUE POSITIVES (sensitivity of the test goes from 0.238 to 0.486).

Confusion matrix at threshold = 0.75
pred_NO pred_YES
No_HD 2844 3
Yes_HD 94 11



Better threshold in terms of TRUE NEGATIVEs (specificity of the test goes from 0.996 to 0.999).

Performance at different classification thresholds

As expected, a lower threshold (\(0.25\)) can “catch” more of the “TRUE POSITIVES”, hence it has a higher sensitivity but a lower specificity.

Thresholds of 0.25

# Print the classification results at the threshold 0.25 
sensitivity_test_025 # 0.4857143 (a.k.a. TRUE POSITIVE rate)
[1] 0.4857143
specificity_test_025 # 0.9803302 (a.k.a. TRUE NEGATIVE rate)
[1] 0.9803302


Thresholds of 0.75

Not surprisingly, the highest threshold (\(0.75\)) has the worst sensitivity

# Print the classification results at the threshold 0.75
sensitivity_test_075 # 0.1047619 (a.k.a. TRUE POSITIVE rate)
[1] 0.1047619
specificity_test_075 # 0.9989463 (a.k.a. TRUE NEGATIVE rate)
[1] 0.9989463

ROC curve may help deciding

  • The ROC curve (“Receiver Operating Curve”) represents the trade-off between the True Positive Rate (Sensitivity) and the False Positive Rate (Specificity) at various thresholds.

  • We use the pROC package:

    1. roc prepares the data for the ROC curve with confidence intervals.
    2. coords extracts the best threshold value, with corresponding specificity and sensitivity – which maximizes the Youden index (Sensitivity + Specificity - 1).
    3. auc computes the AUC (Area Under the Curve) of the ROC curve – how well the model can distinguish between classes
# Create a the needed data for a ROC curve with confidence intervals
roc_curve_data <- pROC::roc(heart_test_pred$heart_disease,
                       heart_test_pred$heart_disease_pred,
                       ci = TRUE, # Compute confidence intervals
                       thresholds = "best", # Select the best threshold
                       print.thres = "best") 

# Extract the best threshold value directly from roc_curve_data
best_threshold <- pROC::coords(roc_curve_data, 
                         # looks for the best threshold
                         x = "best", 
                         # and returns: 
                         ret=c("threshold","specificity", "sensitivity"))
# Print the best threshold
best_threshold
  threshold specificity sensitivity
1 0.0225225   0.8401826   0.9142857

ROC curve may help deciding

# Use function to store the plot as an object
roc_plot <- function() {
  plot(roc_curve_data, col = "#00589b", lwd = 2, 
       # main = "ROC Curve for Heart Disease Prediction",
       # sub = "Logistic regression model on test data",
       xlab = "(Specificity) False Positive Rate", 
       ylab = "(Sensitivity) True Positive Rate")
  # Add a horizontal line at the best threshold
  abline(h = best_threshold$sensitivity, col = "#cd6882", lty = 2)
  # Add the AUC text
  text(0.5, 0.5, paste("AUC =", round(pROC::auc(roc_curve_data), 4)), 
       col = "#887455", cex = 1.5)
}

# Call the function to plot the ROC curve
roc_plot()

ROC curve may help deciding

ROC Curve for Heart Disease Prediction: Logistic regression model on test data
(Ideally, the curve is as close to the top-left corner as possible.)

Conclusions

  • As confirmed also graphically by the ROC curve, a low prediction threshold of \(0.02\) would seem the best compromise between a high sensitivity (\(0.8401826\)) –of utmost importance in clinical tests– without compromising too much in terms of specificity (\(0.9142857\)).

  • However, such a low threshold would also increase the number of False Positives (people wrongly classified as having heart disease), which could lead to unnecessary treatments or interventions.

  • The choice of the threshold should be made considering the specific context and the consequences of False Positives and False Negatives.

  • Also, the sample distribution (i.e. having very few cases of heart disease in the observed cases) has a significant impact on the model’s prediction performance.

PCA: EXAMPLE of UNSUPERVISED ML ALGORITHM

Reducing high-dimensional data to a lower number of variables

DATASET #2: biopsy

Dataset on Breast Cancer Biopsy

Name: Biopsy Data on Breast Cancer Patients
Documentation: See reference on the data downloaded and conditioned for R here https://cran.r-project.org/web/packages/MASS/MASS.pdf
Sampling details: This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. The dataset contains the original Wisconsin breast cancer data with 699 observations on 11 variables.

Importing Dataset biopsy

  • The data can be interactively obtained form the MASS R package
# (need to load MASS pkg before)
library(MASS)  

# ... so I can call 
utils::data(biopsy)
  • I can find info on the dataset using the ? operator
# Find info on the dataset
?biopsy

biopsy variables with description

Variable Type Description
id character Sample id
V1 integer 1 - 10 clump thickness
V2 integer 1 - 10 uniformity of cell size
V3 integer 1 - 10 uniformity of cell shape
V4 integer 1 - 10 marginal adhesion
V5 integer 1 - 10 single epithelial cell size
V6 integer 1 - 10 bare nuclei (16 values are missing)
V7 integer 1 - 10 bland chromatin
V8 integer 1 - 10 normal nucleoli
V9 integer 1 - 10 mitoses
class factor benign or malignant

biopsy dataset manipulation

We will:

  • exclude the non-numerical variables (id and class) before conducting the PCA.

  • exclude the individuals with missing values using the na.omit() or filter(complete.cases() functions.

  • We can do both in 2 equivalent ways:


with base R (more compact)

#--- new (manipulated) dataset 
data_biopsy <- na.omit(biopsy[,-c(1,11)])

with dplyr (more explicit)

# --- new (manipulated) dataset 
data_biopsy <- biopsy %>% 
  # drop incomplete & non-integer columns
  dplyr::select(-ID, -class) %>% 
  # drop incomplete observations (rows)
  dplyr::filter(complete.cases(.))

biopsy dataset manipulation

We obtained a new dataset with 9 variables and 683 observations (instead of the original 699).

# check reduced dataset 
str(data_biopsy)
'data.frame':   683 obs. of  9 variables:
 $ V1: int  5 5 3 6 4 8 1 2 2 4 ...
 $ V2: int  1 4 1 8 1 10 1 1 1 2 ...
 $ V3: int  1 4 1 8 1 10 1 2 1 1 ...
 $ V4: int  1 5 1 1 3 8 1 1 1 1 ...
 $ V5: int  2 7 2 3 2 7 2 2 2 2 ...
 $ V6: int  1 10 2 4 1 10 10 1 1 1 ...
 $ V7: int  3 3 3 3 3 9 3 3 1 2 ...
 $ V8: int  1 2 1 7 1 7 1 1 1 1 ...
 $ V9: int  1 1 1 1 1 1 1 1 5 1 ...

Calculate Principal Components

The first step of PCA is to calculate the principal components. To accomplish this, we use the prcomp() function from the stats package.

  • With argument “scale = TRUE” each variable in the biopsy data is scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components (just like option Autoscaling in MetaboAnalyst)
# calculate principal component
biopsy_pca <- prcomp(data_biopsy, 
                     # standardize variables
                     scale = TRUE)

Analyze Principal Components

Let’s check out the elements of our obtained biopsy_pca object

  • (All accessible via the $ operator)
names(biopsy_pca)
[1] "sdev"     "rotation" "center"   "scale"    "x"       

“sdev” = the standard deviation of the principal components

“sdev”^2 = the variance of the principal components (eigenvalues of the covariance/correlation matrix)

“rotation” = the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).

“center” and “scale” = the means and standard deviations of the original variables before the transformation;

“x” = the principal component scores (after PCA the observations are expressed in principal component scores)

Analyze Principal Components (cont.)

We can see the summary of the analysis using the summary() function

  1. The first row gives the Standard deviation of each component, which can also be retrieved via biopsy_pca$sdev.
  2. The second row shows the Proportion of Variance, i.e. the percentage of explained variance.
summary(biopsy_pca)
Importance of components:
                          PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
Cumulative Proportion  0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
                           PC8     PC9
Standard deviation     0.51062 0.29729
Proportion of Variance 0.02897 0.00982
Cumulative Proportion  0.99018 1.00000

Proportion of Variance for components

  1. The row with Proportion of Variance can be either accessed from summary or calculated as follows:
# a) Extracting Proportion of Variance from summary
summary(biopsy_pca)$importance[2,]
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9 
0.65550 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982 
# b) (same thing)
round(biopsy_pca$sdev^2 / sum(biopsy_pca$sdev^2), digits = 5)
[1] 0.65550 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982


The output suggests the 1st principal component explains around 65% of the total variance, the 2nd principal component explains about 9% of the variance, and this goes on with diminishing proportion for each component.

Cumulative Proportion of variance for components

  1. The last row from the summary(biopsy_pca), shows the Cumulative Proportion of variance, which calculates the cumulative sum of the Proportion of Variance.
# Extracting Cumulative Proportion from summary
summary(biopsy_pca)$importance[3,]
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9 
0.65550 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121 0.99018 1.00000 


Once you computed the PCA in R you must decide the number of components to retain based on the obtained results.

VISUALIZING PCA OUTPUTS

Scree plot

There are several ways to decide on the number of components to retain.

  • One helpful option is visualizing the percentage of explained variance per principal component via a scree plot.
    • Plotting with the fviz_eig() function from the factoextra package
# Scree plot shows the variance of each principal component 
factoextra::fviz_eig(biopsy_pca, 
                     addlabels = TRUE, 
                     ylim = c(0, 70))


Visualization is essential in the interpretation of PCA results. Based on the number of retained principal components, which is usually the first few, the observations expressed in component scores can be plotted in several ways.

Scree plot

The obtained scree plot simply visualizes the output of summary(biopsy_pca).

Principal Component Scores

After a PCA, the observations are expressed as principal component scores.

  1. We can retrieve the principal component scores for each Variable by calling biopsy_pca$x, and store them in a new dataframe PC_scores.
  2. Next we draw a scatterplot of the observations – expressed in terms of principal components
# Create new object with PC_scores
PC_scores <- as.data.frame(biopsy_pca$x)
head(PC_scores)

It is also important to visualize the observations along the new axes (principal components) to interpret the relations in the dataset:

Principal Component Scores

        PC1         PC2         PC3         PC4         PC5         PC6
1  1.469095 -0.10419679  0.56527102  0.03193593 -0.15088743 -0.05997679
2 -1.440990 -0.56972390 -0.23642767  0.47779958  1.64188188  0.48268150
3  1.591311 -0.07606412 -0.04882192  0.09232038 -0.05969539  0.27916615
4 -1.478728 -0.52806481  0.60260642 -1.40979365 -0.56032669 -0.06298211
5  1.343877 -0.09065261 -0.02997533  0.33803588 -0.10874960 -0.43105416
6 -5.010654 -1.53379305 -0.46067165 -0.29517264  0.39155544 -0.11527442
         PC7        PC8          PC9
1 -0.3491471 -0.4200360 -0.005687222
2  1.1150819 -0.3792992  0.023409926
3 -0.2325697 -0.2096465  0.013361828
4  0.2109599  1.6059184  0.182642900
5 -0.2596714 -0.4463277 -0.038791241
6 -0.3842529  0.1489917 -0.042953075

Principal Component Scores plot (adding label variable)

  1. When data includes a factor variable, like in our case, it may be interesting to show the grouping on the plot as well.
  • In such cases, the label variable class can be added to the PC set as follows.
# retrieve class variable
biopsy_no_na <- na.omit(biopsy)
# adding class grouping variable to PC_scores
PC_scores$Label <- biopsy_no_na$class


The visualization of the observation points (point cloud) could be in 2D or 3D.

Principal Component Scores plot (2D)

The Scores Plot can be visualized via the ggplot2 package.

  • grouping is indicated by argument the color = Label;
  • geom_point() is used for the point cloud.
ggplot2::ggplot(PC_scores, 
       aes(x = PC1, 
           y = PC2, 
           color = Label)) +
  geom_point() +
  scale_color_manual(values=c("#245048", "#CC0066")) +
  ggtitle("Figure 1: Scores Plot") +
  theme_bw()

Principal Component Scores plot (2D)

Figure 1 shows the observations projected into the new data space made up of principal components

Principal Component Scores (2D Ellipse Plot)

Confidence ellipses can also be added to a grouped scatter plot visualized after a PCA. We use the ggplot2 package.

  • grouping is indicated by argument the color = Label;
  • geom_point() is used for the point cloud;
  • the stat_ellipse() function is called to add the ellipses per biopsy group.
ggplot2::ggplot(PC_scores, 
       aes(x = PC1, 
           y = PC2, 
           color = Label)) +
  geom_point() +
  scale_color_manual(values=c("#245048", "#CC0066")) +
  stat_ellipse() + 
  ggtitle("Figure 2: Ellipse Plot") +
  theme_bw()

Principal Component Scores (2D Ellipse Plot)

Figure 2 shows the observations projected into the new data space made up of principal components, with 95% confidence regions displayed.

Principal Component Scores plot (3D)

A 3D scatterplot of observations shows the first 3 principal components’ scores.

  • For this one, we need the scatterplot3d() function of the scatterplot3d package;
  • The color argument assigned to the Label variable;
  • To add a legend, we use the legend function and specify its coordinates via the xyz.convert() function.
# 3D scatterplot ...
plot_3d <- with(PC_scores, 
                scatterplot3d::scatterplot3d(PC_scores$PC1, 
                                             PC_scores$PC2, 
                                             PC_scores$PC3, 
                                             color = as.numeric(Label), 
                                             pch = 19, 
                                             main ="Figure 3: 3D Scatter Plot", 
                                             xlab="PC1",
                                             ylab="PC2",
                                             zlab="PC3"))

# ... + legend
legend(plot_3d$xyz.convert(0.5, 0.7, 0.5), 
       pch = 19, 
       yjust=-0.6,
       xjust=-0.9,
       legend = levels(PC_scores$Label), 
       col = seq_along(levels(PC_scores$Label)))

Principal Component Scores plot (3D)

Figure 3 shows the observations projected into the new 3D data space made up of principal components.

Biplot: principal components v. original variables

Next, we create another special type of scatterplot (a biplot) to understand the relationship between the principal components and the original variables.
In the biplot each of the observations is projected onto a scatterplot that uses the first and second principal components as the axes.

  • For this plot, we use the fviz_pca_biplot() function from the factoextra package
    • We will specify the color for the variables, or rather, for the “loading vectors”
    • The habillage argument allows to highlight with color the grouping by class
factoextra::fviz_pca_biplot(biopsy_pca, 
                repel = TRUE,
                col.var = "black",
                habillage = biopsy_no_na$class,
                title = "Figure 4: Biplot", geom="point")

Biplot: principal components v. original variables

The axes show the principal component scores, and the vectors are the loading vectors

Interpreting biplot output

Biplots have two key elements: scores (the 2 axes) and loadings (the vectors). As in the scores plot, each point represents an observation projected in the space of principal components where:

  • Biopsies of the same class are located closer to each other, which indicates that they have similar scores referred to the 2 main principal components;
  • The loading vectors show strength and direction of association of original variables with new PC variables.

As expected from PCA, the single PC1 accounts for variance in almost all original variables, while V9 has the major projection along PC2.

Interpreting biplot output (cont.)

scores <- biopsy_pca$x

loadings <- biopsy_pca$rotation
# excerpt of first 2 components
loadings[ ,1:2] 
          PC1         PC2
V1 -0.3020626 -0.14080053
V2 -0.3807930 -0.04664031
V3 -0.3775825 -0.08242247
V4 -0.3327236 -0.05209438
V5 -0.3362340  0.16440439
V6 -0.3350675 -0.26126062
V7 -0.3457474 -0.22807676
V8 -0.3355914  0.03396582
V9 -0.2302064  0.90555729

Recap of the lab’s content

  • While we’ve only explored a small portion of the Machine Learning toolbox, we have covered two fundamental approaches and gained familiarity with the ML workflow and some key practical tools that extend to more advanced analyses.

  • Logistic Regression exemplifies Supervised Learning, where models learn from labeled data to make predictions—particularly useful for binary classification.

  • PCA (Principal Component Analysis) represents Unsupervised Learning, helping to reduce dimensionality, visualize high-dimensional data, and uncover hidden structures.

  • Analytical choices depend on the context and data—whether it’s setting a classification threshold in Logistic Regression or selecting the number of components in PCA to balance information retention and interpretability.