Practical 5

Lecture

Lab

Materials from this section can be downloaded from here.

Example Script

You can download the example_script.R here.

library(ggplot2)
library(dplyr)

#Script for modelling iris species based on sepal characteristics
## Iris Dataset Modelling
#The iris dataset is composed of 50 flowers from three different species. 
#Measurements include width and length (centimeters) for petals and sepals of each flower.

#Filtering out versicolor to run a binary example
#focus is on viridis and seratosa species
mini_iris <- iris %>%
  filter(Species != 'versicolor')



### Scatter Plot of Iris dataset with Sepal Length and Width as axes
### Points are colored by their species
ggplot(mini_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +  # Add points with some transparency
  theme_minimal() +  # Use a clean theme
  labs(
    title = "Iris Dataset: Sepal Dimensions by Species",
    x = "Sepal Length",
    y = "Sepal Width"
  )

#Objective
#Goal of this analysis is to identify features capable of distinguishing flower species with high accuracy. We achieve this task using the **logistic regression model**, focusing on 1 vs. all comparisons.


#Training logistic regression model
iris.fit <- glm(Species ~ Sepal.Length +  Sepal.Width,
                family = 'binomial',
                data = mini_iris)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#Calculating class predictions
mini_iris$Prob <- predict.glm(iris.fit,
                              mini_iris[,c('Sepal.Length','Sepal.Width')],
                              type = 'response')

#Assigning class labels
mini_iris$Pred <- ifelse(mini_iris$Prob >= 0.5,"virginica", "setosa")

#Comparing to ground truth labels
mini_iris$Result <- mini_iris$Species == mini_iris$Pred


## Visualizing Sepal dimensions by species
ggplot(mini_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Pred)) +
  geom_point(size = 3, alpha = 0.7) +  # Add points with some transparency
  theme_minimal() +  # Use a clean theme
  labs(
    title = "Iris Dataset: Training Classification Performance",
    x = "Sepal Length",
    y = "Sepal Width"
  )

## Repeating analysis between Viridis and versicolor
mini_iris <- iris %>%
  filter(Species != 'setosa')
ggplot(mini_iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +  # Add points with some transparency
  theme_minimal() +  # Use a clean theme
  labs(
    title = "Iris Dataset: Dimensions by Species",
    x = "Petal Length",
    y = "Petal Width"
  )

iris.fit <- glm(Species ~ Sepal.Length + Sepal.Width,
                family = 'binomial',
                data = mini_iris)

iris_pred <- predict.glm(iris.fit,
                         mini_iris[,c('Sepal.Length','Sepal.Width')],
                         type = 'response')
table(mini_iris$Species == ifelse(iris_pred >= 0.5, 'virginica','versicolor'))
## 
## FALSE  TRUE 
##    25    75

Example Notebook

You can download the Example_Notebook.Rmd here.

library(tidyverse)

Iris Dataset Modelling

The iris dataset is composed of 50 flowers from three different species. Measurements include width and length (centimeters) for petals and sepals of each flower.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
Feature visualization
mini_iris <- iris %>%
  filter(Species != 'versicolor')

ggplot(mini_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +  # Add points with some transparency
  theme_minimal() +  # Use a clean theme
  labs(
    title = "Iris Dataset: Sepal Dimensions by Species",
    x = "Sepal Length",
    y = "Sepal Width"
  )

Objective

Goal of this analysis is to identify features capable of distinguishing flower species with high accuracy. We achieve this task using the logistic regression model, focusing on 1 vs. all comparisons.

Logistic Regression

From a probabilistic perspective, the logistic regression follows a Bernoulli distribution

\[ y_i \sim Bernouilli(\pi_i) \]

Where \(y_i\) is the classification for observation \(i\). \(\pi_i\) represents the Bernoulli parameter for observation \(i\) and is calculated as

\[ \pi_i = \sigma(x\beta) = P(y_i = 1| x)= \frac{1}{1 + \exp(-x\beta)} \]

Logistic regression extends the linear model by using the sigmoid/logistic function \(\sigma(x\beta)\) which allows it to bound its outputs between 0 and 1.

Model Fitting

## Question: can you try using Petal length and width instead?
##Hint: how are we currently specifying what features are used in the model?
iris.fit <- glm(Species ~ Sepal.Length +  Sepal.Width,
                family = 'binomial',
                data = mini_iris)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
mini_iris$Prob <- predict.glm(iris.fit,
                              mini_iris,
                              type = 'response')

##Question: Can you change the two lines below into a pipe operation?
##Hint: you will need the mutate() function and the %>% operator
mini_iris$Pred <- ifelse(mini_iris$Prob >= 0.5,"virginica", "setosa")
mini_iris$Result <- mini_iris$Species == mini_iris$Pred


table(mini_iris$Result)
## 
## TRUE 
##  100

Model Visualization

##Question: Can you add an additional plot that shows the TRUE species labels?
#Hint: what was the name of the column storing the labels? What argument is being used to visualize the Predicted labels?
ggplot(mini_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Pred)) +
  geom_point(size = 3, alpha = 0.7) +  # Add points with some transparency
  theme_minimal() +  # Use a clean theme
  labs(
    title = "Iris Dataset: Training Classification Predictions",
    x = "Sepal Length",
    y = "Sepal Width"
  )

## Running a linear regression to compare to logistic regression
mini_iris$Num_Species <- ifelse(mini_iris$Species == 'setosa', 1,2)
iris.lm <- lm(Num_Species ~ Sepal.Length +  Sepal.Width,
              data = mini_iris)

mini_iris$Num_pred <- predict.lm(iris.lm,mini_iris)

summary(mini_iris$Num_pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8347  1.0740  1.4250  1.5000  1.8993  2.5424
summary(mini_iris$Prob)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.5     0.5     1.0     1.0
Feature selection and additional modelling

Exploring optimal features to compare viridis and versicolor species.

mini_iris <- iris %>%
  filter(Species != 'setosa')

##Question:can you change the plot to show Petal length and width?
##hint: aes(x = ?, y = ?)
ggplot(mini_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +  # Add points with some transparency
  theme_minimal() +  # Use a clean theme
  labs(
    title = "Iris Dataset: Dimensions by Species",
    x = "Sepal Length",
    y = "Sepal Width"
  )

Evaluation

Evaluation of the trained logistic regression model demonstrates satisfactory classification performance.

N.B: current estimates are based on training data. Must adapt analysis to be based on held-out samples for future updates.

iris.fit <- glm(Species ~ Sepal.Length + Sepal.Width,
                family = 'binomial',
                data = mini_iris)


iris_pred <- predict.glm(iris.fit,
                         mini_iris,
                         type = 'response')


## Question: is the below function hard to read? How could we fix it?
##Hint: pipe operator
as.data.frame(table(mini_iris$Species == ifelse(iris_pred >= 0.5, 'virginica','versicolor')))
##    Var1 Freq
## 1 FALSE   25
## 2  TRUE   75