Practical 3
Requirements
The tidyverse and nnet packages are required for this section and can be installed with:
install.packages(c("tidyverse","nnet"))
You will also need previous experience with probability to best benefit from the materials.
Lab
Linear regression
For this part of the practical, you can download this script.
Verifying the dataset
Let’s take the iris data set, and check if the petal length and width follow a linear correlation
ggplot(iris) + geom_point(aes(Petal.Length, Petal.Width, col = Species))
cor(iris$Petal.Length, iris$Petal.Width)The data seems to be linearly correlated. The species have different ranges of sizes, but they seem to all follow the same trend.
Fitting with MLE
1- Estimating beta1
\[\hat{\beta_1} = \frac{\sum^{N}_{i=1}(x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i=1}^{N}(x_{i} - \bar{x})^{2}} = \frac{SS_{xy}}{SS_{xx}}\]
Build-in linear regression
lm is the build-in regression function
Plot the results to compare with the MLE
Predicting results
Say we remove the versicolor species during model construction, can we predict the Petal Width in function of its Length?
1- We remove the versicolor data and compute the model
iris_subset <- iris %>% filter(Species != 'versicolor')
iris_model <- lm(Petal.Width ~ Petal.Length, data = iris_subset)2- We extract the versicolor Petal Length
versicolor_p <- iris %>% filter(Species == 'versicolor') %>%
mutate(Predicted_pw = iris_model$coefficients[1] + Petal.Length*iris_model$coefficients[2])Alternatively
3- Plot the results
ggplot(iris) + geom_point(aes(Petal.Length, Petal.Width, col = Species)) +
geom_point(aes(Petal.Length, Predicted_pw), data = versicolor_p, color = 'darkgreen', shape = '+', size = 5)4- Compute the MSE
Using multiple variables
What if we also use the Sepal values to predict the Petal Width?
1- We compute the model after removal of versicolor
2- We extract the versicolor Petal Length
versicolor_p2 <- iris %>% filter(Species == 'versicolor') %>%
mutate(Predicted_pw = iris_model2$coefficients[1] +
Petal.Length*iris_model2$coefficients[2] +
Sepal.Length*iris_model2$coefficients[3] +
Sepal.Width*iris_model2$coefficients[4])3- Plot the results
ggplot(iris) + geom_point(aes(Petal.Length, Petal.Width, col = Species)) +
geom_point(aes(Petal.Length, Predicted_pw), data = versicolor_p2, color = 'darkgreen', shape = '+', size = 5)4- Compute the MSE
In this case, the MSE is a lot larger than the previous model, so adding more variables did not help. What are the correlations between the variables?
ggplot(iris) + geom_point(aes(Sepal.Width, Petal.Width, col = Species))
cor(iris$Sepal.Width, iris$Petal.Width)
ggplot(iris) + geom_point(aes(Sepal.Length, Petal.Width, col = Species))
cor(iris$Sepal.Length, iris$Petal.Width)Quick test with only the Petal Length and Sepal Length
iris_model3 <- lm(Petal.Width ~ Petal.Length + Sepal.Length, data = iris_subset)
versicolor_p3 <- iris %>% filter(Species == 'versicolor') %>%
mutate(Predicted_pw = iris_model3$coefficients[1] +
Petal.Length*iris_model3$coefficients[2] +
Sepal.Length*iris_model3$coefficients[3])
MSE3 = mean((versicolor_p3$Petal.Width - versicolor_p3$Predicted_pw)**2)
ggplot(iris) + geom_point(aes(Petal.Length, Petal.Width, col = Species)) +
geom_point(aes(Petal.Length, Predicted_pw), data = versicolor_p3, color = 'darkgreen', shape = '+', size = 5)
print(MSE3)This is the worst MSE of the 3, possibly because of overfitting.
Hands-on
Create your own linear model
Using mtcars, find the best linear model to predict the miles per gallon (mpg) Use the first 2/3 of the data as training data for the model and the last third as test data
Hands-on solutions
The solutions for this part are in this script.
Solution
# To help us select a set of variables to use, we can test the correlation of all variables with mpg
mtcars_train %>% apply(2, cor, y=mtcars_train$mpg)
# We have a good negative correlation with the number of cylinders, the displacement,
# horsepower and weight. We will construct a model with those
# 1- Compute the model
cars_model <- lm(mpg ~ cyl+disp+hp+wt, data = mtcars_train)
# 2- Compute the results
cars_p <- mtcars_test %>%
mutate(Predicted_mpg = cars_model$coefficients[1] +
cyl*cars_model$coefficients[2] +
disp*cars_model$coefficients[3] +
hp*cars_model$coefficients[4] +
wt*cars_model$coefficients[5])
# 3- Compute the MSE
MSE = mean((cars_p$mpg - cars_p$Predicted_mpg)**2)Note: this may not be the overall best solution, it’s only one example
Logistic regression
For this part of the practical, you can download this script.
Predicting species with a linear model
We remove the versicolor species to only have two species
Predict the species based on the petal dimensions
iris_subset$train_test <- sample(c('train', 'test'), nrow(iris_subset), replace = TRUE, prob = c(0.8, 0.2))
iris_model <- lm(Species ~ Petal.Length + Petal.Width, data = iris_subset[iris_subset$train_test == 'train',])
predictions <- predict.lm(iris_model, iris_subset[iris_subset$train_test == 'test',], type = 'response')Convert the predictions to species and compare to the truth
Predicting 2 species with a log model
iris_log_model <- glm(Species ~ Sepal.Length + Sepal.Width,
family = 'binomial',
data = iris_subset[iris_subset$train_test == 'train',])
predictions_log <- predict.glm(iris_log_model,
iris_subset[iris_subset$train_test == 'test',],
type = 'response')
predictions_log <- ifelse(predictions_log >= 0.5,"virginica", "setosa")
table(predictions_log, iris_subset[iris_subset$train_test == 'test', 'Species'])Predicting 3 species with a log model
iris$train_test <- sample(c('train', 'test'), nrow(iris), replace = TRUE, prob = c(0.8, 0.2))
iris_train <- iris %>% filter(train_test == 'train')
iris_test <- iris %>% filter(train_test == 'test')
iris_multinomial <- multinom(Species ~ Sepal.Length + Sepal.Width,
data = iris_train)
predictions_multi <- predict(iris_multinomial,
iris_test,
'prob')Convert the probabilities to species with a softmax
Hands-on
From the diamonds model, predict the cut
First divide the data in training and testing datasets
Hands on solution
The solutions for this part are in this script.
Solution
diamonds_multinomial <- multinom(cut ~ carat + color + clarity + depth,
data = diamonds_train)
predictions_diamonds <- predict(diamonds_multinomial,
diamonds_test,
'prob')
# Convert the probabilities to categories
categories_predictions <- predictions_diamonds %>%
apply(1, function(x) return(colnames(predictions_diamonds)[which.max(x)]))
table(categories_predictions, diamonds_test$cut)Note: this may not be the overall best solution, it’s only one example
Statistical testing
For this part of the practical, you can download this script.
Two-samples t-test
1- First, we create two random samples from Normal distributions
set.seed(20221011) # Seed for reproducibility
n <- 100 # Sample size
mu1 <- 0 # Mean of sample 1
mu2 <- 0.1 # Mean of sample 2
sd1 <- 1 # Standard deviation 1
sd2 <- 1 # Standard deviation 2Generate two samples of of size n of normally-distributed data.
Sample x1 has a mean of mu1 and standard deviation sd1
Sample x2 has a mean of mu2 and standard deviation sd2
2- We can take a look at the data with boxplots
x_df <- data.frame(Values = c(x1,x2),
Group = c(rep('G1',n),rep('G2',n)))
ggplot(x_df) + geom_boxplot(aes(y = Values, color = Group))3- t-test
4- The result of the test is influenced by the sample size
4- We can also do a one-sided test Check if group2 is greater than group1 (aka the difference 1-2 is smaller)
ANOVA
1- Create mutliple datasets
Our artificial data will have three different means
set.seed(20221011)
m1 <- rnorm(30, mean = 25, sd = 5)
m2 <- rnorm(30, mean = 30, sd = 6)
m3 <- rnorm(30, mean = 22, sd = 7)
ms <- c(m1, m2, m3) # Combine our three measures into one vector
# Our vector of group labels A, B, and C
ls <- as.factor(c(rep("A",30),rep("B",30),rep("C",30)))2- Visualize the data
df_anova <- data.frame(Measure = ms, Group = ls)
ggplot(df_anova) + geom_boxplot(aes(y = Measure, color = Group))3- ANOVA
We can look at the coefficients to find the difference across groups
Hands-on
1- ANOVA vs multiple t-tests
Data PlantGrowth has multiple groups: one control and two treatments
Make an ANOVA and consecutive t-tests between the control and each treatment. Do they lead to the same conclusion?
2- Multiple testing correction
Take the co2 dataset (note: not CO2), that records the CO2 level for each month across several years. We want to test if the CO2 levels from 1959 to 1978 were lower than those between 1979 and 1997 for each month separately. Then, we correct for multiple testing and compare the results (hint: see p.adjust). Do the conclusions change?
Hands-on solutions
The solutions for this part are in this script.
1- ANOVA vs multiple t-tests
Solution hands on 1
results_aov <- aov(weight ~ group, data = PlantGrowth)
summary(results_aov)
results_aov$coefficients
# At least one treatment is different, treatment 1 is lighter than control and
# treatment 2 is heavier
t.test(PlantGrowth$weight[PlantGrowth$group == 'ctrl'], PlantGrowth$weight[PlantGrowth$group == 'trt1'])
t.test(PlantGrowth$weight[PlantGrowth$group == 'ctrl'], PlantGrowth$weight[PlantGrowth$group == 'trt2'])
# Only treatment 2 is significantly different2- Multiple testing correction
Solution hands on 2
p_values <- co2_df %>% apply(2, function(x) return(t.test(x[1:20], x[21:39])$p.value))
sum(p_values < 0.05) # 9 months are different
p_adj <- p.adjust(p_values, method = 'bonferroni')
sum(p_adj < 0.05) # 5 months are different
fdr <- p.adjust(p_values, method = 'fdr')
sum(fdr < 0.05) # 8 months are different