Practical 2
Requirements
The tidyverse package is required for this section and can be installed with:
install.packages('tidyverse')
Lab
For this practical, you can download this script.
Selected dplyr functions
1- Select
Select permits to extract, and potentially rename, one or more columns of a data frame.
5- Count
Count the number of individuals (rows) in a given category.
6- Mutate
Transform, combine and do operations on columns
Examples
Compute the Petal area (assuming diamond shapes)
Compute the sums of lengths and widths
Here is a link to a cheatsheet, that contains small descriptions of many dplyr functions: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf
Piping functions
Say that we want to group the iris data frame by species, then create summary statistics. We can do it…
…With piping
iris %>%
group_by(Species) %>%
summarize(mean(Petal.Length), mean(Petal.Width), mean(Sepal.Length), mean(Sepal.Width))The piping operator %>% permits to put multiple functions one after the other, where the output of the function of the left of the pipe operator is used as first argument of the function on the right side.
Hands on 1: Conversion of script
Convert the following script to its piped version:
grouped_cars <- group_by(mtcars, vs, hp >= 100) # group by vs (engine shape) and hp (horsepower)
ratio_mpg_wt <- mutate(grouped_cars,ratio = mpg/wt) # add a column with the ratio of mpg (miles per gallon) and wt (weight)
summarize(ratio_mpg_wt,
Mean_weight = mean(wt),
Mean_mpg = mean(mpg),
Mean_ratio = mean(ratio)) # summarize statisticsHands on 2:
Count the number of cars with (1) a sum of gears (gear) and carburators (carb) >= 6, (2) a number of cylinders (cyl) < 8 and (3) stratified by engine shape (vs)
tidyr pivot functions
The pivot functions (pivot_longer and pivot_wider) can be used to transform a data frame, such as to convert columns into row values and vice-versa. For example, if we have a data in wide format…
…we can convert it to long format…
billboard_long <- billboard_subset %>%
pivot_longer(cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank")
print(billboard_long)…then back to wide
Managing missing values
Some data sets may have missing values, generally represented by NA.
These can be due to:
- Missed medical appointment
- Error of measuring instruments
- Sampling stochasticity
- …
We can deal with these misisng values in multiple ways.
1- Ignoring missing values
We want to have the average number and standard deviation of cases per country for positive smear in males between 0 and 14 years old (sp_m_014)
With na.rm
2- Dropping missing values
We want to clean the table once and for all. We can remove rows that have missing values in specified columns only
By default, drop_na will drop all rows with a missing value in any column
Alternatively, we can drop rows that only have missing data
3- Replacing missing values
We assume NA replace the absence of cases and are equivalent to 0. First, we create the named list used as parameter in replace_na()
replacing_list <- rep(list(0), ncol(who2))
names(replacing_list) <- colnames(who2)
replacing_list['country'] <- 'unknown'
replacing_list['year'] <- 'unknown'Then, we replace the NAs in the data frame
4- Imputing missing values
We can replace missing values by custom values for each entry. Imputation consists in inferring missing data, generally using other datapoints as reference Should be done on a case-by-case scenario, based on the data type and assumptions
Hands-on 4:
# 1- Drop the relapse columns
who2_ho4 <- who2[,grep('rel', colnames(who2), invert = TRUE)]
## 2- Remove empty rows
## Optional: remove rows with at least 80% of the data missing
## 3- Pivot to long, then create new columns such that the data frame contains columns
## country, year, method, sex, age. Hint: search the doc!
## 4- Produce statistics of your choice
## 5- Re-pivot to wideSolutions
The solutions are in this script.
Hands on 1: Conversion of script
Convert the following script to its piped version:
grouped_cars <- group_by(mtcars, vs, hp >= 100) # group by vs (engine shape) and hp (horsepower)
ratio_mpg_wt <- mutate(grouped_cars,ratio = mpg/wt) # add a column with the ratio of mpg (miles per gallon) and wt (weight)
summarize(ratio_mpg_wt,
Mean_weight = mean(wt),
Mean_mpg = mean(mpg),
Mean_ratio = mean(ratio)) # summarize statisticsSolution
Hands on 2:
Count the number of cars with (1) a sum of gears (gear) and carburators (carb) >= 6, (2) a number of cylinders (cyl) < 8 and (3) stratified by engine shape (vs)
Solution
Hands on 3: Find the errors
We want to get the number of flowers with a sepal area (assuming area = width * length / 2) > 10, stratified by species
iris %>%
mutate(area = Sepal.Length * Sepal.Width / 2) %>%
group_by(Species) %>%
summarize(sum(area))Solution
We want summary statistics by brand
#mtcars %>%
# mutate(brand = sub(' .*', '', row.names(mtcars))) %>%
# group_by(brand) %>%
# summarize(sum(brand), mean(gear), mean(carb))
# ERRORSolution
Hands-on 4:
# 1- Drop the relapse columns
who2_ho4 <- who2[,grep('rel', colnames(who2), invert = TRUE)]
## 2- Remove empty rows
## Optional: remove rows with at least 80% of the data missing
## 3- Pivot to long, then create new columns such that the data frame contains columns
## country, year, method, sex, age. Hint: search the doc!
## 4- Produce statistics of your choice
## 5- Re-pivot to wideSolution
## 1- Drop the relapse columns
who2_ho4 <- who2[,grep('rel', colnames(who2), invert = TRUE)]
## 2- Remove empty rows
## Optional: remove rows with at least 80% of the data missing
to_keep <- who2_ho4[, c(-1, -2)] %>% # remove the first 2 columns
is.na() %>% # test if each entry is an NA
rowSums()/(ncol(who2_ho4)-2) < 0.8 # remove rows with less than 80% of the data
who2_ho4 <- who2_ho4[to_keep,]
## 3- Pivot to long, then create new columns such that the data frame contains columns
## country, year, method, sex, age
who2_ho4 %>%
pivot_longer(cols = colnames(who2_ho4)[c(-1, -2)],
names_to = c("method", "sex", "age"),
names_pattern = "(.*)_(.)_(.*)",
values_to = "count") %>%
## 4- Produce statistics of your choice
group_by(country, method, age) %>%
summarise(Mean = mean(count, na.rm = TRUE),
SD = sd(count, na.rm = TRUE)) %>%
## 5- Re-pivot to wide
pivot_wider(names_from = "method",
values_from = c("Mean", "SD"))