Module 3: Modeling Part 1

Lecture

Regression

Generalized Linear Models

Lab

Load Libraries

First, let’s load any necessary libraries.

library(tidyverse) # General functions and plots
library(faraway) # Source of the gala dataset
library(car) # For some model diagnostics
library(arm) # For binned plot for logistic regression
library(mgcv) # For beta regression
library(microbiome) # Source of dietswap data
library(microbiomeDataSets) # Source of baboongut data
library(phyloseq) # to calculate diversity measures
library(GUniFrac) # for throat dataset
library(MGLM) # for rna seq dataset

First, we will cover multiple linear regression modeling in R with the lm() function. In regression modeling, it is always important to consider the data available to you. What is your response of interest? Is it a continuous, roughly normally distributed variable? Which predictors should be included in your model? Do these predictors have a linear relationship with the outcome?

Multiple Linear Regression and the Dietswap Data

Supposed researchers are interested in how nationality and bmi group are related to diversity of bacteria in the gut microbiome. In particular, researchers are interested to know whether nationality moderates the relationship between bmi group and Shannon diversity.

Read in Data

# load the dietswap data from the microbiome package
data(dietswap)

# look up description of dataset
?dietswap

# given that this is a phylogenetic class format, we get the data as follows:
otu_table(dietswap) # counts of different OTUs
sample_data(dietswap) # the metadata

# let's look at the summary of our sample data
summary(sample_data(dietswap))

Since we do not have a column for the Shannon diversity, we must calculate it ourselves with the estimate_richness() function from the phyloseq package.

# we don't have shannon diversity so we must calculate it given the OTU table
Shannon <- estimate_richness(otu_table(dietswap), measures = c("Shannon"))

# add shannon diversity to our sample data
dietswap_data <- cbind.data.frame(Shannon=Shannon$Shannon, sample_data(dietswap))

# look at histogram of Shannon diversity
hist(dietswap_data$Shannon)

# look at boxplot of Shannon diversity
boxplot(dietswap_data$Shannon)

Another potential issue we notice when reviewing the dataset is that each sample was observed at multiple time points which means this data will violate our independence assumption. To overcome this, we can filter our data to only look at the first time point.

# remove second time point for each group
dietswap_data_t1 <- dietswap_data %>%
                  filter(timepoint.within.group==1)

Since the researchers are interested to know whether there is an interaction effect, we must include the interaction and main effects in our linear model. To include an interaction between two variables, we use ’*’ in the model which will automatically include both main effects and the interaction effect of the 2 variables.

# fitting the full multiple linear regression in R
mod1 <- lm(Shannon~nationality*bmi_group+sex, data=dietswap_data_t1)
summary(mod1)

## 
## Call:
## lm(formula = Shannon ~ nationality * bmi_group + sex, data = dietswap_data_t1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32963 -0.36395  0.09898  0.33401  1.44820 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          2.7540     0.2296  11.993   <2e-16 ***
## nationalityAFR                      -0.3100     0.2212  -1.401   0.1640    
## bmi_groupoverweight                  0.2642     0.2275   1.162   0.2481    
## bmi_groupobese                      -0.2012     0.2208  -0.911   0.3643    
## sexmale                             -0.0509     0.1377  -0.370   0.7125    
## nationalityAFR:bmi_groupoverweight  -0.5859     0.2934  -1.997   0.0484 *  
## nationalityAFR:bmi_groupobese       -0.0190     0.2880  -0.066   0.9475    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5512 on 105 degrees of freedom
## Multiple R-squared:  0.2439, Adjusted R-squared:  0.2007 
## F-statistic: 5.645 on 6 and 105 DF,  p-value: 4.112e-05

We see from the results of this linear model that there is very moderate evidence of an interaction effect between BMI group and nationality such that those with African nationality have a greater difference in diversity between the lean and overweight groups than those of African American nationality. Overall, the variables in the model accounted for 24.4% of variability in the diversity response.

Model Diagnostics

Again, after fitting the model, we must check for the following assumptions:

Linearity - we already checked this prior to fitting the model
Independence - we already fixed this by removing the second time point
Homoscedasticity - we can check this with a scatterplot of model residuals vs. fitted values
Normality - we can check this with a qq plot of model residuals
No multicollinearity - we will ignore this for this model for now since we have interaction terms and look at it for the model without interactions

# 3. homoscedasticity
res <- residuals(mod1)
fit <- fitted(mod1)
mod_res <- cbind.data.frame(res,fit)
# create the residuals vs fitted values plot
ggplot(data=mod_res,aes(x = fit, y = res)) +
  geom_point(color = "blue", size = 2) +  # Plot the residuals
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Add a horizontal line at 0
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

# 4. normality 
# QQ-plot to assess normality
qqPlot(mod_res$res)

## [1] 30 97

Comparing Nested Models

When fitting multiple linear regression models, it’s often useful to explore whether a simpler, more parsimonious model can adequately explain the data. For example, when working with indicator variables, we can assess whether they should be included for a change in intercept (main effect), a change in slope (interaction effect), or excluded altogether. By comparing models without the indicator, with only the main effect, and with the interaction effect, we can evaluate the contribution of the indicator variable and determine its necessity in the final model.

Here, we will use the anova() function in R to compare nested models. The null hypothesis of the test run by this anova function is H0: reduced (simpler model) is adequate and the alternative hypothesis is H1: Full (more complex) model is needed to explain the data. Therefore, if the p value is small, then we have evidence that we should include the extra variable(s).

# fit model with only main effects for nationality
mod2 <- lm(Shannon~nationality+bmi_group+sex, data=dietswap_data_t1)
summary(mod2)

## 
## Call:
## lm(formula = Shannon ~ nationality + bmi_group + sex, data = dietswap_data_t1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35941 -0.34599  0.06785  0.32141  1.51472 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.97433    0.19853  14.982  < 2e-16 ***
## nationalityAFR      -0.52438    0.11508  -4.557 1.39e-05 ***
## bmi_groupoverweight -0.04098    0.16653  -0.246   0.8061    
## bmi_groupobese      -0.31923    0.16734  -1.908   0.0591 .  
## sexmale             -0.12343    0.13122  -0.941   0.3490    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5606 on 107 degrees of freedom
## Multiple R-squared:  0.2029, Adjusted R-squared:  0.1731 
## F-statistic:  6.81 on 4 and 107 DF,  p-value: 6.374e-05

# fit model without nationality
mod3 <- lm(Shannon~bmi_group+sex, data=dietswap_data_t1)
summary(mod3)

## 
## Call:
## lm(formula = Shannon ~ bmi_group + sex, data = dietswap_data_t1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4473 -0.4295  0.1259  0.4826  1.3520 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.46753    0.17886  13.796   <2e-16 ***
## bmi_groupoverweight  0.23528    0.16870   1.395    0.166    
## bmi_groupobese      -0.07057    0.17206  -0.410    0.682    
## sexmale              0.02172    0.13846   0.157    0.876    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6098 on 108 degrees of freedom
## Multiple R-squared:  0.04824,    Adjusted R-squared:  0.02181 
## F-statistic: 1.825 on 3 and 108 DF,  p-value: 0.147

# compare nested models
anova(mod3,mod2)

## Analysis of Variance Table
## 
## Model 1: Shannon ~ bmi_group + sex
## Model 2: Shannon ~ nationality + bmi_group + sex
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    108 40.155                                  
## 2    107 33.629  1    6.5256 20.763 1.385e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(mod3,mod1)

## Analysis of Variance Table
## 
## Model 1: Shannon ~ bmi_group + sex
## Model 2: Shannon ~ nationality * bmi_group + sex
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    108 40.155                                  
## 2    105 31.901  3    8.2543 9.0563 2.189e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(mod2,mod1)

## Analysis of Variance Table
## 
## Model 1: Shannon ~ nationality + bmi_group + sex
## Model 2: Shannon ~ nationality * bmi_group + sex
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    107 33.629                              
## 2    105 31.901  2    1.7287 2.8449 0.06263 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From comparing the nested models, we see that nationality should be included in the model but that we do not require the interaction effect. We can go ahead and check the model diagnostics for mod2.

# 3. homoscedasticity
res <- residuals(mod2)
fit <- fitted(mod2)
mod_res <- cbind.data.frame(res,fit)
# create the residuals vs fitted values plot
ggplot(data=mod_res,aes(x = fit, y = res)) +
  geom_point(color = "blue", size = 2) +  # Plot the residuals
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Add a horizontal line at 0
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

# 4.normality 
# QQ-plot to assess normality
qqPlot(mod_res$res)

## [1] 30 97

# 5. no multicollinearity
vif(mod2)

##                 GVIF Df GVIF^(1/(2*Df))
## nationality 1.166324  1        1.079965
## bmi_group   1.667090  2        1.136291
## sex         1.521859  1        1.233637

Visualize our Model

Now let’s plot the interaction between bmi_group and nationality on Shannon diversity to help us interpret our findings.

ggplot(dietswap_data_t1, aes(x = bmi_group, y = Shannon, color = nationality, group = nationality)) +
  stat_summary(fun = mean, geom = "point", size = 3) +  # Plot the means
  stat_summary(fun = mean, geom = "line") +             # Connect the means with lines
  labs(title = "Interaction between Nationality, BMI Group, and Shannon Diversity",
       x = "BMI Group",
       y = "Shannon Diversity Index",
       color = "Nationality") +
  theme_minimal()

Logistic Regression

Suppose researchers are interested to learn more about the association between shannon diversity of the throat bacteria and smoking. We can use logistic regression to help answer this question through looking at whether we can predict smoking status with shannon diversity of bacteria in the throat, age, and sex.

Loading the Data

# first, need to load in meta data and otu table
data(throat.meta)
data(throat.otu.tab)

# look up descriptions of the data
?throat.meta
?throat.otu.tab

# let's turn this data into a phyloseq object
throat.physeq <- phyloseq(
  otu_table(throat.otu.tab, taxa_are_rows = FALSE),
  sample_data(throat.meta)
)

# check our phyloseq object
otu_table(throat.physeq)
sample_data(throat.physeq)


# now we can estimate shannon diversity
Shannon <- estimate_richness(otu_table(throat.physeq), measures="Shannon")
throat_data <- cbind.data.frame(Shannon=Shannon$Shannon,sample_data(throat.physeq))

Fitting the Logistic Regression Model

In R, we can use the glm() function to fit a logistic regression model by setting family = “binomial” as an argument.

# now let's look at the effect of sex, age, and diversity on smoking
log_mod <- glm(SmokingStatus~Age+Sex+Shannon, data=throat_data, family=binomial)
summary(log_mod)

## 
## Call:
## glm(formula = SmokingStatus ~ Age + Sex + Shannon, family = binomial, 
##     data = throat_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.97623    2.54902   0.775   0.4382  
## Age          0.04757    0.02798   1.700   0.0891 .
## SexMale      0.80698    0.61972   1.302   0.1929  
## Shannon     -1.29961    0.80217  -1.620   0.1052  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82.911  on 59  degrees of freedom
## Residual deviance: 75.147  on 56  degrees of freedom
## AIC: 83.147
## 
## Number of Fisher Scoring iterations: 4

# get odds ratios
exp(coef(log_mod))

## (Intercept)         Age     SexMale     Shannon 
##   7.2154548   1.0487182   2.2411336   0.2726372

# 95% confidence intervals for odds ratios
exp(confint(log_mod))

##                  2.5 %      97.5 %
## (Intercept) 0.05231538 1437.791848
## Age         0.99472215    1.111987
## SexMale     0.68252995    7.976087
## Shannon     0.05015106    1.235533

It does not look like Shannon diversity has a significant association with smoking status. That is, the p-value for the Shannon diversity covariate is 0.105. We can interpret the coefficient on the logit scale as, increasing Shannon diversity by 1 unit is estimated to decrease log(p/1-p) of smoking by 1.2996. Alternatively, we can interpret the coefficient on the odds scale by exponentiating the coefficient. On the odds scale, we can say that increasing the Shannon diversity changes the odds of smoking by a factor of 0.273 (95% CI: 0.050, 1.236).

Model Diagnostic

Again, after fitting the model, we must check for the following assumptions:

Linearity - we can check the relationship between covariates and logit outcome with component-residual plots
Independence - we will assume the data observations are independent
Proper fitting distribution - we can check this with a scatterplot of model DEVIANCE residuals vs. linear predictor
No multicollinearity - we can check this with variance inflation factors

# checking assumptions
# 1. linearity assumption with component+residual plots
crPlot(log_mod, variable="Shannon")

crPlot(log_mod, variable="Age")

# 3. proper fitting distribution (look for no pattern/constant variance)
plot(residuals(log_mod)~predict(log_mod)) # notice it looks weird because we only have 2 response options

binnedplot(predict(log_mod), residuals(log_mod))

# 4. no multicollinearity
vif(log_mod) # a bit concerning - could be too high of a correlation

##      Age      Sex  Shannon 
## 1.074958 1.120760 1.095087

Visualization

# first we need smoking status to be numeric
throat_data$SmokingStatusNumeric <- ifelse(throat_data$SmokingStatus=="Smoker",1,0)
# create a ggplot object with the data points

p <- ggplot(throat_data, aes(x = Shannon, y = SmokingStatusNumeric)) +
  geom_point(alpha = 0.5) +  # Plot the actual data points
  stat_smooth(method = "glm", method.args = list(family = binomial), se = FALSE, color = "blue") +
  labs(title = "Logistic Regression Curve",
       x = "Shannon Diversity",
       y = "Probability of Smoking")

# display the plot
p

Binomial Regression

Suppose researchers are interested in the number of surviving trout eggs in boxes across different locations at different weeks after placement. In particular, the outcome of interest is the number of trout eggs alive in the box/the total number of trout eggs in the box.

Loading the Data

We will use the troutegg data from faraway for this analysis.

# first, let's load in the data
data(troutegg) # from faraway

# view description of the data
?troutegg

# view data
head(troutegg)

##   survive total location period
## 1      89    94        1      4
## 2     106   108        2      4
## 3     119   123        3      4
## 4     104   104        4      4
## 5      49    93        5      4
## 6      94    98        1      7

str(troutegg)

## 'data.frame':    20 obs. of  4 variables:
##  $ survive : int  89 106 119 104 49 94 91 100 80 11 ...
##  $ total   : int  94 108 123 104 93 98 106 130 97 113 ...
##  $ location: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5 1 2 3 4 5 ...
##  $ period  : Factor w/ 4 levels "4","7","8","11": 1 1 1 1 1 2 2 2 2 2 ...

Fitting the Binomial Regression

We will fit the binomial regression using the glm function and setting family=“binomial”. However, we have to tell R how many successes we have in our outcome and how many failures.

# fit binomial
binom_mod <- glm(cbind(survive,total-survive)~location+period, data=troutegg, family="binomial")
summary(binom_mod)

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = "binomial", data = troutegg)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.2813  16.479  < 2e-16 ***
## location2    -0.4168     0.2461  -1.694   0.0903 .  
## location3    -1.2421     0.2194  -5.660 1.51e-08 ***
## location4    -0.9509     0.2288  -4.157 3.23e-05 ***
## location5    -4.6138     0.2502 -18.439  < 2e-16 ***
## period7      -2.1702     0.2384  -9.103  < 2e-16 ***
## period8      -2.3256     0.2429  -9.573  < 2e-16 ***
## period11     -2.4500     0.2341 -10.466  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5

Model Diagnostics

Again, after fitting the model, we must check for the following assumptions:

Linearity - we do not have any continuous or numeric variables
Independence - we will assume the data observations are independent
Proper fitting distribution - we can check this with a scatterplot of model DEVIANCE residuals vs. linear predictor
No multicollinearity - we can check this with variance inflation factors

# checking assumptions
# 3. proper fitting distribution (look for no pattern/constant variance)
plot(residuals(binom_mod)~predict(binom_mod))

# 4. No multicollinearity
vif(binom_mod)

##             GVIF Df GVIF^(1/(2*Df))
## location 1.33796  4        1.037064
## period   1.33796  3        1.049721

Addressing Overdispersion in Binomial

The binomial model assumes that the variance is related to the our outcome, p, as p(1-p). If the variance is greater than what is assumed by the model, it is called ‘overdispersion’. We can estimate overdispersion as deviance/df residuals. If this ratio is greater than 1 we have overdispersion, if it is (roughly) 1 we have no overdispersion, and if it is less than 1, we have underdispersion.

If we have overdispersion of the binomial, we can try to fit a quasibinomial model instead which will estimate the dispersion parameter instead of assuming it to be 1.

# estimating the dispersion
binom_mod$deviance/binom_mod$df.residual

## [1] 5.374593

# if it is large, we can use 'quasibinomial' as the family instead to account for this
binom_mod2 <- glm(cbind(survive,total-survive)~location+period, data=troutegg, family="quasibinomial")
summary(binom_mod2)

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = "quasibinomial", data = troutegg)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6358     0.6495   7.138 1.18e-05 ***
## location2    -0.4168     0.5682  -0.734 0.477315    
## location3    -1.2421     0.5066  -2.452 0.030501 *  
## location4    -0.9509     0.5281  -1.800 0.096970 .  
## location5    -4.6138     0.5777  -7.987 3.82e-06 ***
## period7      -2.1702     0.5504  -3.943 0.001953 ** 
## period8      -2.3256     0.5609  -4.146 0.001356 ** 
## period11     -2.4500     0.5405  -4.533 0.000686 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 5.330358)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Poisson Regression

Loading the Data

We will use the faramea.csv data.

# read in the dataset
faramea <- read.csv('./datasets/faramea.csv') # Use your file path for this data

# look at first 6 rows
head(faramea)

##     X   UTM.EW  UTM.NS Precipitation Elevation Age Age.cat Geology
## 1  B0 625754.0 1011569        2530.0       120   3      c3      Tb
## 2 B49 626654.0 1011969        2530.0       120   3      c3      Tb
## 3  p1 614856.9 1031786        2993.2        20   2      c2      Tc
## 4  p2 613985.4 1030725        3072.0       100   3      c3      Tc
## 5  p3 614674.3 1023802        3007.4       180   1      c1      Tc
## 6  p4 615018.6 1023548        2999.8       180   1      c1      Tc
##   Faramea.occidentalis
## 1                   14
## 2                    7
## 3                    0
## 4                    0
## 5                    2
## 6                    1

# look at structure
str(faramea)

## 'data.frame':    43 obs. of  9 variables:
##  $ X                   : chr  "B0" "B49" "p1" "p2" ...
##  $ UTM.EW              : num  625754 626654 614857 613985 614674 ...
##  $ UTM.NS              : num  1011569 1011969 1031786 1030725 1023802 ...
##  $ Precipitation       : num  2530 2530 2993 3072 3007 ...
##  $ Elevation           : int  120 120 20 100 180 180 40 30 60 50 ...
##  $ Age                 : int  3 3 2 3 1 1 2 2 1 3 ...
##  $ Age.cat             : chr  "c3" "c3" "c2" "c3" ...
##  $ Geology             : chr  "Tb" "Tb" "Tc" "Tc" ...
##  $ Faramea.occidentalis: int  14 7 0 0 2 1 0 0 2 0 ...

# look at summary of variables
summary(faramea)

##       X                 UTM.EW           UTM.NS        Precipitation 
##  Length:43          Min.   :600714   Min.   : 962862   Min.   :1888  
##  Class :character   1st Qu.:625578   1st Qu.:1009776   1st Qu.:2406  
##  Mode  :character   Median :637732   Median :1012428   Median :2516  
##                     Mean   :635927   Mean   :1013984   Mean   :2643  
##                     3rd Qu.:645611   3rd Qu.:1023675   3rd Qu.:2996  
##                     Max.   :688165   Max.   :1045987   Max.   :4002  
##    Elevation          Age          Age.cat            Geology         
##  Min.   : 10.0   Min.   :1.000   Length:43          Length:43         
##  1st Qu.: 59.0   1st Qu.:1.000   Class :character   Class :character  
##  Median :120.0   Median :2.000   Mode  :character   Mode  :character  
##  Mean   :192.3   Mean   :2.093                                        
##  3rd Qu.:180.0   3rd Qu.:3.000                                        
##  Max.   :830.0   Max.   :3.000                                        
##  Faramea.occidentalis
##  Min.   : 0.000      
##  1st Qu.: 0.000      
##  Median : 0.000      
##  Mean   : 3.884      
##  3rd Qu.: 4.500      
##  Max.   :42.000

# convert age to numeric (helps with crPlots)
faramea$Age <- as.numeric(faramea$Age)

Fitting the Poisson Regression

Again, we will be using the glm() function here to fit our Poisson regression model and this time, we will set family=“poisson”.

# fitting the Poisson model
glm.poisson = glm(Faramea.occidentalis ~ Elevation+Age+Precipitation,
                  data = faramea,
                  family = poisson)
summary(glm.poisson)

## 
## Call:
## glm(formula = Faramea.occidentalis ~ Elevation + Age + Precipitation, 
##     family = poisson, data = faramea)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    4.8957542  0.6178206   7.924  2.3e-15 ***
## Elevation     -0.0018282  0.0006837  -2.674 0.007496 ** 
## Age           -0.3918512  0.1135936  -3.450 0.000561 ***
## Precipitation -0.0010039  0.0002904  -3.457 0.000546 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 414.81  on 42  degrees of freedom
## Residual deviance: 341.86  on 39  degrees of freedom
## AIC: 419.74
## 
## Number of Fisher Scoring iterations: 8

Model Diagnostic

Again, after fitting the model, we must check for the following assumptions:

Linearity - use component-residual plots to assess
Independence - we will assume the data observations are independent
Proper fitting distribution- we can check this with a scatterplot of model DEVIANCE residuals vs. linear predictor
No multicollinearity - we can check this with variance inflation factors

# check assumptions
# 1. linearity
crPlot(glm.poisson, variable="Elevation")

crPlot(glm.poisson, variable="Age")

crPlot(glm.poisson, variable="Precipitation")

# 3. proper fitting distribution
plot(resid(glm.poisson)~predict(glm.poisson)) # fan pattern - likely due to the overdispersion

# 4. no multicollinearity
vif(glm.poisson)

##     Elevation           Age Precipitation 
##      1.193796      1.308440      1.397856

Addressing Overdispersion in Poisson

The Poisson model assumes that the variance is equal to the mean (i.e., E(Y) = VAR(Y)). If the variance is greater than what is assumed by the model, it is called ‘overdispersion’. We can estimate overdispersion as deviance/df residuals. If this ratio is greater than 1 we have overdispersion, if it is (roughly) 1 we have equidispersion, and if it is less than 1, we have underdispersion.

If we have overdispersion of the Poisson, we can try to fit a quasipoisson model instead which will estimate the dispersion parameter instead of assuming it to be 1.

# estimate dispersion
glm.poisson$deviance/glm.poisson$df.residual

## [1] 8.765551

# fit quasi-poisson to address overdispersion
glm.qpoisson = glm(Faramea.occidentalis ~ Elevation+Age+Precipitation,
                  data = faramea,
                  family = quasipoisson)
summary(glm.qpoisson)

## 
## Call:
## glm(formula = Faramea.occidentalis ~ Elevation + Age + Precipitation, 
##     family = quasipoisson, data = faramea)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    4.895754   2.159927   2.267    0.029 *
## Elevation     -0.001828   0.002390  -0.765    0.449  
## Age           -0.391851   0.397128  -0.987    0.330  
## Precipitation -0.001004   0.001015  -0.989    0.329  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 12.22231)
## 
##     Null deviance: 414.81  on 42  degrees of freedom
## Residual deviance: 341.86  on 39  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 8

Modeling Rates with Poisson

When modeling count data, there are scenarios where the counts observed may vary due to differences in the length of observation periods, the area covered, or the number of trials conducted. In such cases, it’s important to account for the differing observation levels using an offset in Poisson models.

**While the Poisson model can seem similar to the binomial distribution, particularly when dealing with rates of “successes”, there is a key distinction to keep in mind. The Poisson model is generally preferred when the count of successes is unbounded—that is, there is no theoretical limit to the number of successes we might observe based on the denominator. However, if the probability of success is extremely low, so much so that the numerator (the count of successes) would never be expected to approach the denominator, the Poisson model remains appropriate. In contrast, the binomial distribution is typically used when there is a fixed upper limit on the number of successes (and the denominator is this upper limit).

We will briefly review an example here where we have a rate instead of a count outcome. In this example, we will look at the dicentric dataset from faraway. The outcome of interest is the number of chromosomal abnormalities and the predictor of interest is the dose amount and dose rate. However, we also want to account for the number of cells observed as we would expect to see more abnormalities if we observed more cells. We will treat number of cells as our offset variable.

# load in the data
data(dicentric)

# view description of the data
?dicentric

# now let's look at the model without accounting for # of cells
dicentric$dose_fac <- as.factor(dicentric$doseamt)
mod_pois <- glm(ca ~ log(doserate)+dose_fac, data=dicentric, family=poisson)
summary(mod_pois) # clearly overdispersed - we are not properly accounting for the variability

## 
## Call:
## glm(formula = ca ~ log(doserate) + dose_fac, family = poisson, 
##     data = dicentric)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    4.53018    0.03439 131.740   <2e-16 ***
## log(doserate)  0.20443    0.01662  12.301   <2e-16 ***
## dose_fac2.5   -0.02830    0.04857  -0.583     0.56    
## dose_fac5      0.59294    0.04249  13.955   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 916.13  on 26  degrees of freedom
## Residual deviance: 460.69  on 23  degrees of freedom
## AIC: 644.1
## 
## Number of Fisher Scoring iterations: 4

# include log(cells) as covariate
mod_pois <- glm(ca ~ log(cells)+log(doserate)+dose_fac, data=dicentric, family=poisson) 
summary(mod_pois) # close to 1, can fix at 1 and treat as offset

## 
## Call:
## glm(formula = ca ~ log(cells) + log(doserate) + dose_fac, family = poisson, 
##     data = dicentric)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.08452    0.38201  -8.075 6.77e-16 ***
## log(cells)     1.04271    0.05158  20.217  < 2e-16 ***
## log(doserate)  0.21509    0.01674  12.850  < 2e-16 ***
## dose_fac2.5    1.72742    0.10226  16.892  < 2e-16 ***
## dose_fac5      2.89486    0.12107  23.910  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 916.127  on 26  degrees of freedom
## Residual deviance:  42.089  on 22  degrees of freedom
## AIC: 227.5
## 
## Number of Fisher Scoring iterations: 4

# include log(cells) as offset
mod_pois <- glm(ca ~ offset(log(cells))+log(doserate)+dose_fac, data=dicentric, family=poisson)
summary(mod_pois)

## 
## Call:
## glm(formula = ca ~ offset(log(cells)) + log(doserate) + dose_fac, 
##     family = poisson, data = dicentric)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.76958    0.03430  -80.74   <2e-16 ***
## log(doserate)  0.21447    0.01672   12.83   <2e-16 ***
## dose_fac2.5    1.65299    0.04857   34.03   <2e-16 ***
## dose_fac5      2.80095    0.04251   65.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4753.004  on 26  degrees of freedom
## Residual deviance:   42.776  on 23  degrees of freedom
## AIC: 226.18
## 
## Number of Fisher Scoring iterations: 4

We see that including the offset is important as it can explain a lot of the overdispersion in the model outcome. It is important to note that the GLMs reviewed here are not exhaustive and there are many scenarios you will encounter when these will not work for your data. However, this module is to provide you with the tools necessary to think critically about your data and model assumptions.

If your outcome data is highly overdispersed, it might be better to use more complex models to account for this such as the negative binomial (see below) or the zero-inflated Poisson if you have a large number of zero’s in your outcome.

Negative Binomial

We will very briefly review negative binomial as it is important for overdispersed count data that often occurs in gene expression data and taxonomic compositions.

Load the Data

The data we are looking at is from a simulated experiment based on RNA-Seq data for differential expression. The first 6 columns are the gene expression of different genes and we have covariates for age, treatment, gender, sex. We additionally want to account for total number of reads as an offset.

# load in dataset rnaseq
data(rnaseq)

# description of dataset
?rnaseq

# view first 6 rows of dataset
head(rnaseq)

##    X1  X2 X3  X4  X5 X6 totalReads treatment gender age
## 1 295  65 19 114  54 20   28317494         0      0  57
## 2 213 126 12  96  50  4   20015549         0      0  67
## 3 322 147 23 245  42 19   35318251         0      1  37
## 4 184  35  9 134  74  8   20421437         0      0  59
## 5 376 104  4  74  39  7   21940693         0      1  57
## 6 490  97 18 282 159 24   48645477         0      1  50

Fitting the Negative Binomial Regression

First, we will fit a Poisson regression to show how overdispersed gene expression data can be and how a negative binomial is more accommodating of this kind of data. We can fit this model with glm.nb() which will estimate the extra parameter, theta for us.

# treat X1 as outcome and fit poisson
mod_pois <- glm(X1~offset(log(totalReads))+treatment+gender+age, data=rnaseq, family=poisson)
summary(mod_pois)

## 
## Call:
## glm(formula = X1 ~ offset(log(totalReads)) + treatment + gender + 
##     age, family = poisson, data = rnaseq)
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -1.171e+01  2.557e-02 -458.043  < 2e-16 ***
## treatment   -7.581e-01  1.020e-02  -74.337  < 2e-16 ***
## gender      -8.130e-02  9.645e-03   -8.429  < 2e-16 ***
## age          2.407e-03  4.912e-04    4.901 9.55e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10798.2  on 199  degrees of freedom
## Residual deviance:  4890.3  on 196  degrees of freedom
## AIC: 6313.2
## 
## Number of Fisher Scoring iterations: 4

# dispersion?
sum_pois <- summary(mod_pois)
sum_pois$deviance/sum_pois$df.residual

## [1] 24.95035

# fit negative binomial
mod_nb <- glm.nb(X1~offset(log(totalReads))+treatment+gender+age, data=rnaseq)
summary(mod_nb)

## 
## Call:
## glm.nb(formula = X1 ~ offset(log(totalReads)) + treatment + gender + 
##     age, data = rnaseq, init.theta = 9.319170881, link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.777840   0.126981 -92.753   <2e-16 ***
## treatment    -0.767008   0.047530 -16.137   <2e-16 ***
## gender       -0.043007   0.047568  -0.904    0.366    
## age           0.003463   0.002420   1.431    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(9.3192) family taken to be 1)
## 
##     Null deviance: 460.37  on 199  degrees of freedom
## Residual deviance: 203.51  on 196  degrees of freedom
## AIC: 2243.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  9.319 
##           Std. Err.:  0.965 
## 
##  2 x log-likelihood:  -2233.659

Bonus: Beta Regression

Beta regression is useful for modeling outcomes bounded between 0 and 1, such as proportions. However, it should be used when the denominator of the proportion is unknown. If the denominator is known, binomial regression is more appropriate.

Additionally, beta regression cannot handle outcomes exactly equal to 0 or 1; all values must fall strictly between these bounds.

For this example, we will look at whether there exists an association between DNA methylation at a specific CpG site and gender, age, and smoking status.

Load the Data

The dataset we will use is titled ‘smoker_epigenetic_df.csv’ in our folder.

# first, let's load in the data
smok_dat <- read.csv('./datasets/smoker_epigenetic_df.csv') # # Use your file path for this data

# view data
head(smok_dat)

##          GSM Smoking.Status Gender Age cg00050873 cg00212031 cg00213748
## 1 GSM1051525        current      f  67  0.6075634  0.4228427  0.3724549
## 2 GSM1051526        current      f  49  0.3450542  0.5686615  0.5005995
## 3 GSM1051527        current      f  53  0.3213497  0.3609091  0.3527315
## 4 GSM1051528        current      f  62  0.2772675  0.3044371  0.4752352
## 5 GSM1051529          never      f  33  0.4135991  0.1312511  0.3675446
## 6 GSM1051530        current      f  59  0.6228599  0.5016849  0.2632270
##   cg00214611 cg00455876 cg01707559 cg02004872 cg02011394 cg02050847 cg02233190
## 1  0.6215619  0.2907773  0.2671431  0.1791439  0.4802517  0.3276078  0.2411204
## 2  0.4986067  0.3745909  0.1902743  0.1559775  0.4180809  0.3464627  0.1754907
## 3  0.3738240  0.2306740  0.3147052  0.1057448  0.6151030  0.2375392  0.2464092
## 4  0.4862581  0.2951815  0.2957931  0.1112862  0.3010196  0.3045353  0.1770279
## 5  0.7611667  0.2357703  0.2505265  0.1691084  0.3929746  0.3062257  0.3017014
## 6  0.4157459  0.4751891  0.2539041  0.2607587  0.5097921  0.4052457  0.3852716
##   cg02494853  cg02839557 cg02842889 cg03052502 cg03155755 cg03244189 cg03443143
## 1 0.06706958 0.246993368  0.4692396  0.4002466  0.4150313  0.2214331  0.4758258
## 2 0.04693889 0.236742313  0.3074666  0.3770313  0.3973715  0.2171221  0.5444690
## 3 0.03823712 0.244611725  0.3577526  0.3050442  0.5212775  0.1850495  0.5370600
## 4 0.02671625 0.001641439  0.4457390  0.2714746  0.4344920  0.1654187  0.5079167
## 5 0.03701636 0.334319727  0.3950396  0.3265530  0.4300966  0.1811352  0.4054791
## 6 0.02583463 0.309210202  0.3218573  0.5333670  0.5715522  0.2109749  0.3778239
##   cg03683899 cg03695421 cg03706273
## 1  0.2077242  0.2091974 0.12998255
## 2  0.1844462  0.1937732 0.09853265
## 3  0.3931231  0.2680030 0.04024808
## 4  0.2812089  0.2178572 0.10151626
## 5  0.3107944  0.2800708 0.07785712
## 6  0.4693609  0.3433317 0.04577912

# look at structure
str(smok_dat)

## 'data.frame':    683 obs. of  24 variables:
##  $ GSM           : chr  "GSM1051525" "GSM1051526" "GSM1051527" "GSM1051528" ...
##  $ Smoking.Status: chr  "current" "current" "current" "current" ...
##  $ Gender        : chr  " f" " f" " f" " f" ...
##  $ Age           : int  67 49 53 62 33 59 66 51 55 37 ...
##  $ cg00050873    : num  0.608 0.345 0.321 0.277 0.414 ...
##  $ cg00212031    : num  0.423 0.569 0.361 0.304 0.131 ...
##  $ cg00213748    : num  0.372 0.501 0.353 0.475 0.368 ...
##  $ cg00214611    : num  0.622 0.499 0.374 0.486 0.761 ...
##  $ cg00455876    : num  0.291 0.375 0.231 0.295 0.236 ...
##  $ cg01707559    : num  0.267 0.19 0.315 0.296 0.251 ...
##  $ cg02004872    : num  0.179 0.156 0.106 0.111 0.169 ...
##  $ cg02011394    : num  0.48 0.418 0.615 0.301 0.393 ...
##  $ cg02050847    : num  0.328 0.346 0.238 0.305 0.306 ...
##  $ cg02233190    : num  0.241 0.175 0.246 0.177 0.302 ...
##  $ cg02494853    : num  0.0671 0.0469 0.0382 0.0267 0.037 ...
##  $ cg02839557    : num  0.24699 0.23674 0.24461 0.00164 0.33432 ...
##  $ cg02842889    : num  0.469 0.307 0.358 0.446 0.395 ...
##  $ cg03052502    : num  0.4 0.377 0.305 0.271 0.327 ...
##  $ cg03155755    : num  0.415 0.397 0.521 0.434 0.43 ...
##  $ cg03244189    : num  0.221 0.217 0.185 0.165 0.181 ...
##  $ cg03443143    : num  0.476 0.544 0.537 0.508 0.405 ...
##  $ cg03683899    : num  0.208 0.184 0.393 0.281 0.311 ...
##  $ cg03695421    : num  0.209 0.194 0.268 0.218 0.28 ...
##  $ cg03706273    : num  0.13 0.0985 0.0402 0.1015 0.0779 ...

# look at summary
summary(smok_dat) # some missingness in our outcome, we will not impute

##      GSM            Smoking.Status        Gender               Age       
##  Length:683         Length:683         Length:683         Min.   :18.00  
##  Class :character   Class :character   Class :character   1st Qu.:47.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :56.00  
##                                                           Mean   :53.82  
##                                                           3rd Qu.:62.00  
##                                                           Max.   :80.00  
##                                                                          
##    cg00050873       cg00212031        cg00213748       cg00214611     
##  Min.   :0.1186   Min.   :0.00695   Min.   :0.0000   Min.   :0.01247  
##  1st Qu.:0.4131   1st Qu.:0.06317   1st Qu.:0.3635   1st Qu.:0.06946  
##  Median :0.5052   Median :0.36554   Median :0.4713   Median :0.41575  
##  Mean   :0.5600   Mean   :0.30960   Mean   :0.5191   Mean   :0.34106  
##  3rd Qu.:0.8144   3rd Qu.:0.45981   3rd Qu.:0.7278   3rd Qu.:0.49745  
##  Max.   :0.8989   Max.   :0.70999   Max.   :0.9236   Max.   :0.80606  
##  NA's   :62       NA's   :62        NA's   :62       NA's   :62       
##    cg00455876        cg01707559        cg02004872        cg02011394    
##  Min.   :0.05917   Min.   :0.04333   Min.   :0.00262   Min.   :0.0000  
##  1st Qu.:0.29300   1st Qu.:0.11080   1st Qu.:0.04284   1st Qu.:0.4261  
##  Median :0.37968   Median :0.23873   Median :0.14933   Median :0.5157  
##  Mean   :0.44718   Mean   :0.21435   Mean   :0.15542   Mean   :0.6058  
##  3rd Qu.:0.66283   3rd Qu.:0.28061   3rd Qu.:0.24263   3rd Qu.:0.9412  
##  Max.   :0.85443   Max.   :0.46999   Max.   :0.47384   Max.   :0.9792  
##  NA's   :62        NA's   :62        NA's   :62        NA's   :62      
##    cg02050847        cg02233190        cg02494853        cg02839557     
##  Min.   :0.05234   Min.   :0.00863   Min.   :0.01162   Min.   :0.00000  
##  1st Qu.:0.33963   1st Qu.:0.08838   1st Qu.:0.02865   1st Qu.:0.06384  
##  Median :0.42754   Median :0.25982   Median :0.03695   Median :0.35042  
##  Mean   :0.54369   Mean   :0.23250   Mean   :0.04077   Mean   :0.30088  
##  3rd Qu.:0.95558   3rd Qu.:0.33702   3rd Qu.:0.04677   3rd Qu.:0.45786  
##  Max.   :0.98320   Max.   :0.51173   Max.   :0.28947   Max.   :0.82739  
##  NA's   :62        NA's   :62        NA's   :62        NA's   :62       
##    cg02842889        cg03052502       cg03155755       cg03244189     
##  Min.   :0.01346   Min.   :0.0000   Min.   :0.2020   Min.   :0.02972  
##  1st Qu.:0.05483   1st Qu.:0.4025   1st Qu.:0.4245   1st Qu.:0.11976  
##  Median :0.39757   Median :0.4940   Median :0.4962   Median :0.20397  
##  Mean   :0.32362   Mean   :0.5907   Mean   :0.5895   Mean   :0.19552  
##  3rd Qu.:0.47385   3rd Qu.:0.9631   3rd Qu.:0.8988   3rd Qu.:0.24921  
##  Max.   :0.85625   Max.   :0.9902   Max.   :0.9696   Max.   :0.54074  
##  NA's   :62        NA's   :62       NA's   :62       NA's   :62       
##    cg03443143        cg03683899        cg03695421       cg03706273     
##  Min.   :0.06496   Min.   :0.00788   Min.   :0.0949   Min.   :0.01120  
##  1st Qu.:0.40963   1st Qu.:0.06159   1st Qu.:0.2566   1st Qu.:0.03413  
##  Median :0.48314   Median :0.34422   Median :0.3208   Median :0.04961  
##  Mean   :0.56841   Mean   :0.28442   Mean   :0.3978   Mean   :0.05769  
##  3rd Qu.:0.85436   3rd Qu.:0.41866   3rd Qu.:0.5965   3rd Qu.:0.06916  
##  Max.   :0.93589   Max.   :0.65925   Max.   :0.8433   Max.   :0.34380  
##  NA's   :62        NA's   :62        NA's   :62       NA's   :62

# look closer at gender and smoking status
table(smok_dat$Gender)

## 
##   f   F   m   M 
## 440  13 181  49

table(smok_dat$Smoking.Status)

## 
## current   never 
##     490     193

# clean gender
smok_dat_c <- smok_dat %>% mutate(Gender=case_when(Gender==" f"~"F",
                                                   Gender==" F"~"F",
                                                   Gender==" m"~"M",
                                                   Gender==" M"~"M")) 
table(smok_dat_c$Gender)

## 
##   F   M 
## 453 230

Fitting Beta Regression

To fit beta regression, we need to use the gam() function from the mgcv package in R. We set family=betar() in this function.

# look at effect of smoking status, gender, and age on methylation
modb <- gam(cg00455876~Smoking.Status+Gender+Age, data=smok_dat_c, family=betar())
summary(modb)

## 
## Family: Beta regression(31.518) 
## Link function: logit 
## 
## Formula:
## cg00455876 ~ Smoking.Status + Gender + Age
## 
## Parametric coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.7581417  0.0731284 -10.367   <2e-16 ***
## Smoking.Statusnever  0.0655020  0.0328325   1.995    0.046 *  
## GenderM              1.6778920  0.0342145  49.040   <2e-16 ***
## Age                  0.0005844  0.0013288   0.440    0.660    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.832   Deviance explained = 82.5%
## -REML = -670.56  Scale est. = 1         n = 621

# check assumptions with gam.check
gam.check(modb)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [4.752114e-09,4.752114e-09]
## (score -670.5585 & scale 1).
## Hessian positive definite, eigenvalue range [317.2695,317.2695].
## Model rank =  4 / 4

The mgcv package has a useful function, gam.check, that allows us to check all of our usual model assumptions. In the QQ plot, we want the dots to follow roughly along the diagonal line if we assumed the correct distribution (beta) for our data. In the residual vs fitted plot, we are looking for a random scattering of points as usual.

Independent Work

Exercise 1 - Baboon Diet and Social Group Size

The baboongut dataset from the microbiomeDataSets contains data on baboons over a 14 year span to look at heritability of the gut microbiome within social groups. Let’s load in this dataset and take a closer look. First, we will filter the data to only keep the first observation for each baboon to make the data cross-sectional as opposed to longitudinal.

# load in dataset 
ts <- baboongut()
sample_data <- colData(ts) # colData gives us the metadata we need

# look at description
?baboongut

# we will only look at first sample per ID (so it is not longitudinal)
baboon_first <- sample_data[isUnique(sample_data$baboon_id), ]
dim(baboon_first)

## [1] 62 35

# we will only focus on the diet variables
baboon_first_c <- baboon_first %>%
  as.data.frame(.) %>%
                  dplyr::select(group_size, contains("diet_PC"))

dim(baboon_first_c)

## [1] 62 14

Now, do the following:

Create a new dataset called ‘baboon_first_c’ that only has ‘group_size’ and the first 7 “diet_PC’ columns. These are the variables you would get had you done a principal component analysis on diet variables to reduce the dimensionality of your data. We will be using these principal components as our new predictors.

# we can use select from tidyverse to do this
baboon_first_c <- baboon_first %>%
  as.data.frame(.) %>%
                  dplyr::select(group_size, diet_PC1:diet_PC7)

dim(baboon_first_c)

## [1] 62  8

Fit a linear regression model with the lm() function with group_size as the outcome and the principal components of diet variables (diet_PC) as your predictors.

# fit model 
mod1 <- lm(group_size~., baboon_first_c)
summary(mod1)

## 
## Call:
## lm(formula = group_size ~ ., data = baboon_first_c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.092 -12.943  -2.438  15.847  44.974 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 71.22209    3.16188  22.525  < 2e-16 ***
## diet_PC1     0.07252    0.09040   0.802   0.4259    
## diet_PC2     0.17070    0.23158   0.737   0.4643    
## diet_PC3    -0.52114    0.32445  -1.606   0.1141    
## diet_PC4     0.12835    0.40988   0.313   0.7554    
## diet_PC5     0.80878    0.73445   1.101   0.2757    
## diet_PC6     1.13856    0.53488   2.129   0.0379 *  
## diet_PC7     4.36778    0.98211   4.447 4.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.67 on 54 degrees of freedom
## Multiple R-squared:  0.4013, Adjusted R-squared:  0.3237 
## F-statistic:  5.17 on 7 and 54 DF,  p-value: 0.0001502

Check the model assumptions of this linear regression model. What violations can you find?

Tip

Think about all assumptions, not just those that we can test/visualize.

Can you think of how you would try to fix these violations?

# multicollinearity?
vif(mod1)

## diet_PC1 diet_PC2 diet_PC3 diet_PC4 diet_PC5 diet_PC6 diet_PC7 
## 1.051217 1.084132 1.092934 1.177352 1.814580 1.386233 1.504968

# linearity?
pairs(baboon_first_c)

# normality?
qqPlot(mod1)

## sample_12052-TAGATCCTCGGA-408 sample_12050-TATGCCAGAGAT-407 
##                            38                            52

# constant variance?
plot(resid(mod1)~predict(mod1))

The key violation here would likely be that we do not have independent observations as those within the same social group are more likely to be similar to one another in how they eat (and in this case, will have the same outcome). Other potential violations would be that there seems to be a bit of increasing variance in model residuals which could be because our outcome is a discrete count and Poisson might be more appropriate.

Exercise 2 - Healthcare Demand in U.S.A

For this example we will look at the demand of healthcare in the U.S.A based on certain risk factors such as age, sex, income, etc. We will first load in the dataset, ‘healthcare_demand.csv’ from our folder.

# 1. Load in the dataset.
health <- read.csv('./datasets/healthcare_demand.csv') # # Use your file path for this data

# 2. view data
str(health)

## 'data.frame':    4406 obs. of  18 variables:
##  $ visits   : int  5 1 13 16 3 17 9 3 1 0 ...
##  $ nvisits  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ovisits  : int  0 2 0 5 0 0 0 0 0 0 ...
##  $ novisits : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emergency: int  0 2 3 1 0 0 0 0 0 0 ...
##  $ hospital : int  1 0 3 1 0 0 0 0 0 0 ...
##  $ health   : chr  "average" "average" "poor" "poor" ...
##  $ chronic  : int  2 2 4 2 2 5 0 0 0 0 ...
##  $ adl      : chr  "normal" "normal" "limited" "limited" ...
##  $ region   : chr  "other" "other" "other" "other" ...
##  $ age      : num  6.9 7.4 6.6 7.6 7.9 6.6 7.5 8.7 7.3 7.8 ...
##  $ gender   : chr  "male" "female" "female" "male" ...
##  $ married  : chr  "yes" "yes" "no" "yes" ...
##  $ school   : int  6 10 10 3 6 7 8 8 8 8 ...
##  $ income   : num  2.881 2.748 0.653 0.659 0.659 ...
##  $ employed : chr  "yes" "no" "no" "no" ...
##  $ insurance: chr  "yes" "yes" "no" "yes" ...
##  $ medicaid : chr  "no" "no" "yes" "no" ...

head(health)

##   visits nvisits ovisits novisits emergency hospital  health chronic     adl
## 1      5       0       0        0         0        1 average       2  normal
## 2      1       0       2        0         2        0 average       2  normal
## 3     13       0       0        0         3        3    poor       4 limited
## 4     16       0       5        0         1        1    poor       2 limited
## 5      3       0       0        0         0        0 average       2 limited
## 6     17       0       0        0         0        0    poor       5 limited
##   region age gender married school income employed insurance medicaid
## 1  other 6.9   male     yes      6 2.8810      yes       yes       no
## 2  other 7.4 female     yes     10 2.7478       no       yes       no
## 3  other 6.6 female      no     10 0.6532       no        no      yes
## 4  other 7.6   male     yes      3 0.6588       no       yes       no
## 5  other 7.9 female     yes      6 0.6588       no       yes       no
## 6  other 6.6 female      no      7 0.3301       no        no      yes

# fix income (notice that some incomes are negative, we will set those to 0)
health$income[health$income<0] <- 0

Choose an appropriate regression model if visits is our outcome and our predictors are health, chronic, adl, income, gender, age, insurance, medicaid. Additionally include an interaction term between income and gender. Fit this model to the data.

# we will choose a Poisson regression given that the outcome is a discrete count
glm_mod <- glm(visits~health+chronic+adl+income*gender+age+insurance+medicaid, family="poisson", data=health)
summary(glm_mod)

## 
## Call:
## glm(formula = visits ~ health + chronic + adl + income * gender + 
##     age + insurance + medicaid, family = "poisson", data = health)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.699105   0.085126  19.960  < 2e-16 ***
## healthexcellent   -0.353853   0.030311 -11.674  < 2e-16 ***
## healthpoor         0.259091   0.018087  14.325  < 2e-16 ***
## chronic            0.163046   0.004534  35.964  < 2e-16 ***
## adlnormal         -0.089349   0.016605  -5.381 7.41e-08 ***
## income            -0.003374   0.003104  -1.087    0.277    
## gendermale        -0.137261   0.017031  -8.060 7.66e-16 ***
## age               -0.062546   0.010478  -5.969 2.38e-09 ***
## insuranceyes       0.378897   0.018923  20.023  < 2e-16 ***
## medicaidyes        0.248354   0.024639  10.080  < 2e-16 ***
## income:gendermale  0.017089   0.004123   4.145 3.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 26943  on 4405  degrees of freedom
## Residual deviance: 23818  on 4395  degrees of freedom
## AIC: 36615
## 
## Number of Fisher Scoring iterations: 5

Interpret the interaction term between income and gender

Income has a larger association with number of visits for males such that males with higher income are likely to have more visits while income does not have an effect on visits for females.

Check model assumptions (also look for overdispersion if appropriate for the model you chose)

Note

You cannot use crPlot since we included an interaction so we can assume the linearity assumption is met for now.

# constant variance of deviance residuals?
plot(resid(glm_mod)~predict(glm_mod))

# overdispersed?
glm_summary <- summary(glm_mod)
glm_summary$deviance/glm_summary$df.residual # overdispersed

## [1] 5.419233

Fix any model violations you may have noticed.

# we will log transform income (need to add 0.1 since we cannot take log of 0)
health$income_log  <- log(health$income+0.1)
# Fitting quasipoisson to account for overdisperion
glm_mod1 <- glm(visits~health+chronic+adl+income_log*gender+age+insurance+medicaid, family="quasipoisson", data=health)
summary(glm_mod1)

## 
## Call:
## glm(formula = visits ~ health + chronic + adl + income_log * 
##     gender + age + insurance + medicaid, family = "quasipoisson", 
##     data = health)
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.68935    0.22585   7.480 8.93e-14 ***
## healthexcellent       -0.35470    0.08038  -4.413 1.05e-05 ***
## healthpoor             0.26340    0.04803   5.484 4.40e-08 ***
## chronic                0.16305    0.01203  13.555  < 2e-16 ***
## adlnormal             -0.08755    0.04408  -1.986 0.047067 *  
## income_log            -0.01862    0.02621  -0.710 0.477459    
## gendermale            -0.16838    0.04704  -3.579 0.000348 ***
## age                   -0.06075    0.02784  -2.182 0.029129 *  
## insuranceyes           0.37319    0.05050   7.390 1.74e-13 ***
## medicaidyes            0.24667    0.06602   3.736 0.000189 ***
## income_log:gendermale  0.10583    0.04323   2.448 0.014410 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 7.032885)
## 
##     Null deviance: 26943  on 4405  degrees of freedom
## Residual deviance: 23794  on 4395  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Advanced Exercise - Multivariate Count Data

# we will log transform income (need to add 0.1 since we cannot take log of 0)
health$income_log  <- log(health$income+0.1)
# Fitting quasipoisson to account for overdisperion
glm_mod1 <- glm(visits~health+chronic+adl+income_log*gender+age+insurance+medicaid, family="quasipoisson", data=health)
summary(glm_mod1)

## 
## Call:
## glm(formula = visits ~ health + chronic + adl + income_log * 
##     gender + age + insurance + medicaid, family = "quasipoisson", 
##     data = health)
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.68935    0.22585   7.480 8.93e-14 ***
## healthexcellent       -0.35470    0.08038  -4.413 1.05e-05 ***
## healthpoor             0.26340    0.04803   5.484 4.40e-08 ***
## chronic                0.16305    0.01203  13.555  < 2e-16 ***
## adlnormal             -0.08755    0.04408  -1.986 0.047067 *  
## income_log            -0.01862    0.02621  -0.710 0.477459    
## gendermale            -0.16838    0.04704  -3.579 0.000348 ***
## age                   -0.06075    0.02784  -2.182 0.029129 *  
## insuranceyes           0.37319    0.05050   7.390 1.74e-13 ***
## medicaidyes            0.24667    0.06602   3.736 0.000189 ***
## income_log:gendermale  0.10583    0.04323   2.448 0.014410 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 7.032885)
## 
##     Null deviance: 26943  on 4405  degrees of freedom
## Residual deviance: 23794  on 4395  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Advanced Exercise - Multivariate Count Data

Often in gene expression data or compositional data, we have multiple counts. The package MGLM in R offers functions to fit multivariate count models such as the multinomial model and more advanced models to handle overdispersion of these counts.

Let’s look again at the rna seq dataset we used for the negative binomial example.

# load the dataset
data(rnaseq)

# look at description of the dataset
?rnaseq

# look at first 6 rows
head(rnaseq)

##    X1  X2 X3  X4  X5 X6 totalReads treatment gender age
## 1 295  65 19 114  54 20   28317494         0      0  57
## 2 213 126 12  96  50  4   20015549         0      0  67
## 3 322 147 23 245  42 19   35318251         0      1  37
## 4 184  35  9 134  74  8   20421437         0      0  59
## 5 376 104  4  74  39  7   21940693         0      1  57
## 6 490  97 18 282 159 24   48645477         0      1  50

# look at summary of variables
summary(rnaseq)

##        X1              X2               X3              X4       
##  Min.   : 29.0   Min.   : 15.00   Min.   :  3.0   Min.   : 54.0  
##  1st Qu.:132.8   1st Qu.: 43.00   1st Qu.: 15.0   1st Qu.:138.8  
##  Median :191.0   Median : 66.00   Median : 32.0   Median :201.5  
##  Mean   :215.6   Mean   : 73.72   Mean   : 68.1   Mean   :242.6  
##  3rd Qu.:275.8   3rd Qu.: 96.25   3rd Qu.:109.2   3rd Qu.:320.5  
##  Max.   :711.0   Max.   :186.00   Max.   :259.0   Max.   :684.0  
##        X5               X6          totalReads         treatment  
##  Min.   : 18.00   Min.   : 2.00   Min.   :20015549   Min.   :0.0  
##  1st Qu.: 40.00   1st Qu.: 8.00   1st Qu.:26675268   1st Qu.:0.0  
##  Median : 58.00   Median :12.50   Median :33177688   Median :0.5  
##  Mean   : 65.25   Mean   :14.69   Mean   :33748425   Mean   :0.5  
##  3rd Qu.: 81.00   3rd Qu.:19.00   3rd Qu.:40486800   3rd Qu.:1.0  
##  Max.   :183.00   Max.   :41.00   Max.   :49813815   Max.   :1.0  
##      gender           age       
##  Min.   :0.000   Min.   :26.00  
##  1st Qu.:0.000   1st Qu.:43.00  
##  Median :1.000   Median :50.00  
##  Mean   :0.525   Mean   :49.69  
##  3rd Qu.:1.000   3rd Qu.:56.25  
##  Max.   :1.000   Max.   :77.00

Look up the function MGLMreg to learn about it. What distributions can you fit with this function?

Multinomial (dist=“MN”), Dirichlet-multinomial (dist=“DM”), generalized Dirichlet-multinomial (dist=“GDM”), and negative multinomial (dist=“NM”).

Let’s fit a multinomial regression to the 6 gene expression columns with gender, age, and treatment as our covariates. We will include log(totalreads) as well.

# the multinomial outcome model would assume NO overdispersion:
mnreg <- MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
                                 treatment + age + gender, data = rnaseq, dist = "MN") 

print(mnreg)

## Call: MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) + 
##     treatment + age + gender, data = rnaseq, dist = "MN")
## 
## Coefficients:
##             X1        X2        X3        X4        X5
## [1,]  4.942732  5.009649 -3.792216  4.435434  4.027689
## [2,] -0.112845 -0.170222  0.219277 -0.107260 -0.120928
## [3,] -0.022655 -0.043099  2.745277  1.405742  0.092246
## [4,] -0.006187 -0.009709 -0.005907 -0.010945 -0.009599
## [5,]  0.032676  0.100389  0.020663  0.103859  0.009514
## 
## Hypothesis test: 
##                  wald value    Pr(>wald)
## (Intercept)       144.88789 1.634268e-29
## log(totalReads)    69.92572 1.061922e-13
## treatment       18364.13260 0.000000e+00
## age                79.91023 8.762650e-16
## gender             52.33670 4.601575e-10
## 
## Distribution: Multinomial
## Log-likelihood: -7506.393
## BIC: 15145.24
## AIC: 15062.79
## Iterations: 6

Each row of the coefficient matrix in this output corresponds to our covariates (intercept, log(totalReads), treatment, age, gender) and each column corresponds to our outcomes (X1,…,X6). For example, the coefficient in the first row and first column (4.943) estimates the intercept of X1 which gives the log of baseline expression levels. The second row/first column coefficient (-0.113) gives the estimated association between log(totalReads) and gene expression of X1.

Given that this dataset is overdispersed, it is likely that we will need to use a different distribution to accommodate this. Choose one of the distributions available other than “MN”

# we can change "MN" in the previous code to "DM" to fit a Dirichlet-multinomial regression
dmreg <- MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
                   treatment + age + gender, data = rnaseq, dist = "DM") 
print(dmreg)

## Call: MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) + 
##     treatment + age + gender, data = rnaseq, dist = "DM")
## 
## Coefficients:
##             X1        X2        X3        X4        X5        X6
## [1,] -0.895850 -1.096921 -8.997414 -1.736871 -1.774227 -5.646822
## [2,]  0.221988  0.186919  0.536572  0.252679  0.216672  0.347271
## [3,] -0.679291 -0.686881  1.835585  0.707954 -0.546469 -0.543134
## [4,]  0.010245  0.005227  0.009134  0.004252  0.006090  0.011642
## [5,] -0.026177  0.040244 -0.052842  0.023178 -0.058339 -0.039139
## 
## Hypothesis test: 
##                  wald value  Pr(>wald)
## (Intercept)       14.579069 0.02379579
## log(totalReads)    8.502549 0.20354699
## treatment       1851.437449 0.00000000
## age               13.131512 0.04099442
## gender             4.133364 0.65863419
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -4386.941
## BIC: 8932.831
## AIC: 8833.882
## Iterations: 9

If we wanted to know which model provided a better fit, we could compare their information criterion. For instance, the Bayesian Information Criterion (BIC) is designed to balance model complexity and model fit. A lower BIC indicates a better fitting model. Let’s compare the BIC of our two models fit using the BIC() function in R. Which model fits better for this data?

BIC(dmreg)

## [1] 8932.831

BIC(mnreg)

## [1] 15145.24

Lab Completed!

Congratulations! You have completed Lab 3!