# Motivations

Capture functional relationship between a set of random variables:

• crop yield, temperature, amount of rain, soil type, etc
• stock return, trading volume, macroeconomic indices, etc
• severity of Alzheimer’s disease, age, gender, ethnicity, etc

# Data from Example 11.4

Soil pH and Growth delay index

  SoilpH GrowRet
1    3.3   17.78
2    3.4   21.59
3    3.4   23.84
4    3.5   15.13
5    3.6   23.45
6    3.6   20.87

Note: ties in measurements of SoilpH

# Data for simple regression

• SoilpH: $$x_1, x_2, \ldots, x_n$$

• GrowthRet: $$y_1, y_2, \ldots, y_n$$

• Observations: $$(x_1,y_1), (x_2,y_2),\ldots, (x_n,y_n)$$

# Scatter plot for data

SoilPH versus GrowRet

# Linear relationship?

What is special about the Blue Line $$y = b_0 + b_1 x$$?

# Least-squares method

Blue line $$y = b_0 + b_1 x$$

# Least-squares method

Blue line: $y = b_0 + b_1 x$ minimizes the sum of squares $\sum_{i=1}^n \left[ y_i - (\beta_0 + \beta_1 x_i) \right]^2$ among all lines of the form $y = \beta_0 + \beta_1 x$

Caution: no statistics involved yet …

# Recap: regression line

Regression Line $$y = b_0 + b_1 x$$:

# Statistical Model

• $$y_1, y_2, \ldots, y_n$$ are observations from the random variable $$y$$

• $$x_1, x_2, \ldots, x_n$$ are observations from a deterministic variable $$x$$

• $$\varepsilon_1, \varepsilon_2, \ldots, \varepsilon_n$$ random errors with mean zero

Model: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$ i.e., $E[y] = \beta_0 + \beta_1 x$

# A statistical solution

Find the best among the family $y = \beta_0 + \beta_1 x + \varepsilon$ or among the family $E[y] = \beta_0 + \beta_1 x$ that minimizes the sum of squares $\sum_{i=1}^n \left[ y_i - (\beta_0 + \beta_1 x_i) \right]^2$

# The solution: regression line

The Blue Line = the Regression Line:

# General principle

• $$y_1, y_2, \ldots, y_n$$ are observations from the random variable $$y$$

• $$x_1, x_2, \ldots, x_n$$ are observations from a deterministic variable $$x$$

• Find the best among the family $E[y] = g(x; \boldsymbol {\beta})$ that minimizes a quantitative criterion

# Nonlinear model

Observations in signal processing:

# Questions

• For observations $$\{(x_i,y_i),i=1,\ldots,n\}$$ that are not random, can least-squares method be used to obtain the regression line?

• When will a simple regression model based on least-squares method be a bad modeling strategy?

• When observations $$\{(x_i,y_i),i=1,\ldots,n\}$$ are random, can you be certain about the value of the intercept and slope of the regression line?

# The fitted model

For the statistical model, we have:

• Estimated slope: $$\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}=\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$$

• Estimated intercept: $$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

Namely, the fitted model is $y = \hat{\beta}_0 + \hat{\beta}_1 x + \varepsilon$ and the Regression Line is $y = \hat{\beta}_0 + \hat{\beta}_1 x$

# Find Regression Line

Manual computation:

> bar_x = mean(SoilpH)
> bar_x
[1] 4.025
> bar_y = mean(GrowRet)
> bar_y
[1] 15.842
>
> S_xx = sum((SoilpH - bar_x)^2)
> S_xx
[1] 6.2375
> S_xy = sum((SoilpH - bar_x) * (GrowRet - bar_y))
> S_xy
[1] -49.022

# Find Regression Line

R command:

> ModFit = lm(y ~ x, data = youdata)
> ModFit$coefficients Illustration: > reg = lm(GrowRet ~ SoilpH, data = dataA) > reg$coefficients
(Intercept)      SoilpH
47.475435   -7.859238 

# Inference on coefficients

Model: $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$

Assumption: $$\varepsilon_i$$’s are Normally distributed with mean zero and variance $$\sigma^2$$

Model: $$y = \beta_0 + \beta_1 x + \varepsilon$$

Fitted model: $$y = \hat{\beta}_0 + \hat{\beta}_1 x + \varepsilon$$

# Inference on coefficients

Recall $\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}=\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$ and $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

Under model assumptions:

• $$\hat{\beta}_1$$: Normal with mean $$\beta_1$$ and standard deviation $$\frac{\sigma}{\sqrt{S_{xx}}}$$
• $$\hat{\beta}_0$$: Normal with mean $$\beta_0$$ and standard deviation $$\sigma \sqrt{\frac{1}{n}+ \frac{\bar{x}^2}{S_{xx}}}$$

# Inference on coefficients (Extra)

$\hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) y_i}{\sum (x_i - \bar{x})^2}$ and $\hat{\beta}_0 = \sum \left(\frac{1}{n} - \frac{\bar{x}\sum (x_i - \bar{x})}{\sum (x_i - \bar{x})^2}\right)y_i$

# Estimate error variance

Estimate error variance $$\sigma^2$$ by

$s^2 = \frac{\sum (y_i - \hat{y}_i)^2}{n-2}$ where each $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$

is called the “fitted value” for $$y_i$$

Note: $$\hat{\varepsilon}_i = y_i - \hat{y}_i$$ is called the “i-th residual”

# Inference on coefficients

Under model assumptions:

Estimated slope: $$\hat{\beta}_1 \sim \textsf{Normal}\left(\beta_1, \frac{\sigma^2}{S_{xx}}\right)$$

Estimated intercept: $\hat{\beta}_0 \sim \textsf{Normal}\left(\beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \right)$

Estimated error variance: $s^2 = \frac{\sum (y_i - \hat{y}_i)^2}{n-2}$

# Inference on coefficients

Please construct a procedure for testing $$H_0: \beta_1 =0$$ vs $$H_a: \beta_1 \ne 0$$ using Student’s t test; formula

$t = \frac{\hat{\beta}_1 - 0} {s_{\varepsilon}\sqrt{1/S_{xx}}}$

# Construct procedure

Manual computation:

> bar_x = mean(SoilpH)
> bar_x
[1] 4.025
>
> S_xx = sum((SoilpH - bar_x)^2)
> S_xx
[1] 6.2375
>
> n = length(SoilpH)
> n
[1] 20

# Construct procedure

Estimated error variance:

> EstErrorVar = sum(reg$residuals^2)/(length(reg$residuals) -
+     2)
> EstErrorVar
[1] 7.407196
>
> dfTestStat = reg$df.residual > dfTestStat [1] 18 > > qt(0.025, df = dfTestStat, lower.tail = F) [1] 2.100922 # Inference on coefficients R commands: > reg = lm(y ~ x, data = yourdata) > library(broom) # organize output > tidy(summary(reg)) Illustration: > reg = lm(GrowRet ~ SoilpH, data = dataA) > library(broom) > tidy(summary(reg)) term estimate std.error statistic p.value 1 (Intercept) 47.475435 4.428208 10.721138 3.026862e-09 2 SoilpH -7.859238 1.089737 -7.212052 1.038413e-06 # Inference on coefficients Illustration with full output: > reg = lm(GrowRet ~ SoilpH, data = dataA) > summary(reg) Call: lm(formula = GrowRet ~ SoilpH, data = dataA) Residuals: Min 1Q Median 3Q Max -5.1503 -1.2405 0.1382 1.6893 4.3552 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 47.475 4.428 10.721 3.03e-09 *** SoilpH -7.859 1.090 -7.212 1.04e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.722 on 18 degrees of freedom Multiple R-squared: 0.7429, Adjusted R-squared: 0.7286 F-statistic: 52.01 on 1 and 18 DF, p-value: 1.038e-06 # Confidence interval for slope $$(1-\alpha) \times 100 \%$$ CI: $\hat{\beta}_1 \pm t_{\alpha/2,n-2} s_{\varepsilon} \sqrt{1/S_{xx}}$ # Confidence interval for intercept $$(1-\alpha) \times 100 \%$$ CI: $\hat{\beta}_0 \pm t_{\alpha/2,n-2} s_{\varepsilon} \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}}$ # CI for coefficients > bar_x = mean(SoilpH) > bar_x [1] 4.025 > > S_xx = sum((SoilpH - bar_x)^2) > S_xx [1] 6.2375 > > n = length(SoilpH) > n [1] 20 # CI for coefficients > EstErrorVar = sum(reg$residuals^2)/(length(reg$residuals) - + 2) > EstErrorVar [1] 7.407196 > > dfTestStat = reg$df.residual
> dfTestStat
[1] 18
>
> qt(0.025, df = dfTestStat, lower.tail = F)
[1] 2.100922

# CI for coefficients

R command:

> reg = lm(y ~ x, data = yourdata)
> confint(reg, level = 0.95)

Illustration:

> reg = lm(GrowRet ~ SoilpH, data = dataA)
> confint(reg, level = 0.95)
2.5 %    97.5 %
(Intercept)  38.17211 56.778756
SoilpH      -10.14869 -5.569786

# Coefficient of determination

The coefficient of determination: $R^2 = 1- \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$

Misunderstandings:

• A larger $$R^2$$ indicates that useful predictions can be made
• A larger $$R^2$$ indicates that the estimated linear model is a good fit
• $$R^2$$ close to zero indicates that $$y$$ and $$x$$ are not related

# Coefficient of determination

R command:

> # fit linear model
> reg = lm(GrowRet ~ SoilpH, data = dataA)
> # show R-squared
> summary(reg)$r.squared [1] 0.7429074 > # show F statistic > summary(reg)$fstatistic
value    numdf    dendf
52.01369  1.00000 18.00000 

# Questions

Suppose model assumptions are satisfied:

• Do you know how to test if $$\beta_0 = 0$$?

• Do you know another strategy to test if $$\beta_1 = 0$$? Think about ANOVA and $$R^2$$

• Can you summarize the rules to construct confidence intervals for $$\hat{\beta}_1$$ and $$\hat{\beta}_0$$?

# Predict a new mean

With a new observation $$x_{n+1}$$ from the explanatory variable,

$\hat{y}_{n+1} = \hat{\beta}_0 + \hat{\beta}_1 x_{n+1}$

predicts $$E(y_{n+1})$$

$$\hat{y}_{n+1}$$ has a Normal distribution with mean $$\beta_0 + \beta_1 x_{n+1}$$

The standard deviation of $$\hat{y}_{n+1}$$ can be estimated by

$s_{\varepsilon} \sqrt{\frac{1}{n} + \frac{(x_{n+1} - \bar{x})^2}{S_{xx}}}$

# Predict a new mean

R commands:

> reg = lm(y ~ x, data = yourdata)
> newdata = data.frame(new_x_values)
> predict(reg, newdata, interval = "confidence", level = 0.95)

Illustration:

> regA = lm(GrowRet ~ SoilpH, data = dataA)
> new = data.frame(SoilpH = 4)
> predict(regA, new, interval = "confidence")
fit      lwr      upr
1 16.03848 14.75864 17.31832

# Predict a new response

With a new observation $$x_{n+1}$$ from the explanatory variable,

$\hat{y}_{n+1} = \hat{\beta}_0 + \hat{\beta}_1 x_{n+1}$

predicts $$y_{n+1}$$

The associated standard deviation for $$\hat{y}_{n+1}$$ is estimated by

$s_{\varepsilon} \sqrt{1 + \frac{1}{n} + \frac{(x_{n+1} - \bar{x})^2}{S_{xx}}}$

# Predict a new response

R commands:

> reg = lm(y ~ x, data = yourdata)
> newdata = data.frame(new_x_values)
> predict(reg, newdata, interval = "predict", level = 0.95)

Illustration:

> regA = lm(GrowRet ~ SoilpH, data = dataA)
> new = data.frame(SoilpH = 4)
> predict(regA, new, interval = "predict")
fit      lwr      upr
1 16.03848 10.17909 21.89787

# Predict a new mean and response

Regression line: $$\hat{y}_{n+1} = \hat{\beta}_0 + \hat{\beta}_1 x_{n+1}$$

• $$\hat{\beta}_0 = 47.47$$ and $$\hat{\beta}_1 = -7.86$$
• $$x_{n+1} = 5$$

# Envelope for Regression Line

Confidence envelope for expected value of response:

# Questions

• Is it easier to predict the mean $$E(y_{n+1})$$ than $$y_{n+1}$$?

• What affects the accuracy of predicting $$E(y_{n+1})$$?

# A suggested work flow

• Check for outliers
• Check for unequal variances
• Check for dependence among random errors
• Check for Normality

# High leverage point

A high leverage point is an outlier in explanatory variable

# High influence point

A high influence point is a pair of observations, each of whose coordinates is an outlier

# Error variances unequal

Unequal error varainces can occur when:

• observations are obtained from slightly different experimental codnitions
• there is an intrinsic relationship between mean and variance

Scatter plot:

Residual plot:

# Error variances unequal

Unequal error varainces can be mitigated by:

• Weighted Least-Squares (choosing weights can be very hard though)

• transforming response or regressor (not always sensible)

# Dependent random errors

Dependent random errors can be induced when:

• observations are generated from dynamic systems
• observations are repeated measurements

# Dependent random errors

Dependent random errors can be adjusted by techniques from time-series analysis:

• models for observations for the response
• models for random errors
• combining the above two

# Linearity not plausible

Implausibility of linear model can be mitigated by:

• adding nonlinear terms to Simple Regression Model, e.g., $y = {\beta}_0 + {\beta}_1 x + \beta_2 x^2+ \varepsilon$
• including more explanatory variables, e.g., $y = {\beta}_0 + {\beta}_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2+ \varepsilon$

• transforming response or regressor (not always sensible)

# Non-normal random error

Non-normal random error can be mitigated by transforming response, at the risk of an hard-to-interpret model.

For example, $$y$$ refers to yield; $$y^{1/3}$$ is hard to interpret.

One tool for this purpose is Box-Cox Transform. This transform can also deal with unequal variances and non-linearity.

# Box-Cox Transform

R command:

> library(MASS)
> boxcox(object, lambda = seq(-2, 2, 1/10), plotit = TRUE,
+     interp, eps = 1/50, xlab = expression(lambda),
+     ylab = "log-Likelihood", ...)

# Questions

• When random errors have unequal variances, is it hard to check Normality of random errors?
• When random errors are dependent, is it hard to check Normality of random errors?
• When will you consider transforming explanatory variable and/or response?

(Sometimes you can use bootstrap when model assumptions are violated)

# Motivation

• Using blood sugar, blood pressure, weight, triglycerides to predict cholesterol

• Using HW scores, participation scores, midterm scores to predict final exam scores

• Using precipitation, temperature, relative humidity to predict fruit growth

# Simple linear regression

• Model: $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$

• Model: $$y = \beta_0 + \beta_1 x + \varepsilon$$ when $$\varepsilon_i$$ are i.i.d.

• Model: $$E[y] = \beta_0 + \beta_1 x$$

• $$\beta_1$$: change in units in $$E(y)$$ for unit change in $$x$$

In a model, the random error $$\varepsilon$$ term represents all other variables not explicitly included in the model

# Simple linear regression

• Not sufficient to caputre relationship between more than two variables

• Not sufficient to caputre nonlinear relationship between two variables

A fine decomposition of random error leads to a better but more complicated model

# Multiple linear regression

Multiple linear regression with one response

• one response variable $$y$$; observations $$y_1, y_2, \ldots, y_n$$
• $$p$$ covariates $$x_1$$, $$\cdots$$, $$x_p$$
• random error $$\varepsilon$$ with zero mean; realizations $$\varepsilon_1, \varepsilon_2, \ldots, \varepsilon_n$$

Say, $$p=3$$ covariates $$x_1$$, $$x_2$$ and $$x_3$$, then the model is:

$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \varepsilon_{i}$

# Multiple linear regression

In the model $y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$

if you obsorb the term $\beta_2 x_{2} + \beta_3 x_{3} + \varepsilon_{}$ into a new random error $$\delta$$, i.e., if you set

$\delta = \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon,$ you get Simple Linear Regression Model

$y = \beta_0 + \beta_1 x_{1} + \delta$

# Intepretation of model

Take $$p=3$$ for example. Consider the model: $y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$

Is the model for $$E(y)$$, i.e., for the expected value of $$y$$?

What does $$\beta_1$$ mean?

# Estimate regression coefficients

Still the Least-Squares Method is used, i.e., the fitted model minimizes the sum of squares $\sum_{i=1}^n \left[ y_i - (\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p) \right]^2$ with $$\beta_i, i=1,\cdots,p$$ as parameters

# Fit the model

• Model: $$\boldsymbol{y} = \boldsymbol{X \beta} + \boldsymbol{\varepsilon}$$

• Estimated coefficients: $$\hat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^{\prime}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{\prime}\boldsymbol{y}$$

• Fitted value: $$\hat{\boldsymbol{y}} = \boldsymbol{X} \hat{\boldsymbol{\beta}}$$

• Fitted model: $$\boldsymbol{y} = \boldsymbol{X} \hat{\boldsymbol{\beta}} + \boldsymbol{\varepsilon}$$

• Residuals: $$\hat{\boldsymbol{\varepsilon}} = \boldsymbol{y} - \hat{\boldsymbol{y}}$$

# Data matrix

Assume we have:

• observations $$y_1, y_2, \ldots, y_n$$ for $$y$$

• observations $$x_{i1}, x_{i2}, \cdots, x_{in}$$ for $$x_i$$, $$i=1,\cdots,p$$

Data matrix $$D$$ is $$n \times (p+1)$$. The $$j$$th row of $$D$$ is

$(y_j, x_{1j}, x_{2j}, \cdots, x_{pj})$

Data for multiple regression should have this form for software processing!

# Data matrix

Importing data:

  satisfaction age severity anxiety
1           48  50       51     2.3
2           57  36       46     2.3
3           66  40       48     2.2
4           70  41       44     1.8
5           89  28       43     1.8
6           36  49       54     2.9

$$y$$ = satisfaction, $$x_1$$ = age, $$x_2$$ = serverity, $$x_3$$ = anxiety

# Correlation plot

A correlation plot helps visually check

• if a covariate is correlated to response
• if some covariates are correlated with each other
• if a linear term for a covariate in the model is appropriate

# Correlation plot

R command:

> pairs(x, ...)

# Correlation plot

> pairs(sat)

# Fit multiple regression model

R command: lm(formula, data)

• “formula” is the expression for model
• it can be: y ~ x1 + x2 + x3
• it can be: y ~ x1 + x2 + x3 + x1x2 + x3x3

# Fit multiple regression model

$$y$$ = satisfaction, $$x_1$$ = age, $$x_2$$ = serverity, $$x_3$$ = anxiety

Model: $y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon,$ where

> regML = lm(satisfaction ~ age + severity + anxiety,
+     sat)
> regML$coefficients (Intercept) age severity anxiety 158.4912517 -1.1416118 -0.4420043 -13.4701632  # Correlation plot Does the fit comply with scatter plot? > pairs(sat) # Fitted model $$y$$ = satisfaction, $$x_1$$ = age, $$x_2$$ = serverity, $$x_3$$ = anxiety Fitted model: $y = 158.49 -1.14 x_{1} -0.44 x_{2} -13.47 x_{3} + \varepsilon$ • How to interpret the estimated regression coefficients? • Do you think the model caputures well the relationship between these variables? # Questions Recall the following: • $$y$$ = satisfaction, $$x_1$$ = age, $$x_2$$ = serverity, $$x_3$$ = anxiety • Fitted model: $y = 158.49 -1.14 x_{1} -0.44 x_{2} -13.47 x_{3} + \varepsilon$ Suppose you just fit the model $y = \beta_0 + \beta_1 x_{1} + \varepsilon$ • will the estimated $$\beta_1$$ obtained from this model be different from that obtained from the previous model? # Questions Fit model: $$y = \beta_0 + \beta_1 x_{1} + \varepsilon$$ > regMG = lm(satisfaction ~ age, sat) > regMG$coefficients
(Intercept)         age
119.943170   -1.520604 

Fitted model: $$y = 119.94 -1.52 x_{1} + \varepsilon$$

# Questions

Recall: $$y$$ = satisfaction, $$x_1$$ = age, $$x_2$$ = serverity, $$x_3$$ = anxiety

Compare the two fitted models:

• $$y = 119.94 -1.52 x_{1} + \varepsilon$$

• $y = 158.49 -1.14 x_{1} -0.44 x_{2} -13.47 x_{3} + \varepsilon$

Do you know what caused this difference?

# General principle

Assumptions:

• random errors are independent
• random errors are Normally distributed with zero mean and the same variance

Namely, random errors are independent and identically distributed (i.i.d) Normal. Then inference on regression coefficients can be based on Student’s t distributions or F distributions

# Inference on global null

Recall:

• $$y$$ = satisfaction, $$x_1$$ = age, $$x_2$$ = serverity, $$x_3$$ = anxiety

• Model: $$y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$$

• Fitted model: $$y = 158.49 -1.14 x_{1} -0.44 x_{2} -13.47 x_{3} + \varepsilon$$

Are all coefficients zero, i.e., $$H_0: \beta_1 = \beta_2 = \beta_3 =0$$?

# Inference on global null

If $$\beta_1 = \beta_2 = \beta_3 =0$$, then the performance of the Reduced Model

$y = \beta_0 + \varepsilon$ should be very close to that of the Full Model

$y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$

In other words, SS(Reduced) $\sum (y_i - \bar{y})^2$ (based on Reduced Model) should be close to SS(Full) $\sum (y_i - \hat{y}_i)^2$ (based on Full Model)

# Inference on global null

To assess $$H_0: \beta_1 = \beta_2 = \beta_3 =0$$ versus $$H_a$$, the test statistis is $F = \frac{[SS(Reduced)-SS(Full)]/3}{SS(Full)/(n-4)}$

It has an F distribution with $$df_1 = 3$$ and $$df_2 = n -4$$. Reject $$H_0$$ if $$F > F_{\alpha,df_1,df_2}$$

# Inference on global null

Test $$H_0: \beta_1 = \beta_2 = \beta_3 =0$$ using the following:

• SS(Reduced) = 13369.3; SS(Full) = 4248.841
• $$p = 3$$ and $$n=46$$
• $$F_{0.05,p,n-p-1} = 2.827$$

# Inference on global null

Illustation:

> regML = lm(satisfaction ~ age + severity + anxiety,
+     sat)
> summary(regML)$fstatistic value numdf dendf 30.05208 3.00000 42.00000  # Inference on global null In general, to assess $$H_0: \beta_1 = \beta_2 = \ldots = \beta_p =0$$ versus $$H_a$$, the test statistis is $F = \frac{[SS(Reduced)-SS(Full)]/p}{SS(Full)/(n-(p+1))}$ • SS(Reduced): residual sum of squares obtained by fitting the Reduced Model $$y = \beta_0+\varepsilon$$ • SS(Full) is the residual sum of squares obtained by fitting the Full Model $$y = \beta_0 + \beta_1 x_{1} + \ldots + \beta_p x_{p} + \varepsilon$$ The test stat has an F distribution $$df_1 = p$$ and $$df_2 = n -(p+1)$$. Reject $$H_0$$ if $$F > F_{\alpha,df_1,df_2}$$ # Inference on one coefficient Recall the model $y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$ Consider testing $$H_0: \beta_1 =0$$ vs $$H_a: \beta_1 \ne 0$$, which among t test and F test to use? # Inference on one coefficient Test $$\beta_1 =0$$ or not using the following: • $$\hat{\beta}_1 = -1.14$$ and $$s_{\hat{\beta}_1} = 0.215$$ • $$n=46$$ and $$p=3$$ • $$s_{\hat{\beta}_1}$$ has degrees freedom $$n-(p+1)$$ • $$t_{0.025,42} = 2.02$$ # Inference on one coefficient Illustation: > regML = lm(satisfaction ~ age + severity + anxiety, + sat) > library(broom) > tidy(summary(regML)) term estimate std.error statistic p.value 1 (Intercept) 158.4912517 18.1258887 8.7439162 5.260955e-11 2 age -1.1416118 0.2147988 -5.3147960 3.810252e-06 3 severity -0.4420043 0.4919657 -0.8984452 3.740702e-01 4 anxiety -13.4701632 7.0996608 -1.8972967 6.467813e-02 Value of test stat for if $$\beta_1 =0$$ is $$-5.314796$$ # Inference on one coefficient Test $$\beta_1 =0$$ or not using the following: • Full Model: $$y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$$ • Residual sum of squares for Full Model: 4248.841 • Reduced Model when $$H_0$$ is true: $$y = \beta_0 + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$$ • Residual sum of squares for Reduced Model: 7106.394 • Full Model has $$1$$ more parameter than Reduced Model $F = \frac{[SS(Reduced)-SS(Full)]/1}{SS(Full)/(n-(p+1))}$ # Inference on one coefficient Illustation: > regML = lm(satisfaction ~ age + severity + anxiety, + sat) > RSSF1 = sum(regML$residuals^2)
>
> regMLR = lm(satisfaction ~ severity + anxiety, sat)
> RSSR1 = sum(regMLR$residuals^2) > > FFR = ((RSSR1 - RSSF1)/1)/(RSSF1/42) > FFR [1] 28.24706 value of test stat for if $$\beta_1 =0$$ is $$28.24706$$ Note $$(-5.314796)^2 = 28.24706$$ # Inference on subset of coef Say to assess $$H_0: \beta_1 = \beta_2 =0$$ in model $y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$ • Full Model: $$y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$$ • Reduced Model when $$H_0$$ is true: $$y = \beta_0 + \beta_3 x_{3} + \varepsilon$$ • Full Model has $$2$$ more parameters than Reduced Model $F = \frac{[SS(Reduced)-SS(Full)]/2}{SS(Full)/(n-(p+1))}$ # Inference on subset of coef Test $$H_0: \beta_1 = \beta_2 =0$$ or not using the following: • Full Model: $$y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$$ • Residual sum of squares for Full Model: 4248.841 • Reduced Model when $$H_0$$ is true: $$y = \beta_0 + \beta_3 x_{3} + \varepsilon$$ • Residual sum of squares for Reduced Model: 7814.391 • Full Model has $$2$$ more parameters than Reduced Model # Inference on subset of coef Illustation: > regML = lm(satisfaction ~ age + severity + anxiety, + sat) > RSSF1 = sum(regML$residuals^2)
[1] 4248.841
>
> regMLR2 = lm(satisfaction ~ anxiety, sat)
> RSSR2 = sum(regMLR2$residuals^2) > RSSR2 [1] 7814.391 > > FFR2 = ((RSSR2 - RSSF1)/2)/(RSSF1/42) > FFR2 [1] 17.62282 > > qf(0.05, df1 = 2, df2 = 42, ncp = 0, lower.tail = F) [1] 3.219942 # Inference on coefficients • Full Model has $$p+1$$ parameters ($$1$$ intercept and $$p$$ covariates) • Reduced Model is obtained by assuming $$H_0$$ is true, for which $$H_0$$ says that $$q$$ coefficients in Full Model are $$0$$ • Full model has $$q$$ more parameters than Reduced Model Under Model Assumptions, the test statistic is $F = \frac{[SS(Reduced)-SS(Full)]/q}{SS(Full)/(n-(p+1))}$ # Inference on coefficients R commands syntax: > # Full Model > regF = lm(y ~ all p covariates, yourdata) > > # Reduced Model > regR = lm(y ~ (p-q) covariates, yourdata) > > # testing (put Reduced Model ahead of Full Model) > anova(regR,regF) # Practice via software # Load data Importing data: > sat = read.table("http://math.wsu.edu/faculty/xchen/stat412/data/CH06PR15.txt", + header = F) > colnames(sat) = c("satisfaction", "age", "severity", + "anxiety") > head(sat) satisfaction age severity anxiety 1 48 50 51 2.3 2 57 36 46 2.3 3 66 40 48 2.2 4 70 41 44 1.8 5 89 28 43 1.8 6 36 49 54 2.9 # Pairwise correlations > pairs(sat) # Model • $$y$$ = satisfaction, $$x_1$$ = age, $$x_2$$ = serverity, $$x_3$$ = anxiety • Model: $$y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} + \varepsilon$$ # Fit linear model > regML = lm(satisfaction ~ age + severity + anxiety, + sat) > regML$coefficients
(Intercept)         age    severity     anxiety
158.4912517  -1.1416118  -0.4420043 -13.4701632 

# Test global null

> regML = lm(satisfaction ~ age + severity + anxiety,
+     sat)
> library(broom)
>
> # get value of F statistic and its p-value
> glance(regML)$statistic [1] 30.05208 > glance(regML)$p.value
[1] 1.541973e-10

# Test one coefficient

> regML = lm(satisfaction ~ age + severity + anxiety,
+     sat)
> library(broom)
> tidy(regML)
term    estimate  std.error  statistic      p.value
1 (Intercept) 158.4912517 18.1258887  8.7439162 5.260955e-11
2         age  -1.1416118  0.2147988 -5.3147960 3.810252e-06
3    severity  -0.4420043  0.4919657 -0.8984452 3.740702e-01
4     anxiety -13.4701632  7.0996608 -1.8972967 6.467813e-02

# Test a subset of coef

$$H_0: \beta_1 = \beta_2 =0$$

> # Full Model
> regML = lm(satisfaction ~ age + severity + anxiety,
+     sat)
>
> # Reduced Model
> regMLR2 = lm(satisfaction ~ anxiety, sat)
>
> # testing
> anova(regMLR2, regML)
Analysis of Variance Table

Model 1: satisfaction ~ anxiety
Model 2: satisfaction ~ age + severity + anxiety
Res.Df    RSS Df Sum of Sq      F    Pr(>F)
1     44 7814.4
2     42 4248.8  2    3565.6 17.623 2.773e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Quick diagnostics

> regML = lm(satisfaction ~ age + severity + anxiety,
+     sat)
> par(mfrow = c(2, 2))
> plot(regML)

# On project

> # load data
> da1[1:3, 1:5]
Plant A.leafSize B.leafSize C.leafSize D.leafSize
1     1    27.7619     4.2460    15.5070    33.0101
2     2    27.8523    14.1577     5.0473    44.9680
3     3    21.3495     7.0279    18.3020    34.2074
> # data set is wide format and it needs to be in
> # long format

# Principle

For ANOVA, the data need to be in long format, for which one variable denotes the factor and the other numerical measurements.

# Wide format to Long format

> # Condition and leaf size
> A1 = da1[, 1:5]
> A1
Plant A.leafSize B.leafSize C.leafSize D.leafSize
1      1    27.7619     4.2460    15.5070    33.0101
2      2    27.8523    14.1577     5.0473    44.9680
3      3    21.3495     7.0279    18.3020    34.2074
4      4    31.9616     7.0698    16.0436    28.9766
5      5    19.4623     0.8091    10.2601    42.9229
6      6    12.2804    13.9385    19.0571    36.6827
7      7    21.0508    11.0130    17.1826    32.7229
8      8    19.5074    10.9680    16.6510    32.5668
9      9    26.2808     6.9112    18.8472    28.7695
10    10    26.1466     9.6041    12.4234    36.6952
> # wide to long: first way
> library(tidyr)
> A1L = A1 %>% gather(Condition, Size, 2:5)
> A1L
Plant  Condition    Size
1      1 A.leafSize 27.7619
2      2 A.leafSize 27.8523
3      3 A.leafSize 21.3495
4      4 A.leafSize 31.9616
5      5 A.leafSize 19.4623
6      6 A.leafSize 12.2804
7      7 A.leafSize 21.0508
8      8 A.leafSize 19.5074
9      9 A.leafSize 26.2808
10    10 A.leafSize 26.1466
11     1 B.leafSize  4.2460
12     2 B.leafSize 14.1577
13     3 B.leafSize  7.0279
14     4 B.leafSize  7.0698
15     5 B.leafSize  0.8091
16     6 B.leafSize 13.9385
17     7 B.leafSize 11.0130
18     8 B.leafSize 10.9680
19     9 B.leafSize  6.9112
20    10 B.leafSize  9.6041
21     1 C.leafSize 15.5070
22     2 C.leafSize  5.0473
23     3 C.leafSize 18.3020
24     4 C.leafSize 16.0436
25     5 C.leafSize 10.2601
26     6 C.leafSize 19.0571
27     7 C.leafSize 17.1826
28     8 C.leafSize 16.6510
29     9 C.leafSize 18.8472
30    10 C.leafSize 12.4234
31     1 D.leafSize 33.0101
32     2 D.leafSize 44.9680
33     3 D.leafSize 34.2074
34     4 D.leafSize 28.9766
35     5 D.leafSize 42.9229
36     6 D.leafSize 36.6827
37     7 D.leafSize 32.7229
38     8 D.leafSize 32.5668
39     9 D.leafSize 28.7695
40    10 D.leafSize 36.6952

# Wide format to Long format

> # wide to long: second second way stacking
> # measurements for size
> size = c(A1[, 2], A1[, 3], A1[, 4], A1[, 5])
> # create factor for condition
> n = nrow(A1)
> condition = c(rep("A", n), rep("B", n), rep("C", n),
+     rep("D", n))
> # combine into data frame
> B1 = data.frame(size, condition)

# Example

Grocery retailer tracks productivity and costs of its facilities weekly:

• $$x_1$$: number of cases shipped
• $$x_2$$: costs of total labour hours in percentage
• $$x_3$$: indicator of if a week has a holiday
• $$y$$: total labor hours

# Example

Grocery retailer tracks productivity and costs of its facilities weekly:

• $$x_3$$: indicator of if a week has a holiday
• $$x_3=1$$ if a week has a holiday and $$=0$$ otherwise

$$x_3$$ is a categorical variables

# Data

Part of data:

  LaborHours Sellings LaborCost Holiday
1       4264   305657      7.17       0
2       4496   328476      6.20       0
3       4317   317164      4.61       0
4       4292   366745      7.02       0
5       4945   265518      8.61       1
6       4325   301995      6.88       0

• any difference between total labor hours on weeks with a holiday compared to those without?
• how to predict total labor hours based on a requested number of cases to be shipped?

# Linear model

• $$x_1$$: number of cases shipped
• $$x_2$$: costs of total labour hours in percentage
• $$x_3$$: indicator of if a week has a holiday
• $$y$$: total labor hours
• Model: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$$

# Linear model

Recall

• $$x_3$$: indicator of if a week has a holiday
• Model: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$$

If $$x_3 = 0$$, we get $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon$

If $$x_3 = 1$$, we get $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 + \varepsilon$

$$\beta_3$$ measures difference in change of units in $$y$$ depending on if there is a holiday while holding $$x_1$$ and $$x_2$$ fixed

# Questions

Why do the correlation plots between $$x_3$$, having a holiday or not, and all other three variables look like that?

# Fit linear model

Fitted model: $y = 4149.8 - 0.0008 x_1 - 0.132 x_2 + 6.23 x_3 + \varepsilon$

> egRtl$Holiday = factor(egRtl$Holiday)
>
> regRetail = lm(LaborHours ~ Sellings + LaborCost +
+     Holiday, egRtl)
> regRetail$coefficients (Intercept) Sellings LaborCost Holiday1 4.149887e+03 7.870804e-04 -1.316602e+01 6.235545e+02  # Questions Recall • $$x_1$$: number of cases shipped • $$x_2$$: costs of total labour hours in percentage • $$x_3$$: indicator of if a week has a holiday • $$y$$: total labor hours • Fitted model: $y = 4149.8 - 0.0008 x_1 - 0.132 x_2 + 6.23 x_3 + \varepsilon$ How to interpret the fitted model? # Inference on coefficients Fitted model: $y = 4149.8 - 0.0008 x_1 - 0.132 x_2 + 6.23 x_3 + \varepsilon$  term estimate statistic p.value 1 (Intercept) 4.149887e+03 21.2199453 4.902653e-26 2 Sellings 7.870804e-04 2.1590228 3.587650e-02 3 LaborCost -1.316602e+01 -0.5701616 5.712274e-01 4 Holiday1 6.235545e+02 9.9544230 2.940869e-13 > F-statistic: 35.34 on 3 and 48 DF, p-value: 3.316e-12 What are your conclusions? # Diagnostic plots What are your conclusions on model assumptions? # Check residuals # Check Normality  Shapiro-Wilk normality test data: regRetail$residuals
W = 0.97575, p-value = 0.3644

# Check equal variances


Bartlett test of homogeneity of variances

data:  egRtl$LaborHours and egRtl$Holiday
Bartlett's K-squared = 1.6178, df = 1, p-value = 0.2034

# Questions

• based on the previous boxplot, do you think the random errors have equal variances?
• how would you proceed to check if error variances are equal?
• do you know how to make a prediction on the expected LaborHours?

# Extras

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 15063)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] tidyr_0.7.1   broom_0.4.2   ggplot2_2.2.1 knitr_1.17

loaded via a namespace (and not attached):
[1] Rcpp_0.12.12     formatR_1.5      plyr_1.8.4       bindr_0.1
[5] tools_3.3.0      digest_0.6.12    evaluate_0.10.1  tibble_1.3.4
[9] gtable_0.2.0     nlme_3.1-127     lattice_0.20-33  pkgconfig_2.0.1
[13] rlang_0.1.2      psych_1.7.8      yaml_2.1.14      parallel_3.3.0
[17] bindrcpp_0.2     stringr_1.2.0    dplyr_0.7.4      revealjs_0.9
[21] tidyselect_0.2.0 rprojroot_1.2    grid_3.3.0       glue_1.1.1
[25] R6_2.2.2         foreign_0.8-66   rmarkdown_1.6    purrr_0.2.3
[29] reshape2_1.4.2   magrittr_1.5     backports_1.1.0  scales_0.5.0
[33] htmltools_0.3.6  assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2
[37] labeling_0.3     stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3