Grocery retailer tracks productivity and costs of its facilities weekly:

- \(x_1\): number of cases shipped
- \(x_2\): costs of total labor hours in percentage
- \(x_3\): indicator of if a week has a holiday; \(x_3=1\) if a week has a holiday and \(=0\) otherwise
- \(y\): total labor hours

Part of data:

```
LaborHours Sellings LaborCost Holiday
1 4264 305657 7.17 0
2 4496 328476 6.20 0
3 4317 317164 4.61 0
4 4292 366745 7.02 0
5 4945 265518 8.61 1
6 4325 301995 6.88 0
```

Model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\)

Fitted model: \[y = 4149.8 - 0.0008 x_1 - 0.132 x_2 + 6.23 x_3 + \varepsilon\]

```
term estimate p.value
1 (Intercept) 4.149887e+03 4.902653e-26
2 Sellings 7.870804e-04 3.587650e-02
3 LaborCost -1.316602e+01 5.712274e-01
4 Holiday1 6.235545e+02 2.940869e-13
```

\(x_2\): costs of total labor hours in percentage

\(x_3\): indicator of if a week has a holiday

Model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2 x_3+ \varepsilon\)

How to interpret the “interaction” term \(x_2 x_3\)?

\(x_2\): costs of total labor hours in percentage

\(x_3\): indicator of if a week has a holiday

Model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2 x_3+ \varepsilon\)

when \(x_3 = 0\), we get \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\]

when \(x_3 = 1\), we get \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 + \beta_4 x_2 + \varepsilon\]

How to interpret the difference between these two sub-models?

Model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2 x_3+ \varepsilon\)

Fitted model:

```
term estimate p.value
1 (Intercept) 4.287187e+03 5.789198e-24
2 Sellings 7.372229e-04 4.874013e-02
3 LaborCost -2.983319e+01 2.646345e-01
4 Holiday1 1.132488e+02 7.815787e-01
5 LaborCost:Holiday1 6.748820e+01 2.097844e-01
```

Interpretation …

- higher order terms can be added to model
- basis functions can be used in model
- model does not have be “linear”
- R procedures: nls, gam

Collinearity refers to “covariates are highly correlated with each other”. Consequences of collinearity include:

- difficulty or inability to build a model; e.g., hard to pick covariates or existence of many equally plausible models
- instability in inference on model and in model performance; e.g., small changes in a covariate leading to very large changes in mean response

Blood pressure study: (“BP” is “blood pressure”;“BSA” is “body surface area”)

```
Pt BP Age Weight BSA Dur Pulse Stress
1 1 105 47 85.4 1.75 5.1 63 33
2 2 115 49 94.2 2.10 3.8 70 14
3 3 116 49 95.3 1.98 8.2 72 10
4 4 117 50 94.7 2.01 5.8 73 99
5 5 112 51 89.4 1.89 7.0 72 95
6 6 121 48 99.5 2.25 9.3 71 10
```

Blood pressure study:

Correlation matrix:

```
BP Age Weight BSA
BP 1.0000000 0.6590930 0.9500677 0.8658789
Age 0.6590930 1.0000000 0.4073493 0.3784546
Weight 0.9500677 0.4073493 1.0000000 0.8753048
BSA 0.8658789 0.3784546 0.8753048 1.0000000
```

Large correlation (in absolute value) between two covariates indicates that they are highly correlated; e.g., Weight and BSA

Variance Inflation Factor (VIF)

- obtained by “regressing one covariate on all other covariates”
- large VIF indicates considerable collinearity
- illustration: model “BP = Age + Weight + BSA + Error” and VIF’s

```
(Intercept) Age Weight BSA
-13.6672473 0.7016198 0.9058223 4.6273883
Age Weight BSA
1.201901 4.403645 4.286943
```

- use and interpret model carefully
- drop certain covariates if possible; include correlated covariates into model if needed
- Ridge regression: minimize the regularized sum of squares \[\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} \beta_j^2 \]
- R library: glmnet

Model the probability of an event; e.g., proportion of students passing a test, likelihood of an accident

Caution: can no longer mode the probability as \(p = \beta_0 + \beta_1 x + \varepsilon\); instead, a “link function”" is needed

Example model: \[\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x\] where \(\log{\frac{x}{1-x}}\) is a “link function”

Model: \(\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x\)

The model is equivalent to \(p = \frac{\exp(\beta_0 + \beta_1 x)}{1+\exp(\beta_0 + \beta_1 x)}\)

Interpretation: \(\beta_1\) represents change in units of log odds ratio

E.g., increasing Age by 1 year results in changes in log odds ratio of a disease by 5 folds

“Project Success” versus “Experience (in month)”

```
Experience Success
1 2 0
2 4 0
3 5 0
4 6 0
5 7 0
6 8 1
```

Comments: Response variable is recorded as 0 or 1 (but not as a factor)

Settings:

- Response variable \(y\): success or failure
- \(p\): probability of success
- \(x\): Experience in month

Model: \[\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x\]

R command: glm(y ~ x,data= yourdata,family=binomial())

Fitting the model using R procedure “glm”:

```
(Intercept) Experience
-1.6842005 0.1193806
```

Fitted model: \[\log\left(\frac{p}{1-p}\right) = -1.68 + 0.12 x\]

Inference on logistic regression:

```
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -1.68420 0.94508 -1.782 0.0747 .
> Experience 0.11938 0.05889 2.027 0.0427 *
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for binomial family taken to be 1)
>
> Null deviance: 34.617 on 24 degrees of freedom
> Residual deviance: 29.374 on 23 degrees of freedom
> AIC: 33.374
>
```

Inference is based on the difference between the corresponding likelihood functions (such as Deviance)

Model diagnostics are harder than those for multiple linear regression with Normal errors

R procedure: glm(y ~ x,data= yourdata,family=binomial())

Less complicated models are usually easier to interpret but may not necessarily perform well.

Aim of model selection: balance model complexity with model performance

Modes of selection: forward, backward, stepwise, subset, etc

Selection criteria usually based on model complexity; e.g., information criterion, adjusted R square, Mallows’s Cp, etc

One popular choice: LASSO

Modeling Hitters’ Salary based on Hits, Runs, League, Division, Assists, Errors, etc

- number of covariates available: 20
- sample size: 263

Data:

```
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
CRuns CRBI CWalks League Division PutOuts Assists Errors
-Alan Ashby 321 414 375 N W 632 43 10
-Alvin Davis 224 266 263 A W 880 82 14
-Andre Dawson 828 838 354 N E 200 11 3
Salary NewLeague
-Alan Ashby 475 N
-Alvin Davis 480 A
-Andre Dawson 500 N
```

Model: \[\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} \vert \beta_j \vert\]

Solution path as \(\lambda\) changes:

Model: E(Salary) = Hits+Runs+League+Division+Assists+Errors

Criterion: Akaike information criterion (AIC) \(2 p - 2 \log L_{\max}\)

Criterion: given a set of candidate models for the data, the preferred model is the one with the minimum AIC value

```
> y = Hitters$Salary
> step(lm(Salary ~ Hits + Runs + League + Division +
+ Assists + Errors, data = Hitters), direction = "backward")
Start: AIC=3155.47
Salary ~ Hits + Runs + League + Division + Assists + Errors
Df Sum of Sq RSS AIC
- Runs 1 25026 40528330 3153.6
- Assists 1 44143 40547447 3153.8
- League 1 286090 40789394 3155.3
<none> 40503304 3155.5
- Errors 1 346694 40849998 3155.7
- Division 1 1200823 41704127 3161.2
- Hits 1 1491635 41994939 3163.0
Step: AIC=3153.63
Salary ~ Hits + League + Division + Assists + Errors
Df Sum of Sq RSS AIC
- Assists 1 60502 40588831 3152.0
- League 1 267672 40796002 3153.4
<none> 40528330 3153.6
- Errors 1 341225 40869555 3153.8
- Division 1 1237598 41765927 3159.5
- Hits 1 10731933 51260263 3213.4
Step: AIC=3152.02
Salary ~ Hits + League + Division + Errors
Df Sum of Sq RSS AIC
- League 1 264686 40853517 3151.7
<none> 40588831 3152.0
- Errors 1 1023361 41612192 3156.6
- Division 1 1232043 41820874 3157.9
- Hits 1 10741565 51330396 3211.8
Step: AIC=3151.73
Salary ~ Hits + Division + Errors
Df Sum of Sq RSS AIC
<none> 40853517 3151.7
- Errors 1 898680 41752197 3155.5
- Division 1 1254682 42108199 3157.7
- Hits 1 10487876 51341393 3209.8
Call:
lm(formula = Salary ~ Hits + Division + Errors, data = Hitters)
Coefficients:
(Intercept) Hits DivisionW Errors
186.092 4.636 -138.685 -9.237
```

Variable selection:

- some criteria are equivalent
- a criterion may not provide a good or meaningful model
- a difficult task when the number of covariates greatly exceeds the sample size
- very hot area in high-dimensional statistics

- Randomization aims to simulate independent observations
- Replication aims to provide an estimate of experimental error
- Control aims to reduce experimental error

- One factor representing \(t\) treatments
A total of \(n = n_1 + n_2 + \ldots + n_t\) experimental units

Treatments are randomly assigned to experimental units, so that \(n_j\) experimental units receive treatment \(j\) for each \(j=1,\ldots,t\)

- \(t\) population means \(\mu_i, i=1,\ldots,t\)
- Random samples: \(y_{ij}, i=1,\ldots,t; j = 1,\ldots,n_i\)
- Model: \(y_{ij} = \mu_i + \varepsilon_{ij}\)
- Aim: to assess if all \(t\) population means are equal, i.e., to test \(H_0: \mu_1 = \ldots = \mu_t\)

Grand mean: \(\bar{y}_{..} = \frac{1}{n} \sum_{i=1}^t \sum_{j=1}^{n_i} y_{ij}\)

- Within-sample sum of squares: \[SSW = \sum_{i,j}(y_{ij} - \bar{y}_{i.})^2 \]
Between-sample sum of squares: \[SSB = \sum_{i=1}^t n_i (\bar{y}_{i.}-\bar{y}_{..})^2\]

Assumption on \(y_{ij}\): independent Normal random samples with identical variance

Test statistic \[F = \frac{SSB/(t-1)}{SSW/(n_T -t)}\] follows an F distribution with \(df_1=t-1\) and \(df_2 = n_T -t\)

At Type I error probability \(\alpha\), reject \(H_0\) if \(F > F_{\alpha,df_1, df_2}\); otherwise, conclude \(H_0\)

Aim: to randomly assign treatment \(j\) to \(n_j\) experimental units

- \(n\) homogeneous experimental units numbers from \(1\) to \(n\)
- renumber the experimental units by the \(n\) numbers obtained by randomly permuting the numbers from \(1\) to \(n\)
- assign treatment \(1\) to the first \(n_1\) experimental units, treatment \(2\) to the very next \(n_2\) experimental units, and so on

- Life span of cutting machine: cutting angle, cutting speed
- Strength of adhesive: temperature and humidity
- Patato yield: growth conditions and type of potato

How does cutting scheme affect the life of a machine tool?

- Design: factorial, completely randomized
- Response: Life Hours
- Factors: Cutting Speed, Tool Geometry, Cutting Angle
- 2 cutting speeds; 2 tool geometries; 2 cutting angles
- factor coding: -1 or 1

- Treatments: the 8 combinations for the factors
- Replication: 3 replicates for each treatment

Part of data

```
Cutting.Speed Tool.Geometry Cutting.Angle Life.Hours
1 -1 -1 -1 22
2 -1 -1 -1 31
3 -1 -1 -1 25
4 1 -1 -1 32
5 1 -1 -1 43
6 1 -1 -1 29
7 -1 1 -1 35
8 -1 1 -1 34
9 -1 1 -1 50
10 1 1 -1 55
```

Effects of Tool Geometry and Cutting Angle on Life Hours

Does the effect of Tool Geometry remain the same for different Cutting Angle settings? Why or why not?

Effects of Cutting Speed and Cutting Angle on Life Hours

Does the effect of Cutting Speed remain the same for different Cutting Angle settings? Why or why not?

Effects of Cutting Speed and Cutting Angle on Life Hours

Can you please describe the interactions between the three factors on how they affect the average life hours of cutting machines?

Effects of Cutting Speed and Cutting Angle on Life Hours

Life.Hours: \(y_{ijk}\), observation on the \(k\)th experimental unit receiving the \(i\)th level of Cutting Speed and \(j\)th level of Cutting Angle; associated random error \(\varepsilon_{ijk}\)

Populations means: \(\mu_{ij}\) , mean Life.Hours of cutting machine at the \(i\)th level of Cutting Speed and \(j\)th level of Cutting Angle

Decomposition: \(\mu_{ij} = \mu + \tau_i + \beta_j + \tau \beta_{ij}\)

Model: \(y_{ijk} = \mu + \tau_i + \beta_j + \tau \beta_{ij}+\varepsilon_{ijk}\)

\(\tau_i\): effect of level \(i\) of Cutting Speed

\(\beta_j\): effect of level \(j\) of Cutting Angle

\(\tau \beta_{ij}\): effect of interaction at level \(i\) of Cutting Speed and level \(j\) of Cutting Angle

Constraints on last levels: \(\tau_a = 0\), \(\beta_b = 0\), \(\tau \beta_{ij} = 0\) when \(i=a\) and/or \(j=b\)

Interpretations:

Overall mean: \(\mu = \mu_{ab}\)

Main effects of factor A: \(\tau_i = \mu_{ib} - \mu_{ab}\)

Main effects of factor B: \(\beta_j = \mu_{aj} - \mu_{ab}\)

Interaction effects: \(\tau \beta_{ij} = \mu_{ij} - \mu - \tau_i - \beta_j\), i.e., \[\tau \beta_{ij}= (\mu_{ij} - \mu_{ib}) - (\mu_{aj} - \mu_{ab})\]

Caution: last levels as baseline

Strategy for inference on model is via ANOVA and to compare the standardized version of the following:

- Sum of squares for factor A
- Sum of squares for factor B
- Sum of squares for interaction AB
- Sum of squares for error
- Total sum of squares

When random errors are independent, Normally distributed with mean zero and identical variance, testing on main effect and interaction effect is based on F distribution.

Note that in this case, when testing if a coefficient is zero statistically, the squared value of a t test is that of an F test with numerator degrees of freedom 1.

Model with interaction term: \[y_{ijk} = \mu + \tau_i + \beta_j + \tau \beta_{ij}+\varepsilon_{ijk}\]

```
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 32.833 3.350 9.802 4.42e-09 ***
> SpeedB 9.167 4.737 1.935 0.06724 .
> AngleB 15.667 4.737 3.307 0.00352 **
> SpeedB:AngleB -17.667 6.699 -2.637 0.01580 *
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 8.205 on 20 degrees of freedom
> F-statistic: 3.709 on 3 and 20 DF, p-value: 0.02857
```

Compare the fit with the visual check

Model: \[y_{ijk} = \mu + \tau_i + \beta_j + \tau \beta_{ij}+\varepsilon_{ijk}\]

Model estimates:

- Estimate: \(\mu = 32.833\)
- SpeedB for \(\tau_2\): \(9.167\)
- AngleB for \(\beta_2\): \(15.667\)
- SpeedB:AngleB for \(\tau \beta_{22}\): \(-17.667\)

Caution: first levels as baseline in R

Discussion on interpretation …

Completely randomized design:

- requires experimental units to be homogeneous
- not easy to implement when factors have many levels

Experimental units are not homogeneous

- Decrease on relectivity of a reflective paint on highways, where the experimental unit is a section of a highway
- Effects of four levels of newspaper advertising saturation on sales volume where the experimental unit is a city
- Effects of three different incentive pay schemes on employee productivity of electronic assemblies, where the experimental unit is an employee

- \(t\) treatments and \(b\) blocks
- Each block consists of \(t\) homogeneous experimental units
- Within each block, treatments are randomly assigned to experimental units

Discussion on design …

Auditor training:

- Three training methods: self-teaching at home; training by local staff; training by national staff
- Employees are divided into 10 blocks
- Proficiency assigned at assessment

Discussion on design …

Part of data:

```
Proficiency Block Method
1 73 1 1
2 81 1 2
3 92 1 3
4 76 2 1
5 78 2 2
6 89 2 3
```

Is there is Block effect?

Is there a Method effect?

Visualize both factors:

Model: \(y_{ij} = \mu + \tau_i + \beta_j + \varepsilon_{ij}\)

\(\tau_i\): effect of level \(i\) of Method

\(\beta_j\): effect of level \(j\) of Block

Constraints on last levels: \(\tau_a = 0\) and \(\beta_b = 0\)

Interpretations:

Overall mean: \(\mu = \mu_{ab}\)

Main effects of Method: \(\tau_i = \mu_{ib} - \mu_{ab}\)

Main effects of Block: \(\beta_j = \mu_{aj} - \mu_{ab}\)

Caution: last levels as baseline

Strategy for inference on model is via ANOVA and to compare the standardized version of the following:

- Sum of squares for factor A
- Sum of squares for factor B
- Sum of squares for error
- Total sum of squares

When random errors are independent, Normally distributed with mean zero and identical variance, testing on main effect and interaction effect is based on F distribution.

Note that in this case, when testing if a coefficient is zero statistically, the squared value of a t test is that of an F test with numerator degrees of freedom 1.

Model: \(y_{ij} = \mu + \tau_i + \beta_j + \varepsilon_{ij}\)

```
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 75.500 1.580 47.786 < 2e-16 ***
> Method2 4.000 1.117 3.580 0.002139 **
> Method3 15.500 1.117 13.874 4.72e-11 ***
> Block2 -1.000 2.040 -0.490 0.629872
> Block3 -2.667 2.040 -1.307 0.207544
> Block4 -1.667 2.040 -0.817 0.424554
> Block5 -3.667 2.040 -1.798 0.089033 .
> Block6 -4.000 2.040 -1.961 0.065533 .
> Block7 -6.000 2.040 -2.942 0.008725 **
> Block8 -8.667 2.040 -4.249 0.000483 ***
> Block9 -9.000 2.040 -4.412 0.000336 ***
> Block10 -12.333 2.040 -6.047 1.02e-05 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> F-statistic: 25.18 on 11 and 18 DF, p-value: 1.107e-08
```

Disadvantages of randomized complete block design:

best suited for a relatively small number of treatments (since experimental units within a block must be homogeneous)

controls only one extraneous source of variability (due to blocks)

```
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 15063)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_2.2.1 dplyr_0.7.4 glmnet_2.0-13 foreach_1.4.3 Matrix_1.2-6
[6] car_2.1-5 broom_0.4.2 knitr_1.17
loaded via a namespace (and not attached):
[1] Rcpp_0.12.12 formatR_1.5 nloptr_1.0.4
[4] plyr_1.8.4 bindr_0.1 iterators_1.0.8
[7] tools_3.3.0 digest_0.6.12 lme4_1.1-14
[10] gtable_0.2.0 evaluate_0.10.1 tibble_1.3.4
[13] nlme_3.1-127 lattice_0.20-33 mgcv_1.8-12
[16] pkgconfig_2.0.1 rlang_0.1.2 psych_1.7.8
[19] yaml_2.1.14 parallel_3.3.0 SparseM_1.77
[22] bindrcpp_0.2 stringr_1.2.0 MatrixModels_0.4-1
[25] revealjs_0.9 rprojroot_1.2 grid_3.3.0
[28] nnet_7.3-12 glue_1.1.1 R6_2.2.2
[31] foreign_0.8-66 rmarkdown_1.6 minqa_1.2.4
[34] reshape2_1.4.2 tidyr_0.7.1 purrr_0.2.3
[37] magrittr_1.5 scales_0.5.0 codetools_0.2-14
[40] backports_1.1.0 htmltools_0.3.6 MASS_7.3-45
[43] splines_3.3.0 assertthat_0.2.0 pbkrtest_0.4-7
[46] mnormt_1.5-5 colorspace_1.3-2 labeling_0.3
[49] quantreg_5.33 stringi_1.1.5 lazyeval_0.2.0
[52] munsell_0.4.3
```