Example

Grocery retailer tracks productivity and costs of its facilities weekly:

• $$x_1$$: number of cases shipped
• $$x_2$$: costs of total labor hours in percentage
• $$x_3$$: indicator of if a week has a holiday; $$x_3=1$$ if a week has a holiday and $$=0$$ otherwise
• $$y$$: total labor hours

Data

Part of data:

  LaborHours Sellings LaborCost Holiday
1       4264   305657      7.17       0
2       4496   328476      6.20       0
3       4317   317164      4.61       0
4       4292   366745      7.02       0
5       4945   265518      8.61       1
6       4325   301995      6.88       0

Fit linear model

Model: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$$

Fitted model: $y = 4149.8 - 0.0008 x_1 - 0.132 x_2 + 6.23 x_3 + \varepsilon$

         term      estimate      p.value
1 (Intercept)  4.149887e+03 4.902653e-26
2    Sellings  7.870804e-04 3.587650e-02
3   LaborCost -1.316602e+01 5.712274e-01
4    Holiday1  6.235545e+02 2.940869e-13

Fit nonlinear model

$$x_2$$: costs of total labor hours in percentage

$$x_3$$: indicator of if a week has a holiday

Model: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2 x_3+ \varepsilon$$

How to interpret the “interaction” term $$x_2 x_3$$?

Fit nonlinear model

$$x_2$$: costs of total labor hours in percentage

$$x_3$$: indicator of if a week has a holiday

Model: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2 x_3+ \varepsilon$$

• when $$x_3 = 0$$, we get $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon$

• when $$x_3 = 1$$, we get $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 + \beta_4 x_2 + \varepsilon$

How to interpret the difference between these two sub-models?

Fit nonlinear model

• Model: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2 x_3+ \varepsilon$$

• Fitted model:

                term      estimate      p.value
1        (Intercept)  4.287187e+03 5.789198e-24
2           Sellings  7.372229e-04 4.874013e-02
3          LaborCost -2.983319e+01 2.646345e-01
4           Holiday1  1.132488e+02 7.815787e-01
5 LaborCost:Holiday1  6.748820e+01 2.097844e-01

Interpretation …

Nonlinear model

• higher order terms can be added to model
• basis functions can be used in model
• model does not have be “linear”
• R procedures: nls, gam

Collinearity

Collinearity refers to “covariates are highly correlated with each other”. Consequences of collinearity include:

• difficulty or inability to build a model; e.g., hard to pick covariates or existence of many equally plausible models
• instability in inference on model and in model performance; e.g., small changes in a covariate leading to very large changes in mean response

Data

Blood pressure study: (“BP” is “blood pressure”;“BSA” is “body surface area”)

  Pt  BP Age Weight  BSA Dur Pulse Stress
1  1 105  47   85.4 1.75 5.1    63     33
2  2 115  49   94.2 2.10 3.8    70     14
3  3 116  49   95.3 1.98 8.2    72     10
4  4 117  50   94.7 2.01 5.8    73     99
5  5 112  51   89.4 1.89 7.0    72     95
6  6 121  48   99.5 2.25 9.3    71     10

Check for collinearity

Blood pressure study:

Check for collinearity

Correlation matrix:

              BP       Age    Weight       BSA
BP     1.0000000 0.6590930 0.9500677 0.8658789
Age    0.6590930 1.0000000 0.4073493 0.3784546
Weight 0.9500677 0.4073493 1.0000000 0.8753048
BSA    0.8658789 0.3784546 0.8753048 1.0000000

Large correlation (in absolute value) between two covariates indicates that they are highly correlated; e.g., Weight and BSA

Check for collinearity

Variance Inflation Factor (VIF)

• obtained by “regressing one covariate on all other covariates”
• large VIF indicates considerable collinearity
• illustration: model “BP = Age + Weight + BSA + Error” and VIF’s
(Intercept)         Age      Weight         BSA
-13.6672473   0.7016198   0.9058223   4.6273883
Age   Weight      BSA
1.201901 4.403645 4.286943 

Dealing with collinearity

• use and interpret model carefully
• drop certain covariates if possible; include correlated covariates into model if needed
• Ridge regression: minimize the regularized sum of squares $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} \beta_j^2$
• R library: glmnet

Model

• Model the probability of an event; e.g., proportion of students passing a test, likelihood of an accident

• Caution: can no longer mode the probability as $$p = \beta_0 + \beta_1 x + \varepsilon$$; instead, a “link function”" is needed

• Example model: $\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x$ where $$\log{\frac{x}{1-x}}$$ is a “link function”

Model

Model: $$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x$$

• The model is equivalent to $$p = \frac{\exp(\beta_0 + \beta_1 x)}{1+\exp(\beta_0 + \beta_1 x)}$$

• Interpretation: $$\beta_1$$ represents change in units of log odds ratio

• E.g., increasing Age by 1 year results in changes in log odds ratio of a disease by 5 folds

Illustration

“Project Success” versus “Experience (in month)”

  Experience Success
1          2       0
2          4       0
3          5       0
4          6       0
5          7       0
6          8       1

Comments: Response variable is recorded as 0 or 1 (but not as a factor)

Illustration

Settings:

• Response variable $$y$$: success or failure
• $$p$$: probability of success
• $$x$$: Experience in month

Model: $\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x$

Illustration

R command: glm(y ~ x,data= yourdata,family=binomial())

Fitting the model using R procedure “glm”:

(Intercept)  Experience
-1.6842005   0.1193806 

Fitted model: $\log\left(\frac{p}{1-p}\right) = -1.68 + 0.12 x$

Illustration

Inference on logistic regression:

> Coefficients:
>             Estimate Std. Error z value Pr(>|z|)
> (Intercept) -1.68420    0.94508  -1.782   0.0747 .
> Experience   0.11938    0.05889   2.027   0.0427 *
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for binomial family taken to be 1)
>
>     Null deviance: 34.617  on 24  degrees of freedom
> Residual deviance: 29.374  on 23  degrees of freedom
> AIC: 33.374
> 

Inference on logistic model

• Inference is based on the difference between the corresponding likelihood functions (such as Deviance)

• Model diagnostics are harder than those for multiple linear regression with Normal errors

• R procedure: glm(y ~ x,data= yourdata,family=binomial())

Motivation

Less complicated models are usually easier to interpret but may not necessarily perform well.

Aim of model selection: balance model complexity with model performance

Basics

• Modes of selection: forward, backward, stepwise, subset, etc

• Selection criteria usually based on model complexity; e.g., information criterion, adjusted R square, Mallows’s Cp, etc

• One popular choice: LASSO

Illustration of LASSO

Modeling Hitters’ Salary based on Hits, Runs, League, Division, Assists, Errors, etc

• number of covariates available: 20
• sample size: 263

Illustration of LASSO

Data:

              AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Alan Ashby     315   81     7   24  38    39    14   3449   835     69
-Alvin Davis    479  130    18   66  72    76     3   1624   457     63
-Andre Dawson   496  141    20   65  78    37    11   5628  1575    225
CRuns CRBI CWalks League Division PutOuts Assists Errors
-Alan Ashby     321  414    375      N        W     632      43     10
-Alvin Davis    224  266    263      A        W     880      82     14
-Andre Dawson   828  838    354      N        E     200      11      3
Salary NewLeague
-Alan Ashby      475         N
-Alvin Davis     480         A
-Andre Dawson    500         N

Illustration of LASSO

Model: $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} \vert \beta_j \vert$

Illustration of LASSO

Solution path as $$\lambda$$ changes:

Illustration of stepwise selection

• Model: E(Salary) = Hits+Runs+League+Division+Assists+Errors

• Criterion: Akaike information criterion (AIC) $$2 p - 2 \log L_{\max}$$

• Criterion: given a set of candidate models for the data, the preferred model is the one with the minimum AIC value

Illustration of stepwise selection

> y = Hitters\$Salary
> step(lm(Salary ~ Hits + Runs + League + Division +
+     Assists + Errors, data = Hitters), direction = "backward")
Start:  AIC=3155.47
Salary ~ Hits + Runs + League + Division + Assists + Errors

Df Sum of Sq      RSS    AIC
- Runs      1     25026 40528330 3153.6
- Assists   1     44143 40547447 3153.8
- League    1    286090 40789394 3155.3
<none>                  40503304 3155.5
- Errors    1    346694 40849998 3155.7
- Division  1   1200823 41704127 3161.2
- Hits      1   1491635 41994939 3163.0

Step:  AIC=3153.63
Salary ~ Hits + League + Division + Assists + Errors

Df Sum of Sq      RSS    AIC
- Assists   1     60502 40588831 3152.0
- League    1    267672 40796002 3153.4
<none>                  40528330 3153.6
- Errors    1    341225 40869555 3153.8
- Division  1   1237598 41765927 3159.5
- Hits      1  10731933 51260263 3213.4

Step:  AIC=3152.02
Salary ~ Hits + League + Division + Errors

Df Sum of Sq      RSS    AIC
- League    1    264686 40853517 3151.7
<none>                  40588831 3152.0
- Errors    1   1023361 41612192 3156.6
- Division  1   1232043 41820874 3157.9
- Hits      1  10741565 51330396 3211.8

Step:  AIC=3151.73
Salary ~ Hits + Division + Errors

Df Sum of Sq      RSS    AIC
<none>                  40853517 3151.7
- Errors    1    898680 41752197 3155.5
- Division  1   1254682 42108199 3157.7
- Hits      1  10487876 51341393 3209.8

Call:
lm(formula = Salary ~ Hits + Division + Errors, data = Hitters)

Coefficients:
(Intercept)         Hits    DivisionW       Errors
186.092        4.636     -138.685       -9.237  

Summary

Variable selection:

• some criteria are equivalent
• a criterion may not provide a good or meaningful model
• a difficult task when the number of covariates greatly exceeds the sample size
• very hot area in high-dimensional statistics

Principles of experimental design

• Randomization aims to simulate independent observations
• Replication aims to provide an estimate of experimental error
• Control aims to reduce experimental error

ANOVA with one factor

• One factor representing $$t$$ treatments
• A total of $$n = n_1 + n_2 + \ldots + n_t$$ experimental units

• Treatments are randomly assigned to experimental units, so that $$n_j$$ experimental units receive treatment $$j$$ for each $$j=1,\ldots,t$$

ANOVA with one factor

• $$t$$ population means $$\mu_i, i=1,\ldots,t$$
• Random samples: $$y_{ij}, i=1,\ldots,t; j = 1,\ldots,n_i$$
• Model: $$y_{ij} = \mu_i + \varepsilon_{ij}$$
• Aim: to assess if all $$t$$ population means are equal, i.e., to test $$H_0: \mu_1 = \ldots = \mu_t$$

ANOVA with one factor

• Grand mean: $$\bar{y}_{..} = \frac{1}{n} \sum_{i=1}^t \sum_{j=1}^{n_i} y_{ij}$$

• Within-sample sum of squares: $SSW = \sum_{i,j}(y_{ij} - \bar{y}_{i.})^2$
• Between-sample sum of squares: $SSB = \sum_{i=1}^t n_i (\bar{y}_{i.}-\bar{y}_{..})^2$

ANOVA with one factor

• Assumption on $$y_{ij}$$: independent Normal random samples with identical variance

• Test statistic $F = \frac{SSB/(t-1)}{SSW/(n_T -t)}$ follows an F distribution with $$df_1=t-1$$ and $$df_2 = n_T -t$$

• At Type I error probability $$\alpha$$, reject $$H_0$$ if $$F > F_{\alpha,df_1, df_2}$$; otherwise, conclude $$H_0$$

Completely randomized design

Aim: to randomly assign treatment $$j$$ to $$n_j$$ experimental units

• $$n$$ homogeneous experimental units numbers from $$1$$ to $$n$$
• renumber the experimental units by the $$n$$ numbers obtained by randomly permuting the numbers from $$1$$ to $$n$$
• assign treatment $$1$$ to the first $$n_1$$ experimental units, treatment $$2$$ to the very next $$n_2$$ experimental units, and so on

ANOVA with two factors

• Life span of cutting machine: cutting angle, cutting speed
• Strength of adhesive: temperature and humidity
• Patato yield: growth conditions and type of potato

Example

How does cutting scheme affect the life of a machine tool?

• Design: factorial, completely randomized
• Response: Life Hours
• Factors: Cutting Speed, Tool Geometry, Cutting Angle
• 2 cutting speeds; 2 tool geometries; 2 cutting angles
• factor coding: -1 or 1
• Treatments: the 8 combinations for the factors
• Replication: 3 replicates for each treatment

Example

Part of data

   Cutting.Speed Tool.Geometry Cutting.Angle Life.Hours
1             -1            -1            -1         22
2             -1            -1            -1         31
3             -1            -1            -1         25
4              1            -1            -1         32
5              1            -1            -1         43
6              1            -1            -1         29
7             -1             1            -1         35
8             -1             1            -1         34
9             -1             1            -1         50
10             1             1            -1         55

Example

Effects of Tool Geometry and Cutting Angle on Life Hours

Question

Does the effect of Tool Geometry remain the same for different Cutting Angle settings? Why or why not?

Example

Effects of Cutting Speed and Cutting Angle on Life Hours

Question

Does the effect of Cutting Speed remain the same for different Cutting Angle settings? Why or why not?

Example

Effects of Cutting Speed and Cutting Angle on Life Hours

Question

Can you please describe the interactions between the three factors on how they affect the average life hours of cutting machines?

Example

Effects of Cutting Speed and Cutting Angle on Life Hours

• Life.Hours: $$y_{ijk}$$, observation on the $$k$$th experimental unit receiving the $$i$$th level of Cutting Speed and $$j$$th level of Cutting Angle; associated random error $$\varepsilon_{ijk}$$

• Populations means: $$\mu_{ij}$$ , mean Life.Hours of cutting machine at the $$i$$th level of Cutting Speed and $$j$$th level of Cutting Angle

• Decomposition: $$\mu_{ij} = \mu + \tau_i + \beta_j + \tau \beta_{ij}$$

Model

Model: $$y_{ijk} = \mu + \tau_i + \beta_j + \tau \beta_{ij}+\varepsilon_{ijk}$$

• $$\tau_i$$: effect of level $$i$$ of Cutting Speed

• $$\beta_j$$: effect of level $$j$$ of Cutting Angle

• $$\tau \beta_{ij}$$: effect of interaction at level $$i$$ of Cutting Speed and level $$j$$ of Cutting Angle

• Constraints on last levels: $$\tau_a = 0$$, $$\beta_b = 0$$, $$\tau \beta_{ij} = 0$$ when $$i=a$$ and/or $$j=b$$

Model

Interpretations:

• Overall mean: $$\mu = \mu_{ab}$$

• Main effects of factor A: $$\tau_i = \mu_{ib} - \mu_{ab}$$

• Main effects of factor B: $$\beta_j = \mu_{aj} - \mu_{ab}$$

• Interaction effects: $$\tau \beta_{ij} = \mu_{ij} - \mu - \tau_i - \beta_j$$, i.e., $\tau \beta_{ij}= (\mu_{ij} - \mu_{ib}) - (\mu_{aj} - \mu_{ab})$

Caution: last levels as baseline

Inference on model

Strategy for inference on model is via ANOVA and to compare the standardized version of the following:

• Sum of squares for factor A
• Sum of squares for factor B
• Sum of squares for interaction AB
• Sum of squares for error
• Total sum of squares

Inference on model

When random errors are independent, Normally distributed with mean zero and identical variance, testing on main effect and interaction effect is based on F distribution.

Note that in this case, when testing if a coefficient is zero statistically, the squared value of a t test is that of an F test with numerator degrees of freedom 1.

Fitted model

Model with interaction term: $y_{ijk} = \mu + \tau_i + \beta_j + \tau \beta_{ij}+\varepsilon_{ijk}$

> Coefficients:
>                 Estimate Std. Error t value Pr(>|t|)
> (Intercept)     32.833      3.350   9.802 4.42e-09 ***
> SpeedB          9.167      4.737   1.935  0.06724 .
> AngleB          15.667      4.737   3.307  0.00352 **
> SpeedB:AngleB  -17.667      6.699  -2.637  0.01580 *
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 8.205 on 20 degrees of freedom
> F-statistic: 3.709 on 3 and 20 DF,  p-value: 0.02857

Fitting model

Compare the fit with the visual check

Fitted model

Model: $y_{ijk} = \mu + \tau_i + \beta_j + \tau \beta_{ij}+\varepsilon_{ijk}$

Model estimates:

• Estimate: $$\mu = 32.833$$
• SpeedB for $$\tau_2$$: $$9.167$$
• AngleB for $$\beta_2$$: $$15.667$$
• SpeedB:AngleB for $$\tau \beta_{22}$$: $$-17.667$$

Caution: first levels as baseline in R

Discussion on interpretation …

Summary

Completely randomized design:

• requires experimental units to be homogeneous
• not easy to implement when factors have many levels

Motivation

Experimental units are not homogeneous

• Decrease on relectivity of a reflective paint on highways, where the experimental unit is a section of a highway
• Effects of four levels of newspaper advertising saturation on sales volume where the experimental unit is a city
• Effects of three different incentive pay schemes on employee productivity of electronic assemblies, where the experimental unit is an employee

Definition

• $$t$$ treatments and $$b$$ blocks
• Each block consists of $$t$$ homogeneous experimental units
• Within each block, treatments are randomly assigned to experimental units

Discussion on design …

Example

Auditor training:

• Three training methods: self-teaching at home; training by local staff; training by national staff
• Employees are divided into 10 blocks
• Proficiency assigned at assessment

Discussion on design …

Example

Part of data:

  Proficiency Block Method
1          73     1      1
2          81     1      2
3          92     1      3
4          76     2      1
5          78     2      2
6          89     2      3

Example

Is there is Block effect?

Example

Is there a Method effect?

Example

Visualize both factors:

Model

Model: $$y_{ij} = \mu + \tau_i + \beta_j + \varepsilon_{ij}$$

• $$\tau_i$$: effect of level $$i$$ of Method

• $$\beta_j$$: effect of level $$j$$ of Block

• Constraints on last levels: $$\tau_a = 0$$ and $$\beta_b = 0$$

Model

Interpretations:

• Overall mean: $$\mu = \mu_{ab}$$

• Main effects of Method: $$\tau_i = \mu_{ib} - \mu_{ab}$$

• Main effects of Block: $$\beta_j = \mu_{aj} - \mu_{ab}$$

Caution: last levels as baseline

Inference on model

Strategy for inference on model is via ANOVA and to compare the standardized version of the following:

• Sum of squares for factor A
• Sum of squares for factor B
• Sum of squares for error
• Total sum of squares

Inference on model

When random errors are independent, Normally distributed with mean zero and identical variance, testing on main effect and interaction effect is based on F distribution.

Note that in this case, when testing if a coefficient is zero statistically, the squared value of a t test is that of an F test with numerator degrees of freedom 1.

Fitting model

Model: $$y_{ij} = \mu + \tau_i + \beta_j + \varepsilon_{ij}$$

>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)   75.500      1.580  47.786  < 2e-16 ***
> Method2        4.000      1.117   3.580 0.002139 **
> Method3       15.500      1.117  13.874 4.72e-11 ***
> Block2        -1.000      2.040  -0.490 0.629872
> Block3        -2.667      2.040  -1.307 0.207544
> Block4        -1.667      2.040  -0.817 0.424554
> Block5        -3.667      2.040  -1.798 0.089033 .
> Block6        -4.000      2.040  -1.961 0.065533 .
> Block7        -6.000      2.040  -2.942 0.008725 **
> Block8        -8.667      2.040  -4.249 0.000483 ***
> Block9        -9.000      2.040  -4.412 0.000336 ***
> Block10      -12.333      2.040  -6.047 1.02e-05 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> F-statistic: 25.18 on 11 and 18 DF,  p-value: 1.107e-08

Summary

Disadvantages of randomized complete block design:

• best suited for a relatively small number of treatments (since experimental units within a block must be homogeneous)

• controls only one extraneous source of variability (due to blocks)

Extras

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 15063)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggplot2_2.2.1 dplyr_0.7.4   glmnet_2.0-13 foreach_1.4.3 Matrix_1.2-6
[6] car_2.1-5     broom_0.4.2   knitr_1.17

loaded via a namespace (and not attached):
[1] Rcpp_0.12.12       formatR_1.5        nloptr_1.0.4
[4] plyr_1.8.4         bindr_0.1          iterators_1.0.8
[7] tools_3.3.0        digest_0.6.12      lme4_1.1-14
[10] gtable_0.2.0       evaluate_0.10.1    tibble_1.3.4
[13] nlme_3.1-127       lattice_0.20-33    mgcv_1.8-12
[16] pkgconfig_2.0.1    rlang_0.1.2        psych_1.7.8
[19] yaml_2.1.14        parallel_3.3.0     SparseM_1.77
[22] bindrcpp_0.2       stringr_1.2.0      MatrixModels_0.4-1
[25] revealjs_0.9       rprojroot_1.2      grid_3.3.0
[28] nnet_7.3-12        glue_1.1.1         R6_2.2.2
[31] foreign_0.8-66     rmarkdown_1.6      minqa_1.2.4
[34] reshape2_1.4.2     tidyr_0.7.1        purrr_0.2.3
[37] magrittr_1.5       scales_0.5.0       codetools_0.2-14
[40] backports_1.1.0    htmltools_0.3.6    MASS_7.3-45
[43] splines_3.3.0      assertthat_0.2.0   pbkrtest_0.4-7
[46] mnormt_1.5-5       colorspace_1.3-2   labeling_0.3
[49] quantreg_5.33      stringi_1.1.5      lazyeval_0.2.0
[52] munsell_0.4.3