Proportion relates to Chance, and Chance to Probability:

- Preference between products
- Efficacy for drugs
- Reliability of mechanical systems

The binomial distribution describes the number of successes in independent trials:

- in each trial, the probability of a success is \(\pi\) and the probability of a failure is \(1-\pi\)
- \(n\) independent such trias have been conducted

Its probability mass function is:

\[P(Y = y) = \frac{n!}{y!(n-y)!}\pi^y (1-\pi)^{n-y}\] for \(y=0,\ldots,n\)

Probability mass function for \(\mathsf{Binomial}(50,0.5)\):

Method:

Observed proportion: \(\hat{\pi} = \frac{y}{n}\)

“Standard deviation”: \(\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}\)

\(z = \frac{\hat{\pi} - \pi}{\hat{\sigma}_{\hat{\pi}}} \to \mathsf{Normal}(0,1)\) when \(n \to \infty\)

Heuristics: if \(n \ge \frac{5}{\min(\pi,1-\pi)}\), a Normal test can be used

\((1 - \alpha)\times 100 \%\) Confidence interval when \(n\) is large is

\[\hat{\pi} \pm z_{\alpha/2} \hat{\sigma}_{\hat{\pi}}\]

where \(\hat{\pi} = \frac{y}{n}\) and \(\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}\)

Caution: when \(n\) is small, adjustment is needed; see WAC adjustment

At type I error \(\alpha\), when \(n\) is large the test statistic

\[z = \frac{\hat{\pi} - \pi}{\hat{\sigma}_{\hat{\pi}}}\] based on Normal distribution can be used for hypothesis testing, where \(\hat{\pi} = \frac{y}{n}\) and \(\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}\)

National survey: \(44\%\) of U.S. college students enaged in binge drinking

\(2,500\) undergraduates surveyed from University A; \(1,200\) of them engaged in binge drinking

Is the percentage of students enaged in binge drinking at University A greater than the percentage from national survey?

Formulate hypothesis

Conduct hypothesis testing

Construct confidence interval

Value of z-test statistic

`[1] 4.003`

Critical value: \(z_{0.05}=1.96\)

Conclusion on hypothesis testing?

\(\hat{\sigma}_{\hat{\pi}}\):

`[1] 0.009992`

95% CI: \[0.48 \pm 1.96 \times 0.01 = 0.48 \pm 0.0196\]

Two binomial populations \(\textsf{Binomial}(\pi_1, n_1)\) and \(\textsf{Binomial}(\pi_2, n_2)\).

For the difference \(\pi_1 - \pi_2\):

Observed proportions: \(\hat{\pi}_1 = \frac{y_1}{n_1}\) and \(\hat{\pi}_2 = \frac{y_2}{n_2}\)

Difference: \(\hat{\pi}_1 - \hat{\pi}_2\)

“Standard deviation”: \(\hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2} = \sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}\)

\((1 - \alpha)\times 100 \%\) Confidence interval when \(n\) is large is

\[\hat{\pi}_1 - \hat{\pi}_2 \pm z_{\alpha/2} \hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2}\]

where \(\hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2} = \sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}\)

Two binomial populations \(\textsf{Binomial}(\pi_1, n_1)\) and \(\textsf{Binomial}(\pi_2, n_2)\). For the difference \(\pi_1 - \pi_2\), the test statistic is

\[z= \frac{\hat{\pi}_1 - \hat{\pi}_2}{\sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}}\]

Caution: applicable only if \(n_i \hat{\pi}_i\) and \(n_i (1-\hat{\pi}_i)\) with \(i=1,2\) are at least \(5\); othewsie Fisher’s exact test is recommended

Effectiveness of teaching English via computed aided method and traditional method

- Total \(300\) students
- \(125\) randomly picked and assinged to computer aided teaching method
- the rest \(175\) assinged to traditional teaching method

Data:

ExamResults | viaComputer | Traditional |
---|---|---|

Pass | 94 | 113 |

Fail | 31 | 62 |

Total | 125 | 175 |

Does instruction using computer software appear to increase the proportion of students passing the examination in comparison to the pass rate using the traditional method? Use Type I error probability \(0.05\)

Formulate hypothesis

Conduct hypothesis testing

Construct confidence interval

Quantities needed for hypothesis testing:

```
[1] 0.752
[1] 0.6457
[1] 0.05291
[1] 2.009
```

Conclusion on hypothesis testing?

95% Confidence interval:

```
[1] 0.1063
[1] 0.05291
[1] 2.009
```

CI: \(0.75-0.6457 \pm 1.96 \times 0.05291 = 0.106 \pm 0.104\)

- Severity of skin disease associated with patient’s age?
- Party Affiliation affects Opinion on Tax Reform?
- Daily Shifts affect Quality of Workmanship in manufacture?

Severity of skin disease related to patient’s age?

I | II | III | IV | |
---|---|---|---|---|

Moderate | 15 | 32 | 18 | 5 |

Mildly Severe | 8 | 29 | 23 | 18 |

Severe | 1 | 20 | 25 | 22 |

To assess:

- \(H_0:\) the row and column variances are independent
- \(H_a:\) the row and column variances are dependent

the test statistic is \[\chi^2 = \sum_{i,j}\frac{( n_{ij}-\hat{E}_{ij} )^2}{\hat{E}_{ij}}\] where \(\hat{E}_{ij} = n \hat{\pi}_{ij}\)

To assess row-column independence

\(\chi^2\) has degrees of freedom \((r-1)(c-1)\)

Reject \(H_0\) if \(\chi^2 \ge \chi^2_{\alpha}\), where \(\chi^2_{\alpha}\) is such that

\[P(\chi^2 \ge \chi^2_{\alpha})=\alpha\]

Density with df=5:

Get row and column totals

I | II | III | IV | All Ages | |
---|---|---|---|---|---|

Moderate | 15 | 32 | 18 | 5 | 70 |

Mildly Severe | 8 | 29 | 23 | 18 | 78 |

Severe | 1 | 20 | 25 | 22 | 68 |

All Severities | 24 | 81 | 66 | 45 | 216 |

Get expected counts: \(\hat{E}_{ij} =\frac{n_{i\cdot} n_{j\cdot}}{n}\)

I | II | III | IV | All Ages | |
---|---|---|---|---|---|

Moderate | 7.778 | 26.25 | 21.39 | 14.58 | 70 |

Mildly Severe | 8.667 | 29.25 | 23.83 | 16.25 | 78 |

Severe | 7.556 | 25.50 | 20.78 | 14.17 | 68 |

All Severities | 24.000 | 81.00 | 66.00 | 45.00 | 216 |

Compute the sum: \(\chi^2 = \sum_{i,j}\frac{( n_{ij}-\hat{E}_{ij} )^2}{\hat{E}_{ij}}\)

`[1] 27.13`

Hypothesis testing at type I error \(0.05\): Chi-square table

- Critical value: 12.59 from Chi square distribution with df=6

```
> qchisq(0.05, df = 6, ncp = 0, lower.tail = F)
[1] 12.59
```

Since \(27.13 > 12.59\), reject \(H_0\) that severity of skin disease is NOT associated with a patient’s age

Data as a matrix or data frame:

I | II | III | IV | |
---|---|---|---|---|

Moderate | 15 | 32 | 18 | 5 |

Mildly Severe | 8 | 29 | 23 | 18 |

Severe | 1 | 20 | 25 | 22 |

Command:

```
> # dataset should be a matrix of data.frame
> chisq.test(dataset)
```

```
> chisq.test(SkinAge)
Pearson's Chi-squared test
data: SkinAge
X-squared = 27, df = 6, p-value = 1e-04
```

```
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 15063)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.17
loaded via a namespace (and not attached):
[1] backports_1.1.0 magrittr_1.5 rprojroot_1.2 formatR_1.5
[5] tools_3.3.0 htmltools_0.3.6 revealjs_0.9 yaml_2.1.14
[9] Rcpp_0.12.12 stringi_1.1.5 rmarkdown_1.6 stringr_1.2.0
[13] digest_0.6.12 evaluate_0.10.1
```