# Motivations

Proportion relates to Chance, and Chance to Probability:

• Preference between products
• Efficacy for drugs
• Reliability of mechanical systems

# Binomial distribution

The binomial distribution describes the number of successes in independent trials:

• in each trial, the probability of a success is $$\pi$$ and the probability of a failure is $$1-\pi$$
• $$n$$ independent such trias have been conducted

# Binomial distribution

Its probability mass function is:

$P(Y = y) = \frac{n!}{y!(n-y)!}\pi^y (1-\pi)^{n-y}$ for $$y=0,\ldots,n$$

# Binomial distribution

Probability mass function for $$\mathsf{Binomial}(50,0.5)$$:

# Inference on one proportion

Method:

• Observed proportion: $$\hat{\pi} = \frac{y}{n}$$

• “Standard deviation”: $$\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}$$

• $$z = \frac{\hat{\pi} - \pi}{\hat{\sigma}_{\hat{\pi}}} \to \mathsf{Normal}(0,1)$$ when $$n \to \infty$$

Heuristics: if $$n \ge \frac{5}{\min(\pi,1-\pi)}$$, a Normal test can be used

# Inference on one proportion

$$(1 - \alpha)\times 100 \%$$ Confidence interval when $$n$$ is large is

$\hat{\pi} \pm z_{\alpha/2} \hat{\sigma}_{\hat{\pi}}$

where $$\hat{\pi} = \frac{y}{n}$$ and $$\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}$$

Caution: when $$n$$ is small, adjustment is needed; see WAC adjustment

# Inference on one proportion

At type I error $$\alpha$$, when $$n$$ is large the test statistic

$z = \frac{\hat{\pi} - \pi}{\hat{\sigma}_{\hat{\pi}}}$ based on Normal distribution can be used for hypothesis testing, where $$\hat{\pi} = \frac{y}{n}$$ and $$\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}$$

# Example 10.5 as an Exercise

• National survey: $$44\%$$ of U.S. college students enaged in binge drinking

• $$2,500$$ undergraduates surveyed from University A; $$1,200$$ of them engaged in binge drinking

• Is the percentage of students enaged in binge drinking at University A greater than the percentage from national survey?

# Example 10.5 as an Exercise

• Formulate hypothesis

• Conduct hypothesis testing

• Construct confidence interval

# Example 10.5 as an Exercise

Value of z-test statistic

[1] 4.003

Critical value: $$z_{0.05}=1.96$$

Conclusion on hypothesis testing?

# Example 10.5 as an Exercise

$$\hat{\sigma}_{\hat{\pi}}$$:

[1] 0.009992

95% CI: $0.48 \pm 1.96 \times 0.01 = 0.48 \pm 0.0196$

# Inference on two proportions

Two binomial populations $$\textsf{Binomial}(\pi_1, n_1)$$ and $$\textsf{Binomial}(\pi_2, n_2)$$.

For the difference $$\pi_1 - \pi_2$$:

• Observed proportions: $$\hat{\pi}_1 = \frac{y_1}{n_1}$$ and $$\hat{\pi}_2 = \frac{y_2}{n_2}$$

• Difference: $$\hat{\pi}_1 - \hat{\pi}_2$$

• “Standard deviation”: $$\hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2} = \sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}$$

# Inference on two proportions

$$(1 - \alpha)\times 100 \%$$ Confidence interval when $$n$$ is large is

$\hat{\pi}_1 - \hat{\pi}_2 \pm z_{\alpha/2} \hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2}$

where $$\hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2} = \sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}$$

# Inference on two proportions

Two binomial populations $$\textsf{Binomial}(\pi_1, n_1)$$ and $$\textsf{Binomial}(\pi_2, n_2)$$. For the difference $$\pi_1 - \pi_2$$, the test statistic is

$z= \frac{\hat{\pi}_1 - \hat{\pi}_2}{\sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}}$

Caution: applicable only if $$n_i \hat{\pi}_i$$ and $$n_i (1-\hat{\pi}_i)$$ with $$i=1,2$$ are at least $$5$$; othewsie Fisher’s exact test is recommended

# Example 10.7 as an Exercise

Effectiveness of teaching English via computed aided method and traditional method

• Total $$300$$ students
• $$125$$ randomly picked and assinged to computer aided teaching method
• the rest $$175$$ assinged to traditional teaching method

Data:

Pass 94 113
Fail 31 62
Total 125 175

# Example 10.7 as an Exercise

Does instruction using computer software appear to increase the proportion of students passing the examination in comparison to the pass rate using the traditional method? Use Type I error probability $$0.05$$

• Formulate hypothesis

• Conduct hypothesis testing

• Construct confidence interval

# Example 10.7 as an Exercise

Quantities needed for hypothesis testing:

[1] 0.752
[1] 0.6457
[1] 0.05291
[1] 2.009

Conclusion on hypothesis testing?

# Example 10.7 as an Exercise

95% Confidence interval:

[1] 0.1063
[1] 0.05291
[1] 2.009

CI: $$0.75-0.6457 \pm 1.96 \times 0.05291 = 0.106 \pm 0.104$$

# Contingency table

Severity of skin disease related to patient’s age?

I II III IV
Moderate 15 32 18 5
Mildly Severe 8 29 23 18
Severe 1 20 25 22

# Pearson’s Chi-square test

To assess:

• $$H_0:$$ the row and column variances are independent
• $$H_a:$$ the row and column variances are dependent

the test statistic is $\chi^2 = \sum_{i,j}\frac{( n_{ij}-\hat{E}_{ij} )^2}{\hat{E}_{ij}}$ where $$\hat{E}_{ij} = n \hat{\pi}_{ij}$$

# Pearson’s Chi-square test

To assess row-column independence

• $$\chi^2$$ has degrees of freedom $$(r-1)(c-1)$$

• Reject $$H_0$$ if $$\chi^2 \ge \chi^2_{\alpha}$$, where $$\chi^2_{\alpha}$$ is such that

$P(\chi^2 \ge \chi^2_{\alpha})=\alpha$

Chi-square table

# Chi-square distribution

Density with df=5:

# Example 10.12

Get row and column totals

I II III IV All Ages
Moderate 15 32 18 5 70
Mildly Severe 8 29 23 18 78
Severe 1 20 25 22 68
All Severities 24 81 66 45 216

# Example 10.12

Get expected counts: $$\hat{E}_{ij} =\frac{n_{i\cdot} n_{j\cdot}}{n}$$

I II III IV All Ages
Moderate 7.778 26.25 21.39 14.58 70
Mildly Severe 8.667 29.25 23.83 16.25 78
Severe 7.556 25.50 20.78 14.17 68
All Severities 24.000 81.00 66.00 45.00 216

# Example 10.12

Compute the sum: $$\chi^2 = \sum_{i,j}\frac{( n_{ij}-\hat{E}_{ij} )^2}{\hat{E}_{ij}}$$

[1] 27.13

# Example 10.12

Hypothesis testing at type I error $$0.05$$: Chi-square table

• Critical value: 12.59 from Chi square distribution with df=6
> qchisq(0.05, df = 6, ncp = 0, lower.tail = F)
[1] 12.59

Since $$27.13 > 12.59$$, reject $$H_0$$ that severity of skin disease is NOT associated with a patient’s age

# Software implementation

Data as a matrix or data frame:

I II III IV
Moderate 15 32 18 5
Mildly Severe 8 29 23 18
Severe 1 20 25 22

# Software implementation

Command:

> # dataset should be a matrix of data.frame
> chisq.test(dataset)

# Software implementation

> chisq.test(SkinAge)

Pearson's Chi-squared test

data:  SkinAge
X-squared = 27, df = 6, p-value = 1e-04

# Extras

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 15063)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] knitr_1.17

loaded via a namespace (and not attached):
[1] backports_1.1.0 magrittr_1.5    rprojroot_1.2   formatR_1.5
[5] tools_3.3.0     htmltools_0.3.6 revealjs_0.9    yaml_2.1.14
[9] Rcpp_0.12.12    stringi_1.1.5   rmarkdown_1.6   stringr_1.2.0
[13] digest_0.6.12   evaluate_0.10.1