Stat 412 — Weeks 9

Xiongzhi Chen

Washington State University

Fall 2017

Inference on proportions

Motivations

Proportion relates to Chance, and Chance to Probability:

  • Preference between products
  • Efficacy for drugs
  • Reliability of mechanical systems

Binomial distribution

The binomial distribution describes the number of successes in independent trials:

  • in each trial, the probability of a success is \(\pi\) and the probability of a failure is \(1-\pi\)
  • \(n\) independent such trias have been conducted

Binomial distribution

Its probability mass function is:

\[P(Y = y) = \frac{n!}{y!(n-y)!}\pi^y (1-\pi)^{n-y}\] for \(y=0,\ldots,n\)

Binomial distribution

Probability mass function for \(\mathsf{Binomial}(50,0.5)\):

Inference on one proportion

Inference on one proportion

Method:

  • Observed proportion: \(\hat{\pi} = \frac{y}{n}\)

  • “Standard deviation”: \(\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}\)

  • \(z = \frac{\hat{\pi} - \pi}{\hat{\sigma}_{\hat{\pi}}} \to \mathsf{Normal}(0,1)\) when \(n \to \infty\)

Heuristics: if \(n \ge \frac{5}{\min(\pi,1-\pi)}\), a Normal test can be used

Inference on one proportion

\((1 - \alpha)\times 100 \%\) Confidence interval when \(n\) is large is

\[\hat{\pi} \pm z_{\alpha/2} \hat{\sigma}_{\hat{\pi}}\]

where \(\hat{\pi} = \frac{y}{n}\) and \(\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}\)

Caution: when \(n\) is small, adjustment is needed; see WAC adjustment

Inference on one proportion

At type I error \(\alpha\), when \(n\) is large the test statistic

\[z = \frac{\hat{\pi} - \pi}{\hat{\sigma}_{\hat{\pi}}}\] based on Normal distribution can be used for hypothesis testing, where \(\hat{\pi} = \frac{y}{n}\) and \(\hat{\sigma}_{\hat{\pi}} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}\)

Example 10.5 as an Exercise

  • National survey: \(44\%\) of U.S. college students enaged in binge drinking

  • \(2,500\) undergraduates surveyed from University A; \(1,200\) of them engaged in binge drinking

  • Is the percentage of students enaged in binge drinking at University A greater than the percentage from national survey?

Example 10.5 as an Exercise

  • Formulate hypothesis

  • Conduct hypothesis testing

  • Construct confidence interval

Example 10.5 as an Exercise

Value of z-test statistic

[1] 4.003

Critical value: \(z_{0.05}=1.96\)

Conclusion on hypothesis testing?

Example 10.5 as an Exercise

\(\hat{\sigma}_{\hat{\pi}}\):

[1] 0.009992

95% CI: \[0.48 \pm 1.96 \times 0.01 = 0.48 \pm 0.0196\]

Inference on two proportions

Inference on two proportions

Two binomial populations \(\textsf{Binomial}(\pi_1, n_1)\) and \(\textsf{Binomial}(\pi_2, n_2)\).

For the difference \(\pi_1 - \pi_2\):

  • Observed proportions: \(\hat{\pi}_1 = \frac{y_1}{n_1}\) and \(\hat{\pi}_2 = \frac{y_2}{n_2}\)

  • Difference: \(\hat{\pi}_1 - \hat{\pi}_2\)

  • “Standard deviation”: \(\hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2} = \sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}\)

Inference on two proportions

\((1 - \alpha)\times 100 \%\) Confidence interval when \(n\) is large is

\[\hat{\pi}_1 - \hat{\pi}_2 \pm z_{\alpha/2} \hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2}\]

where \(\hat{\sigma}_{\hat{\pi}_1 - \hat{\pi}_2} = \sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}\)

Inference on two proportions

Two binomial populations \(\textsf{Binomial}(\pi_1, n_1)\) and \(\textsf{Binomial}(\pi_2, n_2)\). For the difference \(\pi_1 - \pi_2\), the test statistic is

\[z= \frac{\hat{\pi}_1 - \hat{\pi}_2}{\sqrt{\frac{\hat{\pi}_1 (1 - \hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1 - \hat{\pi}_2)}{n_2}}}\]

Caution: applicable only if \(n_i \hat{\pi}_i\) and \(n_i (1-\hat{\pi}_i)\) with \(i=1,2\) are at least \(5\); othewsie Fisher’s exact test is recommended

Example 10.7 as an Exercise

Effectiveness of teaching English via computed aided method and traditional method

  • Total \(300\) students
  • \(125\) randomly picked and assinged to computer aided teaching method
  • the rest \(175\) assinged to traditional teaching method

Example 10.7 as an Exercise

Data:

ExamResults viaComputer Traditional
Pass 94 113
Fail 31 62
Total 125 175

Example 10.7 as an Exercise

Does instruction using computer software appear to increase the proportion of students passing the examination in comparison to the pass rate using the traditional method? Use Type I error probability \(0.05\)

  • Formulate hypothesis

  • Conduct hypothesis testing

  • Construct confidence interval

Example 10.7 as an Exercise

Quantities needed for hypothesis testing:

[1] 0.752
[1] 0.6457
[1] 0.05291
[1] 2.009

Conclusion on hypothesis testing?

Example 10.7 as an Exercise

95% Confidence interval:

[1] 0.1063
[1] 0.05291
[1] 2.009

CI: \(0.75-0.6457 \pm 1.96 \times 0.05291 = 0.106 \pm 0.104\)

Test of independence

Motivations

Contingency table

Severity of skin disease related to patient’s age?

I II III IV
Moderate 15 32 18 5
Mildly Severe 8 29 23 18
Severe 1 20 25 22

Pearson’s Chi-square test

To assess:

  • \(H_0:\) the row and column variances are independent
  • \(H_a:\) the row and column variances are dependent

the test statistic is \[\chi^2 = \sum_{i,j}\frac{( n_{ij}-\hat{E}_{ij} )^2}{\hat{E}_{ij}}\] where \(\hat{E}_{ij} = n \hat{\pi}_{ij}\)

Pearson’s Chi-square test

To assess row-column independence

  • \(\chi^2\) has degrees of freedom \((r-1)(c-1)\)

  • Reject \(H_0\) if \(\chi^2 \ge \chi^2_{\alpha}\), where \(\chi^2_{\alpha}\) is such that

\[P(\chi^2 \ge \chi^2_{\alpha})=\alpha\]

Chi-square table

Chi-square distribution

Density with df=5:

Example 10.12

Get row and column totals

I II III IV All Ages
Moderate 15 32 18 5 70
Mildly Severe 8 29 23 18 78
Severe 1 20 25 22 68
All Severities 24 81 66 45 216

Example 10.12

Get expected counts: \(\hat{E}_{ij} =\frac{n_{i\cdot} n_{j\cdot}}{n}\)

I II III IV All Ages
Moderate 7.778 26.25 21.39 14.58 70
Mildly Severe 8.667 29.25 23.83 16.25 78
Severe 7.556 25.50 20.78 14.17 68
All Severities 24.000 81.00 66.00 45.00 216

Example 10.12

Compute the sum: \(\chi^2 = \sum_{i,j}\frac{( n_{ij}-\hat{E}_{ij} )^2}{\hat{E}_{ij}}\)

[1] 27.13

Example 10.12

Hypothesis testing at type I error \(0.05\): Chi-square table

  • Critical value: 12.59 from Chi square distribution with df=6
> qchisq(0.05, df = 6, ncp = 0, lower.tail = F)
[1] 12.59

Since \(27.13 > 12.59\), reject \(H_0\) that severity of skin disease is NOT associated with a patient’s age

Software implementation

Data as a matrix or data frame:

I II III IV
Moderate 15 32 18 5
Mildly Severe 8 29 23 18
Severe 1 20 25 22

Software implementation

Command:

> # dataset should be a matrix of data.frame
> chisq.test(dataset)

Software implementation

> chisq.test(SkinAge)

    Pearson's Chi-squared test

data:  SkinAge
X-squared = 27, df = 6, p-value = 1e-04

Extras

License and session Information

License

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 15063)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.17

loaded via a namespace (and not attached):
 [1] backports_1.1.0 magrittr_1.5    rprojroot_1.2   formatR_1.5    
 [5] tools_3.3.0     htmltools_0.3.6 revealjs_0.9    yaml_2.1.14    
 [9] Rcpp_0.12.12    stringi_1.1.5   rmarkdown_1.6   stringr_1.2.0  
[13] digest_0.6.12   evaluate_0.10.1