# Statistics: learning from data

Statistics is the science of collecting, analyzing and drawing conclusions from data, i.e., a science of Learning from Data

# Learning from data: basic steps

1. Formulate the problem
2. Design studies/experiments to collect data
3. Analyze data and draw conclusions

The course will cover elementary techniques that are needed for each step involved in Learning from Data:

• Collecting data: surveys and experimental studies
• Summarizing and visualizing data: mean, histogram, etc
• Analyzing data: inference on means and variances, regression models, analysis of variance, etc

# General concepts

• Population, sample, random sampling
• Factors, control/treatment, response/covariates/explanatory variables
• Observational study, experimental study

# Two types of studies

• Observational study: the researcher does not interfere with the information generating process

• Experimental study: the researcher manipulates explanatory variables and records their effects on the response variables

# Example 1

• Problem: do students with higher ACT score have higher GPA?
• Data collected from University X and University Y
> gpaAct <- read.table("D:/Teaching/stat412/data/CH03PR03.txt",sep = "", header=FALSE)
> colnames(gpaAct) = c("GPA","ACT","IQ", "Rank") # assign names to columns
> class(gpaAct)
[1] "data.frame"

# Example 1 (cont’d)

Show data:

> gpaAct[1:10,]
GPA ACT  IQ Rank
1  3.897  21 122   99
2  3.885  14 132   71
3  3.778  28 119   95
4  2.540  22  99   75
5  3.028  21 131   46
6  3.865  31 139   77
7  2.962  32 113   85
8  3.961  27 136   99
9  0.500  29  75   13
10 3.178  26 106   97

# Example 1 (cont’d)

Eyeball check: higher ACT score leads to higher GPA?

> plot(gpaAct$ACT, gpaAct$GPA, xlab="ACT score", ylab="GPA")

# Example 1 (cont’d)

• Type of study: observational
• Source of data: survey
• Population: colleges or university students
• Sampled population: students in Universities X and Y
• Sample: GPAs and ACT scores obtained

# Example 1 (cont’d)

Issues with collecting data using surveys:

• How to sample from population
• Nonresponse from participants
• Measurement problems

# Example 2

Problem: how does cutting scheme affect the life of a machine tool?

• 2 Cutting Speeds; 2 Tool Geometries; 2 Cutting Angles
• Coding: -1 or 1
> cutLifetime <- read.csv("D:/Teaching/stat412/data/MontegomeryChp6Prb1.csv", header=TRUE,sep=",")
[1] "data.frame"

# Example 2 (cont’d)

Show data:

> cutLifetime[1:10,]
Cutting.Speed Tool.Geometry Cutting.Angle Life.Hours
1             -1            -1            -1         22
2             -1            -1            -1         31
3             -1            -1            -1         25
4              1            -1            -1         32
5              1            -1            -1         43
6              1            -1            -1         29
7             -1             1            -1         35
8             -1             1            -1         34
9             -1             1            -1         50
10             1             1            -1         55

# Example 2 (cont’d)

• Type of study: experimental
• Design: factorial, completely randomized
• Factors: Cutting Speeds, Tool Geometries, Cutting Angles
• Treatments: the 8 combinations for the factors
• Replication: 3 replicates for each treatment

# Example 2 (cont’d)

• Experimental unit: machine tool
• Explanatory variables: the three factors
• Response variables: Life Hours

# Example 2 (cont’d)

• Key ingredients of experimental design: Randomization and Replication
• Issues of experimental design: not always implementable, control of experimental errors

# Example 2 via SAS codes

> PROC IMPORT OUT= cuttingeg2
+             DATAFILE= "D:\Teaching\stat412\data\MontegomeryChp6Prb1.csv"
+             DBMS=CSV REPLACE;
+      GETNAMES=YES;
+      DATAROW=2;
+ RUN;
+
+ proc PRINT data= cuttingeg2;
+ run;
+             

# Calculator

Operations on numbers: + - * / ^

$$(6 + 3 \times 4) \div 2 - 9^3$$

> (6+3*4)/2- 9^3
[1] -720

# Atomic Classes

Some atomic classes (or modes) of objects in R:

• character
• logical
• numeric (real number)
> x <- "stat412"; x = "stat412"  # character
> x <- TRUE     # logical
> x <- 3.14159  # numeric

# Evaluation

When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may be auto-printed.

> x = 1
> x+2
[1] 3
> print(x)
[1] 1
> print(x+2)
[1] 3

# Functions

There are many useful functions included in R. Here are some examples of built-in functions:

> x <- 2
> print(x)
[1] 2
> sqrt(x)
[1] 1.414214
> log(x)
[1] 0.6931472
> class(x)
[1] "numeric"
> is.vector(x)
[1] TRUE

# Accessing Help in R

You can open the help file for any function by typing ? with the functions name. Here is an example:

> ?sqrt

There’s also a function help.search that can do general searches for help. You can learn about it by typing:

> ?help.search

It’s also useful to use Google: for example, “r help square root”. The R help files are also on the web.

# Variable Names

In the previous examples, we used x as our variable name. Do not use the following variable names, as they have special meanings in R:

c, q, s, t, C, D, F, I, T

When combining two words for a given variable, we recommend one of these options:

> my_variable <- 1
> myVariable <- 1

Variable names such as my.variable are problematic because of the special use of “.” in R.

# Vectors

The vector is the most basic object in R. You can create vectors in a number of ways. For one homework problem, a vector needs to be created.

> x <- c(1.2, 5, -10, 20, 5)
> x
[1]   1.2   5.0 -10.0  20.0   5.0
> length(x)
[1] 5
> z <- seq(from=0, to=100, by=10)
> z
[1]   0  10  20  30  40  50  60  70  80  90 100
> length(z)
[1] 11

# Vectors

• A vector can only contain elements of a single class:
> x <- c(1.2, 5, -10, 20, 5)
> x
[1]   1.2   5.0 -10.0  20.0   5.0
> x[1] # the first entry of x
[1] 1.2
> x[c(2,5)] # the 2nd and 5th entries of x
[1] 5 5
> z <- c(x, TRUE, FALSE)
> z # the vector z contains all numeric entries
[1]   1.2   5.0 -10.0  20.0   5.0   1.0   0.0

# Matrices

Like vectors, matrices are objects that can contain elements of only one class.

> m <- matrix(1:6, nrow=2, ncol=3)
> m
[,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> m[1,2] # the (1,2) entry of m
[1] 3
> m[1,] # the 1st row of m
[1] 1 3 5
> m[,1] # the 1st column of m
[1] 1 2
> m[,c(1,3)] # the 1st and 3rd columns of m
[,1] [,2]
[1,]    1    5
[2,]    2    6

# Factors

In statistics, factors encode categorical data.

> paint <- factor(c("red", "white", "blue", "blue", "red",
+                   "red"))
> paint
[1] red   white blue  blue  red   red
Levels: blue red white

# Data Frames

The data frame is one of the most important objects in R. Data sets very often come in tabular form of mixed classes, and data frames are constructed exactly for this.

# Data Frames

> df <- data.frame(counting=1:3, char=c("a", "b", "c"),
+                  logic=c(TRUE, FALSE, TRUE))
> df
counting char logic
1        1    a  TRUE
2        2    b FALSE
3        3    c  TRUE
>
> nrow(df)
[1] 3
> ncol(df)
[1] 3

# Data Frames

> dim(df)
[1] 3 3
>
> names(df)
[1] "counting" "char"     "logic"
> attributes(df) # give the class infor of an object
$names [1] "counting" "char" "logic"$row.names
[1] 1 2 3

# Summary statistics on GPA

> summary(gpaAct$GPA) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.500 2.689 3.078 3.074 3.593 4.000 > var(gpaAct$GPA)
[1] 0.4151719
> sd(gpaAct$GPA) [1] 0.6443383 > > Mode <- function(x) { + ux <- unique(x) + ux[which.max(tabulate(match(x, ux)))] + } > > Mode(gpaAct$GPA)
[1] 3.885
> abline(v = mean(gpaAct$GPA), col = "blue", lwd = 2) # Ordered GPAs with median/mode > plot(sort(gpaAct$GPA), ylab="GPA")
> abline(h = median(gpaAct$GPA), col = "blue", lwd = 2) > abline(h = Mode(gpaAct$GPA), col = "red", lwd = 2, lty=2)

# Boxplot of GPA

Note: 0.5 (an outlier)

> boxplot(gpaAct$GPA) # Frequency table Frequency table of ACT score > ftable(gpaAct$ACT)
14 15 16 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1  1  3  7  3 10  9  4  5 12 10 10 10 10  7  7  4  4  1  1  1

Note: to create a pie chart

# Two important definitions

• Conditional probability: $$P(A|B) = \frac{P(A \cap B)}{P(B)}$$ (what if $$P(B)=0$$?)
• Independence: $$P(A \cap B) = P(A)P(B)$$ iff events $$A$$ and $$B$$ independent
• $$P(A|B) = P(A)$$ iff $$P(A \cap B) = P(A)P(B)$$?

# Discrete/continuous rv

• Discrete rv: taking at most countably many values via $P(X=a_i)=c_i$
• Number of heads in coin tosses (Binomial)
• Number of customer visits to a service center in 10 minutes (Poisson)
• Continuous rv: taking a continuum of values governed by a law $F(x)=\int_{-\infty}^x f(x)dx$ with a density $$f$$
• measuring temperature; measuring profit

# The Normal distribution

> curve(dnorm(x,0,1), xlim=c(-3,3), main='Standard Normal density',
+       xlab = expression(x),ylab=expression(f(x))) 

# The Normal distribution

• Symmetric around its mean
• Bell shaped
• A dominant proportion of Normal data are around the mean

# Histogram: standard Normal

> set.seed(123)
> data = rnorm(5000,mean=0,sd=1)
> hist(data, prob=TRUE) # obtain histogram and save prob
> lines(density(data))  # fit a density for data

# The Normal distribution

• Notation: $$Y$$ follows a Normal distribution with mean $$\mu$$ and variance $$\sigma^2$$, i.e., $$Y \sim \mathsf{N}(\mu,\sigma^2)$$
• Density: $$f(y) = \frac{1}{\sqrt{2 \pi \sigma}} \exp{\left[-\frac{(y -\mu)^2}{2 \sigma^2}\right]}$$
• If $$Y$$ follows $$\mathsf{N}(\mu,\sigma^2)$$, then $$Z = \frac{Y -\mu}{\sigma}$$ follows $$\mathsf{N}(0,1)$$
• The Normal density is very, very special

# Standard Normal curve areas

• Table 1 in textbook
• $$z = \frac{y-\mu}{\sigma}$$ is referred to as $$z$$-score when $$Y=y$$ and $$Y \sim \mathsf{N}(\mu,\sigma^2)$$

# Lower tail probability

$$P( X \le 1)$$ when $$X \sim \mathsf{N}(0,1)$$

# Lower tail probability (cont’d)

> cord.x <- c(-5,seq(-5,1,0.01),1); cord.y <- c(0,dnorm(seq(-5,1,0.01)),0)
> curve(dnorm(x,0,1), xlim=c(-5,5), main='Standard Normal density',
+       xlab = expression(x),ylab=expression(f(x)))
> polygon(cord.x,cord.y,col='skyblue') # Add the shaded area.

# Lower tail probability (cont’d)

If $$X \sim \mathsf{N}(0,1)$$, then $$P( X \le 1)$$

> pnorm(1, mean = 0, sd = 1)
[1] 0.8413447

# Upper tail probability

$$P( X \ge 1)$$ when $$X \sim \mathsf{N}(0,1)$$

# Upper tail probability

> cord.x <- c(1,seq(1,5,0.01),5); cord.y <- c(0,dnorm(seq(1,5,0.01)),0)
> curve(dnorm(x,0,1), xlim=c(-5,5), main='Standard Normal density',
+       xlab = expression(x),ylab=expression(f(x)))
> polygon(cord.x,cord.y,col='skyblue') # Add the shaded area.

# Upper tail probability (cont’d)

If $$X \sim \mathsf{N}(0,1)$$, then $$P( X \ge 1) = 1 - P( X \le 1)$$

> 1 - pnorm(1, mean = 0, sd = 1)
[1] 0.1586553

# Interval probability

$$P( -2 \le X \le 1)$$ when $$X \sim \mathsf{N}(0,1)$$

# Interval probability (cont’d)

> cord.x <- c(-2,seq(-2,-1,0.01),-1); cord.y <- c(0,dnorm(seq(-2,-1,0.01)),0)
> curve(dnorm(x,0,1), xlim=c(-3.5,3.5), main='Standard Normal density',
+       xlab = expression(x),ylab=expression(f(x)))
> polygon(cord.x,cord.y,col='skyblue')

# Interval probability (cont’d)

• $$X \sim \mathsf{N}(0,1)$$
• $$P( -2 \le X \le -1) = P(X \le -1) - P(X \le -2)$$
> p1 = pnorm(-1, mean = 0, sd = 1)
> p2 = pnorm(-2, mean = 0, sd = 1)
> p1- p2
[1] 0.1359051

# Random samples

• Random sampling: every different sample of a fixed size has an equal probability of being selected
• Sampling distribution: bridge between sample and population

# Histogram of random sample

> set.seed(123)
> xNormal = rnorm(1000,mean=0,sd=1)
> hist(xNormal, prob=TRUE)
> lines(density(xNormal))  # fit a density for data
> abline(v=0,col="blue")

# Histogram of random sample

> set.seed(123)
> xNormal = rnorm(2000,mean=2,sd=2)
> hist(xNormal, prob=TRUE)
> lines(density(xNormal))
> abline(v=2,col="blue")

# Distribution of sample mean

Assume $$x_1,x_2,\ldots, x_n$$ are i.i.d. $$\mathsf{N}(\mu,\sigma^2)$$ and set $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$

• $$\bar{x} \sim \mathsf{N}(\mu,\sigma^2/n)$$, i.e., $$\bar{x}$$ follows a Normal dist with mean $$\mu$$ and standard deviation $$\sigma/\sqrt{n}$$
• $$\bar{x}$$ usually has a smaller variance, i.e., its distribution is usually more concentrated

# Distribution of sample mean

> curve( dnorm(x, mean=2,sd=2), -5, 9, col="blue",ylab="Density",ylim=c(0,1),lwd=2)
> curve( dnorm(x, mean=2,sd=2/5), -5, 9, add = TRUE, col="red",ylab="Density",lty=2,
+        lwd=2,ylim=c(0,1))

# Central limit theorem (CLT)

• General large sample behavior of sample mean
• Statement: Suppose $$\left\{X_1, X_2,\ldots, X_n\right\}$$ is a random sample of size $$n$$ from a distribution with mean $$\mu$$ and finite variance $$\sigma^2$$. Let $$S_n = \frac{1}{n}\sum_{i=1}^n X_i$$ be the sample mean. Then $\sqrt{n} \left(S_n - \mu\right) \stackrel{d}{\to} \mathsf{N}\left(0,\sigma^2\right)$

• Animation: painblogR

# Central limit theorem: caution

• CLT can fail when observations are dependent; see Lindeberg-Feller theorem
• CLT may not be obvious, i.e., use of CLT as an approximation may be very inaccurate, when sample size is not large enough; see Berry-Esseen theorem
• CLT is a universality principle that depends only on the mean and variance of the generating distribution

# Check Normality

Given a random sample $$x_1,x_2,\ldots, x_n$$, how likely is it from $$\mathsf{N}(\mu,\sigma^2)$$?

# Check Normality: Example 1

> set.seed(123)
> xNonNormal = 0.8*rnorm(300,mean=0,sd=1) + 0.2*rt(300, df = 5)
> qqnorm(xNonNormal)
> qqline(xNonNormal, col = 2, lwd=2)

# Check Normality: Example 1

> hist(xNonNormal, main="Histgram of non-Normal sample", prob=TRUE)
> lines(density(xNonNormal))

# Check Normality: Example 1

> boxplot(xNonNormal)

# Check Normality: Example 2

> set.seed(123)
> xNormal = rnorm(1000,mean=0,sd=1)
> qqnorm(xNormal)
> qqline(xNormal, col = 2, lwd=2)

# Check Normality: Example 2

> hist(xNormal, main="Histgram of Normal sample",prob=TRUE)
> lines(density(xNormal))

# Check Normality: Example 2

> boxplot(xNormal)

# Check Normality: Example 3

> set.seed(123)
> xNormal = rnorm(100,mean=0,sd=1)
> ks.test(xNormal, "pnorm",0,1) #or ks.test(x, "pnorm", mean=0, sd=1)

One-sample Kolmogorov-Smirnov test

data:  xNormal
D = 0.093034, p-value = 0.3522
alternative hypothesis: two-sided

# Check Normality: Example 4

>  tough = read.table("http://math.wsu.edu/faculty/xchen/stat412/data/stat412ASCII-tab/CH04/ex4-94.TXT", header=TRUE, sep="\t")
> tough1 = tough[,1]
> qqnorm(tough1)
> qqline(tough1, col = 2, lwd=2)

# Check Normality: Example 4

> sd(tough1)
[1] 0.1596723
>
> ks.test(tough1, "pnorm",mean(tough1),sd(tough1))

One-sample Kolmogorov-Smirnov test

data:  tough1
D = 0.12638, p-value = 0.9457
alternative hypothesis: two-sided

# Check Normality: caution

• When sample size is very large, pure randomness will lead a test to reject Normality even if the random sample is from a Normal distribution

# Week 1 assignments

• Please get familiar with R, Rstudio and/or SAS, and study the codes used in the source file of the lecture notes
• Please study lecture notes and Chapters 1 to 2 of the textbook carefully with needed programming practice, in order to fully understand the concepts and data processing techniques discussed so far

# Week 2 assignments

• Exercises 3.7, 3.16, 3.23, 3.35, 3.41
• Exercises 4.19(a), 4.65(d), 4.94(a); for 4.94(a), please give your conclusion on if the data appear to follow a Normal distribution, for which you may use the KS test

# Extras

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:
[1] dplyr_0.4.2   ggplot2_2.1.0 knitr_1.17

loaded via a namespace (and not attached):
[1] Rcpp_0.12.0      magrittr_1.5     munsell_0.4.2
[4] colorspace_1.2-6 R6_2.1.0         stringr_1.2.0
[7] highr_0.5        plyr_1.8.3       tools_3.2.0
[10] revealjs_0.9     parallel_3.2.0   grid_3.2.0
[13] gtable_0.1.2     DBI_0.3.1        htmltools_0.3.5
[16] lazyeval_0.1.10  yaml_2.1.13      rprojroot_1.2
[19] digest_0.6.8     assertthat_0.1   reshape2_1.4.1
[22] evaluate_0.10.1  rmarkdown_1.6    labeling_0.3
[25] stringi_0.5-5    scales_0.4.0     backports_1.1.0