Stat 412 — Weeks 1 and 2

Xiongzhi Chen

Washington State University

Fall 2017

Overview

Statistics: learning from data

Statistics is the science of collecting, analyzing and drawing conclusions from data, i.e., a science of Learning from Data

Learning from data: basic steps

  1. Formulate the problem
  2. Design studies/experiments to collect data
  3. Analyze data and draw conclusions

About the course

The course will cover elementary techniques that are needed for each step involved in Learning from Data:

  • Collecting data: surveys and experimental studies
  • Summarizing and visualizing data: mean, histogram, etc
  • Analyzing data: inference on means and variances, regression models, analysis of variance, etc

Course logistics

Programming languages

Collecting Data

General concepts

  • Population, sample, random sampling
  • Factors, control/treatment, response/covariates/explanatory variables
  • Observational study, experimental study

Two types of studies

  • Observational study: the researcher does not interfere with the information generating process

  • Experimental study: the researcher manipulates explanatory variables and records their effects on the response variables

Example 1

  • Problem: do students with higher ACT score have higher GPA?
  • Data collected from University X and University Y
> gpaAct <- read.table("D:/Teaching/stat412/data/CH03PR03.txt",sep = "", header=FALSE)
> colnames(gpaAct) = c("GPA","ACT","IQ", "Rank") # assign names to columns
> class(gpaAct)
[1] "data.frame"

Example 1 (cont’d)

Show data:

> gpaAct[1:10,]
     GPA ACT  IQ Rank
1  3.897  21 122   99
2  3.885  14 132   71
3  3.778  28 119   95
4  2.540  22  99   75
5  3.028  21 131   46
6  3.865  31 139   77
7  2.962  32 113   85
8  3.961  27 136   99
9  0.500  29  75   13
10 3.178  26 106   97

Example 1 (cont’d)

Eyeball check: higher ACT score leads to higher GPA?

> plot(gpaAct$ACT, gpaAct$GPA, xlab="ACT score", ylab="GPA")

Example 1 (cont’d)

  • Type of study: observational
  • Source of data: survey
  • Population: colleges or university students
  • Sampled population: students in Universities X and Y
  • Sample: GPAs and ACT scores obtained

Example 1 (cont’d)

Issues with collecting data using surveys:

  • How to sample from population
  • Nonresponse from participants
  • Measurement problems

Example 2

Problem: how does cutting scheme affect the life of a machine tool?

  • 2 Cutting Speeds; 2 Tool Geometries; 2 Cutting Angles
  • Coding: -1 or 1
> cutLifetime <- read.csv("D:/Teaching/stat412/data/MontegomeryChp6Prb1.csv", header=TRUE,sep=",")
> class(cutLifetime)
[1] "data.frame"

Example 2 (cont’d)

Show data:

> cutLifetime[1:10,]
   Cutting.Speed Tool.Geometry Cutting.Angle Life.Hours
1             -1            -1            -1         22
2             -1            -1            -1         31
3             -1            -1            -1         25
4              1            -1            -1         32
5              1            -1            -1         43
6              1            -1            -1         29
7             -1             1            -1         35
8             -1             1            -1         34
9             -1             1            -1         50
10             1             1            -1         55

Example 2 (cont’d)

  • Type of study: experimental
  • Design: factorial, completely randomized
  • Factors: Cutting Speeds, Tool Geometries, Cutting Angles
  • Treatments: the 8 combinations for the factors
  • Replication: 3 replicates for each treatment

Example 2 (cont’d)

  • Experimental unit: machine tool
  • Explanatory variables: the three factors
  • Response variables: Life Hours

Example 2 (cont’d)

  • Key ingredients of experimental design: Randomization and Replication
  • Issues of experimental design: not always implementable, control of experimental errors

Example 2 via SAS codes

> PROC IMPORT OUT= cuttingeg2 
+             DATAFILE= "D:\Teaching\stat412\data\MontegomeryChp6Prb1.csv" 
+             DBMS=CSV REPLACE;
+      GETNAMES=YES;
+      DATAROW=2; 
+ RUN;
+ 
+ proc PRINT data= cuttingeg2; 
+ run;
+             

Getting started in R

Calculator

Operations on numbers: + - * / ^

\((6 + 3 \times 4) \div 2 - 9^3\)

> (6+3*4)/2- 9^3
[1] -720

Atomic Classes

Some atomic classes (or modes) of objects in R:

  • character
  • logical
  • numeric (real number)
> x <- "stat412"; x = "stat412"  # character
> x <- TRUE     # logical
> x <- 3.14159  # numeric

Note: Anything typed after the # sign is not evaluated. The # sign allows you to add comments to your code.

Evaluation

When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may be auto-printed.

> x = 1
> x+2
[1] 3
> print(x)
[1] 1
> print(x+2)
[1] 3

Functions

There are many useful functions included in R. Here are some examples of built-in functions:

> x <- 2
> print(x)
[1] 2
> sqrt(x)
[1] 1.414214
> log(x)
[1] 0.6931472
> class(x)
[1] "numeric"
> is.vector(x)
[1] TRUE

Accessing Help in R

You can open the help file for any function by typing ? with the functions name. Here is an example:

> ?sqrt

There’s also a function help.search that can do general searches for help. You can learn about it by typing:

> ?help.search

It’s also useful to use Google: for example, “r help square root”. The R help files are also on the web.

Variable Names

In the previous examples, we used x as our variable name. Do not use the following variable names, as they have special meanings in R:

c, q, s, t, C, D, F, I, T

When combining two words for a given variable, we recommend one of these options:

> my_variable <- 1
> myVariable <- 1

Variable names such as my.variable are problematic because of the special use of “.” in R.

Vectors

The vector is the most basic object in R. You can create vectors in a number of ways. For one homework problem, a vector needs to be created.

> x <- c(1.2, 5, -10, 20, 5)
> x
[1]   1.2   5.0 -10.0  20.0   5.0
> length(x)
[1] 5
> z <- seq(from=0, to=100, by=10)
> z
 [1]   0  10  20  30  40  50  60  70  80  90 100
> length(z)
[1] 11

Vectors

  • A vector can only contain elements of a single class:
> x <- c(1.2, 5, -10, 20, 5)
> x
[1]   1.2   5.0 -10.0  20.0   5.0
> x[1] # the first entry of x
[1] 1.2
> x[c(2,5)] # the 2nd and 5th entries of x
[1] 5 5
> z <- c(x, TRUE, FALSE)
> z # the vector z contains all numeric entries
[1]   1.2   5.0 -10.0  20.0   5.0   1.0   0.0

Matrices

Like vectors, matrices are objects that can contain elements of only one class.

> m <- matrix(1:6, nrow=2, ncol=3)
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> m[1,2] # the (1,2) entry of m
[1] 3
> m[1,] # the 1st row of m
[1] 1 3 5
> m[,1] # the 1st column of m
[1] 1 2
> m[,c(1,3)] # the 1st and 3rd columns of m
     [,1] [,2]
[1,]    1    5
[2,]    2    6

Factors

In statistics, factors encode categorical data.

> paint <- factor(c("red", "white", "blue", "blue", "red", 
+                   "red"))
> paint
[1] red   white blue  blue  red   red  
Levels: blue red white

Data Frames

The data frame is one of the most important objects in R. Data sets very often come in tabular form of mixed classes, and data frames are constructed exactly for this.

Data Frames

> df <- data.frame(counting=1:3, char=c("a", "b", "c"), 
+                  logic=c(TRUE, FALSE, TRUE))
> df
  counting char logic
1        1    a  TRUE
2        2    b FALSE
3        3    c  TRUE
> 
> nrow(df)
[1] 3
> ncol(df)
[1] 3

Data Frames

> dim(df)
[1] 3 3
> 
> names(df)
[1] "counting" "char"     "logic"   
> attributes(df) # give the class infor of an object
$names
[1] "counting" "char"     "logic"   

$row.names
[1] 1 2 3

$class
[1] "data.frame"

Names

Names can be assigned to columns and rows of vectors, matrices, and data frames.

> # load data from txt file online and assign data to variable gpaAct
> gpaAct <- read.table("http://math.wsu.edu/faculty/xchen/stat412/data/CH03PR03.txt",
+                      sep = "", header=FALSE)
> class(gpaAct) # by default gpaAct is a data.frame
[1] "data.frame"
> 
> gpaAct[1:3,] # show 1st 3 rows of data
     V1 V2  V3 V4
1 3.897 21 122 99
2 3.885 14 132 71
3 3.778 28 119 95

Names

The 1st column contains GPA; 2nd ACT scores

> # assign names to columns of gpaAct
> colnames(gpaAct) = c("GPA","ACT","IQ", "Rank")
> # access GPA by gpaAct$GPA and assign it to a new variable gpa
> gpa = gpaAct$GPA
> 
> gpa[1:5] # 1st 5 entries of gpa
[1] 3.897 3.885 3.778 2.540 3.028
> gpaAct[,1][1:5] # alternative to the above line
[1] 3.897 3.885 3.778 2.540 3.028

Install and load packages

To install a package from CRAN at https://cran.r-project.org/:

> install.packages("nameOfPackage")

To load a package so that its internal functions can be used:

> library("nameOfPackage")

Very useful packages: ggplot2, dplyr

Online R console

R Fiddle: http://www.r-fiddle.org

Some definitions and cautions

Measures of center

  • Sample Mean: the arithmetic average, i.e., \[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]
  • Median: the value for which 50% of the measurements lie above it and 50% fall below it, both inclusive
  • Mode(s): the most frequent or probable measurement(s)

Measures of variablity

  • \(p\)th percentile: among ordered measurements, at most \(p\%\) of them are below it and at most \((100-p)\%\) are above it, both exclusive
  • Quartiles \(Q1, Q2, Q3\): \(25\%\), \(50\%\), \(75\%\) percentiles
  • Sample variance: \(s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\)
  • Sample standard deviation: positive root of sample variance
  • Skewness: how asymmetric a distribution is

Measure of linear relationship

  • Pearson correlation \[r=\frac{1}{n-1}\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{y_x}\right)\]
  • Examples in text: page 109
  • Wikipedia examples

Caution on summary statistics

  • Summary statistics usually are not sufficient to determine the data generating process
  • Summary statistics may not contain accurate information on the data generating process
  • Evidence: T-rex or donut

Describe data: Example 1

ACT score and GPA

Show data:

> gpaAct[1:10,]
     GPA ACT  IQ Rank
1  3.897  21 122   99
2  3.885  14 132   71
3  3.778  28 119   95
4  2.540  22  99   75
5  3.028  21 131   46
6  3.865  31 139   77
7  2.962  32 113   85
8  3.961  27 136   99
9  0.500  29  75   13
10 3.178  26 106   97

Histogram of GPA

Default binning: Sturges

> hist(gpaAct$GPA, main = "Histogram of GPA", xlab="GPA")

Summary statistics on GPA

> summary(gpaAct$GPA)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.500   2.689   3.078   3.074   3.593   4.000 
> var(gpaAct$GPA)
[1] 0.4151719
> sd(gpaAct$GPA)
[1] 0.6443383
> 
> Mode <- function(x) {
+   ux <- unique(x)
+   ux[which.max(tabulate(match(x, ux)))]
+ }
> 
> Mode(gpaAct$GPA)
[1] 3.885
> median(gpaAct$GPA)
[1] 3.0775

Finer histogram of GPA

Finer binning

> hist(gpaAct$GPA, breaks=20, main = "", xlab="GPA")
> abline(v = mean(gpaAct$GPA), col = "blue", lwd = 2)

Ordered GPAs with median/mode

> plot(sort(gpaAct$GPA), ylab="GPA")
> abline(h = median(gpaAct$GPA), col = "blue", lwd = 2)
> abline(h = Mode(gpaAct$GPA), col = "red", lwd = 2, lty=2)

Boxplot of GPA

Note: 0.5 (an outlier)

> boxplot(gpaAct$GPA)

Frequency table

Frequency table of ACT score

> ftable(gpaAct$ACT)
 14 15 16 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
                                                               
  1  1  3  7  3 10  9  4  5 12 10 10 10 10  7  7  4  4  1  1  1

Note: to create a pie chart

> pie(gpaAct$ACT)

Scatter plot: GPA and ACT score

> plot(gpaAct$ACT, gpaAct$GPA, xlab="ACT score", ylab="GPA")

Correlation

Correlation between GAP and ACT score

> cor(gpaAct$ACT, gpaAct$GPA, method="pearson")
[1] 0.2694818

Describe data: Example 2

Cutting scheme and life hours of machine tool

> cutLifetime[1:10,]
   Cutting.Speed Tool.Geometry Cutting.Angle Life.Hours
1             -1            -1            -1         22
2             -1            -1            -1         31
3             -1            -1            -1         25
4              1            -1            -1         32
5              1            -1            -1         43
6              1            -1            -1         29
7             -1             1            -1         35
8             -1             1            -1         34
9             -1             1            -1         50
10             1             1            -1         55

Create factors

  • Cutting.Speed: with 2 levels
  • Tool.Geometries: with 2 levels
  • Cutting.Angles: with 2 levels
> cutLifetime$Cutting.Speed = factor(cutLifetime$Cutting.Speed, labels = c("SpeedA","SpeedB"))
> cutLifetime$Tool.Geometry = factor(cutLifetime$Tool.Geometry, labels=c("GeometryA","GeometryB"))
> cutLifetime$Cutting.Angle = factor(cutLifetime$Cutting.Angle,labels=c("AngleA","AngleB"))

Sample means by treatments

> library(ggplot2)
> ggplot(cutLifetime, aes(x=Cutting.Speed,y=Life.Hours)) + stat_summary(fun.y="mean", geom="bar")+
+   facet_grid(Tool.Geometry~Cutting.Angle)+ylab("Mean of Life.Hours")+ theme_bw()

Sample std dev’s and means by treatments

> library(dplyr)
> cutLifetimeSummary = cutLifetime %>% group_by(Cutting.Speed,Tool.Geometry,Cutting.Angle) %>% 
+   summarize(stdDev = sd(Life.Hours), Means = mean(Life.Hours))
> cutLifetimeSummary
Source: local data frame [8 x 5]
Groups: Cutting.Speed, Tool.Geometry

  Cutting.Speed Tool.Geometry Cutting.Angle   stdDev    Means
1        SpeedA     GeometryA        AngleA 4.582576 26.00000
2        SpeedA     GeometryA        AngleB 3.785939 42.33333
3        SpeedA     GeometryB        AngleA 8.962886 39.66667
4        SpeedA     GeometryB        AngleB 5.033223 54.66667
5        SpeedB     GeometryA        AngleA 7.371115 34.66667
6        SpeedB     GeometryA        AngleB 2.081666 37.66667
7        SpeedB     GeometryB        AngleA 4.932883 49.33333
8        SpeedB     GeometryB        AngleB 4.163332 42.33333

Boxplots by treatments

> library(ggplot2)
> ggplot(data = cutLifetime) + geom_boxplot(aes(x=Cutting.Speed,y=Life.Hours))+
+   facet_grid(Tool.Geometry~Cutting.Angle) + theme_bw()

Histogram of life hours

Question: two modes?

> hist(cutLifetime$Life.Hours,breaks=25,main="",xlab="Life.Hours")

Random variables (rv) and Distributions

Two important definitions

  • Conditional probability: \(P(A|B) = \frac{P(A \cap B)}{P(B)}\) (what if \(P(B)=0\)?)
  • Independence: \(P(A \cap B) = P(A)P(B)\) iff events \(A\) and \(B\) independent
  • \(P(A|B) = P(A)\) iff \(P(A \cap B) = P(A)P(B)\)?

Discrete/continuous rv

  • Discrete rv: taking at most countably many values via \[P(X=a_i)=c_i\]
    • Number of heads in coin tosses (Binomial)
    • Number of customer visits to a service center in 10 minutes (Poisson)
  • Continuous rv: taking a continuum of values governed by a law \[F(x)=\int_{-\infty}^x f(x)dx\] with a density \(f\)
    • measuring temperature; measuring profit

The Normal distribution

> curve(dnorm(x,0,1), xlim=c(-3,3), main='Standard Normal density',
+       xlab = expression(x),ylab=expression(f(x))) 

The Normal distribution

  • Symmetric around its mean
  • Bell shaped
  • A dominant proportion of Normal data are around the mean

Histogram: standard Normal

> set.seed(123)
> data = rnorm(5000,mean=0,sd=1)
> hist(data, prob=TRUE) # obtain histogram and save prob
> lines(density(data))  # fit a density for data

The Normal distribution

  • Notation: \(Y\) follows a Normal distribution with mean \(\mu\) and variance \(\sigma^2\), i.e., \(Y \sim \mathsf{N}(\mu,\sigma^2)\)
  • Density: \(f(y) = \frac{1}{\sqrt{2 \pi \sigma}} \exp{\left[-\frac{(y -\mu)^2}{2 \sigma^2}\right]}\)
  • If \(Y\) follows \(\mathsf{N}(\mu,\sigma^2)\), then \(Z = \frac{Y -\mu}{\sigma}\) follows \(\mathsf{N}(0,1)\)
  • The Normal density is very, very special

Standard Normal curve areas

  • Table 1 in textbook
  • \(z = \frac{y-\mu}{\sigma}\) is referred to as \(z\)-score when \(Y=y\) and \(Y \sim \mathsf{N}(\mu,\sigma^2)\)

Lower tail probability

\(P( X \le 1)\) when \(X \sim \mathsf{N}(0,1)\)

Lower tail probability (cont’d)

> cord.x <- c(-5,seq(-5,1,0.01),1); cord.y <- c(0,dnorm(seq(-5,1,0.01)),0) 
> curve(dnorm(x,0,1), xlim=c(-5,5), main='Standard Normal density',
+       xlab = expression(x),ylab=expression(f(x))) 
> polygon(cord.x,cord.y,col='skyblue') # Add the shaded area.

Lower tail probability (cont’d)

If \(X \sim \mathsf{N}(0,1)\), then \(P( X \le 1)\)

> pnorm(1, mean = 0, sd = 1)
[1] 0.8413447

Upper tail probability

\(P( X \ge 1)\) when \(X \sim \mathsf{N}(0,1)\)

Upper tail probability

> cord.x <- c(1,seq(1,5,0.01),5); cord.y <- c(0,dnorm(seq(1,5,0.01)),0) 
> curve(dnorm(x,0,1), xlim=c(-5,5), main='Standard Normal density',
+       xlab = expression(x),ylab=expression(f(x))) 
> polygon(cord.x,cord.y,col='skyblue') # Add the shaded area.

Upper tail probability (cont’d)

If \(X \sim \mathsf{N}(0,1)\), then \(P( X \ge 1) = 1 - P( X \le 1)\)

> 1 - pnorm(1, mean = 0, sd = 1)
[1] 0.1586553

Interval probability

\(P( -2 \le X \le 1)\) when \(X \sim \mathsf{N}(0,1)\)

Interval probability (cont’d)

> cord.x <- c(-2,seq(-2,-1,0.01),-1); cord.y <- c(0,dnorm(seq(-2,-1,0.01)),0) 
> curve(dnorm(x,0,1), xlim=c(-3.5,3.5), main='Standard Normal density',
+       xlab = expression(x),ylab=expression(f(x))) 
> polygon(cord.x,cord.y,col='skyblue')

Interval probability (cont’d)

  • \(X \sim \mathsf{N}(0,1)\)
  • \(P( -2 \le X \le -1) = P(X \le -1) - P(X \le -2)\)
> p1 = pnorm(-1, mean = 0, sd = 1)
> p2 = pnorm(-2, mean = 0, sd = 1)
> p1- p2
[1] 0.1359051

Sampling distribution and CLT

Random samples

  • Random sampling: every different sample of a fixed size has an equal probability of being selected
  • Sampling distribution: bridge between sample and population

Histogram of random sample

> set.seed(123)
> xNormal = rnorm(1000,mean=0,sd=1)
> hist(xNormal, prob=TRUE)
> lines(density(xNormal))  # fit a density for data
> abline(v=0,col="blue")

Histogram of random sample

> set.seed(123)
> xNormal = rnorm(2000,mean=2,sd=2)
> hist(xNormal, prob=TRUE)
> lines(density(xNormal))
> abline(v=2,col="blue")

Distribution of sample mean

Assume \(x_1,x_2,\ldots, x_n\) are i.i.d. \(\mathsf{N}(\mu,\sigma^2)\) and set \[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]

  • \(\bar{x} \sim \mathsf{N}(\mu,\sigma^2/n)\), i.e., \(\bar{x}\) follows a Normal dist with mean \(\mu\) and standard deviation \(\sigma/\sqrt{n}\)
  • \(\bar{x}\) usually has a smaller variance, i.e., its distribution is usually more concentrated

Distribution of sample mean

> curve( dnorm(x, mean=2,sd=2), -5, 9, col="blue",ylab="Density",ylim=c(0,1),lwd=2)
> curve( dnorm(x, mean=2,sd=2/5), -5, 9, add = TRUE, col="red",ylab="Density",lty=2,
+        lwd=2,ylim=c(0,1))

Central limit theorem (CLT)

  • General large sample behavior of sample mean
  • Statement: Suppose \(\left\{X_1, X_2,\ldots, X_n\right\}\) is a random sample of size \(n\) from a distribution with mean \(\mu\) and finite variance \(\sigma^2\). Let \(S_n = \frac{1}{n}\sum_{i=1}^n X_i\) be the sample mean. Then \[\sqrt{n} \left(S_n - \mu\right) \stackrel{d}{\to} \mathsf{N}\left(0,\sigma^2\right)\]

  • Animation: painblogR

Central limit theorem: caution

  • CLT can fail when observations are dependent; see Lindeberg-Feller theorem
  • CLT may not be obvious, i.e., use of CLT as an approximation may be very inaccurate, when sample size is not large enough; see Berry-Esseen theorem
  • CLT is a universality principle that depends only on the mean and variance of the generating distribution

Check Normality

Given a random sample \(x_1,x_2,\ldots, x_n\), how likely is it from \(\mathsf{N}(\mu,\sigma^2)\)?

Check Normality: Example 1

> set.seed(123)
> xNonNormal = 0.8*rnorm(300,mean=0,sd=1) + 0.2*rt(300, df = 5)
> qqnorm(xNonNormal)
> qqline(xNonNormal, col = 2, lwd=2)

Check Normality: Example 1

> hist(xNonNormal, main="Histgram of non-Normal sample", prob=TRUE)
> lines(density(xNonNormal))

Check Normality: Example 1

> boxplot(xNonNormal)

Check Normality: Example 2

> set.seed(123)
> xNormal = rnorm(1000,mean=0,sd=1)
> qqnorm(xNormal)
> qqline(xNormal, col = 2, lwd=2)

Check Normality: Example 2

> hist(xNormal, main="Histgram of Normal sample",prob=TRUE)
> lines(density(xNormal))

Check Normality: Example 2

> boxplot(xNormal)

Check Normality: Example 3

> set.seed(123)
> xNormal = rnorm(100,mean=0,sd=1)
> ks.test(xNormal, "pnorm",0,1) #or ks.test(x, "pnorm", mean=0, sd=1) 

    One-sample Kolmogorov-Smirnov test

data:  xNormal
D = 0.093034, p-value = 0.3522
alternative hypothesis: two-sided

Check Normality: Example 4

>  tough = read.table("http://math.wsu.edu/faculty/xchen/stat412/data/stat412ASCII-tab/CH04/ex4-94.TXT", header=TRUE, sep="\t")
> tough1 = tough[,1]
> qqnorm(tough1)
> qqline(tough1, col = 2, lwd=2)

Check Normality: Example 4

> sd(tough1)
[1] 0.1596723
> 
> ks.test(tough1, "pnorm",mean(tough1),sd(tough1))

    One-sample Kolmogorov-Smirnov test

data:  tough1
D = 0.12638, p-value = 0.9457
alternative hypothesis: two-sided

Check Normality: caution

  • When sample size is very large, pure randomness will lead a test to reject Normality even if the random sample is from a Normal distribution

Homework assignments

Simple rules

  • If you use software to obtain your answers, please attach the codes
  • Please put codes close to their associated answers

Week 1 assignments

  • Please get familiar with R, Rstudio and/or SAS, and study the codes used in the source file of the lecture notes
  • Please study lecture notes and Chapters 1 to 2 of the textbook carefully with needed programming practice, in order to fully understand the concepts and data processing techniques discussed so far

Week 2 assignments

  • Exercises 3.7, 3.16, 3.23, 3.35, 3.41
  • Exercises 4.19(a), 4.65(d), 4.94(a); for 4.94(a), please give your conclusion on if the data appear to follow a Normal distribution, for which you may use the KS test

Extras

License and session Information

License

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] dplyr_0.4.2   ggplot2_2.1.0 knitr_1.17   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0      magrittr_1.5     munsell_0.4.2   
 [4] colorspace_1.2-6 R6_2.1.0         stringr_1.2.0   
 [7] highr_0.5        plyr_1.8.3       tools_3.2.0     
[10] revealjs_0.9     parallel_3.2.0   grid_3.2.0      
[13] gtable_0.1.2     DBI_0.3.1        htmltools_0.3.5 
[16] lazyeval_0.1.10  yaml_2.1.13      rprojroot_1.2   
[19] digest_0.6.8     assertthat_0.1   reshape2_1.4.1  
[22] evaluate_0.10.1  rmarkdown_1.6    labeling_0.3    
[25] stringi_0.5-5    scales_0.4.0     backports_1.1.0 

Acknowledgements