R day 7

1/19/2011

Linear models: t-test and ANOVA

Simple linear models
predictions about the value of a single quantity
with one set of variables predicting another variable

factor() 因子オブジェクトを作る
contrasts() ある因子に関連するコントラストを設定する、見る
data.frame() データフレームを作る

names() 変数名の表示を変える。

names(d) <- c("Response", "Condition")

lm() 単回帰
lm(目的変数~説明変数)
m <- lm(Response ~ Condition, data=d)
summary(m)
> m <- lm(Response ~ Condition, data=d)
>
> m

Call:
lm(formula = Response ~ Condition, data = d)

Coefficients:
(Intercept)   ConditionT
          3            3

> summary(m)

Call:
lm(formula = Response ~ Condition, data = d)

Residuals:
   Min     1Q Median     3Q    Max
    -2     -1      0      1      2

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0000     0.7071   4.243 0.00283 **
ConditionT    3.0000     1.0000   3.000 0.01707 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.581 on 8 degrees of freedom
Multiple R-squared: 0.5294,    Adjusted R-squared: 0.4706
F-statistic:     9 on 1 and 8 DF, p-value: 0.01707

The t statistic
t distribution is symmetric, like a Gaussian, and has a similar shape, but it has heavier tails (and it is mathematically very different) 正規分布する母集団から標本抽出したときの分布がｔ分布
It has a single parameter (a degrees of freedom)
it approaches a Gaussian as the degrees of freedom go to infinity
William Sealy Gosset

t.test()

t.test(Response ~ Condition, data=d, var.equal=T)

sleepというデータセットが最初から入っている。

plot(extra ~ group, data=sleep)

points(extra ~ as.numeric(group), pch=as.character(ID), data=sleep)

t.test(extra ~ group, data=sleep, var.equal=T)

paired t -test
t.test(extra ~ group, data=sleep, paired=T)

ex1 <- sleep[sleep$group=="1","extra"]
ex2 <- sleep[sleep$group=="2","extra"]

exdiff <- ex1 - ex2
vectorのそれぞれ対応しているものどうしでの引き算となる
> ex1
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0
> ex2
[1] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4
> exdiff <- ex1 - ex2
> exdiff
[1] -1.2 -2.4 -1.3 -1.3 0.0 -1.0 -1.8 -0.8 -4.6 -1.4

library(languageR)
この中に、englishというサンプルデータがある。反応時間データ。
youngとoldで反応時間が違うかどうかを調べるには、、、
> head(english)
RTlexdec RTnaming Familiarity   Word AgeSubject WordCategory WrittenFrequency WrittenSpokenFrequencyRatio FamilySize DerivationalEntropy
1 6.543754 6.145044        2.37    doe      young            N         3.912023                   1.0216512   1.386294             0.14144

RTlexdec をAgeSubjectごとにグループを分ける。グループ間での差の検定。
> t.test(RTlexdec ~ AgeSubject, data=english)

    Welch Two Sample t-test

data: RTlexdec by AgeSubject
t = 67.4682, df = 4534.555, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2152787 0.2281642
sample estimates:
mean in group old mean in group young
           6.660958            6.439237

> t.test(RTlexdec ~ AgeSubject, data=english, var.equal=T)

    Two Sample t-test

data: RTlexdec by AgeSubject
t = 67.4682, df = 4566, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2152787 0.2281642
sample estimates:
mean in group old mean in group young
           6.660958            6.439237

> summary(lm(RTlexdec ~ AgeSubject, data=english))

Call:
lm(formula = RTlexdec ~ AgeSubject, data = english)

Residuals:
     Min       1Q   Median       3Q      Max
-0.25776 -0.08339 -0.01669 0.06921 0.52685

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      6.660958   0.002324 2866.44   <2e-16 ***
AgeSubjectyoung -0.221721   0.003286 -67.47   <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1111 on 4566 degrees of freedom
Multiple R-squared: 0.4992,    Adjusted R-squared: 0.4991
F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16

Multiple comparison problem
multiple comparisons
Bonferroni correction

Mendelian genetics メンデル遺伝学

The Analysis of Variance
Are all the locations the same?
the warpbreaks data frame that comes with R

if there is a difference in the means, where should the variance come from—SSwithin or SSbetween ?
We can use our answer to set up the following ratio (almost):
SSbetween
SSwithin

this seems less and less likely is the ratio gets bigger.

an F distribution

the output of lm()
summary(m)

the ANOVA F statistic is the square of the relevant t statistic.

the warpbreaks data
In weaving, you have two kinds of yarn, the warp and the weft; the warp is the yarn that’s held taut on the loom

warpbreaks
# There are three levels of ’tension’: L (low)
# M (medium) and H (high)

m <- lm(breaks ~ tension, data=warpbreaks)

> m

Call:
lm(formula = breaks ~ tension, data = warpbreaks)

Coefficients:
(Intercept)     tensionM     tensionH
      36.39       -10.00       -14.72

> summary(m)

Call:
lm(formula = breaks ~ tension, data = warpbreaks)

Residuals:
    Min      1Q Median      3Q     Max
-22.389 -8.139 -2.667   6.333 33.611

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    36.39       2.80 12.995 < 2e-16 ***
tensionM      -10.00       3.96 -2.525 0.014717 *
tensionH      -14.72       3.96 -3.718 0.000501 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.88 on 51 degrees of freedom
Multiple R-squared: 0.2203,    Adjusted R-squared: 0.1898
F-statistic: 7.206 on 2 and 51 DF, p-value: 0.001753

aov(m, contrasts=list(tension="contr.helmert"))

Comments are closed.

R day 7

Author: sugiura

Archives

Categories