Linear models: t-test and ANOVA
Simple linear models
predictions about the value of a single quantity
with one set of variables predicting another variable
factor() 因子オブジェクトを作る
contrasts() ある因子に関連するコントラストを設定する、見る
data.frame() データフレームを作る
names() 変数名の表示を変える。
names(d) <- c("Response", "Condition")
lm() 単回帰
lm(目的変数~説明変数)
m <- lm(Response ~ Condition, data=d)
summary(m)
> m <- lm(Response ~ Condition, data=d)
>
> m
Call:
lm(formula = Response ~ Condition, data = d)
Coefficients:
(Intercept) ConditionT
3 3
> summary(m)
Call:
lm(formula = Response ~ Condition, data = d)
Residuals:
Min 1Q Median 3Q Max
-2 -1 0 1 2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0000 0.7071 4.243 0.00283 **
ConditionT 3.0000 1.0000 3.000 0.01707 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.581 on 8 degrees of freedom
Multiple R-squared: 0.5294, Adjusted R-squared: 0.4706
F-statistic: 9 on 1 and 8 DF, p-value: 0.01707
The t statistic
t distribution is symmetric, like a Gaussian, and has a similar shape, but it has heavier tails (and it is mathematically very different) 正規分布する母集団から標本抽出したときの分布がt分布
It has a single parameter (a degrees of freedom)
it approaches a Gaussian as the degrees of freedom go to infinity
William Sealy Gosset
t.test()
t.test(Response ~ Condition, data=d, var.equal=T)
sleepというデータセットが最初から入っている。
plot(extra ~ group, data=sleep)
points(extra ~ as.numeric(group), pch=as.character(ID), data=sleep)
t.test(extra ~ group, data=sleep, var.equal=T)
paired t -test
t.test(extra ~ group, data=sleep, paired=T)
ex1 <- sleep[sleep$group=="1","extra"]
ex2 <- sleep[sleep$group=="2","extra"]
exdiff <- ex1 - ex2
vectorのそれぞれ対応しているものどうしでの引き算となる
> ex1
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0
> ex2
[1] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4
> exdiff <- ex1 - ex2
> exdiff
[1] -1.2 -2.4 -1.3 -1.3 0.0 -1.0 -1.8 -0.8 -4.6 -1.4
library(languageR)
この中に、englishというサンプルデータがある。反応時間データ。
youngとoldで反応時間が違うかどうかを調べるには、、、
> head(english)
RTlexdec RTnaming Familiarity Word AgeSubject WordCategory WrittenFrequency WrittenSpokenFrequencyRatio FamilySize DerivationalEntropy
1 6.543754 6.145044 2.37 doe young N 3.912023 1.0216512 1.386294 0.14144
RTlexdec をAgeSubjectごとにグループを分ける。グループ間での差の検定。
> t.test(RTlexdec ~ AgeSubject, data=english)
Welch Two Sample t-test
data: RTlexdec by AgeSubject
t = 67.4682, df = 4534.555, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2152787 0.2281642
sample estimates:
mean in group old mean in group young
6.660958 6.439237
> t.test(RTlexdec ~ AgeSubject, data=english, var.equal=T)
Two Sample t-test
data: RTlexdec by AgeSubject
t = 67.4682, df = 4566, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2152787 0.2281642
sample estimates:
mean in group old mean in group young
6.660958 6.439237
> summary(lm(RTlexdec ~ AgeSubject, data=english))
Call:
lm(formula = RTlexdec ~ AgeSubject, data = english)
Residuals:
Min 1Q Median 3Q Max
-0.25776 -0.08339 -0.01669 0.06921 0.52685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.660958 0.002324 2866.44 <2e-16 ***
AgeSubjectyoung -0.221721 0.003286 -67.47 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1111 on 4566 degrees of freedom
Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
Multiple comparison problem
multiple comparisons
Bonferroni correction
Mendelian genetics メンデル遺伝学
The Analysis of Variance
Are all the locations the same?
the warpbreaks data frame that comes with R
if there is a difference in the means, where should the variance come from—SSwithin or SSbetween ?
We can use our answer to set up the following ratio (almost):
SSbetween
SSwithin
this seems less and less likely is the ratio gets bigger.
an F distribution
the output of lm()
summary(m)
the ANOVA F statistic is the square of the relevant t statistic.
the warpbreaks data
In weaving, you have two kinds of yarn, the warp and the weft; the warp is the yarn that’s held taut on the loom
warpbreaks
# There are three levels of ’tension’: L (low)
# M (medium) and H (high)
m <- lm(breaks ~ tension, data=warpbreaks)
> m
Call:
lm(formula = breaks ~ tension, data = warpbreaks)
Coefficients:
(Intercept) tensionM tensionH
36.39 -10.00 -14.72
> summary(m)
Call:
lm(formula = breaks ~ tension, data = warpbreaks)
Residuals:
Min 1Q Median 3Q Max
-22.389 -8.139 -2.667 6.333 33.611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.39 2.80 12.995 < 2e-16 ***
tensionM -10.00 3.96 -2.525 0.014717 *
tensionH -14.72 3.96 -3.718 0.000501 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.88 on 51 degrees of freedom
Multiple R-squared: 0.2203, Adjusted R-squared: 0.1898
F-statistic: 7.206 on 2 and 51 DF, p-value: 0.001753
aov(m, contrasts=list(tension="contr.helmert"))
Simple linear models
predictions about the value of a single quantity
with one set of variables predicting another variable
factor() 因子オブジェクトを作る
contrasts() ある因子に関連するコントラストを設定する、見る
data.frame() データフレームを作る
names() 変数名の表示を変える。
names(d) <- c("Response", "Condition")
lm() 単回帰
lm(目的変数~説明変数)
m <- lm(Response ~ Condition, data=d)
summary(m)
> m <- lm(Response ~ Condition, data=d)
>
> m
Call:
lm(formula = Response ~ Condition, data = d)
Coefficients:
(Intercept) ConditionT
3 3
> summary(m)
Call:
lm(formula = Response ~ Condition, data = d)
Residuals:
Min 1Q Median 3Q Max
-2 -1 0 1 2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0000 0.7071 4.243 0.00283 **
ConditionT 3.0000 1.0000 3.000 0.01707 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.581 on 8 degrees of freedom
Multiple R-squared: 0.5294, Adjusted R-squared: 0.4706
F-statistic: 9 on 1 and 8 DF, p-value: 0.01707
The t statistic
t distribution is symmetric, like a Gaussian, and has a similar shape, but it has heavier tails (and it is mathematically very different) 正規分布する母集団から標本抽出したときの分布がt分布
It has a single parameter (a degrees of freedom)
it approaches a Gaussian as the degrees of freedom go to infinity
William Sealy Gosset
t.test()
t.test(Response ~ Condition, data=d, var.equal=T)
sleepというデータセットが最初から入っている。
plot(extra ~ group, data=sleep)
points(extra ~ as.numeric(group), pch=as.character(ID), data=sleep)
t.test(extra ~ group, data=sleep, var.equal=T)
paired t -test
t.test(extra ~ group, data=sleep, paired=T)
ex1 <- sleep[sleep$group=="1","extra"]
ex2 <- sleep[sleep$group=="2","extra"]
exdiff <- ex1 - ex2
vectorのそれぞれ対応しているものどうしでの引き算となる
> ex1
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0
> ex2
[1] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4
> exdiff <- ex1 - ex2
> exdiff
[1] -1.2 -2.4 -1.3 -1.3 0.0 -1.0 -1.8 -0.8 -4.6 -1.4
library(languageR)
この中に、englishというサンプルデータがある。反応時間データ。
youngとoldで反応時間が違うかどうかを調べるには、、、
> head(english)
RTlexdec RTnaming Familiarity Word AgeSubject WordCategory WrittenFrequency WrittenSpokenFrequencyRatio FamilySize DerivationalEntropy
1 6.543754 6.145044 2.37 doe young N 3.912023 1.0216512 1.386294 0.14144
RTlexdec をAgeSubjectごとにグループを分ける。グループ間での差の検定。
> t.test(RTlexdec ~ AgeSubject, data=english)
Welch Two Sample t-test
data: RTlexdec by AgeSubject
t = 67.4682, df = 4534.555, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2152787 0.2281642
sample estimates:
mean in group old mean in group young
6.660958 6.439237
> t.test(RTlexdec ~ AgeSubject, data=english, var.equal=T)
Two Sample t-test
data: RTlexdec by AgeSubject
t = 67.4682, df = 4566, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2152787 0.2281642
sample estimates:
mean in group old mean in group young
6.660958 6.439237
> summary(lm(RTlexdec ~ AgeSubject, data=english))
Call:
lm(formula = RTlexdec ~ AgeSubject, data = english)
Residuals:
Min 1Q Median 3Q Max
-0.25776 -0.08339 -0.01669 0.06921 0.52685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.660958 0.002324 2866.44 <2e-16 ***
AgeSubjectyoung -0.221721 0.003286 -67.47 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1111 on 4566 degrees of freedom
Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
Multiple comparison problem
multiple comparisons
Bonferroni correction
Mendelian genetics メンデル遺伝学
The Analysis of Variance
Are all the locations the same?
the warpbreaks data frame that comes with R
if there is a difference in the means, where should the variance come from—SSwithin or SSbetween ?
We can use our answer to set up the following ratio (almost):
SSbetween
SSwithin
this seems less and less likely is the ratio gets bigger.
an F distribution
the output of lm()
summary(m)
the ANOVA F statistic is the square of the relevant t statistic.
the warpbreaks data
In weaving, you have two kinds of yarn, the warp and the weft; the warp is the yarn that’s held taut on the loom
warpbreaks
# There are three levels of ’tension’: L (low)
# M (medium) and H (high)
m <- lm(breaks ~ tension, data=warpbreaks)
> m
Call:
lm(formula = breaks ~ tension, data = warpbreaks)
Coefficients:
(Intercept) tensionM tensionH
36.39 -10.00 -14.72
> summary(m)
Call:
lm(formula = breaks ~ tension, data = warpbreaks)
Residuals:
Min 1Q Median 3Q Max
-22.389 -8.139 -2.667 6.333 33.611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.39 2.80 12.995 < 2e-16 ***
tensionM -10.00 3.96 -2.525 0.014717 *
tensionH -14.72 3.96 -3.718 0.000501 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.88 on 51 degrees of freedom
Multiple R-squared: 0.2203, Adjusted R-squared: 0.1898
F-statistic: 7.206 on 2 and 51 DF, p-value: 0.001753
aov(m, contrasts=list(tension="contr.helmert"))