ブログのアーカイブ

日本政府は大丈夫なのか？

1/31/2011

アメリカ人の知人が次のようなことを言っていた。
「この前初めて知ったけど、日本政府の財政赤字は先進国の中で最悪だってね。
国民はたくさん貯金してて、経済はまあまあで、円は強くて円高で、それで政府が大赤字ってのは、いったいどういうこと？なんか変じゃない？」
　
最近日本のニュースは見たり読んだり聞いたりしていませんが、「変だ」ってのは何とも説明つきませんよね。「変だ」としか言いようがありません。これで大丈夫だと思ってる（ふりしてる）ことが変ですよね。それとも変だから大丈夫と思ってる（ふりしてる）のかも。
　
やっぱりこれは、変だというべきです。「Speak up!」

"I will be on campus next Tuesday, for sure."

1/30/2011

来週火曜日は大学へ行きます。

ガソリン

1/30/2011

ガソリンタンクが四分の一を切ったので給油。
12.6ガロン　走ったのは271.1マイル　支払いは38.9ドル。1ガロン3.08ドル。マイレージは 21.5 ml/g。
換算すると、
47.9リットル　433.8キロ　3234円（83円／ドル）。1リッター67.5円。燃費は 9.1 km/l。

Power outage

1/27/2011

0℃前後で雷を伴って一晩降った雪のせいで、電線に雪が凍り付いてあちこちで切れて、停電。
昨晩8時ころから今日の夕方4時ころまで電気なし。
お湯を沸かすのも台所も全て電気。暖房はガスだけれど暖房装置を動かすのに電気が必要。
ケーブルを使っているテレビ・電話・インターネットもだめ。
ガレージのドアも電動なので、開かない。車も出せない。雪かきのシャベルも出せない。
小学校も大学も休校。
突然、時間だけができた。寝るしかないので、10時間くらい寝た。
さいわい、暖房が切れても家自体が暖まっていたので20℃が14℃に下がっただけで助かった。

Water main break

1/26/2011

水道管が破裂して、というのがやけに多い。
http://www.weather.com/outlook/videos/huge-water-main-break-1-million-damage-19471?from=video_email

破裂の状況はこちら：
http://www01.wsscwater.com/serviceadvisories/viewer.htm

Quite literally.

1/25/2011

Yes. のバリエーション。Exactly.というのが流行っていたが、最近ちょくちょく聞くようになった。「まったくそのとおりです」

-9.8℃

1/24/2011

室温は20℃に「Hold」。なので、温度差約30℃。家中がこの温度。国中がこの調子。
エネルギーを（必要以上に）大量消費しているといわれても否定できないだろう。
しかし、そうするだけのお金を払えるんだから文句ないだろうという主張。
しかし、エネルギー消費に規制をかけようということには反対。
経済が悪くなるから、生活の質が悪くなるから、という主張。
こういう環境に育っちゃうとね、これが「普通」って思っちゃうんだよね。
で、国自体が広いから、ほとんどの人が他の国のことを知る必要も機会もない。　
　
いくら室温が20℃でも、コタツと肩まで入れる

R day 9

1/21/2011

two-sample t test

> t.test (bp.change[treat==1], bp.change[treat==0], var.equal=FALSE)
Welch Two Sample t-test

> summary(lm(bp.change ~ treat + bp.start + sex + age + BMI))

単回帰モデルのあてはめ
lm(y ~ x)

重回帰分析
lm(y ~ x1 + x2)

交互作用を含めた重回帰分析
lm(y ~ x1 * x2)

lm() の返り値はobj
plot(obj)

回帰式の表示
formula(obj)

summary(obj)

分散分析表
anova(obj)

R のキーワード検索関数 help.search()

Random and mixed effects linear models

a word priming experiment

> head(primingHeid)
Subject         Word Trial       RT Condition Rating Frequency BaseFrequency
1     pp1   basaalheid    81 6.693324      heid   2.14         0      3.555348
2     pp1 markantheid    83 6.809039 baseheid   2.73         0      5.164786
3     pp1 ontroerdheid    89 6.508769 baseheid   3.32         0      5.549076
4     pp1 contentheid    95 6.576470      heid   2.50         0      4.499810
5     pp1    riantheid    96 6.857514      heid   2.36         0      4.532599
6     pp1 tembaarheid   111 6.349139 baseheid   3.73         0      0.000000
LengthInLetters FamilySize NumberOfSynsets ResponseToPrime RTtoPrime
1              10   0.000000               0         correct 6.884487
2              11   0.000000               0         correct 6.558198
3              12   0.000000               1         correct 6.651572
4              11   0.000000               1         correct 6.960348
5               9   0.000000               0         correct 6.826545
6              11   1.026672               0       incorrect 6.824374
>

multistratum ANOVA

a fixed effect　固定効果
a random effect　変量効果
assume that the by-subject intercepts (or slopes) are random variables from some (say) Normal distribution

a mixed effects model　混合効果モデル
説明変数に固定効果 (fixed effects) とランダム効果 (random effects) がまざっている

matrixとlistの違い
listはデータフレーム、行と列に見出し（ラベル）がついている。

一般化混合線形モデル(generalized linear mixed model: GLMM)

http://homepage2.nifty.com/fauves/education/GLMM.htm

一般線形モデル　General Linear Model
分散分析ANOVA
回帰Regression
共分散分析ANCOVA
基づいている原理は同じ(Grafen & Hails 2002)
独立変数が連続量かカテゴリカルかの違い。

GLMより一般化されたもの（正規分布でないような広範囲のデータを扱うことができる）
一般化線形モデル　Generalized Linear Model
ロジスティック回帰も対数線形も

ランダム要因random effect（変量要因）：考慮には入れたくない要因（独立変数）
固定要因Fixed effect：考慮したい要因

擬似反復pseudo-replicationを避けることができる。

一般化混合線形モデル(generalized linear mixed model: GLMM)

「モデル」：どの変数の組み合わせで調べている現象を説明するか。
モデルがどのくらいデータをよく説明しているかの基準
赤池情報基準量AIC
尤度を測るREML(restricted maximum likelihood estimation)
AIC を最小にするシンプルなモデルを選ぶ

lmer

Grafen & Hails, Modern Statistics for the life sciences (Oxford Univ Press)GLM の入門書
一般線形モデルによる生物科学のための現代統計学

R day 8

1/20/2011

t -tests

one-way ANOVAs

Mutiple categorical predictors

within groups and between groups

compare SSbetween and SSwithin using our F statistic

anova(m)
summary(aov(m))

the highest-level interactions

> m <- lm(breaks ~ wool+tension+wool:tension, contrasts=list(wool="contr.helmert", tension="contr.helmert"), data=warpbreaks)
> anova(m)
Analysis of Variance Table

Response: breaks
             Df Sum Sq Mean Sq F value    Pr(>F)
wool          1 450.7 450.67 3.7653 0.0582130 .
tension       2 2034.3 1017.13 8.4980 0.0006926 ***
wool:tension 2 1002.8 501.39 4.1891 0.0210442 *
Residuals    48 5745.1 119.69
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

with(warpbreaks, interaction.plot(wool, tension, breaks))
with is a generic function that evaluates expr in a local environment constructed from data.

interaction.plot()

the durationsOnt data set from the languageR package

> head(CO2)
Grouped Data: uptake ~ conc | Plant
<environment: R_EmptyEnv>
Plant   Type Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled 175   30.4
3   Qn1 Quebec nonchilled 250   34.8
4   Qn1 Quebec nonchilled 350   37.2
5   Qn1 Quebec nonchilled 500   35.3
6   Qn1 Quebec nonchilled 675   39.2

> with(CO2, interaction.plot(Treatment, Type, uptake))

library(languageR)

englishy <- english[english$AgeSubject=="young",]
被験者「young」だけの抽出。

plot(RTlexdec ~ WrittenFrequency, englishy)
「young」の被験者データについて、反応時間と頻度の対応関係をプロット。

右下がりの傾向＝＞頻度の高い単語は反応が速い。

> m <- lm(RTlexdec ~ WrittenFrequency, englishy)
>
> m

Call:
lm(formula = RTlexdec ~ WrittenFrequency, data = englishy)

Coefficients:
     (Intercept) WrittenFrequency
         6.62556          -0.03711

> summary(m)

Call:
lm(formula = RTlexdec ~ WrittenFrequency, data = englishy)

Residuals:
     Min       1Q   Median       3Q      Max
-0.34664 -0.05523 -0.00546 0.05167 0.34877

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       6.6255559 0.0049432 1340.34   <2e-16 ***
WrittenFrequency -0.0371069 0.0009242 -40.15   <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08142 on 2282 degrees of freedom
Multiple R-squared: 0.414,    Adjusted R-squared: 0.4137
F-statistic: 1612 on 1 and 2282 DF, p-value: < 2.2e-16

more than one continuous predictor

n <- lm(RTlexdec ~ WrittenFrequency + FrequencyInitialDiphoneSyllable, englishy)
> n <- lm(RTlexdec ~ WrittenFrequency + FrequencyInitialDiphoneSyllable, englishy)
>
> n

Call:
lm(formula = RTlexdec ~ WrittenFrequency + FrequencyInitialDiphoneSyllable,
    data = englishy)

Coefficients:
                    (Intercept)                 WrittenFrequency
                       6.614295                        -0.037193
FrequencyInitialDiphoneSyllable
                       0.001084

> summary(n)

Call:
lm(formula = RTlexdec ~ WrittenFrequency + FrequencyInitialDiphoneSyllable,
    data = englishy)

Residuals:
     Min       1Q   Median       3Q      Max
-0.34483 -0.05525 -0.00584 0.05185 0.34635

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)
(Intercept)                      6.614295   0.012171 543.435   <2e-16 ***
WrittenFrequency                -0.037193   0.000928 -40.077   <2e-16 ***
FrequencyInitialDiphoneSyllable 0.001084   0.001070   1.012    0.311
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08141 on 2281 degrees of freedom
Multiple R-squared: 0.4143,    Adjusted R-squared: 0.4137
F-statistic: 806.6 on 2 and 2281 DF, p-value: < 2.2e-16

Regression diagnostics

plot(n)

> q <- lm(RTlexdec ~ WrittenFrequency + I(WrittenFrequency^2), englishy)
>
> plot(q)

R day 7

1/19/2011

Linear models: t-test and ANOVA

Simple linear models
predictions about the value of a single quantity
with one set of variables predicting another variable

factor() 因子オブジェクトを作る
contrasts() ある因子に関連するコントラストを設定する、見る
data.frame() データフレームを作る

names() 変数名の表示を変える。

names(d) <- c("Response", "Condition")

lm() 単回帰
lm(目的変数~説明変数)
m <- lm(Response ~ Condition, data=d)
summary(m)
> m <- lm(Response ~ Condition, data=d)
>
> m

Call:
lm(formula = Response ~ Condition, data = d)

Coefficients:
(Intercept)   ConditionT
          3            3

> summary(m)

Call:
lm(formula = Response ~ Condition, data = d)

Residuals:
   Min     1Q Median     3Q    Max
    -2     -1      0      1      2

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0000     0.7071   4.243 0.00283 **
ConditionT    3.0000     1.0000   3.000 0.01707 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.581 on 8 degrees of freedom
Multiple R-squared: 0.5294,    Adjusted R-squared: 0.4706
F-statistic:     9 on 1 and 8 DF, p-value: 0.01707

The t statistic
t distribution is symmetric, like a Gaussian, and has a similar shape, but it has heavier tails (and it is mathematically very different) 正規分布する母集団から標本抽出したときの分布がｔ分布
It has a single parameter (a degrees of freedom)
it approaches a Gaussian as the degrees of freedom go to infinity
William Sealy Gosset

t.test()

t.test(Response ~ Condition, data=d, var.equal=T)

sleepというデータセットが最初から入っている。

plot(extra ~ group, data=sleep)

points(extra ~ as.numeric(group), pch=as.character(ID), data=sleep)

t.test(extra ~ group, data=sleep, var.equal=T)

paired t -test
t.test(extra ~ group, data=sleep, paired=T)

ex1 <- sleep[sleep$group=="1","extra"]
ex2 <- sleep[sleep$group=="2","extra"]

exdiff <- ex1 - ex2
vectorのそれぞれ対応しているものどうしでの引き算となる
> ex1
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0
> ex2
[1] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4
> exdiff <- ex1 - ex2
> exdiff
[1] -1.2 -2.4 -1.3 -1.3 0.0 -1.0 -1.8 -0.8 -4.6 -1.4

library(languageR)
この中に、englishというサンプルデータがある。反応時間データ。
youngとoldで反応時間が違うかどうかを調べるには、、、
> head(english)
RTlexdec RTnaming Familiarity   Word AgeSubject WordCategory WrittenFrequency WrittenSpokenFrequencyRatio FamilySize DerivationalEntropy
1 6.543754 6.145044        2.37    doe      young            N         3.912023                   1.0216512   1.386294             0.14144

RTlexdec をAgeSubjectごとにグループを分ける。グループ間での差の検定。
> t.test(RTlexdec ~ AgeSubject, data=english)

    Welch Two Sample t-test

data: RTlexdec by AgeSubject
t = 67.4682, df = 4534.555, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2152787 0.2281642
sample estimates:
mean in group old mean in group young
           6.660958            6.439237

> t.test(RTlexdec ~ AgeSubject, data=english, var.equal=T)

    Two Sample t-test

data: RTlexdec by AgeSubject
t = 67.4682, df = 4566, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2152787 0.2281642
sample estimates:
mean in group old mean in group young
           6.660958            6.439237

> summary(lm(RTlexdec ~ AgeSubject, data=english))

Call:
lm(formula = RTlexdec ~ AgeSubject, data = english)

Residuals:
     Min       1Q   Median       3Q      Max
-0.25776 -0.08339 -0.01669 0.06921 0.52685

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      6.660958   0.002324 2866.44   <2e-16 ***
AgeSubjectyoung -0.221721   0.003286 -67.47   <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1111 on 4566 degrees of freedom
Multiple R-squared: 0.4992,    Adjusted R-squared: 0.4991
F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16

Multiple comparison problem
multiple comparisons
Bonferroni correction

Mendelian genetics メンデル遺伝学

The Analysis of Variance
Are all the locations the same?
the warpbreaks data frame that comes with R

if there is a difference in the means, where should the variance come from—SSwithin or SSbetween ?
We can use our answer to set up the following ratio (almost):
SSbetween
SSwithin

this seems less and less likely is the ratio gets bigger.

an F distribution

the output of lm()
summary(m)

the ANOVA F statistic is the square of the relevant t statistic.

the warpbreaks data
In weaving, you have two kinds of yarn, the warp and the weft; the warp is the yarn that’s held taut on the loom

warpbreaks
# There are three levels of ’tension’: L (low)
# M (medium) and H (high)

m <- lm(breaks ~ tension, data=warpbreaks)

> m

Call:
lm(formula = breaks ~ tension, data = warpbreaks)

Coefficients:
(Intercept)     tensionM     tensionH
      36.39       -10.00       -14.72

> summary(m)

Call:
lm(formula = breaks ~ tension, data = warpbreaks)

Residuals:
    Min      1Q Median      3Q     Max
-22.389 -8.139 -2.667   6.333 33.611

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    36.39       2.80 12.995 < 2e-16 ***
tensionM      -10.00       3.96 -2.525 0.014717 *
tensionH      -14.72       3.96 -3.718 0.000501 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.88 on 51 degrees of freedom
Multiple R-squared: 0.2203,    Adjusted R-squared: 0.1898
F-statistic: 7.206 on 2 and 51 DF, p-value: 0.001753

aov(m, contrasts=list(tension="contr.helmert"))

<<Previous