When?
두 개의 독립적인 집단의 평균이 같은지 다른지를 달라졌는지를 통계적으로 검정하는 방법
질적 자료(1개) : 두 집단
양적 자료(1개) :
- 귀무가설 : 비졸업과 졸업 간에 용돈에 차이가 없다(mu1 = mu2).
- 대립가설 : 비졸업과 졸업 간에 용돈에 차이가 있다(mu1 is not equal to mu2).
by(twosampleDF$money, twosampleDF$group, shapiro.test)
<결과>
twosampleDF$group: 비
Shapiro-Wilk normality test
data: dd[x, ]
W = 0.83701, p-value = 0.02885
----------------------------------------------------------------------------------
twosampleDF$group: 졸
Shapiro-Wilk normality test
data: dd[x, ]
W = 0.57538, p-value = 6.737e-05
두 집단 모두 정규성 가정이 깨짐 => 2단계로 Wilcoxon's rank sum test를 실시
- 귀무가설 : 등분산이다.
- 대립가설 : 이분산이다.
var.test(datavariable datavariable)
var.test(양적 자료 ~ 질적 자료)
> var.test(twosampleDF$money ~ twosampleDF$group)
F test to compare two variances
data: twosampleDF$money by twosampleDF$group
F = 0.22298, num df = 10, denom df = 11, p-value = 0.02499
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.06324512 0.81720812
sample estimates:
ratio of variances
0.2229815
> by(twosampleDF$money, twosampleDF$group, var)
twosampleDF$group: 비
[1] 1280.455
----------------------------------------------------------------------------------
twosampleDF$group: 졸
[1] 5742.424
결론 : 유의확률이 0.025이므로 유의수준 0.05에서 이분산이다.
t.test(data$variable ~ data$vairable,alternative = c("greater", "less", "two.sided"), var.equal = FALSE)
> t.test(twosampleDF$money ~ twosampleDF$group,
alternative = "two.sided",
var.equal = FALSE)
<결과>
Welch Two Sample t-test
data: twosampleDF$money by twosampleDF$group
t = -0.21741, df = 15.963, p-value = 0.8306
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-57.02012 46.41406
sample estimates:
mean in group 비 mean in group 졸
76.36364 81.66667
유의확률이 0.831이므로 유의수준 0.05에서 비졸업자과 졸업자의 용돈에는 통계적으로 유의한 차이는 없는 것으로 나타났다.
t.test(data$variable ~ data$vairable, alternative = c("greater", "less", "two.sided"), var.equal = TRUE)
> t.test(twosampleDF$money ~ twosampleDF$group,
alternative = "two.sided",
var.equal = TRUE)
Two Sample t-test
data: twosampleDF$money by twosampleDF$group
t = -0.21122, df = 21, p-value = 0.8348
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-57.51554 46.90948
sample estimates:
mean in group 비 mean in group 졸
76.36364 81.66667
유의확률이 0.835이므로 유의수준 0.05에서 비졸업자과 졸업자의 용돈에는 통계적으로 유의한 차이는 없는 것으로 나타났다.
wilcox.test(data$variable ~ data$variable, alternative = c("two.sided", "greater", "less"))
> wilcox.test(twosampleDF$money ~ twosampleDF$group,
alternative = "two.sided")
Wilcoxon rank sum test with continuity correction
data: twosampleDF$money by twosampleDF$group
W = 78, p-value = 0.4594
alternative hypothesis: true location shift is not equal to 0
유의확률이 0.459이므로 유의수준 0.05에서 비졸업자과 졸업자의 용돈에는 통계적으로 유의한 차이는 없는 것으로 나타났다.
- yr_built : 1900이상 ~ 2000미만 : group = "old"
- yr_built : 2000이상 : group = "new"
- 귀무가설 : old와 new 간에 price에 차이가 없다.
- 대립가설 : old가 new보다 price가 작다.
houseDF <- readxl::read_excel(path = "kc_house_data.xlsx",
sheet = 1,
col_names = TRUE)
houseDF$group <- cut(houseDF$yr_built,
breaks = c(1900, 2000, 2020),
right = FALSE)
levels(houseDF$group) <- c("old", "new")
by(houseDF$price, houseDF$group, ad.test)
wilcox.test(houseDF$price ~ houseDF$group,
alternative = "less")
result <- var.test(houseDF$price ~ houseDF$group)
result$p.value
id, date, yr_built를 제외한 모든 변수에 대해서 아래 가설검정을 실시
- 귀무가설 : old와 new는 같다.
- 대립가설 : new와 old는 같지 않다.
최종 결과 형태
variableName |
Normaility |
Method |
Equality |
TW |
pvalue |
price |
yes |
t.test |
yes |
1.234 |
0.123 |
bedrooms |
no |
wilcox.test |
non |
1.234 |
0.123 |
houseDF <- read_excel(path = "path/kc_house_data.xlsx",
sheet = 1,
col_names = TRUE)
houseDF <- data.frame(houseDF)
exceptVariable <- c("id", "date", "yr_built")
analysis.variable <- houseDF %>%
select(-one_of(exceptVariable))
analysis.variable <- colnames(houseDF)[-grep("^id|^date|^yr_built|^group",
colnames(houseDF))]
houseDF$group <- ifelse(houseDF$yr_built >= 2000, "new","old")
str(houseDF$group)
table(houseDF$group)
houseDF$group <- factor(houseDF$group,
levels = c("old", "new"),
labels = c("old", "new"))
table(houseDF$group)
Normality <- c()
Method <- c()
Equality <- c()
TW <- c()
PValue <- c()
for(i in analysis.variable){
result.normality <- by(unlist(houseDF[ , i]), houseDF$group, ad.test)
if( (result.normality$old$p.value < 0.05) | (result.normality$new$p.value < 0.05)){
Normality <- c(Normality, "No")
Method <- c(Method, "wilcox.test")
Equality <- c(Equality, "Non")
result.wilcox <- wilcox.test(unlist(houseDF[ , i])~ houseDF$group,
alternative = "two.sided")
TW <- c(TW, result.wilcox$statistic)
PValue <- c(PValue, result.wilcox$p.value)
}else{
Normality <- c(Normality, "Yes")
Method <- c(Method, "t.test")
result.equality <- var.test(unlist(houseDF[ , i])~ houseDF$group)
if(result.equality$p.value < 0.05){
Equality <- c(Equality, "No")
result.ttest <- t.test(unlist(houseDF[ , i])~ houseDF$group,
alternative = "two.sided",
var.equal = FALSE)
TW <- c(TW, result.ttest$statistic)
PValue <- c(PValue, result.ttest$p.value)
}else{
Equality <- c(Equality, "Yes")
result.ttest <- t.test(unlist(houseDF[ , i])~ houseDF$group,
alternative = "two.sided",
var.equal = TRUE)
TW <- c(TW, result.ttest$statistic)
PValue <- c(PValue, result.ttest$p.value)
}
}
}
outputTest <- data.frame(Variable = analysis.variable,
Normality,
Method,
Equality,
TW,
PValue)
writexl::write_xlsx(outputTest, path = "outputTest.xlsx")