[Fastcampus] RDC 강의 내용 정리 - 이부일 강사님
상관분석(Correlation Analysis)
When?
두 양적 자료 간에 관련성(직선의 관계 = 선형의 관계)이 있는지를 통계적으로 검정하는 방법
예제 데이터 : cars(speed, dist), attitude
1. 산점도(Scatter Plot)
(1) 기본
plot(x - data$variable, y - data$variable) plot(cars$speed, cars$dist) # 한 화면에 여러개 plot 출력 par(mfrow = c(2, 3)) for(i in colnames(attitude)[2:7]){ plot(attitude[ , i], attitude$rating, main = paste("rating vs ", i), xlab = i, ylab = "rating", col = "blue", pch = 12) } par(mfrow = c(1, 1))
(2) 산점행렬도(SMP : Scatter Matrix Plot)
plot(iris[ , 1:4])
(3) 3D 산점도 : rgl, car package
with(iris, plot3d(Sepal.Length, Sepal.Width, Petal.Length, type="s", col=as.numeric(Species))) scatter3d(x = iris$Sepal.Length, y = iris$Petal.Length, z = iris$Sepal.Width, groups = iris$Species, surface=FALSE, grid = FALSE, ellipsoid = TRUE, axis.col = c("black", "black", "black"))
(4) corrplot package
corrplot::corrplot(cor(iris[ , 1:4]), method = "circle")
2. 상관계수(Coefficient of Correlation)
두 양적 자료의 관련성(직선의 관계 = 선형의 관계) 정도를 수치로 알려줌
cor(datavariable, method = c("pearson", "spearman", "kendall"))
> cor(cars$speed, cars$dist, method = "pearson") [1] 0.8068949 > cor(attitude, method = "pearson") rating complaints privileges learning raises critical advance rating 1.0000000 0.8254176 0.4261169 0.6236782 0.5901390 0.1564392 0.1550863 complaints 0.8254176 1.0000000 0.5582882 0.5967358 0.6691975 0.1877143 0.2245796 privileges 0.4261169 0.5582882 1.0000000 0.4933310 0.4454779 0.1472331 0.3432934 learning 0.6236782 0.5967358 0.4933310 1.0000000 0.6403144 0.1159652 0.5316198 raises 0.5901390 0.6691975 0.4454779 0.6403144 1.0000000 0.3768830 0.5741862 critical 0.1564392 0.1877143 0.1472331 0.1159652 0.3768830 1.0000000 0.2833432 advance 0.1550863 0.2245796 0.3432934 0.5316198 0.5741862 0.2833432 1.0000000 > round(cor(attitude, method = "pearson") , digits = 3) rating complaints privileges learning raises critical advance rating 1.000 0.825 0.426 0.624 0.590 0.156 0.155 complaints 0.825 1.000 0.558 0.597 0.669 0.188 0.225 privileges 0.426 0.558 1.000 0.493 0.445 0.147 0.343 learning 0.624 0.597 0.493 1.000 0.640 0.116 0.532 raises 0.590 0.669 0.445 0.640 1.000 0.377 0.574 critical 0.156 0.188 0.147 0.116 0.377 1.000 0.283 advance 0.155 0.225 0.343 0.532 0.574 0.283 1.000 > round(cor(iris[ , 1:4], method = "pearson") , digits = 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.000 -0.118 0.872 0.818 Sepal.Width -0.118 1.000 -0.428 -0.366 Petal.Length 0.872 -0.428 1.000 0.963 Petal.Width 0.818 -0.366 0.963 1.000```
3. 상관분석
- 귀무가설 : speed와 dist 간에는 관련성이 없다.
- 대립가설 : speed와 dist 간에는 관련성이 있다.
cor.test(datavariable, method = "pearson")
> cor.test(cars$speed, cars$dist, method = "pearson") Pearson's product-moment correlation data: cars$speed and cars$dist t = 9.464, df = 48, p-value = 1.49e-12 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6816422 0.8862036 sample estimates: cor 0.8068949
유의확률이 0.000이므로 유의수준 0.05에서 speed와 dist 간에는 통계적으로 유의한 양의 상관관계가 있는 것으로 나타났다.
즉, speed가 증가하면 dist도 증가하는 경향을 보인다.
- 귀무가설 : rating과 complaints 간에는 관련성이 없다.
- 대립가설 : rating과 complaints 간에는 관련성이 있다.
> cor.test(attitude$complaints, attitude$rating, method = "pearson") Pearson's product-moment correlation data: attitude$complaints and attitude$rating t = 7.737, df = 28, p-value = 1.988e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6620128 0.9139139 sample estimates: cor 0.8254176
유의확률이 0.000이므로 유의수준 0.05에서 complaints와 rating 간에는 통계적으로 유의한 매우 높은 양의 상관관계가 있는 것으로 나타났다.
Quiz.
rating과 나머지 6개 변수 간의 관련성 검정을 해 보세요.
# for문 활용 for(i in colnames(attitude)[2:7]){ print(cor.test(attitude[ , i], attitude$rating, method = "pearson")) } # corr.test 패키지 활용 > psych::corr.test(attitude, method = "pearson") Call:psych::corr.test(x = attitude, method = "pearson") Correlation matrix rating complaints privileges learning raises critical advance rating 1.00 0.83 0.43 0.62 0.59 0.16 0.16 complaints 0.83 1.00 0.56 0.60 0.67 0.19 0.22 privileges 0.43 0.56 1.00 0.49 0.45 0.15 0.34 learning 0.62 0.60 0.49 1.00 0.64 0.12 0.53 raises 0.59 0.67 0.45 0.64 1.00 0.38 0.57 critical 0.16 0.19 0.15 0.12 0.38 1.00 0.28 advance 0.16 0.22 0.34 0.53 0.57 0.28 1.00 Sample Size [1] 30 Probability values (Entries above the diagonal are adjusted for multiple tests.) rating complaints privileges learning raises critical advance rating 0.00 0.00 0.19 0.00 0.01 1.00 1.00 complaints 0.00 0.00 0.02 0.01 0.00 1.00 1.00 privileges 0.02 0.00 0.00 0.07 0.15 1.00 0.51 learning 0.00 0.00 0.01 0.00 0.00 1.00 0.03 raises 0.00 0.00 0.01 0.00 0.00 0.36 0.01 critical 0.41 0.32 0.44 0.54 0.04 0.00 0.90 advance 0.41 0.23 0.06 0.00 0.00 0.13 0.00 To see confidence intervals of the correlations, print with the short=FALSE option
Quiz.
housing data : price와 관련성이 있는 상위 6개의 변수명, 상관계수, t값, p-value를 출력하시오.
houseDF <- readxl::read_excel(path = "kc_house_data.xlsx", sheet = 1, col_names = TRUE) View(houseDF) houseDF$year <- 2018 - houseDF$yr_built remove.variables <- c("id", "date", "floors", "waterfront", "view", "condition", "yr_built", "yr_renovated", "zipcode", "lat", "long") houseDF2 <- houseDF %>% select(-one_of(remove.variables)) #psych::corr.test() corr.result <- psych::corr.test(houseDF2) str(corr.result) corr.result$r[ , 1] str(corr.result$r) top6.r <- round(sort(corr.result$r[, 1], decreasing = TRUE)[2:7], digits = 3) top6.t <- round(sort(corr.result$t[, 1], decreasing = TRUE)[2:7], digits = 3) top6.pvalue <- round(sort(corr.result$p[, 1], decreasing = TRUE)[2:7], digits = 3) top6.variables <- names(round(sort(corr.result$r[, 1], decreasing = TRUE)[2:7], digits = 3)) plot(houseDF2[ , c(top6.variables, "price")]) corrDF <- data.frame(Variables = top6.variables, r = top6.r, t = top6.t, pvalue = top6.pvalue) writexl::write_xlsx(corrDF, path = "correlationResult.xlsx")
'Programming > R' 카테고리의 다른 글
[Fast campus] 14. 분산분석(ANOVA : Analysis of Variance) (0) | 2018.07.07 |
---|---|
[Fast campus] 13. Two Sample t-test (0) | 2018.07.05 |
[Fast campus] 12. One Sample t-test (0) | 2018.03.26 |
[Fastcampus] 11. R 활용 tip (0) | 2017.08.23 |
[Fastcampus] 10. 집값 예측 miniproject (0) | 2017.08.22 |