write

[Fastcampus] RDC 강의 내용 정리 - 이부일 강사님

상관분석(Correlation Analysis)

When?
두 양적 자료 간에 관련성(직선의 관계 = 선형의 관계)이 있는지를 통계적으로 검정하는 방법

예제 데이터 : cars(speed, dist), attitude


1. 산점도(Scatter Plot)

(1) 기본

plot(x - data$variable, y - data$variable)
plot(cars$speed, cars$dist)

# 한 화면에 여러개 plot 출력
par(mfrow = c(2, 3))
for(i in colnames(attitude)[2:7]){
    plot(attitude[ , i], attitude$rating,
         main = paste("rating vs ", i),
         xlab = i,
         ylab = "rating",
         col = "blue",
         pch = 12)
}
par(mfrow = c(1, 1))


(2) 산점행렬도(SMP : Scatter Matrix Plot)

plot(iris[ , 1:4])


(3) 3D 산점도 : rgl, car package

with(iris,
     plot3d(Sepal.Length,
            Sepal.Width,
            Petal.Length,
            type="s",
            col=as.numeric(Species)))


scatter3d(x = iris$Sepal.Length,
          y = iris$Petal.Length,
          z = iris$Sepal.Width,
          groups = iris$Species,
          surface=FALSE,
          grid = FALSE,
          ellipsoid = TRUE,
          axis.col = c("black", "black", "black"))

(4) corrplot package

corrplot::corrplot(cor(iris[ , 1:4]), method = "circle")




2. 상관계수(Coefficient of Correlation)

두 양적 자료의 관련성(직선의 관계 = 선형의 관계) 정도를 수치로 알려줌
cor(datavariable,datavariable, datavariable, method = c("pearson", "spearman", "kendall"))

> cor(cars$speed, cars$dist, method = "pearson")
[1] 0.8068949

> cor(attitude, method = "pearson")
              rating complaints privileges  learning    raises  critical   advance
rating     1.0000000  0.8254176  0.4261169 0.6236782 0.5901390 0.1564392 0.1550863
complaints 0.8254176  1.0000000  0.5582882 0.5967358 0.6691975 0.1877143 0.2245796
privileges 0.4261169  0.5582882  1.0000000 0.4933310 0.4454779 0.1472331 0.3432934
learning   0.6236782  0.5967358  0.4933310 1.0000000 0.6403144 0.1159652 0.5316198
raises     0.5901390  0.6691975  0.4454779 0.6403144 1.0000000 0.3768830 0.5741862
critical   0.1564392  0.1877143  0.1472331 0.1159652 0.3768830 1.0000000 0.2833432
advance    0.1550863  0.2245796  0.3432934 0.5316198 0.5741862 0.2833432 1.0000000

> round(cor(attitude, method = "pearson") , digits = 3)
           rating complaints privileges learning raises critical advance
rating      1.000      0.825      0.426    0.624  0.590    0.156   0.155
complaints  0.825      1.000      0.558    0.597  0.669    0.188   0.225
privileges  0.426      0.558      1.000    0.493  0.445    0.147   0.343
learning    0.624      0.597      0.493    1.000  0.640    0.116   0.532
raises      0.590      0.669      0.445    0.640  1.000    0.377   0.574
critical    0.156      0.188      0.147    0.116  0.377    1.000   0.283
advance     0.155      0.225      0.343    0.532  0.574    0.283   1.000

> round(cor(iris[ , 1:4], method = "pearson") , digits = 3)
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length        1.000      -0.118        0.872       0.818
Sepal.Width        -0.118       1.000       -0.428      -0.366
Petal.Length        0.872      -0.428        1.000       0.963
Petal.Width         0.818      -0.366        0.963       1.000```


3. 상관분석

  • 귀무가설 : speed와 dist 간에는 관련성이 없다.
  • 대립가설 : speed와 dist 간에는 관련성이 있다.
    cor.test(datavariable,datavariable, datavariable, method = "pearson")
> cor.test(cars$speed, cars$dist, method = "pearson")

	Pearson's product-moment correlation

data:  cars$speed and cars$dist
t = 9.464, df = 48, p-value = 1.49e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6816422 0.8862036
sample estimates:
      cor
0.8068949

유의확률이 0.000이므로 유의수준 0.05에서 speed와 dist 간에는 통계적으로 유의한 양의 상관관계가 있는 것으로 나타났다.
즉, speed가 증가하면 dist도 증가하는 경향을 보인다.



  • 귀무가설 : rating과 complaints 간에는 관련성이 없다.
  • 대립가설 : rating과 complaints 간에는 관련성이 있다.
> cor.test(attitude$complaints, attitude$rating, method = "pearson")

	Pearson's product-moment correlation

data:  attitude$complaints and attitude$rating
t = 7.737, df = 28, p-value = 1.988e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6620128 0.9139139
sample estimates:
      cor
0.8254176

유의확률이 0.000이므로 유의수준 0.05에서 complaints와 rating 간에는 통계적으로 유의한 매우 높은 양의 상관관계가 있는 것으로 나타났다.



Quiz.

rating과 나머지 6개 변수 간의 관련성 검정을 해 보세요.

# for문 활용
for(i in colnames(attitude)[2:7]){
    print(cor.test(attitude[ , i], attitude$rating, method = "pearson"))
}

# corr.test 패키지 활용
> psych::corr.test(attitude, method = "pearson")

Call:psych::corr.test(x = attitude, method = "pearson")
Correlation matrix
           rating complaints privileges learning raises critical advance
rating       1.00       0.83       0.43     0.62   0.59     0.16    0.16
complaints   0.83       1.00       0.56     0.60   0.67     0.19    0.22
privileges   0.43       0.56       1.00     0.49   0.45     0.15    0.34
learning     0.62       0.60       0.49     1.00   0.64     0.12    0.53
raises       0.59       0.67       0.45     0.64   1.00     0.38    0.57
critical     0.16       0.19       0.15     0.12   0.38     1.00    0.28
advance      0.16       0.22       0.34     0.53   0.57     0.28    1.00
Sample Size
[1] 30
Probability values (Entries above the diagonal are adjusted for multiple tests.)
           rating complaints privileges learning raises critical advance
rating       0.00       0.00       0.19     0.00   0.01     1.00    1.00
complaints   0.00       0.00       0.02     0.01   0.00     1.00    1.00
privileges   0.02       0.00       0.00     0.07   0.15     1.00    0.51
learning     0.00       0.00       0.01     0.00   0.00     1.00    0.03
raises       0.00       0.00       0.01     0.00   0.00     0.36    0.01
critical     0.41       0.32       0.44     0.54   0.04     0.00    0.90
advance      0.41       0.23       0.06     0.00   0.00     0.13    0.00

 To see confidence intervals of the correlations, print with the short=FALSE option


Quiz.

housing data : price와 관련성이 있는 상위 6개의 변수명, 상관계수, t값, p-value를 출력하시오.

houseDF <- readxl::read_excel(path      = "kc_house_data.xlsx",
                              sheet     = 1,
                              col_names = TRUE)
View(houseDF)
houseDF$year <- 2018 - houseDF$yr_built
remove.variables <- c("id", "date", "floors", "waterfront", "view",
                      "condition", "yr_built", "yr_renovated",
                      "zipcode", "lat", "long")

houseDF2 <- houseDF %>%
    select(-one_of(remove.variables))
    #psych::corr.test()

corr.result <- psych::corr.test(houseDF2)
str(corr.result)
corr.result$r[ , 1]
str(corr.result$r)
top6.r <- round(sort(corr.result$r[, 1], decreasing = TRUE)[2:7], digits = 3)
top6.t <- round(sort(corr.result$t[, 1], decreasing = TRUE)[2:7], digits = 3)
top6.pvalue <- round(sort(corr.result$p[, 1], decreasing = TRUE)[2:7], digits = 3)
top6.variables <- names(round(sort(corr.result$r[, 1], decreasing = TRUE)[2:7], digits = 3))
plot(houseDF2[ , c(top6.variables, "price")])

corrDF <- data.frame(Variables = top6.variables,
                     r = top6.r,
                     t = top6.t,
                     pvalue = top6.pvalue)
writexl::write_xlsx(corrDF, path = "correlationResult.xlsx")


+ Recent posts