통계의 잔차(Residual): 완전한 이해

잔차의 정의와 기본 개념

잔차(Residual)는 통계학에서 관측값과 모델의 예측값 사이의 차이를 나타내는 핵심적인 개념입니다[1][2]. 수학적으로 다음과 같이 표현됩니다:

$$ \text{잔차} = \text{관측값} – \text{예측값} $$
$$ e_i = y_i – \hat{y}_i $$

여기서:

  • $$ e_i $$: i번째 관측치의 잔차
  • $$ y_i $$: i번째 실제 관측값
  • $$ \hat{y}_i $$: i번째 예측값

오차와 잔차의 차이

오차(Error)잔차(Residual)는 자주 혼동되지만 명확히 구분되는 개념입니다[1][3]:

구분오차(Error)잔차(Residual)
정의모집단의 실제값과 회귀모형의 차이표본의 관측값과 회귀모형의 차이
계산 가능성일반적으로 계산 불가능실제로 계산 가능
표기ε (엡실론)e
용도이론적 개념실제 진단에 사용

잔차의 유형과 분류

1. 원시 잔차(Raw Residuals)

가장 기본적인 형태의 잔차로, 단순히 관측값에서 예측값을 뺀 값입니다[4][5].

2. 표준화 잔차(Standardized Residuals)

원시 잔차를 표준편차로 나눈 값으로, 서로 다른 모델 간 비교가 용이합니다[4][5]:

$$ r_i = \frac{e_i}{\sqrt{MSE(1-h_{ii})}} $$

여기서 $$ h_{ii} $$는 레버리지(leverage) 값입니다.

3. 스튜던트화 잔차(Studentized Residuals)

외부 스튜던트화 잔차라고도 하며, 해당 관측치를 제외하고 계산한 표준오차를 사용합니다[6][4][7]:

$$ r^*i = \frac{e_i}{\hat{\sigma}{(i)}\sqrt{1-h_{ii}}} $$

이는 t-분포를 따르며, 이상치 탐지에 더 효과적입니다[8][7].

잔차 분석의 목적과 중요성

모델 가정 검증

잔차 분석은 회귀모델의 기본 가정들을 검증하는 데 사용됩니다[9][10]:

  1. 선형성(Linearity): 잔차가 무작위로 분포하는지 확인
  2. 등분산성(Homoscedasticity): 잔차의 분산이 일정한지 확인
  3. 독립성(Independence): 잔차 간 상관관계가 없는지 확인
  4. 정규성(Normality): 잔차가 정규분포를 따르는지 확인

모델 적합성 평가

좋은 모델은 다음과 같은 잔차 특성을 보입니다[11][12]:

  • 잔차의 평균이 0
  • 잔차 간 상관관계가 없음
  • 잔차의 분산이 일정함
  • 잔차가 정규분포를 따름

잔차 그래프(Residual Plots)와 해석

1. 잔차 대 적합값 그래프(Residuals vs Fitted)

가장 일반적으로 사용되는 진단 도구입니다[13][14]:

이상적인 패턴:

  • 0선 주위에 무작위로 분포
  • 뚜렷한 패턴이 없음
  • 일정한 분산

문제가 있는 패턴[15][16]:

  • U자형 패턴: 비선형성 존재
  • 팬닝 패턴: 이분산성 존재
  • 곡선 패턴: 고차항 누락

2. 정규확률도(Normal Q-Q Plot)

잔차의 정규성을 검증하는 그래프입니다[9][17]:

  • 직선을 따르면 정규성 만족
  • 곡선이나 S자형은 정규성 위배

3. 척도-위치 그래프(Scale-Location Plot)

등분산성을 확인하는 그래프로, 표준화 잔차의 제곱근을 그립니다[9][18].

잔차를 통한 문제 진단

이분산성(Heteroscedasticity)

잔차의 분산이 일정하지 않은 현상입니다[19][20][21]:

원인:

  • 모델 오설정
  • 누락 변수
  • 측정 오차

검정 방법:

  • White 검정[20][22]
  • Breusch-Pagan 검정[22]
  • Goldfeld-Quandt 검정[21]

자기상관(Autocorrelation)

시계열 데이터에서 잔차 간 상관관계가 존재하는 현상입니다[19][23]:

특징:

  • $$ Cov(\epsilon_t, \epsilon_{t-k}) \neq 0 $$
  • 시간 순서에 따른 패턴 존재

검정 방법:

  • Durbin-Watson 검정
  • Ljung-Box 검정[24][25]

이상치 및 영향관측치 탐지

레버리지(Leverage)

관측치가 회귀선에 미치는 영향력을 나타냅니다[26][27][28]:

$$ h_{ii} = x_i^T(X^TX)^{-1}x_i $$

특성:

  • $$ 0 \leq h_{ii} \leq 1 $$
  • $$ \sum h_{ii} = p $$ (모수의 개수)
  • $$ h_{ii} > 2p/n $$이면 높은 레버리지

Cook’s Distance

개별 관측치의 영향력을 종합적으로 측정합니다[29][30][31]:

$$ D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1-h_{ii}} $$

기준: $$ D_i > 3 \times \text{평균}(D) $$이면 영향관측치로 간주

시계열 모델에서의 잔차 분석

ARIMA 모델 진단

시계열 모델에서 잔차는 화이트 노이즈와 같은 특성을 보여야 합니다[32][12][24]:

검증 항목:

  • 잔차의 자기상관함수(ACF)
  • 잔차의 정규성
  • 잔차의 등분산성
  • Ljung-Box 검정

계절성 모델 진단

계절 ARIMA 모델에서는 계절 주기에 따른 잔차 패턴을 확인해야 합니다[25][33].

실제 활용과 해결 방안

이분산성 해결 방법

  1. 로버스트 표준오차 사용[21][22]
  2. 가중최소제곱법(WLS)[21]
  3. 일반화최소제곱법(GLS/FGLS)[21]
  4. 변수 변환[18][21]

자기상관 해결 방법

  1. ARIMA 모델링[23][34]
  2. 차분(Differencing)[34]
  3. 시차 변수 추가[23]

이상치 처리 방법

  1. 로버스트 회귀 사용[35][36]
  2. 이상치 제거[31][35]
  3. 변환을 통한 완화[36]

소프트웨어를 통한 잔차 분석

R에서의 잔차 분석

# 기본 진단 그래프
plot(model)

# 개별 잔차 유형
residuals(model)     # 원시 잔차
rstandard(model)     # 표준화 잔차
rstudent(model)      # 스튜던트화 잔차

Python에서의 잔차 분석

import statsmodels.api as sm

# 잔차 진단
model.resid           # 잔차
model.resid_pearson   # 피어슨 잔차
model.get_influence() # 영향 통계량

잔차 분석은 통계 모델링의 핵심 요소로, 모델의 적합성을 평가하고 개선 방향을 제시하는 중요한 도구입니다. 올바른 잔차 분석을 통해 더 신뢰할 수 있고 정확한 통계 모델을 구축할 수 있습니다.

출처
[1] [회귀]오차와 잔차, 표준화 잔차 – Jangpiano Science – 티스토리 https://jangpiano-science.tistory.com/116
[2] Residuals – Statistics By Jim https://statisticsbyjim.com/glossary/residuals/
[3] [통계] 오차(error)와 잔차(residual)의 차이 – 홍시의 씽크탱크 – 티스토리 https://kimhongsi.tistory.com/entry/%ED%86%B5%EA%B3%84-%EC%98%A4%EC%B0%A8error%EC%99%80-%EC%9E%94%EC%B0%A8residual%EC%9D%98-%EC%B0%A8%EC%9D%B4
[4] [PDF] STAT 224 Lecture 10 Chapter 4 Model Diagnostics, Part 1 https://www.stat.uchicago.edu/~yibi/teaching/stat224/L10.pdf
[5] 9.3 – Identifying Outliers (Unusual Y Values) | STAT 462 https://online.stat.psu.edu/stat462/node/172/
[6] Studentized residuals https://www.ibm.com/docs/el/cognos-analytics/11.1.x?topic=terms-studentized-residuals
[7] 9.4 – Studentized Residuals | STAT 462 https://online.stat.psu.edu/stat462/node/247/
[8] Studentized residual test https://www.ibm.com/docs/en/cognos-analytics/11.1.0?topic=tests-studentized-residual-test
[9] Chapter 11 Testing regression assumptions – R for Survey Analysis https://bookdown.org/jimr1603/Intermediate_R_-R_for_Survey_Analysis/testing-regression-assumptions.html [10] Residual Diagnostics https://cran.r-project.org/web/packages/olsrr/vignettes/residual_diagnostics.html [11] 3.3 잔차 진단 | Forecasting: Principles and Practice – OTexts https://otexts.com/fppkr/residuals.html [12] 5.3 Fitted values and residuals | Forecasting – OTexts https://otexts.com/fpp3/residuals.html [13] 4.2 – Residuals vs. Fits Plot | STAT 462 https://online.stat.psu.edu/stat462/node/117/ [14] Residual plots for Fitted Line Plot – Support – Minitab https://support.minitab.com/en-us/minitab/help-and-how-to/statistical-modeling/regression/how-to/fitted-line-plot/interpret-the-results/all-statistics-and-graphs/residual-plots/ [15] 5 Proven Ways to Interpret Residual Plots Accurately https://www.numberanalytics.com/blog/interpreting-residual-plots [16] How to Interpret a Residual Plot | Algebra – Study.com https://study.com/skill/learn/how-to-interpret-a-residual-plot-explanation.html [17] Understanding Diagnostic Plots for Linear Regression Analysis http://library.virginia.edu/data/articles/diagnostic-plots [18] Residual Plots and Assumption Checking – StatsNotebook https://statsnotebook.io/blog/analysis/linearity_homoscedasticity/ [19] Heteroskedasticity and Autocorrelation simply explained https://www.finance-tutoring.fr/heteroskedasticity-and-autocorrelation-simply-explained?mobile=1 [20] 이분산 Heteroscedasticity – 우리들의 오늘을 기꺼이 이겨내가자 https://jinnnm-b.tistory.com/92 [21] 기초통계 – 이분산성(Heteroskedasticity) – Classic! – 티스토리 https://icefree.tistory.com/entry/%EA%B8%B0%EC%B4%88%ED%86%B5%EA%B3%84-%EC%9D%B4%EB%B6%84%EC%82%B0%EC%84%B1Heteroskedasticity [22] OLS에서 잔차의 이분산성 완화 – velog https://velog.io/@watermelon870/OLS%EC%97%90%EC%84%9C-%EC%9E%94%EC%B0%A8%EC%9D%98-%EC%9D%B4%EB%B6%84%EC%82%B0%EC%84%B1-%EC%99%84%ED%99%94 [23] 1 https://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/statistics/07._regression_analysis_ii/01._autocorrelation_in_regression_i/et/9698_et_autocorrelation_i_text.pdf
[24] [PDF] Lecture 9-b ARIMA – Estimation & Diagnostic Testing https://bauer.uh.edu/rsusmel/4397/fec-9-b.pdf
[25] Residual analysis – I https://campus.datacamp.com/courses/arima-models-in-r/fitting-arma-models?ex=11
[26] Hat Matrix and Leverage – MATLAB & Simulink – MathWorks https://www.mathworks.com/help/stats/hat-matrix-and-leverage.html
[27] Leverage (statistics) – Wikipedia https://en.wikipedia.org/wiki/Leverage_(statistics)
[28] 5.3 레버리지와 아웃라이어 – 데이터 사이언스 스쿨 https://datascienceschool.net/03%20machine%20learning/05.03%20%EB%A0%88%EB%B2%84%EB%A6%AC%EC%A7%80%EC%99%80%20%EC%95%84%EC%9B%83%EB%9D%BC%EC%9D%B4%EC%96%B4.html
[29] Cook’s Distance – MATLAB & Simulink – MathWorks https://www.mathworks.com/help/stats/cooks-distance.html
[30] Using Cook’s Distance: Advanced Outlier Detection in Statistical … https://www.numberanalytics.com/blog/using-cooks-distance-advanced-outlier-detection
[31] Identifying Outliers in Linear Regression – Cook’s Distance https://towardsdatascience.com/identifying-outliers-in-linear-regression-cooks-distance-9e212e9136a/
[32] Perform ARIMA Model Residual Diagnostics Using Econometric … https://www.mathworks.com/help/econ/perform-arima-model-residual-diagnostics-using-econometric-modeler.html
[33] Residual diagnostics for seasonal ARIMA model, time series analysis https://stats.stackexchange.com/questions/400775/residual-diagnostics-for-seasonal-arima-model-time-series-analysis
[34] Heteroscedasticity and Autocorrelation https://www.youtube.com/watch?v=q2t7byQ32sA
[35] Outlier detection using regression – Cross Validated – Stack Exchange https://stats.stackexchange.com/questions/104348/outlier-detection-using-regression
[36] Dealing with Outliers Using Three Robust Linear Regression Models https://developer.nvidia.com/blog/dealing-with-outliers-using-three-robust-linear-regression-models/
[37] The Ultimate Guide to Residual Analysis Techniques https://www.numberanalytics.com/blog/ultimate-guide-residual-analysis-techniques
[38] What are Residuals? – Displayr https://www.displayr.com/learn-what-are-residuals/
[39] Everything to Know About Residual Analysis – SixSigma.us https://www.6sigma.us/six-sigma-in-focus/residual-analysis/
[40] Residual Values (Residuals) in Regression Analysis – Statistics How … https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/residual/
[41] Residual Analysis – GeeksforGeeks https://www.geeksforgeeks.org/maths/residual-analysis/
[42] 오차, 잔차, 편차의 차이 (기초통계) python – DataAnalyst – 티스토리 https://signature95.tistory.com/49
[43] Residuals – Numeracy, Maths and Statistics – Academic Skills Kit https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/residuals.html
[44] 5 Key Techniques in Residual Analysis for Better Models https://www.numberanalytics.com/blog/5-key-techniques-residual-analysis-better-models
[45] Residual, Fitting Error 잔차 – [정보통신기술용어해설] http://www.ktword.co.kr/test/view/view.php?no=3832
[46] What Are Residuals in Statistics? – Statology https://www.statology.org/residuals/
[47] Statistics – Residuals, Analysis, Modeling – Britannica https://www.britannica.com/science/statistics/Residual-analysis
[48] 오차항(error) vs 잔차(residual) – 사고의 과정 – 티스토리 https://thought-process-ing.tistory.com/30
[49] Introduction to residuals (article) – Khan Academy https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/regression-library/a/introduction-to-residuals
[50] What Is Residual Analysis? – MATLAB & Simulink – MathWorks https://www.mathworks.com/help/ident/ug/what-is-residual-analysis.html
[51] 5.4 잔차(residual) – 일반통계학(2017-1) https://enook.jbnu.ac.kr/contents/39/
[52] What Is a Residual in Stats? – Outlier Articles https://articles.outlier.org/what-is-a-residual-in-stats
[53] Documentation https://www.mathworks.com/help/releases/R2021a/stats/cooks-distance.html
[54] Residual Analysis – Scaler Topics https://www.scaler.com/topics/data-science/residual-analysis/
[55] Determines outliers using Cook’s Distance https://search.r-project.org/CRAN/refmans/referenceIntervals/html/cook.outliers.html
[56] Studentized Residuals https://www.youtube.com/watch?v=XiR9H6XeSOs
[57] Introduction to Residual Analysis https://www.youtube.com/watch?v=pbFyNsUuzV4
[58] Cook’s distance – Wikipedia https://en.wikipedia.org/wiki/Cook’s_distance
[59] Example: Residual Analysis https://support.ptc.com/help/mathcad/r10.0/en/PTC_Mathcad_Help/example_residual_analysis.html
[60] Residuals, standardized residuals, and Studentized residuals https://www.youtube.com/watch?v=y4hRD7EWdJ4
[61] What Is a Residual Plot? Definitions, Examples, and Applications https://dovetail.com/research/what-is-a-residual-plot/
[62] [linear regression] residual, SSR, OLS, linear … – 올리비아 코딩스쿨 https://olivia-blackcherry.tistory.com/593
[63] Statistics for the Social Sciences https://courses.lumenlearning.com/suny-hccc-wm-concepts-statistics/chapter/assessing-the-fit-of-a-line-2-of-4/
[64] Slide 1 https://mycourses.aalto.fi/pluginfile.php/1660393/mod_folder/content/0/lecture9a_Introduction.pdf
[65] Introduction https://www.stat.purdue.edu/~zhanghao/STAT514/Lecture_Notes/LectureNotes07-Checking-Assumptions-.html
[66] Origin Help – Residual Plot Analysis – OriginLab https://www.originlab.com/doc/origin-help/residual-plot-analysis
[67] =1=Heteroscedasticity and Autocorrelation https://www.math.stonybrook.edu/~gaston/print/Old/review/HeteroAuto.pdf
[68] Simple Linear Regression: Checking Assumptions with Residual Plots https://www.youtube.com/watch?v=iMdtTCX2Q70
[69] What Does A Good Residual Plot Look Like? – The Friendly Statistician https://www.youtube.com/watch?v=hkZhSMhxw-4
[70] Autocorrelation and heteroskedasticity in time series data [closed] https://stats.stackexchange.com/questions/313452/autocorrelation-and-heteroskedasticity-in-time-series-data
[71] Chapter 28 Assessing Assumptions | Extended R Examples for A First Course in Design and Analysis of Experiments, 2nd edition. http://users.stat.umn.edu/~gary/book/RExamples/assessing-assumptions.html
[72] Interpreting Residual Plots to Improve Your Regression – Qualtrics https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
[73] Time Series Regression VI: Residual Diagnostics – MathWorks https://www.mathworks.com/help/econ/time-series-regression-vi-residual-diagnostics.html
[74] How To Use Residuals For Time Series Forecasting – YouTube https://www.youtube.com/watch?v=owVs7bV1sZQ
[75] Residual Diagnostics – MATLAB & Simulink – MathWorks https://www.mathworks.com/help/econ/compare-arma-models.html
[76] Microsoft Word – 9장 http://contents.kocw.net/document/ch9_5.pdf
[77] 3.3 Residual diagnostics | Forecasting: Principles and Practice (2nd … https://otexts.com/fpp2/residuals.html
[78] What are residuals in time series modeling? https://milvus.io/ai-quick-reference/what-are-residuals-in-time-series-modeling
[79] ARIMA Model Diagnostics & Residual Analysis https://apxml.com/courses/time-series-analysis-forecasting/chapter-4-arima-models-forecasting/arima-model-diagnostics
[80] 데이터과학을 위한 통계 리뷰 – 16일차 (가설검정,이분산성,영향값 … https://datacook.tistory.com/50
[81] Residuals in Time Series Models https://www.ucm.es/data/cont/docs/518-2013-11-11-JAM206.pdf
[82] [ SAS ] 이분산 (Heteroscedasticity) – bingsu’s Finance Diary – 티스토리 https://bing-su-b.tistory.com/137
[83] Time Series Regression VI: Residual Diagnostics https://it.mathworks.com/help/econ/time-series-regression-vi-residual-diagnostics.html
[84] 5.4 Residual diagnostics | Forecasting: Principles and Practice (3rd … https://otexts.com/fpp3/diagnostics.html
[85] 5.22 Outliers | Introduction to Regression Methods for Public Health … https://bookdown.org/rwnahhas/RMPH/mlr-outliers.html
[86] Hat matrix and leverages in classical multiple regression https://stats.stackexchange.com/questions/208242/hat-matrix-and-leverages-in-classical-multiple-regression
[87] Interpreting the residuals vs. fitted values plot for verifying the … https://stats.stackexchange.com/questions/76226/interpreting-the-residuals-vs-fitted-values-plot-for-verifying-the-assumptions
[88] Interpreting Residuals v Fitted – General – Posit Community https://forum.posit.co/t/interpreting-residuals-v-fitted/124776
[89] Outlier Detection and Effects on Modeling https://www.scirp.org/journal/paperinformation?paperid=102884
[90] [PDF] Regression in Practice – Brown Computer Science https://cs.brown.edu/courses/cs100/lectures/lecture18.pdf
[91] [PDF] Article – Survey weighted hat matrix and leverages https://www150.statcan.gc.ca/n1/pub/12-001-x/2009001/article/10881-eng.pdf
[92] Supervised outlier detection for classification and regression https://www.sciencedirect.com/science/article/pii/S0925231222002090
[93] [PDF] Statistical Leverage and Improved Matrix Algorithms https://www.stat.berkeley.edu/~mmahoney/talks/LeverageMatrix0308.pdf
[94] Unified methods for variable selection and outlier detection in a … http://www.csam.or.kr/journal/view.html?doi=10.29220%2FCSAM.2019.26.6.575
[95] 11.2 – Using Leverages to Help Identify Extreme x Values | STAT 501 https://online.stat.psu.edu/stat501/lesson/11/11.2
[96] Comparison Study of Outlier Detection Methods in a Regression … https://koreascience.kr/article/JAKO201319069652630.page

코멘트

답글 남기기

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다