잔차의 정의와 기본 개념
잔차(Residual)는 통계학에서 관측값과 모델의 예측값 사이의 차이를 나타내는 핵심적인 개념입니다[1][2]. 수학적으로 다음과 같이 표현됩니다:
$$ \text{잔차} = \text{관측값} – \text{예측값} $$
$$ e_i = y_i – \hat{y}_i $$
여기서:
- $$ e_i $$: i번째 관측치의 잔차
- $$ y_i $$: i번째 실제 관측값
- $$ \hat{y}_i $$: i번째 예측값
오차와 잔차의 차이
오차(Error)와 잔차(Residual)는 자주 혼동되지만 명확히 구분되는 개념입니다[1][3]:
구분 | 오차(Error) | 잔차(Residual) |
---|---|---|
정의 | 모집단의 실제값과 회귀모형의 차이 | 표본의 관측값과 회귀모형의 차이 |
계산 가능성 | 일반적으로 계산 불가능 | 실제로 계산 가능 |
표기 | ε (엡실론) | e |
용도 | 이론적 개념 | 실제 진단에 사용 |
잔차의 유형과 분류
1. 원시 잔차(Raw Residuals)
가장 기본적인 형태의 잔차로, 단순히 관측값에서 예측값을 뺀 값입니다[4][5].
2. 표준화 잔차(Standardized Residuals)
원시 잔차를 표준편차로 나눈 값으로, 서로 다른 모델 간 비교가 용이합니다[4][5]:
$$ r_i = \frac{e_i}{\sqrt{MSE(1-h_{ii})}} $$
여기서 $$ h_{ii} $$는 레버리지(leverage) 값입니다.
3. 스튜던트화 잔차(Studentized Residuals)
외부 스튜던트화 잔차라고도 하며, 해당 관측치를 제외하고 계산한 표준오차를 사용합니다[6][4][7]:
$$ r^*i = \frac{e_i}{\hat{\sigma}{(i)}\sqrt{1-h_{ii}}} $$
이는 t-분포를 따르며, 이상치 탐지에 더 효과적입니다[8][7].
잔차 분석의 목적과 중요성
모델 가정 검증
잔차 분석은 회귀모델의 기본 가정들을 검증하는 데 사용됩니다[9][10]:
- 선형성(Linearity): 잔차가 무작위로 분포하는지 확인
- 등분산성(Homoscedasticity): 잔차의 분산이 일정한지 확인
- 독립성(Independence): 잔차 간 상관관계가 없는지 확인
- 정규성(Normality): 잔차가 정규분포를 따르는지 확인
모델 적합성 평가
좋은 모델은 다음과 같은 잔차 특성을 보입니다[11][12]:
- 잔차의 평균이 0
- 잔차 간 상관관계가 없음
- 잔차의 분산이 일정함
- 잔차가 정규분포를 따름
잔차 그래프(Residual Plots)와 해석
1. 잔차 대 적합값 그래프(Residuals vs Fitted)
가장 일반적으로 사용되는 진단 도구입니다[13][14]:
이상적인 패턴:
- 0선 주위에 무작위로 분포
- 뚜렷한 패턴이 없음
- 일정한 분산
문제가 있는 패턴[15][16]:
- U자형 패턴: 비선형성 존재
- 팬닝 패턴: 이분산성 존재
- 곡선 패턴: 고차항 누락
2. 정규확률도(Normal Q-Q Plot)
잔차의 정규성을 검증하는 그래프입니다[9][17]:
- 직선을 따르면 정규성 만족
- 곡선이나 S자형은 정규성 위배
3. 척도-위치 그래프(Scale-Location Plot)
등분산성을 확인하는 그래프로, 표준화 잔차의 제곱근을 그립니다[9][18].
잔차를 통한 문제 진단
이분산성(Heteroscedasticity)
잔차의 분산이 일정하지 않은 현상입니다[19][20][21]:
원인:
- 모델 오설정
- 누락 변수
- 측정 오차
검정 방법:
- White 검정[20][22]
- Breusch-Pagan 검정[22]
- Goldfeld-Quandt 검정[21]
자기상관(Autocorrelation)
시계열 데이터에서 잔차 간 상관관계가 존재하는 현상입니다[19][23]:
특징:
- $$ Cov(\epsilon_t, \epsilon_{t-k}) \neq 0 $$
- 시간 순서에 따른 패턴 존재
검정 방법:
- Durbin-Watson 검정
- Ljung-Box 검정[24][25]
이상치 및 영향관측치 탐지
레버리지(Leverage)
관측치가 회귀선에 미치는 영향력을 나타냅니다[26][27][28]:
$$ h_{ii} = x_i^T(X^TX)^{-1}x_i $$
특성:
- $$ 0 \leq h_{ii} \leq 1 $$
- $$ \sum h_{ii} = p $$ (모수의 개수)
- $$ h_{ii} > 2p/n $$이면 높은 레버리지
Cook’s Distance
개별 관측치의 영향력을 종합적으로 측정합니다[29][30][31]:
$$ D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1-h_{ii}} $$
기준: $$ D_i > 3 \times \text{평균}(D) $$이면 영향관측치로 간주
시계열 모델에서의 잔차 분석
ARIMA 모델 진단
시계열 모델에서 잔차는 화이트 노이즈와 같은 특성을 보여야 합니다[32][12][24]:
검증 항목:
- 잔차의 자기상관함수(ACF)
- 잔차의 정규성
- 잔차의 등분산성
- Ljung-Box 검정
계절성 모델 진단
계절 ARIMA 모델에서는 계절 주기에 따른 잔차 패턴을 확인해야 합니다[25][33].
실제 활용과 해결 방안
이분산성 해결 방법
- 로버스트 표준오차 사용[21][22]
- 가중최소제곱법(WLS)[21]
- 일반화최소제곱법(GLS/FGLS)[21]
- 변수 변환[18][21]
자기상관 해결 방법
- ARIMA 모델링[23][34]
- 차분(Differencing)[34]
- 시차 변수 추가[23]
이상치 처리 방법
- 로버스트 회귀 사용[35][36]
- 이상치 제거[31][35]
- 변환을 통한 완화[36]
소프트웨어를 통한 잔차 분석
R에서의 잔차 분석
# 기본 진단 그래프
plot(model)
# 개별 잔차 유형
residuals(model) # 원시 잔차
rstandard(model) # 표준화 잔차
rstudent(model) # 스튜던트화 잔차
Python에서의 잔차 분석
import statsmodels.api as sm
# 잔차 진단
model.resid # 잔차
model.resid_pearson # 피어슨 잔차
model.get_influence() # 영향 통계량
잔차 분석은 통계 모델링의 핵심 요소로, 모델의 적합성을 평가하고 개선 방향을 제시하는 중요한 도구입니다. 올바른 잔차 분석을 통해 더 신뢰할 수 있고 정확한 통계 모델을 구축할 수 있습니다.
출처
[1] [회귀]오차와 잔차, 표준화 잔차 – Jangpiano Science – 티스토리 https://jangpiano-science.tistory.com/116
[2] Residuals – Statistics By Jim https://statisticsbyjim.com/glossary/residuals/
[3] [통계] 오차(error)와 잔차(residual)의 차이 – 홍시의 씽크탱크 – 티스토리 https://kimhongsi.tistory.com/entry/%ED%86%B5%EA%B3%84-%EC%98%A4%EC%B0%A8error%EC%99%80-%EC%9E%94%EC%B0%A8residual%EC%9D%98-%EC%B0%A8%EC%9D%B4
[4] [PDF] STAT 224 Lecture 10 Chapter 4 Model Diagnostics, Part 1 https://www.stat.uchicago.edu/~yibi/teaching/stat224/L10.pdf
[5] 9.3 – Identifying Outliers (Unusual Y Values) | STAT 462 https://online.stat.psu.edu/stat462/node/172/
[6] Studentized residuals https://www.ibm.com/docs/el/cognos-analytics/11.1.x?topic=terms-studentized-residuals
[7] 9.4 – Studentized Residuals | STAT 462 https://online.stat.psu.edu/stat462/node/247/
[8] Studentized residual test https://www.ibm.com/docs/en/cognos-analytics/11.1.0?topic=tests-studentized-residual-test
[9] Chapter 11 Testing regression assumptions – R for Survey Analysis https://bookdown.org/jimr1603/Intermediate_R_-R_for_Survey_Analysis/testing-regression-assumptions.html [10] Residual Diagnostics https://cran.r-project.org/web/packages/olsrr/vignettes/residual_diagnostics.html [11] 3.3 잔차 진단 | Forecasting: Principles and Practice – OTexts https://otexts.com/fppkr/residuals.html [12] 5.3 Fitted values and residuals | Forecasting – OTexts https://otexts.com/fpp3/residuals.html [13] 4.2 – Residuals vs. Fits Plot | STAT 462 https://online.stat.psu.edu/stat462/node/117/ [14] Residual plots for Fitted Line Plot – Support – Minitab https://support.minitab.com/en-us/minitab/help-and-how-to/statistical-modeling/regression/how-to/fitted-line-plot/interpret-the-results/all-statistics-and-graphs/residual-plots/ [15] 5 Proven Ways to Interpret Residual Plots Accurately https://www.numberanalytics.com/blog/interpreting-residual-plots [16] How to Interpret a Residual Plot | Algebra – Study.com https://study.com/skill/learn/how-to-interpret-a-residual-plot-explanation.html [17] Understanding Diagnostic Plots for Linear Regression Analysis http://library.virginia.edu/data/articles/diagnostic-plots [18] Residual Plots and Assumption Checking – StatsNotebook https://statsnotebook.io/blog/analysis/linearity_homoscedasticity/ [19] Heteroskedasticity and Autocorrelation simply explained https://www.finance-tutoring.fr/heteroskedasticity-and-autocorrelation-simply-explained?mobile=1 [20] 이분산 Heteroscedasticity – 우리들의 오늘을 기꺼이 이겨내가자 https://jinnnm-b.tistory.com/92 [21] 기초통계 – 이분산성(Heteroskedasticity) – Classic! – 티스토리 https://icefree.tistory.com/entry/%EA%B8%B0%EC%B4%88%ED%86%B5%EA%B3%84-%EC%9D%B4%EB%B6%84%EC%82%B0%EC%84%B1Heteroskedasticity [22] OLS에서 잔차의 이분산성 완화 – velog https://velog.io/@watermelon870/OLS%EC%97%90%EC%84%9C-%EC%9E%94%EC%B0%A8%EC%9D%98-%EC%9D%B4%EB%B6%84%EC%82%B0%EC%84%B1-%EC%99%84%ED%99%94 [23] 1 https://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/statistics/07._regression_analysis_ii/01._autocorrelation_in_regression_i/et/9698_et_autocorrelation_i_text.pdf
[24] [PDF] Lecture 9-b ARIMA – Estimation & Diagnostic Testing https://bauer.uh.edu/rsusmel/4397/fec-9-b.pdf
[25] Residual analysis – I https://campus.datacamp.com/courses/arima-models-in-r/fitting-arma-models?ex=11
[26] Hat Matrix and Leverage – MATLAB & Simulink – MathWorks https://www.mathworks.com/help/stats/hat-matrix-and-leverage.html
[27] Leverage (statistics) – Wikipedia https://en.wikipedia.org/wiki/Leverage_(statistics)
[28] 5.3 레버리지와 아웃라이어 – 데이터 사이언스 스쿨 https://datascienceschool.net/03%20machine%20learning/05.03%20%EB%A0%88%EB%B2%84%EB%A6%AC%EC%A7%80%EC%99%80%20%EC%95%84%EC%9B%83%EB%9D%BC%EC%9D%B4%EC%96%B4.html
[29] Cook’s Distance – MATLAB & Simulink – MathWorks https://www.mathworks.com/help/stats/cooks-distance.html
[30] Using Cook’s Distance: Advanced Outlier Detection in Statistical … https://www.numberanalytics.com/blog/using-cooks-distance-advanced-outlier-detection
[31] Identifying Outliers in Linear Regression – Cook’s Distance https://towardsdatascience.com/identifying-outliers-in-linear-regression-cooks-distance-9e212e9136a/
[32] Perform ARIMA Model Residual Diagnostics Using Econometric … https://www.mathworks.com/help/econ/perform-arima-model-residual-diagnostics-using-econometric-modeler.html
[33] Residual diagnostics for seasonal ARIMA model, time series analysis https://stats.stackexchange.com/questions/400775/residual-diagnostics-for-seasonal-arima-model-time-series-analysis
[34] Heteroscedasticity and Autocorrelation https://www.youtube.com/watch?v=q2t7byQ32sA
[35] Outlier detection using regression – Cross Validated – Stack Exchange https://stats.stackexchange.com/questions/104348/outlier-detection-using-regression
[36] Dealing with Outliers Using Three Robust Linear Regression Models https://developer.nvidia.com/blog/dealing-with-outliers-using-three-robust-linear-regression-models/
[37] The Ultimate Guide to Residual Analysis Techniques https://www.numberanalytics.com/blog/ultimate-guide-residual-analysis-techniques
[38] What are Residuals? – Displayr https://www.displayr.com/learn-what-are-residuals/
[39] Everything to Know About Residual Analysis – SixSigma.us https://www.6sigma.us/six-sigma-in-focus/residual-analysis/
[40] Residual Values (Residuals) in Regression Analysis – Statistics How … https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/residual/
[41] Residual Analysis – GeeksforGeeks https://www.geeksforgeeks.org/maths/residual-analysis/
[42] 오차, 잔차, 편차의 차이 (기초통계) python – DataAnalyst – 티스토리 https://signature95.tistory.com/49
[43] Residuals – Numeracy, Maths and Statistics – Academic Skills Kit https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/residuals.html
[44] 5 Key Techniques in Residual Analysis for Better Models https://www.numberanalytics.com/blog/5-key-techniques-residual-analysis-better-models
[45] Residual, Fitting Error 잔차 – [정보통신기술용어해설] http://www.ktword.co.kr/test/view/view.php?no=3832
[46] What Are Residuals in Statistics? – Statology https://www.statology.org/residuals/
[47] Statistics – Residuals, Analysis, Modeling – Britannica https://www.britannica.com/science/statistics/Residual-analysis
[48] 오차항(error) vs 잔차(residual) – 사고의 과정 – 티스토리 https://thought-process-ing.tistory.com/30
[49] Introduction to residuals (article) – Khan Academy https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/regression-library/a/introduction-to-residuals
[50] What Is Residual Analysis? – MATLAB & Simulink – MathWorks https://www.mathworks.com/help/ident/ug/what-is-residual-analysis.html
[51] 5.4 잔차(residual) – 일반통계학(2017-1) https://enook.jbnu.ac.kr/contents/39/
[52] What Is a Residual in Stats? – Outlier Articles https://articles.outlier.org/what-is-a-residual-in-stats
[53] Documentation https://www.mathworks.com/help/releases/R2021a/stats/cooks-distance.html
[54] Residual Analysis – Scaler Topics https://www.scaler.com/topics/data-science/residual-analysis/
[55] Determines outliers using Cook’s Distance https://search.r-project.org/CRAN/refmans/referenceIntervals/html/cook.outliers.html
[56] Studentized Residuals https://www.youtube.com/watch?v=XiR9H6XeSOs
[57] Introduction to Residual Analysis https://www.youtube.com/watch?v=pbFyNsUuzV4
[58] Cook’s distance – Wikipedia https://en.wikipedia.org/wiki/Cook’s_distance
[59] Example: Residual Analysis https://support.ptc.com/help/mathcad/r10.0/en/PTC_Mathcad_Help/example_residual_analysis.html
[60] Residuals, standardized residuals, and Studentized residuals https://www.youtube.com/watch?v=y4hRD7EWdJ4
[61] What Is a Residual Plot? Definitions, Examples, and Applications https://dovetail.com/research/what-is-a-residual-plot/
[62] [linear regression] residual, SSR, OLS, linear … – 올리비아 코딩스쿨 https://olivia-blackcherry.tistory.com/593
[63] Statistics for the Social Sciences https://courses.lumenlearning.com/suny-hccc-wm-concepts-statistics/chapter/assessing-the-fit-of-a-line-2-of-4/
[64] Slide 1 https://mycourses.aalto.fi/pluginfile.php/1660393/mod_folder/content/0/lecture9a_Introduction.pdf
[65] Introduction https://www.stat.purdue.edu/~zhanghao/STAT514/Lecture_Notes/LectureNotes07-Checking-Assumptions-.html
[66] Origin Help – Residual Plot Analysis – OriginLab https://www.originlab.com/doc/origin-help/residual-plot-analysis
[67] =1=Heteroscedasticity and Autocorrelation https://www.math.stonybrook.edu/~gaston/print/Old/review/HeteroAuto.pdf
[68] Simple Linear Regression: Checking Assumptions with Residual Plots https://www.youtube.com/watch?v=iMdtTCX2Q70
[69] What Does A Good Residual Plot Look Like? – The Friendly Statistician https://www.youtube.com/watch?v=hkZhSMhxw-4
[70] Autocorrelation and heteroskedasticity in time series data [closed] https://stats.stackexchange.com/questions/313452/autocorrelation-and-heteroskedasticity-in-time-series-data
[71] Chapter 28 Assessing Assumptions | Extended R Examples for A First Course in Design and Analysis of Experiments, 2nd edition. http://users.stat.umn.edu/~gary/book/RExamples/assessing-assumptions.html
[72] Interpreting Residual Plots to Improve Your Regression – Qualtrics https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
[73] Time Series Regression VI: Residual Diagnostics – MathWorks https://www.mathworks.com/help/econ/time-series-regression-vi-residual-diagnostics.html
[74] How To Use Residuals For Time Series Forecasting – YouTube https://www.youtube.com/watch?v=owVs7bV1sZQ
[75] Residual Diagnostics – MATLAB & Simulink – MathWorks https://www.mathworks.com/help/econ/compare-arma-models.html
[76] Microsoft Word – 9장 http://contents.kocw.net/document/ch9_5.pdf
[77] 3.3 Residual diagnostics | Forecasting: Principles and Practice (2nd … https://otexts.com/fpp2/residuals.html
[78] What are residuals in time series modeling? https://milvus.io/ai-quick-reference/what-are-residuals-in-time-series-modeling
[79] ARIMA Model Diagnostics & Residual Analysis https://apxml.com/courses/time-series-analysis-forecasting/chapter-4-arima-models-forecasting/arima-model-diagnostics
[80] 데이터과학을 위한 통계 리뷰 – 16일차 (가설검정,이분산성,영향값 … https://datacook.tistory.com/50
[81] Residuals in Time Series Models https://www.ucm.es/data/cont/docs/518-2013-11-11-JAM206.pdf
[82] [ SAS ] 이분산 (Heteroscedasticity) – bingsu’s Finance Diary – 티스토리 https://bing-su-b.tistory.com/137
[83] Time Series Regression VI: Residual Diagnostics https://it.mathworks.com/help/econ/time-series-regression-vi-residual-diagnostics.html
[84] 5.4 Residual diagnostics | Forecasting: Principles and Practice (3rd … https://otexts.com/fpp3/diagnostics.html
[85] 5.22 Outliers | Introduction to Regression Methods for Public Health … https://bookdown.org/rwnahhas/RMPH/mlr-outliers.html
[86] Hat matrix and leverages in classical multiple regression https://stats.stackexchange.com/questions/208242/hat-matrix-and-leverages-in-classical-multiple-regression
[87] Interpreting the residuals vs. fitted values plot for verifying the … https://stats.stackexchange.com/questions/76226/interpreting-the-residuals-vs-fitted-values-plot-for-verifying-the-assumptions
[88] Interpreting Residuals v Fitted – General – Posit Community https://forum.posit.co/t/interpreting-residuals-v-fitted/124776
[89] Outlier Detection and Effects on Modeling https://www.scirp.org/journal/paperinformation?paperid=102884
[90] [PDF] Regression in Practice – Brown Computer Science https://cs.brown.edu/courses/cs100/lectures/lecture18.pdf
[91] [PDF] Article – Survey weighted hat matrix and leverages https://www150.statcan.gc.ca/n1/pub/12-001-x/2009001/article/10881-eng.pdf
[92] Supervised outlier detection for classification and regression https://www.sciencedirect.com/science/article/pii/S0925231222002090
[93] [PDF] Statistical Leverage and Improved Matrix Algorithms https://www.stat.berkeley.edu/~mmahoney/talks/LeverageMatrix0308.pdf
[94] Unified methods for variable selection and outlier detection in a … http://www.csam.or.kr/journal/view.html?doi=10.29220%2FCSAM.2019.26.6.575
[95] 11.2 – Using Leverages to Help Identify Extreme x Values | STAT 501 https://online.stat.psu.edu/stat501/lesson/11/11.2
[96] Comparison Study of Outlier Detection Methods in a Regression … https://koreascience.kr/article/JAKO201319069652630.page
답글 남기기