결론적으로, 모델이 학습된 이후 실제 추론(예측) 단계에서 남은 데이터의 양(비율)을 직접 분석·최적화한 연구는 현재까지 확인되지 않습니다. 기존 연구들은 모두 다음 두 축에서 이루어져 왔습니다.

학습용 vs. 검증용(hold-out) 데이터 분할 비율 최적화

Dobbin & Simon(2005)은 고차원 분류기 설계에서 훈련용과 테스트용 분할 비율이 예측 정확도 추정의 평균제곱오차(MSE)에 미치는 영향을 비모수(resampling) 방식으로 분석하며, 최적 분할 비율을 제안했습니다[1].
Joseph & Vakayil(2021)의 SPlit, OptHoldoutSize 패키지 등은 전체 보유 데이터 $$N$$ 중 검증용 데이터 $$n$$을 어떻게 선택해야 일반화 성능을 최적화할지에 집중합니다[2].

학습 곡선 기반 데이터 요구량 추정

Domhan et al.(2015), Kim & Viering(2022), Hoiem et al.(2021) 등은 학습 곡선(power-law 형태 모델링)을 통해 목표 성능에 도달하기 위한 학습 데이터 $$n$$의 규모를 예측하는 방법을 제안했습니다.
Nakkiran et al.(2019)은 Generic Holdout을 통해 적응적 데이터 분석(adaptive data analysis) 맥락에서 작은 검증 집합으로도 안전한 일반화 보증을 얻는 프레임워크를 제시했으나, 역시 남은 예측 데이터량 자체를 연구 대상으론 삼지 않습니다[3].

위 연구들은 “모델 파라미터 수 대비 필요한 학습 샘플 개수” 또는 “전체 데이터 중 학습용·검증용 분할 비율”에 초점을 맞추지만,
“모델이 완성된 후 실제 운영 환경에서 향후 예측 요청에 사용될 데이터의 양(비율)”을 이론적·경험적으로 분석한 문헌은 아직 부재합니다.

따라서 “모델이 만들어진 이후에 남은 추론 데이터를 얼마나 확보해야 충분한 일반화를 보장할 수 있는가”에 대한 별도의 연구는 현재 없다고 정리할 수 있습니다.

출처
[1] Optimally splitting cases for training and testing high dimen- https://brb.nci.nih.gov/techreport/v2bmc_articleRev1RS4Feb.pdf
[2] optimal_holdout_size_emulation: Estimate optimal holdout size under semi-parametric… in OptHoldoutSize: Estimation of Optimal Size for a Holdout Set for Updating a Predictive Score https://rdrr.io/cran/OptHoldoutSize/man/optimal_holdout_size_emulation.html
[3] arXiv:1809.05596v1 [stat.ME] 14 Sep 2018 http://arxiv.org/pdf/1809.05596.pdf
[4] What does batch size mean in inference? : r/LocalLLaMA – Reddit https://www.reddit.com/r/LocalLLaMA/comments/17sbwo5/what_does_batch_size_mean_in_inference/
[5] Data reconstruction from machine learning models via inverse … https://www.nature.com/articles/s41598-025-96215-z
[6] Why It Matters: Inference for Two Proportions | Concepts in Statistics https://courses.lumenlearning.com/wm-concepts-statistics/chapter/introduction-8/
[7] admin:appserver-set-max-inference-size — MarkLogic Server 11.0 … https://docs.marklogic.com/admin:appserver-set-max-inference-size
[8] Forecasting remaining useful life: Interpretable deep learning … https://www.sciencedirect.com/science/article/pii/S0167923619301290
[9] An Introduction to Inference for Two Proportions https://www.youtube.com/watch?v=g0at6LpYvHc
[10] Generalization in Adaptive Data Analysis and Holdout Reuse – arXiv https://arxiv.org/abs/1506.02629
[11] 502: Preventing Overfitting with Holdout https://winder.ai/502-preventing-overfitting-with-holdout/
[12] Inference Scaling Laws: An Empirical Analysis of Compute-Optimal … https://arxiv.org/html/2408.00724v3
[13] 7.3 Inference for Two-Sample Proportions – Significant Statistics: An Introduction to Statistics https://pressbooks.lib.vt.edu/significantstatistics/chapter/inference-for-two-sample-proportions/
[14] Simulated example https://cran.r-project.org/web/packages/OptHoldoutSize/vignettes/simulated_example.pdf
[15] Post Training in Deep Learning https://arxiv.org/pdf/1611.04499.pdf
[16] Estimation of minimal data sets sizes for machine learning … – Nature https://www.nature.com/articles/s41746-024-01360-w
[17] STAT509: Statistical inference for proportion https://people.stat.sc.edu/houp/Stat509/notes/Inference%20on%20single%20proportion.pdf
[18] [PDF] Splitting strategies for post-selection inference – arXiv https://arxiv.org/pdf/2102.02159.pdf
[19] Papers with Code – How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence https://paperswithcode.com/paper/how-post-training-reshapes-llms-a-mechanistic

모델은 영원하지 않습니다

모두가 한 번 학습된 모델이 영구히 유효하리라고 기대하기 쉽습니다. 그러나 실무와 연구에서 확인되는 바는 다음과 같습니다.

데이터·개념 드리프트(Drift)의 실재
시간이 흐르며

입력 데이터 분포가 변하는 데이터 드리프트
입력과 목표 변수 간 관계가 바뀌는 개념 드리프트
가 꾸준히 발생합니다[1].

모델 성능 저하 및 리트레이닝 필요성
드리프트가 쌓이면 검증 시점의 정확도가 떨어지고, 새로운 예측 요청에 제대로 대응하지 못합니다.
따라서 모델은 일정 주기 혹은 성능 기준(모니터링 임계치)이 달성될 때마다 재학습해야 합니다[2][3].
유한한 수명의 모델 관리
– 모델을 ‘탄생→운영→폐기’라는 수명 주기(lifecycle) 관점으로 관리하며,
– 운영 중 모니터링 및 재학습, 버전 관리, 퇴출(retirement) 시점을 명확히 정합니다[4].
미래 예측 요청량과 재학습 전략
앞서 논의한 ‘남은 추론 데이터량’ 연구는 부족했으나,
실제로는

예측 요청 패턴(트래픽)
재훈련 비용
성능 저하 속도
등을 고려해 재학습 주기(retraining cadence) 와 유지보수 전략을 수립합니다[5].

결국 모델은 생산 환경 안에서 ‘영구히’ 머무르지 않으며, 끊임없는 모니터링과 업데이트를 통해 수명을 연장하고, 적절한 시점에 신규 모델로 교체해야만 비로소 실용적이고 신뢰할 수 있는 예측 시스템을 유지할 수 있습니다.

출처
[1] 개념 드리프트 탐지에 대한 설명 | 사피엔의 AI 용어집 – Sapien https://www.sapien.io/ko/glossary/definition/concept-drift-detection
[2] Model Retraining in 2025: Why & How to Retrain ML Models? https://research.aimultiple.com/model-retraining/
[3] 모델 드리프트란 무엇인가요? | IBM https://www.ibm.com/kr-ko/think/topics/model-drift
[4] Amazon SageMaker Model Registry, 이제 기계 학습 모델 수명 주기 … https://aws.amazon.com/ko/about-aws/whats-new/2024/11/amazon-sagemaker-model-registry-defining-machine-learning-lifecycle-stages
[5] How Often Should You Retrain Machine Learning Models? – NILG.AI https://nilg.ai/202403/how-often-should-you-retrain-machine-learning-models/
[6] 03) 정기적인 모델 재학습 – 데이터의 신비한 변신 – 위키독스 https://wikidocs.net/198989
[7] The Lifetime of a Machine Learning Model | Towards Data Science https://towardsdatascience.com/the-lifetime-of-a-machine-learning-model-392e1fadf84a/
[8] [Model Drift] Model Drift에 대한 A to Z # 2. Detection 방법과 Handling … https://calmmimiforest.tistory.com/120
[9] The Machine Learning Life Cycle Explained – DataCamp https://www.datacamp.com/blog/machine-learning-lifecycle-explained
[10] [논문 리뷰] Unveiling Group-Specific Distributed Concept Drift https://www.themoonlight.io/ko/review/unveiling-group-specific-distributed-concept-drift-a-fairness-imperative-in-federated-learning
[11] 연속 학습과 프로덕션 테스트 – velog https://velog.io/@kyyle/%EC%97%B0%EC%86%8D-%ED%95%99%EC%8A%B5%EA%B3%BC-%ED%94%84%EB%A1%9C%EB%8D%95%EC%85%98-%ED%85%8C%EC%8A%A4%ED%8A%B8
[12] What Is A Model Lifecycle? – Yields.io https://www.yields.io/blog/what-is-model-lifecycle/
[13] The Retirement Decision in Dynamic Microsimulation Models: An Exploratory Review https://www.microsimulation.pub/articles/00287
[14] Forecasting remaining useful life: Interpretable deep learning approach via variational Bayesian inferences https://dl.acm.org/doi/10.1016/J.DSS.2019.113100
[15] [PDF] Model Lifecycle Management for MBSE – OMG Wiki https://www.omgwiki.org/MBSE/lib/exe/fetch.php?media=mbse%3Amodel_lifecycle_management_for_mbse_v4.pdf
[16] FINANCIAL SECURITY DIVISION https://longevity.stanford.edu/wp-content/uploads/2017/03/DTR-Recommendations-Final-9-21-16.pdf
[17] A machine learning based prediction model for life expectancy https://zenodo.org/records/7188338
[18] What is AI Lifecycle Management? – IBM https://www.ibm.com/think/topics/ai-lifecycle
[19] A Tax-Efficient Model Predictive Control Policy https://web.stanford.edu/~boyd/papers/pdf/retirement.pdf
[20] Development of a Machine Learning Model Using Limited Features … https://pmc.ncbi.nlm.nih.gov/articles/PMC9067363/
[21] [PDF] scoring and prediction of early retirement using machine learning … https://www.actuarios.org/wp-content/uploads/2019/12/Art6-Anales2019.pdf
[22] 모델 등록 및 작업 – Azure Machine Learning – Learn Microsoft https://learn.microsoft.com/ko-kr/azure/machine-learning/how-to-manage-models?view=azureml-api-2
[23] AI Mortality Model Predicts End of Life, Boosts Palliative Care https://www.targetedonc.com/view/ai-mortality-model-predicts-end-of-life-boosts-palliative-care
[24] 기계 학습 방법을 이용한 직장 생활 프로파일 기반의 퇴직 예측 모델 개발 https://koreascience.kr/article/JAKO201711656706941.page
[25] 모델 훈련 – Amazon SageMaker AI https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/train-model.html
[26] Machine learning models to predict 6-month mortality risk in home … https://www.sciencedirect.com/science/article/pii/S2347562525000277
[27] [PDF] Machine learning predicts lifespan and underlying causes of death … https://www.biorxiv.org/content/10.1101/2024.03.20.585803v1.full.pdf
[28] Model management lifecycle – Finance | Dynamics 365 https://learn.microsoft.com/en-us/dynamics365/finance/finance-insights/model-manage-lifecycle
[29] Azure OpenAI in Azure AI Foundry Models model retirements https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/model-retirements
[30] ML 모델 수명 주기에 대한 MLflow – Azure Databricks | Microsoft Learn https://learn.microsoft.com/ko-kr/azure/databricks/mlflow/

모델 구축 후 남은 추론(예측) 데이터량에 대한 연구 부재

모델은 영원하지 않습니다

코멘트

답글 남기기 응답 취소

더 많은 게시물

HJHKeymap 작업일지

JAVA, Kotlin Codes

Swift -> Kotlin

코틀린 단점