음... - naver.HOW data lake by AI

아래에 전체 데이터 대비 미래에 예측·추론에 활용할 데이터 비율(hold-out 비율, 혹은 학습용/검증용 데이터 분할 비율)을 직접 최적화하거나 학습 곡선(learning curve)을 이용해 필요 데이터 양을 추정한 주요 연구들을 정리합니다.

Optimal Ratio for Data Splitting
V. Roshan Joseph은 선형 회귀 모델을 가정할 때 학습용 대 검증용 데이터 비율을 파라미터 수 $$p$$에 대한 $$\sqrt{p}:1$$로 제안했습니다[1]. 이 연구는 매개변수 개수 대비 데이터 양을 기준으로 한 ‘샘플 복잡도(sample complexity)’가 아니라, 주어진 전체 데이터 $$N$$에서 검증용 데이터 $$n$$을 어떻게 나누어야 전체 일반화 오차를 최소화할지 이론적으로 다룹니다.
SPlit: An Optimal Method for Data Splitting
Joseph & Vakayil(2021)은 SPlit라 명명된 결정론적 분할 알고리즘을 제안하여, support points를 활용해 검증 집합의 일반화 오차 분산을 $$\mathcal{O}(1/n^2)$$ 급으로 감소시키고, 최적 분할 비율을 추정할 수 있음을 보였습니다[2].
OptHoldoutSize 패키지
R 패키지 OptHoldoutSize는 비용 함수
$$\ell(n;k_1,N,\theta)=k_1\,n\;+\;k_2(n;\theta)\,(N-n)$$
을 최소화하는 $$n$$을 계산합니다. 여기서 $$k_1$$은 검증 없이 예측할 때 드는 비용, $$k_2(n)$$는 $$n$$개 학습 후 남은 $$(N-n)$$개에 대한 예측 비용을 나타내며, 적절한 $$\theta$$를 입력하면 전체 데이터 대비 최적 hold-out 크기를 얻을 수 있습니다[3].
On Optimal Data Split for Generalization Estimation and Model Selection
Larsen(2013) 등은 교차검증 기반 모델 선택 맥락에서, 분할 비율 $$y\in(0,1)$$을
$$
y_{\rm opt}
=
\mathop{\rm argmin}_{y}\;
\mathbb{E}{\text{true generalization error of model selected by CV at split }y}
$$
로 정의하고, 실제 비모수(nonparametric) 모델에서 이론적으로 자주 관찰되는 최적 비율을 논의했습니다[4].
Generic Holdout Methodology for Adaptive Data Analysis
Nakkiran et al.(2019)은 ‘Generic Holdout’을 제안하여, 전체 데이터에서 일부를 안전하게 분리(hold-out)함으로써 적응적(adaptive) 하이퍼파라미터 튜닝·모델 선택 과정에서 일반화 성능을 보증할 수 있음을 보였습니다. 이론상 적은 검증 집합 크기로도 지수(exponential) 개수의 쿼리에 대응하는 보증을 제공합니다[5].
Learning Curve Extrapolation for Data Requirement Estimation

Domhan et al.(2015)는 딥 러닝의 학습 곡선(성능 vs. 학습 데이터 크기)을 멱법칙(power law) 함수로 모델링하고, 이를 통해 목표 성능에 도달하기 위한 데이터 양을 예측할 수 있음을 보였습니다[6].
Kim & Viering(2022)은 다양한 곡선 적합(fitting) 기법과 초기화(initialization) 전략을 비교하며, 제한된 데이터로부터 학습 곡선을 안정적으로 추정하는 방식을 연구했습니다[7].
Hoiem et al.(2021)는 학습 곡선을 추정하는 파라미터(오차 한계 $$e_N$$, 데이터 의존도 $$\beta_N$$)를 정의하여, 네트워크 설계나 사전학습(pretraining)이 데이터 의존성에 미치는 영향을 분석했습니다[8].

이들 연구는 “모델 파라미터 수 대비 필요한 샘플 개수”를 다루는 전통적 샘플 복잡도 이론이 아니라,
“전체 보유 데이터 중 얼마를 학습에, 얼마를 검증에 할당해야 충분히 일반화된 모델을 얻을 수 있는가”
를 직접 최적화하거나, 학습 곡선을 기반으로 필요한 데이터 비율을 추정하려는 대표적인 시도들입니다.

더 심화해서, 실제 비용·리스크를 반영한 분할 비용 함수를 설정하고 $$N$$, $$k_1$$, $$k_2(n)$$ 등을 경험적·이론적으로 추정해 최적 $$n$$을 계산하는 프레임워크가 OptHoldoutSize[3]에 구현되어 있으니 참고하시기 바랍니다.

출처
[1] [2202.03326] Optimal Ratio for Data Splitting – arXiv https://arxiv.org/abs/2202.03326
[2] SPlit: An Optimal Method for Data Splitting – Taylor & Francis Online https://www.tandfonline.com/doi/full/10.1080/00401706.2021.1921037
[3] optimal_holdout_size: Estimate optimal holdout size under parametric assumptions in OptHoldoutSize: Estimation of Optimal Size for a Holdout Set for Updating a Predictive Score https://rdrr.io/cran/OptHoldoutSize/man/optimal_holdout_size.html
[4] [PDF] On Optimal Data Split for Generalization Estimation and Model … https://orbit.dtu.dk/files/5381207/Larsen.pdf
[5] arXiv:1809.05596v1 [stat.ME] 14 Sep 2018 http://arxiv.org/pdf/1809.05596.pdf
[6] Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves https://ml.informatik.uni-freiburg.de/wp-content/uploads/papers/15-IJCAI-Extrapolation_of_Learning_Curves.pdf
[7] [PDF] Different Approaches to Fitting and Extrapolating the Learning Curve https://bnaic2022.uantwerpen.be/wp-content/uploads/BNAICBeNeLearn_2022_submission_1583.pdf
[8] Learning Curves for Analysis of Deep Networks http://proceedings.mlr.press/v139/hoiem21a/hoiem21a.pdf
[9] What is Sample Complexity? – DataCamp https://www.datacamp.com/blog/what-is-sample-complexity
[10] Generalization in Adaptive Data Analysis and https://www.cs.toronto.edu/~toni/Papers/nips-maxinfo.pdf
[11] 6 – Understanding Learning Curves: Solving a Manager’s Problem with Real-World Data – MadhavanSV https://www.youtube.com/watch?v=T-MP8wp0QLo
[12] Splitting Data Into Training & Testing in Data Preprocessing | Data Science ML (Lecture #4) https://www.youtube.com/watch?v=htAOwa7UKUY
[13] Better Generalization with Forecasts https://www.ijcai.org/Proceedings/13/Papers/246.pdf
[14] [PDF] Sample Complexity of Diffusion Model Training Without Empirical … https://arxiv.org/pdf/2505.18344.pdf
[15] Journal of Machine Learning Research 24 (2023) 1-51 https://www.jmlr.org/papers/volume24/22-1293/22-1293.pdf
[16] PRIOR https://prior.allenai.org/projects/lcurve
[17] Beyond Random Split for Assessing Statistical http://arxiv.org/pdf/2209.03346.pdf
[18] Dimensionality reduction to maximize prediction generalization… https://openreview.net/forum?id=b6rmaTkDnT
[19] [PDF] Transductive Rademacher Complexity and its Applications – arXiv https://arxiv.org/pdf/1401.3441.pdf
[20] Generalization Bounds and Stability https://ocw.mit.edu/courses/9-520-statistical-learning-theory-and-applications-spring-2006/9a5f87123d8e36531b5959b031920fa8_class14.pdf
[21] Transductive Rademacher Complexity and its Applications – arXiv https://arxiv.org/abs/1401.3441
[22] 06-head.dvi https://ocw.mit.edu/courses/9-520-statistical-learning-theory-and-applications-spring-2003/cbeff6178683ade20114d9eb05183b58_class06.pdf
[23] [PDF] Aggregated Hold-Out – Journal of Machine Learning Research https://jmlr.org/papers/volume22/19-624/19-624.pdf
[24] View of Transductive Rademacher Complexity and its Applications https://jair.org/index.php/jair/article/view/10608/25373
[25] [PDF] A Comparison of Tight Generalization Error Bounds https://icml.cc/Conferences/2005/proceedings/papers/052_Comparison_KaeaeriaeinenLangford.pdf
[26] Transductive Learning is Compact | OpenReview https://openreview.net/forum?id=YWTpmLktMj
[27] COS 511: Theoretical Machine Learning https://www.cs.princeton.edu/courses/archive/spring18/cos511/scribe_notes/0305.pdf
[28] Transductive Learning Is Compact – arXiv https://arxiv.org/html/2402.10360v3
[29] Generalization bounds for finite hypothesis classes https://alliance.seas.upenn.edu/~cis520/dynamic/2017/wiki/index.php?n=Lectures.PAC
[30] [PDF] Beating the Hold-Out: Bounds for K-fold and Progressive Cross … https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/pre2003-Beating_the_Holdout.pdf
[31] [PDF] Transductive Learning Is Compact – arXiv https://arxiv.org/pdf/2402.10360.pdf
[32] A Finite-Sample Generalization Bound for Semiparametric Regression https://proceedings.mlr.press/v33/huang14.html
[33] [PDF] Bounds for K-fold and Progressive Cross-Validation – TTIC https://home.ttic.edu/~avrim/Papers/kfold.pdf
[34] Trade-off between training and testing ratio in machine learning for … https://pmc.ncbi.nlm.nih.gov/articles/PMC11419616/
[35] A new improved generalized class of estimators for population … https://www.nature.com/articles/s41598-023-30150-9
[36] [Linear Models 2] Generalization – bomishot – 티스토리 https://bomishot.tistory.com/20
[37] New generalized class of estimators for estimation of finite … https://pmc.ncbi.nlm.nih.gov/articles/PMC10612467/
[38] Learning Learning Curves https://research.tudelft.nl/en/publications/learning-learning-curves
[39] Optimal ratio for data splitting – Joseph – 2022 – Wiley Online Library https://onlinelibrary.wiley.com/doi/full/10.1002/sam.11583
[40] Finite Population Survey Sampling: An https://arxiv.org/pdf/2306.10635.pdf
[41] Identifying Key Challenges of Hardness-Based Resampling – arXiv https://arxiv.org/abs/2504.07031
[42] [PDF] A General Theory of Holdouts – Xiaobo Yu https://xiaobo-yu.com/papers/Holdout.pdf
[43] Towards Bridging Sample Complexity and Model Capacity https://cdn.aaai.org/ojs/20092/20092-13-24105-1-2-20220628.pdf
[44] Training, Validation, Test Split for Machine Learning Datasets – Encord https://encord.com/blog/train-val-test-split/
[45] 4 – Holdout method – The Emerging Science of Machine Learning … https://www.mlbenchmarks.org/04-holdout-method.html
[46] Data Sampling Affects the Complexity of Online SGD over Dependent Data https://arxiv.org/pdf/2204.00006.pdf
[47] Splitting data set into training and test data, keeping the ratio https://stackoverflow.com/questions/52610332/splitting-data-set-into-training-and-test-data-keeping-the-ratio
[48] [PDF] Information-Theoretic Generalization Bounds for Transductive … http://www.jmlr.org/papers/volume25/23-1368/23-1368.pdf
[49] PAC https://alliance.seas.upenn.edu/~cis520/dynamic/2016/wiki/index.php?n=Lectures.PAC
[50] Information-Theoretic Generalization Bounds for Transductive … https://arxiv.org/abs/2311.04561
[51] 06. Transductive Support Vector Machines – 위키독스 https://wikidocs.net/190189
[52] Proceedings of Machine Learning Research vol 99:1–10, 2019 https://proceedings.mlr.press/v99/feldman19a/feldman19a.pdf
[53] Transductive Learning | Papers With Code https://paperswithcode.com/task/transductive-learning
[54] [PDF] Generalization Bounds and Representation Learning for Estimation … https://jmlr.org/papers/volume23/19-511/19-511.pdf
[55] A note on generalization bounds for losses with finite moments – arXiv https://arxiv.org/abs/2403.16681
[56] Machine Learning Theory – Part 2: Generalization Bounds https://mostafa-samir.github.io/ml-theory-pt2/
[57] Journal of Machine Learning Research 24 (2023) 1-51 https://jmlr.org/papers/volume24/22-1293/22-1293.pdf
[58] Information-Theoretic Bounds on the Moments of https://discovery.ucl.ac.uk/id/eprint/10138964/1/Gholomali_Information-Theoretic%20Bounds%20on%20the%20Moments%20of%20the%20Generalization%20Error%20of%20Learning%20Algorithms_AAM.pdf
[59] 1 https://www.osti.gov/servlets/purl/1476410
[60] A New Family of Generalization Bounds Using Samplewise … – arXiv https://arxiv.org/abs/2210.06422
[61] Which Algorithms Have Tight Generalization Bounds? – OpenReview https://openreview.net/forum?id=RFMdtKbff5
[62] Fantastic Generalization Measures are Nowhere to be Found https://openreview.net/forum?id=NkmJotfL42
[63] arXiv:2110.11216v4 [stat.ML] 9 Nov 2021 http://arxiv.org/pdf/2110.11216v4.pdf
[64] Understanding Hold-Out Methods for Training Machine Learning … https://www.comet.com/site/blog/understanding-hold-out-methods-for-training-machine-learning-models/
[65] ECE901 Spring 2007 Statistical Learning Theory https://nowak.ece.wisc.edu/SLT07/lecture10.pdf
[66] Generalization in Deep Learning https://lis.csail.mit.edu/pubs/kawaguchi-techreport18.pdf
[67] [PDF] Non-vacuous Generalization Bounds for Adversarial Risk in … https://proceedings.mlr.press/v238/mustafa24a/mustafa24a.pdf
[68] Understanding PAC (probably approximately correct) bounds on the realizable case (and finite hypothesis class) https://math.stackexchange.com/questions/761096/understanding-pac-probably-approximately-correct-bounds-on-the-realizable-case
[69] General Disclaimer https://ntrs.nasa.gov/api/citations/19710001064/downloads/19710001064.pdf
[70] [PDF] PAC-Bayesian Theory for Transductive Learning http://proceedings.mlr.press/v33/begin14.pdf
[71] Table of contents https://zief0002.github.io/epsy-5261/06-01-generalization.html
[72] Generalization Ability – a statistical approach https://www3.nd.edu/~lemmon/courses/deep-learning/spring-2024/slides/slide2.pdf
[73] SREE2015_SmallNGeneralization https://files.eric.ed.gov/fulltext/ED562348.pdf
[74] Is Transductive Learning Equivalent to PAC Learning? – arXiv https://arxiv.org/html/2405.05190v1
[75] Generalizability Theory https://web.pdx.edu/~newsomj/pmclass/generalizability%20theory.pdf
[76] [PDF] A Generalized Ratio-Type Estimator of Finite Population Variance … https://www.naturalspublishing.com/download.asp?ArtcID=20325
[77] Is Transductive Learning Equivalent to PAC Learning? – OpenReview https://openreview.net/forum?id=2XAAdvSlP2
[78] [PDF] A Transductive Local Rademacher Complexity Approach – arXiv https://arxiv.org/pdf/2309.16858.pdf
[79] carriers98.PDF https://galileo.phys.virginia.edu/classes/312/notes/carriers.pdf
[80] How to Use the Finite Population Correction – MeasuringU https://measuringu.com/finite-population-correction/
[81] str.dvi https://static.googleusercontent.com/media/research.google.com/en/pubs/archive/34678.pdf
[82] Generalization Bounds with Logarithmic Negative-Sample … https://openreview.net/forum?id=OaVi1yjdEc
[83] 1 http://arxiv.org/pdf/1602.00956.pdf
[84] Generalization Bounds via Information Density and Conditional Information Density https://research.chalmers.se/publication/540547/file/540547_Fulltext.pdf
[85] Solid State Physics https://www.ucl.ac.uk/~ucapahh/teaching/3C25/Lecture24s.pdf
[86] [PDF] Localized Complexities for Transductive Learning http://proceedings.mlr.press/v35/tolstikhin14.pdf
[87] [PDF] Comparing Comparators in Generalization Bounds https://proceedings.mlr.press/v238/hellstrom24a/hellstrom24a.pdf
[88] Generalization Bounds via Conditional $f$-Information | OpenReview https://openreview.net/forum?id=ocxVXe5XN1
[89] Understanding Train-Test Split | Kaggle https://www.kaggle.com/general/542494
[90] Improved Risk Bounds with Unbounded Losses for Transductive … https://openreview.net/forum?id=vjbIer5R2H
[91] Sampling from a Finite Population: Interval https://utstat.toronto.edu/~brunner/oldclass/utm218s07/FinitePop.pdf
[92] Optimal Ratio for Data Splitting https://arxiv.org/pdf/2202.03326.pdf
[93] Is There a rule of thumb for How to Divide a Dataset into Training and Validation Sets? https://intellipaat.com/blog/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validation/
[94] Simulated example https://cran.r-project.org/web/packages/OptHoldoutSize/vignettes/simulated_example.pdf
[95] Train Test Validation Split: Mastering Model Evaluation for Machine Learning Success – 33rd Square https://www.33rdsquare.com/train-test-validation-split/
[96] A Comparative study of data splitting algorithms for machine learning model selection http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1506870
[97] What is data splitting and why is it important? – TechTarget https://www.techtarget.com/searchenterpriseai/definition/data-splitting
[98] What is: Holdout Method https://statisticseasily.com/glossario/what-is-holdout-method-data-science/
[99] DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, https://www.diva-portal.org/smash/get/diva2:1506870/FULLTEXT01.pdf
[100] [PDF] Efficient Bayesian Learning Curve Extrapolation using Prior-Data … https://proceedings.neurips.cc/paper_files/paper/2023/file/3f1a5e8bfcc3005724d246abe454c1e5-Paper-Conference.pdf
[101] Under review as a conference paper at ICLR 2021 https://openreview.net/pdf?id=FsLTUzZlsgT
[102] How Much More Data Do I Need? Estimating Requirements For Downstream Tasks https://research.nvidia.com/labs/toronto-ai/estimatingrequirements/
[103] Architecture-Aware Learning Curve Extrapolation via Graph … – arXiv https://arxiv.org/abs/2412.15554
[104] Strategies and impact of learning curve estimation for CNN-based image classification https://ar5iv.labs.arxiv.org/html/2310.08470v1
[105] [PDF] Efficient Bayesian Learning Curve Extrapolation using Prior-Data … https://openreview.net/pdf?id=VQpqxucNX63
[106] Learning Curves¶ https://docs.datarobot.com/en/docs/modeling/analyze-models/other/learn-curve.html
[107] How Much More Data Do I Need? Estimating Requirements for Downstream Tasks | NVIDIA Toronto AI Lab https://research.nvidia.com/labs/toronto-ai/publication/2022_cvpr_how_much_data/
[108] [PDF] RATT: Leveraging Unlabeled Data to Guarantee Generalization http://proceedings.mlr.press/v139/garg21a/garg21a.pdf
[109] A simulation study: An enhanced generalized class of estimators for … https://www.sciencedirect.com/science/article/pii/S2405844023044778
[110] Estimating Population Proportions https://www.youtube.com/watch?v=4lAU-RAKGqo
[111] [PDF] Generalization Bounds and Representation Learning for Estimation … https://arxiv.org/pdf/2001.07426.pdf
[112] [PDF] Generalization Bound of Gradient Descent for Non-Convex Metric … https://proceedings.neurips.cc/paper/2020/file/6f5e4e86a87220e5d361ad82f1ebc335-Paper.pdf
[113] MS&E 226: “Small” Data http://web.stanford.edu/~rjohari/teaching/notes/226_lecture4_prediction.pdf
[114] 4.6. Generalization in Classification – Dive into Deep Learning https://d2l.ai/chapter_linear-classification/generalization-classification.html
[115] OptHoldoutSize: inst/doc/ASPRE_example.Rmd https://rdrr.io/cran/OptHoldoutSize/f/inst/doc/ASPRE_example.Rmd
[116] Optimal ratio for data splitting – Wiley Online Library https://onlinelibrary.wiley.com/doi/pdf/10.1002/sam.11583
[117] Machine Learning CS60050 https://cse.iitkgp.ac.in/~aritrah/course/theory/ML/Spring2021/scribes/2021-02-25_Thu_20RJ92R05+20RJ91R07.pdf
[118] Hold-out strategy for selecting learning models – ScienceDirect.com https://www.sciencedirect.com/science/article/abs/pii/S0022249622000372
[119] [PDF] Transductive Robust Learning Guarantees https://proceedings.mlr.press/v151/montasser22a/montasser22a.pdf
[120] [PDF] Extrapolating Learning Curves of Deep Neural Networks https://aad.informatik.uni-freiburg.de/wp-content/uploads/papers/14-AUTOML-ExtrapolatingLearningCurves4.pdf
[121] Generalization error – Wikipedia https://en.wikipedia.org/wiki/Generalization_error
[122] Holding out for an explanation – by Ben Recht – arg min https://www.argmin.net/p/holding-out-for-an-explanation
[123] [PDF] Comparing Learning Curve Extrapolation Methods in different … https://studenttheses.universiteitleiden.nl/access/item:3665258/view

음…

코멘트

답글 남기기 응답 취소

더 많은 게시물

딥웹 오픈소스 크롤러

정규분포 발전사 약 190년의 여정

AI Learning Roadmap — One-Page Brochure (KR)

실전형 러닝 로드맵