기초 → 아키텍처 패밀리(CNN/RNN/Transformer/SSM/GNN/INR…) → 생성 패러다임(AR/VAE/GAN/Diffusion/Flow/EBM) → 도메인 트랙(NLP/비전/음성·오디오/시계열/멀티모달/RL/바이오) → 시스템·스케일링 → 캡스톤 순서로 진행하세요. 굵은 키워드로 논문/자료 검색하면 됩니다.

러닝 로드맵 (한국어)

0) 기초 & 환경

수학/확률: 선형대수, 미분/적분(딥러닝용), 확률/정보이론
DL 기본기: 최적화(SGD/AdamW), 초기화, 과소·과적합, 바이어스–분산
툴링: PyTorch, (옵션) JAX, Lightning/Accelerate, 혼합정밀, 프로파일링
데이터 위생: 수집/정제, 분할/누수 방지, 증강, 재현성(시드)
마일스톤: MNIST/CIFAR 분류기와 문자 RNN 토이 구현, 로깅·체크포인트 포함

1) 아키텍처 패밀리(구조 레벨)

1.1 Feedforward & Modern MLP

MLP, Residual MLP, MLP-Mixer, gMLP, KAN
마일스톤: CIFAR에서 MLP-Mixer 재현 → 소형 CNN과 비교

1.2 CNN (고전→모던→경량)

백본: ResNet, DenseNet, Inception, MobileNet, EfficientNet, ConvNeXt, RegNet, NFNet, CoAtNet(conv-attn 하이브리드)
모듈: 팽창/변형/깊이분리/그룹드 컨브, SENet, CBAM, ECA
경량: ShuffleNet, GhostNet
마일스톤: 자체 데이터로 소형 CNN 학습, depthwise vs standard ablation

1.3 RNN & Temporal Conv

Vanilla RNN(Elman/Jordan), LSTM, GRU, Peephole, Bi-/Stacked, TCN
가변 길이 처리(packing/masking)
마일스톤: LSTM/GRU와 TCN으로 시퀀스 분류기 구현(패딩 처리 포함)

1.4 Transformer (효율 포함)

패밀리: Encoder-only(BERT), Decoder-only(GPT), Encoder–Decoder(T5/BART)
비전: ViT/DeiT/Swin, PVT, MaxViT; 검출: DETR/Deformable DETR, Mask2Former
롱·효율: Transformer-XL, Longformer, BigBird, Reformer, Performer, Linformer, FlashAttention
인기 디코더: Llama(Mixtral), Mistral, Gemma, Phi
희소/조건부 계산: MoE(GShard/Switch/Mixtral), Mixture-of-Depths, Adaptive Computation Time
검색/메모리: RAG, REALM, RETRO, kNN-LM
마일스톤: 소형 인코더 분류·디코더 생성 파인튜닝, 사내 코퍼스에 RAG 적용

1.5 State-Space & 장거리 대안

S4/S5/DSS, Mamba, Retentive Network, Hyena/H3, RWKV(RNN 느낌+병렬 학습)
마일스톤: 장시퀀스 태스크에 Mamba로 교체, 처리량/정확도 비교

1.6 GNN (그래프)

GCN, GAT, GraphSAGE, GIN, MPNN; 등변성: EGNN, SE(3)-Transformer; TGN(시계열 그래프); GAE/VGAE
마일스톤: Cora 노드 분류 → TGN으로 시간 축 실험

1.7 INR/3D & 기하

좌표 기반(INR): SIREN, NeRF(Mip-NeRF, Instant-NGP, DVGO, Plenoxels), DeepSDF, Occupancy Nets
3D 네트: PointNet/PointNet++, DGCNN, Point Transformer, MinkowskiNet(sparse conv)
마일스톤: 소형 NeRF로 소수 시점 재구성

1.8 연상/기타

Modern Hopfield Networks, Hopfield-Transformer; CapsNet; Neural ODE/Latent ODE/ODE-RNN
마일스톤: 토이 동역학에 Neural ODE 적용 → GRU와 비교

2) 생성 모델 패러다임(아키텍처 아님)

Autoregressive: RNN-LM, GPT, PixelRNN/PixelCNN, WaveNet, ImageGPT, AudioLM, MusicLM
VAE: VAE, β-VAE, VQ-VAE/VQ-VAE-2, NVAE
GAN: GAN, DCGAN, WGAN/WGAN-GP, SNGAN, StyleGAN(1/2/3/XL), BigGAN, CycleGAN, Pix2Pix, SPADE/GauGAN
Diffusion/Score: DDPM/Improved DDPM, DDIM, Score-SDE(VE/VP), Latent Diffusion(Stable Diffusion), EDM, Consistency Models
- Transformer 기반: DiT, U-ViT
Flow/Flow-Matching: NICE, RealNVP, Glow, Flow Matching, Rectified Flow
EBM: 에너지 기반, 스코어 매칭, NCE
마일스톤: 소형 데이터로 VQ-VAE와 DDPM 학습, 샘플 품질(FID/PR) 비교

3) 도메인 트랙(우선 2–3개 선택)

3.1 NLP / 장문 & 검색결합

토크나이저, BERT/T5/GPT 파인튜닝, 인스트럭션 튜닝, LoRA, RAG, 장문(ROPE/ALiBi, Longformer/BigBird)
프로젝트: 도메인 Q&A에 RAG 적용, 환각/근거평가

3.2 비전(검출/분할/파운데이션)

백본(ResNet/ConvNeXt ↔ ViT/Swin), 검출(YOLO/RetinaNet/Faster R-CNN ↔ DETR), 분할(U-Net/DeepLab ↔ Mask2Former)
프로젝트: 멀티객체 검출 파이프라인, 지연시간 vs mAP

3.3 음성·오디오

ASR: Conformer, RNN-T/CTC, Whisper; TTS: WaveNet, HiFi-GAN, VITS; 생성: AudioLM/MusicLM(개념)
프로젝트: Whisper로 사내 억양 파인튜닝, 스트리밍 서빙

3.4 시계열/예측

TCN, Informer/Autoformer(Transformer), S4/Mamba, RWKV
프로젝트: 외생 변수 포함 다중 수평 예측, MSE vs 지연시간 비교

3.5 멀티모달(VL/VLM/LMM)

CLIP, BLIP/BLIP-2, Flamingo, LLaVA, GPT-4V, Gemini; 정렬(ITC/ITM), 인스트럭션 튜닝, VQA/VLEP 평가지표
프로젝트: 이미지-텍스트 검색 + VQA, 캡션/메타데이터에 RAG 결합

3.6 강화학습

DQN/Double/Rainbow, A2C/A3C, PPO, SAC/TD3, Decision Transformer; RLHF(SFT → RM → PPO/DPO)
프로젝트: 연속제어에 PPO 적용, 보상 셰이핑/커리큘럼

3.7 단백질/생물

구조: AlphaFold(Evoformer), ESM; 생성: RFdiffusion, ProtGPT
프로젝트: ESM 임베딩으로 성질 예측, RFdiffusion 샘플 탐색(개념)

3.8 3D/그래픽스 & 버추얼 프로덕션

Instant-NGP, Plenoxels, 포인트 기반(예: PointNeRF), 카메라 트래킹 융합, 실시간 제약
프로젝트: 씬 캡처 → NeRF 배경 → 트래킹 카메라 합성

4) 학습 레시피·평가·시스템

최적화: AdamW, 스케줄러/워밍업, 그라디언트 클리핑, WD, EMA
정규화: 드롭아웃, 라벨 스무딩, 스토캐스틱 뎁스, 믹스업/컷믹스
스케일링: DP/ZeRO, 텐서/모델/파이프 병렬, MoE, 체크포인트, LoRA/QLoRA
데이터/평가: 견고한 분할, 누수 점검, 보정·불확실성, 장문 평가, 안전성
서빙: 양자화(INT8/FP8), 증류, Triton/FastAPI, 스트리밍 지연 예산, 캐시/KV 재사용
마일스톤: 도메인 프로젝트 1개를 프로덕션 유사 서빙 + 대시보드 + A/B 테스트까지

5) 캡스톤(택1)

멀티모달 스튜디오 어시스턴트: RAG + LLaVA/CLIP로 컷/에셋 검색·태깅·노트, 온프레미스 추론
장시퀀스 예측: Mamba/RWKV vs Transformer, 알림형 배포
실시간 ASR→자막: Whisper/Conformer, 지연 예산, 도메인 용어 사전
NeRF 기반 VP 백드롭: 캡처 → Instant-NGP → 키잉+트래킹 합성
Bio-임베딩 특성화: ESM 임베딩으로 성질 예측, 해석성

6) 논문 읽기 & 재현 문화

기여 유형 분류: 아키텍처 vs 학습 vs 데이터
로그/어블레이션 템플릿 고정, 한 번에 변수 1개만 변경
결과 카드화: “무엇이 바뀌었나? 왜? 비용(FLOPs/지연/메모리)은?”

7) 빠른 선택 가이드(치트시트)

아주 긴 시퀀스: SSM(Mamba/S4) 또는 RWKV; 검색 도움되면 RAG
모바일/저지연: MobileNet/ShuffleNet/GhostNet 또는 경량 ViT; 양자화
그래프: GAT/GIN, 시간축 → TGN
3D/씬 캡처: Instant-NGP/Plenoxels, 스파스 3D → MinkowskiNet
텍스트 생성: Llama/Mistral/Gemma/Phi + LoRA; 안전/평가 포함
이미지 생성: 시작은 Latent Diffusion, 여유 생기면 DiT로 확장

Learning Roadmap (English)

0) Foundations & Setup

Math & Prob/Stats: linear algebra, calculus for DL, probability, information theory.
Core DL: optimization (SGD/AdamW), initialization, overfit/underfit, bias–variance.
Tooling: PyTorch, JAX (optional), Lightning/Accelerate, mixed precision, profiling.
Data discipline: curation, splits, leakage checks, augmentations, reproducibility (seeds).

Milestone: Implement a small MNIST/CIFAR classifier and a toy RNN on character data; set up a clean training loop with logging & checkpoints.

1) Architecture Families (structure level)

1.1 Feedforward & Modern MLPs

MLP, Residual MLP, MLP-Mixer, gMLP, KAN (Kolmogorov-Arnold Networks).
Milestone: Reproduce MLP-Mixer on CIFAR; compare to a small CNN.

1.2 CNNs (classic → modern → lightweight)

Backbones: ResNet, DenseNet, Inception, MobileNet, EfficientNet, ConvNeXt, RegNet, NFNet, CoAtNet (conv-attn hybrid).
Modules: Dilated/Deformable/Depthwise/Grouped conv, SENet, CBAM, ECA.
Lightweight: ShuffleNet, GhostNet.
Milestone: Train a modern small CNN on your dataset; ablate depthwise vs standard conv.

1.3 RNNs & Temporal Convs

Vanilla RNN (Elman/Jordan), LSTM, GRU, Peephole, Bi-/Stacked; TCN; packing/masking for variable length.
Milestone: Build a sequence classifier with LSTM/GRU and TCN; handle variable lengths properly.

1.4 Transformers (and efficiency)

Families: Encoder-only (BERT), Decoder-only (GPT), Encoder–Decoder (T5/BART).
Vision: ViT/DeiT/Swin, PVT, MaxViT; Detection: DETR/Deformable DETR, Mask2Former.
Long/efficient: Transformer-XL, Longformer, BigBird, Reformer, Performer, Linformer, FlashAttention.
Popular decoder models: Llama(Mixtral), Mistral, Gemma, Phi.
Sparse/conditional compute: MoE(GShard, Switch, Mixtral), Mixture-of-Depths, Adaptive Computation Time.
Retrieval/memory: RAG, REALM, RETRO, kNN-LM.
Milestone: Fine-tune a small encoder for classification and a small decoder for generation; add RAG on a private corpus.

1.5 State-Space & Long-Range Alternatives

S4/S5/DSS, Mamba (and variants), Retentive Network, Hyena/H3, RWKV (RNN-like, parallel training).
Milestone: Replace Transformer with Mamba on a long-sequence task; compare throughput and accuracy.

1.6 GNNs (graphs)

GCN, GAT, GraphSAGE, GIN, MPNN; equivariant: EGNN, SE(3)-Transformer; temporal: TGN; generative: GAE/VGAE.
Milestone: Node classification on Cora/Citeseer; then TGN on a temporal graph.

1.7 INR / 3D & Geometry

INR/coordinate networks: SIREN, NeRF (Mip-NeRF, Instant-NGP, DVGO, Plenoxels), DeepSDF, Occupancy Nets.
3D nets: PointNet/PointNet++, DGCNN, Point Transformer, MinkowskiNet (sparse conv).
Milestone: Train a tiny NeRF on a few views; reconstruct a room/prop.

1.8 Associative/Other

Modern Hopfield Networks, Hopfield-Transformer; CapsNet; Neural ODE, Latent ODE, ODE-RNN.
Milestone: Implement a small Neural ODE on toy dynamics; compare to GRU.

2) Generative Model Paradigms (not architectures)

Autoregressive: RNN-LM, GPT, PixelRNN/PixelCNN, WaveNet, ImageGPT, AudioLM, MusicLM.
VAE: VAE, β-VAE, VQ-VAE / VQ-VAE-2, NVAE.
GAN: GAN, DCGAN, WGAN/WGAN-GP, SNGAN, StyleGAN(1/2/3/XL), BigGAN, CycleGAN, Pix2Pix, SPADE/GauGAN.
Diffusion / Score: DDPM / Improved DDPM, DDIM, Score-SDE (VE/VP), Latent Diffusion (Stable Diffusion), EDM, Consistency Models; Transformer-based: DiT, U-ViT.
Flow / Flow-Matching: NICE, RealNVP, Glow, Flow Matching, Rectified Flow.
EBM: EBM, score matching, NCE.

Milestone: Train a VQ-VAE and a DDPM on a small image dataset; sample and compare FID/precision–recall.

3) Domain Tracks (pick 2–3 first)

3.1 NLP / Long-Context & Retrieval

Tokenizers; BERT/T5/GPT fine-tuning; instruction tuning; LoRA; RAG; long-context (ALiBi/RoPE, Longformer/BigBird).
Project: Domain Q&A with RAG; evaluate hallucination and grounding.

3.2 Vision (Detection/Segmentation/Foundations)

Backbones (ResNet/ConvNeXt ↔ ViT/Swin), detectors (YOLO/RetinaNet/Faster R-CNN ↔ DETR), segmentation (U-Net/DeepLab ↔ Mask2Former).
Project: Build a multi-object detection pipeline; measure latency vs mAP.

3.3 Speech & Audio

ASR: Conformer, RNN-T/CTC, Whisper; TTS: WaveNet, HiFi-GAN, VITS; audio generation: AudioLM/MusicLM (concepts).
Project: Fine-tune Whisper on custom accents; deploy streaming ASR.

3.4 Time-Series / Forecasting

TCN, Informer/Autoformer (Transformer), S4/Mamba, RWKV.
Project: Multi-horizon forecasting with exogenous features; compare MSE vs latency.

3.5 Multimodal (VL/VLM/LMM)

CLIP, BLIP/BLIP-2, Flamingo, LLaVA, GPT-4V, Gemini; data alignment (ITC/ITM), instruction tuning, evaluation (VQA/VLEP).
Project: Image-text retrieval + VQA; add RAG over captions/metadata.

3.6 Reinforcement Learning

DQN/Double/Rainbow, A2C/A3C, PPO, SAC/TD3; sequence-DP: Decision Transformer; RLHF (SFT → RM → PPO/DPO).
Project: PPO on a continuous-control task; add reward shaping and curriculum.

3.7 Protein/Biology

Structure: AlphaFold (Evoformer), ESM; generation: RFdiffusion, ProtGPT.
Project: Use ESM embeddings for property prediction; explore RFdiffusion samples (conceptually).

3.8 3D/Graphics & Virtual Production (XR-friendly)

Instant-NGP, Plenoxels, PointNeRF/Point-based methods; camera-tracking fusion; real-time constraints.
Project: Scene capture → NeRF background → integrate with a tracked camera feed.

4) Training Recipes, Evaluation & Systems

Optimization: AdamW, schedulers & warmup, gradient clipping, weight decay, EMA.
Regularization: dropout, label smoothing, stochastic depth, mixup/cutmix.
Scaling: DP/ZeRO, tensor/model/pipeline parallel, MoE, gradient checkpointing, LoRA/QLoRA.
Data & Eval: robust splits, leakage tests, calibration, uncertainty, long-context eval, safety checks.
Serving: quantization (INT8/FP8), distillation, Triton/FastAPI, streaming ASR/VLM latencies, caching & KV reuse.

Milestone: Take one domain project to production-like serving with metrics dashboards and A/B tests.

5) Capstones (pick 1)

Multimodal Studio Assistant: RAG + LLaVA/CLIP for shot search, asset tagging, and scene notes; on-prem inference.
Long-Sequence Forecasting: Mamba/RWKV vs Transformer on operational telemetry; deploy with alerts.
Real-Time ASR→Captioning: Whisper/Conformer + latency budget + domain lexicon injection.
NeRF-Driven VP Backdrops: Capture → Instant-NGP training → keyed talent compositing with tracked camera.
Bio-inspired Feature Design: Use ESM embeddings as features for property prediction; interpretability focus.

6) Paper-Reading & Repro Culture

Triage: architecture vs training vs data contributions; reproduce small-scale first.
Logs & ablations: keep a fixed template; isolate one variable per run.
Share cards: “What changed? Why? At what cost (FLOPs/latency/memory)?”

7) Quick Decision Guide (cheat-sheet)

Very long sequences: try SSM (S4/Mamba) or RWKV; if retrieval helps → RAG.
Tight latency / mobile: MobileNet/ShuffleNet/GhostNet or distilled ViT; quantize.
Structured graphs: GAT/GIN; temporal → TGN.
3D/scene capture: Instant-NGP/Plenoxels; sparse 3D → MinkowskiNet.
Text generation: decoder (Llama/Mistral/Gemma/Phi) with LoRA; safety + eval.
Image generation: start Latent Diffusion; scale to DiT when compute allows.

실전형 러닝 로드맵