기초 → 아키텍처 패밀리(CNN/RNN/Transformer/SSM/GNN/INR…) → 생성 패러다임(AR/VAE/GAN/Diffusion/Flow/EBM) → 도메인 트랙(NLP/비전/음성·오디오/시계열/멀티모달/RL/바이오) → 시스템·스케일링 → 캡스톤 순서로 진행하세요. 굵은 키워드로 논문/자료 검색하면 됩니다.
러닝 로드맵 (한국어)
0) 기초 & 환경
- 수학/확률: 선형대수, 미분/적분(딥러닝용), 확률/정보이론
- DL 기본기: 최적화(SGD/AdamW), 초기화, 과소·과적합, 바이어스–분산
- 툴링: PyTorch, (옵션) JAX, Lightning/Accelerate, 혼합정밀, 프로파일링
- 데이터 위생: 수집/정제, 분할/누수 방지, 증강, 재현성(시드)
마일스톤: MNIST/CIFAR 분류기와 문자 RNN 토이 구현, 로깅·체크포인트 포함
1) 아키텍처 패밀리(구조 레벨)
1.1 Feedforward & Modern MLP
- MLP, Residual MLP, MLP-Mixer, gMLP, KAN
마일스톤: CIFAR에서 MLP-Mixer 재현 → 소형 CNN과 비교
1.2 CNN (고전→모던→경량)
- 백본: ResNet, DenseNet, Inception, MobileNet, EfficientNet, ConvNeXt, RegNet, NFNet, CoAtNet(conv-attn 하이브리드)
- 모듈: 팽창/변형/깊이분리/그룹드 컨브, SENet, CBAM, ECA
- 경량: ShuffleNet, GhostNet
마일스톤: 자체 데이터로 소형 CNN 학습, depthwise vs standard ablation
1.3 RNN & Temporal Conv
- Vanilla RNN(Elman/Jordan), LSTM, GRU, Peephole, Bi-/Stacked, TCN
- 가변 길이 처리(packing/masking)
마일스톤: LSTM/GRU와 TCN으로 시퀀스 분류기 구현(패딩 처리 포함)
1.4 Transformer (효율 포함)
- 패밀리: Encoder-only(BERT), Decoder-only(GPT), Encoder–Decoder(T5/BART)
- 비전: ViT/DeiT/Swin, PVT, MaxViT; 검출: DETR/Deformable DETR, Mask2Former
- 롱·효율: Transformer-XL, Longformer, BigBird, Reformer, Performer, Linformer, FlashAttention
- 인기 디코더: Llama(Mixtral), Mistral, Gemma, Phi
- 희소/조건부 계산: MoE(GShard/Switch/Mixtral), Mixture-of-Depths, Adaptive Computation Time
- 검색/메모리: RAG, REALM, RETRO, kNN-LM
마일스톤: 소형 인코더 분류·디코더 생성 파인튜닝, 사내 코퍼스에 RAG 적용
1.5 State-Space & 장거리 대안
- S4/S5/DSS, Mamba, Retentive Network, Hyena/H3, RWKV(RNN 느낌+병렬 학습)
마일스톤: 장시퀀스 태스크에 Mamba로 교체, 처리량/정확도 비교
1.6 GNN (그래프)
- GCN, GAT, GraphSAGE, GIN, MPNN; 등변성: EGNN, SE(3)-Transformer; TGN(시계열 그래프); GAE/VGAE
마일스톤: Cora 노드 분류 → TGN으로 시간 축 실험
1.7 INR/3D & 기하
- 좌표 기반(INR): SIREN, NeRF(Mip-NeRF, Instant-NGP, DVGO, Plenoxels), DeepSDF, Occupancy Nets
- 3D 네트: PointNet/PointNet++, DGCNN, Point Transformer, MinkowskiNet(sparse conv)
마일스톤: 소형 NeRF로 소수 시점 재구성
1.8 연상/기타
- Modern Hopfield Networks, Hopfield-Transformer; CapsNet; Neural ODE/Latent ODE/ODE-RNN
마일스톤: 토이 동역학에 Neural ODE 적용 → GRU와 비교
2) 생성 모델 패러다임(아키텍처 아님)
- Autoregressive: RNN-LM, GPT, PixelRNN/PixelCNN, WaveNet, ImageGPT, AudioLM, MusicLM
- VAE: VAE, β-VAE, VQ-VAE/VQ-VAE-2, NVAE
- GAN: GAN, DCGAN, WGAN/WGAN-GP, SNGAN, StyleGAN(1/2/3/XL), BigGAN, CycleGAN, Pix2Pix, SPADE/GauGAN
- Diffusion/Score: DDPM/Improved DDPM, DDIM, Score-SDE(VE/VP), Latent Diffusion(Stable Diffusion), EDM, Consistency Models
- Transformer 기반: DiT, U-ViT
- Flow/Flow-Matching: NICE, RealNVP, Glow, Flow Matching, Rectified Flow
- EBM: 에너지 기반, 스코어 매칭, NCE
마일스톤: 소형 데이터로 VQ-VAE와 DDPM 학습, 샘플 품질(FID/PR) 비교
3) 도메인 트랙(우선 2–3개 선택)
3.1 NLP / 장문 & 검색결합
- 토크나이저, BERT/T5/GPT 파인튜닝, 인스트럭션 튜닝, LoRA, RAG, 장문(ROPE/ALiBi, Longformer/BigBird)
프로젝트: 도메인 Q&A에 RAG 적용, 환각/근거평가
3.2 비전(검출/분할/파운데이션)
- 백본(ResNet/ConvNeXt ↔ ViT/Swin), 검출(YOLO/RetinaNet/Faster R-CNN ↔ DETR), 분할(U-Net/DeepLab ↔ Mask2Former)
프로젝트: 멀티객체 검출 파이프라인, 지연시간 vs mAP
3.3 음성·오디오
- ASR: Conformer, RNN-T/CTC, Whisper; TTS: WaveNet, HiFi-GAN, VITS; 생성: AudioLM/MusicLM(개념)
프로젝트: Whisper로 사내 억양 파인튜닝, 스트리밍 서빙
3.4 시계열/예측
- TCN, Informer/Autoformer(Transformer), S4/Mamba, RWKV
프로젝트: 외생 변수 포함 다중 수평 예측, MSE vs 지연시간 비교
3.5 멀티모달(VL/VLM/LMM)
- CLIP, BLIP/BLIP-2, Flamingo, LLaVA, GPT-4V, Gemini; 정렬(ITC/ITM), 인스트럭션 튜닝, VQA/VLEP 평가지표
프로젝트: 이미지-텍스트 검색 + VQA, 캡션/메타데이터에 RAG 결합
3.6 강화학습
- DQN/Double/Rainbow, A2C/A3C, PPO, SAC/TD3, Decision Transformer; RLHF(SFT → RM → PPO/DPO)
프로젝트: 연속제어에 PPO 적용, 보상 셰이핑/커리큘럼
3.7 단백질/생물
- 구조: AlphaFold(
Evoformer), ESM; 생성: RFdiffusion, ProtGPT
프로젝트: ESM 임베딩으로 성질 예측, RFdiffusion 샘플 탐색(개념)
3.8 3D/그래픽스 & 버추얼 프로덕션
- Instant-NGP, Plenoxels, 포인트 기반(예: PointNeRF), 카메라 트래킹 융합, 실시간 제약
프로젝트: 씬 캡처 → NeRF 배경 → 트래킹 카메라 합성
4) 학습 레시피·평가·시스템
- 최적화: AdamW, 스케줄러/워밍업, 그라디언트 클리핑, WD, EMA
- 정규화: 드롭아웃, 라벨 스무딩, 스토캐스틱 뎁스, 믹스업/컷믹스
- 스케일링: DP/ZeRO, 텐서/모델/파이프 병렬, MoE, 체크포인트, LoRA/QLoRA
- 데이터/평가: 견고한 분할, 누수 점검, 보정·불확실성, 장문 평가, 안전성
- 서빙: 양자화(INT8/FP8), 증류, Triton/FastAPI, 스트리밍 지연 예산, 캐시/KV 재사용
마일스톤: 도메인 프로젝트 1개를 프로덕션 유사 서빙 + 대시보드 + A/B 테스트까지
5) 캡스톤(택1)
- 멀티모달 스튜디오 어시스턴트: RAG + LLaVA/CLIP로 컷/에셋 검색·태깅·노트, 온프레미스 추론
- 장시퀀스 예측: Mamba/RWKV vs Transformer, 알림형 배포
- 실시간 ASR→자막: Whisper/Conformer, 지연 예산, 도메인 용어 사전
- NeRF 기반 VP 백드롭: 캡처 → Instant-NGP → 키잉+트래킹 합성
- Bio-임베딩 특성화: ESM 임베딩으로 성질 예측, 해석성
6) 논문 읽기 & 재현 문화
- 기여 유형 분류: 아키텍처 vs 학습 vs 데이터
- 로그/어블레이션 템플릿 고정, 한 번에 변수 1개만 변경
- 결과 카드화: “무엇이 바뀌었나? 왜? 비용(FLOPs/지연/메모리)은?”
7) 빠른 선택 가이드(치트시트)
- 아주 긴 시퀀스: SSM(Mamba/S4) 또는 RWKV; 검색 도움되면 RAG
- 모바일/저지연: MobileNet/ShuffleNet/GhostNet 또는 경량 ViT; 양자화
- 그래프: GAT/GIN, 시간축 → TGN
- 3D/씬 캡처: Instant-NGP/Plenoxels, 스파스 3D → MinkowskiNet
- 텍스트 생성: Llama/Mistral/Gemma/Phi + LoRA; 안전/평가 포함
- 이미지 생성: 시작은 Latent Diffusion, 여유 생기면 DiT로 확장
Learning Roadmap (English)
0) Foundations & Setup
- Math & Prob/Stats: linear algebra, calculus for DL, probability, information theory.
- Core DL: optimization (SGD/AdamW), initialization, overfit/underfit, bias–variance.
- Tooling: PyTorch, JAX (optional), Lightning/Accelerate, mixed precision, profiling.
- Data discipline: curation, splits, leakage checks, augmentations, reproducibility (seeds).
Milestone: Implement a small MNIST/CIFAR classifier and a toy RNN on character data; set up a clean training loop with logging & checkpoints.
1) Architecture Families (structure level)
1.1 Feedforward & Modern MLPs
- MLP, Residual MLP, MLP-Mixer, gMLP, KAN (Kolmogorov-Arnold Networks).
Milestone: Reproduce MLP-Mixer on CIFAR; compare to a small CNN.
1.2 CNNs (classic → modern → lightweight)
- Backbones: ResNet, DenseNet, Inception, MobileNet, EfficientNet, ConvNeXt, RegNet, NFNet, CoAtNet (conv-attn hybrid).
- Modules: Dilated/Deformable/Depthwise/Grouped conv, SENet, CBAM, ECA.
- Lightweight: ShuffleNet, GhostNet.
Milestone: Train a modern small CNN on your dataset; ablate depthwise vs standard conv.
1.3 RNNs & Temporal Convs
- Vanilla RNN (Elman/Jordan), LSTM, GRU, Peephole, Bi-/Stacked; TCN; packing/masking for variable length.
Milestone: Build a sequence classifier with LSTM/GRU and TCN; handle variable lengths properly.
1.4 Transformers (and efficiency)
- Families: Encoder-only (BERT), Decoder-only (GPT), Encoder–Decoder (T5/BART).
- Vision: ViT/DeiT/Swin, PVT, MaxViT; Detection: DETR/Deformable DETR, Mask2Former.
- Long/efficient: Transformer-XL, Longformer, BigBird, Reformer, Performer, Linformer, FlashAttention.
- Popular decoder models: Llama(Mixtral), Mistral, Gemma, Phi.
- Sparse/conditional compute: MoE(GShard, Switch, Mixtral), Mixture-of-Depths, Adaptive Computation Time.
- Retrieval/memory: RAG, REALM, RETRO, kNN-LM.
Milestone: Fine-tune a small encoder for classification and a small decoder for generation; add RAG on a private corpus.
1.5 State-Space & Long-Range Alternatives
- S4/S5/DSS, Mamba (and variants), Retentive Network, Hyena/H3, RWKV (RNN-like, parallel training).
Milestone: Replace Transformer with Mamba on a long-sequence task; compare throughput and accuracy.
1.6 GNNs (graphs)
- GCN, GAT, GraphSAGE, GIN, MPNN; equivariant: EGNN, SE(3)-Transformer; temporal: TGN; generative: GAE/VGAE.
Milestone: Node classification on Cora/Citeseer; then TGN on a temporal graph.
1.7 INR / 3D & Geometry
- INR/coordinate networks: SIREN, NeRF (Mip-NeRF, Instant-NGP, DVGO, Plenoxels), DeepSDF, Occupancy Nets.
- 3D nets: PointNet/PointNet++, DGCNN, Point Transformer, MinkowskiNet (sparse conv).
Milestone: Train a tiny NeRF on a few views; reconstruct a room/prop.
1.8 Associative/Other
- Modern Hopfield Networks, Hopfield-Transformer; CapsNet; Neural ODE, Latent ODE, ODE-RNN.
Milestone: Implement a small Neural ODE on toy dynamics; compare to GRU.
2) Generative Model Paradigms (not architectures)
- Autoregressive: RNN-LM, GPT, PixelRNN/PixelCNN, WaveNet, ImageGPT, AudioLM, MusicLM.
- VAE: VAE, β-VAE, VQ-VAE / VQ-VAE-2, NVAE.
- GAN: GAN, DCGAN, WGAN/WGAN-GP, SNGAN, StyleGAN(1/2/3/XL), BigGAN, CycleGAN, Pix2Pix, SPADE/GauGAN.
- Diffusion / Score: DDPM / Improved DDPM, DDIM, Score-SDE (VE/VP), Latent Diffusion (Stable Diffusion), EDM, Consistency Models; Transformer-based: DiT, U-ViT.
- Flow / Flow-Matching: NICE, RealNVP, Glow, Flow Matching, Rectified Flow.
- EBM: EBM, score matching, NCE.
Milestone: Train a VQ-VAE and a DDPM on a small image dataset; sample and compare FID/precision–recall.
3) Domain Tracks (pick 2–3 first)
3.1 NLP / Long-Context & Retrieval
- Tokenizers; BERT/T5/GPT fine-tuning; instruction tuning; LoRA; RAG; long-context (ALiBi/RoPE, Longformer/BigBird).
Project: Domain Q&A with RAG; evaluate hallucination and grounding.
3.2 Vision (Detection/Segmentation/Foundations)
- Backbones (ResNet/ConvNeXt ↔ ViT/Swin), detectors (YOLO/RetinaNet/Faster R-CNN ↔ DETR), segmentation (U-Net/DeepLab ↔ Mask2Former).
Project: Build a multi-object detection pipeline; measure latency vs mAP.
3.3 Speech & Audio
- ASR: Conformer, RNN-T/CTC, Whisper; TTS: WaveNet, HiFi-GAN, VITS; audio generation: AudioLM/MusicLM (concepts).
Project: Fine-tune Whisper on custom accents; deploy streaming ASR.
3.4 Time-Series / Forecasting
- TCN, Informer/Autoformer (Transformer), S4/Mamba, RWKV.
Project: Multi-horizon forecasting with exogenous features; compare MSE vs latency.
3.5 Multimodal (VL/VLM/LMM)
- CLIP, BLIP/BLIP-2, Flamingo, LLaVA, GPT-4V, Gemini; data alignment (ITC/ITM), instruction tuning, evaluation (VQA/VLEP).
Project: Image-text retrieval + VQA; add RAG over captions/metadata.
3.6 Reinforcement Learning
- DQN/Double/Rainbow, A2C/A3C, PPO, SAC/TD3; sequence-DP: Decision Transformer; RLHF (SFT → RM → PPO/DPO).
Project: PPO on a continuous-control task; add reward shaping and curriculum.
3.7 Protein/Biology
- Structure: AlphaFold (
Evoformer), ESM; generation: RFdiffusion, ProtGPT.
Project: Use ESM embeddings for property prediction; explore RFdiffusion samples (conceptually).
3.8 3D/Graphics & Virtual Production (XR-friendly)
- Instant-NGP, Plenoxels, PointNeRF/Point-based methods; camera-tracking fusion; real-time constraints.
Project: Scene capture → NeRF background → integrate with a tracked camera feed.
4) Training Recipes, Evaluation & Systems
- Optimization: AdamW, schedulers & warmup, gradient clipping, weight decay, EMA.
- Regularization: dropout, label smoothing, stochastic depth, mixup/cutmix.
- Scaling: DP/ZeRO, tensor/model/pipeline parallel, MoE, gradient checkpointing, LoRA/QLoRA.
- Data & Eval: robust splits, leakage tests, calibration, uncertainty, long-context eval, safety checks.
- Serving: quantization (INT8/FP8), distillation, Triton/FastAPI, streaming ASR/VLM latencies, caching & KV reuse.
Milestone: Take one domain project to production-like serving with metrics dashboards and A/B tests.
5) Capstones (pick 1)
- Multimodal Studio Assistant: RAG + LLaVA/CLIP for shot search, asset tagging, and scene notes; on-prem inference.
- Long-Sequence Forecasting: Mamba/RWKV vs Transformer on operational telemetry; deploy with alerts.
- Real-Time ASR→Captioning: Whisper/Conformer + latency budget + domain lexicon injection.
- NeRF-Driven VP Backdrops: Capture → Instant-NGP training → keyed talent compositing with tracked camera.
- Bio-inspired Feature Design: Use ESM embeddings as features for property prediction; interpretability focus.
6) Paper-Reading & Repro Culture
- Triage: architecture vs training vs data contributions; reproduce small-scale first.
- Logs & ablations: keep a fixed template; isolate one variable per run.
- Share cards: “What changed? Why? At what cost (FLOPs/latency/memory)?”
7) Quick Decision Guide (cheat-sheet)
- Very long sequences: try SSM (S4/Mamba) or RWKV; if retrieval helps → RAG.
- Tight latency / mobile: MobileNet/ShuffleNet/GhostNet or distilled ViT; quantize.
- Structured graphs: GAT/GIN; temporal → TGN.
- 3D/scene capture: Instant-NGP/Plenoxels; sparse 3D → MinkowskiNet.
- Text generation: decoder (Llama/Mistral/Gemma/Phi) with LoRA; safety + eval.
- Image generation: start Latent Diffusion; scale to DiT when compute allows.
답글 남기기