실전형 러닝 로드맵

기초 → 아키텍처 패밀리(CNN/RNN/Transformer/SSM/GNN/INR…) → 생성 패러다임(AR/VAE/GAN/Diffusion/Flow/EBM) → 도메인 트랙(NLP/비전/음성·오디오/시계열/멀티모달/RL/바이오) → 시스템·스케일링 → 캡스톤 순서로 진행하세요. 굵은 키워드로 논문/자료 검색하면 됩니다.


러닝 로드맵 (한국어)

0) 기초 & 환경

  • 수학/확률: 선형대수, 미분/적분(딥러닝용), 확률/정보이론
  • DL 기본기: 최적화(SGD/AdamW), 초기화, 과소·과적합, 바이어스–분산
  • 툴링: PyTorch, (옵션) JAX, Lightning/Accelerate, 혼합정밀, 프로파일링
  • 데이터 위생: 수집/정제, 분할/누수 방지, 증강, 재현성(시드)
    마일스톤: MNIST/CIFAR 분류기문자 RNN 토이 구현, 로깅·체크포인트 포함

1) 아키텍처 패밀리(구조 레벨)

1.1 Feedforward & Modern MLP

  • MLP, Residual MLP, MLP-Mixer, gMLP, KAN
    마일스톤: CIFAR에서 MLP-Mixer 재현 → 소형 CNN과 비교

1.2 CNN (고전→모던→경량)

  • 백본: ResNet, DenseNet, Inception, MobileNet, EfficientNet, ConvNeXt, RegNet, NFNet, CoAtNet(conv-attn 하이브리드)
  • 모듈: 팽창/변형/깊이분리/그룹드 컨브, SENet, CBAM, ECA
  • 경량: ShuffleNet, GhostNet
    마일스톤: 자체 데이터로 소형 CNN 학습, depthwise vs standard ablation

1.3 RNN & Temporal Conv

  • Vanilla RNN(Elman/Jordan), LSTM, GRU, Peephole, Bi-/Stacked, TCN
  • 가변 길이 처리(packing/masking)
    마일스톤: LSTM/GRUTCN으로 시퀀스 분류기 구현(패딩 처리 포함)

1.4 Transformer (효율 포함)

  • 패밀리: Encoder-only(BERT), Decoder-only(GPT), Encoder–Decoder(T5/BART)
  • 비전: ViT/DeiT/Swin, PVT, MaxViT; 검출: DETR/Deformable DETR, Mask2Former
  • 롱·효율: Transformer-XL, Longformer, BigBird, Reformer, Performer, Linformer, FlashAttention
  • 인기 디코더: Llama(Mixtral), Mistral, Gemma, Phi
  • 희소/조건부 계산: MoE(GShard/Switch/Mixtral), Mixture-of-Depths, Adaptive Computation Time
  • 검색/메모리: RAG, REALM, RETRO, kNN-LM
    마일스톤: 소형 인코더 분류·디코더 생성 파인튜닝, 사내 코퍼스에 RAG 적용

1.5 State-Space & 장거리 대안

  • S4/S5/DSS, Mamba, Retentive Network, Hyena/H3, RWKV(RNN 느낌+병렬 학습)
    마일스톤: 장시퀀스 태스크에 Mamba로 교체, 처리량/정확도 비교

1.6 GNN (그래프)

  • GCN, GAT, GraphSAGE, GIN, MPNN; 등변성: EGNN, SE(3)-Transformer; TGN(시계열 그래프); GAE/VGAE
    마일스톤: Cora 노드 분류 → TGN으로 시간 축 실험

1.7 INR/3D & 기하

  • 좌표 기반(INR): SIREN, NeRF(Mip-NeRF, Instant-NGP, DVGO, Plenoxels), DeepSDF, Occupancy Nets
  • 3D 네트: PointNet/PointNet++, DGCNN, Point Transformer, MinkowskiNet(sparse conv)
    마일스톤: 소형 NeRF로 소수 시점 재구성

1.8 연상/기타

  • Modern Hopfield Networks, Hopfield-Transformer; CapsNet; Neural ODE/Latent ODE/ODE-RNN
    마일스톤: 토이 동역학에 Neural ODE 적용 → GRU와 비교

2) 생성 모델 패러다임(아키텍처 아님)

  • Autoregressive: RNN-LM, GPT, PixelRNN/PixelCNN, WaveNet, ImageGPT, AudioLM, MusicLM
  • VAE: VAE, β-VAE, VQ-VAE/VQ-VAE-2, NVAE
  • GAN: GAN, DCGAN, WGAN/WGAN-GP, SNGAN, StyleGAN(1/2/3/XL), BigGAN, CycleGAN, Pix2Pix, SPADE/GauGAN
  • Diffusion/Score: DDPM/Improved DDPM, DDIM, Score-SDE(VE/VP), Latent Diffusion(Stable Diffusion), EDM, Consistency Models
    • Transformer 기반: DiT, U-ViT
  • Flow/Flow-Matching: NICE, RealNVP, Glow, Flow Matching, Rectified Flow
  • EBM: 에너지 기반, 스코어 매칭, NCE
    마일스톤: 소형 데이터로 VQ-VAEDDPM 학습, 샘플 품질(FID/PR) 비교

3) 도메인 트랙(우선 2–3개 선택)

3.1 NLP / 장문 & 검색결합

  • 토크나이저, BERT/T5/GPT 파인튜닝, 인스트럭션 튜닝, LoRA, RAG, 장문(ROPE/ALiBi, Longformer/BigBird)
    프로젝트: 도메인 Q&A에 RAG 적용, 환각/근거평가

3.2 비전(검출/분할/파운데이션)

  • 백본(ResNet/ConvNeXtViT/Swin), 검출(YOLO/RetinaNet/Faster R-CNNDETR), 분할(U-Net/DeepLab ↔ Mask2Former)
    프로젝트: 멀티객체 검출 파이프라인, 지연시간 vs mAP

3.3 음성·오디오

  • ASR: Conformer, RNN-T/CTC, Whisper; TTS: WaveNet, HiFi-GAN, VITS; 생성: AudioLM/MusicLM(개념)
    프로젝트: Whisper로 사내 억양 파인튜닝, 스트리밍 서빙

3.4 시계열/예측

  • TCN, Informer/Autoformer(Transformer), S4/Mamba, RWKV
    프로젝트: 외생 변수 포함 다중 수평 예측, MSE vs 지연시간 비교

3.5 멀티모달(VL/VLM/LMM)

  • CLIP, BLIP/BLIP-2, Flamingo, LLaVA, GPT-4V, Gemini; 정렬(ITC/ITM), 인스트럭션 튜닝, VQA/VLEP 평가지표
    프로젝트: 이미지-텍스트 검색 + VQA, 캡션/메타데이터에 RAG 결합

3.6 강화학습

  • DQN/Double/Rainbow, A2C/A3C, PPO, SAC/TD3, Decision Transformer; RLHF(SFT → RM → PPO/DPO)
    프로젝트: 연속제어에 PPO 적용, 보상 셰이핑/커리큘럼

3.7 단백질/생물

  • 구조: AlphaFold(Evoformer), ESM; 생성: RFdiffusion, ProtGPT
    프로젝트: ESM 임베딩으로 성질 예측, RFdiffusion 샘플 탐색(개념)

3.8 3D/그래픽스 & 버추얼 프로덕션

  • Instant-NGP, Plenoxels, 포인트 기반(예: PointNeRF), 카메라 트래킹 융합, 실시간 제약
    프로젝트: 씬 캡처 → NeRF 배경 → 트래킹 카메라 합성

4) 학습 레시피·평가·시스템

  • 최적화: AdamW, 스케줄러/워밍업, 그라디언트 클리핑, WD, EMA
  • 정규화: 드롭아웃, 라벨 스무딩, 스토캐스틱 뎁스, 믹스업/컷믹스
  • 스케일링: DP/ZeRO, 텐서/모델/파이프 병렬, MoE, 체크포인트, LoRA/QLoRA
  • 데이터/평가: 견고한 분할, 누수 점검, 보정·불확실성, 장문 평가, 안전성
  • 서빙: 양자화(INT8/FP8), 증류, Triton/FastAPI, 스트리밍 지연 예산, 캐시/KV 재사용
    마일스톤: 도메인 프로젝트 1개를 프로덕션 유사 서빙 + 대시보드 + A/B 테스트까지

5) 캡스톤(택1)

  • 멀티모달 스튜디오 어시스턴트: RAG + LLaVA/CLIP로 컷/에셋 검색·태깅·노트, 온프레미스 추론
  • 장시퀀스 예측: Mamba/RWKV vs Transformer, 알림형 배포
  • 실시간 ASR→자막: Whisper/Conformer, 지연 예산, 도메인 용어 사전
  • NeRF 기반 VP 백드롭: 캡처 → Instant-NGP → 키잉+트래킹 합성
  • Bio-임베딩 특성화: ESM 임베딩으로 성질 예측, 해석성

6) 논문 읽기 & 재현 문화

  • 기여 유형 분류: 아키텍처 vs 학습 vs 데이터
  • 로그/어블레이션 템플릿 고정, 한 번에 변수 1개만 변경
  • 결과 카드화: “무엇이 바뀌었나? 왜? 비용(FLOPs/지연/메모리)은?”

7) 빠른 선택 가이드(치트시트)

  • 아주 긴 시퀀스: SSM(Mamba/S4) 또는 RWKV; 검색 도움되면 RAG
  • 모바일/저지연: MobileNet/ShuffleNet/GhostNet 또는 경량 ViT; 양자화
  • 그래프: GAT/GIN, 시간축 → TGN
  • 3D/씬 캡처: Instant-NGP/Plenoxels, 스파스 3D → MinkowskiNet
  • 텍스트 생성: Llama/Mistral/Gemma/Phi + LoRA; 안전/평가 포함
  • 이미지 생성: 시작은 Latent Diffusion, 여유 생기면 DiT로 확장

Learning Roadmap (English)

0) Foundations & Setup

  • Math & Prob/Stats: linear algebra, calculus for DL, probability, information theory.
  • Core DL: optimization (SGD/AdamW), initialization, overfit/underfit, bias–variance.
  • Tooling: PyTorch, JAX (optional), Lightning/Accelerate, mixed precision, profiling.
  • Data discipline: curation, splits, leakage checks, augmentations, reproducibility (seeds).

Milestone: Implement a small MNIST/CIFAR classifier and a toy RNN on character data; set up a clean training loop with logging & checkpoints.


1) Architecture Families (structure level)

1.1 Feedforward & Modern MLPs

  • MLP, Residual MLP, MLP-Mixer, gMLP, KAN (Kolmogorov-Arnold Networks).
    Milestone: Reproduce MLP-Mixer on CIFAR; compare to a small CNN.

1.2 CNNs (classic → modern → lightweight)

  • Backbones: ResNet, DenseNet, Inception, MobileNet, EfficientNet, ConvNeXt, RegNet, NFNet, CoAtNet (conv-attn hybrid).
  • Modules: Dilated/Deformable/Depthwise/Grouped conv, SENet, CBAM, ECA.
  • Lightweight: ShuffleNet, GhostNet.
    Milestone: Train a modern small CNN on your dataset; ablate depthwise vs standard conv.

1.3 RNNs & Temporal Convs

  • Vanilla RNN (Elman/Jordan), LSTM, GRU, Peephole, Bi-/Stacked; TCN; packing/masking for variable length.
    Milestone: Build a sequence classifier with LSTM/GRU and TCN; handle variable lengths properly.

1.4 Transformers (and efficiency)

  • Families: Encoder-only (BERT), Decoder-only (GPT), Encoder–Decoder (T5/BART).
  • Vision: ViT/DeiT/Swin, PVT, MaxViT; Detection: DETR/Deformable DETR, Mask2Former.
  • Long/efficient: Transformer-XL, Longformer, BigBird, Reformer, Performer, Linformer, FlashAttention.
  • Popular decoder models: Llama(Mixtral), Mistral, Gemma, Phi.
  • Sparse/conditional compute: MoE(GShard, Switch, Mixtral), Mixture-of-Depths, Adaptive Computation Time.
  • Retrieval/memory: RAG, REALM, RETRO, kNN-LM.
    Milestone: Fine-tune a small encoder for classification and a small decoder for generation; add RAG on a private corpus.

1.5 State-Space & Long-Range Alternatives

  • S4/S5/DSS, Mamba (and variants), Retentive Network, Hyena/H3, RWKV (RNN-like, parallel training).
    Milestone: Replace Transformer with Mamba on a long-sequence task; compare throughput and accuracy.

1.6 GNNs (graphs)

  • GCN, GAT, GraphSAGE, GIN, MPNN; equivariant: EGNN, SE(3)-Transformer; temporal: TGN; generative: GAE/VGAE.
    Milestone: Node classification on Cora/Citeseer; then TGN on a temporal graph.

1.7 INR / 3D & Geometry

  • INR/coordinate networks: SIREN, NeRF (Mip-NeRF, Instant-NGP, DVGO, Plenoxels), DeepSDF, Occupancy Nets.
  • 3D nets: PointNet/PointNet++, DGCNN, Point Transformer, MinkowskiNet (sparse conv).
    Milestone: Train a tiny NeRF on a few views; reconstruct a room/prop.

1.8 Associative/Other

  • Modern Hopfield Networks, Hopfield-Transformer; CapsNet; Neural ODE, Latent ODE, ODE-RNN.
    Milestone: Implement a small Neural ODE on toy dynamics; compare to GRU.

2) Generative Model Paradigms (not architectures)

  • Autoregressive: RNN-LM, GPT, PixelRNN/PixelCNN, WaveNet, ImageGPT, AudioLM, MusicLM.
  • VAE: VAE, β-VAE, VQ-VAE / VQ-VAE-2, NVAE.
  • GAN: GAN, DCGAN, WGAN/WGAN-GP, SNGAN, StyleGAN(1/2/3/XL), BigGAN, CycleGAN, Pix2Pix, SPADE/GauGAN.
  • Diffusion / Score: DDPM / Improved DDPM, DDIM, Score-SDE (VE/VP), Latent Diffusion (Stable Diffusion), EDM, Consistency Models; Transformer-based: DiT, U-ViT.
  • Flow / Flow-Matching: NICE, RealNVP, Glow, Flow Matching, Rectified Flow.
  • EBM: EBM, score matching, NCE.

Milestone: Train a VQ-VAE and a DDPM on a small image dataset; sample and compare FID/precision–recall.


3) Domain Tracks (pick 2–3 first)

3.1 NLP / Long-Context & Retrieval

  • Tokenizers; BERT/T5/GPT fine-tuning; instruction tuning; LoRA; RAG; long-context (ALiBi/RoPE, Longformer/BigBird).
    Project: Domain Q&A with RAG; evaluate hallucination and grounding.

3.2 Vision (Detection/Segmentation/Foundations)

  • Backbones (ResNet/ConvNeXtViT/Swin), detectors (YOLO/RetinaNet/Faster R-CNNDETR), segmentation (U-Net/DeepLab ↔ Mask2Former).
    Project: Build a multi-object detection pipeline; measure latency vs mAP.

3.3 Speech & Audio

  • ASR: Conformer, RNN-T/CTC, Whisper; TTS: WaveNet, HiFi-GAN, VITS; audio generation: AudioLM/MusicLM (concepts).
    Project: Fine-tune Whisper on custom accents; deploy streaming ASR.

3.4 Time-Series / Forecasting

  • TCN, Informer/Autoformer (Transformer), S4/Mamba, RWKV.
    Project: Multi-horizon forecasting with exogenous features; compare MSE vs latency.

3.5 Multimodal (VL/VLM/LMM)

  • CLIP, BLIP/BLIP-2, Flamingo, LLaVA, GPT-4V, Gemini; data alignment (ITC/ITM), instruction tuning, evaluation (VQA/VLEP).
    Project: Image-text retrieval + VQA; add RAG over captions/metadata.

3.6 Reinforcement Learning

  • DQN/Double/Rainbow, A2C/A3C, PPO, SAC/TD3; sequence-DP: Decision Transformer; RLHF (SFT → RM → PPO/DPO).
    Project: PPO on a continuous-control task; add reward shaping and curriculum.

3.7 Protein/Biology

  • Structure: AlphaFold (Evoformer), ESM; generation: RFdiffusion, ProtGPT.
    Project: Use ESM embeddings for property prediction; explore RFdiffusion samples (conceptually).

3.8 3D/Graphics & Virtual Production (XR-friendly)

  • Instant-NGP, Plenoxels, PointNeRF/Point-based methods; camera-tracking fusion; real-time constraints.
    Project: Scene capture → NeRF background → integrate with a tracked camera feed.

4) Training Recipes, Evaluation & Systems

  • Optimization: AdamW, schedulers & warmup, gradient clipping, weight decay, EMA.
  • Regularization: dropout, label smoothing, stochastic depth, mixup/cutmix.
  • Scaling: DP/ZeRO, tensor/model/pipeline parallel, MoE, gradient checkpointing, LoRA/QLoRA.
  • Data & Eval: robust splits, leakage tests, calibration, uncertainty, long-context eval, safety checks.
  • Serving: quantization (INT8/FP8), distillation, Triton/FastAPI, streaming ASR/VLM latencies, caching & KV reuse.

Milestone: Take one domain project to production-like serving with metrics dashboards and A/B tests.


5) Capstones (pick 1)

  • Multimodal Studio Assistant: RAG + LLaVA/CLIP for shot search, asset tagging, and scene notes; on-prem inference.
  • Long-Sequence Forecasting: Mamba/RWKV vs Transformer on operational telemetry; deploy with alerts.
  • Real-Time ASR→Captioning: Whisper/Conformer + latency budget + domain lexicon injection.
  • NeRF-Driven VP Backdrops: Capture → Instant-NGP training → keyed talent compositing with tracked camera.
  • Bio-inspired Feature Design: Use ESM embeddings as features for property prediction; interpretability focus.

6) Paper-Reading & Repro Culture

  • Triage: architecture vs training vs data contributions; reproduce small-scale first.
  • Logs & ablations: keep a fixed template; isolate one variable per run.
  • Share cards: “What changed? Why? At what cost (FLOPs/latency/memory)?”

7) Quick Decision Guide (cheat-sheet)

  • Very long sequences: try SSM (S4/Mamba) or RWKV; if retrieval helps → RAG.
  • Tight latency / mobile: MobileNet/ShuffleNet/GhostNet or distilled ViT; quantize.
  • Structured graphs: GAT/GIN; temporal → TGN.
  • 3D/scene capture: Instant-NGP/Plenoxels; sparse 3D → MinkowskiNet.
  • Text generation: decoder (Llama/Mistral/Gemma/Phi) with LoRA; safety + eval.
  • Image generation: start Latent Diffusion; scale to DiT when compute allows.

코멘트

답글 남기기

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다