[카테고리:] 미분류

  • 실전형 러닝 로드맵

    기초 → 아키텍처 패밀리(CNN/RNN/Transformer/SSM/GNN/INR…) → 생성 패러다임(AR/VAE/GAN/Diffusion/Flow/EBM) → 도메인 트랙(NLP/비전/음성·오디오/시계열/멀티모달/RL/바이오) → 시스템·스케일링 → 캡스톤 순서로 진행하세요. 굵은 키워드로 논문/자료 검색하면 됩니다.


    러닝 로드맵 (한국어)

    0) 기초 & 환경

    • 수학/확률: 선형대수, 미분/적분(딥러닝용), 확률/정보이론
    • DL 기본기: 최적화(SGD/AdamW), 초기화, 과소·과적합, 바이어스–분산
    • 툴링: PyTorch, (옵션) JAX, Lightning/Accelerate, 혼합정밀, 프로파일링
    • 데이터 위생: 수집/정제, 분할/누수 방지, 증강, 재현성(시드)
      마일스톤: MNIST/CIFAR 분류기문자 RNN 토이 구현, 로깅·체크포인트 포함

    1) 아키텍처 패밀리(구조 레벨)

    1.1 Feedforward & Modern MLP

    • MLP, Residual MLP, MLP-Mixer, gMLP, KAN
      마일스톤: CIFAR에서 MLP-Mixer 재현 → 소형 CNN과 비교

    1.2 CNN (고전→모던→경량)

    • 백본: ResNet, DenseNet, Inception, MobileNet, EfficientNet, ConvNeXt, RegNet, NFNet, CoAtNet(conv-attn 하이브리드)
    • 모듈: 팽창/변형/깊이분리/그룹드 컨브, SENet, CBAM, ECA
    • 경량: ShuffleNet, GhostNet
      마일스톤: 자체 데이터로 소형 CNN 학습, depthwise vs standard ablation

    1.3 RNN & Temporal Conv

    • Vanilla RNN(Elman/Jordan), LSTM, GRU, Peephole, Bi-/Stacked, TCN
    • 가변 길이 처리(packing/masking)
      마일스톤: LSTM/GRUTCN으로 시퀀스 분류기 구현(패딩 처리 포함)

    1.4 Transformer (효율 포함)

    • 패밀리: Encoder-only(BERT), Decoder-only(GPT), Encoder–Decoder(T5/BART)
    • 비전: ViT/DeiT/Swin, PVT, MaxViT; 검출: DETR/Deformable DETR, Mask2Former
    • 롱·효율: Transformer-XL, Longformer, BigBird, Reformer, Performer, Linformer, FlashAttention
    • 인기 디코더: Llama(Mixtral), Mistral, Gemma, Phi
    • 희소/조건부 계산: MoE(GShard/Switch/Mixtral), Mixture-of-Depths, Adaptive Computation Time
    • 검색/메모리: RAG, REALM, RETRO, kNN-LM
      마일스톤: 소형 인코더 분류·디코더 생성 파인튜닝, 사내 코퍼스에 RAG 적용

    1.5 State-Space & 장거리 대안

    • S4/S5/DSS, Mamba, Retentive Network, Hyena/H3, RWKV(RNN 느낌+병렬 학습)
      마일스톤: 장시퀀스 태스크에 Mamba로 교체, 처리량/정확도 비교

    1.6 GNN (그래프)

    • GCN, GAT, GraphSAGE, GIN, MPNN; 등변성: EGNN, SE(3)-Transformer; TGN(시계열 그래프); GAE/VGAE
      마일스톤: Cora 노드 분류 → TGN으로 시간 축 실험

    1.7 INR/3D & 기하

    • 좌표 기반(INR): SIREN, NeRF(Mip-NeRF, Instant-NGP, DVGO, Plenoxels), DeepSDF, Occupancy Nets
    • 3D 네트: PointNet/PointNet++, DGCNN, Point Transformer, MinkowskiNet(sparse conv)
      마일스톤: 소형 NeRF로 소수 시점 재구성

    1.8 연상/기타

    • Modern Hopfield Networks, Hopfield-Transformer; CapsNet; Neural ODE/Latent ODE/ODE-RNN
      마일스톤: 토이 동역학에 Neural ODE 적용 → GRU와 비교

    2) 생성 모델 패러다임(아키텍처 아님)

    • Autoregressive: RNN-LM, GPT, PixelRNN/PixelCNN, WaveNet, ImageGPT, AudioLM, MusicLM
    • VAE: VAE, β-VAE, VQ-VAE/VQ-VAE-2, NVAE
    • GAN: GAN, DCGAN, WGAN/WGAN-GP, SNGAN, StyleGAN(1/2/3/XL), BigGAN, CycleGAN, Pix2Pix, SPADE/GauGAN
    • Diffusion/Score: DDPM/Improved DDPM, DDIM, Score-SDE(VE/VP), Latent Diffusion(Stable Diffusion), EDM, Consistency Models
      • Transformer 기반: DiT, U-ViT
    • Flow/Flow-Matching: NICE, RealNVP, Glow, Flow Matching, Rectified Flow
    • EBM: 에너지 기반, 스코어 매칭, NCE
      마일스톤: 소형 데이터로 VQ-VAEDDPM 학습, 샘플 품질(FID/PR) 비교

    3) 도메인 트랙(우선 2–3개 선택)

    3.1 NLP / 장문 & 검색결합

    • 토크나이저, BERT/T5/GPT 파인튜닝, 인스트럭션 튜닝, LoRA, RAG, 장문(ROPE/ALiBi, Longformer/BigBird)
      프로젝트: 도메인 Q&A에 RAG 적용, 환각/근거평가

    3.2 비전(검출/분할/파운데이션)

    • 백본(ResNet/ConvNeXtViT/Swin), 검출(YOLO/RetinaNet/Faster R-CNNDETR), 분할(U-Net/DeepLab ↔ Mask2Former)
      프로젝트: 멀티객체 검출 파이프라인, 지연시간 vs mAP

    3.3 음성·오디오

    • ASR: Conformer, RNN-T/CTC, Whisper; TTS: WaveNet, HiFi-GAN, VITS; 생성: AudioLM/MusicLM(개념)
      프로젝트: Whisper로 사내 억양 파인튜닝, 스트리밍 서빙

    3.4 시계열/예측

    • TCN, Informer/Autoformer(Transformer), S4/Mamba, RWKV
      프로젝트: 외생 변수 포함 다중 수평 예측, MSE vs 지연시간 비교

    3.5 멀티모달(VL/VLM/LMM)

    • CLIP, BLIP/BLIP-2, Flamingo, LLaVA, GPT-4V, Gemini; 정렬(ITC/ITM), 인스트럭션 튜닝, VQA/VLEP 평가지표
      프로젝트: 이미지-텍스트 검색 + VQA, 캡션/메타데이터에 RAG 결합

    3.6 강화학습

    • DQN/Double/Rainbow, A2C/A3C, PPO, SAC/TD3, Decision Transformer; RLHF(SFT → RM → PPO/DPO)
      프로젝트: 연속제어에 PPO 적용, 보상 셰이핑/커리큘럼

    3.7 단백질/생물

    • 구조: AlphaFold(Evoformer), ESM; 생성: RFdiffusion, ProtGPT
      프로젝트: ESM 임베딩으로 성질 예측, RFdiffusion 샘플 탐색(개념)

    3.8 3D/그래픽스 & 버추얼 프로덕션

    • Instant-NGP, Plenoxels, 포인트 기반(예: PointNeRF), 카메라 트래킹 융합, 실시간 제약
      프로젝트: 씬 캡처 → NeRF 배경 → 트래킹 카메라 합성

    4) 학습 레시피·평가·시스템

    • 최적화: AdamW, 스케줄러/워밍업, 그라디언트 클리핑, WD, EMA
    • 정규화: 드롭아웃, 라벨 스무딩, 스토캐스틱 뎁스, 믹스업/컷믹스
    • 스케일링: DP/ZeRO, 텐서/모델/파이프 병렬, MoE, 체크포인트, LoRA/QLoRA
    • 데이터/평가: 견고한 분할, 누수 점검, 보정·불확실성, 장문 평가, 안전성
    • 서빙: 양자화(INT8/FP8), 증류, Triton/FastAPI, 스트리밍 지연 예산, 캐시/KV 재사용
      마일스톤: 도메인 프로젝트 1개를 프로덕션 유사 서빙 + 대시보드 + A/B 테스트까지

    5) 캡스톤(택1)

    • 멀티모달 스튜디오 어시스턴트: RAG + LLaVA/CLIP로 컷/에셋 검색·태깅·노트, 온프레미스 추론
    • 장시퀀스 예측: Mamba/RWKV vs Transformer, 알림형 배포
    • 실시간 ASR→자막: Whisper/Conformer, 지연 예산, 도메인 용어 사전
    • NeRF 기반 VP 백드롭: 캡처 → Instant-NGP → 키잉+트래킹 합성
    • Bio-임베딩 특성화: ESM 임베딩으로 성질 예측, 해석성

    6) 논문 읽기 & 재현 문화

    • 기여 유형 분류: 아키텍처 vs 학습 vs 데이터
    • 로그/어블레이션 템플릿 고정, 한 번에 변수 1개만 변경
    • 결과 카드화: “무엇이 바뀌었나? 왜? 비용(FLOPs/지연/메모리)은?”

    7) 빠른 선택 가이드(치트시트)

    • 아주 긴 시퀀스: SSM(Mamba/S4) 또는 RWKV; 검색 도움되면 RAG
    • 모바일/저지연: MobileNet/ShuffleNet/GhostNet 또는 경량 ViT; 양자화
    • 그래프: GAT/GIN, 시간축 → TGN
    • 3D/씬 캡처: Instant-NGP/Plenoxels, 스파스 3D → MinkowskiNet
    • 텍스트 생성: Llama/Mistral/Gemma/Phi + LoRA; 안전/평가 포함
    • 이미지 생성: 시작은 Latent Diffusion, 여유 생기면 DiT로 확장

    Learning Roadmap (English)

    0) Foundations & Setup

    • Math & Prob/Stats: linear algebra, calculus for DL, probability, information theory.
    • Core DL: optimization (SGD/AdamW), initialization, overfit/underfit, bias–variance.
    • Tooling: PyTorch, JAX (optional), Lightning/Accelerate, mixed precision, profiling.
    • Data discipline: curation, splits, leakage checks, augmentations, reproducibility (seeds).

    Milestone: Implement a small MNIST/CIFAR classifier and a toy RNN on character data; set up a clean training loop with logging & checkpoints.


    1) Architecture Families (structure level)

    1.1 Feedforward & Modern MLPs

    • MLP, Residual MLP, MLP-Mixer, gMLP, KAN (Kolmogorov-Arnold Networks).
      Milestone: Reproduce MLP-Mixer on CIFAR; compare to a small CNN.

    1.2 CNNs (classic → modern → lightweight)

    • Backbones: ResNet, DenseNet, Inception, MobileNet, EfficientNet, ConvNeXt, RegNet, NFNet, CoAtNet (conv-attn hybrid).
    • Modules: Dilated/Deformable/Depthwise/Grouped conv, SENet, CBAM, ECA.
    • Lightweight: ShuffleNet, GhostNet.
      Milestone: Train a modern small CNN on your dataset; ablate depthwise vs standard conv.

    1.3 RNNs & Temporal Convs

    • Vanilla RNN (Elman/Jordan), LSTM, GRU, Peephole, Bi-/Stacked; TCN; packing/masking for variable length.
      Milestone: Build a sequence classifier with LSTM/GRU and TCN; handle variable lengths properly.

    1.4 Transformers (and efficiency)

    • Families: Encoder-only (BERT), Decoder-only (GPT), Encoder–Decoder (T5/BART).
    • Vision: ViT/DeiT/Swin, PVT, MaxViT; Detection: DETR/Deformable DETR, Mask2Former.
    • Long/efficient: Transformer-XL, Longformer, BigBird, Reformer, Performer, Linformer, FlashAttention.
    • Popular decoder models: Llama(Mixtral), Mistral, Gemma, Phi.
    • Sparse/conditional compute: MoE(GShard, Switch, Mixtral), Mixture-of-Depths, Adaptive Computation Time.
    • Retrieval/memory: RAG, REALM, RETRO, kNN-LM.
      Milestone: Fine-tune a small encoder for classification and a small decoder for generation; add RAG on a private corpus.

    1.5 State-Space & Long-Range Alternatives

    • S4/S5/DSS, Mamba (and variants), Retentive Network, Hyena/H3, RWKV (RNN-like, parallel training).
      Milestone: Replace Transformer with Mamba on a long-sequence task; compare throughput and accuracy.

    1.6 GNNs (graphs)

    • GCN, GAT, GraphSAGE, GIN, MPNN; equivariant: EGNN, SE(3)-Transformer; temporal: TGN; generative: GAE/VGAE.
      Milestone: Node classification on Cora/Citeseer; then TGN on a temporal graph.

    1.7 INR / 3D & Geometry

    • INR/coordinate networks: SIREN, NeRF (Mip-NeRF, Instant-NGP, DVGO, Plenoxels), DeepSDF, Occupancy Nets.
    • 3D nets: PointNet/PointNet++, DGCNN, Point Transformer, MinkowskiNet (sparse conv).
      Milestone: Train a tiny NeRF on a few views; reconstruct a room/prop.

    1.8 Associative/Other

    • Modern Hopfield Networks, Hopfield-Transformer; CapsNet; Neural ODE, Latent ODE, ODE-RNN.
      Milestone: Implement a small Neural ODE on toy dynamics; compare to GRU.

    2) Generative Model Paradigms (not architectures)

    • Autoregressive: RNN-LM, GPT, PixelRNN/PixelCNN, WaveNet, ImageGPT, AudioLM, MusicLM.
    • VAE: VAE, β-VAE, VQ-VAE / VQ-VAE-2, NVAE.
    • GAN: GAN, DCGAN, WGAN/WGAN-GP, SNGAN, StyleGAN(1/2/3/XL), BigGAN, CycleGAN, Pix2Pix, SPADE/GauGAN.
    • Diffusion / Score: DDPM / Improved DDPM, DDIM, Score-SDE (VE/VP), Latent Diffusion (Stable Diffusion), EDM, Consistency Models; Transformer-based: DiT, U-ViT.
    • Flow / Flow-Matching: NICE, RealNVP, Glow, Flow Matching, Rectified Flow.
    • EBM: EBM, score matching, NCE.

    Milestone: Train a VQ-VAE and a DDPM on a small image dataset; sample and compare FID/precision–recall.


    3) Domain Tracks (pick 2–3 first)

    3.1 NLP / Long-Context & Retrieval

    • Tokenizers; BERT/T5/GPT fine-tuning; instruction tuning; LoRA; RAG; long-context (ALiBi/RoPE, Longformer/BigBird).
      Project: Domain Q&A with RAG; evaluate hallucination and grounding.

    3.2 Vision (Detection/Segmentation/Foundations)

    • Backbones (ResNet/ConvNeXtViT/Swin), detectors (YOLO/RetinaNet/Faster R-CNNDETR), segmentation (U-Net/DeepLab ↔ Mask2Former).
      Project: Build a multi-object detection pipeline; measure latency vs mAP.

    3.3 Speech & Audio

    • ASR: Conformer, RNN-T/CTC, Whisper; TTS: WaveNet, HiFi-GAN, VITS; audio generation: AudioLM/MusicLM (concepts).
      Project: Fine-tune Whisper on custom accents; deploy streaming ASR.

    3.4 Time-Series / Forecasting

    • TCN, Informer/Autoformer (Transformer), S4/Mamba, RWKV.
      Project: Multi-horizon forecasting with exogenous features; compare MSE vs latency.

    3.5 Multimodal (VL/VLM/LMM)

    • CLIP, BLIP/BLIP-2, Flamingo, LLaVA, GPT-4V, Gemini; data alignment (ITC/ITM), instruction tuning, evaluation (VQA/VLEP).
      Project: Image-text retrieval + VQA; add RAG over captions/metadata.

    3.6 Reinforcement Learning

    • DQN/Double/Rainbow, A2C/A3C, PPO, SAC/TD3; sequence-DP: Decision Transformer; RLHF (SFT → RM → PPO/DPO).
      Project: PPO on a continuous-control task; add reward shaping and curriculum.

    3.7 Protein/Biology

    • Structure: AlphaFold (Evoformer), ESM; generation: RFdiffusion, ProtGPT.
      Project: Use ESM embeddings for property prediction; explore RFdiffusion samples (conceptually).

    3.8 3D/Graphics & Virtual Production (XR-friendly)

    • Instant-NGP, Plenoxels, PointNeRF/Point-based methods; camera-tracking fusion; real-time constraints.
      Project: Scene capture → NeRF background → integrate with a tracked camera feed.

    4) Training Recipes, Evaluation & Systems

    • Optimization: AdamW, schedulers & warmup, gradient clipping, weight decay, EMA.
    • Regularization: dropout, label smoothing, stochastic depth, mixup/cutmix.
    • Scaling: DP/ZeRO, tensor/model/pipeline parallel, MoE, gradient checkpointing, LoRA/QLoRA.
    • Data & Eval: robust splits, leakage tests, calibration, uncertainty, long-context eval, safety checks.
    • Serving: quantization (INT8/FP8), distillation, Triton/FastAPI, streaming ASR/VLM latencies, caching & KV reuse.

    Milestone: Take one domain project to production-like serving with metrics dashboards and A/B tests.


    5) Capstones (pick 1)

    • Multimodal Studio Assistant: RAG + LLaVA/CLIP for shot search, asset tagging, and scene notes; on-prem inference.
    • Long-Sequence Forecasting: Mamba/RWKV vs Transformer on operational telemetry; deploy with alerts.
    • Real-Time ASR→Captioning: Whisper/Conformer + latency budget + domain lexicon injection.
    • NeRF-Driven VP Backdrops: Capture → Instant-NGP training → keyed talent compositing with tracked camera.
    • Bio-inspired Feature Design: Use ESM embeddings as features for property prediction; interpretability focus.

    6) Paper-Reading & Repro Culture

    • Triage: architecture vs training vs data contributions; reproduce small-scale first.
    • Logs & ablations: keep a fixed template; isolate one variable per run.
    • Share cards: “What changed? Why? At what cost (FLOPs/latency/memory)?”

    7) Quick Decision Guide (cheat-sheet)

    • Very long sequences: try SSM (S4/Mamba) or RWKV; if retrieval helps → RAG.
    • Tight latency / mobile: MobileNet/ShuffleNet/GhostNet or distilled ViT; quantize.
    • Structured graphs: GAT/GIN; temporal → TGN.
    • 3D/scene capture: Instant-NGP/Plenoxels; sparse 3D → MinkowskiNet.
    • Text generation: decoder (Llama/Mistral/Gemma/Phi) with LoRA; safety + eval.
    • Image generation: start Latent Diffusion; scale to DiT when compute allows.