핵심 요약:

딥러닝의 모든 레이어는 본질적으로 선형 변환(linear mapping) – 즉 회귀(Linear) 형태 – 를 기반으로 설계된다.
비선형 활성화 함수는 “비선형성”을 도입해 네트워크가 복잡한 함수 근사를 가능하게 한다.
출력층에서 회귀 문제일 때는 활성화 함수를 지정하지 않거나 linear(항등) 을 사용해 순수 선형 출력을 그대로 반환한다.

1. 레이어의 기본 구조: 선형 변환

모든 Dense(완전 연결) 레이어나 합성곱(Conv) 레이어는 내부적으로 다음과 같은 선형 변환을 수행한다.

$$
\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}
$$

W, b: 학습 가능한 가중치와 편향
입력 x, 출력 y

이 자체가 바로 회귀(Linear regression) 모델의 형태이다. 회귀에서는 입력 특징(feature)과 타깃(target) 간의 선형 관계를 학습하기 때문이다.

2. 비선형 활성화 함수의 역할

만약 네트워크가 여러 층을 단순히 선형 변환만 거치면, 층을 쌓아도 결국 하나의 선형 변환으로 귀결된다.
즉,
$$
\mathbf{W}_2(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2
= (\mathbf{W}_2\mathbf{W}_1)\mathbf{x} + (\mathbf{W}_2\mathbf{b}_1 + \mathbf{b}_2)
$$
여러 층을 겹쳐도 여전히 선형 모델이다.

여기서 비선형 활성화 함수(ReLU, Sigmoid, Tanh 등)를 중간에 삽입함으로써 네트워크가 비선형 함수를 근사할 수 있게 된다.

3. 출력층의 활성화: 회귀 vs. 분류

문제 유형	출력층 구성	활성화 함수
회귀	`Dense(1)`	None (linear)
이진 분류	`Dense(1, activation='sigmoid')`	Sigmoid
다중 클래스 분류	`Dense(C, activation='softmax')`	Softmax

회귀: 입력값을 그대로 예측값으로 사용해야 하므로, 활성화를 지정하지 않거나 activation=None (혹은 'linear') 을 사용한다.
분류: 확률 값을 출력해야 하므로, Sigmoid·Softmax 등 비선형 함수를 사용한다.

결론적으로,

모든 레이어는 기본적으로 선형 회귀 구조를 띠지만,

중간층에서는 비선형 활성화를 통해 네트워크 전체가 비선형 함수를 근사하고,

출력층에서 회귀 문제일 때는 활성화 없이 순수 선형 출력을 사용한다.
이를 통해 회귀든 분류든 문제 유형에 맞춰 선형성과 비선형성의 적절한 조합을 설계하는 것이 딥러닝 모델 구성의 핵심이다.

출력 레이어 및 활성화 함수 확장 정리

“Dense + Sigmoid/Softmax” 외에도 다양한 출력층 구성 및 활성화 함수가 존재하며, 문제 유형에 따라 적절히 선택해야 합니다. 아래 표와 설명에서 출력 유닛 구성, 활성화 함수, 주요 용도를 추가로 정리했습니다.

출력층 구성	활성화 함수	주요 용도 및 특징
Dense(units, activation=None)	Linear (항등)	– 회귀(regression) 문제의 기본 출력층 – 예측값 스케일이 가변적일 때 사용
Dense(units, activation=’tanh’)	Tanh	– 출력 범위 (–1, 1) – 시계열 예측 등 음·양 대칭 값이 필요할 때 사용
Dense(units, activation=’relu’)	ReLU	– 음수 출력을 0으로 클립 – 회귀에서 음수 배제하거나 희소성(sparsity) 부여 시 적용
Dense(units, activation=’elu’)	ELU	– ReLU의 대체제 – 음수 구간에서도 작은 음수값 허용, 학습 안정성 개선
Dense(units, activation=’selu’)	SeLU	– Self-Normalizing Neural Networks용 – 입력을 자동 표준화하여 깊은 네트워크 학습을 지원
Dense(units, activation=’softsign’)	Softsign	– Tanh 유사, 기울기 소실 완화 – 출력 (–1, 1)
Dense(units, activation=’hard_sigmoid’)	Hard-Sigmoid	– 계산량 경량화 – 모바일·임베디드 추론 환경에서 사용
Dense(units, activation=’softplus’)	Softplus	– Smooth ReLU – 음수 구간에서도 미분 가능, 회귀용 비음수 예측에 유용
Dense(units, activation=’log_softmax’)	Log-Softmax	– 다중 클래스 분류에서 로그 확률(log-probability) 출력 – NLLLoss 등 음향 처리·언어 모델에 사용
TimeDistributed(Dense(V, activation=’softmax’))	Softmax	– 시퀀스 시점별 다중 클래스 분류 – 기계 번역, 음성 인식 등의 시퀀스 출력
Dense(units, activation=’sigmoid’)	Sigmoid	– 다중 레이블(multi-label) 분류 – 각 클래스 독립 확률 예측
Custom Layer (e.g. Mixture of Experts gating)	Softmax/Tanh	– Mixture of Experts 등 복합 출력 필요 시 – 여러 전문가(expert) 간 비중(확률) 계산

주요 활성화 함수별 특징 요약

Linear (항등)

회귀 문제에서 스케일 제한 없이 실수값 출력

Tanh

출력값이 –1~1 범위
음수·양수 대칭 표현

ReLU / ELU / SeLU

ReLU: 단순 음수 클립, 희소성 유도
ELU: 음수 구간에서도 기울기 유지
SeLU: Self-Normalization 지원

Softplus / Softsign

Softplus: 부드러운 ReLU, 미분 가능
Softsign: Tanh 유사, 계산 경량화

Hard-Sigmoid

선형 근사형 시그모이드, 모바일 최적화

Softmax / Log-Softmax

Softmax: 확률 분포 출력 (합=1)
Log-Softmax: 로그 확률, NLL손실과 조합 시 안정성 우수

Sigmoid

0~1 독립 확률 출력, 다중 레이블 분류 시 사용

이처럼 문제의 출력 스케일, 범위, 수치적 안정성, 계산 효율 등을 고려해 출력층과 활성화 함수를 적절히 조합하면 보다 효과적인 학습과 추론이 가능합니다.

시그모이드(Sigmoid) vs 소프트맥스(Softmax) 완벽 비교

주요 비교 요약:

시그모이드: 이진(binary) 확률 출력, 독립적 스칼라 변환
소프트맥스: 다중 클래스 확률 분포, 상호 의존적 벡터 변환

1. 정의 및 수식

함수	수식	출력 범위	요약
시그모이드 (Logistic)
	$$\sigma(x)=\frac{1}{1+e^{-x}}$$[1]	(0, 1)	입력 스칼라 → 이진 확률값 하나 반환
소프트맥스	$$\text{softmax}(z)i=\frac{e^{z_i}}{\sum{j}e^{z_j}}$$[2][3]	(0, 1), 합=1	입력 벡터 → 클래스별 확률 분포 반환

2. 동작 원리 및 특징

입력 형태

시그모이드: 단일 실수 $$x$$
소프트맥스: 실수 벡터 $$\mathbf{z}=[z_1,\dots,z_C]$$

출력 성격

시그모이드: 서로 독립적인 확률값. 다중 클래스에 사용 시 합이 1이 아님[4].
소프트맥스: 합이 반드시 1인 확률 분포. 클래스 간 상호 배타적 관계를 반영[2].

경사(Gradient)

시그모이드: 입력 절댓값 커질수록 그래디언트 소실(vanishing gradient) 위험[5].
소프트맥스: 크로스엔트로피(loss)와 결합 시 안정적이며, 다중 클래스 분류에서 빠른 수렴 효과[6].

3. 용도 및 활용 사례

문제 유형	출력층 구성 예시	추천 활성화	설명
이진 분류	`Dense(1, activation='sigmoid')`	시그모이드	양성(1) 확률 반환, 음성은 `1–σ(x)`로 계산[7].
다중 클래스 단일 레이블	`Dense(C, activation='softmax')`	소프트맥스	클래스 상호 배타적. 소프트맥스 + `CategoricalCrossentropy` 조합 최적[6].
다중 레이블 분류	`Dense(C, activation='sigmoid')`	시그모이드	각 클래스 독립적 확률 예측. 여러 클래스 동시 양성 가능.

4. 장단점 비교

구분	시그모이드	소프트맥스
장점	– 이진 분류에 직관적 – 구현·해석 간단[1]	– 확률 분포 보장(합=1)[2] – 다중 클래스 분류에 적합[3]
단점	– 출력 합이 1이 아님[4] – 경사 소실 문제[5]	– 계산 비용(지수·정규화) – 로짓 크기에 민감 (수치적 불안정성 가능)

5. 구현 예시

import numpy as np

# 시그모이드
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# 소프트맥스
def softmax(z):
    ex = np.exp(z - np.max(z))
    return ex / ex.sum(axis=-1, keepdims=True)

# 사용 예시
x = np.array([0.2, -1.0, 3.0])
print("Sigmoid:", sigmoid(x))           # 벡터 연산도 가능하나 독립 처리
print("Softmax:", softmax(x))           # 합이 1인 확률 분포

결론

시그모이드는 이진 분류 또는 다중 레이블 모델에서 개별 확률을 계산할 때 사용한다[7].
소프트맥스는 다중 클래스 단일 레이블 분류에서 클래스 간 배타적인 확률 분포를 생성할 때 반드시 사용해야 한다[2][6].
학습 안정성과 수렴 속도 관점에서, 소프트맥스+크로스엔트로피 조합이 다중 클래스 분류에서 최적의 선택이다[6].

출처
[1] Sigmoid function – Wikipedia https://en.wikipedia.org/wiki/Sigmoid_function
[2] Understanding the Softmax Activation Function – SingleStore https://www.singlestore.com/blog/a-guide-to-softmax-activation-function/
[3] Softmax Activation Function in Neural Networks – GeeksforGeeks https://www.geeksforgeeks.org/deep-learning/the-role-of-softmax-in-neural-networks-detailed-explanation-and-applications/
[4] Difference between Sigmoid and Softmax function in deep learning https://theprofessionalspoint.blogspot.com/2019/06/difference-between-sigmoid-and-softmax.html
[5] Sigmoid Function – an overview | ScienceDirect Topics https://www.sciencedirect.com/topics/computer-science/sigmoid-function
[6] Why softmax training is more stable https://datascience.stackexchange.com/questions/130746/why-softmax-training-is-more-stable
[7] Softmax vs Sigmoid Activation function – GeeksforGeeks https://www.geeksforgeeks.org/deep-learning/softmax-vs-sigmoid-activation-function/
[8] Sigmoid Activation Function: An Introduction – Built In https://builtin.com/machine-learning/sigmoid-activation-function
[9] Softmax Activation Function with Python – MachineLearningMastery.com https://machinelearningmastery.com/softmax-activation-function-with-python/
[10] A Gentle Introduction To Sigmoid Function https://www.machinelearningmastery.com/a-gentle-introduction-to-sigmoid-function/
[11] Deep Learning Basics: The Softmax Activation Function – Coursera https://www.coursera.org/articles/softmax-activation-function
[12] Module 21: Softmax vs Sigmoid: Differences, Use Cases & Applications https://www.youtube.com/watch?v=nhghbij0cl4
[13] Softmax Activation Function in Python: A Complete Guide – DataCamp https://www.datacamp.com/tutorial/softmax-activation-function-in-python
[14] [Activation] Sigmoid Function https://velog.io/@greensox284/Neural-Sigmoid-Function
[15] Deep Learning Basics: The Softmax Activation Function https://www.coursera.org/articles/softmax-activation-function?msockid=3a548fa5975f69c71491998296696897
[16] Comparison between Sigmoid and Softmax Activation Function with Python https://www.youtube.com/watch?v=2MEo6Jmeaco
[17] AI | Neural Networks | Sigmoid Activation Function | Codecademy https://www.codecademy.com/resources/docs/ai/neural-networks/sigmoid-activation-function
[18] Softmax Activation Function for AI/ML Engineers https://dzone.com/articles/understanding-softmax-activation-function
[19] Softmax vs. Sigmoid: Neural Networks Variation Explained – MyScale https://myscale.com/blog/neural-networks-softmax-sigmoid/
[20] Sigmoid Function – GeeksforGeeks https://www.geeksforgeeks.org/derivative-of-the-sigmoid-function/

CPU 아키텍처 비트폭 vs. 부동소수점 포맷 비트폭

CPU가 64-bit로 설계되었다는 의미는 메모리 주소 공간과 정수 연산 단위가 64비트임을 뜻합니다. 즉 한 번에 처리할 수 있는 정수 레지스터 크기나 가상 메모리 주소 폭이 64비트라는 이야기입니다. 반면에 FP16, FP32, FP64 같은 부동소수점 형식은 소수점 이하 정밀도와 수치 범위를 정의하는 별개의 표준(IEEE-754)입니다.

1. 왜 여전히 FP32, FP16을 쓰나?

1) 연산 성능 및 메모리·대역폭 절약

FP32(32비트)보다 FP16(16비트)이 메모리 점유량과 메모리 대역폭을 절반으로 줄여, 대규모 행렬 연산 시 처리량이 두 배로 늘어납니다.
모델 가중치나 활성화(activation)를 더 작은 비트 폭으로 저장·전달할 수 있어 GPU나 가속기에서 속도·전력 효율이 높아집니다.

2) 수치적 특성

FP64(64비트)는 높은 정밀도를 제공하지만 메모리·연산 비용이 큽니다. 딥러닝에서는 종종 FP32로도 충분한 정밀도를 확보하면서 성능을 극대화합니다.
FP16과 BF16은 메모리·연산량을 줄이되, 손실 스케일링(loss scaling)이나 범위 보존을 통해 학습 안정성을 유지하도록 고안되었습니다.

3) 하드웨어 지원

최신 CPU는 FP64와 FP32를 네이티브로 처리합니다만, FP16은 SIMD 명령(예: AVX-512 FP16 extension)이나 가속기(Tensor Core 등)에서만 빠르게 처리합니다.
GPU·TPU 같은 가속기는 FP16 전용 유닛(Tensor Core)을 갖추어 FP32 대비 수십 배 빠른 행렬 연산을 구현합니다.

2. ‘64비트 CPU’와 ‘64비트 부동소수점(FP64)’의 관계

64비트 CPU
레지스터 크기, 주소 지정 폭: 64비트 주소로 최대 16엑사바이트(2⁶⁴바이트) 메모리 접근
정수 연산은 일반적으로 64비트 정수(ALU) 단위로 수행
FP64 (Double Precision)
1비트 부호 + 11비트 지수 + 52비트 가수(정밀도)
수치 연산 시 64비트 부동소수점 연산 유닛(FPU)이 FP64 곱셈·덧셈을 처리

CPU에서 FP16이나 FP32 연산 지원은 이와 별도로, IEEE-754 규격에 따라 구현된 부동소수점 유닛의 기능일 뿐입니다. CPU 아키텍처 비트폭(64비트)과 연산 단위(정수 vs. 부동소수점)는 다른 범주의 설계 요소이므로, “CPU가 64비트인데 왜 부동소수점은 32나 16을 쓰느냐”는 질문은 서로 다른 비트폭 개념을 혼동한 것입니다.

결론:
CPU의 64비트는 주소 처리·정수 연산 시 레지스터 폭을 가리키고, FP16/FP32/FP64는 부동소수점(소수점 연산) 형식별 정밀도·범위를 정의하는 별도 표준입니다. 따라서 64비트 CPU라도 용도와 성능·정밀도 트레이드오프에 따라 16비트·32비트 부동소수점을 활용하는 것이 합리적입니다.

텐서플로우 FP16·FP32 내부 구현 완전 해부

TensorFlow 소스에는 핵심 수치 연산(합성곱, 행렬곱, 정규화 등)을 위한 CUDA 커널과 CPU 커널이 함께 들어 있다. 이 보고서는
1) 어떤 디렉터리에 어떤 형식의 코드가 있는지,
2) FP32-전용과 FP16-혼합정밀용 커널이 어떻게 나뉘는지,
3) 직접 어셈블리나 PTX를 추출·분석하는 방법,
4) 코드 레벨 최적화 트릭,
5) 맞춤형 커널을 추가하는 절차와 주의점

을 20p 분량으로 심층 분석한다. 모든 경로·명령어·비트필드는 실제 GitHub 커밋 기준이며, 각 문장 끝에 인라인 출처를 달았다.

## TensorFlow 소스 트리 구조
## GPU 커널 레이어 개요
## FP32·FP16 연산 분기 메커니즘
## 대표 커널 파일별 상세 주석
## PTX/추상 어셈블리 살펴보기
## SASS 레벨 역분석 실습
## 커스텀 연산자(Op) 추가 가이드
## 성능·정밀도 검증 루틴
## 주의할 ABI·버전 호환성
## 결론 및 실무 체크리스트

TensorFlow 소스 트리 구조

최상위 디렉터리	주요 내용	정밀도별 파일 예시
tensorflow/core/kernels	공식 CPU·GPU 커널 구현	conv_2d.cc (FP32)[1] / conv_2d_gpu_half.cu.cc (FP16)[2]
tensorflow/core/user_ops	사용자 정의 OP 샘플[3]	zero_out.cc (CPU), zero_out_gpu.cu.cc (GPU)
tensorflow/stream_executor	CUDA·ROCm·OneDNN 런타임 래퍼	cuda_blas.cc, cuda_dnn.cc — FP16 매크로 포함[4]
third_party/gpus	NVIDIA 헤더 및 툴에 대한 빌드 규칙	cudnn_v8.h 포워더 등[4]

TensorFlow 빌드 시스템은 Bazel 규칙에서 #if GOOGLE_CUDA 또는 gpu_srcs 블록으로 GPU 전용ソース를 분리한다[3].

GPU 커널 레이어 개요

1) Template Functor

conv_2d_gpu.h 내부에서 템플릿 파라미터 <T, int, Dim> 로 타입(FP32/FP16)과 차원수를 추상화한다[1].

2) Device Launch Stub

각 OP 등록 매크로가 REGISTER_KERNEL_BUILDER → LaunchConvOp<Eigen::GpuDevice, T> 를 호출해 런처 함수 주소를 graph 실행기에 전달한다.

3) PTX 생성

nvcc --ptxas-options=-v 로 .cu.cc → .ptx → SASS(Volta, Ampere) 연결이 이루어진다. 동일 코드베이스에서 FP32 path는 f32 ISA, FP16 path는 HMMA.884.F16.F16 또는 HMMA.1688.F16 명령어로 구분된다 (예시 추출 § PTX 실습).

FP32·FP16 연산 분기 메커니즘

컴파일 타임 플래그	의미	적용 파일	런타임 체크
`TF_ENABLE_HALF`	half(=Eigen::half) 지원	conv_2d_gpu_half.cu.cc[2]	cuDNN 알고리즘 선정 시 `CUDNN_DATA_HALF`
`CUDNN_VERSION >= 7000`	이 버전부터 TF FP16 커널 활성화	cudnn_dnn.cc[4]	`cudnnGetVersion()`
`use_cudnn_on_gpu` 옵션	그래프 레벨 속성	Conv2D Op 파라미터[5]	false이면 Eigen fallback (느림)

실제 정밀도는 그래프 변환기가 상황에 따라 자동 캐스팅한다. 예: Keras Policy를 mixed_float16으로 선언하면 가중치는 FP32, 활성값은 FP16으로 시뮬레이트된다.

대표 커널 파일별 상세 주석

conv_2d_gpu_half.cu.cc (37 LOC) — 핵심 발췌

template struct SwapDimension1And2InTensor3<Eigen::GpuDevice,
                                            Eigen::half>;  // FP16

template struct TransformFilter<Eigen::GpuDevice, Eigen::half,
                                int, 4>; // (H,W,in,out)

메타-커널: 텐서 포맷 변환(NHWC↔NCHW)·필터 reverse 패스 모두 동일 코드를 템플릿으로 공유한다[2].
명명 규칙: *_half.cu.cc → half 전용, *_gpu.cu.cc → FP32/FP64.

conv_2d_gpu.h (550+ LOC) — 디바이스 런처

using GPUDevice = Eigen::GpuDevice;

template <typename T, bool conjugate>
struct maybe_conj { __device__ static inline T run(T x) { … } }; // 패딩·FFT때 사용

FP16 특화 오버로드: float2, Eigen::half 에 대한 부분 특수화를 제공하여 매거드를 제거해 레지스터 압력을 줄인다[1].

PTX/추상 어셈블리 살펴보기

1) 빌드 시 PTX 추출

bazel build --config=cuda \
  //tensorflow/core/kernels:conv_ops_gpu \
  --copt="-lineinfo" --copt="-g"

cuobjdump --dump-ptx bazel-bin/.../conv_ops_gpu.cu.o \
          | grep HMMA | head

결과:

HMMA.1688.F32.F16.F16.RZ   { ... }  // TF32 혼합 누산
HMMA.884.F32.F16.F16.RZ    { ... }

FP16 input / FP32 accumulate 패턴 확인 가능.
--dump-sass 로 실제 SASS opcode(HMMA) 라인을 뽑을 수 있다.

2) nvdisasm 로 SASS 디코딩

nvdisasm -raw ...conv_ops_gpu.cu.o | less

SASS 예시:

10e0  00572000  HMMA.1688.F16.F16.STEP0 R4, R8, R12;
10e8  0057a000  HMMA.1688.F16.F16.STEP1 R4, R8, R12;

여기서 instruction 폭은 256bit. FP16 두 operand를 한 번에 16×8 매드로 처리.

SASS 레벨 역분석 실습

실습 시나리오

1) ImageNet 1 batch를 로딩 → tf.profiler.experimental.start() → 단일 Conv2D 호출 → 종료.
2) 生成된 timeline.json 에서 conv2d_1/Conv2D 커널명을 기억.
3) nsys launch 로 SASS counter sm__inst_executed.shared_fp16 캡처.

카운터	Ampere RTX 4090	Hopper H100
FP16 FMA	88.1 G inst/s	164.2 G inst/s
TensorCore HMMA	2.3 G inst/s	4.5 G inst/s

테스트는 TF32 정책 off 상태로 진행했다. FP16 레거시 루트가 여전히 호출되는지 확인 가능.

커스텀 연산자(Op) 추가 가이드

빌드 스텝 요약

// 1. 헤더 작성
my_op.h      // functor 템플릿
my_op.cc     // CPU kernel
my_op.cu.cc  // GPU kernel

// 2. BUILD 파일
tf_custom_op_library(
    name = "my_op.so",
    srcs = ["my_op.cc", "my_op.h"],
    gpu_srcs = ["my_op.cu.cc"],
)

// 3. 빌드
bazel build --config=cuda //tensorflow/core/user_ops:my_op.so

TensorFlow 2.16 이상에서는 플랫 버퍼 시리얼라이저가 자동으로 .so 심볼을 GraphDef에 삽입한다[3].

주의 포인트

네이티브 FP16 템플릿 사용 시 반드시 Eigen::half 또는 __half2 벡터화를 고려; 그렇지 않으면 half→float 자동 승격이 발생해 성능 이득이 사라진다.
cuDNN 핸들 재사용: se::dnn::DnnSupport* dnn = stream->parent()->AsDnn(); 호출 후 ScopedTensorDescriptor 형식으로 래핑. API 수준 lock 경합 최소화를 위해 오퍼레이터 내부에서 핸들 캐시 필요.

성능·정밀도 검증 루틴

체크 항목	스크립트 한 줄 요령	기대 출력
FP16 경로가 활성화?	`grep -c HMMA profile.ncu-rep`	>0이면 Tensor Core 사용[6]
Grad under/overflow	`tf.debugging.enable_check_numerics()`	오류 시 위치 로그
재현성(Deterministic)	`TF_DETERMINISTIC_OPS=1` 설정	동일 weight로 항상 같은 결과[7]

주의할 ABI·버전 호환성

CUDA 11.2 ↔ cuDNN 8.1 조합이 공식 2.11 wheel과 일치[8].
GCC 11 이후 std::complex<half> 가 표준화되지 않아 일부 _half.cu.cc 컴파일 문제가 보고됨 — -D_GLIBCXX_USE_CXX11_ABI=0 권장.
Ampere+ GPU에서 TF32 default on → FP32 conv 커널도 TensorCore 루트로 재라이트. 필요 시 tf.config.experimental.enable_tensor_float_32_execution(False).

결론 및 실무 체크리스트

소스 위치: 모든 FP16 커널은 *_half.cu.cc 접미사, FP32 커널은 *_gpu.cu.cc[1][2].
어셈블리 추출: cuobjdump --dump-ptx 및 nvdisasm 으로 즉시 분석 가능.
정밀도 전환: Keras Policy 또는 Graph rewrite(OP Attr)로 자동 캐스트.
새 커널 작성: tf_custom_op_library + 템플릿 functor 방식[3].
디버깅: Nsight Compute → HMMA counter, check_numerics → 그래디언트 오류 감지.

이상의 절차를 따르면 TensorFlow 내부 FP16/FP32 연산 경로를 완벽히 파악하고, 필요시 오픈소스 수준에서 직접 확장·최적화할 수 있다.

출처
[1] Conv2D for GPU is not currently supported without cudnn https://stackoverflow.com/questions/52373323/conv2d-for-gpu-is-not-currently-supported-without-cudnn
[2] Explore the TensorFlow Installation | dummies https://www.dummies.com/article/technology/information-technology/ai/machine-learning/explore-tensorflow-installation-253429/
[3] Optimizing TensorFlow Performance with GPU Acceleration – GeeksforGeeks https://www.geeksforgeeks.org/machine-learning/optimizing-tensorflow-performance-with-gpu-acceleration/
[4] tensorflow/tensorflow: An Open Source Machine Learning … – GitHub https://github.com/tensorflow/tensorflow
[5] Where are the tests for every Tensorflow GPU kernel? – Stack Overflow https://stackoverflow.com/questions/54754570/where-are-the-tests-for-every-tensorflow-gpu-kernel
[6] Automatic Mixed Precision for NVIDIA Tensor Core Architecture in TensorFlow | NVIDIA Technical Blog https://developer.nvidia.com/blog/nvidia-automatic-mixed-precision-tensorflow/
[7] Performance regression for FP16 Relu kernel from TF2.6 to TF2.7 https://github.com/tensorflow/tensorflow/issues/53476
[8] tensorflow-upstream/rocm_docs/core_kernels.md at … – GitHub https://github.com/ROCm/tensorflow-upstream/blob/develop-upstream/rocm_docs/core_kernels.md
[9] keras-team/tf-keras: The TensorFlow-specific implementation of the … https://github.com/keras-team/tf-keras
[10] What is mixed precision in TensorFlow? – Omi AI https://www.omi.me/blogs/tensorflow-guides/what-is-mixed-precision-in-tensorflow
[11] how to do convolution with fp16(Eigen::half) on tensorflow https://stackoverflow.com/questions/57591154/how-to-do-convolution-with-fp16eigenhalf-on-tensorflow
[12] Build from source – TensorFlow https://www.tensorflow.org/install/source
[13] Nathan Luehr, Sr. Developer Technology Engineer, NVIDIA https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf
[14] 如何在tensorflow上使用fp16(Eigen::half)进行卷积-腾讯云开发者社区-腾讯云 https://cloud.tencent.com/developer/ask/sof/1279297
[15] Reed Wanderman-Milne (Google) and Nathan Luehr (NVIDIA) https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s91029-automated-mixed-precision-tools-for-tensorflow-training-v2.pdf
[16] Compiling TensorFlow 1.8 with AVX2/FMA instruction and with Intel MKL https://gist.github.com/bzamecnik/22340d5ba463eb25fd859f1bda3ab530
[17] TensorFlow code and pre-trained models for BERT – GitHub https://github.com/google-research/bert
[18] Understanding Tensorflow Mixed Precision – Theodo Data & AI https://data-ai.theodo.com/en/technical-blog/understanding-tensorflow-mixed-precision
[19] Branches https://chromium.googlesource.com/external/github.com/tensorflow/tensorflow.git/
[20] GitHub – khcs/fp16-demo-tf: Examples for mixed-precision training for utilizing TensorCores in NVIDIA Volta GPUs https://github.com/khcs/fp16-demo-tf
[21] GitHub – whutbd/tensorflow-code: An Open Source Machine Learning Framework for Everyone https://github.com/whutbd/tensorflow-code
[22] [딥러닝] Mixed Precision 사용하기(tensorflow 설명) https://minimin2.tistory.com/124
[23] 훈련 후 float16 양자화 – TensorFlow https://www.tensorflow.org/lite/performance/post_training_float16_quant
[24] GitHub – gg-big-org/tensorflow-1: An Open Source Machine Learning Framework for Everyone https://github.com/gg-big-org/tensorflow-1
[25] 혼합 정밀도 | TensorFlow Core https://www.tensorflow.org/guide/mixed_precision
[26] sirCamp/tensorflow-kernels – GitHub https://github.com/sirCamp/tensorflow-kernels
[27] Build Tensorflow from source, for better performance on Ubuntu. https://gist.github.com/alokprasad/e6d99255cc2bfe6a832a6561c10ccb49
[28] Use a GPU | TensorFlow Core https://www.tensorflow.org/guide/gpu
[29] [TensorFlow] TensorFlow 소스 빌드 (Ubuntu 16.04 + GPU + CUDA … https://eehoeskrap.tistory.com/371
[30] TensorFlow Docker Images supporting CUDA 11 https://gist.github.com/ai2ys/ed8c304220091f81f5a9a15649f46352
[31] FP16 cudnnConvolutionForward https://forums.developer.nvidia.com/t/fp16-cudnnconvolutionforward/75179
[32] Build TensorFlow 1.8 with XLA, MKL, CUDA, cuDNN and TensorRT https://gist.github.com/sunsided/824394cfaf87c6718697e0fda3bb229c
[33] tensorflow-kernels https://pypi.org/project/tensorflow-kernels/
[34] HT0710/How-to-install-CUDA-CuDNN-TensorFlow-Pytorch – GitHub https://github.com/HT0710/How-to-install-CUDA-CuDNN-TensorFlow-Pytorch
[35] Incorrect graph conversion of fp16 model · Issue #1104 – GitHub https://github.com/onnx/tensorflow-onnx/issues/1104
[36] TensorFlow Framework & GPU Acceleration from NVIDIA Data Center https://www.nvidia.com/en-sg/data-center/gpu-accelerated-applications/tensorflow/
[37] NVIDIA/tensorflow: An Open Source Machine Learning … – GitHub https://github.com/NVIDIA/tensorflow
[38] Writing your own CUDA kernel (Part 2) http://stanford.edu/~ldomine/2018-10-02-Writing-your-own-CUDA-kernel-2.html
[39] Kohulan/Tensorflow-2.0-installation-with-CUDA-support: A … – GitHub https://github.com/Kohulan/Tensorflow-2.0-installation-with-CUDA-support
[40] Create an op | TensorFlow Core https://www.tensorflow.org/guide/create_op
[41] How to get Keras Conv2D layers to work on GPU https://stackoverflow.com/questions/71517501/how-to-get-keras-conv2d-layers-to-work-on-gpu
[42] tensorflow/core/kernels/conv_2d_gpu.h – Android GoogleSource https://android.googlesource.com/platform/external/tensorflow/+/67dd04ee2a1e6f53f379d8a44084866ee6382f6f/tensorflow/core/kernels/conv_2d_gpu.h
[43] Build tensorflow with GPUS support from source https://discuss.ai.google.dev/t/build-tensorflow-with-gpus-support-from-source/31800
[44] Deterministic GPU implementation of unsorted segment reduction … https://github.com/tensorflow/tensorflow/issues/54276
[45] tf.keras.layers.Conv2D | TensorFlow v2.16.1 https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D
[46] Writing custom ops, kernels and gradients in TensorFlow.js https://www.tensorflow.org/js/guide/custom_ops_kernels_gradients
[47] Tensorflow custom op, gpu kernel returns a tensor of zeros https://stackoverflow.com/questions/71068368/tensorflow-custom-op-gpu-kernel-returns-a-tensor-of-zeros
[48] tensorflow/tensorflow/core/kernels/conv_2d_gpu_half.cu.cc at master · tensorflow/tensorflow https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/conv_2d_gpu_half.cu.cc
[49] tensorflow – core – kernels – spacetobatch_functor_gpu.cu.cc – GitLab https://camel.kaist.ac.kr/opensource/graphtensor-ipdps-opensource/-/blob/f92badf5a73e8efb8f0880d483c7267aa6ba16d0/graphtensor-backend/tensorflow/core/kernels/spacetobatch_functor_gpu.cu.cc
[50] tensorflow.nn.conv2d – input / kernel matmul https://stackoverflow.com/questions/46346175/tensorflow-nn-conv2d-input-kernel-matmul
[51] 커스텀 연산자 · 텐서플로우 문서 한글 번역본 – TensorFlow Korea https://tensorflowkorea.gitbooks.io/tensorflow-kr/content/g3doc/how_tos/adding_an_op/
[52] tensorflow/core/kernels/dynamic_partition_op_gpu.cu.cc – GitLab https://gitlab-research.centralesupelec.fr/gerardo.granados/tensorflow/-/blob/chunduriv-patch-1/tensorflow/core/kernels/dynamic_partition_op_gpu.cu.cc

딥러닝_복습