llama_ggml


๐Ÿ“Œ ์ „์ฒด ๊ตฌ์กฐ ์š”์•ฝ

์ด ์ฝ”๋“œ๋Š” ggml๋กœ ๊ฐ„๋‹จํ•œ ํ–‰๋ ฌ ๊ณฑ์…ˆ(matmul)์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค:

  • CPU ๋˜๋Š” GPU(Metal / CUDA) backend์—์„œ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
  • ggml ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„ ๊ตฌ์„ฑ โ†’ ์‹คํ–‰ โ†’ ๊ฒฐ๊ณผ ํ™•์ธ
  • ggml์˜ “๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ ๋ชจ๋ธ” ๊ตฌ์กฐ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ข‹์€ ์˜ˆ์ œ

๐Ÿ—บ๏ธ ๊ตฌ์กฐ ๋‹จ๊ณ„๋ณ„ ๋ถ„์„

1๏ธโƒฃ ์ค€๋น„

#include "ggml.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
  • ggml core, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น๊ธฐ, backend API ์‚ฌ์šฉ

2๏ธโƒฃ ๋ชจ๋ธ ์ •์˜

struct simple_model {
    struct ggml_tensor* a{};
    struct ggml_tensor* b{};
    ggml_backend_t backend = nullptr;
    ggml_backend_buffer_t buffer{};
    struct ggml_context* ctx{};
};
  • simple_model ๊ตฌ์กฐ์ฒด์— 2๊ฐœ์˜ ํ…์„œ a, b ์„ ์–ธ
  • backend ์„ค์ • (CPU / CUDA / Metal)
  • backend buffer ์‚ฌ์šฉ (a, b ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅ)
  • context๋Š” ggml์—์„œ ํ…์„œ ๋ฉ”ํƒ€ ์ •๋ณด, ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๋‹ด๋‹น

3๏ธโƒฃ ๋ฐ์ดํ„ฐ ์ค€๋น„

float matrix_A[4 * 2] = { ... };
float matrix_B[3 * 2] = { ... };
  • matrix_A: (4×2) matrix
  • matrix_B: (3×2) matrix โ†’ ๋‚ด๋ถ€์—์„œ transpose๋˜์–ด ๊ณฑํ•ด์ง

4๏ธโƒฃ Backend ์ดˆ๊ธฐํ™”

#ifdef GGML_USE_CUDA
    model.backend = ggml_backend_cuda_init(0);
#endif
#ifdef GGML_USE_METAL
    model.backend = ggml_backend_metal_init();
#endif
if (!model.backend) {
    model.backend = ggml_backend_cpu_init();
}
  • CUDA โ†’ Metal โ†’ CPU ์ˆœ์œผ๋กœ backend ์ดˆ๊ธฐํ™”
  • llama.cpp๋„ ๋น„์Šทํ•œ ๊ตฌ์กฐ (๋‹ค์–‘ํ•œ backend ์ง€์›)

5๏ธโƒฃ ggml context ์ƒ์„ฑ + ํ…์„œ ์ƒ์„ฑ

ggml_init_params params{...};
model.ctx = ggml_init(params);
model.a = ggml_new_tensor_2d(model.ctx, GGML_TYPE_F32, 2, 4);
model.b = ggml_new_tensor_2d(model.ctx, GGML_TYPE_F32, 2, 3);
  • ggml context ์ƒ์„ฑ โ†’ ํ…์„œ ์ƒ์„ฑ
  • ggml_new_tensor_2d ๋กœ 2D tensor ์ƒ์„ฑ (shape = cols, rows ์ฃผ์˜)

6๏ธโƒฃ Backend buffer ์‚ฌ์šฉ

model.buffer = ggml_backend_alloc_ctx_tensors(model.ctx, model.backend);
ggml_backend_tensor_set(model.a, a, ...);
ggml_backend_tensor_set(model.b, b, ...);
  • backend buffer์— ํ…์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ์—…๋กœ๋“œ
  • CPU โ†’ GPU ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ ์ง์ ‘ ์ปจํŠธ๋กค โ†’ llama.cpp์™€ ๋™์ผํ•œ ์„ค๊ณ„ ์ฒ ํ•™

7๏ธโƒฃ ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„ ๊ตฌ์„ฑ

ggml_context* ctx0 = ggml_init(params0);
ggml_cgraph* gf = ggml_new_graph(ctx0);
ggml_tensor* result = ggml_mul_mat(ctx0, model.a, model.b);
ggml_build_forward_expand(gf, result);
  • ๋ณ„๋„์˜ temporay context ์ƒ์„ฑ โ†’ ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„ ๊ตฌ์„ฑ
  • ggml_mul_mat : ํ–‰๋ ฌ ๊ณฑ ์—ฐ์‚ฐ ๋…ธ๋“œ ์ƒ์„ฑ
  • ๊ทธ๋ž˜ํ”„ ๋นŒ๋“œ ๋‹จ๊ณ„ โ†’ llama.cpp์—์„œ๋„ ํ•„์ˆ˜

8๏ธโƒฃ ๋ฉ”๋ชจ๋ฆฌ ์˜ˆ์•ฝ + ๊ทธ๋ž˜ํ”„ ์‹คํ–‰

ggml_gallocr_alloc_graph(allocr, gf);
ggml_backend_graph_compute(model.backend, gf);
  • ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น๊ธฐ(allocator)๋กœ ๊ทธ๋ž˜ํ”„ ์ตœ์ ํ™” ์‹คํ–‰
  • backend์—์„œ ๊ทธ๋ž˜ํ”„ ์‹คํ–‰ โ†’ CPU๋“  GPU๋“  ๋™์ผํ•œ API ์‚ฌ์šฉ (llama.cpp์—์„œ๋„ ๋™์ผ)

9๏ธโƒฃ ๊ฒฐ๊ณผ ๊ฐ€์ ธ์˜ค๊ธฐ + ์ถœ๋ ฅ

ggml_backend_tensor_get(result, out_data.data(), ...);
  • backend ๋ฉ”๋ชจ๋ฆฌ โ†’ CPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ๊ฒฐ๊ณผ ๋ณต์‚ฌ
  • ๊ฒฐ๊ณผ ํ–‰๋ ฌ ์ถœ๋ ฅ

10๏ธโƒฃ ๋งˆ๋ฌด๋ฆฌ ์ •๋ฆฌ

ggml_gallocr_free(allocr);
ggml_free(model.ctx);
ggml_backend_buffer_free(model.buffer);
ggml_backend_free(model.backend);
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ backend clean-up

๐Ÿšฉ ํ•ต์‹ฌ ํฌ์ธํŠธ

โœ… ggml๋Š” ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉ
โœ… ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๊ณผ backend ์ œ์–ด๋ฅผ ์ง์ ‘์ ์œผ๋กœ ๋…ธ์ถœ
โœ… llama.cpp์™€ ๊ตฌ์กฐ์ ์œผ๋กœ ๋™์ผ:

  • llama.cpp๋„ ggml_graph_compute ์‚ฌ์šฉ
  • backend abstraction ์‚ฌ์šฉ
  • memory ๊ด€๋ฆฌ explicit

โœ… on-device ์ตœ์ ํ™”์— ๋งค์šฐ ์œ ๋ฆฌํ•œ ๊ตฌ์กฐ


๐Ÿ” ์ดํ‰

  • ๋งค์šฐ ์ข‹์€ ggml ์ƒ˜ํ”Œ
  • llama.cpp์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ ์ดํ•ด์— ๋”ฑ ์ ํ•ฉ
  • “C/C++ ๊ธฐ๋ฐ˜ + ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ + backend abstraction” โ†’ llama.cpp๊ฐ€ ์ž„๋ฒ ๋””๋“œ์— ๊ฐ•ํ•œ ์ด์œ  ๊ทธ๋Œ€๋กœ ๋ณด์—ฌ์คŒ

๐Ÿš€ ์ถ”๊ฐ€๋กœ ๋ณผ๋งŒํ•œ ํ™•์žฅ ๋ฐฉํ–ฅ

1๏ธโƒฃ ggml_compute_forward vs backend_graph_compute ์ฐจ์ด โ†’ CPU ์ „์šฉ vs multi-backend ์ง€์› ์ฐจ์ด
2๏ธโƒฃ ggml-alloc์˜ ์—ญํ•  โ†’ runtime ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
3๏ธโƒฃ gguf ํŒŒ์ผ ๋กœ๋”ฉ โ†’ llama.cpp์—์„œ๋Š” gguf โ†’ ggml_tensor ๊ตฌ์„ฑ โ†’ ggml_graph ์‹คํ–‰ ํ๋ฆ„



๐Ÿ—บ๏ธ ggml-simple.cpp vs llama.cpp ๊ตฌ์กฐ ๋น„๊ต

๐Ÿ—๏ธ 1๏ธโƒฃ ๊ณตํ†ต ๊ธฐ๋ณธ ๊ตฌ์กฐ

๋‹จ๊ณ„ggml-simple.cppllama.cpp
Backend ์ดˆ๊ธฐํ™”CUDA/Metal/CPU backend ์„ ํƒ๋™์ผ (llama.cpp๋„ backend layer ์กด์žฌ)
ggml_context ์ƒ์„ฑggml_init(params)ggml_init(params) (llama.cpp ๋‚ด๋ถ€ llama_init_from_file)
Tensor ๊ตฌ์„ฑggml_new_tensor_2d ์ง์ ‘ ํ˜ธ์ถœ๋ชจ๋ธ ๋กœ๋”ฉ์‹œ gguf โ†’ tensor ์ž๋™ ๊ตฌ์„ฑ
๋ฐ์ดํ„ฐ ์—…๋กœ๋“œggml_backend_tensor_set ์ˆ˜๋™ ์—…๋กœ๋“œgguf ๋กœ๋”ฉ์‹œ ์ž๋™ tensor ์ดˆ๊ธฐํ™” + backend buffer ์—…๋กœ๋“œ
๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„ ๊ตฌ์„ฑggml_mul_mat โ†’ ggml_build_forward_expandllama_decode_internal ๋‚ด์—์„œ ์—ฌ๋Ÿฌ ggml_xxx ์—ฐ์‚ฐ ํ˜ธ์ถœ ํ›„ build
๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์ตœ์ ํ™”ggml_gallocr_alloc_graphllama_batch_decode์‹œ ๋‚ด๋ถ€ alloc ์‚ฌ์šฉ
๊ทธ๋ž˜ํ”„ ์‹คํ–‰ggml_backend_graph_computeggml_backend_graph_compute (llama.cpp๋„ ๋™์ผ ํ˜ธ์ถœ)
๊ฒฐ๊ณผ ์ถ”์ถœggml_backend_tensor_getllama_decode_internal์—์„œ logits ์ถ”์ถœ์‹œ ์‚ฌ์šฉ

๐Ÿ–ผ๏ธ ๊ตฌ์กฐ ํ๋ฆ„๋„ ์‹œ๊ฐํ™”

๐Ÿ“Œ ggml-simple.cpp

[main]
  โ†“
[backend init] โ†’ [ggml_init]
  โ†“
[create tensor A, B] โ†’ [upload tensor data]
  โ†“

[build graph: ggml_mul_mat]

โ†“

[alloc graph memory]

โ†“ [compute graph] โ†’ [get result tensor]

๐Ÿ“Œ llama.cpp (inference ๊ฒฝ๋กœ)

[llama_init_from_file]
  โ†“
[backend init] โ†’ [ggml_init]
  โ†“

[load model.gguf โ†’ create all ggml_tensor]

โ†“

[upload tensor data to backend buffer]

โ†“

[user calls llama_decode or llama_batch_decode]

โ†“

[llama_decode_internal builds ggml_graph]

โ†“

[alloc graph memory (alloc + scratch)]

โ†“

[compute graph (ggml_backend_graph_compute)]

โ†“

[extract logits tensor โ†’ tokenizer / output]


๐ŸŽ ํ•ต์‹ฌ ์ฐจ์ด์ 

โœ… ggml-simple.cpp๋Š” ์ˆ˜๋™์œผ๋กœ ํ…์„œ ๋งŒ๋“ค๊ณ  ๊ทธ๋ž˜ํ”„ ๊ตฌ์„ฑ
โœ… llama.cpp๋Š” gguf ๊ธฐ๋ฐ˜์œผ๋กœ ํ…์„œ ์ž๋™ ๊ตฌ์„ฑ + LLM ์•„ํ‚คํ…์ฒ˜(Attention/MLP/Norm ๋“ฑ) ๊ตฌ์„ฑ

  • llama.cpp ๋‚ด๋ถ€์˜ ํ•ต์‹ฌ ๊ทธ๋ž˜ํ”„๋Š”:
    • ggml_mul_mat
    • ggml_norm
    • ggml_silu
    • ggml_repeat
    • ggml_add / ggml_mul / ggml_reshape ๋“ฑ
      โ†’ ์ด ๋ชจ๋“  ๋…ธ๋“œ๋“ค์ด ggml_cgraph๋กœ ๋นŒ๋“œ๋˜๊ณ  ์‹คํ–‰๋จ โ†’ ์ง€๊ธˆ ์˜ฌ๋ ค์ฃผ์‹  ์ฝ”๋“œ ํ๋ฆ„๊ณผ ๊ฑฐ์˜ ๋™์ผํ•œ ๊ตฌ์กฐ

๐Ÿš€ ์ •๋ฆฌ

  • ์ง€๊ธˆ ์˜ฌ๋ ค์ฃผ์‹  ggml-simple.cpp๋Š” llama.cpp ๊ตฌ์กฐ ์ดํ•ด์— ๋งค์šฐ ์ข‹์€ ์ƒ˜ํ”Œ์ž„ โ†’ llama.cpp ๋‚ด๋ถ€๋„ ๊ฒฐ๊ตญ ggml_graph๋ฅผ ๊ตฌ์„ฑํ•ด ์‹คํ–‰ํ•˜๋Š” ๊ตฌ์กฐ
  • llama.cpp๋Š”:
    • model.gguf ๋กœ๋“œ ์‹œ โ†’ tensor ๊ตฌ์„ฑ
    • llama_decode ํ˜ธ์ถœ ์‹œ โ†’ ggml_graph ๊ตฌ์„ฑ
    • ggml_backend_graph_compute๋กœ ์‹คํ–‰
  • ์ด์‹์„ฑ + ์ตœ์ ํ™” ํฌ์ธํŠธ๋„ ๋™์ผํ•˜๊ฒŒ ggml ๊ตฌ์กฐ์— ๊ธฐ๋ฐ˜

๐ŸŽจ ์ถ”์ฒœ ํ•™์Šต ์ˆœ์„œ

1๏ธโƒฃ ggml-simple.cpp์—์„œ ggml_mul_mat โ†’ โ†’ MLP Layer ๊ตฌ์„ฑ๋„ ์—ฐ์Šต
2๏ธโƒฃ llama.cpp์˜ llama_decode_internal ํ•จ์ˆ˜ ๋ถ„์„
3๏ธโƒฃ gguf ํŒŒ์ผ ๊ตฌ์กฐ ๋ถ„์„ (gguf = LLM weights + arch config + metadata)
4๏ธโƒฃ llama.cpp์— ์ปค์Šคํ…€ ggml ์—ฐ์‚ฐ ์ถ”๊ฐ€ ์‹คํ—˜



๐Ÿ—บ๏ธ llama.cpp ์ตœ์‹ (2025) llama_decode_internal ๊ตฌ์กฐ ํ๋ฆ„

1๏ธโƒฃ llama_init_from_file

  • gguf ๋ชจ๋ธ ํŒŒ์ผ ๋กœ๋“œ
  • ๋ชจ๋“  ggml_tensor ์ƒ์„ฑ (weights, embeddings, norms ๋“ฑ)
  • backend init + backend buffer alloc + tensor data ์—…๋กœ๋“œ
  • โ†’ ๊ธฐ๋ณธ์ ์œผ๋กœ ggml_new_tensor_xd + ggml_backend_tensor_set ์‚ฌ์šฉ๋จ โ†’ ggml-simple.cpp์™€ ๋™์ผ

2๏ธโƒฃ llama_decode_internal ํ˜ธ์ถœ

  • ํ•ต์‹ฌ LLM “forward pass” ์ˆ˜ํ–‰
[llama_decode_internal]
  โ†“

[ggml_init graph context]

โ†“

[ggml_cgraph* gf = ggml_new_graph(ctx)]

โ†“

[build graph for N tokens โ†’ 1 layer์”ฉ loop]


3๏ธโƒฃ Per-layer graph ๊ตฌ์„ฑ

For each Transformer layer:

[ggml_rms_norm] โ†’ Norm
  โ†“
[ggml_mul_mat] Q_proj โ†’ Q
[ggml_mul_mat] K_proj โ†’ K
[ggml_mul_mat] V_proj โ†’ V
  โ†“
[ggml_rope] apply rotary positional embedding to Q,K
  โ†“
[ggml_mul_mat] Q x K^T โ†’ attention scores
[ggml_soft_max] attention scores โ†’ weights
[ggml_mul_mat] weights x V โ†’ attention output
  โ†“
[ggml_add] residual connection
  โ†“
[ggml_rms_norm] โ†’ Norm
  โ†“
[ggml_mul_mat] FFN_1 (intermediate proj)
[ggml_silu] activation
[ggml_mul] gated activation
[ggml_mul_mat] FFN_2 (final proj)
  โ†“
[ggml_add] residual connection

โ†’ Transformer layer 1๊ฐœ ๋‹น ์œ„์™€ ๊ฐ™์€ ggml node ๊ทธ๋ž˜ํ”„ ๊ตฌ์„ฑ


4๏ธโƒฃ ์ „์ฒด Graph ์ตœ์ข… ๊ตฌ์„ฑ

[embedding lookup]
  โ†“

[layer loop: N layers โ†’ ggml nodes push]

โ†“

[ggml_norm + projection to logits]

โ†“

[ggml_cgraph build complete โ†’ ggml_build_forward_expand(gf, final_logits)]


5๏ธโƒฃ ๊ทธ๋ž˜ํ”„ ์‹คํ–‰

ggml_backend_graph_compute(backend, gf);
  • ๋ฐ”๋กœ ์•„๊นŒ ggml-simple.cpp์™€ ๋™์ผํ•œ ํ˜ธ์ถœ
  • backend(CPU, CUDA, Metal ๋“ฑ)์— ๋”ฐ๋ผ ์ตœ์ ํ™” ์‹คํ–‰

6๏ธโƒฃ ๊ฒฐ๊ณผ ์ถ”์ถœ

ggml_backend_tensor_get(final_logits_tensor, ...)
  • logits ๊ฒฐ๊ณผ ์ถ”์ถœ ํ›„ softmax / sampling โ†’ output token ๊ฒฐ์ •

๐ŸŽ ์ •๋ฆฌ

ggml-simple.cppllama.cpp ๋‚ด๋ถ€ ๊ตฌ์กฐ
ggml_mul_matggml_mul_mat (Q_proj, K_proj, V_proj, FFN ๋“ฑ)
ggml_build_forward_expandggml_build_forward_expand
ggml_backend_graph_computeggml_backend_graph_compute
tensor ์ˆ˜๋™ ์ž…๋ ฅgguf๋กœ ์ž๋™ ๋กœ๋“œ ํ›„ tensor ๊ตฌ์„ฑ
matmul ๊ฒฐ๊ณผ ์ถœ๋ ฅlogits ๊ฒฐ๊ณผ ์ถœ๋ ฅ

๐Ÿšฉ ๊ฒฐ๋ก 

โœ… llama.cpp๋Š” ๊ฒฐ๊ตญ ggml_graph ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋“  LLM forward pass๋ฅผ ๊ตฌํ˜„
โœ… ์•„๊นŒ ์˜ฌ๋ ค์ฃผ์‹  ggml-simple.cpp๊ฐ€ ๊ตฌ์กฐ์ ์œผ๋กœ ๊ฑฐ์˜ ๋™์ผ:

  • Backend ์ดˆ๊ธฐํ™” โ†’ Tensor ๊ตฌ์„ฑ โ†’ Graph ๊ตฌ์„ฑ โ†’ Graph ์‹คํ–‰ โ†’ ๊ฒฐ๊ณผ ์ถœ๋ ฅ

โœ… llama.cpp์—์„œ ggml_mul_mat / ggml_norm / ggml_silu / ggml_rope ๋“ฑ ๊ธฐ๋ณธ node๋“ค๋งŒ ์กฐํ•ฉํ•ด์„œ LLM ์™„์„ฑ

โœ… ํ•™์Šต์€ ์—†๊ณ  inference-only๋ผ ๊ทธ๋ž˜ํ”„๋ฅผ build โ†’ run โ†’ destroy cycle๋กœ ๋ฐ˜๋ณต (token๋งˆ๋‹ค ๋ฐ˜๋ณต)


๐ŸŽจ ์š”์•ฝ ๊ทธ๋ฆผ

[Embedding] โ†’ [N layers (Attention + FFN)] โ†’ [Norm + Final projection] โ†’ [Logits] โ†’ [Sampling โ†’ Output token]

์ฝ”๋ฉ˜ํŠธ

๋‹ต๊ธ€ ๋‚จ๊ธฐ๊ธฐ

์ด๋ฉ”์ผ ์ฃผ์†Œ๋Š” ๊ณต๊ฐœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•„์ˆ˜ ํ•„๋“œ๋Š” *๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค