Choosing the Right LLM

Date: 2025-11-05 (Asia/Seoul)

TL;DR

Choosing an LLM becomes straightforward when you map agent purpose → required capabilities → operational constraints → safety & governance. Use the P-COS² Framework (Purpose, Capabilities, Operations, Safety & Governance, Spend) and the L0~L4 autonomy ladder to decide if you need a full Agentic stack or a simpler Workflow/AI automation setup. Validate with a repeatable AgentEval harness before rollout.

1) Executive Summary

With dozens of frontier and open models, “best” is meaningless without context. This whitepaper shows how to select the right LLM for your agent by:

Translating business goals into capability requirements (tool use, browsing, long-context, coding, multilingual, structured output, etc.).
Matching those requirements to model classes (frontier closed-source vs. fine‑tuned open‑source) and deployment modes (SaaS vs. on‑prem).
Governing risk through HITL, guardrails, and policy budgets.
Proving value via scenario-based evaluation and cost/latency modeling.

2) The Purpose-First Rule

Your AI agent’s purpose is the key. Each model shines in different scenarios. Start from the mission, not the model.

Define:

User & Task: research, support, coding, operations, analytics…
Environment: data sensitivity, compliance scope, air-gapped?
Success Criteria: quality (groundedness), time, cost, safety.
Risk Tolerance: where HITL is mandatory vs. optional.

3) P‑COS² Decision Framework

P — Purpose: Concrete goals, KPIs, target autonomy level (L0~L4).

C — Capabilities (select what you truly need):

Tool Use / Function Calling (including web browsing & scraping)
Retrieval & Long‑Context (RAG, 200k+ tokens, table/PDF competence)
Structured Output (JSON/DSL reliability, schema validation)
Reasoning Style (stepwise planning, multi‑turn memory)
Multimodality (vision/audio if required)
Coding (generation, refactor, test synthesis)
Multilingual (Korean ↔ English quality if applicable)

O — Operations:

Latency (p50/p95), throughput, rate limits, caching
Observability (traces, tool-call logs), rollout (canary, fallback)
SLAs & uptime, version pinning

S — Safety & Governance:

Guardrails: PII masking, allow/deny tool lists, budget caps
Auditability: reproducible runs, decision logs, HITL checkpoints
Policy violations & auto-halt actions

S — Spend (Economics):

Cost-per-task (input+output tokens + tool calls)
Scale forecast, ROI threshold (vs. manual or simpler automation)

Outcome: If Purpose+Capabilities imply dynamic planning with multi‑tool execution and the Operations/Safety/Spend boxes are checked, move to Agentic (L4). Otherwise, stay at L2/L3.

4) Autonomy Ladder (L0~L4)

L0 — Manual / Scripted: Human executes, scripts assist.
L1 — API Triggered: One‑shot actions, no planning.
L2 — Workflow Orchestrated: Multi‑step static flows.
L3 — AI‑Assisted: Model predicts/decides; HITL approves.
L4 — Agentic: Model plans‑acts‑observes‑improves under policy.

Guideline: Keep high‑risk tasks at L3 (HITL). Promote low‑risk, high‑volume tasks to L4 once KPIs & guardrails are stable.

5) Model Landscape by Agent Purpose (Examples)

Capabilities and licensing evolve rapidly; treat the models named below as examples and verify current tool/browsing/context specs before adoption.

A) Web Browsing & Research Agents

Need: real‑time info extraction, rapid synthesis from diverse sources, strong tool use/structured output.
Try: GPT‑4o (with browsing), Perplexity API, Gemini 1.5 Pro.
Notes: Ensure rate limits and crawler compliance; implement grounded citations.

B) Document Analysis & RAG

Need: long‑context reasoning, PDF/table layout awareness, strong retrieval integration.
Try: GPT‑4o, Claude 3 Sonnet, Llama 3 (fine‑tuned), Mistral (fine‑tuned).
Notes: Invest in chunking, hybrid search (BM25+vector), eval for groundedness.

C) Coding & Dev Assistants

Need: code generation/refactor, test synthesis, multi‑file reasoning, tool use with repos/CI.
Try: GPT‑4o, Claude 3 Opus, StarCoder2, CodeLlama‑70B.
Notes: Enforce repo‑scoped permissions; cache embeddings; measure acceptance rate.

D) Specialized Domain Applications

Need: domain terminology, regulation‑aware responses, predictable JSON.
Try: Llama 3/Mistral fine‑tunes, Gemma 2B for lightweight on‑device.
Notes: Prioritize fine‑tuning + RAG; add red‑teaming for compliance.

6) Router over One‑Size Fits All

Use a Model Router that selects a model per task, with fallbacks and cost‑aware policies.

Policy Sketch

router:
  rules:
    - when: task == "code"
      primary: gpt-4o
      fallbacks: [claude-3-opus, starcoder2]
    - when: task == "web_browse"
      primary: gpt-4o-browse
      fallbacks: [gemini-1.5-pro, perplexity]
    - when: task == "rag_docs"
      primary: claude-3-sonnet
      fallbacks: [gpt-4o, llama3-ft]
  budgets:
    per_task_usd: 0.75
    daily_usd: 150
  guardrails:
    deny_tools: ["delete_db", "rotate_keys"]
    require_hitl: ["financial_trades", "policy_changes"]

7) AgentEval — A Practical Evaluation Harness

Design an evaluation loop that mirrors real agent behavior (plan→act→observe):

Core Metrics

Task Success Rate (pass/fail against acceptance tests)
Groundedness (citation coverage, anti‑hallucination)
Tool‑Call Accuracy (args validity, side‑effect checks)
JSON Validity (schema pass rate)
Latency/Cost (p50/p95, cost‑per‑task)
Safety (policy violations/1k actions)

Harness Pseudocode

for scenario in scenarios:
    plan = agent.plan(scenario.goal, context=scenario.ctx)
    trace = []
    for step in plan:
        call = tool_router.select(step).execute(step.args)
        trace.append(call)
        if guardrails.violated(call):
            halt_and_notify(trace)
            break
    score = evaluator.judge(scenario, trace)
    log_metrics(score, trace)

Datasets: Create in‑house gold tasks (doc QA, browsing, coding, JSON agents). Add adversarial prompts for safety.

8) RAG & Long‑Context Design Notes

Use hybrid retrieval (sparse + dense) and domain‑specific rerankers.
Prefer small chunks with semantic overlap; keep citation spans short.
Track Answer Sources and Coverage; penalize ungrounded tokens.
For very long docs, consider map‑reduce prompting or sectional routing.

9) Cost & Latency Modeling (Quick Math)

Cost per Task ≈ (prompt_tokens × in_rate + gen_tokens × out_rate) + tool_call_costs.

Example

Prompt 12k, Gen 3k, in_rate $1/1M, out_rate $3/1M → $0.021
Add browsing/tool calls: +$0.005 → $0.026/task

Use caching (embeddings/results) and smaller specialists for savings.

10) Safety & Governance Quickstart

Budgets: per‑task & daily caps; kill‑switch on exceed.
Allow/Deny: whitelisted tools; read/write separation.
PII/Secrets: redact & scope; secrets never in prompts.
Audit: immutable run logs; reproducible seeds/configs.
HITL: approvals for high‑impact actions.

11) 30/60/90 Day Plan

Day 0–30: Define Purpose & KPIs → stand up router + logging → run AgentEval v0 on 2–3 models.
Day 31–60: Add RAG/browsing; deploy L3 (HITL) to a low‑risk use case; track success/cost.
Day 61–90: Introduce L4 for low‑risk flows with budgets/guardrails; expand eval set; negotiate provider SLAs.

12) FAQ

Is a frontier model always best? No. Use routers and specialists; pay for quality where it matters.
When do I need full Agentic? When goals vary, multi‑tool planning is required, and you can enforce budgets/guardrails.
How do I avoid vendor lock‑in? Abstraction layer for tools & traces; maintain open‑source fallbacks.

13) Korean Summary (번역)

핵심 요약

선택은 목적→능력→운영→안전→비용 순으로 단순화됩니다(P‑COS²).
웹 연구/브라우징: GPT‑4o(browsing), Perplexity API, Gemini 1.5 Pro.
문서·RAG: GPT‑4o, Claude 3 Sonnet, Llama 3 파인튜닝, Mistral.
코딩: GPT‑4o, Claude 3 Opus, StarCoder2, CodeLlama‑70B.
도메인 특화: Llama 3·Mistral 파인튜닝, 경량은 Gemma 2B.
단일 모델보다 모델 라우터가 실전적이며, AgentEval로 성능·비용·안전을 사전 검증하세요.

체크리스트

목적/성공기준 정의
필요한 기능 선택(툴사용, RAG, 긴 컨텍스트, 구조화 출력, 멀티모달, 코딩, 다국어)
운영지표/관측(지연, 비용, 로그)
가드레일/정책(HITL, 권한, 예산, PII)
모델 라우팅/폴백/캐시
AgentEval로 사전 테스트 후 점진 배포

14) Credits & Community

Access leading models in one place: https://thealpha.dev