Date: 2025-11-05 (Asia/Seoul)
TL;DR
Choosing an LLM becomes straightforward when you map agent purpose → required capabilities → operational constraints → safety & governance. Use the P-COS² Framework (Purpose, Capabilities, Operations, Safety & Governance, Spend) and the L0~L4 autonomy ladder to decide if you need a full Agentic stack or a simpler Workflow/AI automation setup. Validate with a repeatable AgentEval harness before rollout.
1) Executive Summary
With dozens of frontier and open models, “best” is meaningless without context. This whitepaper shows how to select the right LLM for your agent by:
- Translating business goals into capability requirements (tool use, browsing, long-context, coding, multilingual, structured output, etc.).
- Matching those requirements to model classes (frontier closed-source vs. fine‑tuned open‑source) and deployment modes (SaaS vs. on‑prem).
- Governing risk through HITL, guardrails, and policy budgets.
- Proving value via scenario-based evaluation and cost/latency modeling.
2) The Purpose-First Rule
Your AI agent’s purpose is the key. Each model shines in different scenarios. Start from the mission, not the model.
Define:
- User & Task: research, support, coding, operations, analytics…
- Environment: data sensitivity, compliance scope, air-gapped?
- Success Criteria: quality (groundedness), time, cost, safety.
- Risk Tolerance: where HITL is mandatory vs. optional.
3) P‑COS² Decision Framework
P — Purpose: Concrete goals, KPIs, target autonomy level (L0~L4).
C — Capabilities (select what you truly need):
- Tool Use / Function Calling (including web browsing & scraping)
- Retrieval & Long‑Context (RAG, 200k+ tokens, table/PDF competence)
- Structured Output (JSON/DSL reliability, schema validation)
- Reasoning Style (stepwise planning, multi‑turn memory)
- Multimodality (vision/audio if required)
- Coding (generation, refactor, test synthesis)
- Multilingual (Korean ↔ English quality if applicable)
O — Operations:
- Latency (p50/p95), throughput, rate limits, caching
- Observability (traces, tool-call logs), rollout (canary, fallback)
- SLAs & uptime, version pinning
S — Safety & Governance:
- Guardrails: PII masking, allow/deny tool lists, budget caps
- Auditability: reproducible runs, decision logs, HITL checkpoints
- Policy violations & auto-halt actions
S — Spend (Economics):
- Cost-per-task (input+output tokens + tool calls)
- Scale forecast, ROI threshold (vs. manual or simpler automation)
Outcome: If Purpose+Capabilities imply dynamic planning with multi‑tool execution and the Operations/Safety/Spend boxes are checked, move to Agentic (L4). Otherwise, stay at L2/L3.
4) Autonomy Ladder (L0~L4)
- L0 — Manual / Scripted: Human executes, scripts assist.
- L1 — API Triggered: One‑shot actions, no planning.
- L2 — Workflow Orchestrated: Multi‑step static flows.
- L3 — AI‑Assisted: Model predicts/decides; HITL approves.
- L4 — Agentic: Model plans‑acts‑observes‑improves under policy.
Guideline: Keep high‑risk tasks at L3 (HITL). Promote low‑risk, high‑volume tasks to L4 once KPIs & guardrails are stable.
5) Model Landscape by Agent Purpose (Examples)
Capabilities and licensing evolve rapidly; treat the models named below as examples and verify current tool/browsing/context specs before adoption.
A) Web Browsing & Research Agents
- Need: real‑time info extraction, rapid synthesis from diverse sources, strong tool use/structured output.
- Try: GPT‑4o (with browsing), Perplexity API, Gemini 1.5 Pro.
- Notes: Ensure rate limits and crawler compliance; implement grounded citations.
B) Document Analysis & RAG
- Need: long‑context reasoning, PDF/table layout awareness, strong retrieval integration.
- Try: GPT‑4o, Claude 3 Sonnet, Llama 3 (fine‑tuned), Mistral (fine‑tuned).
- Notes: Invest in chunking, hybrid search (BM25+vector), eval for groundedness.
C) Coding & Dev Assistants
- Need: code generation/refactor, test synthesis, multi‑file reasoning, tool use with repos/CI.
- Try: GPT‑4o, Claude 3 Opus, StarCoder2, CodeLlama‑70B.
- Notes: Enforce repo‑scoped permissions; cache embeddings; measure acceptance rate.
D) Specialized Domain Applications
- Need: domain terminology, regulation‑aware responses, predictable JSON.
- Try: Llama 3/Mistral fine‑tunes, Gemma 2B for lightweight on‑device.
- Notes: Prioritize fine‑tuning + RAG; add red‑teaming for compliance.
6) Router over One‑Size Fits All
Use a Model Router that selects a model per task, with fallbacks and cost‑aware policies.
Policy Sketch
router:
rules:
- when: task == "code"
primary: gpt-4o
fallbacks: [claude-3-opus, starcoder2]
- when: task == "web_browse"
primary: gpt-4o-browse
fallbacks: [gemini-1.5-pro, perplexity]
- when: task == "rag_docs"
primary: claude-3-sonnet
fallbacks: [gpt-4o, llama3-ft]
budgets:
per_task_usd: 0.75
daily_usd: 150
guardrails:
deny_tools: ["delete_db", "rotate_keys"]
require_hitl: ["financial_trades", "policy_changes"]
7) AgentEval — A Practical Evaluation Harness
Design an evaluation loop that mirrors real agent behavior (plan→act→observe):
Core Metrics
- Task Success Rate (pass/fail against acceptance tests)
- Groundedness (citation coverage, anti‑hallucination)
- Tool‑Call Accuracy (args validity, side‑effect checks)
- JSON Validity (schema pass rate)
- Latency/Cost (p50/p95, cost‑per‑task)
- Safety (policy violations/1k actions)
Harness Pseudocode
for scenario in scenarios:
plan = agent.plan(scenario.goal, context=scenario.ctx)
trace = []
for step in plan:
call = tool_router.select(step).execute(step.args)
trace.append(call)
if guardrails.violated(call):
halt_and_notify(trace)
break
score = evaluator.judge(scenario, trace)
log_metrics(score, trace)
Datasets: Create in‑house gold tasks (doc QA, browsing, coding, JSON agents). Add adversarial prompts for safety.
8) RAG & Long‑Context Design Notes
- Use hybrid retrieval (sparse + dense) and domain‑specific rerankers.
- Prefer small chunks with semantic overlap; keep citation spans short.
- Track Answer Sources and Coverage; penalize ungrounded tokens.
- For very long docs, consider map‑reduce prompting or sectional routing.
9) Cost & Latency Modeling (Quick Math)
Cost per Task ≈ (prompt_tokens × in_rate + gen_tokens × out_rate) + tool_call_costs.
Example
- Prompt 12k, Gen 3k, in_rate $1/1M, out_rate $3/1M → $0.021
- Add browsing/tool calls: +$0.005 → $0.026/task
Use caching (embeddings/results) and smaller specialists for savings.
10) Safety & Governance Quickstart
- Budgets: per‑task & daily caps; kill‑switch on exceed.
- Allow/Deny: whitelisted tools; read/write separation.
- PII/Secrets: redact & scope; secrets never in prompts.
- Audit: immutable run logs; reproducible seeds/configs.
- HITL: approvals for high‑impact actions.
11) 30/60/90 Day Plan
- Day 0–30: Define Purpose & KPIs → stand up router + logging → run AgentEval v0 on 2–3 models.
- Day 31–60: Add RAG/browsing; deploy L3 (HITL) to a low‑risk use case; track success/cost.
- Day 61–90: Introduce L4 for low‑risk flows with budgets/guardrails; expand eval set; negotiate provider SLAs.
12) FAQ
- Is a frontier model always best? No. Use routers and specialists; pay for quality where it matters.
- When do I need full Agentic? When goals vary, multi‑tool planning is required, and you can enforce budgets/guardrails.
- How do I avoid vendor lock‑in? Abstraction layer for tools & traces; maintain open‑source fallbacks.
13) Korean Summary (번역)
핵심 요약
- 선택은 목적→능력→운영→안전→비용 순으로 단순화됩니다(P‑COS²).
- 웹 연구/브라우징: GPT‑4o(browsing), Perplexity API, Gemini 1.5 Pro.
- 문서·RAG: GPT‑4o, Claude 3 Sonnet, Llama 3 파인튜닝, Mistral.
- 코딩: GPT‑4o, Claude 3 Opus, StarCoder2, CodeLlama‑70B.
- 도메인 특화: Llama 3·Mistral 파인튜닝, 경량은 Gemma 2B.
- 단일 모델보다 모델 라우터가 실전적이며, AgentEval로 성능·비용·안전을 사전 검증하세요.
체크리스트
- 목적/성공기준 정의
- 필요한 기능 선택(툴사용, RAG, 긴 컨텍스트, 구조화 출력, 멀티모달, 코딩, 다국어)
- 운영지표/관측(지연, 비용, 로그)
- 가드레일/정책(HITL, 권한, 예산, PII)
- 모델 라우팅/폴백/캐시
- AgentEval로 사전 테스트 후 점진 배포
14) Credits & Community
- Access leading models in one place: https://thealpha.dev
답글 남기기