๐Ÿ“˜ ๋‚˜๋งŒ์˜ ์ฑ—๋ด‡ (RAG ๊ธฐ๋ฐ˜)

1. ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

1.1 ํ”„๋กœ์ ํŠธ ๋ชฉ์ 

๋ฐ์ดํ„ฐ ํŒŒ์ผ์„ ์ „์ฒ˜๋ฆฌํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ ๋’ค,
์ด๋ฅผ **๋ฒกํ„ฐ ์Šคํ† ์–ด(Vector Store)**์— ์ €์žฅํ•˜์—ฌ Retrieval Augmented Generation(RAG) ๋ฐฉ์‹์œผ๋กœ
์ผ๋ฐ˜์ ์ธ ๋Œ€ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋‚˜๋งŒ์˜ ์ฑ—๋ด‡์„ ๊ตฌํ˜„ํ•œ๋‹ค.

1.2 ์ง„ํ–‰ ์ ˆ์ฐจ

โœ… ํ•™์Šต ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์ฒ˜๋ฆฌ
โœ… ํ•™์Šต ๋ฐ์ดํ„ฐ Embedding
โœ… Embedding ๋ฐ์ดํ„ฐ ์ €์žฅ
โœ… ์งˆ๋ฌธ ์‹œ ๊ด€๋ จ ๋ฌธ๋งฅ ๊ฒ€์ƒ‰ โ†’ LLM ๋‹ต๋ณ€ ์ƒ์„ฑ
โœ… ์ฑ—๋ด‡ ์ธํ„ฐ๋ž™์…˜ ์‹คํ–‰


2. ๊ตฌํ˜„ํ•˜๊ธฐ

2.1 ํ•™์Šต ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํ•˜๊ธฐ : 01_chatbot_dataget.py

์˜ˆ์‹œ ๋ฐ์ดํ„ฐ(my_data.txt)๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ CSV๋กœ ์ •๋ฆฌํ•œ๋‹ค.

import pandas as pd
import re

def remove_newlines(text):
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r' +', ' ', text)
    return text

def text_to_df(data_file):
    texts = []
    with open(data_file, 'r', encoding="utf-8") as file:
        text = file.read()
        sections = text.split('\n\n')
        for section in sections:
            lines = section.split('\n')
            fname = lines[0]
            content = ' '.join(lines[1:])
            texts.append([fname, content])
    df = pd.DataFrame(texts, columns=['title', 'text'])
    df['text'] = df['text'].apply(remove_newlines)
    return df

# ์‹คํ–‰ ์˜ˆ์‹œ
df = text_to_df('my_data.txt')   # โ† ์—ฌ๊ธฐ์— ๋ณธ์ธ ๋ฐ์ดํ„ฐ ๋„ฃ๊ธฐ
df.to_csv('processed.csv', index=False, encoding='utf-8')
df.head()

2.2 ๋ฐ์ดํ„ฐ ์ž„๋ฒ ๋”ฉํ•˜๊ธฐ : 02_chatbot_embedding.py

OpenAI ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐํ™”ํ•œ๋‹ค.

!pip install openai tiktoken python-dotenv pandas

import os, time
import pandas as pd
import tiktoken
from openai import OpenAI
from dotenv import load_dotenv

# API ํ‚ค ๋กœ๋“œ
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 1500

df = pd.read_csv("processed.csv")
df.columns = ['title','text']

tokenizer = tiktoken.get_encoding(embedding_encoding)
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# ์ž„๋ฒ ๋”ฉ ํ•จ์ˆ˜
def get_embedding(text, model=embedding_model, max_retries=5):
    text = text.replace("\n", " ")
    retries = 0
    while retries < max_retries:
        try:
            response = client.embeddings.create(input=[text], model=model)
            return response.data[0].embedding
        except Exception as e:
            retries += 1
            print(f"Error: {e}. Retrying...")
            time.sleep(2**retries)
    raise Exception(f"Failed embedding: {text}")

df["embeddings"] = df.text.apply(lambda x: get_embedding(x))
df.to_csv("embeddings.csv", index=False)
df.head()

2.3 ๋ฌธ๋งฅ ๊ฒ€์ƒ‰ ํ•จ์ˆ˜ : 04_search.py

์งˆ๋ฌธ๊ณผ ์ €์žฅ๋œ ์ž„๋ฒ ๋”ฉ์„ ๋น„๊ตํ•ด ๋ฌธ๋งฅ(Context)์„ ๋ฝ‘๋Š”๋‹ค.

import numpy as np
from scipy import spatial

def distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine"):
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    return [distance_metrics[distance_metric](query_embedding, emb) for emb in embeddings]

def create_context(question, df, max_len=1800):
    q_embeddings = client.embeddings.create(
        input=[question], model=embedding_model
    ).data[0].embedding
    
    df['distances'] = distances_from_embeddings(
        q_embeddings, df['embeddings'].apply(eval).apply(np.array).values
    )
    
    returns, cur_len = [], 0
    for _, row in df.sort_values('distances', ascending=True).iterrows():
        cur_len += row['n_tokens'] + 4
        if cur_len > max_len: break
        returns.append(row["text"])
    return "\n\n###\n\n".join(returns)

2.4 ๋‹ต๋ณ€ ์ƒ์„ฑ ํ•จ์ˆ˜ : rag_search.py

๊ฒ€์ƒ‰๋œ ๋ฌธ๋งฅ์„ ํ™œ์šฉํ•ด LLM์ด ๋‹ต๋ณ€ํ•œ๋‹ค. (๋ฒ”์šฉ ์ฑ—๋ด‡์ด๋ฏ€๋กœ ํ˜ธํ…” ๋งฅ๋ฝ ์ œ๊ฑฐ!)

def answer_question(question, conversation_history, df):
    context = create_context(question, df)
    prompt = f"""๋‹น์‹ ์€ ์นœ์ ˆํ•˜๊ณ  ๋˜‘๋˜‘ํ•œ AI ๋น„์„œ์ž…๋‹ˆ๋‹ค.
์ฃผ์–ด์ง„ ๋ฌธ๋งฅ์„ ์ฐธ๊ณ ํ•˜์—ฌ ์งˆ๋ฌธ์— ๋‹ตํ•˜์„ธ์š”.
๋งŒ์•ฝ ๋ฌธ๋งฅ์—์„œ ๋‹ต์„ ์ฐพ์„ ์ˆ˜ ์—†๋‹ค๋ฉด '์ž˜ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค'๋ผ๊ณ  ๋Œ€๋‹ตํ•˜์„ธ์š”.

๋ฌธ๋งฅ:
{context}

์งˆ๋ฌธ: {question}
๋‹ต๋ณ€:"""
    
    conversation_history.append({"role":"user","content":prompt})
    
    response = client.chat.completions.create(
        model="gpt-4o-mini", 
        messages=conversation_history,
        temperature=0.7,
    )
    return response.choices[0].message.content.strip()

2.5 ์ฑ—๋ด‡ ์‹คํ–‰ํ•˜๊ธฐ : 05_rag_chatbot.py

import pandas as pd

df = pd.read_csv("embeddings.csv")
conversation_history = []

print("๋‚˜๋งŒ์˜ ์ฑ—๋ด‡ ์‹œ์ž‘! (์ข…๋ฃŒํ•˜๋ ค๋ฉด exit ์ž…๋ ฅ)")

while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break
    answer = answer_question(user_input, conversation_history, df)
    print("Bot:", answer)
    conversation_history.append({"role":"assistant","content":answer})

๐Ÿš€ ์‹คํ–‰ ๋ฐฉ๋ฒ• (๊ตฌ๊ธ€ ์ฝ”๋žฉ)

  1. ์ฝ”๋žฉ ์ƒˆ ๋…ธํŠธ๋ถ ์—ด๊ธฐ
  2. ์œ„ ์ฝ”๋“œ ๋ธ”๋ก์„ ์ˆœ์„œ๋Œ€๋กœ ์‹คํ–‰
  3. ๋ณธ์ธ ๋ฐ์ดํ„ฐ(my_data.txt) ์—…๋กœ๋“œ
  4. ๋งˆ์ง€๋ง‰ ์ฑ—๋ด‡ ์‹คํ–‰ ๋ถ€๋ถ„์—์„œ ์ž์œ ๋กญ๊ฒŒ ๋Œ€ํ™”

์ฝ”๋ฉ˜ํŠธ

๋‹ต๊ธ€ ๋‚จ๊ธฐ๊ธฐ

์ด๋ฉ”์ผ ์ฃผ์†Œ๋Š” ๊ณต๊ฐœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•„์ˆ˜ ํ•„๋“œ๋Š” *๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค