미분류 201 - naver.HOW data lake by AI

1. 프로젝트 개요

1.1 프로젝트 목적

데이터 파일을 전처리하고 임베딩 벡터로 변환한 뒤,
이를 **벡터 스토어(Vector Store)**에 저장하여 Retrieval Augmented Generation(RAG) 방식으로
일반적인 대화가 가능한 나만의 챗봇을 구현한다.

1.2 진행 절차

✅ 학습 데이터 준비 및 처리
✅ 학습 데이터 Embedding
✅ Embedding 데이터 저장
✅ 질문 시 관련 문맥 검색 → LLM 답변 생성
✅ 챗봇 인터랙션 실행

2. 구현하기

2.1 학습 데이터 처리하기 : `01_chatbot_dataget.py`

예시 데이터(my_data.txt)를 불러와서 CSV로 정리한다.

import pandas as pd
import re

def remove_newlines(text):
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r' +', ' ', text)
    return text

def text_to_df(data_file):
    texts = []
    with open(data_file, 'r', encoding="utf-8") as file:
        text = file.read()
        sections = text.split('\n\n')
        for section in sections:
            lines = section.split('\n')
            fname = lines[0]
            content = ' '.join(lines[1:])
            texts.append([fname, content])
    df = pd.DataFrame(texts, columns=['title', 'text'])
    df['text'] = df['text'].apply(remove_newlines)
    return df

# 실행 예시
df = text_to_df('my_data.txt')   # ← 여기에 본인 데이터 넣기
df.to_csv('processed.csv', index=False, encoding='utf-8')
df.head()

2.2 데이터 임베딩하기 : `02_chatbot_embedding.py`

OpenAI 임베딩 모델을 활용하여 텍스트를 벡터화한다.

!pip install openai tiktoken python-dotenv pandas

import os, time
import pandas as pd
import tiktoken
from openai import OpenAI
from dotenv import load_dotenv

# API 키 로드
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 1500

df = pd.read_csv("processed.csv")
df.columns = ['title','text']

tokenizer = tiktoken.get_encoding(embedding_encoding)
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# 임베딩 함수
def get_embedding(text, model=embedding_model, max_retries=5):
    text = text.replace("\n", " ")
    retries = 0
    while retries < max_retries:
        try:
            response = client.embeddings.create(input=[text], model=model)
            return response.data[0].embedding
        except Exception as e:
            retries += 1
            print(f"Error: {e}. Retrying...")
            time.sleep(2**retries)
    raise Exception(f"Failed embedding: {text}")

df["embeddings"] = df.text.apply(lambda x: get_embedding(x))
df.to_csv("embeddings.csv", index=False)
df.head()

2.3 문맥 검색 함수 : `04_search.py`

질문과 저장된 임베딩을 비교해 문맥(Context)을 뽑는다.

import numpy as np
from scipy import spatial

def distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine"):
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    return [distance_metrics[distance_metric](query_embedding, emb) for emb in embeddings]

def create_context(question, df, max_len=1800):
    q_embeddings = client.embeddings.create(
        input=[question], model=embedding_model
    ).data[0].embedding
    
    df['distances'] = distances_from_embeddings(
        q_embeddings, df['embeddings'].apply(eval).apply(np.array).values
    )
    
    returns, cur_len = [], 0
    for _, row in df.sort_values('distances', ascending=True).iterrows():
        cur_len += row['n_tokens'] + 4
        if cur_len > max_len: break
        returns.append(row["text"])
    return "\n\n###\n\n".join(returns)

2.4 답변 생성 함수 : `rag_search.py`

검색된 문맥을 활용해 LLM이 답변한다. (범용 챗봇이므로 호텔 맥락 제거!)

def answer_question(question, conversation_history, df):
    context = create_context(question, df)
    prompt = f"""당신은 친절하고 똑똑한 AI 비서입니다.
주어진 문맥을 참고하여 질문에 답하세요.
만약 문맥에서 답을 찾을 수 없다면 '잘 모르겠습니다'라고 대답하세요.

문맥:
{context}

질문: {question}
답변:"""
    
    conversation_history.append({"role":"user","content":prompt})
    
    response = client.chat.completions.create(
        model="gpt-4o-mini", 
        messages=conversation_history,
        temperature=0.7,
    )
    return response.choices[0].message.content.strip()

2.5 챗봇 실행하기 : `05_rag_chatbot.py`

import pandas as pd

df = pd.read_csv("embeddings.csv")
conversation_history = []

print("나만의 챗봇 시작! (종료하려면 exit 입력)")

while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break
    answer = answer_question(user_input, conversation_history, df)
    print("Bot:", answer)
    conversation_history.append({"role":"assistant","content":answer})

🚀 실행 방법 (구글 코랩)

코랩 새 노트북 열기
위 코드 블록을 순서대로 실행
본인 데이터(my_data.txt) 업로드
마지막 챗봇 실행 부분에서 자유롭게 대화

[카테고리:] 미분류

📘 나만의 챗봇 (RAG 기반)

1. 프로젝트 개요

1.1 프로젝트 목적

1.2 진행 절차

2. 구현하기

2.1 학습 데이터 처리하기 : `01_chatbot_dataget.py`

2.2 데이터 임베딩하기 : `02_chatbot_embedding.py`

2.3 문맥 검색 함수 : `04_search.py`

2.4 답변 생성 함수 : `rag_search.py`

2.5 챗봇 실행하기 : `05_rag_chatbot.py`

🚀 실행 방법 (구글 코랩)

[카테고리:] 미분류

📘 나만의 챗봇 (RAG 기반)

1. 프로젝트 개요

1.1 프로젝트 목적

1.2 진행 절차

2. 구현하기

2.1 학습 데이터 처리하기 : 01_chatbot_dataget.py

2.2 데이터 임베딩하기 : 02_chatbot_embedding.py

2.3 문맥 검색 함수 : 04_search.py

2.4 답변 생성 함수 : rag_search.py

2.5 챗봇 실행하기 : 05_rag_chatbot.py

🚀 실행 방법 (구글 코랩)

2.1 학습 데이터 처리하기 : `01_chatbot_dataget.py`

2.2 데이터 임베딩하기 : `02_chatbot_embedding.py`

2.3 문맥 검색 함수 : `04_search.py`

2.4 답변 생성 함수 : `rag_search.py`

2.5 챗봇 실행하기 : `05_rag_chatbot.py`