Playwright와 httpx

Playwright – 차세대 웹 자동화 도구

핵심 특징

로그인 및 쿠키 처리 예시

pythonfrom playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch()
    context = await browser.new_context()
    page = await context.new_page()
    
    # 로그인 처리
    await page.goto('https://example.com/login')
    await page.fill('#username', 'user')
    await page.fill('#password', 'pass')
    await page.click('#submit')
    
    # 쿠키 자동 저장/관리
    cookies = await context.cookies()
    
    # 세션 유지로 다른 페이지 접근
    await page.goto('https://example.com/protected')

httpx – requests의 현대적 대안

핵심 특징

  • 비동기 HTTP 클라이언트github+1
  • 자동 쿠키 관리 및 세션 지속github
  • HTTP/2 지원
  • requests와 유사한 APIscrapfly

세션 및 쿠키 처리 예시

pythonimport httpx
from http.cookiejar import LWPCookieJar

# 쿠키 자동 저장/로드
cookiejar = LWPCookieJar(filename='cookies.dat')
try:
    cookiejar.load()
except FileNotFoundError:
    pass

async with httpx.AsyncClient(cookies=cookiejar) as client:
    # 로그인
    response = await client.post('https://example.com/login', 
                               data={'user': 'name', 'pass': 'word'})
    
    # 쿠키 자동 저장
    cookiejar.save()
    
    # 인증된 상태로 데이터 수집
    data = await client.get('https://example.com/api/data')

Crawl4AI – LLM 친화적 크롤러

AI 최적화 크롤링

포테이토넷의 가능성 있는 기술 스택

사용자님 추론대로, 포테이토넷은 아마도:

크롤링 계층

python# 최신 비동기 크롤링 스택
- Playwright (브라우저 자동화, 로그인 처리)
- httpx (고성능 HTTP 클라이언트)  
- asyncio (비동기 처리)
- Crawl4AI (AI 친화적 데이터 추출)

AI/ML 계층

python# LLM 기반 분석
- PyTorch/Transformers (딥러닝)
- LangChain (LLM 파이프라인)
- FastAPI (API 서빙)
- Vector DB (임베딩 저장)

일일 5,000만 건 URL 수집이라는 규모를 고려하면, Playwright + httpx + asyncio의 조합으로 분산 병렬 처리 아키텍처를 구축했을 가능성이 매우 높습니다.cathodicpro.tistory+1

특히 딥웹 탐지라는 특수성을 고려하면, Tor 네트워크 연동프록시 로테이션도 포함된 정교한 시스템일 것으로 추정됩니다.jonghoonpark

  1. https://cathodicpro.tistory.com/entry/%EC%87%BC%ED%95%91%EB%AA%B0-%ED%81%AC%EB%A1%A4%EB%A7%81-%EA%B0%80%EC%9D%B4%EB%93%9C-Playwright%EC%99%80-PyQt%EB%A5%BC-%ED%99%9C%EC%9A%A9%ED%95%9C-%EC%9B%B9-%EC%8A%A4%ED%81%AC%EB%9E%98%ED%95%91
  2. https://jonghoonpark.com/2023/07/24/dcinside-crawling-using-playwright-python
  3. https://blog.hashscraper.com/playwright-web-browser-automation/
  4. https://velog.io/@imkkuk/Selenium-Playwright-%EC%A0%84%ED%99%98%EA%B8%B0-%EC%86%8D%EB%8F%84%EC%99%80-%EC%95%88%EC%A0%95%EC%84%B1%EC%9D%84-%EC%9E%A1%EB%8B%A4
  5. https://minding-deep-learning.tistory.com/251
  6. https://roundproxies.com/blog/playwright-vs-selenium/
  7. https://www.browserstack.com/guide/playwright-vs-selenium
  8. https://github.com/encode/httpx/discussions/2229
  9. https://scrapfly.io/blog/posts/web-scraping-with-python-httpx
  10. https://aiandgamedev.com/ai/ollama-7-crawl4ai-llm-crawing/
  11. https://dev.to/ali_dz/crawl4ai-the-ultimate-guide-to-ai-ready-web-crawling-2620
  12. https://brightdata.com/blog/web-data/crawl4ai-and-deepseek-web-scraping
  13. https://discuss.pytorch.kr/t/crawl4ai-llm-ai-crawler/5282
  14. https://dodonam.tistory.com/417
  15. https://imgzon.tistory.com/150
  16. https://beomi.github.io/2017/01/20/HowToMakeWebCrawler-With-Login/
  17. https://bravehangni-study.tistory.com/31
  18. https://thkim610.tistory.com/123
  19. https://developshrimp.com/entry/Spring-%EB%A1%9C%EA%B7%B8%EC%9D%B8-%EC%B2%98%EB%A6%AC-12-%EC%BF%A0%ED%82%A4Cookie%EC%99%80-%EC%84%B8%EC%85%98Session
  20. https://itstory1592.tistory.com/62
  21. https://thunderbit.com/ko/blog/python-web-scraping
  22. https://fleetwood.tistory.com/84
  23. https://velog.io/@sua0714/%ED%95%99%EC%8A%B5-%EC%A0%95%EB%A6%AC-%EC%BF%A0%ED%82%A4%EC%99%80-%EC%84%B8%EC%85%98-2025-03-20
  24. https://eliclosetshop.tistory.com/69
  25. https://velog.io/@rlfrkdms1/%EB%A1%9C%EA%B7%B8%EC%9D%B8-%EC%BF%A0%ED%82%A4-%EC%84%B8%EC%85%98
  26. https://tofof.tistory.com/25
  27. https://catsbi.oopy.io/0c27061c-204c-4fbf-acfd-418bdc855fd8
  28. https://apidog.com/kr/blog/python-requests-cookies-2/
  29. https://scrapfly.io/blog/answers/save-and-load-cookies-in-requests-python
  30. https://apidog.com/blog/python-requests-cookies/
  31. https://www.browsercat.com/post/playwright-vs-selenium-deep-comparison
  32. https://stackoverflow.com/questions/31554771/how-can-i-use-cookies-in-python-requests
  33. https://saucelabs.com/resources/blog/playwright-vs-selenium-guide
  34. https://github.com/unclecode/crawl4ai
  35. https://github.com/encode/httpx/discussions/1481
  36. https://www.scrapingbee.com/blog/crawl4ai/
  37. https://abstracta.us/blog/functional-software-testing/playwright-vs-selenium/
  38. https://www.youtube.com/watch?v=od6AaKhKYmg
  39. https://www.reddit.com/r/dotnet/comments/1im7oly/selenium_vs_playwright/

코멘트

답글 남기기

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다