[Project] 한국어 대화 분류 및 요약

Project/한국어 대화 분류 및 요약

[Project] 한국어 대화 분류 및 요약 - 데이터 수집

gangee 2024. 5. 2. 21:19

728x90

데이터 수집1 - AI Hub

AI Hub의 한국어 대화 데이터 수집
- 두 사람이 다양한 주제로 자유롭게 대화한 내용 (128GB)

데이터 수집2 - 네이버 지식인

특정 키워드에 대한 질문과 답변을 크롤링을 통해 수집
- 66,627개 데이터, 파일의 크기 (64.21MB)

Selenium, BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

# 네이버 지식인 크롤링 함수
def crawl_naver_kin(category, keyword, start_date, end_date):

    max_pages = 80
    period_txt = f"&period={start_date}.%7c{end_date}."

    # 결과를 저장할 디렉토리 이름 설정 및 생성
    directory = 'result'
    if not os.path.exists(directory):
        os.makedirs(directory)

    filename = f'{directory}/{keyword.replace(" ", "_")}_crawling_result.csv'

    # CSV 파일 헤더 작성
    with open(filename, mode='w', newline=' ', encoding='utf-8-sig') as file:
        writer = csv.writer(file)
        writer.writerow(['카테고리', '키워드', '제목', '질문', '답변'])

        # 지정한 페이지 수만큼 반복
        for page_index in range(1, max_pages + 1):
            time.sleep(uniform(0.01, 1.0))
            driver.get(f'https://kin.naver.com/search/list.nhn?sort={sorted_kind}&query={keyword.replace(" ", "%20")}'
                       f'{period_txt}&section=kin&page={str(page_index)}')
            soup = BeautifulSoup(driver.page_source, 'html.parser')

            # 질문 링크 추출
            tags = soup.find_all('a', class_='_nclicks:kin.txt _searchListTitleAnchor')
            if not tags:
                break

            # URL 접속 및 질문과 답변 추출
            for tag in tags:
                url = tag.get('href')
                driver.get(url)
                title = driver.find_element(By.CLASS_NAME, 'title').text
                question_txt = driver.find_element(By.CLASS_NAME, 'c-heading__content').text if driver.find_elements(By.CLASS_NAME, 'c-heading__content') else ""
                answer_list = driver.find_elements(By.CLASS_NAME, "se-main-container")

            # 추출한 결과 csv 입력
            for n, answer in enumerate(answer_list):
                texts = answer.find_elements(By.TAG_NAME, 'span')
                answer_txt = ''.join([text.text for text in texts])
                if n == 0:
                    writer.writerow([category, keyword, title, question_txt, answer_txt])
                else:
                    writer.writerow([category, keyword, "", "", answer_txt])

driver.quit()

Chrome Driver & Proxy

IP 밴 방지

chrome_options = Options()
chrome_options.add_argument('--proxy-server=socks5://127.0.0.1:9050')

driver = webdriver.Chrome()

728x90