Notice
Recent Posts
Recent Comments
ยซ   2024/12   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
Archives
Today
In Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

A Joyful AI Research Journey๐ŸŒณ๐Ÿ˜Š

*[3] 241101 Konply, Selenium [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 3] ๋ณธ๋ฌธ

๐ŸŒณAI & Quantum Computing Bootcamp 2024โœจ/AI Lecture Revision

*[3] 241101 Konply, Selenium [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 3]

yjyuwisely 2024. 11. 1. 13:07

241101 Fri 3rd class

์˜ค๋Š˜ ๋ฐฐ์šด ๊ฒƒ ์ค‘ ๊ธฐ์–ตํ•  ๊ฒƒ์„ ์ •๋ฆฌํ–ˆ๋‹ค.


๋ฉด์ ‘ ๋•Œ ๋ง๋กœ ์„ค๋ช…ํ•œ๋‹ค. 
๊ฐ€๋Šฅ์„ฑ์„ ๋ณด๊ณ  ๋ฝ‘์•„์ค„ ์ˆ˜๋Š” ์žˆ๋‹ค. 

ํ”„๋กœ์ ํŠธ๋งŒX

์ด๋ก ๋„ ์ค‘์š”ํ•˜๋‹ค.

์›๋ฆฌ๋ฅผ ์•Œ๋ฉด ์‰ฝ๊ฒŒ ์ ‘๊ทผํ•œ๋‹ค. 

๊ณต๋ถ€ํ•  ๋•Œ ๊ณ ๋ฏผํ•  ์‹œ๊ฐ„ ๊ฐ€์ง€๊ธฐ, ์–ด๋–ป๊ฒŒ ํ’€์ง€ ๊ณ ๋ฏผ, ๊ฒ€์ƒ‰ํ•˜๊ธฐ, ์ƒ๊ฐ์„ ํ•œ๋‹ค. 

์•Œ๊ณ ๋ฆฌ์ฆ˜ ์งœ๋ณด๊ธฐ, ๋Š˜์–ด๋‚จ 

ChatGPT ์˜์กด ์‚ฌํƒœ๊ฐ€ ์ผ์–ด๋‚  ์ˆ˜ ์žˆ๋‹ค. 

๋ชปํ•˜๋Š” ๊ฑฐ ์žก์•„๋‚ด๊ณ , ์ปจํŠธ๋กคํ•˜๋ ค๋ฉด ์ง€์‹ ์žˆ์–ด์•ผ ํ•œ๋‹ค. 

๋งŽ์ด ๊ณต๋ถ€ -> ํ•ด๊ฒฐ์ฑ…์„ ์•ˆ๋‹ค. (ex. nltk)

์ธ๊ธฐ ์žˆ๋Š” ๊ฑฐ ์ซ“์•„์„œ ๊ณต๋ถ€ํ•˜๋ฉด ์ค‘์š”ํ•œ ๊ฑธ ๊ณต๋ถ€X

2,3๋…„์ฐจ ์—”์ง€๋‹ˆ์–ด -> ๊ธฐ๋ณธ ๋ชจ๋ฅด๋Š” ๊ฒฝ์šฐ ์žˆ๋‹ค. -> ์˜ค๋ž˜๊ฐ€์ง€ ๋ชปํ•œ๋‹ค.

LLM ์ „์ดํ•™์Šต 


๋งˆ์ง€๋ง‰ ํ”„๋กœ์ ํŠธ๋Š” ํŒ€ ๋ณ„๋กœ ํ•œ๋‹ค. ์‹ค์Šตํ•˜๊ธฐ 

์„œ๋น„์Šค๋ฅผ ํ•˜๋ ค๋ฉด ๊ธฐ๋ณธ์ ์ธ ์ง€์‹์ด ์žˆ์–ด์•ผ ํ•œ๋‹ค. 

์ƒˆ๋กœ์šด ํ”„๋กœ์ ํŠธ ๊ฐ™์€ ๋ฐ ์›๋ฆฌ๊ฐ€ ๋น„์Šทํ•˜๋‹ค. 

๊ธฐ๋ณธ์„ ์•Œ๋ฉด / ํ•˜๋‚˜๋ฅผ ์ œ๋Œ€๋กœ ์•Œ๋ฉด ์‘์šฉํ•ด์„œ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. 

ํ† ํฐ, ๋ถˆ์šฉ์–ด ๋“ฑ ์•Œ์•„์•ผ ํ•œ๋‹ค. 

HTML -> w3schools.com ์ข‹์Œ

์ฃผ์‹ ํˆฌ์ž -> ์›นํฌ๋กค๋ง -> ๋„์›€ ๋จ 
์—‘์…€ํ‘œ ์ž๋™ ์ •๋ฆฌ 


https://ldjwj.github.io/CLASS_PY_LIB_START/PYLIB_03_01_konlpy_nltk_v01_2411.html

 

unit01_01_konlpy_nltk_v01_2411

1-3 konlpy ์„ค์น˜ ๋ฐ ์†Œ๊ฐœ¶ ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ: konlpy๋Š” ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ํ˜•ํƒœ์†Œ ๋ถ„์„, ํ’ˆ์‚ฌ ํƒœ๊น…, ๊ตฌ๋ฌธ ๋ถ„์„ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ํ˜•ํƒœ

ldjwj.github.io


!pip install konlpy
import nltk
import matplotlib.pyplot as plt
import numpy as np
import konlpy


import nltk 
import matplotlib.pyplot as plt
import numpy as np 
import konlpy 
from konlpy.tag import Kkma

4-4 konlpy.tag.Okt ์‹ค์Šตํ•ด ๋ณด๊ธฐ

# KoNLPy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋ฉด ์ฃผ์„์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹คํ–‰ํ•˜์„ธ์š”
# !pip install konlpy

from konlpy.tag import Okt

# 1. ๋ถ„์„๊ธฐ ๊ฐ์ฒด ์ƒ์„ฑ
okt = Okt()

# ์˜ˆ์ œ ํ…์ŠคํŠธ
text = "ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„์„์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค."

# 2. ํ† ํฐํ™” (ํ˜•ํƒœ์†Œ ๋ถ„์„)
tokens = okt.morphs(text)
print("ํ† ํฐํ™” ๊ฒฐ๊ณผ:", tokens)

# 3. ํ’ˆ์‚ฌ ํƒœ๊น…
pos_tags = okt.pos(text)
print("ํ’ˆ์‚ฌ ํƒœ๊น… ๊ฒฐ๊ณผ:", pos_tags)

# 4. ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ
# ๋ถˆ์šฉ์–ด ๋ฆฌ์ŠคํŠธ (ํ•„์š”์— ๋”ฐ๋ผ ํ™•์žฅ ๊ฐ€๋Šฅ)
stopwords = ["์€", "๋Š”", "์ด", "๊ฐ€", "์—์„œ", "๋ถ€๋ถ„", "์ž…๋‹ˆ๋‹ค"]

# ๋ถˆ์šฉ์–ด๊ฐ€ ์ œ๊ฑฐ๋œ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑ
filtered_tokens = [word for word in tokens if word not in stopwords]
print("๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ๊ฒฐ๊ณผ:", filtered_tokens)

[๋ ˆ๋ฒจ์—… ๋ฌธ์ œ1] txt๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ์ด๋ฅผ Kkma๋กœ ๋ถ„์„ํ•ด ๋ณด์ž.

[๋ ˆ๋ฒจ์—… ๋ฌธ์ œ2] Hannanum๋กœ ๋ถ„์„ํ•ด ๋ณด์ž.


nltk: ํ† ํฐ์„ ๋ถ„๋ฅ˜ํ•ด์ฃผ๋Š” ๋ชจ๋ธ ๋‹ค์šด ๋ฐ›์•„ ์‚ฌ์šฉ

!pip install nltk

https://platform.openai.com/tokenizer

nltk.download('punkt')

์ฝ”ํผ์Šค: ํŠน์ • ์–ธ์–ด ๋˜๋Š” ์ฃผ์ œ์— ๋Œ€ํ•œ ํ…์ŠคํŠธ์˜ ๋ชจ์Œ


STT ์Œ์„ฑ 

ํ˜•์šฉ์‚ฌ ๋นˆ๋„ ๋ถ„์„ -> ์‚ฌ๋žŒ ์„ฑํ–ฅ ํŒ๋‹จ 
์—ฐ์„ค๋ฌธ, ๋‹จ์–ด ์‚ฌ์šฉ, ์–ดํœ˜๋ ฅ ํ’๋ถ€ ๋“ฑ -> ๋ถ„์„


VSCode์—์„œ Colab ์—ฐ๊ฒฐ

https://dogfoot1.tistory.com/92

 

[VSCode] ipynb ์‚ฌ์šฉํ•˜๊ธฐ

jupyter notebook ํ™•์žฅ ํŒฉ์„ ์„ค์น˜ํ•œ๋‹ค cmd์—์„œ pip install jupyter  ipynb ํŒŒ์ผ ์ƒ์„ฑ select kernel์„ ์„ ํƒ Python Environments ์„ ํƒํ•˜๊ณ  ์ž์‹ ์ด ์›ํ•˜๋Š” python ํ™˜๊ฒฝ ํด๋ฆญ  ๊ทธ๋Ÿฌ๋ฉด VSCode์—์„œ ipynb ์‹คํ–‰ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค

dogfoot1.tistory.com

 

pip install selenium
pip install webdriver-manager

์š”์†Œ ์ฐพ๊ธฐ

  • Selenium์€ ์›น ํŽ˜์ด์ง€์—์„œ ์š”์†Œ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค
  • ์ฐธ์กฐ URL : https://selenium-python.readthedocs.io/locating-elements.html

ํ•˜๋‚˜์˜ DOM(๊ฐ์ฒด)์— ์ ‘๊ทผ - element

from selenium.webdriver.common.by import By  

find_element(By.ID, "element_id"): ์š”์†Œ์˜ ๊ณ ์œ  ID๋กœ ์ฐพ๊ธฐ
find_element(By.NAME, "element_name"): ์š”์†Œ์˜ name ์†์„ฑ์œผ๋กœ ์ฐพ๊ธฐ
find_element(By.XPATH, "//div[@class='element']"): XPath ํ‘œํ˜„์‹์œผ๋กœ ์ฐพ๊ธฐ
find_element(By.LINK_TEXT, "Link Text"): ๋งํฌ์˜ ๊ฐ€์‹œ์ ์ธ ํ…์ŠคํŠธ๋กœ ์ฐพ๊ธฐ
find_element(By.PARTIAL_LINK_TEXT, "Partial Link"): ๋งํฌ์˜ ๋ถ€๋ถ„ ํ…์ŠคํŠธ๋กœ ์ฐพ๊ธฐ
find_element(By.TAG_NAME, "div"): HTML ํƒœ๊ทธ ์ด๋ฆ„์œผ๋กœ ์ฐพ๊ธฐ
find_element(By.CLASS_NAME, "element_class"): CSS ํด๋ž˜์Šค ์ด๋ฆ„์œผ๋กœ ์ฐพ๊ธฐ
find_element(By.CSS_SELECTOR, "div.element"): CSS ์„ ํƒ์ž๋กœ ์ฐพ๊ธฐโ€‹
 

์—ฌ๋Ÿฌ๊ฐœ์˜ DOM(๊ฐ์ฒด)์— ์ ‘๊ทผ - elements

from selenium.webdriver.common.by import By  

find_elements(By.NAME, "name"): ์š”์†Œ์˜ name ์†์„ฑ์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์š”์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
find_elements(By.XPATH, "xpath"): XPath ํ‘œํ˜„์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์š”์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
find_elements(By.LINK_TEXT, "text"): ๋งํฌ์˜ ๊ฐ€์‹œ์ ์ธ ํ…์ŠคํŠธ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์š”์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
find_elements(By.PARTIAL_LINK_TEXT, "text"): ๋งํฌ์˜ ๋ถ€๋ถ„ ํ…์ŠคํŠธ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์š”์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
find_elements(By.TAG_NAME, "tag"): HTML ํƒœ๊ทธ ์ด๋ฆ„์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์š”์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
find_elements(By.CLASS_NAME, "class"): CSS ํด๋ž˜์Šค ์ด๋ฆ„์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์š”์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
find_elements(By.CSS_SELECTOR, "css"): CSS ์„ ํƒ์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์š”์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.โ€‹

https://ldjwj.github.io/CLASS_PY_LIB_START/CL03_01_selenium_basic_V11.html

 

CL03_01_selenium_basic_V11

aํƒœ๊ทธ(anchor tag)์˜ link text๋กœ ์ ‘๊ทผํ•˜๋ ค๊ณ  ํ• ๋•Œ ์‚ฌ์šฉ. ์•ˆ๋…•ํ•˜์„ธ์š”! Continue Cancel find_element(By.LINK_TEXT, '') find_element(By.PARTIAL_LINK_TEXT, '')

ldjwj.github.io

from selenium.webdriver.common.by import By
select_id = driver.find_element(By.ID, 'rank')
print(select_id)
print(select_id.text)

๊ฒฐ๊ณผ)
<selenium.webdriver.remote.webelement.WebElement (session="3f8ca0841e16ae6ab12a480d0b6d218e", element="f.2985483FCEEE8B67D80A7646DF8E012E.d.F3E6E49034E1E8C7C8D999B464268F27.e.96")> 10. ๋žญํ‚น ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ(์›น ํฌ๋กค๋ง)


sel_tag_h1 = driver.find_element(By.TAG_NAME, 'h1')
print(sel_tag_h1.text)
print(sel_tag_h1)

my web page <selenium.webdriver.remote.webelement.WebElement (session="3f8ca0841e16ae6ab12a480d0b6d218e", element="f.2985483FCEEE8B67D80A7646DF8E012E.d.F3E6E49034E1E8C7C8D999B464268F27.e.76")>


[์‹ค์Šต 1] aํƒœ๊ทธ ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ
 
[๋ ˆ๋ฒจ์—… ์‹ค์Šต 1] aํƒœ๊ทธ ์ „๋ถ€ ๊ฐ€์ ธ์˜ค๊ธฐ
 
sel_tag_a1 = driver.find_elements(By.TAG_NAME, 'a')

print(type(sel_tag_a1) )

for one in sel_tag_a1:
    print(one.text)

๊ฒฐ๊ณผ) 

<class 'list'>
01. ์ œ๋ชฉ ๊ฐ€์ ธ์˜ค๊ธฐ(title)
02. ํ…์ŠคํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ(p)
03. ๋งํฌ ๊ฐ€์ ธ์˜ค๊ธฐ(a)
04. ์ด๋ฏธ์ง€ ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ(img)
05. ๋ฆฌ์ŠคํŠธ ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ(ul,ol)
06. id๋ฅผ ํ™œ์šฉํ•œ ์ •๋ณด ํš๋“
07. class๋ฅผ ํ™œ์šฉํ•œ ์ •๋ณด ํš๋“
08. ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ
09. ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ
10. ๋žญํ‚น ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ(์›น ํฌ๋กค๋ง)

from selenium import webdriver  
from selenium.webdriver.common.by import By  

url = 'https://news.naver.com/'  

# ์›น ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ง€์ •๋œ URL๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.  
driver.get(url)

๊ฒ€์ƒ‰์ฐฝ: //*[@id="u_hs"]/div/div/input

๋‹๋ณด๊ธฐ: //*[@id="u_hs"]/div/div/button[2]

ํ†ตํ•ฉ๊ฒ€์ƒ‰: //*[@id="u_hs"]/button[1]

from selenium.webdriver.common.by import By  

# ๊ฒ€์ƒ‰ ์•„์ด์ฝ˜ ์š”์†Œ ์ฐพ๊ธฐ 
# /html/body/section/header/div[1]/div/div/div[2]/div[3]/a 
search_icon = driver.find_element(By.XPATH, '/html/body/section/header/div[1]/div/div/div[2]/div[3]/a')  
print(search_icon.tag_name)  
print(search_icon.text)  
search_icon.click()  


# ๊ฒ€์ƒ‰์ฐฝ ์š”์†Œ ์ฐพ๊ธฐ  
# //*[@id="u_hs"]/div/div/input
search_input = driver.find_element(By.XPATH, '//*[@id="u_hs"]/div/div/input')  
print(search_input.tag_name)  
print(search_input.text)  

# ๊ฒ€์ƒ‰ ๋ฒ„ํŠผ ์š”์†Œ ์ฐพ๊ธฐ
# //*[@id="u_hs"]/div/div/button[2]  
search_button = driver.find_element(By.XPATH, '//*[@id="u_hs"]/div/div/button[2]')  
print(search_button.tag_name)  
print(search_button.text) 

# ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ ๋ฐ ๊ฒ€์ƒ‰ ์‹คํ–‰  
search_input.send_keys("ํŒจ์…˜")  
search_button.click()

๊ฒฐ๊ณผ)


์‹ค์Šต1 - ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์™€ ๋ณด๊ธฐ
 
ํ…์ŠคํŠธ ํŒŒ์ผ๋กœ ๊ฐ€์ ธ์˜จ๋‹ค.
path = '//*[@id="ct"]/div/section[1]/div[2]/ ... '
sel_xpath = driver.find_element(By.XPATH, path)
print(sel_xpath.text)

 
์‹ค์Šต 2- ๋‰ด์Šค ์ œ๋ชฉ ๋ฆฌ์ŠคํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ
import time

# ํ˜„์žฌ ํƒญ ํ•ธ๋“ค ์ €์žฅ
current_tab = driver.current_window_handle  
print(current_tab)

# ๋ชจ๋“  ํƒญ ํ•ธ๋“ค ๊ฐ€์ ธ์˜ค๊ธฐ  
all_tabs = driver.window_handles  
print(all_tabs)

# ์ƒˆ๋กœ์šด ํƒญ์œผ๋กœ ์ „ํ™˜  
for tab in all_tabs:  
    if tab != current_tab:  
        driver.switch_to.window(tab)  
        break  

# ์ƒˆ๋กœ์šด ํƒญ์—์„œ URL ๊ฐ€์ ธ์˜ค๊ธฐ  
time.sleep(2)  # ํŽ˜์ด์ง€ ๋กœ๋”ฉ ๋Œ€๊ธฐ  
current_url = driver.current_url  
print("์ƒˆ๋กœ์šด ํƒญ์˜ URL:", current_url)

๊ฒฐ๊ณผ)

2985483FCEEE8B67D80A7646DF8E012E
['2985483FCEEE8B67D80A7646DF8E012E', '51ABD8E439E25B4FCECDA7B1DC8807E8', '262E1EF0F66D8B7B4E75B29EE6A9B121']
์ƒˆ๋กœ์šด ํƒญ์˜ URL: https://search.naver.com/search.naver?where=news&ie=utf8&sm=nws_hty&query=%ED%8C%A8%EC%85%98
## ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ์ฐฝ์—์„œ ์ •๋ณด๊ฐ€์ ธ์˜ค๊ธฐ
# //*[@id="sp_nws1"]/div[1]/div/div[2]/a[2]
 
path = '//*[@id="sp_nws1"]/div[1]/div/div[2]/a[2]'
sel_xpath = driver.find_element(By.XPATH, path)
print(sel_xpath.text)

๊ฒฐ๊ณผ)

๊ฝƒ๋ชจ์–‘ ๋ธŒ๋กœ์น˜๋งŒ 1์–ต…'ํ—‰' ์†Œ๋ฆฌ ๋‚˜๋Š” ์ง€๋“œ๋ž˜๊ณค ํŒจ์…˜ ํ™”์ œ

728x90
๋ฐ˜์‘ํ˜•
Comments