[4] 241104 Data Preprocessing, Word Cloud, ChatGPT API [Goorm All-In-One Pass! AI Project Master

Notice

Recent Posts

Recent Comments

Links

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Today

In Total

관리 메뉴

A Joyful AI Research Journey🌳😊

[4] 241104 Data Preprocessing, Word Cloud, ChatGPT API [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 4] 본문

🌳AI & Quantum Computing Bootcamp✨/AI Lecture Revision

[4] 241104 Data Preprocessing, Word Cloud, ChatGPT API [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 4]

yjyuwisely 2024. 11. 4. 11:53

241104 Mon 4th class

오늘 배운 것 중 기억할 것을 정리했다.

https://rowan-sail-868.notion.site/da602ba0748c4e2cb12b443178b16507

텍스트 데이터 분석 기본 | Notion

수업 자료로 접속하기

rowan-sail-868.notion.site

https://ldjwj.github.io/CHATGPT_AI_CLASS/01_TextPre_V10.html

01_TextPre_V10

ldjwj.github.io

비전공자, 대학원X, 4년 동안 올라감, 대회 2개 1등

집중해서 공부

하나를 하더라도 제대로 한다.

논문 읽고 대회 참여, 혼자 공부, 성장함

늦은 건 X

왔다갔다 X 우루루 쫓아다니지 X

대기업 문서 많음

과정 2-3개 한다고 지식 많아지는 건 아니다

초급 과정은 비슷하다

1년 공부해도 달라진 게 없다 늘은 건 X

많이 알아야 한다

필수 라이브러리는 필수다.

열심히 한다

01. 자연어 처리(NLP)란 무엇인가?

자연어 처리(NLP)는 인간의 언어를 컴퓨터가 이해하고 처리할 수 있도록 하는 기술
NLP는 텍스트와 음성 데이터를 분석하고, 이를 통해 의미를 추출하거나 새로운 텍스트를 생성하는 다양한 방법을 포함.

03. 자연어 처리의 기본 용어를 이해

토큰화(Tokenization): 텍스트를 작은 단위로 분할.
형태소 분석(Morphological Analysis): 단어를 형태소로 분리하고 품사 태깅.
어간 추출(Stem Extraction): 단어의 기본 형태 추출.
품사 태깅(part-of-speech tagging (POS) Tagging): 단어의 품사를 식별.
명명된 개체 인식(NER): 인물, 장소 등의 명명된 개체 식별.
단어 임베딩(Word Embedding): 단어를 벡터로 변환하여 의미를 수치적으로 표현.

https://ldjwj.github.io/CLASS_PY_LIB_LEVELUP/06_DATA_ANALYSIS/01_%ED%85%8D%EC%8A%A4%ED%8A%B8%EB%8D%B0%EC%9D%B4%ED%84%B0%EB%B6%84%EC%84%9D1_%EB%B9%88%EB%8F%84%EB%B6%84%EC%84%9D_V10.html

텍스트데이터분석1_빈도분석_V10

ldjwj.github.io

빈도 분석

LLM 책에 다 나온다.

import nltk
print(nltk.__version__)

3.8.1

!pip install nltk

Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1) Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7) Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.2) Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2024.9.11) Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.6)

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# NLTK 데이터 다운로드 (처음 한 번만 실행)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.

True

# 예제 문장
sentence = "Natural language processing makes it possible for computers to understand human language."

# 토큰화 (Tokenization)
tokens = word_tokenize(sentence)
print("토큰화 결과:", tokens)

# 품사 태깅 (POS Tagging)
tagged_tokens = pos_tag(tokens)
print("품사 태깅 결과:", tagged_tokens)

토큰화 결과: ['Natural', 'language', 'processing', 'makes', 'it', 'possible', 'for', 'computers', 'to', 'understand', 'human', 'language', '.']
품사 태깅 결과: [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('makes', 'VBZ'), ('it', 'PRP'), ('possible', 'JJ'), ('for', 'IN'), ('computers', 'NNS'), ('to', 'TO'), ('understand', 'VB'), ('human', 'JJ'), ('language', 'NN'), ('.', '.')]

re 정규 표현식

https://docs.python.org/ko/3/library/re.html

re — Regular expression operations

Source code: Lib/re/ This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings ( str) as well as 8-...

docs.python.org

불용어처리

불용어 또는 제외어는 불용 목록에 있는 단어로, 자연어 데이터 처리 전후에 중요하지 않기 때문에 필터링된다. 모든 자연어 처리 도구에서 사용되는 단일한 범용 불용어 목록은 없으며, 불용어 식별을 위해 합의된 규칙도 없으며, 실제로 모든 도구가 이러한 목록을 사용하는 것도 아니다.

import os #리눅스: 폴더 이름 directory 
from collections import Counter #빈도 개수 확인 
import re
from nltk.corpus import stopwords #영어, 불용어 모아둠 
import matplotlib.pyplot as plt

# 텍스트 파일 경로
file_paths = [
    "01_다른경쟁사와간단비교.txt",
    "02_기업리서치관련정리.txt",
    "03_생성AI분석.txt"
]

for file_path in file_paths:
  with open(file_path, 'r', encoding='utf-8') as file: # r 읽기 전용
    text = file.read()
    print("파일명 : ", file_path)
    print("파일 내용 : ", text)

# 한국어 불용어 추가
# 수동으로 정의한 한국어 불용어 리스트
korean_stopwords = {
    '의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로',
    '자', '에', '와', '한', '하다', '에서', '것', '및', '위해', '그', '되다'
}

additional_stopwords = {'강점', '약점', '경쟁사'}  # 분석에 불필요한 단어 추가
korean_stopwords.update(additional_stopwords)

# 파일별 텍스트 처리 및 단어 빈도 계산
def process_text(text):
    # 텍스트 전처리: 소문자화, 특수 문자 제거, 불용어 제거
    text = text.lower() 
    text = re.sub(r'[^\w\s]', '', text) #특수 문자 제거
    words = text.split() # 공백으로 나눠줌 
    words = [word for word in words if word not in korean_stopwords and len(word) > 1] # 불용어X, 길이가 1보다 큼 
    return words

# 빈도 분석 결과 저장
word_frequencies = []

for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
        words = process_text(text)
        word_freq = Counter(words) # 빈도 분석 
        word_frequencies.append(word_freq) # 추가함 

# 파일별로 가장 자주 등장한 상위 10개 단어 출력
for i, freq in enumerate(word_frequencies):
    print(f"\n파일 {i+1}의 상위 10개 단어:")
    print(freq.most_common(10))

# 시각화: 파일별 상위 10개 단어 빈도
for i, freq in enumerate(word_frequencies):
    common_words = freq.most_common(10)
    words, counts = zip(*common_words)

    plt.figure(figsize=(10, 5))
    plt.bar(words, counts)
    plt.title(f'파일 {i+1} 상위 10개 단어 빈도')
    plt.xticks(rotation=45)
    plt.show()

파일 1의 상위 10개 단어:
[('제품', 11), ('글로벌', 10), ('시장', 7), ('기술', 5), ('반도체', 5), ('삼성전자', 4), ('스마트폰', 4), ('디스플레이', 4), ('강력한', 4), ('a사', 4)]

파일 2의 상위 10개 단어:
[('있습니다', 19), ('반도체', 11), ('글로벌', 11), ('사업', 8), ('디스플레이', 7), ('스마트폰', 7), ('삼성전자는', 6), ('다양한', 6), ('특히', 6), ('기술', 5)]

파일 3의 상위 10개 단어:
[('있습니다', 20), ('글로벌', 12), ('삼성전자는', 11), ('있으며', 11), ('반도체', 10), ('스마트폰', 6), ('시장', 6), ('기술', 6), ('디스플레이', 5), ('대한', 5)]

3-5 [실습 1] 불용어를 추가해서, 최종적으로 그래프를 확인해 주세요.

# 한국어 불용어 추가
# 수동으로 정의한 한국어 불용어 리스트
korean_stopwords = { '있습니다', '있으며', '대한',  
    '의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로',
    '자', '에', '와', '한', '하다', '에서', '것', '및', '위해', '그', '되다'
}

additional_stopwords = {'강점', '약점', '경쟁사'}  # 분석에 불필요한 단어 추가
korean_stopwords.update(additional_stopwords)

# 파일별 텍스트 처리 및 단어 빈도 계산
def process_text(text):
    # 텍스트 전처리: 소문자화, 특수 문자 제거, 불용어 제거
    text = text.lower() 
    text = re.sub(r'[^\w\s]', '', text) #특수 문자 제거
    words = text.split() # 공백으로 나눠줌 
    words = [word for word in words if word not in korean_stopwords and len(word) > 1] # 불용어X, 길이가 1보다 큼 
    return words

# 빈도 분석 결과 저장
word_frequencies = []

for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
        words = process_text(text)
        word_freq = Counter(words) # 빈도 분석 
        word_frequencies.append(word_freq) # 추가함 

# 파일별로 가장 자주 등장한 상위 10개 단어 출력
for i, freq in enumerate(word_frequencies):
    print(f"\n파일 {i+1}의 상위 10개 단어:")
    print(freq.most_common(10))

# 시각화: 파일별 상위 10개 단어 빈도
for i, freq in enumerate(word_frequencies):
    common_words = freq.most_common(10)
    words, counts = zip(*common_words)

    plt.figure(figsize=(10, 5))
    plt.bar(words, counts)
    plt.title(f'파일 {i+1} 상위 10개 단어 빈도')
    plt.xticks(rotation=45)
    plt.show()

파일 1의 상위 10개 단어:
[('제품', 11), ('글로벌', 10), ('시장', 7), ('기술', 5), ('반도체', 5), ('삼성전자', 4), ('스마트폰', 4), ('디스플레이', 4), ('강력한', 4), ('a사', 4)]

파일 2의 상위 10개 단어:
[('반도체', 11), ('글로벌', 11), ('사업', 8), ('디스플레이', 7), ('스마트폰', 7), ('삼성전자는', 6), ('다양한', 6), ('특히', 6), ('기술', 5), ('네트워크', 5)]

파일 3의 상위 10개 단어:
[('글로벌', 12), ('삼성전자는', 11), ('반도체', 10), ('스마트폰', 6), ('시장', 6), ('기술', 6), ('디스플레이', 5), ('모바일', 5), ('다양한', 5), ('특히', 5)]

3-6 [레벨업 실습 1] 불용어 파일을 만들어서 이를 불러와서 최종적으로 그래프를 확인한다.

```
코드
```
로 올리기

names = ['Alice', 'Bob', 'Charlie']
ages = [25, 30, 35]
cities = ['New York', 'London', 'Paris']

for idx, i in enumerate(ages):
  print(idx, i)

0 25
1 30
2 35

person_info = list(zip(names, ages, cities))
print(person_info)

[('Alice', 25, 'New York'), ('Bob', 30, 'London'), ('Charlie', 35, 'Paris')]

https://ldjwj.github.io/CLASS_PY_LIB_START/PYLIB_03_02_alice_extreme_V11_2411.html

unit01_02_alice_extreme_V11_2411

ldjwj.github.io

set 집합 중복 제거됨

## 집합 확인
s2 = set([1,2,3,4,5,1,2])
s2

{1, 2, 3, 4, 5}

### 불용어 단어 추가
x_words = set(STOPWORDS)
x_words.add("said")
x_words

Lexical dispersion is a measure of how frequently a word appears across the parts of a corpus.

Result)

https://openai.com/index/openai-api/

https://platform.openai.com/docs/overview

https://rowan-sail-868.notion.site/ChatGPT-API-1337d480b59380ec918bfe4d6c0f6c41

ChatGPT API 기본 시작하기 | Notion

1-1 ChatGPT API란 무엇인가요?

rowan-sail-868.notion.site

5달러 충전

https://velog.io/@sy508011/%EC%8A%A4%ED%8A%B8%EB%A6%BC%EB%A6%BF-Github%EB%A1%9C-%EC%97%B0%EB%8F%99%ED%95%B4%EC%84%9C-%ED%8E%98%EC%9D%B4%EC%A7%80-Deploy%ED%95%98%EA%B8%B0

[스트림릿] Github로 연동해서 페이지 Deploy하기

Github와 연동해서 App 배포하기.

velog.io

728x90

저작자표시 비영리 동일조건 (새창열림)

'🌳AI & Quantum Computing Bootcamp✨ > AI Lecture Revision' 카테고리의 다른 글

[6] 241106 Notion, Prompt Engineering [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 6] (1)	2024.11.06
[5] 241105 Streamlit [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 5] (0)	2024.11.05
[4] 241104 Text Analysis, Frequency Analysis [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 4] (0)	2024.11.04
*[3] 241101 Konply, Selenium [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 3] (1)	2024.11.01

'🌳AI & Quantum Computing Bootcamp✨/AI Lecture Revision' Related Articles

Comments

A Joyful AI Research Journey🌳😊

[4] 241104 Data Preprocessing, Word Cloud, ChatGPT API [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 4] 본문

[4] 241104 Data Preprocessing, Word Cloud, ChatGPT API [Goorm All-In-One Pass! AI Project Master - 4th Session, Day 4]

01. 자연어 처리(NLP)란 무엇인가?

03. 자연어 처리의 기본 용어를 이해

'🌳AI & Quantum Computing Bootcamp✨ > AI Lecture Revision' 카테고리의 다른 글

티스토리툴바