๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„

[ํ…์ŠคํŠธ ๋ถ„์„] 6. KoNLPy๋ฅผ ์ด์šฉํ•œ ๋„ค์ด๋ฒ„ ์˜ํ™” ํ‰์  ๊ฐ์„ฑ๋ถ„์„

xod22 2022. 3. 19. 22:58
728x90
ํ•œ๊ธ€ NLP ์ฒ˜๋ฆฌ์˜ ์–ด๋ ค์›€

 

์˜์–ด์˜ ๊ฒฝ์šฐ ๋„์–ด์“ฐ๊ธฐ๋ฅผ ์ž˜๋ชปํ•˜๋ฉด ์ž˜๋ชป๋œ ๋‹จ์–ด ๋˜๋Š” ์—†๋Š” ๋‹จ์–ด๋กœ ์ธ์‹๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ํ•œ๊ธ€์˜ ๊ฒฝ์šฐ, '์•„๋ฒ„์ง€๊ฐ€ ๋ฐฉ์— ๋“ค์–ด๊ฐ€์‹ ๋‹ค'๋ฅผ ์ž˜๋ชป ๋„์–ด์“ฐ๊ธฐํ•˜๋ฉด '์•„๋ฒ„์ง€ ๊ฐ€๋ฐฉ์— ๋“ค์–ด๊ฐ€์‹ ๋‹ค'๊ฐ€ ๋˜์–ด ์˜๋ฏธ๊ฐ€ ์™œ๊ณก๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๋˜ํ•œ ์ฃผ์–ด๋‚˜ ๋ชฉ์ ์–ด๋ฅผ ์œ„ํ•ด ์ถ”๊ฐ€๋˜๋Š” ์กฐ์‚ฌ์˜ ๊ฒฝ์šฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ• ๋•Œ ์ œ๊ฑฐํ•˜๊ธฐ๊ฐ€ ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค.

 

'๋„ˆํฌ ์ง‘์€ ์–ด๋”” ์žˆ๋‹ˆ?'์—์„œ '์ง‘์€'์˜ '์€'์ด ๋œปํ•˜๋Š” ๊ฒƒ์ด ์กฐ์‚ฌ์ธ์ง€ ์•„๋‹ˆ๋ฉด ๊ธˆ์€๋™ํ• ๋•Œ ์€์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ž˜๋ชป๋ผ์–ด '์ง‘ ์€'์œผ๋กœ ์“ด ๊ฒฝ์šฐ ๋”์šฑ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค..!

 

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค ๋•Œ๋ฌธ์— ํ•œ๊ธ€ ์–ธ์–ด ์ฒ˜๋ฆฌ๊ฐ€ ๋ผํ‹ด์–ด ์ฒ˜๋ฆฌ๋ณด๋‹ค ์–ด๋ ต๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

 

์‹ค์Šต

 

KoNLPy๋Š” ํŒŒ์ด์ฌ์˜ ๋Œ€ํ‘œ์ ์ธ ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ํŒจํ‚ค์ง€์ž…๋‹ˆ๋‹ค. KoNLPy๋Š” ์ž๋ฐ” ํ˜•ํƒœ์†Œ ๋ถ„์„ ์—”์ง„์„ ๋ž˜ํผํ•œ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— Java์„ค์น˜๊ฐ€ ์„ ํ–‰๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 

์„ค์น˜๊ฐ€ ๋ณต์žกํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ตฌ๊ธ€์ด๋‚˜ ์ด ๊ธ€์„ ์ฐธ๊ณ ํ•˜์…”์„œ ์ฐจ๊ทผ์ฐจ๊ทผ ์ž๋ฐ”๋ถ€ํ„ฐ ์„ค์น˜ํ•˜์‹œ๋ฉด ์‹คํ–‰๋˜์‹ค๊บผ์—์š”..!

 

2021๋…„ 7์›” ๊ธฐ์ค€ konlpy ์„ค์น˜ ๋ฐฉ๋ฒ•

ํŒŒ์ด์ฌ์—์„œ konlpy๋ฅผ ์„ค์น˜ํ•˜๋‹ค ๊ณ ํ†ต๋ฐ›์€ ๋ถ„๋“ค์„ ์œ„ํ•˜์—ฌ

velog.io

- OktํŒจํ‚ค์ง€ ์‚ฌ์šฉ์ค‘ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์ด ๊ธ€ ์ฐธ๊ณ ..!

 

[์ž์—ฐ์–ด ์ฒ˜๋ฆฌ] konlpy ์„ค์น˜ ์˜ค๋ฅ˜, okt()์—๋Ÿฌ-already loaded in another classloader, SystemErro

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์‹ค์Šต์„ ํ•˜๊ธฐ ์œ„ํ•ด konlpy๋ฅผ ์„ค์น˜ํ•˜๋˜ ์ค‘ ์ˆ˜ ์ฐจ๋ก€์˜ ์˜ค๋ฅ˜๋ฅผ ๋งž์ดํ–ˆ์Šต๋‹ˆ๋‹ค...๐Ÿ˜ญ๐Ÿ˜ญ ์ €๋Š” ํ™˜๊ฒฝ๋ณ€์ˆ˜๋ถ€ํ„ฐ pip, konlpy ํ˜ธ์ถœ ๋“ฑ ๋ชจ๋“  ๋ถ€๋ถ„์—์„œ ์—๋Ÿฌ๋ฅผ ๊ฒผ์—ˆ๋Š”๋ฐ์š”,,,,ใ…Ž ๊ด€๋ จ ์ง€์‹์€ ๊ฑฐ์˜ ์—†์ง€

byeon-sg.tistory.com

 

 

1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

ratings_train.txt
13.95MB
ratings_test.txt
4.67MB

import pandas as pd
train_df=pd.read_csv("ratings_train.txt", sep='\t')
train_df.head(3)

 

- ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ 0๊ณผ 1์˜ label๊ฐ’ ๋น„์œจ ์‚ดํŽด๋ณด๊ธฐ

(1์ด ๊ธ์ •, 0์ด ๋ถ€์ • ๊ฐ์„ฑ)

train_df['label'].value_counts()

 

 

 

2. ์ „์ฒ˜๋ฆฌ

 

train_df์˜ ๊ฒฝ์šฐ 'document'์ปฌ๋Ÿผ์— null๊ฐ’์ด ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ค€๋‹ค.

๋˜ํ•œ ๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ์ˆซ์ž์˜ ๊ฒฝ์šฐ์—๋„ ๋ถ„์„์—์„œ ๋‹จ์–ด์ ์ธ ์˜๋ฏธ๋กœ ๋ถ€์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํŒŒ์ด์ฌ ์ •๊ทœ ํ‘œํ˜„์‹ ๋ชจ๋“ˆ re๋ฅผ ์‚ฌ์šฉํ•ด ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ค€๋‹ค..!

import re

#train๋ฐ์ดํ„ฐ->null์„ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
train_df = train_df.fillna(' ')
#์ •๊ทœ ํ‘œํ˜„์‹์„ ์ด์šฉํ•˜์—ฌ ์ˆซ์ž๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€๊ฒฝ(์ •๊ทœ ํ‘œํ˜„์‹์œผ๋กœ \d ๋Š” ์ˆซ์ž๋ฅผ ์˜๋ฏธํ•จ) 
train_df['document'] = train_df['document'].apply( lambda x : re.sub(r"\d+", " ", x) )

#test๋ฐ์ดํ„ฐ ๋กœ๋”ฉ->null์„ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
test_df = pd.read_csv('ratings_test.txt', sep='\t')
test_df = test_df.fillna(' ')
#์ˆซ์ž๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€๊ฒฝ
test_df['document'] = test_df['document'].apply( lambda x : re.sub(r"\d+", " ", x) )

 

 

3. ํ† ํฐํ™”

 

ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ์—”์ง„์€ Okt(๊ตฌ Twitter)๋ฅผ ์ด์šฉํ•ด ๊ฐ ๋ฌธ์žฅ์„ ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ๋‹จ์–ด๋กœ ํ† ํฐํ™”ํ•œ ๋’ค TfidfVectorizer๋กœ TF-IDF ๋ฐฉ์‹์œผ๋กœ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐํ™”.

from konlpy.tag import Okt

okt = Okt()

def tw_tokenizer(text):
    #์ž…๋ ฅ ์ธ์ž๋กœ ๋“ค์–ด์˜จ text๋ฅผ ํ˜•ํƒœ์†Œ ๋‹จ์–ด๋กœ ํ† ํฐํ™” ํ•˜์—ฌ list ๊ฐ์ฒด ๋ณ€ํ™˜
    tokens_ko = okt.morphs(text)
    return tokens_ko

2022.03.22 - [๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„] - [ํ…์ŠคํŠธ ๋ถ„์„] KoNLPy - Twitter ์˜ค๋ฅ˜ ํ•ด๊ฒฐ

 

[ํ…์ŠคํŠธ ๋ถ„์„] KoNLPy - Twitter ์˜ค๋ฅ˜ ํ•ด๊ฒฐ

์ฝ”๋“œ from konlpy.tag import Twitter twitter=Twitter() ์˜ค๋ฅ˜ UserWarning: "Twitter" has changed to "Okt" since KoNLPy v0.4.5. warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.') ์œ„์˜ ์ฝ”๋“œ๋ฅผ ์‹ค..

xod22.tistory.com

- tw_tokenizer() : Tfidfectorizer tokenizer ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฌธ์žฅ์„ ํ˜•ํƒœ์†Œ ๋‹จ์–ด ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜.

 

 

~๋ฒกํ„ฐํ™”~

from sklearn.feature_extraction.text import TfidfVectorizer

#Okt ๊ฐ์ฒด์˜ morphs()๊ฐ์ฒด๋ฅผ ์ด์šฉํ•œ tokenizer ์‚ฌ์šฉ
tfidf_vect = TfidfVectorizer(tokenizer= tw_tokenizer, ngram_range=(1,2), min_df=3, max_df=0.9)
tfidf_vect.fit(train_df['document'])
tfidf_matrix_train = tfidf_vect.transform(train_df['document'])

- min_df : ์ตœ์†Œ ๋นˆ๋„๊ฐ’์„ ์„ค์ •ํ•ด์ฃผ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ

     DF๋Š” ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” '๋ฌธ์„œ์˜ ์ˆ˜;๋ฅผ ์˜๋ฏธ, ๋‹จ์–ด์˜ ์ˆ˜๊ฐ€ ์•„๋‹˜.

     min_df๋ฅผ ์„ค์ •ํ•˜์—ฌ ํ•ด๋‹น ๊ฐ’๋ณด๋‹ค ์ž‘์€ DF๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด๋“ค์€ ์‚ฌ์ „(vocabulary_)์—์„œ ์ œ์™ธํ•จ

 

- max_df : ์ตœ๋Œ€ ๋นˆ๋„๊ฐ’์„ ์„ค์ •ํ•ด์ฃผ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ

     max_df๋ฅผ ์„ค์ •ํ•˜์—ฌ ํ•ด๋‹น ๊ฐ’๋ณด๋‹ค ์ž‘์€ DF๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด๋“ค์€ ์‚ฌ์ „(vocabulary_)์—์„œ ์ œ์™ธํ•จ

     float์€ %, int๋Š” ๊ฐฏ์ˆ˜๋ฅผ ์˜๋ฏธํ•จ (ex - 0.80 = ๋ฌธ์„œ์— 80%์ด์ƒ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด ๋ฌด์‹œ, 10 = ๋ฌธ์„œ์— 10๊ฐœ ์ด์ƒ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด ๋ฌด์‹œ)

 

- ngram_range : ๋‹จ์–ด์˜ ๋ฌถ์Œ์˜ ๋ฒ”์œ„ ์„ค์ • ํŒŒ๋ผ๋ฏธํ„ฐ

     ngram_range = (1, 1) : ๋‹จ์–ด์˜ ๋ฌถ์Œ์„ 1๊ฐœ๋ถ€ํ„ฐ 1๊ฐœ๊นŒ์ง€ ์„ค์ • (one, two, …)

     ngram_range = (1, 2) : ๋‹จ์–ด์˜ ๋ฌถ์Œ์„ 1๊ฐœ๋ถ€ํ„ฐ 2๊ฐœ๊นŒ์ง€ ์„ค์ • (go back, good time, one, two, …)

 

 

5. ๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€, GridSearchCV๋ฅผ ์ด์šฉํ•œ ์ตœ์ ํ™”

-> ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ’์„ ์ฐพ์Œ

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

#Logistic Regression ์„ ์ด์šฉํ•˜์—ฌ ๊ฐ์„ฑ ๋ถ„์„ Classification ์ˆ˜ํ–‰
lg_clf = LogisticRegression(random_state=0)

#Parameter C ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด GridSearchCV๋ฅผ ์ด์šฉ
params = { 'C': [1 ,3.5, 4.5, 5.5, 10 ] }
grid_cv = GridSearchCV(lg_clf , param_grid=params , cv=3 ,scoring='accuracy', verbose=1 )
grid_cv.fit(tfidf_matrix_train , train_df['label'] )
print(grid_cv.best_params_ , round(grid_cv.best_score_,4))

 

~์ •ํ™•๋„~

from sklearn.metrics import accuracy_score

#ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ ์šฉํ•œ TfidfVectorizer๋ฅผ ์ด์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ TF-IDF ๊ฐ’์œผ๋กœ Feature ๋ณ€ํ™˜ 
tfidf_matrix_test = tfidf_vect.transform(test_df['document'])

#classifier ๋Š” GridSearchCV์—์„œ ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•™์Šต๋œ classifier๋ฅผ ๊ทธ๋Œ€๋กœ ์ด์šฉ
best_estimator = grid_cv.best_estimator_
preds = best_estimator.predict(tfidf_matrix_test)

print('Logistic Regression ์ •ํ™•๋„: ',accuracy_score(test_df['label'],preds))

 

 

6. ์‹ค์ œ ๋ฌธ์žฅ ํ…Œ์ŠคํŠธ

test_df['document'][100]

grid_cv.predict(tfidf_vect.transform([test_df['document'][100]]))

- Test 100๋ฒˆ์งธ ๋ฐ์ดํ„ฐ์˜ ๋ฆฌ๋ทฐ๋ฅผ ๋ณด๊ณ , ๊ฐ์„ฑ๋ถ„์„์˜ ๊ฒฐ๊ณผ 0(๋ถ€์ •)์œผ๋กœ ๋‚˜์˜ค๋Š”๊ฒƒ์„ ๋ณด๋‹ˆ, ๋‚˜์˜์ง€์•Š์€๊ฒƒ ๊ฐ™๋‹ค

- transform์„ ํ• ๋•Œ ๋ฆฌ์ŠคํŠธ๋กœ ๊ฐ์‹ธ์ฃผ์–ด์•ผ ํ•œ๋‹ค.

 

 

7. ๊ฐ์„ฑ๋ถ„๋ฅ˜ ์ ์šฉ

text = '์‹œ์›ํ•˜๊ณ  ํ†ต์พŒํ•œ ์•ก์…˜ ์ตœ๊ณ ์˜€์–ด์š”'
if grid_cv.predict(tfidf_vect.transform([text])) == 0:
    print(f'"{text}" -> ๋ถ€์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(grid_cv.predict_proba(tfidf_vect.transform([text]))[0][0],2)}% ์ž…๋‹ˆ๋‹ค.')
else:
    print(f'"{text}" -> ๊ธ์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(grid_cv.predict_proba(tfidf_vect.transform([text]))[0][1],2)}% ์ž…๋‹ˆ๋‹ค.')

text = '์—ฌํƒœ ๋ณด์•˜๋˜ ์˜ํ™”์ค‘์— ์ œ์ผ ์žฌ๋ฏธ์—†๋„ค์š”'
if grid_cv.predict(tfidf_vect.transform([text])) == 0:
    print(f'"{text}" -> ๋ถ€์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(grid_cv.predict_proba(tfidf_vect.transform([text]))[0][0],2)}% ์ž…๋‹ˆ๋‹ค.')
else:
    print(f'"{text}" -> ๊ธ์ •์ผ ๊ฐ€๋Šฅ์„ฑ์ด {round(grid_cv.predict_proba(tfidf_vect.transform([text]))[0][1],2)}% ์ž…๋‹ˆ๋‹ค.')

 

 

728x90