๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„

[ํ…์ŠคํŠธ ๋ถ„์„] 4-2. ๋น„์ง€๋„ ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„

xod22 2022. 2. 27. 18:42
728x90

2022.02.26 - [๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„] - [ํ…์ŠคํŠธ ๋ถ„์„] 4-(1). ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„ - IMDB์˜ํ™”ํ‰

 

[ํ…์ŠคํŠธ ๋ถ„์„] 4-(1). ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„ - IMDB์˜ํ™”ํ‰

ํ…์ŠคํŠธ ๋ถ„์„์—์„œ ์ „์ฒ˜๋ฆฌ๋ถ€ํ„ฐ BOW๊นŒ์ง€ ๊ณต๋ถ€๋ฅผ ํ•ด๋ณด์•˜๋Š”๋ฐ์š”! ์ด์ œ ์ง์ ‘ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์‹ค์Šต์„ ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ์„ฑ๋ถ„์„์ด๋ž€? : ๋ฌธ์„œ์˜ ์ฃผ๊ด€์ ์ธ ๊ฐ์„ฑ/์˜๊ฒฌ/๊ฐ์ •/๊ธฐ๋ถ„ ๋“ฑ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•

xod22.tistory.com

์ง€๋„ํ•™์Šต๊ธฐ๋ฐ˜ ๊ฐ์„ฑ๋ถ„์„์— ์ด์–ด ๋น„์ง€๋„ ํ•™์Šต๊ธฐ๋ฐ˜ ๊ฐ์„ฑ๋ถ„์„ ์‹ค์Šต์„ ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!


๋น„์ง€๋„ ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„

 

: ๋น„์ง€๋„ ๊ฐ์„ฑ๋ถ„์„์€ Lexicon์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.

Lexicon์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์–ดํœ˜์ง‘์„ ์˜๋ฏธํ•˜๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” ์ฃผ๋กœ ๊ฐ์„ฑ๋งŒ์„ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด ์ง€์›ํ•˜๋Š” ๊ฐ์„ฑ ์–ดํœ˜ ์‚ฌ์ „์ด๋‹ค.

 

๊ฐ์„ฑ์‚ฌ์ „์€ ๊ธ์ •(Positive) ๋˜๋Š” ๋ถ€์ •(Negative) ๊ฐ์„ฑ ์ •๋„๋ฅผ ์˜๋ฏธํ•˜๋Š” ์ˆ˜์น˜(๊ฐ์„ฑ์ง€์ˆ˜)๋ฅผ ๊ฐ–๊ณ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋‹จ์–ด์˜ ์œ„์น˜๋‚˜ ์ฃผ๋ณ€๋‹จ์–ด, ๋ฌธ๋งฅ, POS(Part of Speech) ๋“ฑ์„ ์ฐธ๊ณ ํ•ด ๊ฒฐ์ •๋œ๋‹ค.

 

์ด๋Ÿฌํ•œ ๊ฐ์„ฑ ์‚ฌ์ „์„ ๊ตฌํ˜„ํ•œ ๋Œ€ํ‘œ์  ํŒจํ‚ค์ง€๊ฐ€ NLTK์ด๋‹ค. NLTK๋Š” ๋งŽ์€ ์„œ๋ธŒ ๋ชจ๋“ˆ์„ ๊ฐ–๊ณ  ์žˆ์œผ๋ฉฐ, Lexicon๋„ ํฌํ•จ๋˜์–ด์žˆ๋‹ค.

 

WordNet

 

: ๊ฐ์„ฑ์‚ฌ์ „์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋จผ์ € WordNet์„ ์•Œ์•„์•ผ ํ•œ๋‹ค.

WordNet๋ชจ๋“ˆ์€ NLTK์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋Œ€ํ•œ ์‹œ๋งจํ‹ฑ ๋ถ„์„์„ ์ œ๊ณตํ•˜๋Š” ์–ดํœ˜์‚ฌ์ „์ด๋‹ค.

 

๋™์ผํ•œ ๋‹จ์–ด๋‚˜ ๋ฌธ์žฅ์ด๋ผ๋„ ๋‹ค๋ฅธ ํ™˜๊ฒฝ, ๋ฌธ๋งฅ์—์„œ๋Š” ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„๋˜๊ฑฐ๋‚˜ ์ดํ•ด๋  ์ˆ˜ ์žˆ๋‹ค.

์˜ˆ๋ฅผ๋“ค์–ด ์˜์–ด ๋‹จ์–ด 'Present'๋Š” ์„ ๋ฌผ์ด๋ผ๋Š” ์˜๋ฏธ๋„ ์žˆ์ง€๋งŒ ํ˜„์žฌ๋ผ๋Š” ์˜๋ฏธ๋„ ์žˆ๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๋ง์˜ '๋ฐฅ ๋จน์—ˆ์–ด?'๋ผ๋Š” ํ‘œํ˜„์€ ๋‹จ์ˆœํžˆ ์‹์‚ฌ๋ฅผ ํ–ˆ๋Š”์ง€๋ฅผ ๋ฌป๋Š” ํ‘œํ˜„์ผ ์ˆ˜๋„ ์žˆ์ง€๋งŒ ์•ˆ๋ถ€๋ฅผ ๋ฌป๋Š” ํ‘œํ˜„์ผ์ˆ˜๋„ ์žˆ๋‹ค..!

 

WordNet์€ ์ด๋ ‡๊ฒŒ ๊ฐ™์€ ์–ดํœ˜๋ผ๋„ ์ƒํ™ฉ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์–ดํœ˜์˜ ์‹œ๋งจํ‹ฑ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค..

 

NLTK์— ํฌํ•จ๋œ ๊ฐ์„ฑ์‚ฌ์ „

 

  • SentiWordNet : NLTKํŒจํ‚ค์ง€์˜ WordNet๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๊ฐ์„ฑ๋‹จ์–ด ์ „์šฉ WordNet์„ ๊ตฌํ˜„ํ•œ๊ฒƒ์œผ๋กœ 3๊ฐ€์ง€ ๊ฐ์„ฑ์ ์ˆ˜(sentiment score)๋ฅผ ์ ์šฉํ•œ๋‹ค

- ๊ธ์ • ๊ฐ์„ฑ ์ง€์ˆ˜ : ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๊ฐ์„ฑ์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๊ธ์ •์ ์ธ๊ฐ€๋ฅผ ๋‚˜ํƒ€๋‚ธ ์ˆ˜์น˜

- ๋ถ€์ • ๊ฐ์„ฑ ์ง€์ˆ˜ : ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ์„ฑ์ ์œผ๋กœ ๋ถ€์ •์ ์ธ๊ฐ€๋ฅผ ๋‚˜ํƒ€๋‚ธ ์ˆ˜์น˜

- ๊ฐ๊ด€์„ฑ ์ง€์ˆ˜ :  ๊ธ์ •/๋ถ€์ • ์ง€์ˆ˜์™€ ์™„์ „ํžˆ ๋ฐ˜๋Œ€๋˜๋Š” ๊ฐœ๋…์œผ๋กœ ๋‹จ์–ด๊ฐ€ ๊ฐ์„ฑ๊ณผ ๊ด€๊ณ„์—†์ด ์–ผ๋งˆ๋‚˜ ๊ฐ๊ด€์ ์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ ์ˆ˜์น˜

 

  • VADER : ์†Œ์…œ ๋ฏธ๋””์–ด์˜ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ๊ฐ์„ฑ ๋ถ„์„์„ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•œ ํŒจํ‚ค์ง€๋กœ ๋›ฐ์–ด๋‚œ ๊ฐ์„ฑ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ๋น„๊ต์  ๋น ๋ฅด์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ๋ณด์žฅํ•ด ๋Œ€์šฉ๋Ÿ‰ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ์ž˜ ์‚ฌ์šฉ๋˜๋Š” ํŒจํ‚ค์ง€์ด๋‹ค.
  • Pattern : ์˜ˆ์ธก ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›๋Š” ํŒจํ‚ค์ง€์ด์ง€๋งŒ ํŒŒ์ด์ฌ 2.X๋ฒ„์ „์—์„œ๋งŒ ๋™์ž‘ํ•œ๋‹ค.

=> SentiWordNet ๊ฐ์„ฑ ์‚ฌ์ „์„ ์ด์šฉํ•ด ๊ฐ์„ฑ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

 

์‹ค์Šต

 

1. ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ

: SentiWordNet์€ WrodNet๊ธฐ๋ฐ˜์˜ synset์„ ์ด์šฉํ•˜๋ฏ€๋กœ ๋จผ์ € WordNet์˜ synset์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

synset์€ ๊ทธ ๋‹จ์–ด๊ฐ€ ๊ฐ–๋Š” ๋ฌธ๋งฅ, ์‹œ๋งจํ‹ฑ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” WordNet์˜ ํ•ต์‹ฌ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.

 

๋จผ์ € NLTK์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์„ธํŠธ์™€ ํŒจํ‚ค์ง€๋ฅผ ๋‚ด๋ ค๋ฐ›์€๋’ค, WordNet์˜ synsets()๋ฅผ ์‚ฌ์šฉํ—ค 'present' ๋‹จ์–ด์— ๋Œ€ํ•œ synset ๊ฐ์ฒด๋ฅผ ์ถ”์ถœํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

import nltk
import pandas as pd
from nltk.corpus import wordnet as wn
nltk.download('all')

 

 

2. 'present'๋ผ๋Š” ๋‹จ์–ด๋กœ synsets(์–ดํœ˜์‚ฌ์ „) ์ƒ์„ฑ

synsets=wn.synsets('present')

print('synsets() ๋ฐ˜ํ™˜ type: ', type(synsets))
print('synsets() ๋ฐ˜ํ™˜ ๊ฐ’ ๊ฐœ์ˆ˜: ', len(synsets))
print('synsets() ๋ฐ˜ํ™˜ ๊ฐ’: ', synsets)

์ด 18๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” synset ๊ฐ์ฒด๊ฐ€ ๋ฐ˜ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Synset('present.n.01')๊ณผ ๊ฐ™์ด Synset ๊ฐ์ฒด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ 'present.n.01'์€ POS(Part Of Speech: ํ’ˆ์‚ฌ) ํƒœ๊ทธ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

n์€ ๋ช…์‚ฌ ํ’ˆ์‚ฌ, 01์€ present๊ฐ€ ๋ช…์‚ฌ๋กœ์„œ ์—ฌ๋Ÿฌ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•œ ์ธ๋ฑ์Šค ์ž…๋‹ˆ๋‹ค.

 

for synset in synsets :
    print('##### Synset name: ', synset.name(),'#####') #synset ์ด๋ฆ„
    print('POS: ',synset.lexname()) #๋‹จ์–ด์˜ ํ’ˆ์‚ฌ
    print('Definition: ',synset.definition()) #๋‹จ์–ด์˜ ์˜๋ฏธ
    print('Lemmas: ',synset.lemma_names()) #๋‹จ์–ด์˜ ๋ถ€๋ช…์ œ, ๋‹ค๋ฅด๋Œ€์ฒด ๋‹จ์–ด

ํ•œ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ ์—ฌ๋Ÿฌ ์˜๋ฏธ๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์–‘ํ•œ synsets๋“ค์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์ด synset ํ•˜๋‚˜ํ•˜๋‚˜ ํ’ˆ์‚ฌ์™€ ์ •์˜ ๊ทธ๋ฆฌ๊ณ  ๋Œ€์ฒด ๋‹จ์–ด๋“ค์„ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

 

- Synset('present.n.01')๊ณผ Synset('present.n.02')๋Š” ๋ช…์‚ฌ์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

-Synset('present.n.01')์€ POS๊ฐ€ noun.time์ด๋ฉฐ Definition์„ ์‚ดํŽด๋ณด๋ฉด '์‹œ๊ฐ„์ ์ธ ์˜๋ฏธ๋กœ ํ˜„์žฌ'๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

-Synset('present.n.02')๋Š” POS๊ฐ€ noun.possession์ด๋ฉฐ Definition์€ '์„ ๋ฌผ'์ž…๋‹ˆ๋‹ค.

 

์ด์ฒ˜๋Ÿผ synset์€ ํ•˜๋‚˜์˜ ๋‹จ์–ด๊ฐ€ ๊ฐ€์งˆ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์‹œ๋งจํ‹ฑ ์ •๋ณด๋ฅผ ๊ฐœ๋ณ„ ํด๋ž˜์Šค๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

3. ์–ดํœ˜๊ฐ„์˜ ์œ ์‚ฌ๋„ ํ™•์ธ

# ๋‹จ์–ด ์ƒ์„ฑ
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

 

synset๊ฐ์ฒด์˜ path_similarity() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด 'tree', 'lion', 'tiger', 'cat', 'dog' ๋‹จ์–ด์˜ ์ƒํ˜ธ ์œ ์‚ฌ๋„๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

entities = [tree , lion , tiger , cat , dog]
similarities = []
entity_names = [ entity.name().split('.')[0] for entity in entities]

#๋‹จ์–ด๋ณ„ synset ๋“ค์„ iteration ํ•˜๋ฉด์„œ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์˜ synset๊ณผ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •
for entity in entities:
    similarity = [round(entity.path_similarity(compared_entity), 2) for compared_entity in entities]
    similarities.append(similarity)
    
#๊ฐœ๋ณ„ ๋‹จ์–ด๋ณ„ synset๊ณผ ๋‹ค๋ฅธ ๋‹จ์–ด์˜ synset๊ณผ์˜ ์œ ์‚ฌ๋„๋ฅผ DataFrameํ˜•ํƒœ๋กœ ์ €์žฅ 
similarity_df = pd.DataFrame(similarities , columns=entity_names,index=entity_names)
similarity_df

 

4. SentiWordNet - ๊ฐ์„ฑ์‚ฌ์ „

: ๋ณดํ†ต WordNet ์‚ฌ์ „๋ณด๋‹จ SentiWordNet ์‚ฌ์ „์„ ๋” ๋งŽ์ด ์‚ฌ์šฉ

SentiWordNet๋„ WordNet Synset๊ณผ ์œ ์‚ฌํ•œ Senti_Synset ํด๋ž˜์Šค๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค!

SentiWordNet ๋ชจ๋“ˆ์˜ Senti_Synsets()๋„ Synsets()์ฒ˜๋Ÿผ Senti_Synsetํด๋ž˜์Šค๋ฅผ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() ๋ฐ˜ํ™˜ type: ', type(senti_synsets))
print('senti_synsets() ๋ฐ˜ํ™˜ ๊ฐ’ ๊ฐฏ์ˆ˜: ', len(senti_synsets))
print('senti_synsets() ๋ฐ˜ํ™˜ ๊ฐ’: ', senti_synsets)

์ด 11๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” Senti_Synset ๊ฐ์ฒด๊ฐ€ ๋ฐ˜ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

 

5. SentiWordNet - father/fabulous ๋น„๊ต

import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print('father ๊ธ์ •๊ฐ์„ฑ ์ง€์ˆ˜: ', father.pos_score())
print('father ๋ถ€์ •๊ฐ์„ฑ ์ง€์ˆ˜: ', father.neg_score())
print('father ๊ฐ๊ด€์„ฑ ์ง€์ˆ˜: ', father.obj_score())
print('\n')
fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous ๊ธ์ •๊ฐ์„ฑ ์ง€์ˆ˜: ',fabulous .pos_score())
print('fabulous ๋ถ€์ •๊ฐ์„ฑ ์ง€์ˆ˜: ',fabulous .neg_score())

๊ฐ๊ด€์„ฑ ์ง€์ˆ˜๋Š” ๊ฐ์„ฑ๋‹จ์–ด์ผ ๋•Œ 0, ๊ฐ์„ฑ๋‹จ์–ด๊ฐ€ ์•„๋‹Œ ๊ฐ๊ด€์ ์ธ ๋‹จ์–ด์ผ๋•Œ 1๋กœ ํ‘œํ˜„.

๊ฐ์„ฑ๋‹จ์–ด์ผ ๋•Œ(=0)๋Š” ๋Œ€์‹  ๊ธ์ •๊ฐ์„ฑ ์ง€์ˆ˜/๋ถ€์ •๊ฐ์„ฑ ์ง€์ˆ˜๋กœ ๊ธ์ •/๋ถ€์ •์ด ํ‘œํ˜„๋จ!

 

-father์€ ๊ฐ๊ด€์ ์ธ ๋‹จ์–ด๋กœ ๊ฐ๊ด€์„ฑ ์ง€์ˆ˜๊ฐ€ 1.0์ด๊ณ , ๊ธ์ • ๊ฐ์„ฑ/๋ถ€์ • ๊ฐ์„ฑ ์ง€์ˆ˜ ๋ชจ๋‘ 0์ž…๋‹ˆ๋‹ค.

-fabulous๋Š” ๊ฐ์„ฑ๋‹จ์–ด๋กœ์„œ ๊ธ์ • ๊ฐ์„ฑ ์ง€์ˆ˜๊ฐ€ 0.975, ๋ถ€์ • ๊ฐ์„ฑ ์ง€์ˆ˜๊ฐ€ 0.125์ž…๋‹ˆ๋‹ค.

728x90