๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„

[ํ…์ŠคํŠธ ๋ถ„์„] 2-3. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - Stemming๊ณผ Lemmatization

xod22 2022. 2. 23. 22:06
728x90

2022.02.20 - [๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„] - [ํ…์ŠคํŠธ ๋ถ„์„] 2-(2). ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ

์ €๋ฒˆ ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ํฌ์ŠคํŒ…์— ์ด์–ด์„œ Stemming & Lemmatization์— ๋Œ€ํ•ด ์ ์–ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!

1. Cleansing(ํด๋ Œ์ง•)
2. Tokenization(ํ† ํฐํ™”)
3. ํ•„ํ„ฐ๋ง / stopwords(๋ถˆ์šฉ์–ด) ์ œ๊ฑฐ / ์ฒ ์ž ์ˆ˜์ •
4. Stemming & Lemmatization(์–ด๊ทผ ์ถ”์ถœ)

4. Stemming๊ณผ Lemmatization

 

: Stemming๊ณผ Lemmatization์€ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ์•„์ฃผ๋Š” ์ž‘์—…์„ ํ•œ๋‹ค!

*๋‘˜์˜ ์ฐจ์ด๋Š” ์ •๊ตํ•จ์˜ ์ฐจ์ด์™€ ์ด์— ๋”ฐ๋ฅด ์ž‘์—… ์ˆ˜ํ–‰ ์†๋„์˜ ์ฐจ์ด..

 

์˜ˆ๋ฅผ๋“ค์–ด ์˜์–ด ๋‹จ์–ด work๋Š” ๊ณผ๊ฑฐํ˜•์œผ๋กœ worked, 3์ธ์นญ ๋‹จ์ˆ˜์ผ ๋•Œ๋Š” works, ๊ทธ๋ฆฌ๊ณ  ์ง„ํ–‰ํ˜•์ผ ๊ฒฝ์šฐ์—๋Š” working์œผ๋กœ ๋ฐ”๋€๋‹ˆ๋‹ค! Stemming๊ณผ Lemmatization์€ ๋ณ€ํ˜•๋œ ๋‹จ์–ด๋“ค์˜ ์›ํ˜•์„ ์ฐพ์•„์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

 

์ด๋•Œ Lemmatization์ด ๋” ์ •๊ตํ•œ ์ž‘์—…์„ ํ•œ๋‹ค. ๋ฌธ๋ฒ•๊ณผ ์˜๋ฏธ๋ก ์ ์ธ ๊ธฐ๋ฐ˜์—์„œ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํ•œ ์–ด๊ทผ์„ ์ฐพ๋Š”๋‹ค.

ํ•˜์ง€๋งŒ Stemming๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ๋” ๋‹จ์ˆœํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ๋ณ€ํ˜•๋œ ๋‹จ์–ด๋ฅผ ์›ํ˜• ๋‹จ์–ด๋กœ ๋ณ€ํ™˜์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ํ›ผ์†๋œ ์–ด๊ทผ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ์ฐจ์ด์  ๋•Œ๋ฌธ์— Lemmatization์€ ์ž‘์—…์„ ํ•˜๋Š”๋ฐ ๋” ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฐ๋‹ค!

 

 

~ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ~

from nltk.stem import LancasterStemmer

#Stemming์ž‘์—…์„ ํ•˜๋Š” LancasterStemmer์„ stemmer๋ผ๊ณ  ์ง€์นญํ•˜๊ฒ ๋‹ค๋Š” ๋ช…๋ น!
stemmer = LancasterStemmer()

 

~Stemming~

#stemmer.stem('๋‹จ์–ด')๋ฅผ ํ†ตํ•ด์„œ stemming(์–ด๊ทผ์ถ”์ถœ)ํ›„ print
print(stemmer.stem('working'), stemmer.stem('works'), stemmer.stem('worked'))
print(stemmer.stem('happiest'), stemmer.stem('happier'))
print(stemmer.stem('fancier'), stemmer.stem('fanciest'))
print(stemmer.stem('amuses'),stemmer.stem('amusing'), stemmer.stem('amused'))

1. LancasterStemmer๋กœ Stemming์ž‘์—…์„ ์‹œํ–‰ํ–ˆ๋‹ค.

2. work๋ฅผ ์–ด๊ทผ์œผ๋กœ ํ•˜๋Š” ๋ณ€ํ˜• ๋™์‚ฌ๋“ค์€ ๋‹จ์ˆœ ๋ณ€ํ˜•์ด๊ธฐ ๋•Œ๋ฌธ์— ์‰ฝ๊ฒŒ ์›ํ˜•์„ ์ฐพ์ง€๋งŒ ๋น„๊ต๊ธ‰์ด๋‚˜ ์ตœ์ƒ๊ธ‰์—์„œ๋Š” ์›ํ˜•์„ ์ •ํ™•ํžˆ ์ฐพ์•„๋‚ด์ง€ ๋ชปํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ.

3. ๋ณธ๋ž˜ amuse๊ฐ€ ์›ํ˜•์ด์ง€๋งŒ anmuses, amusing, amused ๋ชจ๋‘ amus์— ์ถ”๊ฐ€ ๋‹จ์–ด๋“ค์ด ๋ถ™๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ปดํ“จํ„ฐ๊ฐ€ amus๋ฅผ ์–ด๊ทผ์œผ๋กœ ์ถ”๋ฆฌํ•œ๋‹ค..ใ…œใ…œ 

-> ์ฆ‰ ์ •๊ตํ•˜์ง€ ๋ชปํ•˜๊ณ  ํ›ผ์†๋œ ์ฑ„ ์–ด๊ทผ์„ ๊ฐ€์ ธ์˜ด!

 

 

~๋‹ค์‹œ ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ~

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

# Lemmatization์ž‘์—…์„ ํ•˜๋Š” WordNetLemmatizer์„ lemma๋ผ๊ณ  ์ง€์นญํ•˜๊ฒ ๋‹ค๋Š” ๋ช…๋ น!
lemma = WordNetLemmatizer()

 

~Lemmatization~

# lemma.lemmatize('๋‹จ์–ด', 'ํ’ˆ์‚ฌ')๋ฅผ ํ†ตํ•ด์„œ Lemmatization(์–ด๊ทผ ์ถ”์ถœ) ํ›„ print
print(lemma.lemmatize('amusing','v'), lemma.lemmatize('amuses','v'),lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'), lemma.lemmatize('happiest','a'))
print(lemma.lemmatize('fancier','a'), lemma.lemmatize('fanciest','a'))

1. Lemmatization ์ž‘์—…์€ ๋‹จ์–ด ์˜†์— ํ’ˆ์‚ฌ๋ฅผ ์ ์–ด์ค˜์•ผํ•œ๋‹ค. ๋™์‚ฌ๋Š” v, ํ˜•์šฉ์‚ฌ๋Š” a๋ฅผ ์‚ฌ์šฉํ•ด์ค€๋‹ค!

2. Stemming์— ๋น„ํ•ด ์–ด๊ทผ์˜ ์›ํ˜•์ด ์ž˜ ์ถ”์ถœ๋œ ๊ฒƒ์„ ํ™•์ธ..!

 


์ง€๊ธˆ๊นŒ์ง€ ํ…์ŠคํŠธ์˜ ๋ฒกํ„ฐ๊ฐ’์„ ํ”ผ์ฒ˜ํ™”ํ•˜๋Š” ์ž‘์—…์„ ํ•˜๊ธฐ ์ด์ „์— ๊ฑฐ์ณ์•ผํ•  ์ „์ฒ˜๋ฆฌ ๊ณผ์ •๋“ค์— ๋Œ€ํ•ด ์ ์–ด๋ณด์•˜๋Š”๋ฐ ์–ด๋– ์…จ๋‚˜์š”..

๊ณผ์ •์€ ๋ณต์žกํ•˜์ง€๋งŒ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ณด๋‹ˆ๊นŒ ๋‚˜๋ฆ„ ๊ฐ„๋‹จํ•œ ์›๋ฆฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!

 

๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ๋Š” ํ…์ŠคํŠธ ๋ถ„์„์˜ ๋‹ค์Œ ๋‹จ๊ณ„์ธ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ๊ฐ’์„ ๊ฐ€์ง„ ํ”ผ์ฒ˜๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์„ ๊ณต๋ถ€ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค~_~

๊ทธ๋Ÿผ ๋!

728x90