[ํ…์ŠคํŠธ ๋ถ„์„] 2-3. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - Stemming๊ณผ Lemmatization

2022. 2. 23. 22:06ยท ๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„
728x90

2022.02.20 - [๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„] - [ํ…์ŠคํŠธ ๋ถ„์„] 2-(2). ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ

์ €๋ฒˆ ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ํฌ์ŠคํŒ…์— ์ด์–ด์„œ Stemming & Lemmatization์— ๋Œ€ํ•ด ์ ์–ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!

1. Cleansing(ํด๋ Œ์ง•)
2. Tokenization(ํ† ํฐํ™”)
3. ํ•„ํ„ฐ๋ง / stopwords(๋ถˆ์šฉ์–ด) ์ œ๊ฑฐ / ์ฒ ์ž ์ˆ˜์ •
4. Stemming & Lemmatization(์–ด๊ทผ ์ถ”์ถœ)

4. Stemming๊ณผ Lemmatization

 

: Stemming๊ณผ Lemmatization์€ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ์•„์ฃผ๋Š” ์ž‘์—…์„ ํ•œ๋‹ค!

*๋‘˜์˜ ์ฐจ์ด๋Š” ์ •๊ตํ•จ์˜ ์ฐจ์ด์™€ ์ด์— ๋”ฐ๋ฅด ์ž‘์—… ์ˆ˜ํ–‰ ์†๋„์˜ ์ฐจ์ด..

 

์˜ˆ๋ฅผ๋“ค์–ด ์˜์–ด ๋‹จ์–ด work๋Š” ๊ณผ๊ฑฐํ˜•์œผ๋กœ worked, 3์ธ์นญ ๋‹จ์ˆ˜์ผ ๋•Œ๋Š” works, ๊ทธ๋ฆฌ๊ณ  ์ง„ํ–‰ํ˜•์ผ ๊ฒฝ์šฐ์—๋Š” working์œผ๋กœ ๋ฐ”๋€๋‹ˆ๋‹ค! Stemming๊ณผ Lemmatization์€ ๋ณ€ํ˜•๋œ ๋‹จ์–ด๋“ค์˜ ์›ํ˜•์„ ์ฐพ์•„์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

 

์ด๋•Œ Lemmatization์ด ๋” ์ •๊ตํ•œ ์ž‘์—…์„ ํ•œ๋‹ค. ๋ฌธ๋ฒ•๊ณผ ์˜๋ฏธ๋ก ์ ์ธ ๊ธฐ๋ฐ˜์—์„œ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํ•œ ์–ด๊ทผ์„ ์ฐพ๋Š”๋‹ค.

ํ•˜์ง€๋งŒ Stemming๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ๋” ๋‹จ์ˆœํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ๋ณ€ํ˜•๋œ ๋‹จ์–ด๋ฅผ ์›ํ˜• ๋‹จ์–ด๋กœ ๋ณ€ํ™˜์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ํ›ผ์†๋œ ์–ด๊ทผ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ์ฐจ์ด์  ๋•Œ๋ฌธ์— Lemmatization์€ ์ž‘์—…์„ ํ•˜๋Š”๋ฐ ๋” ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฐ๋‹ค!

 

 

~ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ~

from nltk.stem import LancasterStemmer

#Stemming์ž‘์—…์„ ํ•˜๋Š” LancasterStemmer์„ stemmer๋ผ๊ณ  ์ง€์นญํ•˜๊ฒ ๋‹ค๋Š” ๋ช…๋ น!
stemmer = LancasterStemmer()

 

~Stemming~

#stemmer.stem('๋‹จ์–ด')๋ฅผ ํ†ตํ•ด์„œ stemming(์–ด๊ทผ์ถ”์ถœ)ํ›„ print
print(stemmer.stem('working'), stemmer.stem('works'), stemmer.stem('worked'))
print(stemmer.stem('happiest'), stemmer.stem('happier'))
print(stemmer.stem('fancier'), stemmer.stem('fanciest'))
print(stemmer.stem('amuses'),stemmer.stem('amusing'), stemmer.stem('amused'))

1. LancasterStemmer๋กœ Stemming์ž‘์—…์„ ์‹œํ–‰ํ–ˆ๋‹ค.

2. work๋ฅผ ์–ด๊ทผ์œผ๋กœ ํ•˜๋Š” ๋ณ€ํ˜• ๋™์‚ฌ๋“ค์€ ๋‹จ์ˆœ ๋ณ€ํ˜•์ด๊ธฐ ๋•Œ๋ฌธ์— ์‰ฝ๊ฒŒ ์›ํ˜•์„ ์ฐพ์ง€๋งŒ ๋น„๊ต๊ธ‰์ด๋‚˜ ์ตœ์ƒ๊ธ‰์—์„œ๋Š” ์›ํ˜•์„ ์ •ํ™•ํžˆ ์ฐพ์•„๋‚ด์ง€ ๋ชปํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ.

3. ๋ณธ๋ž˜ amuse๊ฐ€ ์›ํ˜•์ด์ง€๋งŒ anmuses, amusing, amused ๋ชจ๋‘ amus์— ์ถ”๊ฐ€ ๋‹จ์–ด๋“ค์ด ๋ถ™๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ปดํ“จํ„ฐ๊ฐ€ amus๋ฅผ ์–ด๊ทผ์œผ๋กœ ์ถ”๋ฆฌํ•œ๋‹ค..ใ…œใ…œ 

-> ์ฆ‰ ์ •๊ตํ•˜์ง€ ๋ชปํ•˜๊ณ  ํ›ผ์†๋œ ์ฑ„ ์–ด๊ทผ์„ ๊ฐ€์ ธ์˜ด!

 

 

~๋‹ค์‹œ ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ~

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

# Lemmatization์ž‘์—…์„ ํ•˜๋Š” WordNetLemmatizer์„ lemma๋ผ๊ณ  ์ง€์นญํ•˜๊ฒ ๋‹ค๋Š” ๋ช…๋ น!
lemma = WordNetLemmatizer()

 

~Lemmatization~

# lemma.lemmatize('๋‹จ์–ด', 'ํ’ˆ์‚ฌ')๋ฅผ ํ†ตํ•ด์„œ Lemmatization(์–ด๊ทผ ์ถ”์ถœ) ํ›„ print
print(lemma.lemmatize('amusing','v'), lemma.lemmatize('amuses','v'),lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'), lemma.lemmatize('happiest','a'))
print(lemma.lemmatize('fancier','a'), lemma.lemmatize('fanciest','a'))

1. Lemmatization ์ž‘์—…์€ ๋‹จ์–ด ์˜†์— ํ’ˆ์‚ฌ๋ฅผ ์ ์–ด์ค˜์•ผํ•œ๋‹ค. ๋™์‚ฌ๋Š” v, ํ˜•์šฉ์‚ฌ๋Š” a๋ฅผ ์‚ฌ์šฉํ•ด์ค€๋‹ค!

2. Stemming์— ๋น„ํ•ด ์–ด๊ทผ์˜ ์›ํ˜•์ด ์ž˜ ์ถ”์ถœ๋œ ๊ฒƒ์„ ํ™•์ธ..!

 


์ง€๊ธˆ๊นŒ์ง€ ํ…์ŠคํŠธ์˜ ๋ฒกํ„ฐ๊ฐ’์„ ํ”ผ์ฒ˜ํ™”ํ•˜๋Š” ์ž‘์—…์„ ํ•˜๊ธฐ ์ด์ „์— ๊ฑฐ์ณ์•ผํ•  ์ „์ฒ˜๋ฆฌ ๊ณผ์ •๋“ค์— ๋Œ€ํ•ด ์ ์–ด๋ณด์•˜๋Š”๋ฐ ์–ด๋– ์…จ๋‚˜์š”..

๊ณผ์ •์€ ๋ณต์žกํ•˜์ง€๋งŒ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ณด๋‹ˆ๊นŒ ๋‚˜๋ฆ„ ๊ฐ„๋‹จํ•œ ์›๋ฆฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!

 

๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ๋Š” ํ…์ŠคํŠธ ๋ถ„์„์˜ ๋‹ค์Œ ๋‹จ๊ณ„์ธ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ๊ฐ’์„ ๊ฐ€์ง„ ํ”ผ์ฒ˜๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์„ ๊ณต๋ถ€ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค~_~

๊ทธ๋Ÿผ ๋!

728x90

'๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹ > ํ…์ŠคํŠธ ๋ถ„์„' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[ํ…์ŠคํŠธ ๋ถ„์„] 4-1. ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„ - IMDB์˜ํ™”ํ‰  (0) 2022.02.26
[ํ…์ŠคํŠธ ๋ถ„์„] 3. Bag of Words (BOW)  (0) 2022.02.24
[ํ…์ŠคํŠธ ๋ถ„์„] 2-2. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ  (0) 2022.02.20
[ํ…์ŠคํŠธ ๋ถ„์„] 2-1. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ํด๋ Œ์ง•, ํ† ํฐํ™”  (0) 2022.02.20
[ํ…์ŠคํŠธ ๋ถ„์„] 1. ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ดํ•ด  (0) 2022.02.19
'๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [ํ…์ŠคํŠธ ๋ถ„์„] 4-1. ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„ - IMDB์˜ํ™”ํ‰
  • [ํ…์ŠคํŠธ ๋ถ„์„] 3. Bag of Words (BOW)
  • [ํ…์ŠคํŠธ ๋ถ„์„] 2-2. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ
  • [ํ…์ŠคํŠธ ๋ถ„์„] 2-1. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ํด๋ Œ์ง•, ํ† ํฐํ™”
xod22
xod22
Data Analyst Storyxod22 ๋‹˜์˜ ๋ธ”๋กœ๊ทธ์ž…๋‹ˆ๋‹ค.
xod22
Data Analyst Story
xod22
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • ๐ŸŒณ Home ๐ŸŒณ (178)
    • ๐Ÿฌ MySQL (46)
      • ๋ฌธ์ œํ’€์ด (29)
      • SQL ๋ฐ์ดํ„ฐ๋ถ„์„ ์บ ํ”„ (9)
    • ๐Ÿ” ๋ฐ์ดํ„ฐ ๋ถ„์„ (53)
      • Product (5)
      • 01. Data Collection (7)
      • 02. Data Processing (7)
      • 03. Data Visualizaton (15)
      • 04. Data Analysis (19)
    • ๐Ÿ“š Study (20)
      • ๋น…๋ฐ์ดํ„ฐ ๋ถ„์„๊ธฐ์‚ฌ ์‹ค๊ธฐ (8)
      • ADP ์‹ค๊ธฐ (7)
      • ๊ตฌ๊ธ€ ์• ๋„๋ฆฌํ‹ฑ์Šค (5)
      • ํ”„๋กœ์ ํŠธ (0)
    • โœ๏ธ ์ƒ๊ฐ ๊ธฐ๋ก (10)
      • ๋…์„œ (5)
      • ์ž๋ฃŒ ์Šคํฌ๋žฉ (2)
      • ์ทจ์—… ์ค€๋น„ (2)
    • ๐Ÿ’ป GitHub (6)
      • ์ˆ˜์ • ๋ฐ ๋ณ€๊ฒฝ (5)
    • ๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹ (35)
      • ์ถ”์ฒœ์‹œ์Šคํ…œ (19)
      • ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ (1)
      • ํ…์ŠคํŠธ ๋ถ„์„ (10)

๊ณต์ง€์‚ฌํ•ญ

  • Github
  • How to ๊ตฌ๋…, ์ข‹์•„์š”

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๋Œ“๊ธ€

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ํ™ˆ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

ํƒœ๊ทธ

  • pandas
  • ์ž‘์—…ํ˜•์ œ1์œ ํ˜•
  • ํฌ๋กค๋ง
  • ๋ฐ์ดํ„ฐ๋ฆฌ์•ˆ
  • ํ†ต๊ณ„์ ๋ชจ๋ธ๋ง
  • MySQL
  • SQL
  • ์‹œ๊ฐํ™”
  • ํƒœ๋ธ”๋กœ
  • ๊ธฐ์ถœํ’€์ด
  • ์ถ”์ฒœ์‹œ์Šคํ…œ
  • ๊ตฌ๊ธ€์• ๋„๋ฆฌํ‹ฑ์Šค
  • ๋ฐ์ดํ„ฐ์‹œ๊ฐํ™”
  • Python
  • ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค
  • ํ…์ŠคํŠธ๋ถ„์„
  • ๊นƒํ—ˆ๋ธŒ
  • ์ „์ฒ˜๋ฆฌ
  • ADP์‹ค๊ธฐ
  • ํŒŒ์ด์ฌ
  • tableau
  • ๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ
  • ์„ธ๋ฏธ๋‚˜
  • Plot
  • ์ฝ”๋”ฉํ…Œ์ŠคํŠธ
  • ํ•ด์ปค๋žญํฌ
  • ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ
  • ๋ฐ์ดํ„ฐ๋ถ„์„
  • github
  • ๋น…๋ถ„๊ธฐ

์ตœ๊ทผ ๊ธ€

hELLO ยท Designed By ์ •์ƒ์šฐ.v4.2.0
xod22
[ํ…์ŠคํŠธ ๋ถ„์„] 2-3. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - Stemming๊ณผ Lemmatization
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.