[ํ…์ŠคํŠธ ๋ถ„์„] 2-1. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ํด๋ Œ์ง•, ํ† ํฐํ™”

2022. 2. 20. 00:36ยท ๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„
728x90

2022.02.19 - [๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„] - [ํ…์ŠคํŠธ ๋ถ„์„] 1. ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ดํ•ด

 

[ํ…์ŠคํŠธ ๋ถ„์„] 1. ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ดํ•ด

ํ”„๋กœ์ ํŠธ๋ฅผ ํ•˜๋ฉด์„œ ํ…์ŠคํŠธ ๋ถ„์„์„ ์ ‘ํ•ด๋ณผ ๊ธฐํšŒ๊ฐ€ ์žˆ์—ˆ๋Š”๋ฐ ์™„์ „ํ•œ ์ดํ•ด๋ฅผ ํ•˜๊ณ  ์‚ฌ์šฉํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๋ผ ์˜ค๋ฅ˜๋„ ๋งŽ์ด ์ ‘ํ•ด๋ณด์•˜๊ณ  ๊ทธ๋ž˜์„œ ๋”์šฑ ์ฐจ๊ทผ์ฐจ๊ทผ ํ…์ŠคํŠธ ๋ถ„์„์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค! ํ…

xod22.tistory.com

์ €๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ดํ•ด์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ์ ์–ด๋ณด์•˜๋Š”๋ฐ์š”!

์˜ค๋Š˜์€ ํ…์ŠคํŠธ ๋ถ„์„์˜ ํ”„๋กœ์„ธ์Šค ์ค‘ ์ฒซ๋ฒˆ์งธ์ธ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ ๋ฐ ์ •๊ทœํ™”์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!


๋ฌธ์„œ์—์„œ ๋‹จ์–ด ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ ํ”ผ์ฒ˜๋ฅผ ๋ฝ‘์€ ํ›„์— ํ”ผ์ฒ˜์— ๋ฒกํ„ฐ ๊ฐ’์„ ๋ถ€์—ฌํ•˜๋Š” ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค.

ํ•˜์ง€๋งŒ ํ”ผ์ฒ˜์— ๋ฒกํ„ฐ๊ฐ’์„ ๋ถ€์—ฌํ•˜๋Š” ์ž‘์—…์„ ํ•˜๊ธฐ์ „! ๋ณธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ „์ฒ˜๋ฆฌ ์ž‘์—…์ด ํ•„์ˆ˜์ ์ด๋‹ค..!

 

ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ „์ฒ˜๋ฆฌ ์ž‘์—…์„ ์ •๋ฆฌํ•ด๋ณด์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1. Cleansing(ํด๋ Œ์ง•)
2. Tokenization(ํ† ํฐํ™”)
3. ํ•„ํ„ฐ๋ง / stopwords(๋ถˆ์šฉ์–ด) ์ œ๊ฑฐ / ์ฒ ์ž ์ˆ˜์ •
4. Stemming & Lemmatization

1. Cleansing(ํด๋ Œ์ง•)

 

: ํ…์ŠคํŠธ ๋ถ„์„์—์„œ ์˜คํžˆ๋ ค ๋ฐฉํ•ด๊ฐ€ ๋˜๋Š” ๋ถˆํ•„์š”ํ•œ ๋ฌธ์ž ๋“ฑ์„ ์ œ๊ฑฐํ•˜๋Š” ์ž‘์—…์„ ๋งํ•œ๋‹ค!

์˜ˆ๋ฅผ๋“ค์–ด ์ธํ„ฐ๋„ท์—์„œ ํฌ๋กค๋งํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, html ๊ธฐํ˜ธ ๋“ฑ์„ ์‚ฌ์ „์— ์‚ญ์ œํ•˜๋Š” ์ž‘์—…์ด ์ด์— ์†ํ•จ..!

 

2. Tokenization(ํ† ํฐํ™”)

 

: ๋‚ ๊ฒƒ์˜ ํ…์ŠคํŠธ๋ฅผ ๋ฌธ์žฅ๋ณ„, ๋‹จ์–ด๋ณ„๋กœ ๋‚˜๋ˆ„๊ธฐ..!

ํ† ํฐํ™”์˜ ์œ ํ˜•์€ ํฌ๊ฒŒ ๋‘๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

 

1) ๋ฌธ์žฅ ํ† ํฐํ™” : ๋ฌธ์„œ์—์„œ ๋ฌธ์žฅ์„ ๋ถ„๋ฆฌ

2) ๋‹จ์–ด ํ† ํฐํ™” : ๋ฌธ์žฅ์—์„œ ๋‹จ์–ด๋ฅผ ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌ

 

์ฆ‰ ๋ฌธ์„œ->๋ฌธ์žฅ, ๋ฌธ์žฅ->๋‹จ์–ด

 

 

~ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ~

from nltk import sent_tokenize
import nltk
nltk.download('punkt')

 

~๋ฌธ์žฅ ํ† ํฐํ™”~

: ๋ฌธ์žฅ ํ† ํฐํ™”๋Š” ์ฃผ๋กœ ๋ฌธ์žฅ์˜ ๋งˆ์ง€๋ง‰์„ ๋œปํ•˜๋Š” ๊ธฐํ˜ธ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ž‘์—…์ด ์ง„ํ–‰๋œ๋‹ค..!

๋ฌธ์žฅ์˜ ๋งˆ์ง€๋ง‰์„ ์ƒ์ง•ํ•˜๋Š” ๊ธฐํ˜ธ๋กœ๋Š” ๋Œ€ํ‘œ์ ์œผ๋กœ ๋งˆ์นจํ‘œ(.)์™€ ๊ฐœํ–‰๋ฌธ์ž(\n) ๋“ฑ์ด ์žˆ๋‹ค. 

๋˜ํ•œ ์ •๊ทœํ™” ํ‘œํ˜„์‹์— ๋”ฐ๋ฅธ ๋ฌธ์žฅ ํ† ํฐํ™”๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.

text_sample = '''The Matrix is everywhere its all around us,
here even in this room. you can see it out your window or on your television.
you feel it when you go to work, or go to church or pay your taxes.'''

# sent_tokenize(text="์ž…๋ ฅํ•  ํ…์ŠคํŠธ")
sentences = sent_tokenize(text=text_sample)

print("๊ฒฐ๊ณผ :", sentences)

# text๊ฐ€ ๋ช‡๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ ๋˜์–ด์žˆ๋Š”์ง€ ๊ฐœ์ˆ˜๋ฅผ ์„ธ์คŒ
print("๋ฌธ์žฅ ๊ฐœ์ˆ˜ :", len(sentences))

print๋กœ ํ…์ŠคํŠธ๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด, ๋ฌธ์žฅ๋ณ„๋กœ ๋ถ„๋ฆฌ๊ฐ€ ๋˜์–ด ๋ฆฌ์ŠคํŠธ์— ๋‹ด๊ฒจ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค!

 

 

~๋‹จ์–ด ํ† ํฐํ™”~

: ๋‹จ์–ด ํ† ํฐํ™”๋Š” ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ํ† ํฐํ™” ํ•จ์„ ์˜๋ฏธํ•œ๋‹ค.

๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” ๊ณต๋ฐฑ, ์ฝค๋งˆ(,), ๋งˆ์นจํ‘œ(.), ๊ฐœํ–‰๋ฌธ์ž ๋“ฑ์„ ๊ธฐ์ค€์œผ๋กœ ํ† ํฐํ™” ์‹œํ‚จ๋‹ค.

from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room"
words = word_tokenize(sentence)

print(type(words))
print("๊ฒฐ๊ณผ :", words)

๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐ ๋‹จ์–ด๋“ค์ด ๋ฆฌ์ŠคํŠธ์— ๋‹ด๊ฒจ ๋‚˜ํƒ€๋‚˜๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

~๋ฌธ์žฅ ํ† ํฐํ™”์™€ ๋‹จ์–ด ํ† ํฐํ™”์˜ ๊ฒฐํ•ฉ~

from nltk import word_tokenize, sent_tokenize

#ํ•จ์ˆ˜ ์ƒ์„ฑ
def tokenize_text(text):
    
    #๋ฌธ์žฅ๋ณ„๋กœ ๋ถ„๋ฆฌ
    sentences = sent_tokenize(text)
    
    #๋ถ„๋ฆฌ๋œ ๋ฌธ์žฅ๋ณ„ ๋‹จ์–ด ํ† ํฐํ™”
    word_tokens = [word_tokenize(sentence) for sentence in sentences]
    
    return word_tokens

#ํ•จ์ˆ˜์— text_sample์„ ๋„ฃ์–ด์คŒ
word_tokens =  tokenize_text(text_sample)

#๋ฐ˜ํ™˜๊ฐ’ word_tokens ์ถœ๋ ฅ
print(type(word_tokens), len(word_tokens))
print(word_tokens)

3๊ฐœ์˜ ๋ฌธ์žฅ๋“ค์ด ๊ฐ๊ฐ ๋‹จ์–ด ํ† ํฐํ™”๋˜์–ด ํ•˜๋‚˜์˜ ๋ฆฌ์ŠคํŠธ์— ๋‹ด๊ฒจ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!

-> ์„ธ๊ฐœ์˜ ๋ฌธ์žฅ์ด๋ฏ€๋กœ ์„ธ๊ฐœ์˜ ๋ฆฌ์ŠคํŠธ..!

728x90

'๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹ > ํ…์ŠคํŠธ ๋ถ„์„' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[ํ…์ŠคํŠธ ๋ถ„์„] 4-1. ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ๋ถ„์„ - IMDB์˜ํ™”ํ‰  (0) 2022.02.26
[ํ…์ŠคํŠธ ๋ถ„์„] 3. Bag of Words (BOW)  (0) 2022.02.24
[ํ…์ŠคํŠธ ๋ถ„์„] 2-3. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - Stemming๊ณผ Lemmatization  (0) 2022.02.23
[ํ…์ŠคํŠธ ๋ถ„์„] 2-2. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ  (0) 2022.02.20
[ํ…์ŠคํŠธ ๋ถ„์„] 1. ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ดํ•ด  (0) 2022.02.19
'๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [ํ…์ŠคํŠธ ๋ถ„์„] 3. Bag of Words (BOW)
  • [ํ…์ŠคํŠธ ๋ถ„์„] 2-3. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - Stemming๊ณผ Lemmatization
  • [ํ…์ŠคํŠธ ๋ถ„์„] 2-2. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ
  • [ํ…์ŠคํŠธ ๋ถ„์„] 1. ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ดํ•ด
xod22
xod22
xod22
Data Analyst Story
xod22
์ „์ฒด
์˜ค๋Š˜
์–ด์ œ
  • ๐ŸŒณ Home ๐ŸŒณ (178)
    • ๐Ÿฌ MySQL (46)
      • ๋ฌธ์ œํ’€์ด (29)
      • SQL ๋ฐ์ดํ„ฐ๋ถ„์„ ์บ ํ”„ (9)
    • ๐Ÿ” ๋ฐ์ดํ„ฐ ๋ถ„์„ (53)
      • Product (5)
      • 01. Data Collection (7)
      • 02. Data Processing (7)
      • 03. Data Visualizaton (15)
      • 04. Data Analysis (19)
    • ๐Ÿ“š Study (20)
      • ๋น…๋ฐ์ดํ„ฐ ๋ถ„์„๊ธฐ์‚ฌ ์‹ค๊ธฐ (8)
      • ADP ์‹ค๊ธฐ (7)
      • ๊ตฌ๊ธ€ ์• ๋„๋ฆฌํ‹ฑ์Šค (5)
      • ํ”„๋กœ์ ํŠธ (0)
    • โœ๏ธ ์ƒ๊ฐ ๊ธฐ๋ก (10)
      • ๋…์„œ (5)
      • ์ž๋ฃŒ ์Šคํฌ๋žฉ (2)
      • ์ทจ์—… ์ค€๋น„ (2)
    • ๐Ÿ’ป GitHub (6)
      • ์ˆ˜์ • ๋ฐ ๋ณ€๊ฒฝ (5)
    • ๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹ (35)
      • ์ถ”์ฒœ์‹œ์Šคํ…œ (19)
      • ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ (1)
      • ํ…์ŠคํŠธ ๋ถ„์„ (10)

๊ณต์ง€์‚ฌํ•ญ

  • Github
  • How to ๊ตฌ๋…, ์ข‹์•„์š”

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๋Œ“๊ธ€

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ํ™ˆ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

ํƒœ๊ทธ

  • ์ถ”์ฒœ์‹œ์Šคํ…œ
  • ADP์‹ค๊ธฐ
  • ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค
  • ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ
  • ๊นƒํ—ˆ๋ธŒ
  • ํฌ๋กค๋ง
  • MySQL
  • tableau
  • ํƒœ๋ธ”๋กœ
  • ๊ตฌ๊ธ€์• ๋„๋ฆฌํ‹ฑ์Šค
  • ์ „์ฒ˜๋ฆฌ
  • ์ฝ”๋”ฉํ…Œ์ŠคํŠธ
  • ๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ
  • SQL
  • ์‹œ๊ฐํ™”
  • ๊ธฐ์ถœํ’€์ด
  • ์ž‘์—…ํ˜•์ œ1์œ ํ˜•
  • Plot
  • ์„ธ๋ฏธ๋‚˜
  • Python
  • ๋น…๋ถ„๊ธฐ
  • ํ•ด์ปค๋žญํฌ
  • ํ…์ŠคํŠธ๋ถ„์„
  • ๋ฐ์ดํ„ฐ๋ฆฌ์•ˆ
  • ํŒŒ์ด์ฌ
  • ๋ฐ์ดํ„ฐ์‹œ๊ฐํ™”
  • github
  • ๋ฐ์ดํ„ฐ๋ถ„์„
  • ํ†ต๊ณ„์ ๋ชจ๋ธ๋ง
  • pandas

์ตœ๊ทผ ๊ธ€

hELLO ยท Designed By ์ •์ƒ์šฐ.v4.2.0
xod22
[ํ…์ŠคํŠธ ๋ถ„์„] 2-1. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ํด๋ Œ์ง•, ํ† ํฐํ™”
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.