xod22 2022. 2. 24. 16:46
728x90

2022.02.23 - [๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„] - [ํ…์ŠคํŠธ ๋ถ„์„] 2-(3). ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - Stemming๊ณผ Lemmatization

 

[ํ…์ŠคํŠธ ๋ถ„์„] 2-(3). ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - Stemming๊ณผ Lemmatization

2022.02.20 - [๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/ํ…์ŠคํŠธ ๋ถ„์„] - [ํ…์ŠคํŠธ ๋ถ„์„] 2-(2). ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ์ €๋ฒˆ ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ํฌ์ŠคํŒ…์— ์ด์–ด์„œ Stemming & Lemmatization์— ๋Œ€ํ•ด ์ ์–ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค! 1. Cleansing(ํด๋ Œ..

xod22.tistory.com

์ €๋ฒˆ ํฌ์ŠคํŒ…์— ์ด์–ด์„œ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ ๋‹ค์Œ ํ”„๋กœ์„ธ์Šค์ธ Bag of Words(BOW)์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!


Bag of Words - BOW

 

: BOW ๋ชจ๋ธ์€ ๋ฌธ์„œ๊ฐ€ ๊ฐ€์ง€๋Š” ๋ชจ๋“  ๋‹จ์–ด๋“ค์„ ๋ฌธ๋งฅ๊ณผ ์ˆœ์„œ๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ์ผ๊ด„์ ์œผ๋กœ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋นˆ๋„ ๊ฐ’์„ ๋ถ€์—ฌํ•ด ํ”ผ์ฒ˜ ๊ฐ’์„ ์ถ”์ถœํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค.

๋น„์œ ์ ์œผ๋กœ, ์–‘๋…๊ฐ์ž๋ฅผ ์˜ˆ์‹œ๋กœ ๋“ค ์ˆ˜ ์žˆ๋‹ค. ๋ฌธ์„œ์— ์žˆ๋Š” ๋ชจ๋“  ๋‹จ์–ด๋“ค์„ ์ถ”์ถœํ•˜์—ฌ ์–‘๋…๊ฐ์ž ํŒฉ์— ๋„ฃ๊ณ  ๋’ค์„ž๋Š” ๊ฒƒ์ด ์ด์— ๋น„์œ ๋  ์ˆ˜ ์žˆ๋‹ค. Bag of Words ์ž์ฒด๊ฐ€ ์ด๋Ÿฐ ๋น„์œ ์—์„œ ๋ชจ๋ธ ์ด๋ฆ„์ด ์ƒ์„ฑ๋˜์—ˆ๋‹ค!

BOW ํ”„๋กœ์„ธ์Šค

 

๋งŒ์ผ ๋ฌธ์žฅ1๊ณผ ๋ฌธ์žฅ2๊ฐ€ ์žˆ๋‹ค๋ฉด ๋ฌธ์žฅ1๊ณผ ๋ฌธ์žฅ2์˜ ๋ชจ๋“  ๋‹จ์–ด๋“ค์„ ์ค‘๋ณต์„ ์ œ์™ธํ•˜๊ณ  ์ถ”์ถœํ•œ ํ›„ ์นผ๋Ÿผ(์—ด)์œผ๋กœ ๋‚˜์—ดํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐœ๋ณ„ ๋ฌธ์žฅ๋“ค์„ ์ธ๋ฑ์Šค(ํ–‰)๋กœ ์„ค์ •ํ•˜๊ณ  ๊ฐ ์ธ๋ฑ์Šค์—์„œ ์นผ๋Ÿผ์— ๋‚˜์—ด๋˜์–ด์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ ํšŸ์ˆ˜๋ฅผ value ๊ฐ’์œผ๋กœ ์ธก์ •ํ•œ๋‹ค.

 

 

์žฅ์  : ์‰ฝ๊ณ  ๋น ๋ฅธ ๊ตฌ์ถ•์ด ๊ฐ€๋Šฅ

๋‹จ์  :

1. ๋ฌธ๋งฅ๊ณผ ์ˆœ์„œ๋ฅผ ์ œ์™ธํ•˜๊ณ  ํ”ผ์ฒ˜๊ฐ’์„ ์„ค์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ๋งฅํ•ด์„์ด ์–ด๋ ค์›€ 

2. ํฌ์†Œํ–‰๋ ฌ ๋ฌธ์ œ ๋ฐœ์ƒ : ๋Œ€๋ถ€๋ถ„์˜ ์นผ๋Ÿผ๊ฐ’์ด 0์œผ๋กœ ์ฑ„์›Œ์ง€๋Š” ํ–‰๋ ฌ

-> BOW๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์„œ๋“ค์ด ์žˆ์„ ๋•Œ ๋ชจ๋“  ๋ฌธ์„œ๋“ค์˜ ๋‹จ์–ด๋“ค์„ ์ถ”์ถœํ•˜์—ฌ ์นผ๋Ÿผ(์—ด)๋กœ ๋‚˜์—ดํ•œ๋‹ค. ๊ฐ ๋ฌธ์„œ๋“ค์€ ์ถ”์ถœํ•œ ์นผ๋Ÿผ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ ๋ฌธ์„œ์˜ ๋‹จ์–ด ๋นˆ๋„ ํšŸ์ˆ˜๋ฅผ ์นผ๋Ÿผ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๊ฐ ๋ฌธ์„œ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์นผ๋Ÿผ์„ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์— ์นผ๋Ÿผ ๊ฐœ์ˆ˜๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์ง€๋งŒ, ๊ฐ ๋ฌธ์„œ๋งˆ๋‹ค ์‚ฌ์šฉํ•˜๋Š” ๋‹จ์–ด๋Š” ๋‹น์—ฐํžˆ ์ƒ์ดํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์„œ๋งˆ๋‹ค ๊ฐ ์นผ๋Ÿผ ๊ฐ’๋“ค์ด 0์œผ๋กœ ์ฑ„์›Œ์งˆ ํ™•๋ฅ ์ด ๋†’๋‹ค. ํฌ์†Œ ํ–‰๋ ฌ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Š” BOW ๋ชจ๋ธ์˜ ๋‹จ์ ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

BOW ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”

 

1. ์นด์šดํŠธ ๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™” : ๋นˆ๋„ ํšŸ์ˆ˜๊ฐ€ ๋†’์€ ๋‹จ์–ด๋ฅผ ์ค‘์š”๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด๋กœ ์ธก์ •. ํ•˜์ง€๋งŒ ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ๊ณตํ†ต์ ์œผ๋กœ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด๋“ค์กฐ์ฐจ ์ค‘์š”๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด๋กœ ์„ ์ •๋จ..

2. TF-IDF ๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™” : ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๋‹จ์–ด๋ฅผ ์ค‘์š”๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด๋กœ ์ธก์ •ํ•˜์ง€๋งŒ, ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ๋ณดํŽธ์ ์œผ๋กœ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค.


BOW ํฌ์†Œํ–‰๋ ฌ ๋ฌธ์ œ ํ•ด๊ฒฐ : COO / CSR

 

BOW ๋ฐฉ์‹์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๊ฐ€ ํฌ์†Œํ–‰๋ ฌ(0์ด ๋งŽ์Œ)์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๋”ฐ๋ผ์„œ BOW ๋ฐฉ์‹์€ ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฐจ์ง€ํ•˜๋Š” ํฌ์†Œํ–‰๋ ฌ์„ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋œ ์ฐจ์ง€ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•ด์•ผ๋งŒ ํ•œ๋‹ค. ๊ทธ ๋ฐฉ์‹์œผ๋กœ COOํ˜•์‹๊ณผ CSR ํ˜•์‹์ด ์กด์žฌํ•œ๋‹ค.

 

COO๋ฐฉ์‹๋ณด๋‹จ CSR ๋ฐฉ์‹์ด ํ–‰๋ ฌ์— ๋Œ€ํ•ด์„œ ์ €์žฅํ•˜๊ฑฐ๋‚˜ ๊ณ„์‚ฐํ•  ๋•Œ ๋” ์œ ๋ฆฌํ•˜๋‹ค.

 

  • COOํ˜•์‹

: COOํ˜•์‹์€ 0์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋งŒ ๋ณ„๋„์˜ ๋ฐ์ดํ„ฐ ๋ฐฐ์—ด์— ์ €์žฅํ•˜๊ณ , ๊ทธ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€๋ฆฌํ‚ค๋Š” ํ–‰๊ณผ ์—ด์˜ ์œ„์น˜๋ฅผ ๋ณ„๋„์˜ ๋ฐฐ์—ด๋กœ ์ €์žฅํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

 

->ํ–‰์œ„์น˜์™€ ์—ด์œ„์น˜๋ฅผ ๊ฐ๊ฐ ๋ฐฐ์—ด๋กœ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •

0์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋Š” [3,1,2]์ด๋ฉฐ 0์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ์˜ ์œ„์น˜๋ฅผ (row, col)์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด (0,0), (0,2), (1,1)์ด ๋ฉ๋‹ˆ๋‹ค.

row์™€ column์„ ๋ณ„๋„์˜ ๋ฐฐ์—ด๋กœ ์ €์žฅํ•˜๋ฉด row๋Š” [0,0,1]์ด๊ณ  column์€ [0,2,1]์ด ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ์—์„œ๋Š” ํฌ์†Œ ํ–‰๋ ฌ ๋ณ€ํ™˜์„ ์œ„ํ•ด ์ฃผ๋กœ ์‚ฌ์ดํŒŒ์ด(Scipy)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

 

~ํŒจํ‚ค์ง€ ๋‹ค์šด~

import numpy as np
from scipy import sparse

 

~COO๋ฐฉ์‹~

dense=np.array([[3,0,1], [0,2,0]])

#0์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
data=np.array([3,1,2])

#ํ–‰ ์œ„์น˜์™€ ์—ด ์œ„์น˜๋ฅผ ๊ฐ๊ฐ ๋ฐฐ์—ด๋กœ ์ƒ์„ฑ
row_pos=np.array([0,0,1])
col_pos=np.array([0,2,1])

#sparseํŒจํ‚ค์ง€์˜ coo_matrix๋ฅผ ์ด์šฉํ•ด COOํ˜•์‹์œผ๋กœ ํฌ์†Œ ํ–‰๋ ฌ ์ƒ์„ฑ
sparse_coo=sparse.coo_matrix((data, (row_pos, col_pos)))
print(sparse_coo)

* COO ํ˜•์‹์˜ ๋ฌธ์ œ์ 

: COOํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜์‹œ ๋ฐ์ดํ„ฐ ๋ฐฐ์—ด์€ [1, 5, 1, 4, 3, 2, 5, 6, 3, 2, 7, 8, 1]์ด๊ณ 

ํ–‰ ์œ„์น˜ ๋ฐฐ์—ด์€ [0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5], ์—ด ์œ„์น˜ ๋ฐฐ์—ด์€ [2, 5, 0, 1, 3, 4, 5, 1, 3, 0, 3, 5, 0]์ด ๋ฉ๋‹ˆ๋‹ค.

ํ–‰ ์œ„์น˜ ๋ฐฐ์—ด์„ ์ž์„ธํžˆ ๋ณด๋ฉด ์ˆœ์ฐจ์ ์ธ ๊ฐ™์€ ๊ฐ’์ด 0, 0, -> 1, 1, 1, 1,..->2, 2 ์ด๋Ÿฐ์‹์œผ๋กœ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

  • CSRํ˜•์‹

: COO์˜ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•œ ๊ฒƒ์ด CSR ๋ฐฉ์‹!

ํ–‰ ์œ„์น˜ ๋ฐฐ์—ด ๋‚ด์— ์žˆ๋Š” ๊ณ ์œ ํ•œ ๊ฐ’์˜ ์‹œ์ž‘ ์œ„์น˜๋งŒ ๋‹ค์‹œ ๋ณ„๋„์˜ ์œ„์น˜ ๋ฐฐ์—ด๋กœ ๊ฐ–๋Š” ๋ณ€ํ™˜ ๋ฐฉ์‹์„ ์˜๋ฏธ

ํ–‰ ์œ„์น˜ ๋ฐฐ์—ด์ด 0๋ถ€ํ„ฐ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฐ’์œผ๋กœ ์ด๋ค„์กŒ๋‹ค๋Š” ํŠน์„ฑ์„ ๊ณ ๋ คํ•ด ํ–‰ ์œ„์น˜ ๋ฐฐ์—ด์˜ ๊ณ ์œ ํ•œ ๊ฐ’์˜ ์‹œ์ž‘ ์œ„์น˜๋งŒ ํ‘œ๊ธฐํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ˜๋ณต์„ ์ œ๊ฑฐ!

์ด๋ ‡๊ฒŒ ๊ณ ์œ ๊ฐ’์˜ ์‹œ์ž‘ ์œ„์น˜๋งŒ ์•Œ๊ณ  ์žˆ์œผ๋ฉด ์–ผ๋งˆ๋“ ์ง€ ํ–‰ ์œ„์น˜ ๋ฐฐ์—ด์„ ๋‹ค์‹œ ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— COO ๋ฐฉ์‹๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ ๊ฒŒ ๋“ค๊ณ  ๋น ๋ฅธ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋จ!

 

 

~ํŒจํ‚ค์ง€ ๋‹ค์šด~

from scipy import sparse

 

~CSR๋ฐฉ์‹~

dense2=np.array([[0,0,1,0,0,5],
             [1,4,0,3,2,5],
             [0,6,0,3,0,0],
             [2,0,0,0,0,0],
             [0,0,0,7,0,8],
             [1,0,0,0,0,0]])
             
#0์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
data2 = np.array([1, 5, 1, 4, 3, 2, 5, 6, 3, 2, 7, 8, 1])

#ํ–‰์œ„์น˜์™€ ์—ด์œ„์น˜๋ฅผ ๊ฐ๊ฐ array๋กœ ์ƒ์„ฑ
row_pos = np.array([0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5])
col_pos = np.array([2, 5, 0, 1, 3, 4, 5, 1, 3, 0, 3, 5, 0])

*์‹ค์ œ๋กœ๋Š” 0์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ ๋ฐฐ์—ด, ROW, COL ์œ„์น˜ ๋ฐฐ์—ด์„ ์ง€์ •ํ•˜์ง€ ์•Š๊ณ  sparse.coo_matrix(๋ฐฐ์—ด), sparse.csr_matrix(๋ฐฐ์—ด)๋กœ ์‚ฌ์šฉํ•ด๋„ ๋œ๋‹ค.

#COO ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
sparse_coo=sparse.coo_matrix((data2, (row_pos,col_pos)))

#ํ–‰์œ„์น˜ ๋ฐฐ์—ด์˜ ๊ณ ์œ ํ•œ ๊ฐ’์˜ ์‹œ์ž‘ ์œ„์น˜ ์ธ๋ฑ์Šค๋ฅผ ๋ฐฐ์—ด๋กœ ์ƒ์„ฑ
row_pos_ind = np.array([0, 2, 7, 9, 10, 12, 13])

#CSRํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
sparse_csr = sparse.csr_matrix((data2, col_pos, row_pos_ind))

print('COO ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ œ๋Œ€๋กœ ๋˜์—ˆ๋Š”์ง€ ๋‹ค์‹œ Dense๋กœ ์ถœ๋ ฅ ํ™•์ธ')
print(sparse_coo.toarray())
print('CSR ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ œ๋Œ€๋กœ ๋˜์—ˆ๋Š”์ง€ ๋‹ค์‹œ Dense๋กœ ์ถœ๋ ฅ ํ™•์ธ')
print(sparse_csr.toarray())

COO๋ฐฉ์‹๊ณผ CSR๋ฐฉ์‹์ด ๋ชจ๋‘ ๋˜‘๊ฐ™์ด ํ–‰๋ ฌ์„ ์ €์žฅํ•˜๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

~์‹ค์ œ ์‚ฌ์šฉ~

dense3 = np.array([[0,0,1,0,0,5],
             [1,4,0,3,2,5],
             [0,6,0,3,0,0],
             [2,0,0,0,0,0],
             [0,0,0,7,0,8],
             [1,0,0,0,0,0]])

coo = sparse.coo_matrix(dense3)
csr = sparse.csr_matrix(dense3)

print('COO\n', coo)
print('CSR\n', csr)


์˜ค๋Š˜์€ ์ „์ฒ˜๋ฆฌ ๋‹ค์Œ๋‹จ๊ณ„์ธ BOW๋ฅผ ๊ณต๋ถ€ํ•ด๋ณด์•˜๋Š”๋ฐ์š”!

๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์‹ค์Šต์„ ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!

 

๋—!-!

728x90
๋Œ“๊ธ€์ˆ˜0