๐Ÿ” ๋ฐ์ดํ„ฐ ๋ถ„์„/02. Data Processing

[์ „์ฒ˜๋ฆฌ] Types of data processing ๋ฐ ๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹

xod22 2022. 3. 4. 23:00
728x90
Data Processing

 

: Data Processing(๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ)๋ž€ ํŠน์„ฑ ๋ถ„์„์— ์ ํ•ฉํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•˜๋Š” ์ž‘์—…์ด๋‹ค.

๊ฐ€์žฅ ๋งŽ์€ ๋…ธ๋ ฅ์ด ๋“ค์–ด๊ฐ€๋Š” ๊ณผ์ •..!

 

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์˜ ์ข…๋ฅ˜๋Š” ํฌ๊ฒŒ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

1. Data Cleaning(๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹) - ๊ฒฐ์ธก์น˜์ฒ˜๋ฆฌ,,

2. Data Transformation(๋ฐ์ดํ„ฐ ๋ณ€ํ™˜) - ์ •๊ทœํ™”,,,๋“ฑ๋“ฑ

3. Data Reduction(๋ฐ์ดํ„ฐ ์ถ•์†Œ) - ์ฐจ์›์ถ•์†Œ,,,


*๊ทธ๋Ÿผ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์˜ ์ฒซ๋ฒˆ์งธ Data Cleaning์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค..!

 

~1. Data Cleaning(๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹)~

๊ฒฐ์ธก์น˜ ์‚ญ์ œ

 

1. ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ ๋ฐ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

pima-indians-diabetes.csv
0.02MB

from pandas import read_csv
dataset=read_csv('pima-indians-diabetes.csv', header=None)

 

2. ๋ฐ์ดํ„ฐ ํ™•์ธ

print(dataset.describe())
#0์€ ๊ฒฐ์ธก์น˜ ์ธ๋ฐ min๊ฐ’์ด 0์œผ๋กœ ํ™•์ธ๋จ

 

3. ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜ ํ™•์ธ

num_missing=(dataset[[1,2,3,4,5]]==0).sum()
print(num_missing)

1,2,3,4,5 ํ–‰์—์„œ 0์˜ ๊ฐฏ์ˆ˜๋ฅผ ์„ธ์–ด๋ณด๋ฉด ๊ฒฐ์ธก์น˜์˜ ๊ฐœ์ˆ˜๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ!

 

 

4. 0์„ ๊ฒฐ์ธก์น˜(nan)์œผ๋กœ ๋ณ€๊ฒฝ

import numpy as np
dataset=read_csv('pima-indians-diabetes.csv', header=None)
datasetorig=dataset.copy()

dataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0, np.nan)
print(datasetorig.head(10))
print(dataset.head(10))
#0์ด nan์œผ๋กœ ์ž˜ ๋ฐ”๋€ ๊ฒƒ์„ ํ™•์ธ

0์„ nan์œผ๋กœ ๋ฐ”๊พธ์–ด์„œ ๊ฒฐ์ธก์น˜๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์คŒ!

 

 

5. ๊ฒฐ์ธก์น˜ ์‚ญ์ œ

#nan์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ์—ˆ์œผ๋ฏ€๋กœ dropna๋ฅผ ํ–‰ํ•˜๋ฉด na๊ฐ’์‚ญ์ œ(ํ–‰์‚ญ์ œ)
#inplace=True : ์‚ญ์ œ๋œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋Œ€์ฒด
#axis=0(๋””ํดํŠธ๊ฐ’) -> ํ–‰์‚ญ์ œ
#axis=1(๋””ํดํŠธ๊ฐ’) -> ์—ด์‚ญ์ œ

dataset.dropna(inplace=True)

 

 

6. ์‚ญ์ œ ์ „ํ›„ ๋น„๊ต

print("์‚ญ์ œ์ „ ํ–‰๊ฐœ์ˆ˜ :", datasetorig.shape)
print("์‚ญ์ œํ›„ ํ–‰๊ฐœ์ˆ˜ :", dataset.shape)

๊ฒฐ์ธก์น˜ ํ–‰์„ ๋ชจ๋‘ ์‚ญ์ œํ•˜๋‹ˆ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์ด ์ค„์–ด๋“  ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ์˜ ์†์‹ค์ด ํฐ ๊ฒฝ์šฐ์—๋Š” ์‚ญ์ œ๋ณด๋‹ค๋Š” ๊ฒฐ์ธก์น˜๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค!

 

 

๊ฒฐ์ธก์น˜ ๋Œ€์ฒด

 

1. ๋‹ค์‹œ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฐ nan(๊ฒฐ์ธก์น˜)๋กœ ๋ณ€๊ฒฝ

import pandas as pd
dataset=pd.read_csv('pima-indians-diabetes.csv', header=None)
datasetorig=dataset.copy()
dataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0, np.nan)

 

 

2. ์ตœ๋นˆ๊ฐ’/์ค‘์•™๊ฐ’/ํ‰๊ท ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝ ๊ฐ€๋Šฅ

from sklearn.impute import SimpleImputer
import pandas as pd
values=dataset.values

#strategy='most_frequent' -> ์ตœ๋นˆ๊ฐ’
#strategy='median' -> ์ค‘์•™๊ฐ’
#strategy='mean' -> ํ‰๊ท ๊ฐ’
imputer=SimpleImputer(missing_values=np.nan, strategy='mean')
dataimputed=pd.DataFrame(imputer.fit_transform(values))
#ํ–‰๊ฐœ์ˆ˜๊ฐ€ ๊ทธ๋Œ€๋กœ์ž„
print("์‚ญ์ œ์ „ ํ–‰๊ฐœ์ˆ˜ :", datasetorig.shape)
print("์‚ญ์ œํ›„ ํ–‰๊ฐœ์ˆ˜ :", dataset.shape)

#๊ฒฐ์ธก์น˜๋Š” ์‚ฌ๋ผ์ง„๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
num_missing=(dataimputed[[1,2,3,4,5]]==0).sum()
print(num_missing)

 

 

3. ๊ฒฐ์ธก์น˜๋ฅผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•˜์—ฌ ๋Œ€์ฒด

import pandas as pd
dataset=pd.read_csv('pima-indians-diabetes.csv', header=None)
datasetorig=dataset.copy()
dataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0, np.nan)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer=IterativeImputer()
datatrans=pd.DataFrame(imputer.fit_transform(dataset))
pip install missingpy

 

~๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๊ฐ’์œผ๋กœ ๋Œ€์ฒด~

from missingpy import MissForest
imputer=MissForest()
datatrans=pd.DataFrame(impupter.fit_transform(dataset))

 

~KNN ๊ฐ’์œผ๋กœ ๋Œ€์ฒด~

from missingpy import KNNImputer
imputer=KNNImputer()
datatrans=pd.DataFrame(impupter.fit_transform(dataset))
728x90