๐Ÿ” ๋ฐ์ดํ„ฐ ๋ถ„์„/02. Data Processing

[์ „์ฒ˜๋ฆฌ] Data Transformation(๋ฐ์ดํ„ฐ ๋ณ€ํ™˜) - ์ •๊ทœํ™”

xod22 2022. 3. 6. 00:25
728x90

2022.03.04 - [๋ฐ์ดํ„ฐ ๋ถ„์„/02. Data Processing] - [์ „์ฒ˜๋ฆฌ] Types of data processing ๋ฐ ๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹

 

[์ „์ฒ˜๋ฆฌ] Types of data processing ๋ฐ ๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹

Data Processing : Data Processing(๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ)๋ž€ ํŠน์„ฑ ๋ถ„์„์— ์ ํ•ฉํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•˜๋Š” ์ž‘์—…์ด๋‹ค. ๊ฐ€์žฅ ๋งŽ์€ ๋…ธ๋ ฅ์ด ๋“ค์–ด๊ฐ€๋Š” ๊ณผ์ •..! ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์˜ ์ข…๋ฅ˜๋Š” ํฌ๊ฒŒ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 1. Data Cleaning(๋ฐ์ด

xod22.tistory.com

์ €๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹์— ๋Œ€ํ•ด ๊ณต๋ถ€๋ฅผ ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ „์ฒ˜๋ฆฌ์˜ ๋‘๋ฒˆ์งธ ์ข…๋ฅ˜์ธ Data Transformation(๋ฐ์ดํ„ฐ ๋ณ€ํ™˜)์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!

 

1. Data Cleaning(๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹) - ๊ฒฐ์ธก์น˜์ฒ˜๋ฆฌ,,

2. Data Transformation(๋ฐ์ดํ„ฐ ๋ณ€ํ™˜) - ์ •๊ทœํ™”,,,๋“ฑ๋“ฑ

3. Data Reduction(๋ฐ์ดํ„ฐ ์ถ•์†Œ) - ์ฐจ์›์ถ•์†Œ,,,


Data Transform

 

: ๋ฐ์ดํ„ฐ๋Š” ์‹คํ—˜ ํ™˜๊ฒฝ์—์„œ ์ƒ์„ฑ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๊ฑฐ์˜ ์—†์œผ๋ฏ€๋กœ ๋ฐ์ดํ„ฐ์˜ ๋ณ€ํ™˜์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค!

normalization(์ •๊ทœํ™”)์˜ ๋ฐฉ๋ฒ•์—๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์žˆ์ง€๋งŒ ์ €๋Š” ๊ทธ์ค‘์—์„œ๋„ ํŠนํžˆ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š”

MinMaxScaler์™€ z-score๋ฅผ ์ด์šฉํ•œ StandardScaler์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค..!

 

 

MinMaxScaler

: ์ตœ๋Œ€, ์ตœ์†Œ๊ฐ’์ด 0,1์ด ๋˜๋„๋ก ์ •๊ทœํ™” ํ•˜๋Š” ๋ฐฉ๋ฒ•

 

1. ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ

import pandas as pd
import numpy as np

 

2. ๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹ - ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

pima-indians-diabetes.csv
0.02MB

dataset=pd.read_csv("pima-indians-diabetes.csv", header=None)
dataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0, np.nan)
dataset.dropna(inplace=True)

 

3. ๋ณ€ํ™˜ํ•  ์ปฌ๋Ÿผ๋งŒ ๋”ฐ๋กœ ์ €์žฅ

#์—ด(2,5)->ํ˜ˆ์•• ๋ฐ BMI๋ฐ์ดํ„ฐ์— ์ดˆ์ 
datablbm=dataset[{2,5}]

 

4. MinMaxScaler ์ ์šฉ

์ •๊ทœํ™”์—์„œ fit->transform ๊ณผ์ •์ด ๊ผญ ์ง„ํ–‰๋˜์–ด์•ผ ํ•œ๋‹ค.

from sklearn.preprocessing import MinMaxScaler
minmax=MinMaxScaler()

#fit&transform
minmax.fit(datablbm)
blbmmm=minmax.transform(datablbm)
#๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•์‹์œผ๋กœ ๋ฐ”๊ฟˆ
blbm=pd.DataFrame(blbmmm)
print(blbm.describe())

min-max ๊ฐ’์ด 0,1๋กœ ๋ณ€ํ™˜๋œ ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค!

 

 

Standard Scaler

: ํ‰๊ท ์ด 0, ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ด ๋˜๋„๋ก ์ •๊ทœํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•

 

1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฐ ๋ฐ์ดํ„ฐ ํด๋ฆฌ๋‹

dataset=pd.read_csv("pima-indians-diabetes.csv", header=None)
dataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0, np.nan)
dataset.dropna(inplace=True)

 

2. ์ •๊ทœํ™”๋ฅผ ์ง„ํ–‰ํ•  ์ปฌ๋Ÿผ๋งŒ ์ €์žฅ

#์—ด(2,5)->ํ˜ˆ์•• ๋ฐ BMI๋ฐ์ดํ„ฐ์— ์ดˆ์ 
datablbm=dataset[{2,5}]

 

3. StandardScaler ์ ์šฉ

from sklearn.preprocessing import StandardScaler
stand=StandardScaler()
stand.fit(datablbm)
blbmst=stand.transform(datablbm)
#๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•์‹์œผ๋กœ ๋ฐ”๊ฟˆ
blbm2=pd.DataFrame(blbmst)
print(blbm2.describe())

728x90