๐Ÿ” ๋ฐ์ดํ„ฐ ๋ถ„์„/04. Data Analysis

[ํ†ต๊ณ„์  ๋ชจ๋ธ๋ง] ์‹œ๊ณ„์—ด ๋ถ„์„ - ๊ฒฐ์ธก๊ฐ’ ์ฒ˜๋ฆฌ์™€ Smoothing

xod22 2022. 3. 20. 12:19
728x90

2022.03.19 - [๋ฐ์ดํ„ฐ ๋ถ„์„/04. Data Analysis] - [ํ†ต๊ณ„์  ๋ชจ๋ธ๋ง] ์‹œ๊ณ„์—ด ๋ถ„์„ - ์ •์ƒ์„ฑ(stationary)๊ณผ ์ฐจ๋ถ„

 

[ํ†ต๊ณ„์  ๋ชจ๋ธ๋ง] ์‹œ๊ณ„์—ด ๋ถ„์„ - ์ •์ƒ์„ฑ(stationary)๊ณผ ์ฐจ๋ถ„

2022.03.18 - [๋ฐ์ดํ„ฐ ๋ถ„์„/04. Data Analysis] - [ํ†ต๊ณ„์  ๋ชจ๋ธ๋ง] ์‹œ๊ณ„์—ด ๋ถ„์„ [ํ†ต๊ณ„์  ๋ชจ๋ธ๋ง] ์‹œ๊ณ„์—ด ๋ถ„์„ ํ•ญ์ƒ ์‹œ๊ณ„์—ด ๋ถ„์„์€ ์–ด๋ ต๊ณ  ๋ณต์žกํ•˜๋‹ค๋Š” ์ƒ๊ฐ์— ์ฝ”๋“œ๋ฅผ ํ•˜๋‚˜ํ•˜๋‚˜ ์ดํ•ดํ•˜๋ฉด์„œ ์ž‘์„ฑํ•˜๊ธฐ ํž˜๋“ค์—ˆ

xod22.tistory.com

์ €๋ฒˆ ๊ธ€์—์„œ๋Š” ์ •์ƒ์„ฑ๊ณผ ์ฐจ๋ถ„์— ๋Œ€ํ•ด์„œ ๊ณต๋ถ€๋ฅผ ํ•ด๋ดค์Šต๋‹ˆ๋‹ค.

์˜ค๋Š˜์€ ์ด์–ด์„œ ์‹œ๊ณ„์—ด ๋ถ„์„์„ ํ• ๋•Œ ๊ฒฐ์ธก๊ฐ’์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ trend(์ถ”์„ธ)๋ฅผ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋ณด๊ธฐ์œ„ํ•œ ๋ฐฉ๋ฒ•์ธ Smoothing์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹น!

 

 

๊ฒฐ์ธก๊ฐ’ ์ฒ˜๋ฆฌ

 

: ๋•Œ๋กœ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์— ๊ฒฐ์ธก๊ฐ’์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค. ํ•ด๋‹น ๊ตฌ๊ฐ„์— ๋ฐ์ดํ„ฐ๊ฐ€ ์ธก์ •๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•˜์ง€ ์•Š๋‹ค. 

๋ฐ์ดํ„ฐ๊ฐ€ ์Šคํ…Œ์ด์…”๋„ˆ๋ฆฌํ•˜์ง€ ์•Š๋‹ค๋ฉด ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ "ํ‰๊ท "์„ ์‚ฌ์šฉํ•ด์„œ ๊ฒฐ์ธก๊ฐ’์„ ์ฑ„์›Œ ๋„ฃ์œผ๋ฉด ์•ˆ๋œ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ๋Š” ์ด์ „ ๊ฐ’์œผ๋กœ ์•ž์— ๊ฐ’์„ ์ฑ„์›Œ๋„ฃ๋Š” foward-fill์„ ์‚ฌ์šฉํ•ด์„œ ์ง€์ €๋ถ„ํ•˜๊ธด ํ•˜์ง€๋งŒ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ณ„์—ด ์†์„ฑ์— ๋”ฐ๋ผ์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค!

 

1. forward/backward fill
2. Interpolation(๋ณด๊ฐ„๋ฒ•)
3. ์ตœ๊ทผ์ ‘ ์ด์›ƒ ํ‰๊ท (Mean of nearest neighbors)

 

~1. forward/backward fill~

df=pd.read_csv('a10.csv', parse_dates=['date'], index_col='date')

#forward fill : ffill()
#backward fill : bfill()
df_ffill=df.ffill()
df_bfill=df.bfill()

 

~2. Interpolation(๋ณด๊ฐ„๋ฒ•)~

 

- Linear

from scipy.interpolate import interp1d
df=pd.read_csv('a10.csv', parse_dates=['date'], index_col='date')

import numpy as np
df['rownum']=np.arange(df.shape[0])
df_nona=df.dropna(subset=['value'])

#kind๋ฅผ ์ง€์ •ํ•ด์ฃผ์ง€ ์•Š์•˜์„ ๋•Œ ๋””ํดํŠธ๊ฐ’์€ 'linear'
f=interp1d(df_nona['rownum'], df_nona['value'])
df['linear_fill']=f(df['rownum'])

 

- Cubic : ๋” ์„ฌ์„ธํ•œ fitting ๊ฐ€๋Šฅ!

from scipy.interpolate import interp1d
df=pd.read_csv('a10.csv', parse_dates=['date'], index_col='date')

df['rownum']=np.arange(df.shape[0])
df_nona=df.dropna(subset=['value'])

f2=interp1d(df_nona['rownum'], df_nona['value'], kind='cubic')
df['cubic_fill']=f2(df['rownum'])

 

~3. Mean of nearest neighbors~

def knn_mean(ts, n):
    out=np.copy(ts)
    for i, val in enumerate(ts):
        if np.isnan(val):
            n_by_2=np.ceil(n/2)
            lower=np.max([0, int(i-n_by_2)])
            upper=np.min([len(ts)+1, int(i+n_by_2)])
            ts_near=np.concatenate([ts[lower:i], ts[i:upper]])
            out[i]=np.nanmean(ts_near)
    return out

 

 

Smoothing

 

: ์‹œ๊ฐ„ ๋‹จ๊ณ„ ๊ฐ„์˜ ๋ฏธ์„ธํ•œ ๋ณ€๋™์„ ์ œ๊ฑฐํ•จ,

์ถ”์„ธ๋ฅผ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ์Œ=> ์—๋Ÿฌ๋“ค์˜ ํšจ๊ณผ๋ฅผ ๋‚ฎ์ถ”๊ณ  ์‰ฝ๊ฒŒ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

- Moving average(์ด๋™ํ‰๊ท ๋ฒ•) : tn-1, t, tn+1์˜ ํ‰๊ท ๊ฐ’
- Loess smoothing : ๊ฐ๊ฐ์˜ ๊ฐ’๋ณด๋‹ค ์ถ”์„ธ๋ฅผ ๋ณด๋Š” ๋ฐฉ๋ฒ•

 

 

1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

from statsmodels.nonparametric.smoothers_lowess import lowess
df_orig=pd.read_csv('elecequip.csv', parse_dates=['date'], index_col='date')

 

2. moving average

df_ma=df_orig.value.rolling(3, center=True, closed='both').mean()

 

3. Loess Smoothing(5% and 15%)

df_loess_5=pd.DataFrame(lowess(df_orig.value, np.arange(len(df_orig.value)), frac=0.05)[:, 1], index=df_orig.index, columns=['value'])
df_loess_15=pd.DataFrame(lowess(df_orig.value, np.arange(len(df_orig.value)), frac=0.15)[:, 1], index=df_orig.index, columns=['value'])

 

3. plot

 

~original plot~

df_orig.plot()
plt.show()

 

~5%, 15% smoothing plot~

df_loess_5.plot()
df_loess_15.plot()
plt.show()

 

~moving average plot~

df_ma.plot()
plt.show()

 

728x90