πŸ” 데이터 뢄석/04. Data Analysis

[톡계적 λͺ¨λΈλ§] μ„ ν˜•νšŒκ·€, λ‘œμ§€μŠ€ν‹±νšŒκ·€

xod22 2022. 3. 18. 00:20
728x90
μ„ ν˜•νšŒκ·€ κ°œμš”

 

1. νŒ¨ν‚€μ§€ μž„ν¬νŠΈ

import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import patsy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats

 

2. 데이터 생성

y=np.array([1,2,3,4,5])
x1=np.array([6,7,8,9,10])
x2=np.array([11,12,13,14,15])

data={"y":y, "x1":x1, "x2":x2}

 

3. λͺ¨λΈ 생성

y, X=patsy.dmatrices("y~1+x1+x2+x1:x2", data, return_type="dataframe")
model=sm.OLS(y,X)
result=model.fit()
result.params

 

~":" λŒ€μ‹ μ— "*"을 써쀌~

y,X=patsy.dmatrices("y~x1*x2", data, return_type="dataframe")
model2=sm.OLS(y,X)
result2=model2.fit()
result2.params

μœ„μ˜ μˆ˜μ‹κ³Ό 같은 κ²°κ³Όλ₯Ό λ‚˜νƒ€λƒ„..!

 

y,X=patsy.dmatrices("y~x1+x2", data, return_type="dataframe")
model3=sm.OLS(y,X)
result3=model3.fit()
result3.params

 

~둜그/μ‚Όκ°ν•¨μˆ˜λ„ ν‘œν˜„ κ°€λŠ₯~

y,X=patsy.dmatrices("y~np.log(x1)+np.cos(x2)+np.sin(x1+x2)", data, return_type="dataframe")
model4=sm.OLS(y,X)
result4=model4.fit()
result4.params

 

λ²”μ£Ό 생성

 

: Patsy νŒ¨ν‚€μ§€λŠ” λ²”μ£Όν˜• λ³€μˆ˜λ₯Ό 생성할 수 μžˆλ‹€.

μˆ˜λ™μœΌλ‘œ μˆ«μžκ°’μ„ C(x1)μ΄λ ‡κ²Œ μ§€μ •ν•˜λ©΄ μΉ΄ν…Œκ³ λ¦¬ν™”λ¨..!

y,X=patsy.dmatrices("y~-1+C(x1)", data=data, return_type="dataframe")
print(X)

 


1. μ„ ν˜•νšŒκ·€ 뢄석

 

# μ•„μ΄μŠ€ν¬λ¦Ό λ°μ΄ν„° : μ•„μ΄μŠ€ν¬λ¦Ό μ†ŒλΉ„λŸ‰, κ³ κ°μ˜ μˆ˜μž…, μ•„μ΄μŠ€ν¬λ¦Ό κ°€κ²© λ° ν•˜λ£¨μ˜ μ˜¨λ„

dataset=sm.datasets.get_rdataset("Icecream", "Ecdat")
model=smf.ols("cons~1+income+price+temp", data=dataset.data)
result=model.fit()
print(result.summary())

 

~ income을 μ œμ™Έν•˜κ³  μ†ŒλΉ„λŸ‰~가격, μ˜¨λ„μ˜ νšŒκ·€~

#μ†ŒλΉ„λŸ‰~가격, μ˜¨λ„μ˜ νšŒκ·€
model=smf.ols("cons~1+price+temp", data=dataset.data)
result=model.fit()
print(result.summary())

 


2. μ΄μ‚°νšŒκ·€λΆ„μ„ : λ‘œμ§€μŠ€ν‹±νšŒκ·€λΆ„μ„

 

~iris 데이터 뢈러였기~

df = sm.datasets.get_rdataset("iris").data 
df_subset=df[df.Species.isin(["versicolor","virginica"])].copy() 
df_subset.Species = df_subset.Species.map({"versicolor":1,"virginica":0}) 
df_subset.rename(columns={"Sepal.Length": "Sepal_Length","Sepal.Width": "Sepal_Width","Petal.Length": "Petal_Length","Petal.Width": "Petal_Width"}, inplace=True)

 

~λ‘œμ§€μŠ€ν‹±νšŒκ·€ λͺ¨λΈ 생성~

model = smf.logit("Species ~ Petal_Length + Petal_Width", data=df_subset)
result = model.fit()
print(result.summary())

 

~plot~

#scatter plot
params=result.params
alpha0 = -params['Intercept']/params['Petal_Width'] 
alpha1 = -params['Petal_Length']/params['Petal_Width']
_x=np.array([3.0, 7.0])
fig, ax = plt.subplots(1,1, figsize=(8,4)) 
ax.plot(df_subset[df_subset.Species==0].Petal_Length.values, df_subset[df_subset.Species==0].Petal_Width.values,'s', label='virginica') 
ax.plot(df_subset[df_subset.Species==1].Petal_Length.values, df_subset[df_subset.Species==1].Petal_Width.values,'s', label='versicolor')
ax.plot(_x,alpha0+alpha1 * _x)
ax.set_xlabel('Petal length')
ax.set_ylabel('Petal width')
ax.legend()

=> 잘 λΆ„λ₯˜ν•˜κ³  μžˆμŒμ„ 확인..!

728x90