๐Ÿ” ๋ฐ์ดํ„ฐ ๋ถ„์„/04. Data Analysis

[Python] ์„œ์šธ ์ข…ํ•ฉ๋ณ‘์› ๋ถ„ํฌ ๋ฐ์ดํ„ฐ ๋ถ„์„

xod22 2022. 3. 24. 15:22
728x90
๋ฐ์ดํ„ฐ
 

์†Œ์ƒ๊ณต์ธ์‹œ์žฅ์ง„ํฅ๊ณต๋‹จ_์ƒ๊ฐ€(์ƒ๊ถŒ)์ •๋ณด_์˜๋ฃŒ๊ธฐ๊ด€_20190930

์ƒ๊ฐ€(์ƒ๊ถŒ)์ •๋ณด ์˜๋ฃŒ๊ธฐ๊ด€์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ์˜๋ฃŒ๊ธฐ๊ด€์˜ ์ƒํ˜ธ๋ช…, ์ฃผ์†Œ, ์ƒ๊ถŒ์—…์ข… ์ค‘๋ถ„๋ฅ˜๋ช…, ์†Œ๋ถ„๋ฅ˜๋ช…์„ ํ•ญ๋ชฉ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

www.data.go.kr

->๋งํฌ์—์„œ ๋ฐ์ดํ„ฐ ๋‹ค์šด

 

 

๊ฐ€์„ค์„ค์ •

 

์„œ์šธ์˜ ์ข…ํ•ฉ๋ณ‘์›์€ ๊ณ ๋ฅด๊ฒŒ ๋ถ„ํฌ๋˜์–ด ์žˆ์„๊นŒ?

 


1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#๊ทธ๋ž˜ํ”„๊ฐ€ ๋…ธํŠธ๋ถ ์•ˆ์— ๋ณด์ด๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด
%matplotlib inline

#ํ•œ๊ธ€ ํฐํŠธ ์„ค์ •
plt.rc('font', family='Malgun Gothic')
#๋งˆ์ด๋„ˆ์Šค ๊ธฐํ˜ธ๊ฐ€ ํ‘œ์‹œ๋˜๋„๋ก ํ•ด์คŒ
plt.rc('axes', unicode_minus=False)

- matplotlib์—์„œ๋Š” ํ•œ๊ธ€๊ณผ ๋งˆ์ด๋„ˆ์Šค ๊ธฐํ˜ธ๊ฐ€ ๊นจ์ ธ์„œ ๋‚˜ํƒ€๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ํ•œ๊ธ€ ์ง€์›๊ฐ€๋Šฅํ•œ ๊ธ€์”จ์ฒด๋กœ ๋ฐ”๊ฟ”์ฃผ์–ด์•ผ ํ•œ๋‹ค!

- plt.rc('font', family='Malgun Gothic') : matplotlib ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํฐํŠธ๋ฅผ ํ•œ๊ธ€ ์ง€์›์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์œผ๋กœ ๋ฐ”๊พธ๋Š” ์ฝ”๋“œ ์ž…๋‹ˆ๋‹ค.

 

df=pd.read_csv('์†Œ์ƒ๊ณต์ธ์‹œ์žฅ์ง„ํฅ๊ณต๋‹จ_์ƒ๊ฐ€์—…์†Œ์ •๋ณด_์˜๋ฃŒ๊ธฐ๊ด€_201909.csv', encoding='cp949')

#ํ–‰์—ด ๊ฐœ์ˆ˜
df.shape

 


2. ์ „์ฒ˜๋ฆฌ

 

~๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜ ํ™•์ธ~

null_count=df.isnull().sum()
null_count

 

 

~๊ฒฐ์ธก์น˜ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธ~

null_count.plot.barh(figsize=(5,10))

 

 

~๊ฒฐ์ธก์น˜๊ฐœ์ˆ˜๋ฅผ ์„ผ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ -> ์ปฌ๋Ÿผ๋ช… ๋ณ€๊ฒฝ~

df_null_count=null_count.reset_index()
print(df_null_count.head())

#์ปฌ๋Ÿผ๋ช… ๋ณ€๊ฒฝ
df_null_count.columns=['์ปฌ๋Ÿผ๋ช…','๊ฒฐ์ธก์น˜์ˆ˜']
print(df_null_count.head())

-> ์ปฌ๋Ÿผ๋ช…์ด ์ž˜ ๋ณ€๊ฒฝ๋œ ๊ฒƒ์„ ํ™•์ธ!

 

 

~๊ฒฐ์ธก์น˜ ์ˆ˜๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด์„œ ์ €์žฅ!~

df_null_count_top=df_null_count.sort_values(by="๊ฒฐ์ธก์น˜์ˆ˜", ascending=False).head(10)
df_null_count_top

 

 

~๊ฒฐ์ธก์น˜๊ฐ€ ๋งŽ์€ ์ปฌ๋Ÿผ์˜ ์ปฌ๋Ÿผ๋ช…์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅ~

drop_columns=df_null_count_top["์ปฌ๋Ÿผ๋ช…"].tolist()
drop_columns

 

 

~์‚ญ์ œ~

print("์›๋ฐ์ดํ„ฐ :", df.shape)
df=df.drop(drop_columns, axis=1)
print("drop ํ›„ :", df.shape)

 


3. ์‹œ๊ฐํ™”

 

~"์‹œ๋„๋ช…" ์˜๋ฃŒ๊ธฐ๊ด€ ๊ฐœ์ˆ˜ count~

city=df["์‹œ๋„๋ช…"].value_counts()
city

 

 

~์ •๊ทœํ™”~

city_normalize=df["์‹œ๋„๋ช…"].value_counts(normalize=True)
city_normalize

 

 

~"์‹œ๋„๋ณ„" ์˜๋ฃŒ๊ธฐ๊ด€ ๊ฐœ์ˆ˜ ์‹œ๊ฐํ™”~

#bar๊ทธ๋ž˜ํ”„
#city=)์œ„์—์„œ value_counts()ํ•œ ๋ฐ์ดํ„ฐ
city.plot.barh()

#barh๋Œ€์‹  seaborn(countplot)
sns.countplot(data=df, y='์‹œ๋„๋ช…')

#์›๊ทธ๋ž˜ํ”„
city_normalize.plot.pie(figsize=(7,7))

 


4. ๊ฐ€์„ค๊ฒ€์ •์„ ์œ„ํ•œ ๋ถ„์„

 

~์„œ์šธ ์ข…ํ•ฉ๋ณ‘์› ๋ฐ์ดํ„ฐ ์ถ”์ถœ~

df_seoul_hospital=df[(df["์ƒ๊ถŒ์—…์ข…์†Œ๋ถ„๋ฅ˜๋ช…"]=="์ข…ํ•ฉ๋ณ‘์›") & (df["์‹œ๋„๋ช…"]=="์„œ์šธํŠน๋ณ„์‹œ")]
df_seoul_hospital

- "์ƒ๊ถŒ์—…์ข…์†Œ๋ถ„๋ฅ˜๋ช…"์ด ""์ข…ํ•ฉ๋ณ‘์›"์ธ ๋ฐ์ดํ„ฐ & "์‹œ๋„๋ช…"์ด "์„œ์šธํŠน๋ณ„์‹œ"์ธ ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

- ๊ฒฐ๊ณผ๋ฅผ df_seoul_hospital์— ํ• ๋‹น

์„œ์šธ ์ข…ํ•ฉ๋ณ‘์› ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ–ˆ์œผ๋‚˜
์ƒํ˜ธ๋ช…์„ ํ™•์ธํ•˜๋ฉด ์‹ค์ƒ ์ข…ํ•ฉ๋ณ‘์›์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚จ์•„์žˆ์Œ
=> ์ œ๊ฑฐ ํ•„์š”

 

 

~์ƒํ˜ธ๋ช…์— ์ข…ํ•ฉ๋ณ‘์›์ด ํฌํ•จ๋˜์–ด์žˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ์ฐพ๊ธฐ~

df_seoul_hospital.loc[~df_seoul_hospital["์ƒํ˜ธ๋ช…"].str.contains("์ข…ํ•ฉ๋ณ‘์›"), "์ƒํ˜ธ๋ช…"].unique()

- "์ข…ํ•ฉ๋ณ‘์›"์ด ์ƒํ˜ธ๋ช…์— ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์€ ๋ณ‘์›๋“ค์„ ๋ณด๋ฉด "๊ฝƒ๋ฐฐ๋‹ฌ/์˜๋ฃŒ๊ธฐ/์žฅ๋ก€์‹์žฅ/์ƒ๋‹ด์†Œ/์–ด๋ฆฐ์ด์ง‘"๊ณผ ๊ฐ™์€ ์ข…ํ•ฉ๋ณ‘์›๊ณผ ๋ฌด๊ด€ํ•œ ๊ณณ๋“ค์ด ์žˆ์Œ=> ์‚ญ์ œ

 

 

~๊ฝƒ๋ฐฐ๋‹ฌ/์˜๋ฃŒ๊ธฐ/์žฅ๋ก€์‹์žฅ/์ƒ๋‹ด์†Œ/์–ด๋ฆฐ์ด์ง‘์€ ์ข…ํ•ฉ๋ณ‘์›๊ณผ ๋ฌด๊ด€ํ•˜๋ฏ€๋กœ ์ œ๊ฑฐ~

#dropํ•  index์ €์žฅ
drop_row=df_seoul_hospital[df_seoul_hospital["์ƒํ˜ธ๋ช…"].str.contains("๊ฝƒ๋ฐฐ๋‹ฌ|์˜๋ฃŒ๊ธฐ|์žฅ๋ก€์‹์žฅ|์ƒ๋‹ด์†Œ|์–ด๋ฆฐ์ด์ง‘")].index

#index๋ฅผ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅ
drop_row=drop_row.tolist()
drop_row

 

 

~์˜์›์œผ๋กœ ๋๋‚˜๋Š” ๋ฐ์ดํ„ฐ๋„ ์ข…ํ•ฉ๋ณ‘์›์œผ๋กœ ๋ณผ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ์ œ๊ฑฐ~

#dropํ•  index์ €์žฅ
drop_row2=df_seoul_hospital[df_seoul_hospital["์ƒํ˜ธ๋ช…"].str.endswith("์˜์›")].index

#index๋ฅผ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅ
drop_row2=drop_row2.tolist()
drop_row2
#์‚ญ์ œํ•  ํ–‰ drop_row์— ํ•ฉ์ณ์ค€๋‹ค
drop_row=drop_row+drop_row2
print(len(drop_row))

#ํ•ด๋‹น ์…€์„ ์‚ญ์ œํ•˜๊ณ  ์‚ญ์ œ ์ „๊ณผ ํ›„์˜ ํ–‰ ๊ฐฏ์ˆ˜ ๋น„๊ต
print("์‚ญ์ œ ์ „ :", df_seoul_hospital.shape)
df_seoul_hospital=df_seoul_hospital.drop(drop_row, axis=0)
print("์‚ญ์ œ ํ›„ :", df_seoul_hospital.shape)

 


5. ๋ถ„์„ ํ›„ ์‹œ๊ฐํ™”

 

~์‹œ๊ตฐ๊ตฌ๋ช…์— ๋”ฐ๋ฅธ ์ข…ํ•ฉ๋ณ‘์› ์ˆ˜ ์‹œ๊ฐํ™”~

df_seoul_hospital["์‹œ๊ตฐ๊ตฌ๋ช…"].value_counts().plot.bar()

-> ๊ฐ•๋‚จ๊ตฌ, ์˜๋“ฑํฌ๊ตฌ, ์ค‘๊ตฌ ์ˆœ์œผ๋กœ ์ข…ํ•ฉ๋ณ‘์› ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒƒ์„ ํ™•์ธ

#countplot์œผ๋กœ ๊ทธ๋ ค๋ณด๊ธฐ
plt.figure(figsize=(15,4))
sns.countplot(data=df_seoul_hospital, x="์‹œ๊ตฐ๊ตฌ๋ช…", order=df_seoul_hospital["์‹œ๊ตฐ๊ตฌ๋ช…"].value_counts().index)

-> ๊ทธ๋ž˜ํ”„๋ฅผ ํ•ด์„ํ•ด๋ณด๋ฉด ๊ฒฐ๊ณผ๋Š” ๊ฐ™๋‹ค!

 

 

๊ฒฐ๋ก 

 

Q : ์„œ์šธ์˜ ์ข…ํ•ฉ๋ณ‘์›์€ ๊ณ ๋ฅด๊ฒŒ ๋ถ„ํฌ๋˜์–ด ์žˆ์„๊นŒ?

 

- ๊ฐ•๋‚จ๊ตฌ, ์˜๋“ฑํฌ๊ตฌ, ์ค‘๊ตฌ, ์–‘์ฒœ๊ตฌ ์ˆœ์œผ๋กœ ์ข…ํ•ฉ๋ณ‘์›์ด ๋งŽ์ด ๋ถ„ํฌ๋˜์–ด ์žˆ๋‹ค.

- ๊ฐ•๋‚จ๊ตฌ์— ์ข…ํ•ฉ๋ณ‘์›์ด ์••๋„์ ์œผ๋กœ ๋งŽ์ด ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•„ ์ข…ํ•ฉ๋ณ‘์›์ด ์„œ์šธ์— ๊ณจ๊ณ ๋ฃจ ๋ถ„ํฌ๋˜์–ด ์žˆ์ง€๋Š” ์•Š๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

728x90