๐Ÿ” ๋ฐ์ดํ„ฐ ๋ถ„์„/01. Data Collection

[ํฌ๋กค๋ง] Web์—์„œ Tabular data ์ˆ˜์ง‘

xod22 2022. 3. 4. 00:02
728x90

์›น์—์„œ ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ Python์œผ๋กœ ์Šคํฌ๋žฉํ•˜๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค!

 

์‹ค์Šต

 

์œ„ํ‚คํ”ผ๋””์•„์˜ "๋Œ€ํ†ต๋ น ์„ ๊ฑฐ ๊ฒฐ๊ณผ" ํ‘œ๋ฅผ ์˜ˆ๋กœ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

https://en.wikipedia.org/wiki/Politics_of_Pennsylvania 

 

Politics of Pennsylvania - Wikipedia

Politics of a U.S. state Pennsylvania has swung from being a Republican-leaning state during much of the 20th century to being a notable battleground state in presidential elections. Pennsylvania backed the Democratic presidential candidate in every electi

en.wikipedia.org

 

ํŽ˜์ด์ง€์— ์ ‘์†ํ–ˆ์„ ๋•Œ ๊ฐ€์ ธ์˜ค๊ณ ์žํ•˜๋Š” ํ‘œ๋Š” "Presidential election results"์ž…๋‹ˆ๋‹ค!

 

 

1. ์›น๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

import pandas as pd
import numpy as np
table_PA = pd.read_html('http://en.wikipedia.org/wiki/Politics_of_Pennsylvania')

 

 

2. Presidential elecction results๋งŒ ์ถ”์ถœ

len(table_PA) #ํŽ˜์ด์ง€์— 5๊ฐœ์˜ ํ…Œ์ด๋ธ”์ด ์žˆ์Œ(table_PA)->์šฐ๋ฆฌ๋Š” ์ด ์ค‘์— ๋Œ€์„ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ  ์‹ถ์Œ
table_PA = pd.read_html('http://en.wikipedia.org/wiki/Politics_of_Pennsylvania', match='Presidential election results') #๋Œ€์„ ๊ฒฐ๊ณผ์™€ ์ผ์น˜ํ•˜๋Š” ํ…Œ์ด๋ธ”๋งŒ ๊ธ์–ด์˜ด
len(table_PA) #๋ฐ˜ํ™˜๋œ ๊ฐ’์€ 1๋กœ ํ•˜๋‚˜์˜ ํ…Œ์ด๋ธ”๋งŒ ์ผ์น˜ํ•œ๋‹ค๋Š” ์˜๋ฏธ!
df=table_PA[0]

* table_PA์˜ ํ˜•ํƒœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์œผ๋ฏ€๋กœ table_PA[0]์„ df์— ์ €์žฅํ•ด์ค€๋‹ค!

-> df.head๋ฅผ ํ†ตํ•ด 5๊ฐœ์˜ ํ–‰๋งŒ ํ™•์ธํ•ด๋ณด๋ฉด ์ž˜ ์ˆ˜์ง‘๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!

 

 

3. ์ „์ฒ˜๋ฆฌ (์ˆซ์ž๋กœ ๋ณ€ํ™˜)

๋ถ„์„์„ ์œ„ํ•ด์„œ๋Š” ๊ฐ’์„ numericํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์ค˜์•ผํ•˜๋Š”๋ฐ ๊ฐ’์— %๊ฐ€ ์žˆ์–ด .apply(pd.to_numeric)ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๊ธฐ ํž˜๋“ค๋‹ค..!

๋”ฐ๋ผ์„œ "%"์ œ๊ฑฐ๋ฅผ ๋จผ์ € ์„ ํ–‰ํ•œ ํ›„์— ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•ด์ฃผ์–ด์•ผ ํ•จ!

df.head()
df.info() #ํ™•์ธํ•ด๋ณด๋ฉด ์ˆซ์ž๊ฐ€ ์•„๋‹ˆ๋ผ ๋ฌธ์ž๋กœ ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ”๊ฟ”์คŒ

#"%"๋ฌธ์ž ์ œ๊ฑฐ
df['Democratic']=df['Democratic'].str[:4] #Democratic์˜ ์•ž์— 4๊ธ€์ž๋งŒ ์ถ”์ถœํ•ด๋ผ, %๊ธธ์ด๊ฐ€ ๋‹ค ๋˜‘๊ฐ™๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ•œ์ผ์ž„
df['Republican']=df['Republican'].str[:4]

df[['Democratic', 'Republican']]=df[['Democratic','Republican']].apply(pd.to_numeric) 
df.info()

 

 

4. ๋ฐ์ดํ„ฐ๊ฐ€ ์ž˜ ๋ฐ”๋€๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!

->

 

728x90