Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

헬창 개발자

4. 셀프 주유소는 정말 저렴할까? 본문

데이터 분석

4. 셀프 주유소는 정말 저렴할까?

찬배 2022. 2. 15. 21:34

학습전 하기

https://health-coding.tistory.com/37

Selenium 기초

이 페이지의 목적 Selenium 기본 사용법 이해 Selenium 이란? Selenium 웹 브라우저를 원격 조작하는 도구 자동으로 URL을 열고 클릭 등이 가능 스크롤, 문자 입력, 화면 캡처 등 설치 : 파이썬 모듈 + 크

health-coding.tistory.com

학습 목표

Selenium 모듈 이해

0. 들어가기 전

Selenium VS BeautifulSoup

Selenium BeautifulSoup

웹 동작	html 정보 파싱
javascript 렌더링을 통해 생성되는 데이터들을 손쉽게 가져올 수 있다.	python 내장 모듈 requests나 urllib을 이용해 HTML을 다운 받고, beautifulsoup으로 테이터를 추출한다.
웹 브라우저를 실제로 진행시키는 방법이기 때문에 속도도 많이 느리고, 메모리도 상대적으로 많이 차지한다.	서버에서 HTML을 다운 받기 때문에 javascipt 렌더링을 필요로 하는 사이트들은 크롤링하기 까다롭다.

Xpath란?

<!-- 우리가 찾아야할 것은 body라는 태그 하나로 끝나는 것이 아니다
각 태그에는 속성도 있고 속성값도 있고 태그의 내용물도 있다 -->

<html>
<body style="s2" class="top"> test </body>
</html>


<?xml version="1.0" encoding="UTF-8"?>
 
<bookstore>
<book>
  <title lang="en">Harry Potter</title>
  <price>29.99</price>
</book>
 
<book>
  <title lang="en">Learning XML</title>
  <price>39.95</price>
</book>2 
</bookstore>

book을 찾고싶을때 /bookstore/book[1]
수식을 적용 /bookstore/book[price>35.00]

1. 주유소 정보 확인

대한민국 주유 가격 - 지역별 : 링크
- 목표 데이터 : 브랜드, 가격, 셀프 주유 여부, 위치
- 데이터가 바뀌어도 주소가 변경되지 않아 BeautifulSoup로 접근하기 어렵다

주유소 가격 정보 흭득

from selenium import webdriver
from selenium.webdriver import ActionChains
import time

driver = webdriver.Chrome('../BI/driver/chromedriver.exe')
driver.get("https://www.opinet.co.kr/searRgSelect.do")

time.sleep(3)
# 가장싼주유소의 Xpath 경로
xpath="""//*[@id="quick_ul"]/li[2]/a/span"""
driver.find_element_by_xpath(xpath).click()

2. 주유소 가격 정보 저장

Xpath로 시군구 접근하기

from tqdm import notebook
# 시의 Xpath 경로
xpath="""//*[@id="SIDO_NM0"]/option[2]"""
driver.find_element_by_xpath(xpath).click()

for n in notebook.tqdm(range(1,27)):
	# 구의 Xpath 경로
    xpath='//*[@id="SIGUNGU_NM0"]/option'+'['+str(n)+']'
    driver.find_element_by_xpath(xpath).click()
    time.sleep(3)
    #excel 	
    xpath="""//*[@id="glopopd_excel"]/span"""
    driver.find_element_by_xpath(xpath).click()
    time.sleep(3)

새폴더를 만들어서 한곳에 넣기

엑셀 파일 확인
행을 기준으로 2번째까지는 필요없는 행이다.
엑셀파일 한번에 불러오기

from glob import glob
import pandas as pd
stations_files = glob("C:/Users/BI/Downloads/새 폴더/*.xls")
tmp = pd.read_excel(stations_files[0], header=2)
tmp.head(2)

불러온 데이터 합치기

tmp_raw = []

for file_name in stations_files:
    tmp = pd.read_excel(file_name, header=2)
    tmp_raw.append(tmp)

station_raw = pd.concat(tmp_raw)

정보 확인
```
station_raw.info()
```

누락값이 없는걸 확인완료!!

필요 데이터만 추출

# 데이터 추출
stations = pd.DataFrame(
    {
        "상호" : station_raw.get("상호"),
        "주소" : station_raw.get("주소"),
        "가격" : station_raw.get("휘발유"),
        "셀프" : station_raw.get("셀프여부"),
        "상표" : station_raw.get("상표"),
    }
)

# 구 정보 추가
stations["구"] = [eachAddress.split()[1] for eachAddress in stations.get("주소")]

# 가격 타입 변경
# 없는 값 제외
stations = stations[stations["가격"] != "-"]
# 타입 변경
stations["가격"] = stations["가격"].astype("float")

# 인덱스 번호 재설정
stations.reset_index(inplace=True)
# 이전의 index 컬럼 삭제
del stations['index']

stations.tail(2)

4. 주유소 가격 시각화

데이터 저장 및 회사별 시각화

stations.to_excel('C:/Users/BI/Downloads/새 폴더/05. stations_final.xlsx', encoding='utf-8')
stations = pd.read_excel('C:/Users/BI/Downloads/새 폴더/05. stations_final.xlsx', index_col = 0)

import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline
# get_ipython().run_line_magic("matplotlib", "inline")
plt.rcParams["axes.unicode_minus"] = False
rc('font', family='Malgun Gothic')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

plt.figure(figsize=(12, 8))
sns.boxplot(x='상표', y='가격', hue = '셀프', data=stations, palette="Set3")
plt.grid()
plt.show()

구별 평균값

import numpy as np
gu_data = pd.pivot_table(
                            data = stations, 
                            index = ["구"], 
                            values = ["가격"], 
                            aggfunc = np.mean
                        )
gu_data.head()

지도 시각화

import json

geo_path = "C:/Users/BI/DataScience/data/02. skorea_municipalities_geo_simple.json"
geo_str = json.load(open(geo_path, encoding="utf-8"))

my_map = folium.Map(
                        location=[37.5502, 126.982], 
                        zoom_start=10.5, 
                        tiles="Stamen Toner"
                    )

folium.Choropleth(
    geo_data = geo_str,
    data = gu_data,
    columns = [gu_data.index, "가격"],
    key_on = "feature.id",
    fill_color = "PuRd"
).add_to(my_map)

my_map

종로구, 중구, 용산구가 비교적 저렴하지 않다는것을 알 수 있음

'데이터 분석' 카테고리의 다른 글

8. 자연어 처리 시작하기 (0)	2022.04.17
k-means 알고리즘 [파이썬] (0)	2022.04.17
Selenium 기초 (0)	2022.02.15
3. 네이버 영화 평점 크롤링 (5)	2022.02.15
2. 서울시 범죄 현황 분석 (1)	2022.02.15

'데이터 분석' Related Articles

Comments