New York City Airbnb (I) EDA
credit to :https://techcrunch.com/2018/01/31/nyc-new-york-airbnb-study-mcgill/
本篇資料來自Kaggle上的New York City Airbnb
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
同時會分成兩篇文章分別進行分析,第一篇主打EDA的部分,針對紐約各地區airbnb的資料進行視覺化的解剖以及一般的描述性統計。
資料的變數如下:
那我們開始吧!
沒什麼,讀個檔順便叫一下要用的package
把一些變數設成因子,然後看一下資料整體
設定時間格式,把年份提出來
看一下NA值,可以發現主要是last review的日期缺了,因此per month/last year都出現相同數量的NA,剩下的是name/host name有一些缺失,看得更清楚一點我們可以將它視覺化,可以更清楚的了解。
再看一下summary:
我們可以明白主要以Manhattan社區為主,房型則是以獨棟為主
視覺化:
對,就是來畫圖看看
首先是價錢,畫了發現沒很清楚想再放大看看
價錢取log看看:
社區的部分:曼哈頓與布魯克林為主
我們可以畫一張各區的價位分布圖,清楚明瞭
這裡的結論就是曼哈頓的價位普遍比較高,五個社區分佈的也沒有太偏
那房型(roomtype)呢?
很明顯的獨棟較高,分佈看起來也沒明顯偏態。
那來看看社區與房型的關係
曼哈頓的獨棟最多,這似乎解釋了為什麼它的價位最高
那看房次數?
似乎沒什麼端倪,而且房價高幾乎沒人去看
那年份?
NA值很多,不過2019年的新建築很多
那看看他們的價位如何?
來看看各變數的相關
本篇資料來自Kaggle上的New York City Airbnb
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
同時會分成兩篇文章分別進行分析,第一篇主打EDA的部分,針對紐約各地區airbnb的資料進行視覺化的解剖以及一般的描述性統計。
資料的變數如下:
id
listing ID
name
name of the listing
host_id
host ID
host_name
name of the host
neighbourhood_group
location
neighbourhood
area
latitude
latitude coordinates
longitude
longitude coordinates
room_type
listing space type
price
price in dollars(預測變數)
minimum_nights
amount of nights minimum
number_of_reviews
number of reviews
last_review
latest review
reviews_per_month
number of reviews per month
calculated_host_listings_count
amount of listing per host
availability_365
number of days when listing is available for booking
沒什麼,讀個檔順便叫一下要用的package
#library
library(dplyr)
library(tidyverse)
library(ggthemes)
library(GGally)
library(ggExtra)
library(caret)
library('glmnet')
library(corrplot)
library(leaflet)
library("kableExtra")
library(RColorBrewer)
library(plotly)
library(mice)
library(VIM)
airbnb <- read_csv("AB_NYC_2019.csv")
把一些變數設成因子,然後看一下資料整體
#factor
airbnb$neighbourhood_group<-as.factor(airbnb$neighbourhood_group)
airbnb$neighbourhood <- as.factor(airbnb$neighbourhood)
airbnb$room_type <- as.factor(airbnb$room_type)
glimpse(airbnb)
## Observations: 48,895
## Variables: 16
## $ id <dbl> 2539, 2595, 3647, 3831, 5022, 5...
## $ name <chr> "Clean & quiet apt home by the ...
## $ host_id <dbl> 2787, 2845, 4632, 4869, 7192, 7...
## $ host_name <chr> "John", "Jennifer", "Elisabeth"...
## $ neighbourhood_group <fct> Brooklyn, Manhattan, Manhattan,...
## $ neighbourhood <fct> Kensington, Midtown, Harlem, Cl...
## $ latitude <dbl> 40.64749, 40.75362, 40.80902, 4...
## $ longitude <dbl> -73.97237, -73.98377, -73.94190...
## $ room_type <fct> Private room, Entire home/apt, ...
## $ price <dbl> 149, 225, 150, 89, 80, 200, 60,...
## $ minimum_nights <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1,...
## $ number_of_reviews <dbl> 9, 45, 0, 270, 9, 74, 49, 430, ...
## $ last_review <date> 2018-10-19, 2019-05-21, NA, 20...
## $ reviews_per_month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.5...
## $ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1...
## $ availability_365 <dbl> 365, 355, 365, 194, 0, 129, 0, ...
設定時間格式,把年份提出來
#lubridate
library(lubridate)
airbnb$last_review = as.Date(airbnb$last_review, "%m/%d/%Y")
airbnb$last_year = as.factor(format(airbnb$last_review, "%Y"))
看一下NA值,可以發現主要是last review的日期缺了,因此per month/last year都出現相同數量的NA,剩下的是name/host name有一些缺失,看得更清楚一點我們可以將它視覺化,可以更清楚的了解。
#NA
sapply(airbnb,function(x) sum(is.na(x)))
## id name
## 0 16
## host_id host_name
## 0 21
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 10052 10052
## calculated_host_listings_count availability_365
## 0 0
## last_year
## 10052
aggr_plot <- aggr(airbnb, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
再看一下summary:
我們可以明白主要以Manhattan社區為主,房型則是以獨棟為主
#summary
summary(airbnb)
## id name host_id ## Min. : 2539 Length:48895 Min. : 2438 ## 1st Qu.: 9471945 Class :character 1st Qu.: 7822033 ## Median :19677284 Mode :character Median : 30793816 ## Mean :19017143 Mean : 67620011 ## 3rd Qu.:29152178 3rd Qu.:107434423 ## Max. :36487245 Max. :274321313 ## ## host_name neighbourhood_group neighbourhood ## Length:48895 Bronx : 1091
Williamsburg : 3920
## Class :character
Brooklyn :20104
Bedford-Stuyvesant: 3714
## Mode :character
Manhattan :21661
Harlem : 2658 ## Queens : 5666 Bushwick : 2465 ## Staten Island: 373 Upper West Side : 1971 ## Hell's Kitchen : 1958 ## (Other) :32209 ## latitude longitude room_type ## Min. :40.50 Min. :-74.24
Entire home/apt:25409
## 1st Qu.:40.69 1st Qu.:-73.98
Private room :22326
## Median :40.72 Median :-73.96 Shared room : 1160 ## Mean :40.73 Mean :-73.95 ## 3rd Qu.:40.76 3rd Qu.:-73.94 ## Max. :40.91 Max. :-73.71 ## ## price minimum_nights number_of_reviews ## Min. : 0.0 Min. : 1.00 Min. : 0.00 ## 1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 1.00 ## Median : 106.0 Median : 3.00 Median : 5.00 ##
Mean : 152.7
Mean : 7.03 Mean : 23.27 ## 3rd Qu.: 175.0 3rd Qu.: 5.00 3rd Qu.: 24.00 ## Max. :10000.0 Max. :1250.00 Max. :629.00 ## ## last_review reviews_per_month calculated_host_listings_count ## Min. :2011-03-28 Min. : 0.010 Min. : 1.000 ## 1st Qu.:2018-07-08 1st Qu.: 0.190 1st Qu.: 1.000 ## Median :2019-05-19 Median : 0.720 Median : 1.000 ## Mean :2018-10-04 Mean : 1.373 Mean : 7.144 ## 3rd Qu.:2019-06-23 3rd Qu.: 2.020 3rd Qu.: 2.000 ## Max. :2019-07-08 Max. :58.500 Max. :327.000 ##
NA's :10052 NA's :10052
## availability_365 last_year ## Min. : 0.0 2019 :25209 ## 1st Qu.: 0.0 2018 : 6050 ## Median : 45.0 2017 : 3205 ## Mean :112.8 2016 : 2707 ## 3rd Qu.:227.0 2015 : 1393 ## Max. :365.0 (Other): 279 ##
NA's :10052
視覺化:
對,就是來畫圖看看
首先是價錢,畫了發現沒很清楚想再放大看看
#EDA~price
ggplot(airbnb) +
geom_bar(aes(price),fill = '#fd5c63',alpha = 0.85,binwidth = 10) +
theme_minimal(base_size = 13) + xlab("Price") + ylab("Number") +
ggtitle("The Distrubition of Price")
價錢取log看看:
#EDA~price
ggplot(airbnb, aes(price)) +
geom_histogram(bins = 30, aes(y = ..density..), fill = "#fd5c63") +
geom_density(alpha = 0.2, fill = "#fd5c63") +ggtitle("Transformed distribution of price",subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) +
scale_x_log10()
社區的部分:曼哈頓與布魯克林為主
#neighbourhood_group
ggplot(airbnb) + geom_histogram(aes(neighbourhood_group, fill = neighbourhood_group), stat = "count",alpha = 0.85) +
theme_minimal(base_size=13) + xlab("") + ylab("") +theme(legend.position="none") +
ggtitle("The Number of Property in Each Area")
airbnb_nh <- airbnb %>%
group_by(neighbourhood_group) %>%
summarise(price = round(mean(price), 2))
airbnb_nh
## # A tibble: 5 x 2
## neighbourhood_group price
## <fct> <dbl>
## 1 Bronx 87.5
## 2 Brooklyn 124.
## 3 Manhattan 197.
## 4 Queens 99.5
## 5 Staten Island 115.
曼哈頓的房價最高,再來視布魯克林,但我們還需要看得更仔細一點...我們可以畫一張各區的價位分布圖,清楚明瞭
ggplot(airbnb, aes(price)) +
geom_histogram(bins = 30, aes(y = ..density..), fill = "#fd5c63") +
geom_density(alpha = 0.2, fill = "#fd5c63") +ggtitle("Transformed distribution of price",subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) +
scale_x_log10() + facet_wrap(~neighbourhood_group)
這裡的結論就是曼哈頓的價位普遍比較高,五個社區分佈的也沒有太偏
那房型(roomtype)呢?
#roomtype
airbnb_rt <- airbnb %>%
group_by(room_type) %>%
summarise(price = round(mean(price), 2))
airbnb_rt
## # A tibble: 3 x 2
## room_type price
## <fct> <dbl>
## 1 Entire home/apt 212.
## 2 Private room 89.8
## 3 Shared room 70.1
獨棟(Entire home)的價位很明顯更高,馬上來看看他們的分佈ggplot(airbnb, aes(price)) +
geom_histogram(bins = 30, aes(y = ..density..), fill = "#fd5c63") +
geom_density(alpha = 0.2, fill = "#fd5c63") +ggtitle("Transformed distribution of price",subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) +
scale_x_log10() + facet_wrap(~room_type)
很明顯的獨棟較高,分佈看起來也沒明顯偏態。
那來看看社區與房型的關係
#neighbourhood_group x roomtype
ggplot(airbnb) + geom_histogram(aes(neighbourhood_group, fill = room_type), stat = "count",alpha = 0.85, position = 'fill') +
theme_minimal(base_size=13) + xlab("") + ylab("") +
ggtitle("The Proportion of Room Type in Each Area")
曼哈頓的獨棟最多,這似乎解釋了為什麼它的價位最高
那看房次數?
#number of reviews
ggplot(airbnb, aes(number_of_reviews, price)) +
theme(axis.title = element_text(), axis.title.x = element_text()) +
geom_point(aes(size = price), alpha = 0.05, color = "slateblue") +
xlab("Number of reviews") +
ylab("Price") +
ggtitle("Relationship between number of reviews",
subtitle = "The most expensive objects have small number of reviews (or 0)")
那年份?
ggplot(airbnb) +
geom_histogram(aes(last_year), stat = "count", fill = '#fd5c63',alpha = 0.85) +
theme_minimal(base_size=13)+xlab("")+ylab("") +
ggtitle("The Number of New Property")
NA值很多,不過2019年的新建築很多
那看看他們的價位如何?
airbnb_yr <- airbnb %>%
group_by(last_year) %>%
summarise(price = round(mean(price), 2))
# A tibble: 10 x 2
last_year price
<fct> <dbl>
1 2011 169
2 2012 158.
3 2013 256.
4 2014 160.
5 2015 157.
6 2016 152.
7 2017 135.
8 2018 139.
9 2019 142.
10 NA 193.
2013年的價位特別高,不過NA值實在太多啦來看看各變數的相關
#correlation
airbnb_cor <- airbnb[, sapply(airbnb, is.numeric)]
airbnb_cor <- airbnb_cor[complete.cases(airbnb_cor), ]
correlation_matrix <- cor(airbnb_cor, method = "spearman")
corrplot(correlation_matrix, method = "color")
看不太出來哪一個變數特別相關,除了經度
以上大概就是這次的EDA~
還在思考有沒有可以切入的點。。。。
Still a long way~
留言
張貼留言