New York City Airbnb (I) EDA

credit to :https://techcrunch.com/2018/01/31/nyc-new-york-airbnb-study-mcgill/
 
本篇資料來自Kaggle上的New York City Airbnb
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png

同時會分成兩篇文章分別進行分析,第一篇主打EDA的部分,針對紐約各地區airbnb的資料進行視覺化的解剖以及一般的描述性統計。

資料的變數如下:
  id
listing ID

name
name of the listing

host_id
host ID

host_name
name of the host

neighbourhood_group
location

neighbourhood
area

latitude
latitude coordinates

longitude
longitude coordinates

room_type
listing space type

price
price in dollars(預測變數)

minimum_nights
amount of nights minimum

number_of_reviews
number of reviews

last_review
latest review

reviews_per_month
number of reviews per month

calculated_host_listings_count
amount of listing per host

availability_365
number of days when listing is available for booking
那我們開始吧!

沒什麼,讀個檔順便叫一下要用的package
#library
library(dplyr)
library(tidyverse)
library(ggthemes)
library(GGally)
library(ggExtra)
library(caret)
library('glmnet')
library(corrplot)
library(leaflet)
library("kableExtra")
library(RColorBrewer)
library(plotly)
library(mice)
library(VIM)
airbnb <- read_csv("AB_NYC_2019.csv")

把一些變數設成因子,然後看一下資料整體
#factor
airbnb$neighbourhood_group<-as.factor(airbnb$neighbourhood_group)
airbnb$neighbourhood <- as.factor(airbnb$neighbourhood)
airbnb$room_type <- as.factor(airbnb$room_type)
glimpse(airbnb)
## Observations: 48,895
## Variables: 16
## $ id                             <dbl> 2539, 2595, 3647, 3831, 5022, 5...
## $ name                           <chr> "Clean & quiet apt home by the ...
## $ host_id                        <dbl> 2787, 2845, 4632, 4869, 7192, 7...
## $ host_name                      <chr> "John", "Jennifer", "Elisabeth"...
## $ neighbourhood_group            <fct> Brooklyn, Manhattan, Manhattan,...
## $ neighbourhood                  <fct> Kensington, Midtown, Harlem, Cl...
## $ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 4...
## $ longitude                      <dbl> -73.97237, -73.98377, -73.94190...
## $ room_type                      <fct> Private room, Entire home/apt, ...
## $ price                          <dbl> 149, 225, 150, 89, 80, 200, 60,...
## $ minimum_nights                 <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1,...
## $ number_of_reviews              <dbl> 9, 45, 0, 270, 9, 74, 49, 430, ...
## $ last_review                    <date> 2018-10-19, 2019-05-21, NA, 20...
## $ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.5...
## $ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1...
## $ availability_365               <dbl> 365, 355, 365, 194, 0, 129, 0, ...

設定時間格式,把年份提出來
#lubridate
library(lubridate)
airbnb$last_review = as.Date(airbnb$last_review, "%m/%d/%Y")
airbnb$last_year = as.factor(format(airbnb$last_review, "%Y"))

看一下NA值,可以發現主要是last review的日期缺了,因此per month/last year都出現相同數量的NA,剩下的是name/host name有一些缺失,看得更清楚一點我們可以將它視覺化,可以更清楚的了解。
#NA
sapply(airbnb,function(x) sum(is.na(x)))
##                             id                           name 
##                              0                             16 
##                        host_id                      host_name 
##                              0                             21 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                          10052                          10052 
## calculated_host_listings_count               availability_365 
##                              0                              0 
##                      last_year 
##                          10052
aggr_plot <- aggr(airbnb, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))


再看一下summary:
我們可以明白主要以Manhattan社區為主,房型則是以獨棟為主
#summary
summary(airbnb)
##        id               name              host_id         
##  Min.   :    2539   Length:48895       Min.   :     2438  
##  1st Qu.: 9471945   Class :character   1st Qu.:  7822033  
##  Median :19677284   Mode  :character   Median : 30793816  
##  Mean   :19017143                      Mean   : 67620011  
##  3rd Qu.:29152178                      3rd Qu.:107434423  
##  Max.   :36487245                      Max.   :274321313  
##                                                           
##   host_name            neighbourhood_group            neighbourhood  
##  Length:48895       Bronx        : 1091    Williamsburg      : 3920  
##  Class :character   Brooklyn     :20104    Bedford-Stuyvesant: 3714  
##  Mode  :character   Manhattan    :21661    Harlem            : 2658  
##                     Queens       : 5666    Bushwick          : 2465  
##                     Staten Island:  373    Upper West Side   : 1971  
##                                            Hell's Kitchen    : 1958  
##                                            (Other)           :32209  
##     latitude       longitude                room_type    
##  Min.   :40.50   Min.   :-74.24   Entire home/apt:25409  
##  1st Qu.:40.69   1st Qu.:-73.98   Private room   :22326  
##  Median :40.72   Median :-73.96   Shared room    : 1160  
##  Mean   :40.73   Mean   :-73.95                          
##  3rd Qu.:40.76   3rd Qu.:-73.94                          
##  Max.   :40.91   Max.   :-73.71                          
##                                                          
##      price         minimum_nights    number_of_reviews
##  Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
##  1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00   
##  Median :  106.0   Median :   3.00   Median :  5.00   
##  Mean   :  152.7   Mean   :   7.03   Mean   : 23.27   
##  3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00   
##  Max.   :10000.0   Max.   :1250.00   Max.   :629.00   
##                                                       
##   last_review         reviews_per_month calculated_host_listings_count
##  Min.   :2011-03-28   Min.   : 0.010    Min.   :  1.000               
##  1st Qu.:2018-07-08   1st Qu.: 0.190    1st Qu.:  1.000               
##  Median :2019-05-19   Median : 0.720    Median :  1.000               
##  Mean   :2018-10-04   Mean   : 1.373    Mean   :  7.144               
##  3rd Qu.:2019-06-23   3rd Qu.: 2.020    3rd Qu.:  2.000               
##  Max.   :2019-07-08   Max.   :58.500    Max.   :327.000               
##  NA's   :10052        NA's   :10052                                   
##  availability_365   last_year    
##  Min.   :  0.0    2019   :25209  
##  1st Qu.:  0.0    2018   : 6050  
##  Median : 45.0    2017   : 3205  
##  Mean   :112.8    2016   : 2707  
##  3rd Qu.:227.0    2015   : 1393  
##  Max.   :365.0    (Other):  279  
##                   NA's   :10052

視覺化:
對,就是來畫圖看看

首先是價錢,畫了發現沒很清楚想再放大看看
#EDA~price
ggplot(airbnb) + 
geom_bar(aes(price),fill = '#fd5c63',alpha = 0.85,binwidth = 10) + 
theme_minimal(base_size = 13) + xlab("Price") + ylab("Number") + 
ggtitle("The Distrubition of Price") 


價錢取log看看:
#EDA~price
ggplot(airbnb, aes(price)) +
  geom_histogram(bins = 30, aes(y = ..density..), fill = "#fd5c63") + 
  geom_density(alpha = 0.2, fill = "#fd5c63") +ggtitle("Transformed distribution of price",subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) +
  scale_x_log10()


社區的部分:曼哈頓與布魯克林為主
#neighbourhood_group
ggplot(airbnb) + geom_histogram(aes(neighbourhood_group, fill = neighbourhood_group), stat = "count",alpha = 0.85) + 
  theme_minimal(base_size=13) + xlab("") + ylab("") +theme(legend.position="none") + 
  ggtitle("The Number of Property in Each Area")

airbnb_nh <- airbnb %>%
  group_by(neighbourhood_group) %>%
  summarise(price = round(mean(price), 2))
airbnb_nh
## # A tibble: 5 x 2
##   neighbourhood_group price
##   <fct>               <dbl>
## 1 Bronx                87.5
## 2 Brooklyn            124. 
## 3 Manhattan           197. 
## 4 Queens               99.5
## 5 Staten Island       115.
曼哈頓的房價最高,再來視布魯克林,但我們還需要看得更仔細一點...

我們可以畫一張各區的價位分布圖,清楚明瞭
ggplot(airbnb, aes(price)) +
  geom_histogram(bins = 30, aes(y = ..density..), fill = "#fd5c63") + 
  geom_density(alpha = 0.2, fill = "#fd5c63") +ggtitle("Transformed distribution of price",subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) +
  scale_x_log10() + facet_wrap(~neighbourhood_group)

這裡的結論就是曼哈頓的價位普遍比較高,五個社區分佈的也沒有太偏

那房型(roomtype)呢?
#roomtype   
airbnb_rt <- airbnb %>%
  group_by(room_type) %>%
  summarise(price = round(mean(price), 2))
airbnb_rt
## # A tibble: 3 x 2
##   room_type       price
##   <fct>           <dbl>
## 1 Entire home/apt 212. 
## 2 Private room     89.8
## 3 Shared room      70.1
獨棟(Entire home)的價位很明顯更高,馬上來看看他們的分佈
ggplot(airbnb, aes(price)) +
  geom_histogram(bins = 30, aes(y = ..density..), fill = "#fd5c63") + 
  geom_density(alpha = 0.2, fill = "#fd5c63") +ggtitle("Transformed distribution of price",subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) +
  scale_x_log10() + facet_wrap(~room_type)

很明顯的獨棟較高,分佈看起來也沒明顯偏態。

那來看看社區與房型的關係
#neighbourhood_group x roomtype
ggplot(airbnb) + geom_histogram(aes(neighbourhood_group, fill = room_type), stat = "count",alpha = 0.85, position = 'fill') + 
  theme_minimal(base_size=13) + xlab("") + ylab("")  + 
  ggtitle("The Proportion of Room Type in Each Area")

曼哈頓的獨棟最多,這似乎解釋了為什麼它的價位最高

那看房次數?
#number of reviews
ggplot(airbnb, aes(number_of_reviews, price)) +
  theme(axis.title = element_text(), axis.title.x = element_text()) +
  geom_point(aes(size = price), alpha = 0.05, color = "slateblue") +
  xlab("Number of reviews") +
  ylab("Price") +
  ggtitle("Relationship between number of reviews",
          subtitle = "The most expensive objects have small number of reviews (or 0)")
似乎沒什麼端倪,而且房價高幾乎沒人去看


那年份?
ggplot(airbnb) + 
  geom_histogram(aes(last_year), stat = "count", fill = '#fd5c63',alpha = 0.85) + 
  theme_minimal(base_size=13)+xlab("")+ylab("") + 
  ggtitle("The Number of New Property")

NA值很多,不過2019年的新建築很多
那看看他們的價位如何?
airbnb_yr <- airbnb %>%
  group_by(last_year) %>%
  summarise(price = round(mean(price), 2))
# A tibble: 10 x 2
   last_year price
   <fct>     <dbl>
 1 2011       169 
 2 2012       158.
 3 2013       256.
 4 2014       160.
 5 2015       157.
 6 2016       152.
 7 2017       135.
 8 2018       139.
 9 2019       142.
10 NA         193.
2013年的價位特別高,不過NA值實在太多啦

來看看各變數的相關
#correlation
airbnb_cor <- airbnb[, sapply(airbnb, is.numeric)]
airbnb_cor <- airbnb_cor[complete.cases(airbnb_cor), ]
correlation_matrix <- cor(airbnb_cor, method = "spearman")
corrplot(correlation_matrix, method = "color")

看不太出來哪一個變數特別相關,除了經度


以上大概就是這次的EDA~
還在思考有沒有可以切入的點。。。。
Still a long way~




留言

這個網誌中的熱門文章

Word Vector & Word embedding 初探 - with n-Gram & GLOVE Model

文字探勘之關鍵字萃取 : TF-IDF , text-rank , RAKE

多元迴歸分析- subsets and shrinkage