In [48]:
import pandas as pd
In [71]:
df = pd.read_csv("AB_NYC_2019.csv")
In [50]:
df.head()
Out[50]:
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 19-10-2018 | 0.21 | 6 | 365 |
1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 21-05-2019 | 0.38 | 2 | 355 |
2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 05-07-2019 | 4.64 | 1 | 194 |
4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 19-11-2018 | 0.10 | 1 | 0 |
In [75]:
df["id"]=df["id"].astype(str)
df["host_id"]=df["host_id"].astype(str)
df["latitude"]=df["latitude"].astype(str)
df["longitude"]=df["longitude"].astype(str)
How does the data look mathematically?
In [52]:
df.describe()
Out[52]:
price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|
count | 48906.000000 | 48906.000000 | 48906.000000 | 38854.000000 | 48906.000000 | 48906.000000 |
mean | 152.711324 | 7.031612 | 23.300454 | 1.373151 | 7.142702 | 112.782031 |
std | 240.128713 | 20.512489 | 44.607175 | 1.680270 | 32.948926 | 131.620370 |
min | 0.000000 | 1.000000 | 0.000000 | 0.010000 | 1.000000 | 0.000000 |
25% | 69.000000 | 1.000000 | 1.000000 | 0.190000 | 1.000000 | 0.000000 |
50% | 106.000000 | 3.000000 | 5.000000 | 0.720000 | 1.000000 | 45.000000 |
75% | 175.000000 | 5.000000 | 24.000000 | 2.020000 | 2.000000 | 227.000000 |
max | 10000.000000 | 1250.000000 | 629.000000 | 58.500000 | 327.000000 | 365.000000 |
range of minimum nights for listings is 1 and 1250
Categorical Data¶
In [53]:
df.nunique()
Out[53]:
id 48895 name 47896 host_id 37457 host_name 11452 neighbourhood_group 5 neighbourhood 221 latitude 19048 longitude 14718 room_type 3 price 674 minimum_nights 109 number_of_reviews 394 last_review 1764 reviews_per_month 937 calculated_host_listings_count 47 availability_365 366 dtype: int64
In [54]:
df.columns
Out[54]:
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'], dtype='object')
In [55]:
df["room_type"].value_counts()
Out[55]:
room_type Entire home/apt 25414 Private room 22332 Shared room 1160 Name: count, dtype: int64
In [56]:
df["room_type"].value_counts(normalize = True)
Out[56]:
room_type Entire home/apt 0.519650 Private room 0.456631 Shared room 0.023719 Name: proportion, dtype: float64
In [57]:
# df["neighbourhood_group"].value_counts(normalize=True)*100
percentage_counts = df["neighbourhood_group"].value_counts(normalize=True) * 100
print(percentage_counts.map("{:.3f}%".format))
neighbourhood_group Manhattan 44.307% Brooklyn 41.114% Queens 11.585% Bronx 2.231% Staten Island 0.763% Name: proportion, dtype: object
In [58]:
df["neighbourhood"].value_counts().reset_index().rename(columns = {"count" : "No. of Hotels"})
Out[58]:
neighbourhood | No. of Hotels | |
---|---|---|
0 | Williamsburg | 3921 |
1 | Bedford-Stuyvesant | 3715 |
2 | Harlem | 2658 |
3 | Bushwick | 2465 |
4 | Upper West Side | 1974 |
... | ... | ... |
216 | Fort Wadsworth | 1 |
217 | Richmondtown | 1 |
218 | New Dorp | 1 |
219 | Rossville | 1 |
220 | Willowbrook | 1 |
221 rows × 2 columns
Numerical Data¶
In [59]:
df["price"].value_counts(bins=5)
Out[59]:
(-10.001, 2000.0] 48820 (2000.0, 4000.0] 54 (4000.0, 6000.0] 16 (6000.0, 8000.0] 9 (8000.0, 10000.0] 7 Name: count, dtype: int64
In [60]:
bins = [-10,0, 50,100, 200,500,800,2000,4000,10000]
df["price"].value_counts(bins = bins)
Out[60]:
(50.0, 100.0] 17373 (100.0, 200.0] 16588 (200.0, 500.0] 7340 (0.0, 50.0] 6550 (500.0, 800.0] 624 (800.0, 2000.0] 334 (2000.0, 4000.0] 54 (4000.0, 10000.0] 32 (-10.001, 0.0] 11 Name: count, dtype: int64
It is mainly helpful in small datasets.
Measures of central tendency¶
In [61]:
df["price"].mean()
Out[61]:
152.71132376395533
In [62]:
df["price"].median()
Out[62]:
106.0
In [63]:
df["price"].std()
Out[63]:
240.1287131622509
In [64]:
df["minimum_nights"].mean()
Out[64]:
7.031611663190611
In [65]:
df["minimum_nights"].median()
Out[65]:
3.0
Measure of Spread¶
In [66]:
df["price"].skew()
Out[66]:
19.120831694826197
In [67]:
df["price"].kurt() ## This tells the height of the price data
Out[67]:
585.7930484394186
How many listings have availability throughout the year (365 days)
In [68]:
df[df["availability_365"]==365].shape[0]
Out[68]:
1295
In [79]:
df.corr(numeric_only=True)
#The main task of the DataFrame.corr() method is to find the pairwise correlation of all the columns in the DataFrame.
# If any null value is present, it will automatically be excluded. It also ignores non-numeric data type columns from the DataFrame.
Out[79]:
price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|
price | 1.000000 | 0.042771 | -0.048014 | -0.030608 | 0.057478 | 0.081817 |
minimum_nights | 0.042771 | 1.000000 | -0.080093 | -0.121772 | 0.127917 | 0.144146 |
number_of_reviews | -0.048014 | -0.080093 | 1.000000 | 0.549291 | -0.072375 | 0.172002 |
reviews_per_month | -0.030608 | -0.121772 | 0.549291 | 1.000000 | -0.009414 | 0.185818 |
calculated_host_listings_count | 0.057478 | 0.127917 | -0.072375 | -0.009414 | 1.000000 | 0.225680 |
availability_365 | 0.081817 | 0.144146 | 0.172002 | 0.185818 | 0.225680 | 1.000000 |
In [ ]: