CS109 Data Science, Harvard

Methodology.

We define city infrastructural data such as the current housing price, energy consumption of the area, income of neighborhoods, transportation accessibility, green space in the neighborhoods, specific housing features, and crime rates as “top-down data”. We also define informal crowd-sourcing data such as, twitter texts, Instagram tags, or yelp reviews as “bottom up data” . There is a third category of data that has not been widely deployed yet which we define as the “holistic visual data”. It is the general impression to your visual surroundings when you at a specific spot that we believe can be captured by google street views.

By testing on the data of visual surroundings as dependant variables, we hope to capture missing information from the top down data and therefore tell if the visual environment is a significant feature in a housing prediction analytical model. We use machine learning methodology to pre process Google street views to give certain attributes. To combine different data sets (from top-down to bottom-up) into one file for machine learning, we use graph data structure. Basically, we collect information on different locations and merge them together in specific pixelized grid based data points utilizing graph structure.

With two combined datasets of housing and rent prices in Boston, we conduct the both regression and classification to propose relevant models for value prediction as well as suggest best features among about 50 features from different resources.

Posted by Eille Jungmin Han on September 18, 2016

Learning Model.

We propose RandomForest regressors as a finalized regression model for the housing value prediction, and RandomForest classifiers as a suggested model of classification with R_squared at 0.664 and model accuracy at 0.77 on the test sets respectively. After conducting several methods for feature selections, we clustered important feature sets for each housing and rent prices into three different categories: Baseline, Socio-economic and Visual classes. Each classes contain 5-10 features for predictions, which could be utilized for proposed models.

After the process of feature selections, each classes were evaluated separately from the simple linear regressions to randomforest regressors. On top of the results for prediction for each classes, we combined three classes together to have optimized featureset for the rpediction with weighted values for all classes. While conducting parametric studies with different feature sets, various parameters for specific models were tuned for optimized R_square and scores of selected models.

Finally, Random Forest Model of both regression and classification was chosen with distinctively higher prediction accuracy.It can be assumed that the tree models work better for predcting housing values with visualized data of built environment. To use multiple features including visual data and environment information for value prediction (can be categorized by geo location), Random Forest model is strongly recommended.

Built Environment Assessment

for the Housing Value Prediction

Is there any relationship between Built environment and housing and rent prices?

Methodology.

Learning Model.

Visual Index for the research.

Step 1 :Data Parsing

Step 2 :Data Exploration

Step 3 :Leraning Housing Price

Step 4 :Leraning Renting Price

Step 5 :Conclusion

Milestone