Parsing, cleaning and
data structure

data from google street view, google place, Craigslist Boston, and Zillow.


There are two different data sets: 1) social economic data representing city of boston as a social aspect such as land price, crime, energy usage, and so on, 2) Urban spatial data representing city of boston as a urban

Social Economic Data:

Renting Prices Data From Trulia [Jupyter Notebook]: download
data type : numeric and category
featrues: umber of room and bathroom, zip, address, date, rent price, SQFT(Square Foot/Feet)

Housing Prices Data From Zillow [Jupyter Notebook]: download
data type : numeric and category
featrues: geographic coordinate, last_sold_price, number of bathrooms and bedrooms, zestimate amount, prices, property size, zip, tax year, status, address, last sold date, tax value, year built, home_size, home_type, property

Energy data in Boston from Boston Data [csv]: download
data type: numeric and category
featrues: Site EUI, geographic coordinate, Tax Parcel, Years Repo, Site Energ, ZIP, ShapeSTLen, Energy Star, Pct Gas, Reported, Pct_Electr, Address, GHG Intens, Year Built, Property T, Water Inte

Crime data in Boston from Boston Data [JSON]: download
data type: numeric, string, and category
featrues: year, type of crime, geographic coordinate

Poperties Assessment in Boston from Boston Craigslist [JSON]: download
data type: numeric, string, and category
featrues: average land values, geographic coordinate

House and Room post data in Boston from Boston Craigslist [JSON]: download
data type: numeric, string, and category
featrues: date, title of post, content of post, geographic coordinate

Urban spatial Data :

google place from Google place API [Jupyter Notebook]: download
data type: numeric, string, and category
featrues: geographic coordinate, type of places(school, food, MBTA, etc.. )

google street view from Google street API [Jupyter Notebook]: download
[JSON]: download
data type: image
featrues: RGB values and mathematical numerical features


In order to process more complex data than one datum internally, data structure is necessary. Unlike the well know data array such SQL (Structured Query Language) as Tabular matrix like CSV (Comma Separated Values), TSV (Tab Separated Values), or such NoSQL (No Structured Query Language) as JSON (JavaScript Object Notation) or graph structure, a data structure should be interactive and computable with its data sets and neighbor or connected data sets as efficient as possible.

Data Processing :

On top of the top down data, the bottom-up data(post processing of google street views) can be map for predicting housing price. This is currently being processed and haven’t being analyzed yet. To process the google street view data, there are two data structures(pixel and graph data structure) where individual data are populated and calculated.Pixel data structure is a matrix, discretizing a urban or district into a finite setting for analysis, in which each pixel has the relationship with its neighbors, and each one computes its own data on the basis of neighbors’ settings, so that urban data can be naturally addressed and computed in spatial context.

Pixel Structure :

Pixel Data structure based on two-dimension matrix array consists of individual pixel which contain diverse data internally. As a parent of each pixel in the hierarchy, the pixel structure governs and controls computing and emerging the new data by processing not only its child pixel but also its neighbors. Just like image processing, the data affect their neighbors based on given algorisms so that the effect of data in the given relationships appear and emerge new pattern of data.

example: data blending in pixel structure

So that individual data can live in the closest pixel and make its relationship with neighbor in the pixels of City of Boston.

Graph Structure :

As a similar technic, Graph data structure could be deployed. Graph structure is mathematical objects that consist of nodes and edges, and are widely used to represent relational data structures. The street network of urban, street, highway or the subway map are examples of objects whose graphs closely resemble their physical form. Thus, the structure will deploy to process urban data in spatial relationships in order to produce features for the house prediction model.

example: Graph interacting with computing data in each nodes, and visualizing its path from a node to a node

example: Graph interacting with orphan nodes computing internal data based on the change

Thus, Graph makes it possible to deal with the relationship between the Google Place map and each pixel locations, not based on Euclidean distance, but based on Manhattan Distance within roads and street in the Boston, so that it is able to compute the actual distance in the urban context.


Basically, these data are stored in the discretized gird capturing city of Boston as form of data, so that it makes possible to compute these data sets both in a very effective way and in a comprehensive way to synthesizing multiple data in each pixels.