Our love-hate relationship with heatmaps and how we use kriging to make them
As stated before, depending on the tuning parameters (in this case kernel bandwidth), you make different conclusions. The plot on the left, with the original data, might even be the best one. The hate part in the love-hate relationship is due to this endless tweaking of the tuning parameter. The one dimensional equivalent of this problem is the choice of the number of bins of a histogram. As there is no goodness of fit parameter that describes the quality of your histogram or heatmap, you end up with a number of different maps that lead to different conclusions.
But there is an important difference between the heatmap of the robberies in Springfield (the subject of the blogpost of Kenneth Field), and a heatmap of houseprices. When you study the occurrence of robberies over space, you are doing a point pattern analysis. The question you are trying to solve is: where do the robberies occur?
With geospatial statistics you try to answer a different question though. For the Simpson police that question would be: what is the expected severity of a robbery at each location? A typical point pattern analysis question for residential real estate would be: where are the most houses for sale? A geospatial statistics question is: where are the houses the most expensive? As you are basically trying to predict a continuous variable, you can use a goodness-of-fit measure to evaluate the quality of your map.
So we assume that the value of a house is different depending of the location. Can we build a model that predicts the value of each location in Belgium? With the collaboration of our partner Zimmo.be we had a look if we could extract the value of the location out of their online listings. Can we extract the value of the location out of these online listings? Or in other words, how do we model the value of a location if there was never a transaction before?
Due to the widespread availability in computing power, there are a number of straight forward ways to solve this. In the past, the most common way would have been to bin the online listings, for example by postal code. That is a natural first idea, but not without limitations. Doing so, one loses all variance within a postal code (as illustrated in this screenshot from an article in De Standaard).
You could take a more granular binning by for example binning the listings by neighborhood. We are not too much of a fan of this method either as not every neighborhood has the same amount of listings. A neighborhood in a city center might have 50 historical listings while a neighborhood in the countryside might have none. But we still want to know the value of the neighborhood in the country side as well! One argument for this method is that the boundaries add a lot of information to this interpolation method. A river for example might be a very good natural boundary for house prices. Our interpolation method has to figure out itself if there is some kind of structural break.
Today the most common technique is to just include x,y coordinates as features in a model such as gradient boosting. This is a very fast method that leads to results that are good enough in terms of predicting power, but the resulting maps are not very beautiful. Below you can see an example of such a heatmap, where we tried to predict the value of the air pollution in Belgium (curieuzeneuzen dataset).
We claim that we are a geo-data-science company (we love maps!), so we want to do better. By doing better, we both mean to model the location more accurately and to make a more convincing map. Just a small recap, what we are trying to do is to use discrete points in time and space to model a surface. That sounds exactly like interpolation. From wikipedia:
In the mathematical field of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points.
There are a number of interpolation algorithms, such as nearest neighbors. But the most powerful interpolation algorithm for our problem is kriging, or Gaussian processes for Baysian statisticians. It would take us too far to completely explain the method, but this is an excellent introduction.
Kriging is an technique discovered in 1960 by Danie G. Krige. He had the assignment to assess quickly if a block of raw material of one cubic meter was worth processing. If he assessed it contained a sufficient amount of gold, the block would be further processed. If not, the block would be left untouched. Based on a couple of measurements and his interpolation technique, he made his decision.
Kriging has two tuning parameters. The first one is an obvious bias-variance trade of being the kernel width. The larger your kernel, the smoother your map but the lower amount of detail. The smaller your kernel, the larger the amount of detail but the map might overfit the data. The second tuning parameter is the amount of noise you allow on your measurements. Are repeating measurements consistent? If you measure for example a few times the presence of a material on a location, you will measure each time more or less the same. The noise on your measurements is quite low. But on the contrary, if the sale of a house at a location is your measurement of the value of location, the noise will be very high, as the value of a house does not only depend on the location. Two houses next to each other might be sold at very different prices, although the value of the location is the same. Hence the noise on the measurements is quite high. Below you can see the optimal kernel width and noise level for 4 random areas in Belgium.
Here we notice something important: the optimal tuning parameters depend on the region. In machine learning you tend to pick the tuning parameters that lead to the highest prediction power, but when making a map, you also take the esthetics into account. If 51% of the listings would be in a city, and 49% would be on the countryside, the optimal kernel is not necessarily the one that optimizes the prediction power in the city.
A last huge benefit of kriging is that it doesn’t only predict the mean value for each location but also uncertainty. You can cut off locations where the uncertainty is too high and don’t make a prediction there. They are either locations where you have too few observations or locations where the noise on you observations is too high. We used this in combination with the Belgian open data of neighborhood density to only predict the value of houses in meaningful locations.
We hope this was helpful Don’t forget to take a look at the resulting heatmap and happy kriging ;).