Address matching: as much as possible or better safe than sorry?
We work with geodata and addresses. Lets take the concrete example of an insurer. For each customer, he knows the address and the product the client bought. He has the following questions:
- In which region do I have which market share?
- Can you find a correlation between the risk I take on the fire insurance and the features Rockestate calculates out of open data for the client his house? (e.g. is the cost of a claim correlated with the volume of a house?)
For both questions, the insurer expects that we match the addresses. But what does it actually mean, match an address? To answer the two given questions, the requirements of an address match are different. In the first case, we do not need an address match with open data. We just need a geolocation:
- Nieuwpoortsesteenweg 410 8400 Oostende → (51.223803 lat, 2.906891 lon)
It is not very important that the geolocation is exact. The impact on calculating the market share would not be severe if the coordinates of the address are for example the coordinates of the neighbouring house. As the insurer is interested in his market share in Oostende, he wants to include as many of the clients as possible, and therefore doesn’t really care if the geolocation of some clients is only approximative.
The second question can only be answered in case the address is matched with a building ID out of an open data source. In the case of the previous example, a match means we have to find a building ID of the open data source for Flanders, which is the “Grootschalig Referentie Bestand” or GRB:
- Nieuwpoortsesteenweg 410 8400 Oostende → grb gbgoidn 928443
On the image below, you can see this matching in greater detail with a screenshot of the application geopunt of Informatie Vlaanderen. Thanks to this match, we know the contour of this building in the form of a polygon.
Out of this polygon we create lots of features. Not only the area of the polygon is important, but features like “is the polygon touching other polygons” or “what is the distance to the closest street”? And as these features are important in a risk-analysis as in the second question, we do not want to mix up the polygon of one house with the polygon of it’s neighbour. So in this case, we rather match less addresses, but for each address we match, we are 100% sure that we are matching it with the right building ID.
The following information is an address we were not able to match:
- Streetname: Jules De Troozlaan
- Housenumber: 50/52
- Postalcode: 8370
- Municipality: Blankenberge
It is not difficult to imagine where this customer lives. It doesn’t look like the street name or municipality contain typos. But what do we do with the house number? Is there really a house with house number “50/52”? Or did the client insure two houses? A quick search in Google maps already illustrates this problem. Besides the fact that Google doesn’t know the name of the street, Google did not find a house with the exact house number 50/52. As Google absolutely wants to return a result, Google just approximated it by taking number 52. For 99% of the Google maps users, this is sufficient information.
What about the open data? The street name is not known in that source either, but rather the abbreviation ‘J. De Troozlaan’. Our matching algorithm will deal with this problem. But it seams there is only one building polygon for all the addresses between number 46 and number 54. Assuming that the house numbers on that side of the street are even, this polygon contains the addresses with house numbers 48, 50 and 52. If we click on the polygon, Geopunt tells us that the closest address is number 52.
All this makes little sense. Even if we would match the address with that large polygon, how do we know which fraction of the polygon corresponds to number “50/52”? Is “50/52” one house or actually two houses? Given these impossible to answer questions, we better not assume this polygon corresponds to that address, it will lead to mistakes in the risk analysis. Although we can easily geolocate it, we have to accept that his address brings too many uncertainties and we therefore do not match it.
And a quick look to Google Streetview reveals that we can be happy we did not match it. There are clearly two different addresses, 50 and 52. On top of that, each of those addresses consists of many different apartments. Number 48, 50 and 52 all seem to be constructed at the same time, by the same architect, construction firm and project developer.
For most insurers it remains challenging to differentiate between a situation where the address 50/52 actually exists and a scenario in which the customer decided to insure 2 different buildings located on respectively number 50 and 52. (And we don’t have time to go manually over thousands of addresses). We consider this as a serious data quality issue. Cleaning up these addresses after the fire insurance is sold is hard work. It is a lot easier to be sure of an address match on the moment of intake. Therefore we always advise our clients - where possible - to work with an dropdown address list based on open data which is provided and maintained by the corresponding regional authorities. That way the client purchasing the fire insurances, manually confirms the right address.