Address Cleanse

Author

Doaa Kurdi

Updated

2022-08-22

What is Address Cleansing

Address Cleansing is the process of verifying, correcting, formatting and completing address data. If an address is missing values, performing address cleansing means filling in absent details with accurate results. If an address is not in the correct format, cleansing organises the address components to adhere to the local authoritative postal service address guidelines.

Ideal Postodes' Address Data Cleansing solution attempts to return the closest matching address for any given address inputs. We also return a number of Match Level indicators that describe the degree to which the suggested address matches the input address. The more impaired the input address, the harder it is to cleanse.

We accept address inputs in CSV format.

We will return the cleanse results in a CSV format, which preserves the original data. The suggested address and additional Match Indicators will be appended in subsequent columns.

Use the Confidence Score and Match Indicator to infer which addresses can be reliably cleansed.

Input

We require an input file in CSV format. The first row must contain header names.

Required Columns

  • id. Any unique ID you have assigned to the address so that it can later be identified.
  • address. The complete address to cleanse on.

Optional Columns

  • postcode
  • post_town
  • udprn

Output

The cleanse process will yield an output CSV file containing the original inputs appended with the suggested address and Match Indicators.

Suggested Address

Immediately after the input columns, we append standard address fields:

  • Address Line 1
  • Address Line 2
  • Address Line 3
  • Post Town
  • Postcode

Match Indicators

  • Count. The number of addresses we matched to the input. We return the closest match by default.
  • Organisation Match
  • Premise Match
  • Postcode Match
  • Thoroughfare Match
  • Locality Match
  • Post Town Match
  • Fit. A score represented as number between 1 and 0. Fit compares the address elements present in your query against the matching address elements. It does not incorporate elements you have not presented in the score. A partial address (e.g. 12 Pye Green Road) will have a fit of 1 even though it is missing post town and postcode. Its confidence score will be less than 1 however because it is missing some crucial elements.
  • Confidence. A confidence score represented as number between 1 and 0. 1 indicates a full match. 0 indicates no complete matching elements.

Match Levels

For any given address attribute (e.g. premise, thoroughfare, locality), we return a match score which benchmarks to what extent the input address conforms to the cleansed address. The full list is below:

  • FULL. A character for character match has been detected.
  • PARTIAL. A close match has been detected.
  • INCORRECT. The suggested address attribute and input address attribute do not match.
  • MISSING. The suggested address has this attribute while the input address does not.
  • NA. The suggested address does not use this particular attribute and so it is considered not applicable.
  • NO_MATCH. No match whatsoever could be found for the input address.

Confidence Score

The confidence score is a number ranging between 0 and 1. Where 1 implies a full match and 0 implies no major elements completely match. Each incorrect, missing or misspelled element will subtract from the overall confidence score.

Deciding on an Acceptable Confidence Score Threshold

Different address cleanse projects can have radically different inputs. However, within each project, the inputs tend to repeat the same errors. For instance, some input datasets may be exclusively inputted manually and be prone to typos. Others may have a persistently missing datapoint such as organistation name or postcode. For this reason, it is important to understand that there is no absolute Confidence Score threshold. Instead, the acceptable confidence score must be determined on a project by project basis based on systematic errors present in the data and business goals.

When determining an acceptable Confidence Score threshold you should load a subset of the dataset into a spreadsheet application like Excel and sort on the score. Scrolling from top-to-bottom you will be able to observe matches from best to worst. As you start to hit the lower quality searches, you will be able to roughly determine:

  • Which confidence scores indicate ambigious matches (i.e. up to building level only)
  • Which confidence scores indicate a poor or no match (i.e. the nearest matching address is too far from the input address)

Depending on your business goals, you can also use the Match Levels to determine an acceptable match. For instance, do you need to match up to the throroughfare or building name only? Are accurate organisation names an important feature?