Validating & cleaning ZIP codes in Python
When I’m working with data, I probably spend 90% of my time cleaning. In a recent project, I ran into an issue where I needed to make ZIP codes consistent – some were five-digit, some had the plus-four, some had a hyphen while others didn’t, and some were just invalid.
I searched for a simple script to validate and standardize ZIP codes, but couldn’t find any, so I created a simple script.
A quick word of caution when working with ZIP codes in Python: You’ll want to make sure they are stored as strings, otherwise a 00412 ZIP code would be 412 as an integer.
The code
First, we’ll import our libraries and create a test dataframe.
import pandas as pd import numpy as np import re d = {'NAME': ['bob', 'linda', 'tina', 'gene', 'louise'], 'ZIP': ['21310', '00412', '21310-1234', '004125555', '21310-']} df = pd.DataFrame(data=d)
When we print the dataframe, this is what we get. As you can see, the last 2 are problematic.
NAME ZIP 0 bob 21310 1 linda 00412 2 tina 21310-1234 3 gene 004125555 4 louise 21310-
So to solve this, we’ll create a simple for loop to iterate over each cell in the ZIP column.
for i, rows in df.iterrows(): ifor_val = df.iloc[i]['ZIP'] if re.match(r"^\d{9}$", ifor_val): zipc = ifor_val[:5] plus4 = ifor_val[5:] strings = [zipc,plus4] ifor_val = '-'.join(strings) elif re.match(r"^\d{5}-$", ifor_val): zipc = ifor_val[:5] ifor_val = zipc elif re.match(r"^\d{5}-\d{4}$", ifor_val): ifor_val = df.iloc[i]['ZIP'] elif re.match(r"^\d{5}$", ifor_val): ifor_val = df.iloc[i]['ZIP'] else: ifor_val = 'invalid' df.at[i,'ZIP'] = ifor_val
Let’s break down this code.
We start with creating a variable (ifor_val) defined as the ZIP column value for each row.
In the first if statement, we’re testing if the value is 9 contiguous numbers. If true, we’ll format to XXXXX-XXXX using slices and the join method.
In the second if statement, we’re testing whether the value is 5 contiguous numbers and ends with a hyphen. If true, we’ll remove the hyphen.
In the third if statement, we’re testing whether the value matches a regular expression where there are 5 contiguous numbers, a hyphen, then 4 contiguous numbers. If true, we’ll keep the value.
In the fourth if statement, we’re testing whether the value matches a regular expression where there are 5 contiguous numbers. If true, we’ll keep the value.
Finally, if the value doesn’t match any of these conditions, the cell will be updated to “invalid.”
When we print the dataframe again, this is what we get:
NAME ZIP 0 bob 21310 1 linda 00412 2 tina 21310-1234 3 gene 00412-5555 4 louise 21310