python data analysis concept

Validating & cleaning ZIP codes in Python

When I’m working with data, I probably spend 90% of my time cleaning. In a recent project, I ran into an issue where I needed to make ZIP codes consistent – some were five-digit, some had the plus-four, some had a hyphen while others didn’t, and some were just invalid.

I searched for a simple script to validate and standardize ZIP codes, but couldn’t find any, so I created a simple script.

A quick word of caution when working with ZIP codes in Python: You’ll want to make sure they are stored as strings, otherwise a 00412 ZIP code would be 412 as an integer.

The code

First, we’ll import our libraries and create a test dataframe.

import pandas as pd
import numpy as np
import re

d = {'NAME': ['bob', 'linda', 'tina', 'gene', 'louise'], 'ZIP': ['21310', '00412', '21310-1234', '004125555', '21310-']}
df = pd.DataFrame(data=d)

When we print the dataframe, this is what we get. As you can see, the last 2 are problematic.

	NAME	ZIP
0	bob	21310
1	linda	00412
2	tina	21310-1234
3	gene	004125555
4	louise	21310-

So to solve this, we’ll create a simple for loop to iterate over each cell in the ZIP column.

for i, rows in df.iterrows():
    ifor_val = df.iloc[i]['ZIP']
    if re.match(r"^\d{9}$", ifor_val):
        zipc = ifor_val[:5]
        plus4 = ifor_val[5:]
        strings = [zipc,plus4]
        ifor_val = '-'.join(strings)
    elif re.match(r"^\d{5}-$", ifor_val):
        zipc = ifor_val[:5]
        ifor_val = zipc
    elif re.match(r"^\d{5}-\d{4}$", ifor_val):
        ifor_val = df.iloc[i]['ZIP']  
    elif re.match(r"^\d{5}$", ifor_val):
        ifor_val = df.iloc[i]['ZIP']
    else:
        ifor_val = 'invalid'
    df.at[i,'ZIP'] = ifor_val

Let’s break down this code.

We start with creating a variable (ifor_val) defined as the ZIP column value for each row.

In the first if statement, we’re testing if the value is 9 contiguous numbers. If true, we’ll format to XXXXX-XXXX using slices and the join method.

In the second if statement, we’re testing whether the value is 5 contiguous numbers and ends with a hyphen. If true, we’ll remove the hyphen.

In the third if statement, we’re testing whether the value matches a regular expression where there are 5 contiguous numbers, a hyphen, then 4 contiguous numbers. If true, we’ll keep the value.

In the fourth if statement, we’re testing whether the value matches a regular expression where there are 5 contiguous numbers. If true, we’ll keep the value.

Finally, if the value doesn’t match any of these conditions, the cell will be updated to “invalid.”

When we print the dataframe again, this is what we get:

	NAME	ZIP
0	bob	21310
1	linda	00412
2	tina	21310-1234
3	gene	00412-5555
4	louise	21310

You may also like