September 30, 2020October 1, 2020 by Marcus Rauhut

Python for Marketers: Pulling data from the NPI registry

What this is for: Extracting bulk data from the NPI registry and downloading as CSV
Requirements: Knowledge of JSON, Pandas dataframes
Concepts covered: Making an API request, flattening JSON, combining dataframes
Download: Download the Jupyter notebook

If you work with health care provider data, you are no doubt familiar with NPI (National Provider Identifier) data. The NPI registry provides a database of health care providers across the country, including names, addresses, and specialties.

The Centers for Medicare & Medicaid Services makes the data available for download. However, there are a couple drawbacks to working with the downloaded data. First, it’s a massive file that will take up a lot of hard disk space and will be too big for Excel. Second, you would have to download updates constantly as the data changes.

Luckily, there is a public API available that allows quick searches with a little bit of Python code. The API is free and doesn’t require any authentication token, so if you’re looking to build your first application with an API, this is a great project.

In this tutorial, we’ll extract bulk data from the NPI registry and convert to a CSV. Like other APIs, the application will return results in a JSON format, so there will be some steps in getting the data how we need it.

Making the request

First, we’ll import libraries:

import requests
import json
import pandas as pd
import numpy as np
from flatten_json import flatten

We’ll make use of an incredibly handy Python library, Flatten JSON. If you don’t already have this library installed, just follow the directions on the page to install in your Python environment.

Our API request is made by making a request to a URL with parameters such as name, city, state, etc. To understand the available parameters and how to structure searches, you should explore the API demo and documentation.

For this tutorial, we’ll pull 10 providers in the 90210 zip code. So our API endpoint will look like this:

https://npiregistry.cms.hhs.gov/api/?number=&enumeration_type=&taxonomy_description=&first_name=&last_name=&organization_name=&address_purpose=&city=&state=&postal_code=90201&country_code=&limit=10&skip=&version=2.0

The code will make a request and then print the results:

r = requests.get('https://npiregistry.cms.hhs.gov/api/?number=&enumeration_type=&taxonomy_description=&first_name=&last_name=&organization_name=&address_purpose=&city=&state=&postal_code=90201&country_code=&limit=10&skip=&version=2.0')
results_text = r.text
print(results_text)

Working with JSON

At this point, we have our data, but it’s not in a useable format. First, we’ll need to convert to a JSON object.

results_json = r.json()

From here we can easily access records or fields. For example, if we wanted to print the first name of the first record, the code would look like this:

print(results_json['results'][0]['basic']['first_name'])

Let’s break this down. results_json is the JSON object we just created. The provider records are nested under “results”: so we must add [‘results’]. [0] is the index, so it pulls the first record. Name fields are nested under “basic”: so we must add [‘basic’]. Then the field name we want is “first_name”, so we add [‘first_name’].

Flattening the JSON

The above method can capture a specific field within a specific record, but our goal is to dump all of the data into a CSV.

There are two major challenges. First, much of the data is nested under other objects. For example, name fields are all nested under [‘basic’]. The second challenge is each record has different available fields. For example, some providers may have 1 address, while others may more.

To tackle the first challenge, we’ll need to flatten the JSON.

What this does is remove the nesting hierarchy. From there, we can make a dataframe where each field is its own column. To do this for the first provider record, it will look like the code below. The first line flattens the record. The second line takes the dictionary and converts to a dataframe.

first_record = flatten(results_json['results'][0])
df1 = pd.DataFrame.from_dict(first_record,orient='index').T
df1

Converting to CSV

The script above works for one record, but we’ll need to flatten all of the records. To do this, we’ll create a basic loop to iterate over each record.

First, we’ll need to know the record count. In our example, we limited to the results to 10, but if you didn’t specify a limit, we’ll need to pull this number dynamically.

To do this, we’ll need to access the result_count field within the JSON. This will store the number as a variable and print it. Then we’ll make our loop.

result_count = results_json['result_count']
print(result_count)

#Create loop parameters
i = 0
#Create loop
while i < result_count:
    record = flatten(results_json['results'][i])
    print(record)
    data = pd.DataFrame.from_dict(record,orient='index').T
    if i == 0:
        df = data
    else:
        df = pd.concat([df,data], axis=0, ignore_index=True)
    i+=1

Let’s break this down. The i variable establishes the index of the record, starting at 0. The loop will run until the index number reaches the total number of record results.

Within the loop, we’ll flatten each record, print it, then store as a dataframe. The first record will create the dataframe, then each subsequent loop will use the concat() function to add to the dataframe.

After running the loop, the dataframe will look like this:

Finally, just save the dataframe to a CSV.

df.to_csv('NPI_Export.csv')

Done!

in Analytics 0 comments

Python for Marketers: Pulling data from the NPI registry

Making the request

Working with JSON

Flattening the JSON

Converting to CSV

You may also like

Working with weighted survey data in Power BI

Validating & cleaning ZIP codes in Python

How accurate is sampled Google Analytics data? A simple experiment to find out