How to scrape and clean data into a CSV using Microsoft Word

Converting raw, non-delimited text into usable formats

If you’ve spent as much time working with datasets as I have, you’ve probably run into a situation at least once where the data you need does not come in a neat, tidy, or even usable format.

In a perfect world, raw data would be delimited in some way – either by a comma or tab. Even data in HTML tables and other formats can be fairly easily scraped into a CSV.

But what if your data just comes out as a bunch of plain text with no formatting and is not delimited?

I ran into this issue as I started exporting data out of an EMR. The data came out as an HTML file, and while it is at least consistent, it was not structured. Here’s what it looked like:

But after setting up some macros in Word and tweaking the VBA script a little bit, I’m now able to achieve this with a couple clicks:

If you have data scraper software (or some advanced programming knowledge) there are other and probably better ways to do this. But if you’re looking for a quick, easy way to do this repeatedly with common desktop software, it’s possible Microsoft Word. But note: for this to work, your data must follow a predictable pattern.

Basically, this process uses the Find & Replace function to remove white spaces with commas to separate the values.

Process

  1. Copy and paste your data into a new Word document.
  2. Start recording a new macro. (View -> Macros -> Record New Macro)
  3. First thing’s first – delete any commas. The end goal is to create a CSV, so any existing commas are going to mess up the data structure. Use Find & Replace to find any commas in your document and delete them (leave the Replace field blank).
  4. Use the Find & Replace function to replace whitespaces (^w) with commas where there is a break in data values. Depending on the data, this may require some advanced knowledge of regular expressions in building your search queries. In this example, anywhere I saw “(space)thru(space)“ I knew this represented a break in values, so I used find and replace to replace “^wthru^w” with “,” throughout the document.
  5. From here you should starting seeing the data to develop some structure. Continue using Find & Replace to get rid of unnecessary elements. Again, this may require some knowledge of building advanced search queries in order to do this consistently through the document.
  6. Once you have the data looking like a true CSV, save it as a text file and stop recording the macro.
  7. Open the file in Windows Explorer and change the file extension to .CSV. Open it up in Excel to validate that the columns are lining up and your data is breaking at the right point. If they aren’t, go back to your document and tweak your Find & Replace queries. This will take some experimentation to get it right.
  8. The next steps are optional but will improve your workflows. Edit your macro (View -> Macros -> Edit).
  9. In the VBA script editor, find the line near the end where it saves the file name. It should look something like this:

  10. Edit the file name to include the today’s date and change the extension to .CSV. It should look something like this:

Voila! Now when I get the HTML file from the EMR system, I simply run the macro and within seconds I have a CSV.

This isn’t a perfect method, but if you’re not a programmer and don’t have access to an advanced data scraper, it’s a quick workaround in Word.

You may also like