Address Validation with IBM InfoSphere DataStage

Erich Guenther

/ 2014-01-07

Maintaining a database of postal mail addresses requires them to be maintained over time and validated. Conventional wisdom is usually that an address database should be largely static and never change after data entry unless someone physically moves. However, as in many cases, conventional wisdom is not correct in this case. Both the United States Geological Survey (USGS) and the United States Postal Service (USPS) recommend that physical street addresses be maintained at least once a quarter via a process named geocoding, which includes street name, city, state and zip code validation.

Even if a client does not relocate or physically move their mailing address can be subject to change as zip code boundaries are redrawn, street names change over time and sometimes land swaps or annexations occur between cities shifting an address from one city or town to another.

Over the years, a number of different methods were used to survey the United States. This resulted in different methods of street layouts and addressing. Additionally, house numbering systems can also vary by region of the country. Instead of attempting to develop software to compensate for these anomalies and attempt to get it certified by the USPS or USGS, it is my opinion that you should consider off the shelf software for street address validation that has already met the standards of the USPS. There is more than one such package on the market but for the purposes of this article, and my past experience, I will discuss the Pitney Bowes Product, formerly referred to as Group One Address Validation. It is now called “CODE-1 PLUS.” You can click on the link below for a quick overview of this product.

https://www.pb.com/software/Mailing-and-Postal-Compliance/CASS-Certification/CODE-1-Plus.shtml

Diving into the specific details of how this software works is well beyond the scope of this brief article. The Pitney Bowes documentation is very detailed in how the software works, what kind of detailed results it returns, and how to setup the software to work as an API or Java web service. Many large organizations have already purchased the software where it is installed on its own servers somewhere in their IT departments. All that is typically required is to encapsulate or wrap the C-based API into a Java Web Service wrapper and publish the Web service WSDL file for internal company access (there are usually license limitations to using the software external to the company).

IBM’s InfoSphere DataStage is a great tool that can be setup and used in conjunction with Address Validation software such as CODE-1 PLUS and via a recurring and scheduled batch process to keep address databases fresh. First it is important to understand the components of an address and an address breakdown. Most mailing or service delivery addresses have several components that an address validation/maintenance program will recognize. They are:

  1. Longitude (optional but recommended)
  2. Latitude (optional but recommended)
  3. Street Number
  4. Street Name
  5. City
  6. State
  7. Zip Code

It is important to keep these items separate.

Clients often ask why it is important to also include “geocoding” information such as longitude and latitude within the street address validation. This information can be used as a good second source of information to validate the address as well as it can be used with billing systems particularly in tax valuation processes to assign subcomponents of sales tax if an item or service is sold to the specific address. For example if there is a transit tax or sewage district tax embedded in the sales tax, longitude and latitude will give you a good indication if the street address falls within the specific tax district for these taxes.

Address validation is an iterative process and setting up InfoSphere DataStage to perform it is quite straightforward. The first step in the process is to obtain the Java Web Service*.WSDL location. Once you obtain the *.WSDL address all you need to do is import the *.WSDL file into the InfoSphere DataStage Web Services Transformer. Upon completion of a successful import of the *.WSDL page, the Transformer should detail what is needed for input and what is needed for output for the web service to work via the Web Services Transformer.

You then create an input link from a source that provides the inputs requested for the web service to produce output and the return codes from the web service so you can design your next InfoSphere DataStage course of action. If you are new to the CODE-1PLUS software, a good intermediate step at this point is to design an output as either a viewable dataset or ASCII file so you can eyeball the various return codes and results that are output and you can proceed with your iterative design from there. However, I will outline how a good and basic street address validation system should work in batch. It should be very similar to how you use address validation in a data entry system GUI.

As stated above address validation is an iterative process. First you need to validate the City, State and Zip combination. This is logical as you want to slowly shrink the possibilities from the largest area to the smallest. You first check to see that the zip code is valid for the state, then check to see if the zip is valid for the city within the state. There is no point in going further in the validation process if you fail on these two edits because the remaining results will not be correct.

When you pass the zip code validation you want to then proceed to the street name validation. Does the street name exist within the stated city, state and zip code? If the answer is yes then you want to proceed onto the street number. Is the street number valid on this specific street name? When the answer is yes, you have completed the street address validation.

Keep a close eye on the return codes from the street validations because in some cases you will get a return code that says the street is only validated down to the Centrex zip code. This means the longitude and latitude for this specific street address is placed right in the center of the zip code area. When this happens it means the software could only validate that the street exists within the zip code but cannot determine exactly where within the zip code it exists. Typically this is acceptable (letter delivery) because in most cases of small to medium-sized Post Offices if you can get the letter to the correct Post Office, the letter carrier will find the street instead bouncing the mail back. In the cases of taxes, however, a Centrex zip code might not be a precise enough location to determine taxing authority or assess a component of a sales tax properly.

In some cases the software will not be able to validate the exact street name as you have it spelled in your database. For example, a street named ELMBROOK (without a space) might only be accepted as ELM BROOK (with a space) by the USPS and the CODE-1 PLUS software. In the case of a tie or multiple matches, the software will return a list of street addresses that are similarly named within an array and you will have to use some logic to attempt to pick one. It is probably more efficient to just dump these unvalidated streets into another table or file and use additional logic in a later process in an attempt to find an exact match vs slowing the batch cycle first pass down with match details on a small percentage of records.

When you design your system and after you become more fluent with the capabilities of CODE-1 PLUS, you can add additional validations or granularity onto the street address validation processes. For example, adding in logic to respond to the various software return codes and shooting for a goal of an exact match xx% of the time. You then have various options to get close to 100% matching (you may not ever hit the 100% goal). You can dump the remainder unvalidated addresses to an output file and outsource to another company to see if they can work the remainder. You can also send the unvalidated addresses to be reconfirmed via an internal client service call center. You might have other options as well. It is unlikely you will get an exact match for every single address in your database and will have to sort the addresses based on what accuracy level the software has indicated for the address validations.

Why would you want to outsource unvalidated street addresses to another company? In some cases, third-party vendors have access to the USPS Change of Address database. This is a database of scanned Change of Address forms that are mailed or completed at the local post office. In some cases this allows these vendors to follow a client as they physically move from address to address. These changes can then be captured as part of your update system. This type of matching is not always possible or accurate and has to also be validated in some way.

Summary

Every client realizes they should maintain their mail address database. It makes good business sense. It is relatively straightforward to do with IBM’s InfoSphere DataStage and saves postal costs (potentially giving the internal project a payback return on investment). It also leads to increased client satisfaction because it eliminates the client needing to call in and inform you their zip code and/or street name has changed.

Share This Story, Choose Your Platform!

Share This Story

Drive your business forward!

iOLAP experts are here to assist you