Projects / Technical Work
GitHub URL: https://github.com/selintunali
For more information about my projects and code samples please email me at selintunali.tunali@gmail.com
1. $3.8 million Graduation Project
Partnered with a healthcare provider company, Zelis, on improving their existing payment processing algorithm. Zelis receives documents from clients and sends them to healthcare providers around the nation. However, Zelis’s clients do not have a unified way of formatting the end destination addresses, which results in pieces of mail being sent to the same address separately when they could be packaged together.
We have improved the data cleaning algorithm and developed new standardization rules to reduce the number of packages being sent to the same address leveraging machine learning and NLP techniques (NER, Stemming & Lemmatization). We have also developed efficiency metrics to measure to what degree the new algorithm will improve mail grouping compared to the current algorithm. We formulated a similarity score algorithm to find the best combination of pieces of mail into packages for mailing. Finally, we performed an economic analysis using the efficiency metrics we have developed. We were able to decrease the number of packages sent by 3% on average, saving Zelis 3.8 million each year.
KEYWORDS: address, algorithm, grouping, cleaning, Python, consolidation, mail, package, string match, standardization, machine learning, NLP techniques
Figure 1: Zelis Company Flowchart
Data Cleaning
The first part of the algorithm, data cleaning, aims to standardize the raw data by following the steps below:
Capitalizing all letters
Removing punctuation
Applying mapping tables in form of json files to standerdize directionals, states, business suffixes and prefixes
Tagging address into components using the usaddress Python package
Parsing recipient and company name into components with a string split Python function.
Mapping tables are used in Step 3 because they quickly and easily replace all possible incorrect spellings in the recipient or company name column. This method is applied to state abbreviations, directional abbreviations, street type abbreviations, as well as incorrect spellings of abbreviations, such as “CORP”, “CORPORAT”, “CORPORA”.
Data Grouping
The data grouping algorithm assignes weighted scores to each component of address and name information each, summing up to a total score of 100. The weights assigned to each component is calculated strategically consulting with Zelis employees. Different scoring for each component prevents the shipment of separate packages which have different names, but the same address. For example, a company “ABC”, which is written as “ABC LLP” or “ABC LP”, but which has the same addresses for both names, returns an address score of 100, corresponding to a perfect address match. By doing so, we can identify incorrect suffixes or different notations that are not identified in the cleaning algorithm. The similarity score must take into account both address and name scores and their corresponding weights, to be able to identify packages that are meant to be sent to the same location and recipient. Our client emphasized that address and name components are equally therefore when calculating the overall score both address and name are equally weighted except when there's a PO Box information which is a more specific piece of information therefore a slightly higher weight will be used for those.
From technical perspective, the data grouping algorithm begins by creating a hashmap and adds keys containing tuples of data containing PO Box (if present), zip code, and street name components. A value is associated with each keys. Initially, this value is set to an empty list. Later, names and addresses will be added to this list if the key value associated with them matches the key it is compared against, and the similarity score meets the specified threshold of 95%.
Each tuple value functions as an initial filter to decreases the algorithm runtime, as the algorithm will not compare tuple values that don’t have the same zip code, street name, and PO Box value (a boolean value of ‘True’ if the address is a PO Box type or ‘False’ if a full address). The algorithm checks if the tuple already exists in the dictionary. If a tuple match exists, the algorithm will compare values in the list of addresses and name values which correspond to each tuple key against one another. Instead of comparing each address and name combination with all previous entries, the algorithm only compares new entries with the first “comparer” element. If the similarity score generated by this comparison process passes a threshold of 95%, the new name and address data is added to the list of values which corresponds to the tuple key identified earlier. The algorithm will repeat this process for all items in the cleaned dataset.
Figure 2: Data grouping tuple diagram
Figure 3: Different weights used
Experimental Results & Conclusion
Our first objective was to create efficacy metrics on algorithm performance. We invested our very own consolidation rate:
In general, we have found that as the threshold for data grouping is decreased, money saved increases, but accuracy decreases. Zelis has emphasized the importance of high accuracy to us, so we have decided to run the data grouping algorithm at two high thresholds, 95% and 100%, to measure the consolidation rate. A threshold of 95% means that when two addresses and names are compared against one another they have a similarity score of 95% or greater. We found that other thresholds, such as 75%, resulted in some packages of data that should not be grouped together. Note that a 100% consolidation rate results in an algorithm that is very similar to Zelis’s old algorithm that utilizes exact string matching.
Figure 4: Scoring algorithm consolidation rates
The 95% threshold had an average consolidation rate of 49.4%, while the 100% threshold had an average consolidation rate of 47.2%. When compared to Zelis’s old algorithm, which had an average consolidation rate of 46.4%, we find that even the 100% threshold reduces the number of packages shipped. However, we do not believe that a 100% threshold is necessary, as we should package together items that have misspellings of names and addresses. We find that the 95% threshold is satisfactory, with a 3% increase in consolidation resulting in a decrease in spending of $75,000 each week. Scaling to a year’s worth of shipping, the new algorithm would potentially save Zelis 3.9 million.
Figure 4: Algorithm Flow Chart