This is a tool that I created in order to shrink estimates for population statistics in the CPS using weights which decay by distance, but could certainly be used for other purposes. The datasets contain a distance measure by:
1. the minimum number of borders one must cross to enter each other state, and
2. the distance from the center of each state in miles.
LINKS FOR DOWNLOAD:
Distance files for each state
Source files used to create distance data
A browser view-able version of the code
Details below the jump. . .
Need a Stata function that does spellcheck? Ok, so not as good as spellcheck. The spellcheck function in MS word does a lot of checking for transpositions and probability of misspelling that this won’t do, but it’s definitely less crude than counting the position-specific differences of two strings.
Levenshtein Distance is a metric designed to measure similarity of two strings. Basically, Levenshtein Distance is the minimum number of additions, deletions, or replacements necessary to transform one string into another.
As with many math-related topics, Wikipedia does a pretty good job of explaining the mechanics. Also, here are some implementations in other languages.
Programming a mata function for this would be fairly easy, as would be creating a matrix for each pair of words. This program uses temp variables instead, which can be a bit computationally intensive, but is good for making lots of comparisons simultaneously (after all, Stata is good for vector manipulation).
I wrote this program to do record linkage using names, which can be accomplished by using joinby and then comparing strings of matches.
Here’s the link – enjoy!