Removing Point Outliers
In my previous post to remove point outliers, I tried using R and PLR in PostGres. Although, I only scratched the surface on the spatial analyzing capabilities of R, I needed something more extensible for my internet purposes. I decided to use Python’s pragmatic benefits and ease in programming. Idea was to pull out the vector points from PostGIS, process it using an algorithm (ideally minimum convex hull but it could be expensive later on) and then remove the outliers.
Numpy, a scientific python library, blends easily by using basic functions for mathematical array computations such as mean, median, standard deviation and variance. For now, the algorithm takes a 90% threshold, taken from “Dealing with ‘Outliers': Maintain Your Data’s Integrity”
Consider this collection of 10 scores, sorted from smallest to largest: x 8 25 35 41 50 75 75 79 92 129 ^ The median of these 10 values of x is 62.5, computed as (75+50)/2. Next, calculate the absolute value of the deviation of original data from median: x med abs_dev 50 62.5 12.5 75 62.5 12.5 75 62.5 12.5 79 62.5 16.5 41 62.5 21.5 ->| 35 62.5 27.5 ->| MEDIAN(abs_dev) = 24.5 = (21.5+27.5)/2 92 62.5 29.5 25 62.5 37.5 8 62.5 54.5 129 62.5 66.5 Next, compute a test statistic which is the column of absolute values computed above, divided by the mediate of the absolute values: Test Stat = abs_dev / (Med of abs Dev) Med of Test x Median abs_dev abs dev Statistic Outlier? 8 62.5 54.5 24.5 2.22449 25 62.5 37.5 24.5 1.53061 35 62.5 27.5 24.5 1.12245 41 62.5 21.5 24.5 0.87755 50 62.5 12.5 24.5 0.51020 75 62.5 12.5 24.5 0.51020 75 62.5 12.5 24.5 0.51020 79 62.5 16.5 24.5 0.67347 92 62.5 29.5 24.5 1.20408 129 62.5 66.5 24.5 2.71429 Yes The decision rule then is to compare this test statistic with an arbitrary cutoff point. A cutoff of 2.5 is conservative; 4.5 or 5 is more rigorous. If the Test Statistic > Critical value (=2.5), then define the observed value as an outlier. According to this cutoff value, the data above have one outlier (x=129).
Implementing this in Python…
P = 116.32977 39.905319,116.329906 39.90464,116.329907 39.90464,116.329918 39.904675,116.330047 39.904683
multipoints = getPointsString() print multipoints pobj = getPointArray(multipoints) p = pobj.p; x = pobj.x; y = pobj.y; #print "Median:", median(p) #print "Std:", p.std(axis=0) #print "Min:", p.min(axis=0) #print "Max:", p.max(axis=0) pmed = median(p) pdev = p - pmed pdev_abs = abs(pdev) med_pdev = median( pdev_abs ) pfinal = pdev_abs / med_pdev |
Where getPointsString() = “116.32977 39.905319,116.329906 39.90464,116.329907 39.90464,116.329918 39.904675,116.330047 39.904683..” a list of point geometries. We can easily get the median, std, and even minimum (min) and maximum (max) values in the array.
Here the original dots are marked as red, while the final dots after removing the outliers were colored as green.