Removing Point Outliers

In my previous post to remove point outliers, I tried using R and PLR in PostGres. Although, I only scratched the surface on the spatial analyzing capabilities of R, I needed something more extensible for my internet purposes. I decided to use Python’s pragmatic benefits and ease in programming. Idea was to pull out the vector points from PostGIS, process it using an algorithm (ideally minimum convex hull but it could be expensive later on) and then remove the outliers.

Numpy, a scientific python library, blends easily by using basic functions for mathematical array computations such as mean, median, standard deviation and variance. For now, the algorithm takes a 90% threshold, taken from “Dealing with ‘Outliers': Maintain Your Data’s Integrity”

Consider this collection of 10 scores, sorted from smallest to largest:
 
  x    8 25 35 41 50   75 75 79 92 129
                     ^
The median of these 10 values of x is 62.5, computed as (75+50)/2.
 
Next, calculate the absolute value of the deviation of original data from median:
 
   x     med  abs_dev
 
  50    62.5    12.5
  75    62.5    12.5
  75    62.5    12.5
  79    62.5    16.5
  41    62.5    21.5 -&gt;|
  35    62.5    27.5 -&gt;|  MEDIAN(abs_dev) = 24.5 = (21.5+27.5)/2
  92    62.5    29.5
  25    62.5    37.5
   8    62.5    54.5
 129    62.5    66.5
 
Next, compute a test statistic which is the column of absolute values computed above, divided by the mediate of the absolute values:
 
Test Stat = abs_dev / (Med of abs Dev)
 
                           Med of       Test
  x    Median  abs_dev    abs dev    Statistic  Outlier?   
 
  8     62.5     54.5       24.5      2.22449
 25     62.5     37.5       24.5      1.53061
 35     62.5     27.5       24.5      1.12245
 41     62.5     21.5       24.5      0.87755
 50     62.5     12.5       24.5      0.51020
 75     62.5     12.5       24.5      0.51020
 75     62.5     12.5       24.5      0.51020
 79     62.5     16.5       24.5      0.67347
 92     62.5     29.5       24.5      1.20408
129     62.5     66.5       24.5      2.71429       Yes
 
The decision rule then is to compare this test statistic with an arbitrary cutoff point. A cutoff of 2.5 is conservative; 4.5 or 5 is more rigorous. If the Test Statistic &gt; Critical value (=2.5), then define the observed value as an outlier. According to this cutoff value, the data above have one outlier (x=129).

Implementing this in Python…

P = 116.32977 39.905319,116.329906 39.90464,116.329907 39.90464,116.329918 39.904675,116.330047 39.904683

    multipoints = getPointsString()
    print multipoints
 
    pobj = getPointArray(multipoints)
    p = pobj.p;
    x = pobj.x;
    y = pobj.y;
 
    #print "Median:",  median(p)
    #print "Std:",  p.std(axis=0)
    #print "Min:", p.min(axis=0)
    #print "Max:", p.max(axis=0)
 
    pmed = median(p)
    pdev = p - pmed
    pdev_abs = abs(pdev)
    med_pdev = median( pdev_abs )
    pfinal = pdev_abs / med_pdev

Where getPointsString() = “116.32977 39.905319,116.329906 39.90464,116.329907 39.90464,116.329918 39.904675,116.330047 39.904683..” a list of point geometries. We can easily get the median, std, and even minimum (min) and maximum (max) values in the array.

Here the original dots are marked as red, while the final dots after removing the outliers were colored as green.