By Rupert
Algorithms
Removing Point Outliers
Nov 19th
In my previous post to remove point outliers, I tried using R and PLR in PostGres. Although, I only scratched the surface on the spatial analyzing capabilities of R, I needed something more extensible for my internet purposes. I decided to use Python’s pragmatic benefits and ease in programming. Idea was to pull out the vector points from PostGIS, process it using an algorithm (ideally minimum convex hull but it could be expensive later on) and then remove the outliers.
Numpy, a scientific python library, blends easily by using basic functions for mathematical array computations such as mean, median, standard deviation and variance. For now, the algorithm takes a 90% threshold, taken from “Dealing with ‘Outliers’: Maintain Your Data’s Integrity”
Consider this collection of 10 scores, sorted from smallest to largest: x 8 25 35 41 50 75 75 79 92 129 ^ The median of these 10 values of x is 62.5, computed as (75+50)/2. Next, calculate the absolute value of the deviation of original data from median: x med abs_dev 50 62.5 12.5 75 62.5 12.5 75 62.5 12.5 79 62.5 16.5 41 62.5 21.5 ->| 35 62.5 27.5 ->| MEDIAN(abs_dev) = 24.5 = (21.5+27.5)/2 92 62.5 29.5 25 62.5 37.5 8 62.5 54.5 129 62.5 66.5 Next, compute a test statistic which is the column of absolute values computed above, divided by the mediate of the absolute values: Test Stat = abs_dev / (Med of abs Dev) Med of Test x Median abs_dev abs dev Statistic Outlier? 8 62.5 54.5 24.5 2.22449 25 62.5 37.5 24.5 1.53061 35 62.5 27.5 24.5 1.12245 41 62.5 21.5 24.5 0.87755 50 62.5 12.5 24.5 0.51020 75 62.5 12.5 24.5 0.51020 75 62.5 12.5 24.5 0.51020 79 62.5 16.5 24.5 0.67347 92 62.5 29.5 24.5 1.20408 129 62.5 66.5 24.5 2.71429 Yes The decision rule then is to compare this test statistic with an arbitrary cutoff point. A cutoff of 2.5 is conservative; 4.5 or 5 is more rigorous. If the Test Statistic > Critical value (=2.5), then define the observed value as an outlier. According to this cutoff value, the data above have one outlier (x=129).
Implementing this in Python…
P = 116.32977 39.905319,116.329906 39.90464,116.329907 39.90464,116.329918 39.904675,116.330047 39.904683
multipoints = getPointsString() print multipoints pobj = getPointArray(multipoints) p = pobj.p; x = pobj.x; y = pobj.y; #print "Median:", median(p) #print "Std:", p.std(axis=0) #print "Min:", p.min(axis=0) #print "Max:", p.max(axis=0) pmed = median(p) pdev = p - pmed pdev_abs = abs(pdev) med_pdev = median( pdev_abs ) pfinal = pdev_abs / med_pdev
Where getPointsString() = “116.32977 39.905319,116.329906 39.90464,116.329907 39.90464,116.329918 39.904675,116.330047 39.904683..” a list of point geometries. We can easily get the median, std, and even minimum (min) and maximum (max) values in the array.
Here the original dots are marked as red, while the final dots after removing the outliers were colored as green.
Geometric Algorithms in GIS
Nov 16th
Here is a couple of Geometric Algorithms used in GIS.
- Convex hull problem: for a set of points, determine the smallest convex set that contains all.
- Line segment intersection: for a set of line segments, determine all intersections.
- Voronoi diagram computation: for a set of points, determine the subdivision of the plane into cells such that inside some cell, one and the same point of the set is closest.
- Delaunay triangulation: for a set of points, determine a planar subdivision by creating edges between the input points in such a way that no two edges intersect, all faces are triangles, no more edges can be added with the given constraints, and no circumcircle of any triangle contains an input point in its interior.
- Minkowski sum: for two simple polygons P and Q, compute the shape that consist exactly of the sum of all points of P and all points of Q, where sum is interpreted as the vector sum.
- Rectangular range search: for a set of points in the plane, design a data structure on those points, such that for every axis-parallel query rectangle, all points in the data structure that lie in the query rectangle can be reported efficiently. Algorithms are needed for the construction of the data structure and for the execution of a query.
Reference:
M Kreveld, Computational Geometry: Its objectives and relation to GIS, Institute of Information and Computing Sciences, Utrecht University


Comments