I am excited to announce that this is supposed to be my first article published also on r-bloggers.com :)

The processing of data needs to take dimensionality into account as usual metrics change their behaviour in subtle ways, which impacts the efficiency of algorithms and methods that are based on distances / similarities of data points. This has been tagged the “curse of dimensionality“. Just as well, in some cases high dimensionality can aid us when investigating data – “blessing of dimensionality”. But in general it is, as usual, a good thing to know what’s going on and so let’s have a look at what dimensionality does to data.

Hyper-Spheres and Neighbourhoods

R code for above chart # full code at: # https://github.com/joyofdata/miscellaneous/blob/master/high-dim/sphere_volume.R library(ggplot2) library(grid) vol_of_sphere <- function(dim, N) { m <- matrix(runif(dim*N,-1,1), ncol=dim) sum(sqrt(rowSums(m^2)) <= 1) / N } df <- data.frame( dim <- 1:10, v <- sapply(1:10, function(dim) vol_of_sphere(dim,1000000)) ) # plotting aove computed data # [...] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # full code at: # https://github.com/joyofdata/miscellaneous/blob/master/high-dim/sphere_volume.R library ( ggplot2 ) library ( grid ) vol_of_sphere < - function ( dim , N ) { m < - matrix ( runif ( dim * N , - 1 , 1 ) , ncol = dim ) sum ( sqrt ( rowSums ( m ^ 2 ) ) <= 1 ) / N } df < - data . frame ( dim < - 1 : 10 , v < - sapply ( 1 : 10 , function ( dim ) vol_of_sphere ( dim , 1000000 ) ) ) # plotting aove computed data # [...]

Within the euclidian spaces of conventional dimensions 1,2 and 3 we tend to think of the neighbourhood of a point, containing close or similar points, as the interior of an interval, circle or sphere. Those are all canonical concepts based on the euclidian distance from center to the respective point being at most some constant. In above chart this constant is simply 1 and we compare the volume of the hyper-sphere with radius 1 to the volume of the smallest hyper-box containing it – hence edges being of length 2. For a one dimensional space the box, as well as the “hyper-sphere” is an interval of length 2 – and on we go. Reaching the 10th dimension the ratio is no longer visually distiguishable from 0. And 10 dimensions is not much at all for real data sets. The effect is that points are rarely any longer close to points we would (interpreting the data) consider similar (meaning “living in the same neighbourhood”).

Linear Separability

R code for above chart # full code at: # https://github.com/joyofdata/miscellaneous/blob/master/high-dim/random_soft_linsep.R library(kernlab) library(ggplot2) library(grid) err_svm_for_dim <- function(num_samples, num_points_per_class, dim, C) { rand_svm <- function(N,dim, C) { v <- as.factor(c(rep("a",N),rep("b",N))) m <- rbind(matrix(rnorm(2*N*dim), ncol=dim)) svm <- ksvm(m,v,kernel="vanilladot",C=C,kpar=list()) return(svm@error) } v_res <- mapply(rand_svm, rep(num_points_per_class, num_samples), rep(dim, num_samples), rep(C, num_samples)) q <- quantile(v_res, c(.1,.5,.9)) names(q) <- NULL return(c( q01 = q[1], q05 = q[2], q09 = q[3], ls0 = sum(v_res == 0) / num_samples, ls1 = sum(v_res <= 1/(2*num_points_per_class)) / num_samples, ls2 = sum(v_res <= 1/num_points_per_class) / num_samples )) } res_N10 <- sapply(1:50, function(d) err_svm_for_dim( num_samples=1000,num_points_per_class=10,dim=d,C=10)) res_N20 <- sapply(1:50, function(d) err_svm_for_dim( num_samples=1000,num_points_per_class=20,dim=d,C=10)) res_N30 <- sapply(1:50, function(d) err_svm_for_dim( num_samples=1000,num_points_per_class=30,dim=d,C=10)) # plotting above computed data # [...] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 # full code at: # https://github.com/joyofdata/miscellaneous/blob/master/high-dim/random_soft_linsep.R library ( kernlab ) library ( ggplot2 ) library ( grid ) err_svm_for_dim < - function ( num_samples , num_points_per_class , dim , C ) { rand_svm < - function ( N , dim , C ) { v < - as . factor ( c ( rep ( "a" , N ) , rep ( "b" , N ) ) ) m < - rbind ( matrix ( rnorm ( 2 * N * dim ) , ncol = dim ) ) svm < - ksvm ( m , v , kernel = "vanilladot" , C = C , kpar = list ( ) ) return ( svm @ error ) } v_res < - mapply ( rand_svm , rep ( num_points_per_class , num_samples ) , rep ( dim , num_samples ) , rep ( C , num_samples ) ) q < - quantile ( v_res , c ( . 1 , . 5 , . 9 ) ) names ( q ) < - NULL return ( c ( q01 = q [ 1 ] , q05 = q [ 2 ] , q09 = q [ 3 ] , ls0 = sum ( v_res == 0 ) / num_samples , ls1 = sum ( v_res <= 1 / ( 2 * num_points_per_class ) ) / num_samples , ls2 = sum ( v_res <= 1 / num_points_per_class ) / num _ samples ) ) } res_N10 < - sapply ( 1 : 50 , function ( d ) err_svm_for_dim ( num_samples = 1000 , num_points_per_class = 10 , dim = d , C = 10 ) ) res_N20 < - sapply ( 1 : 50 , function ( d ) err_svm_for_dim ( num_samples = 1000 , num_points_per_class = 20 , dim = d , C = 10 ) ) res_N30 < - sapply ( 1 : 50 , function ( d ) err_svm_for_dim ( num_samples = 1000 , num_points_per_class = 30 , dim = d , C = 10 ) ) # plotting above computed data # [...]

This chart shows how the probability of two sets of N points each being linearly separable quickly approaches 1 with the dimension increasing. Effectively this implies that for example for N = 10 (that is 20 points) from 7th dimension on an observed linear separability can no longer be statistically significant at level 5%. In general we see that that from a certain dimension on linear separability is practically guaranteed.

For technical reasons the “true” probability functions are likely to ascend sooner. I used a linear kernel SVM which will in some separable cases settle for a not fully separating discriminant and then falsely be counted as not linearly separable. Then again data always contains noise and outliers – so we would mostly tend to interprete an almost separable set as actually separable. (See my article on using linear programming for non-ambiguous and faster determination of linear separability)

Distances and Metrics

R code for above chart # full code at: # https://github.com/joyofdata/miscellaneous/blob/master/high-dim/rand_dist_origin.R library(ggplot2) library(grid) max_dim <- 100 rand_dist_origin <- function(N,dim) { m <- matrix(rnorm(N*dim),ncol=dim) v <- sqrt(rowSums(m^2)) return(c(avg = mean(v), min = min(v), max = max(v))) } v <- mapply(rand_dist_origin, rep(100000,99), 2:max_dim) # setting up data frame and plotting it with ggplot2 # [...] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # full code at: # https://github.com/joyofdata/miscellaneous/blob/master/high-dim/rand_dist_origin.R library ( ggplot2 ) library ( grid ) max_dim < - 100 rand_dist_origin < - function ( N , dim ) { m < - matrix ( rnorm ( N * dim ) , ncol = dim ) v < - sqrt ( rowSums ( m ^ 2 ) ) return ( c ( avg = mean ( v ) , min = min ( v ) , max = max ( v ) ) ) } v < - mapply ( rand_dist_origin , rep ( 100000 , 99 ) , 2 : max_dim ) # setting up data frame and plotting it with ggplot2 # [...]

As the chart indicates (extrapoloating the green and blue trails) the relative difference between the maximum observed distance and the minimum observed distance vanishes with increase of dimension – this means variations in distance convey less and less information on similarity and this handicaps clustering using methods like k-NN (assuming it applies the euclidian distance).

It has been argued in [6], that under certain reasonable assumptions on the data distribution, the ratio of the distances of the nearest and farthest neighbors to a given target in high dimensional space is almost 1 for a wide variety of data distributions and distance functions. In such a case, the nearest neighbor problem becomes ill defined, since the contrast between the distances to different data points does not exist. In such cases, even the concept of proximity may not be meaningful from a qualitative perspective: a problem which is even more fundamental than the performance degradation of high dimensional algorithms. [Agg]

“On the Surprising Behavior of Distance Metrics in High Dimensional Space” [Agg] suggests using fractional norms to define metrics to fight the loss of meaning of distance for high dimensions. Basically a canonical extension of the L(k)-norms which are usually only defined for k = 1,2,… with k=2 defining the euclidian distance and k=1 defining Manhattan distance into positive fractions below 1. For k=1/2 this would give:

[Agg] “On the Surprising Behavior of Distance Metrics in High Dimensional Space” Aggarwal, Hinneburg, Keim

(original article published on www.joyofdata.de)