Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-Data
thesisposted on 16.01.2019 by Aritra Chakravorty
In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.
In Divide & Recombine (D&R), data are divided into subsets, analytic methodsare applied to each subset independently, with no communication between processes;then the subset outputs for each method are recombined. For big data, this providesalmost all of the analytic tasking needed when data are analyzed. It also provideshigh computational performance because typically most of the computation is em-barrassingly parallel, the simplest parallel computation.
Another kind of tasking must address computational performance and numericaccuracy: the computing of functions of all of the data, or “statistics”. For data bigand small, it is often important to compute such statistics for all of the data, whichcan be summaries of the data, such as sample quantiles of continuous variables, orcan process the data into a form that helps analysis, such as dividing the data intorepresentative subsets. Development of computational methods to compute thesestatistics can be challenging.
D&R can be a very effective framework for computing statistics. To supportthis, we introduce the concept of embarrassingly parallel (EP) statistics, both weakand strong. The concept of EP statistics is not entirely new, but has had littledevelopment. The existing methodology is mainly sums of sums. For example, this isdone when computing the necessary statistics for least squares where sums of productsand cross productions are carried out on subsets then summed across subsets. Ourtreatment of EP statistics has taken the concept much further. The outcome is abilityto use EP statistics in conjunction with the use a Fourier series to approximate an optimization criteria. The series terms, which are strongly EP statistics, are summedacross subsets, and the result is optimized. These are EP-F computational methods.
We have so far developed two EP-F computational methods for two widely usedstatistic computations. EP-F-Quantile is for quantiles of big data, and EP-F-KDtreeis for KD-trees. Speed and accuracy of EPF-Quantile are compared with that of thewell-known binning method, which also can be formulated in terms of EP statistics. EPF-KDtree is the first parallel KD-tree computational method of which we areaware. EP and EPF computational methods have potentially many other applicationsto computing statistics.