Improving Stability and Parameter Selection of Data Processing Programs

2020-01-07T21:15:49Z (GMT) by Wen-Chuan Lee
Data-processing programs are becoming increasingly important in the Big-data era. However, two notable problems of these programs may cause sub-optimal data- processing results. On one hand, these programs contain large number of floating-point computations. Due to the limited precision of floating-point representations, errors are introduced, propagated and accumulated in series of computations, making the computation results unreliable. We call this problem as floating-point instability. On the other hand, these programs are heavily parameterized. As no universal optimal parameter configuration exists for all possible inputs, the setting of program parameters should be carefully chosen and tuned for each input. Otherwise, the result would be sub-optimal. Manual tuning is infeasible because the number of parameters and the range of each parameter value may be big.

We try to address these two challenges in this dissertation. For floating-point instability problem, we develop a novel runtime technique to capture different output variations in the presence of instability. It features the idea of transforming every floating point value to a vector of multiple values $-$ the values added to create the vector are obtained by introducing artificial errors that are upper bounds of actual errors. The propagation of artificial errors models the propagation of actual errors. When values in vectors result in discrete execution differences (e.g., following different paths), the execution is forked to capture the resulting output variations.

For parameterized data-processing programs, we develop a white-box program tuning framework to tune the program parameter configuration for optimal data-processing result of each program input.
To further reduce the parameter configuration overhead, we propose the first general framework to inject artificial intelligence (AI) in the program, so the intelligent program is able to predict the parameter configuration for each incoming input directly. However, similar to many other ML/AI applications, the crucial challenge lies in feature selection, i.e., selection of the feature variables for predicting the target parameter specified by the users.
Thus, we propose a novel approach by combining program analysis and statistical analysis for better program feature variables selection which further helps better target parameter prediction and improves the result.