Supplementary material description

This page contains additional details about the experimental setup and results discussed in the paper Data Pipeline Selection and Optimization submitted for the 21st International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2019) collocated with EDBT/ICDT joint conference.

Table of Content:

  1. Experiments short description
  2. Experiment 1: SMBO for DPSO
    1. Detailed experimental protocol
    2. Measures and analysis
    3. Results
  3. Experiment 2: Algorithm-specific Configuration
    1. Detailed experimental protocol
    2. Measures and analysis
    3. Results
    4. NMAD Calculation

Experiments short description

The paper contains two experiments:

  1. Study of Sequential Model Based Optimizatoin (SMBO) to the Data Pipeline Selection and Optimization.
  2. Study of if an optimal pipeline configuration is specific to an algorithm or general to the dataset.

Experiment 1: SMBO for DPSO

Detailed experimental protocol

For each dataset, there is a baseline pipeline consisting in not doing any preprocessing.

Step 1

For each dataset and method, we performed an exhaustive search on the configuration space defined in details right after. For each of the 4750 configurations, a 10-fold cross-validation has been performed and the score measure is the accuracy.

Step 2

For each dataset and method, we performed a search using Sequential Model-Based Optimization (implementation provided by hyperopt) and a budget of 100 configurations to visit (about 2% of the whole configuration space). As for Step 1., we measure the accuracy obtained over a 10-fold cross-validation.

Measures and analysis

We want:

To answer those questions, we generate two kind of plots:

  1. Density of the configuration depending on the accuracy for the exhaustive grid, and for the SMBO search. If the density is not null for accuracy higher than the baseline score, then, there exist configurations that improve the baseline score (answer to Q1). We can observe the probability to improve the baseline score (and quantify how much) by observing the proportion of the area after the baseline score vertical marker (answer to Q2). Similarely, if the density for SMBO has some support higher than the baseline score, it means SMBO search could improve (answer to Q3). If the area above the baseline score vertical marker is larger for SMBO than for the exhaustive search, then SMBO is more likely to improve the baseline than an exhaustive search (answer to Q4).
  2. Evolution of score obtained configuration after configuration for SMBO search. The improvement interval is comprised between the baseline score and the best score obtained by the exhaustive search. To answer Q5, we plot horizontally the improvement interval, and plot the best score obtained iteration after iteration. SMBO improved the baseline as soon as the best score enters the improvement interval. To help visualization, we plot veritically the number of configurations needed to enter the interval and the number of configurations to visit before reaching the best score obtained over the budget of 100 configurations. For both market, the lower the better.

Pipeline prototype and configuration

The configuration space for the pipeline is composed of three operations. For each operations there are 4, 5 and 4 possible operators. Each operator has between 0 and 3 specific parameter(s). For each parameter, there is between 2 and 4 possible values. The final pipeline configuration space has a total of 4750 possible configurations.

Pipeline prototype

The pipeline prototype is composed of three sequential operations:

  1. rebalance: to handle imbalanced dataset with oversampling or downsampling techniques.
  2. normalizer: to normalize or scale features.
  3. features: to select the most important features or reduce the input vector space dimension.
Pipeline illustration

Pipeline operators

For the step rebalance, the possible methods to instanciate are:

For the step normalizer, the possible methods to instanciate are:

For the step features, the possible methods to instanciate are:

REMARK: The baseline pipeline corresponds to the triple (None, None, None).

Pipeline operator specific configuration

Results

The results are sorted by dataset.

Breast dataset

Configuration density depending on accuracy - Random Forest
SMBO results - Random Forest
Configuration density depending on accuracy - Decision Tree
SMBO results - Decision Tree
Configuration density depending on accuracy - Neural Net
SMBO results - Neural Net
Configuration density depending on accuracy - SVM
SMBO results - SVM

Iris dataset

Configuration density depending on accuracy - Random Forest
SMBO results - Random Forest
Configuration density depending on accuracy - Decision Tree
SMBO results - Decision Tree
Configuration density depending on accuracy - Neural Net
SMBO results - Neural Net
Configuration density depending on accuracy - SVM
SMBO results - SVM

Wine dataset

Configuration density depending on accuracy - Random Forest
SMBO results - Random Forest
Configuration density depending on accuracy - Decision Tree
SMBO results - Decision Tree
Configuration density depending on accuracy - Neural Net
SMBO results - Neural Net
Configuration density depending on accuracy - SVM
SMBO results - SVM

Experiment 2: Algorithm-specific Configuration

Detailed experimental protocol

Each document is preprocessed using a data pipeline consisting in tokenization, stopwords removal, followed by a n-gram generation. The processed documents are combined and the k top tokens across the corpus are kept to form the dictionary. Each case is turned into a Bag-of-Words using the dictionary.

There are two hyperparameters in the preprocessing phase: n the size of the n-grams, and k the number of tokens in the dictionary.

Results

The results are sorted by dataset.

ECHR dataset

Heatmap of the accuracy depending on the configuration - Random Forest
Heatmap of the accuracy depending on the configuration - Decision Tree
Heatmap of the accuracy depending on the configuration - Neural Net
Heatmap of the accuracy depending on the configuration - SVM

Newsgroup dataset

Heatmap of the accuracy depending on the configuration - Random Forest
Heatmap of the accuracy depending on the configuration - Decision Tree
Heatmap of the accuracy depending on the configuration - Neural Net
Heatmap of the accuracy depending on the configuration - SVM

NMAD calculation

In this section, we provide the intermediate steps to compute the NMAD indicator. In particular, the first table display the optimal points expressed in the normalized configuration space, the second table the sample to consider for each optimal point, and the third one, the NMAD value for each optimal point.

ECHR

Method
Decision Tree
Neural Network
Random Forest , ,
Linear SVM , ,
Sample ECHR
Point NMAD
0
0.275
0.213
0.175
0.094

Newsgroup

Method
Decision Tree ,
Neural Network
Random Forest
Linear SVM
Sample Newsgroup
Point NMAD
0.306
0.300
0.356
0.294
0.362

References

[1] I. Mani, I. Zhang. “kNN approach to unbalanced data distributions: a case study involving information extraction,” In Proceedings of workshop on learning from imbalanced datasets, 2003.
[2] P. Hart, “The condensed nearest neighbor rule,” In Information Theory, IEEE Transactions on, vol. 14(3), pp. 515-516, 1968.
[3] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.