# Supplementary material description

This page contains additional details about the experimental setup and results discussed in the paper Data Pipeline Selection and Optimization submitted for the 21st International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2019) collocated with EDBT/ICDT joint conference.

Table of Content:

1. Experiments short description
2. Experiment 1: SMBO for DPSO
3. Experiment 2: Algorithm-specific Configuration

# Experiments short description

The paper contains two experiments:

1. Study of Sequential Model Based Optimizatoin (SMBO) to the Data Pipeline Selection and Optimization.
2. Study of if an optimal pipeline configuration is specific to an algorithm or general to the dataset.

# Experiment 1: SMBO for DPSO

## Detailed experimental protocol

For each dataset, there is a baseline pipeline consisting in not doing any preprocessing.

### Step 1

For each dataset and method, we performed an exhaustive search on the configuration space defined in details right after. For each of the 4750 configurations, a 10-fold cross-validation has been performed and the score measure is the accuracy.

### Step 2

For each dataset and method, we performed a search using Sequential Model-Based Optimization (implementation provided by `hyperopt`) and a budget of 100 configurations to visit (about 2% of the whole configuration space). As for Step 1., we measure the accuracy obtained over a 10-fold cross-validation.

### Measures and analysis

We want:

• (Q1) to quantify the achievable improvement compared to the baseline pipeline.
• (Q2) to measure how likely it is to improve the baseline score according to the configuration space.
• (Q3) to determine if SMBO is capable to improving the baseline score.
• (Q4) to measure how much SMBO is likely to improve the baseline score with a restricted budget.
• (Q5) to measure how fast SMBO is likely to improve the baseline score with a restricted budget.

To answer those questions, we generate two kind of plots:

1. Density of the configuration depending on the accuracy for the exhaustive grid, and for the SMBO search. If the density is not null for accuracy higher than the baseline score, then, there exist configurations that improve the baseline score (answer to Q1). We can observe the probability to improve the baseline score (and quantify how much) by observing the proportion of the area after the baseline score vertical marker (answer to Q2). Similarely, if the density for SMBO has some support higher than the baseline score, it means SMBO search could improve (answer to Q3). If the area above the baseline score vertical marker is larger for SMBO than for the exhaustive search, then SMBO is more likely to improve the baseline than an exhaustive search (answer to Q4).
2. Evolution of score obtained configuration after configuration for SMBO search. The improvement interval is comprised between the baseline score and the best score obtained by the exhaustive search. To answer Q5, we plot horizontally the improvement interval, and plot the best score obtained iteration after iteration. SMBO improved the baseline as soon as the best score enters the improvement interval. To help visualization, we plot veritically the number of configurations needed to enter the interval and the number of configurations to visit before reaching the best score obtained over the budget of 100 configurations. For both market, the lower the better.

## Pipeline prototype and configuration

The configuration space for the pipeline is composed of three operations. For each operations there are 4, 5 and 4 possible operators. Each operator has between 0 and 3 specific parameter(s). For each parameter, there is between 2 and 4 possible values. The final pipeline configuration space has a total of 4750 possible configurations.

### Pipeline prototype

The pipeline prototype is composed of three sequential operations:

1. rebalance: to handle imbalanced dataset with oversampling or downsampling techniques.
2. normalizer: to normalize or scale features.
3. features: to select the most important features or reduce the input vector space dimension.

### Pipeline operators

For the step rebalance, the possible methods to instanciate are:

For the step normalizer, the possible methods to instanciate are:

For the step features, the possible methods to instanciate are:

REMARK: The baseline pipeline corresponds to the triple `(None, None, None)`.

### Pipeline operator specific configuration

• `NearMiss`:
• `n_neighbors`: [1,2,3]
• `CondensedNearestNeighbour`:
• `n_neighbors`: [1,2,3]
• `SMOTE`:
• `k_neighbors`: [5,6,7]
• `StandardScaler`:
• `with_mean`: [True, False]
• `with_std`: [True, False]
• `RobustScaler`:
• `quantile_range`:[(25.0, 75.0),(10.0, 90.0), (5.0, 95.0)]
• `with_centering`: [True, False]
• `with_scaling`: [True, False]
• `PCA`:
• `n_components`: [1,2,3,4]
• `SelectKBest`:
• `k`: [1,2,3,4]
• `PCA+SelectKBest`:
• `n_components`: [1,2,3,4]
• `k`:[1,2,3,4]

## Results

The results are sorted by dataset.

# Experiment 2: Algorithm-specific Configuration

## Detailed experimental protocol

• Datasets: European Court of Human Rights (1000 cases, prevalence 50\%), 20newsgroup (855 documents from the categories atheism and religion)
• Methods: SVM, Random Forest, Neural Network, Decision Tree.
• Dataset split: 60% for training set, 40% for test set.
• Pipeline configuration space size: 35 configurations.

Each document is preprocessed using a data pipeline consisting in tokenization, stopwords removal, followed by a n-gram generation. The processed documents are combined and the k top tokens across the corpus are kept to form the dictionary. Each case is turned into a Bag-of-Words using the dictionary.

There are two hyperparameters in the preprocessing phase: n the size of the n-grams, and k the number of tokens in the dictionary.

## Results

The results are sorted by dataset.

### Newsgroup dataset

In this section, we provide the intermediate steps to compute the NMAD indicator. In particular, the first table display the optimal points expressed in the normalized configuration space, the second table the sample to consider for each optimal point, and the third one, the NMAD value for each optimal point.

## ECHR

Method $(n,k)$
Decision Tree $p_1=(1,0.5)$
Neural Network $p_2=(1,0.5)$
Random Forest $p_3=(0.5,0.1)$, $p_4=(0.75,0.1)$, $p_5=(1,0.5)$
Linear SVM $p_6=(0.5,0.5)$, $p_7=(0.75,0.5)$, $p_8=(1,0.5)$
Sample ECHR
$S(p_1)=S(p_2)=S(p_5)=S(p_8)=\{&space;p_1,&space;p_2,&space;p_5,&space;p_8&space;\}$
$S(p_3)=&space;\{&space;p_1,&space;p_2,&space;p_3,&space;p_6&space;\}$
$S(p_4)=&space;\{&space;p_1,&space;p_2,&space;p_4,&space;p_7&space;\}$
$S(p_6)=&space;\{&space;p_1,&space;p_2,&space;p_3,&space;p_6&space;\}$
$S(p_7)=&space;\{&space;p_1,&space;p_2,&space;p_5,&space;p_7&space;\}$
$(5,50000)$ 0
$(3,10000)$ 0.275
$(4,10000)$ 0.213
$(3,50000)$ 0.175
$(4,50000)$ 0.094

## Newsgroup

Method $(n,k)$
Decision Tree $p_1=(0.75,0.05)$, $p_2=(0.75,1.0)$
Neural Network $p_3=(1.0,0.50)$
Random Forest $p_4=(0.5,0.10)$
Linear SVM $p_5=(0.25,1.0)$
Sample Newsgroup
$S(p_1)=S(p_4)=\{p_1,p_3,p_4,p_5\}$
$S(p_2)=S(p_3)=S(p_5)=\{p_2,p_3,p_4,p_5\}$
$(4,5000)$ 0.306
$(4,100000)$ 0.300
$(5,50000)$ 0.356
$(3,10000)$ 0.294
$(2,100000)$ 0.362