Gap-filling in Streamflow Sensor Data

Create by: Taimoor Akhtar

This notebook attempts to explore different statistical models, to analyze their capabilities in filling gaps in streamflow sensor data. Data for three correlated flow sites is provided, and the problem statement is to fill data gaps between the most downstream site, i.e., site A.

In [4]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
from utils import *
from models import single_experiment

2. Data Exploration

The general data exploration strategy is to explore tends in the flow data at all sites through a) time-series plots, b) cross-correations, c) auto-correlartions and d) visualizations of gains. While, analysis of auto-correlations is customary in time series analysis, lagged cross-correlations and gains are also analyzed here, due to their importance in hydrological flow and routing.

In [6]:
# Loading data and plotting the time-series at A, B and C
df = pd.read_csv("ts_data.csv", parse_dates=True)
df.set_index('DateTime', inplace=True)
df_orig = df.copy()

# Removing the date intervals specified for prediction in the Problem Statement
df["A"][(df.index >= "2018-02-03") & (df.index <= "2018-02-20")] = np.nan
df["A"][(df.index >= "2018-04-13") & (df.index <= "2018-04-28")] = np.nan
df["A"][(df.index >= "2018-06-17") & (df.index <= "2018-07-01")] = np.nan
df["A"][(df.index >= "2018-11-01") & (df.index <= "2018-11-15")] = np.nan
In [7]:
fig = plt.figure(figsize=[20,5])
ax = fig.add_subplot(1,1,1)
df.plot(ax=ax)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f8d8ebc10>
In [9]:
# Understanding cross-correlations
df_A = df['A']
df_B = df['B']
df_C = df['C']

rb = [crosscorr(df_A, df_B, lag) for lag in range(300)]
rc = [crosscorr(df_A, df_C, lag) for lag in range(300)]

f,ax=plt.subplots(figsize=(14,3))
ax.plot(rb, color = 'g', label = 'Site B')
ax.plot(rc, color = 'r', label = 'Site C')
ax.set(title='Cross-Correlation of Site A with B and C', xlabel='Lag (15 Minute Intervals)', ylabel='Correlation (Pearson r)');
plt.legend()
Out[9]:
<matplotlib.legend.Legend at 0x7f8f923358e0>
In [10]:
df_A = df['A'][df.index <= "2018-01-31"]
fig = plt.figure(figsize=[20,10])
ax = fig.add_subplot(3,1,1)
sns.distplot(df, ax=ax)
ax = fig.add_subplot(3,1,2)
plot_acf(df_A, lags=2000, ax=ax)
ax = fig.add_subplot(3,1,3)
plot_pacf(df_A, lags=100, ax=ax, method='ols');
In [11]:
# Understanding gains and any distributions of gains
df_diff_1 = df_A - df_B
df_diff_2 = df_A - df_C
fig = plt.figure(figsize=[20,5])
ax = fig.add_subplot(1,1,1)
df_diff_1.plot(ax=ax, label='Difference B')
df_diff_2.plot(ax=ax, label='Difference C')
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f8fc50dc0>

4. Modelling Methodology

4.1. Comparison Baseline and Metric

This work will compare the modeling strategy(ies) proposed in this work with two baselines, i.e., i) 'Fill missing data with global time series mean' (using SimpleImputer from sklearn) and ii) 'Interpolation' (the Pandas built-in method). Moreover, the Mean Absolute Percentage Error (MAPE) and Mean Percentage Error (MPE) are used to evalualte the models proposed / analyzed in this work. MPE is used to also highlight any biases in predictions from algorithms (this is very important in hydrologic problems). Both evaluation metrics are descirbied below:

$\text{MAPE} = \frac{1}{n} \sum_i \mid \frac{\text{pred}_i - \text{obs}_i}{\text{obs}_i} \mid$

$\text{MPE} = \frac{1}{n} \sum_i \frac{\text{pred}_i - \text{obs}_i}{\text{obs}_i}$

4.1. Model Selection

The literature review summarised in Section 1 highlighted multiple promising algorithms for filling missing data in multivariate time series. This work will evaluate one such strategy, i.e. the Multivariate Imputation by Chained Equations (MICE) [2] method (we actually use a variation of the original MICE). [2] uses MICE for filling missing data in cross-correlated hourly energy demand data.

Findings from Exploratory Analysis

The exploratory analysis indicates that:

  • Order of flow sites (from upstream to downstream) are B-C-A (Inferred from Flow magintudes and lags at site A).
  • Cross correlation of A-C peaks with a lag of approximately 12 hours.
  • Cross correlation of A-B peaks with a lag of approximately 36 hours.
  • Long data gaps at site B exist in 2016.

3. Data Preparation

We divide our data set into training and test sets where years 2016 and 2017 are uses for model training and the year 2018 is used as test set.

In [12]:
# The trainng set is approximately 70 Percent of Total Data, i.e., years 2016-17 (excluding time periods used in evaluation)
df_train_orig = df[df.index <= "2017-12-31"].copy()
df_test_orig = df[df.index > "2017-12-31"].copy()
In [13]:
## Lets create one training and one test dataset with events that have missing sensor data
df_train = df_train_orig.copy()
df_train_A = df_train['A']
df_train_A = create_gaps(df_train_A, perc=0.4)
df_train['A'] = df_train_A

df_test = df_test_orig.copy()
df_test_A = df_test['A']
df_test_A = create_gaps(df_test_A, perc=0.4)
df_test['A'] = df_test_A

3.2. Training Data Preparation

A key preliminary step prior to training and testing different models for gap filling, is transformation of training and test datasets to introduce synthetic 'gap events' (please note that the 4 events mentioned in the problem statement are already excluded from both data sets). The assumptions made in generating the synthetic data sets are as follows:

  • Gaps are only introduced in the time series for site A (since this is our series of interest).
  • Length of a data gap is random and ranges between 5-20 days (Note: it is possible that two synthetic gap events overlap to result in a bigger gap). Hence, all algorithms are trained with the assumption that the maximum gap in data is 20 days (the biggest gap amongst the 4 events mentioned was apprximately 18 days). A minumum of 5 days were kept with the assumption that it may take a few days to fix a sensor. However, these numbers can be changed in future analyses.
  • A / the synthetic gap scenario (i.e., a single training / test set with missing data events) is generated by removing 40 Percent (by introducing gaps) of the flow time-series at flow site A.
  • Important assumption: In all synthetic gap scenarios, synthetic gaps are only introduced in the time series at flow site A. However existing actual gaps in other series remain.

The following code snippet is used for creating training / test sets / scenarios with gaps (The function create_gaps in the utils.py file is used).

In [14]:
## Fit simple models as baseline that use strategies like filling with mean etc.
from sklearn.impute import SimpleImputer

## Create two dataframes that will store all training and test series from all models 
df_train_all = df_train_orig.copy()
df_train_all['Original'] = df_train_all['A']
df_train_all = df_train_all.drop(['A', 'B', 'C'], axis=1)
df_test_all = df_test_orig.copy()
df_test_all['Original'] = df_test_all['A']
df_test_all = df_test_all.drop(['A', 'B', 'C'], axis=1)

## Use global mean as predictor: Baseline 1
imp_mean = SimpleImputer()
imp_mean.fit(df_train)
Y = imp_mean.transform(df_train)
df_train_all['Mean'] = Y[:,0]
Yt = imp_mean.transform(df_test)
df_test_all['Mean'] = Yt[:,0]

## Use Interpolation as predictor: Baseline 2
df_train_temp = df_train.copy()
df_train_ip = df_train_temp['A'].interpolate()
df_train_all['Interpolate'] = df_train_ip
df_test_temp = df_test.copy()
df_test_ip = df_test_temp['A'].interpolate()
df_test_all['Interpolate'] = df_test_ip

The following code snippet implements the two baseline gap-filling strategies, i.e., fill by global mean and fill by interpolation.

In [15]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df_train_mf1 = df_train.copy()
df_train_mf1 = df_train_mf1[['B', 'C', 'A']] #order of filling is upstream to downstream
df_train_mf1_array = df_train_mf1.to_numpy()
df_train_mf1_array = np.log(df_train_mf1_array) #data is log-transformed

df_test_mf1 = df_test.copy()
df_test_mf1 = df_test_mf1[['B', 'C', 'A']] #order of filling is upstream to downstream
df_test_mf1_array = df_test_mf1.to_numpy()
df_test_mf1_array = np.log(df_test_mf1_array) #data is log-transformed

imp_mf1 = IterativeImputer(random_state=0, imputation_order='roman')
imp_mf1.fit(df_train_mf1_array)
Y = imp_mf1.transform(df_train_mf1_array)
Yt = imp_mf1.transform(df_test_mf1_array)
df_train_all['MICE-Flow-1'] = np.exp(Y[:,2])
df_test_all['MICE-Flow-1'] = np.exp(Yt[:,2])
In [10]:
import plotly.io as pio
fig = interactive_ts_plot(df_train_all)
pio.show(fig)

Selected Model (MICE)

As mentioned earlier, the model selected for this analysis is a variation of the stastical method MICE. The basic strategy behind MICE is to use chained regression equations to sequentially fill gaps in multiple time series. So in our problem, MICE would iteratively fill gaps in time series of flow stations A, B and C (order can be specified or series with minimum gaps is filled first). In each iteration, one of the variables becomes the dependent variable while the others are independent variables.

Model Implemnetation Overview

We first implement MICE with A, B and C as the input series. In Python, a variation of MICE is implemented within sklearn's IterativeImputer. ItertiveImputer employs Bayesian Ridge Regression (to introduce regularization) between the series to fill missing values. Consequently, the following equation represents the basic MICE model (variant of MICE) used for filling gaps in data of Flow site A:

$X^t_A = \alpha_B X^t_B + \alpha_C X^t_C + \epsilon$

Please note that Time series B and C also have gaps (from original data). These gaps are filled first in this variant of the MICE model (also using Bayesian Ridge Regression). In the original MICE model, gaps are filled in acending order of the number of data gaps in a series.

The above-mentioned implementation of MICE is called MICE-Flow-1 in subsequent discussion. Some further assumptions are as follows:

  • Missing data at site A is filled using cross-correlations with other sites only.
  • Missing data is filled at sites B and C first (from upstream to downstream order in river channel).
  • All data is log-transformed before model-fitting. It is evident from initial data analysis that distribution of data is not normal and is skewed (which is expected from any hydrologic series).

MICE-Flow-1 is implemented in the following code snippet.

In [16]:
import plotly.io as pio
fig = interactive_ts_plot(df_train_all)
pio.show(fig)

Proposed Model 1: MICE-Flow-2

MICE-Flow-1 only takes into account correlations at site A with upstream river locations, whereas any auto-correlations (or local trends within time series A) are not considered. Furthermore, the model does not take into account lagged cross-correlation with other streamflow sites (lagged correlations were noticed during data exploration). Finally, it is believed that missing data at other locations is resulting in sudden jumps (noise) in the predictions of MICE-Flow-1. For instance, see data of July 6, 2016 in the above model conmparison plot (in training time period).

Feature Engineering

To alleviate the aforementioned shortcomings of MICE-Flow-1, some new input features are proposed to i) introduce effect of auto-correlations in prediction (by using a 1-week mean), ii) introduce the effect of lagged cross-correlations, and iii) improve model stability by adding 'local mean' (one month). Also, since there are very large temporal gaps in data for upstream site B, it is excluded as a model feature. Mathematical representation of the modified MICE model (for site A) is as follows:

$X^t_A = \alpha_2 X^{t-L}_C + \gamma_{1}\overline{X}^{Month}_A + \gamma_{2}\overline{X}^{Week}_A + \epsilon$

In the above equation $X^t_A$ is the time series at flow location A, $X^{t-L}_C$ is lagged time-series at location C, $\overline{X}^{Month}_A$ is the 30-day rolling mean centered at time $t$ and$\overline{X}^{Week}_A$is the 7-day rolling mean centered at time.

The proposed model is called MICE-Flow-2. Some further assumptions for model fitting are as follows:

  • Only a limited data at site C is missing. Hence, it is filled first using simple interpolation and helps to improve overall fit. Consequently, performance of MICE-Flow-2 is strongly dependent on availability of an upstream site
  • All data is log-transformed before model-fitting.

The code for implementation of MICE-Flow-2 is in the following snippet.

In [17]:
## Second Level of the MICE method
df_train_mf2 = df_train.copy()
df_train_mf2['Mavg'] = df_train_mf2['A'].rolling(96*30, min_periods=10).mean()
df_train_mf2['Wavg'] = df_train_mf2['A'].rolling(96*7, min_periods=10).mean()
df_train_mf2['C'] = df_train_mf2['C'].shift(48)
df_train_mf2['C'] = df_train_mf2['C'].interpolate()
df_train_mf2 = df_train_mf2.drop(['B'], axis=1)
df_train_mf2_array = df_train_mf2.to_numpy()
df_train_mf2_array = np.log(df_train_mf2_array) #data is log-transformed

df_test_mf2 = df_test.copy()
df_test_mf2['Mavg'] = df_test_mf2['A'].rolling(96*30, min_periods=10).mean()
df_test_mf2['Wavg'] = df_test_mf2['A'].rolling(96*7, min_periods=10).mean()
df_test_mf2['C'] = df_test_mf2['C'].shift(48)
df_test_mf2['C'] = df_test_mf2['C'].interpolate()
df_test_mf2 = df_test_mf2.drop(['B'], axis=1)
df_test_mf2_array = df_test_mf2.to_numpy()
df_test_mf2_array = np.log(df_test_mf2_array) #data is log-transformed

imp_mf2 = IterativeImputer(random_state=0)
imp_mf2.fit(df_train_mf2_array)
Y = imp_mf2.transform(df_train_mf2_array)
Yt = imp_mf2.transform(df_test_mf2_array)
df_train_all['MICE-Flow-2'] = np.exp(Y[:,0])
df_test_all['MICE-Flow-2'] = np.exp(Yt[:,0])
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

In [18]:
import plotly.io as pio
fig = interactive_ts_plot(df_test_all[['Original', 'Mean', 'MICE-Flow-1', 'MICE-Flow-2']])
pio.show(fig)

Repeat comaprative experiment for different data missing scenarios (synthetically generated as described before)

In [19]:
ntrials = 40 # will do 10 experiments
results_train = []
results_test = [] 
for trial in range(ntrials):
    df_train_all, df_test_all = single_experiment(df_train_orig, df_test_orig)
    results_train.append(df_train_all)
    results_test.append(df_test_all)
    print('Trial ' + str(trial) + ' finished')

nmodels = 4
models = ['Mean', 'Interpolate', 'MICE-Flow-1', 'MICE-Flow-2']
mape_vals = np.empty([ntrials, nmodels])
mae_vals = np.empty([ntrials, nmodels])

i = 0
for df_test in results_test:
    j = 0
    for model in models:
        y = df_test[df_test['Missing'].isnull()]['Original'].to_numpy()
        yhat = df_test[df_test['Missing'].isnull()][model].to_numpy()
        mape_vals[i,j] = mape(y, yhat)
        mae_vals[i,j] = mae(y, yhat)
        j = j + 1
    i = i + 1
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 0 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 1 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 2 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 3 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 4 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 5 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 6 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 7 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 8 finished
Trial 9 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 10 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 11 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 12 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 13 finished
Trial 14 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 15 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 16 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 17 finished
Trial 18 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 19 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 20 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 21 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 22 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 23 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 24 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 25 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 26 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 27 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 28 finished
Trial 29 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 30 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 31 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 32 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 33 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 34 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 35 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 36 finished
Trial 37 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 38 finished
/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

/Users/taimoorakhtar/opt/anaconda3/envs/hydro-env/lib/python3.8/site-packages/sklearn/impute/_iterative.py:669: ConvergenceWarning:

[IterativeImputer] Early stopping criterion not reached.

Trial 39 finished
In [20]:
mape_vals_train = np.empty([ntrials, nmodels])
mae_vals_train = np.empty([ntrials, nmodels])
i = 0
for df_train in results_train:
    j = 0
    for model in models:
        y = df_train[df_train['Missing'].isnull()]['Original'].to_numpy()
        yhat = df_train[df_train['Missing'].isnull()][model].to_numpy()
        mape_vals_train[i,j] = mape(y, yhat)
        mae_vals_train[i,j] = mae(y, yhat)
        j = j + 1
    i = i + 1

5. Model Evaluation

Metric-based Results Analysis

In the previous code-block we generated 20 different synthetic traning and tests (method discussed previously) with missing values in time series. We will use results of 20 training and test experiments / trials to understand how three methods, i.e., Interpolation, MICE-Flow-1, and MICE-Flow-2 compare.

The follwing code genereates comparison box-plots.

In [21]:
## Show results in boxplot
df_mape = pd.DataFrame(data=mape_vals, index = range(ntrials), columns = models)
df_mae = pd.DataFrame(data=mae_vals, index = range(ntrials), columns = models)
df_mape_train = pd.DataFrame(data=mape_vals_train, index = range(ntrials), columns = models)
df_mae_train = pd.DataFrame(data=mae_vals_train, index = range(ntrials), columns = models)

fig = plt.figure(figsize=[10,8])
ax = fig.add_subplot(2,2,1)
df_mape_train.boxplot(column = ['Interpolate', 'MICE-Flow-1', 'MICE-Flow-2'], ax=ax)
ax.set_title('MAPE: Training Set')
ax = fig.add_subplot(2,2,2)
df_mae_train.boxplot(column = ['Interpolate', 'MICE-Flow-1', 'MICE-Flow-2'], ax=ax)
ax.set_title('MAE: Training Set')
ax = fig.add_subplot(2,2,3)
df_mape.boxplot(column = ['Interpolate', 'MICE-Flow-1', 'MICE-Flow-2'], ax=ax)
ax.set_title('MAPE: Test Set')
ax = fig.add_subplot(2,2,4)
df_mae.boxplot(column = ['Interpolate', 'MICE-Flow-1', 'MICE-Flow-2'], ax=ax)
ax.set_title('MAE: Test Set')
Out[21]:
Text(0.5, 1.0, 'MAE: Test Set')

Monthly Error analysis

The overall performance metric analysis (via boxplots) shows that MICE-Flow-2 is the best amongst all strategies tested, with an average absolute percentage error of around percent. Surprisingly, the MAPE and MAE values of MICE-Flow-2 and Interpolation are not very different (can even perform a hypothesis test to confirm). Howeover, when we see time series plots (as shown previously) of all prediction methods tested here, it is clear that MICE-Flow-2 is the best amongst methods tested.

We can investigate error dynamics further by grouping individual percentage errors by months. This strategy will enable us to understand how different imputation algorithms perform in different seasons. For instance, it is important to understand how different algorithms perform in the baseflow dominant winter months.

The monthly error analysis is prepared in the following code block.

In [22]:
import math
month_vals = []
trial_num = 0
for df_test in results_test:
    j = 0
    df_test.index = pd.to_datetime(df_test.index)
    for model in models:
        if trial_num == 0:
            month_vals.append([])
        for month in range(1, 13):
            if trial_num == 0:
                month_vals[j].append([])
            y = df_test[(df_test['Missing'].isnull()) & (df_test.index.month == month)]['Original'].to_numpy()
            yhat = df_test[(df_test['Missing'].isnull()) & (df_test.index.month == month)][model].to_numpy()
            for i in range(len(y)):
                if not math.isnan(y[i]):
                    month_vals[j][month-1].append(100*np.abs(y[i] - yhat[i])/y[i])
        j = j + 1
    trial_num = trial_num + 1
#print(len(month_vals[1]))
In [23]:
fig = plt.figure(figsize=[15,5])
ax = fig.add_subplot(1,3,1)
ax.boxplot(month_vals[1], showfliers=False)
ax.set_title("Interpolation")
ax.set_xlabel("Month Number")
ax.set_ylabel("Percentage Error")
ax = fig.add_subplot(1,3,2)
ax.boxplot(month_vals[2], showfliers=False)
ax.set_title("MICE-Flow-1")
ax.set_xlabel("Month Number")
ax.set_ylabel("Percentage Error")
ax = fig.add_subplot(1,3,3)
ax.boxplot(month_vals[3], showfliers=False)
ax.set_title("MICE-Flow-2")
ax.set_xlabel("Month Number")
ax.set_ylabel("Percentage Error")
Out[23]:
Text(0, 0.5, 'Percentage Error')

The above plots show the distributions of monthly absolute percentage errors (combined over multiple trials) for the three imputation methods. Sub-figure three illustrates that MICE-Flow-2 is more robust than other algorithms tested and reports a shorter overall range of error across months.

Another advantage of MICE-Flow-1 and MICE-Flow-2 is that these methods can also provide prediction bounds for estimations. This is possible when multiple MICE trials are performed (and we performed 30 trials), with different sythetic data sets.

Evaluation Results (Drawing Prediction Intervals)

This work will now use the 30 model trials of MICE-Flow-2 to develop predictions and prediction intervals for the 4 events provided in the problems statement, for final evaluation. Since multiple model trials are available (using different missing data scenarios), multiple predictions for each missing data point are available. The average of these predictions will be used as estimate for a missing data point and the standard deviation will be used to develop confidence bounds.

The code snippet below creates the predictions (at provided evaluations points - 4 gap events) and prediction intervals.

In [24]:
#df_orig.set_index('DateTime', inplace=True)
df_eval_obs = df_orig[df_orig.index > "2018-01-01"]
## Extract predictions and put in a dataframe
num_trials = 30
col_names = []
for i in range(num_trials):
    cname = "Trial_" + str(i+1) 
    col_names.append(cname)
df_eval_e1 = pd.DataFrame(columns=col_names)
df_eval_e2 = pd.DataFrame(columns=col_names)
df_eval_e3 = pd.DataFrame(columns=col_names)
df_eval_e4 = pd.DataFrame(columns=col_names)
trial_num = 1
for df_test in results_test:
    cname = "Trial_" + str(trial_num)
    df_eval_e1[cname] = 0
    df_eval_e1[cname] = df_test["MICE-Flow-2"][(df_test.index >= "2018-02-03") & (df_test.index <= "2018-02-20")]
    df_eval_e2[cname] = df_test["MICE-Flow-2"][(df_test.index >= "2018-04-13") & (df_test.index <= "2018-04-28")]
    df_eval_e3[cname] = df_test["MICE-Flow-2"][(df_test.index >= "2018-06-17") & (df_test.index <= "2018-07-01")]
    df_eval_e4[cname] = df_test["MICE-Flow-2"][(df_test.index >= "2018-11-01") & (df_test.index <= "2018-11-15")]
    trial_num = trial_num + 1

col_names = ['Mean', 'ub', 'lb']
df_eval_e1_pred = pd.DataFrame(columns=col_names)
df_eval_e1_pred['Mean'] = df_eval_e1.mean(axis=1)
df_eval_e1_pred['ub'] = df_eval_e1.quantile(0.75, axis=1)
df_eval_e1_pred['lb'] = df_eval_e1.quantile(0.25, axis=1)

df_eval_e2_pred = pd.DataFrame(columns=col_names)
df_eval_e2_pred['Mean'] = df_eval_e2.mean(axis=1)
df_eval_e2_pred['ub'] = df_eval_e2.quantile(0.75, axis=1)
df_eval_e2_pred['lb'] = df_eval_e2.quantile(0.25, axis=1)

df_eval_e3_pred = pd.DataFrame(columns=col_names)
df_eval_e3_pred['Mean'] = df_eval_e3.mean(axis=1)
df_eval_e3_pred['ub'] = df_eval_e3.quantile(0.75, axis=1)
df_eval_e3_pred['lb'] = df_eval_e3.quantile(0.25, axis=1)

df_eval_e4_pred = pd.DataFrame(columns=col_names)
df_eval_e4_pred['Mean'] = df_eval_e4.mean(axis=1)
df_eval_e4_pred['ub'] = df_eval_e4.quantile(0.75, axis=1)
df_eval_e4_pred['lb'] = df_eval_e4.quantile(0.25, axis=1)
In [25]:
## Creating final evaluation plot with confidence bounds
import plotly.graph_objects as go
import plotly
import plotly.io as pio

plot_data = []
trace = go.Scatter(
            x=df_eval_obs.index,
            y=df_eval_obs['A'],
            name='Observed',
            line=dict(color='#636EFA', width=2))
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e1_pred.index,
            y=df_eval_e1_pred['Mean'],
            name='Event 1',
            line=dict(color='red', width=2, dash='dash'))
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e1_pred.index,
            y=df_eval_e1_pred['lb'],
            fill=None,
            line_color = 'red',
            mode='lines',
            line = dict(width=0.5),
            showlegend=False)
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e1_pred.index,
            y=df_eval_e1_pred['ub'],
            fill='tonexty',
            mode='lines',
            name='Event 1 PI',
            line_color = 'red',
            line = dict(width=0.5),
            opacity = 0.5)
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e2_pred.index,
            y=df_eval_e2_pred['Mean'],
            name='Event 2',
            line=dict(color='yellow', width=2, dash='dash'))
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e2_pred.index,
            y=df_eval_e2_pred['lb'],
            fill=None,
            line_color = 'yellow',
            mode='lines',
            line = dict(width=0.5),
            showlegend=False)
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e2_pred.index,
            y=df_eval_e2_pred['ub'],
            fill='tonexty',
            mode='lines',
            name='Event 2 PI',
            line_color = 'yellow',
            line = dict(width=0.5),
            opacity = 0.5)
plot_data.append(trace)

race = go.Scatter(
            x=df_eval_e3_pred.index,
            y=df_eval_e3_pred['Mean'],
            name='Event 3',
            line=dict(color='indigo', width=2, dash='dash'))
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e3_pred.index,
            y=df_eval_e3_pred['lb'],
            fill=None,
            line_color = 'indigo',
            mode='lines',
            line = dict(width=0.5),
            showlegend=False)
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e3_pred.index,
            y=df_eval_e3_pred['ub'],
            fill='tonexty',
            mode='lines',
            name='Event 3 PI',
            line_color = 'indigo',
            line = dict(width=0.5),
            opacity = 0.5)
plot_data.append(trace)

race = go.Scatter(
            x=df_eval_e4_pred.index,
            y=df_eval_e4_pred['Mean'],
            name='Event 4',
            line=dict(color='green', width=2, dash='dash'))
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e4_pred.index,
            y=df_eval_e4_pred['lb'],
            fill=None,
            line_color = 'green',
            mode='lines',
            line = dict(width=0.5),
            showlegend=False)
plot_data.append(trace)

trace = go.Scatter(
            x=df_eval_e4_pred.index,
            y=df_eval_e4_pred['ub'],
            fill='tonexty',
            mode='lines',
            name='Event 4 PI',
            line_color = 'green',
            line = dict(width=0.5),
            opacity = 0.5)
plot_data.append(trace)

layout = dict(
        title='Evaluation of Selected Imputation Technique',
        titlefont=dict(
            size=18
        ),
        xaxis=dict(
            title="Date",
            titlefont_size=14,
            tickfont_size=12,
            type='date'
        ),
        yaxis=dict(
            title="Flow",
            titlefont_size=14,
            tickfont_size=12,
        ),
        legend=dict(
            x=0.01,
            y=0.99,
            bordercolor="Black",
            yanchor='top',
            borderwidth=1
        )
    )
fig = dict(data=plot_data, layout=layout)
pio.show(fig)

Final Thoughts on Evaluation Results:

Results reported in the above figure show that MICE-Flow-2 performs reasonably on the evaluation data set, with improvement over interpolation. However, there is potential to develop other models that can significantly improve upon MICE-Flow-2. For instance, MICE-Flow-2 does not adequately represent auto-correlations in the flow time-series where data needs to be filled. Some strategies were tested to incorporate such correlations. For example, lagged time series (with forward and backward lags) at Site A were introduced as features in the model for testing. However,due to feedback and error propagation, their inclusion deteriorated model performance. Also, the model is not able to extract knowledge from Site B, since it was excluded due to large data gaps. Moreover, a critical requirement of the model is a second site (for cross-correlated prediction), which has minimal data gaps. Given these issues, other learning models that are more flexible in-terms of feature engineering, may be more successful for this problem.

6. Future Recommendations

One promising Machine Learning strategy that could be explored and compared to MICE in future is a tree-based algorithm 4. The use of tree-based methods will give more freedom in terms of formulating the problem and selecting features. A tree based model that is based on the following features is proposed here:

$X^t_A = F(X^{t}_B, X^{t-L}_B, X^{t}_C, X^{t-L}_C, {X}^{Month}_A, {X}^{start}_A, {X}^{next}_A, \delta, r_{gap})$

  • $X^t_A$, is the missing data in series A at time step $t$
  • $X^{t}_B$, $X^{t-L}_B$, $X^{t}_C$ and $X^{t-L}_C$ are values from other series with and without lag
  • ${X}^{Month}_A$ is the rolling monthly mean for series A,
  • ${X}^{start}_A$ is the flow value of the previous observed time step (i.e., which is not missing) and ${X}^{next}_A$ is the flow at next available time step.
  • $\delta$ is the number of time steps from ${X}^{start}_A$ to $t$, amd $r_{gap}$ is $\frac{\delta}{t_{delta}}$, where $t_delta$ is total length of the gap that includes current missing value (i.e., missing value at time $t$)

The above features can be included to fit a tree based algorithm like XGBoost, to incorporate auto-crelations. cross-correlations and patterns in other parts of the multivariate time series.

References

  1. Stef van Buuren, Karin Groothuis-Oudshoorn (2011). “mice: Multivariate Imputation by Chained Equations in R”. Journal of Statistical Software 45: 1-67.
  2. Ruggles, T.H., Farnham, D.J., Tong, D. et al. Developing reliable hourly electricity demand data through screening and imputation. Sci Data 7, 155 (2020). https://doi.org/10.1038/s41597-020-0483-x
  3. Daniel J. Stekhoven, Peter Bühlmann, MissForest—non-parametric missing value imputation for mixed-type data, Bioinformatics, Volume 28, Issue 1, 1 January 2012, Pages 112–118, https://doi.org/10.1093/bioinformatics/btr597
  4. Xu, X., Liu, X., Kang, Y. et al. A Multi-directional Approach for Missing Value Estimation in Multivariate Time Series Clinical Data. J Healthc Inform Res (2020). https://doi.org/10.1007/s41666-020-00076-2
  5. Che, Z., Purushotham, S., Cho, K. et al. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci Rep 8, 6085 (2018). https://doi.org/10.1038/s41598-018-24271-9