Create by: Taimoor Akhtar
This notebook attempts to explore different statistical models, to analyze their capabilities in filling gaps in streamflow sensor data. Data for three correlated flow sites is provided, and the problem statement is to fill data gaps between the most downstream site, i.e., site A.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
from utils import *
from models import single_experiment
The general data exploration strategy is to explore tends in the flow data at all sites through a) time-series plots, b) cross-correations, c) auto-correlartions and d) visualizations of gains. While, analysis of auto-correlations is customary in time series analysis, lagged cross-correlations and gains are also analyzed here, due to their importance in hydrological flow and routing.
# Loading data and plotting the time-series at A, B and C
df = pd.read_csv("ts_data.csv", parse_dates=True)
df.set_index('DateTime', inplace=True)
df_orig = df.copy()
# Removing the date intervals specified for prediction in the Problem Statement
df["A"][(df.index >= "2018-02-03") & (df.index <= "2018-02-20")] = np.nan
df["A"][(df.index >= "2018-04-13") & (df.index <= "2018-04-28")] = np.nan
df["A"][(df.index >= "2018-06-17") & (df.index <= "2018-07-01")] = np.nan
df["A"][(df.index >= "2018-11-01") & (df.index <= "2018-11-15")] = np.nan
fig = plt.figure(figsize=[20,5])
ax = fig.add_subplot(1,1,1)
df.plot(ax=ax)
# Understanding cross-correlations
df_A = df['A']
df_B = df['B']
df_C = df['C']
rb = [crosscorr(df_A, df_B, lag) for lag in range(300)]
rc = [crosscorr(df_A, df_C, lag) for lag in range(300)]
f,ax=plt.subplots(figsize=(14,3))
ax.plot(rb, color = 'g', label = 'Site B')
ax.plot(rc, color = 'r', label = 'Site C')
ax.set(title='Cross-Correlation of Site A with B and C', xlabel='Lag (15 Minute Intervals)', ylabel='Correlation (Pearson r)');
plt.legend()
df_A = df['A'][df.index <= "2018-01-31"]
fig = plt.figure(figsize=[20,10])
ax = fig.add_subplot(3,1,1)
sns.distplot(df, ax=ax)
ax = fig.add_subplot(3,1,2)
plot_acf(df_A, lags=2000, ax=ax)
ax = fig.add_subplot(3,1,3)
plot_pacf(df_A, lags=100, ax=ax, method='ols');
# Understanding gains and any distributions of gains
df_diff_1 = df_A - df_B
df_diff_2 = df_A - df_C
fig = plt.figure(figsize=[20,5])
ax = fig.add_subplot(1,1,1)
df_diff_1.plot(ax=ax, label='Difference B')
df_diff_2.plot(ax=ax, label='Difference C')
This work will compare the modeling strategy(ies) proposed in this work with two baselines, i.e., i) 'Fill missing data with global time series mean' (using SimpleImputer from sklearn) and ii) 'Interpolation' (the Pandas built-in method). Moreover, the Mean Absolute Percentage Error (MAPE) and Mean Percentage Error (MPE) are used to evalualte the models proposed / analyzed in this work. MPE is used to also highlight any biases in predictions from algorithms (this is very important in hydrologic problems). Both evaluation metrics are descirbied below:
$\text{MAPE} = \frac{1}{n} \sum_i \mid \frac{\text{pred}_i - \text{obs}_i}{\text{obs}_i} \mid$
$\text{MPE} = \frac{1}{n} \sum_i \frac{\text{pred}_i - \text{obs}_i}{\text{obs}_i}$
The literature review summarised in Section 1 highlighted multiple promising algorithms for filling missing data in multivariate time series. This work will evaluate one such strategy, i.e. the Multivariate Imputation by Chained Equations (MICE) [2] method (we actually use a variation of the original MICE). [2] uses MICE for filling missing data in cross-correlated hourly energy demand data.
The exploratory analysis indicates that:
We divide our data set into training and test sets where years 2016 and 2017 are uses for model training and the year 2018 is used as test set.
# The trainng set is approximately 70 Percent of Total Data, i.e., years 2016-17 (excluding time periods used in evaluation)
df_train_orig = df[df.index <= "2017-12-31"].copy()
df_test_orig = df[df.index > "2017-12-31"].copy()
## Lets create one training and one test dataset with events that have missing sensor data
df_train = df_train_orig.copy()
df_train_A = df_train['A']
df_train_A = create_gaps(df_train_A, perc=0.4)
df_train['A'] = df_train_A
df_test = df_test_orig.copy()
df_test_A = df_test['A']
df_test_A = create_gaps(df_test_A, perc=0.4)
df_test['A'] = df_test_A
A key preliminary step prior to training and testing different models for gap filling, is transformation of training and test datasets to introduce synthetic 'gap events' (please note that the 4 events mentioned in the problem statement are already excluded from both data sets). The assumptions made in generating the synthetic data sets are as follows:
## Fit simple models as baseline that use strategies like filling with mean etc.
from sklearn.impute import SimpleImputer
## Create two dataframes that will store all training and test series from all models
df_train_all = df_train_orig.copy()
df_train_all['Original'] = df_train_all['A']
df_train_all = df_train_all.drop(['A', 'B', 'C'], axis=1)
df_test_all = df_test_orig.copy()
df_test_all['Original'] = df_test_all['A']
df_test_all = df_test_all.drop(['A', 'B', 'C'], axis=1)
## Use global mean as predictor: Baseline 1
imp_mean = SimpleImputer()
imp_mean.fit(df_train)
Y = imp_mean.transform(df_train)
df_train_all['Mean'] = Y[:,0]
Yt = imp_mean.transform(df_test)
df_test_all['Mean'] = Yt[:,0]
## Use Interpolation as predictor: Baseline 2
df_train_temp = df_train.copy()
df_train_ip = df_train_temp['A'].interpolate()
df_train_all['Interpolate'] = df_train_ip
df_test_temp = df_test.copy()
df_test_ip = df_test_temp['A'].interpolate()
df_test_all['Interpolate'] = df_test_ip
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df_train_mf1 = df_train.copy()
df_train_mf1 = df_train_mf1[['B', 'C', 'A']] #order of filling is upstream to downstream
df_train_mf1_array = df_train_mf1.to_numpy()
df_train_mf1_array = np.log(df_train_mf1_array) #data is log-transformed
df_test_mf1 = df_test.copy()
df_test_mf1 = df_test_mf1[['B', 'C', 'A']] #order of filling is upstream to downstream
df_test_mf1_array = df_test_mf1.to_numpy()
df_test_mf1_array = np.log(df_test_mf1_array) #data is log-transformed
imp_mf1 = IterativeImputer(random_state=0, imputation_order='roman')
imp_mf1.fit(df_train_mf1_array)
Y = imp_mf1.transform(df_train_mf1_array)
Yt = imp_mf1.transform(df_test_mf1_array)
df_train_all['MICE-Flow-1'] = np.exp(Y[:,2])
df_test_all['MICE-Flow-1'] = np.exp(Yt[:,2])
import plotly.io as pio
fig = interactive_ts_plot(df_train_all)
pio.show(fig)
As mentioned earlier, the model selected for this analysis is a variation of the stastical method MICE. The basic strategy behind MICE is to use chained regression equations to sequentially fill gaps in multiple time series. So in our problem, MICE would iteratively fill gaps in time series of flow stations A, B and C (order can be specified or series with minimum gaps is filled first). In each iteration, one of the variables becomes the dependent variable while the others are independent variables.
We first implement MICE with A, B and C as the input series. In Python, a variation of MICE is implemented within sklearn's IterativeImputer. ItertiveImputer employs Bayesian Ridge Regression (to introduce regularization) between the series to fill missing values. Consequently, the following equation represents the basic MICE model (variant of MICE) used for filling gaps in data of Flow site A:
$X^t_A = \alpha_B X^t_B + \alpha_C X^t_C + \epsilon$
Please note that Time series B and C also have gaps (from original data). These gaps are filled first in this variant of the MICE model (also using Bayesian Ridge Regression). In the original MICE model, gaps are filled in acending order of the number of data gaps in a series.
The above-mentioned implementation of MICE is called MICE-Flow-1 in subsequent discussion. Some further assumptions are as follows:
MICE-Flow-1 is implemented in the following code snippet.
import plotly.io as pio
fig = interactive_ts_plot(df_train_all)
pio.show(fig)
MICE-Flow-1 only takes into account correlations at site A with upstream river locations, whereas any auto-correlations (or local trends within time series A) are not considered. Furthermore, the model does not take into account lagged cross-correlation with other streamflow sites (lagged correlations were noticed during data exploration). Finally, it is believed that missing data at other locations is resulting in sudden jumps (noise) in the predictions of MICE-Flow-1. For instance, see data of July 6, 2016 in the above model conmparison plot (in training time period).
To alleviate the aforementioned shortcomings of MICE-Flow-1, some new input features are proposed to i) introduce effect of auto-correlations in prediction (by using a 1-week mean), ii) introduce the effect of lagged cross-correlations, and iii) improve model stability by adding 'local mean' (one month). Also, since there are very large temporal gaps in data for upstream site B, it is excluded as a model feature. Mathematical representation of the modified MICE model (for site A) is as follows:
$X^t_A = \alpha_2 X^{t-L}_C + \gamma_{1}\overline{X}^{Month}_A + \gamma_{2}\overline{X}^{Week}_A + \epsilon$
In the above equation $X^t_A$ is the time series at flow location A, $X^{t-L}_C$ is lagged time-series at location C, $\overline{X}^{Month}_A$ is the 30-day rolling mean centered at time $t$ and$\overline{X}^{Week}_A$is the 7-day rolling mean centered at time.
The proposed model is called MICE-Flow-2. Some further assumptions for model fitting are as follows:
The code for implementation of MICE-Flow-2 is in the following snippet.
## Second Level of the MICE method
df_train_mf2 = df_train.copy()
df_train_mf2['Mavg'] = df_train_mf2['A'].rolling(96*30, min_periods=10).mean()
df_train_mf2['Wavg'] = df_train_mf2['A'].rolling(96*7, min_periods=10).mean()
df_train_mf2['C'] = df_train_mf2['C'].shift(48)
df_train_mf2['C'] = df_train_mf2['C'].interpolate()
df_train_mf2 = df_train_mf2.drop(['B'], axis=1)
df_train_mf2_array = df_train_mf2.to_numpy()
df_train_mf2_array = np.log(df_train_mf2_array) #data is log-transformed
df_test_mf2 = df_test.copy()
df_test_mf2['Mavg'] = df_test_mf2['A'].rolling(96*30, min_periods=10).mean()
df_test_mf2['Wavg'] = df_test_mf2['A'].rolling(96*7, min_periods=10).mean()
df_test_mf2['C'] = df_test_mf2['C'].shift(48)
df_test_mf2['C'] = df_test_mf2['C'].interpolate()
df_test_mf2 = df_test_mf2.drop(['B'], axis=1)
df_test_mf2_array = df_test_mf2.to_numpy()
df_test_mf2_array = np.log(df_test_mf2_array) #data is log-transformed
imp_mf2 = IterativeImputer(random_state=0)
imp_mf2.fit(df_train_mf2_array)
Y = imp_mf2.transform(df_train_mf2_array)
Yt = imp_mf2.transform(df_test_mf2_array)
df_train_all['MICE-Flow-2'] = np.exp(Y[:,0])
df_test_all['MICE-Flow-2'] = np.exp(Yt[:,0])
import plotly.io as pio
fig = interactive_ts_plot(df_test_all[['Original', 'Mean', 'MICE-Flow-1', 'MICE-Flow-2']])
pio.show(fig)
ntrials = 40 # will do 10 experiments
results_train = []
results_test = []
for trial in range(ntrials):
df_train_all, df_test_all = single_experiment(df_train_orig, df_test_orig)
results_train.append(df_train_all)
results_test.append(df_test_all)
print('Trial ' + str(trial) + ' finished')
nmodels = 4
models = ['Mean', 'Interpolate', 'MICE-Flow-1', 'MICE-Flow-2']
mape_vals = np.empty([ntrials, nmodels])
mae_vals = np.empty([ntrials, nmodels])
i = 0
for df_test in results_test:
j = 0
for model in models:
y = df_test[df_test['Missing'].isnull()]['Original'].to_numpy()
yhat = df_test[df_test['Missing'].isnull()][model].to_numpy()
mape_vals[i,j] = mape(y, yhat)
mae_vals[i,j] = mae(y, yhat)
j = j + 1
i = i + 1
mape_vals_train = np.empty([ntrials, nmodels])
mae_vals_train = np.empty([ntrials, nmodels])
i = 0
for df_train in results_train:
j = 0
for model in models:
y = df_train[df_train['Missing'].isnull()]['Original'].to_numpy()
yhat = df_train[df_train['Missing'].isnull()][model].to_numpy()
mape_vals_train[i,j] = mape(y, yhat)
mae_vals_train[i,j] = mae(y, yhat)
j = j + 1
i = i + 1
In the previous code-block we generated 20 different synthetic traning and tests (method discussed previously) with missing values in time series. We will use results of 20 training and test experiments / trials to understand how three methods, i.e., Interpolation, MICE-Flow-1, and MICE-Flow-2 compare.
The follwing code genereates comparison box-plots.
## Show results in boxplot
df_mape = pd.DataFrame(data=mape_vals, index = range(ntrials), columns = models)
df_mae = pd.DataFrame(data=mae_vals, index = range(ntrials), columns = models)
df_mape_train = pd.DataFrame(data=mape_vals_train, index = range(ntrials), columns = models)
df_mae_train = pd.DataFrame(data=mae_vals_train, index = range(ntrials), columns = models)
fig = plt.figure(figsize=[10,8])
ax = fig.add_subplot(2,2,1)
df_mape_train.boxplot(column = ['Interpolate', 'MICE-Flow-1', 'MICE-Flow-2'], ax=ax)
ax.set_title('MAPE: Training Set')
ax = fig.add_subplot(2,2,2)
df_mae_train.boxplot(column = ['Interpolate', 'MICE-Flow-1', 'MICE-Flow-2'], ax=ax)
ax.set_title('MAE: Training Set')
ax = fig.add_subplot(2,2,3)
df_mape.boxplot(column = ['Interpolate', 'MICE-Flow-1', 'MICE-Flow-2'], ax=ax)
ax.set_title('MAPE: Test Set')
ax = fig.add_subplot(2,2,4)
df_mae.boxplot(column = ['Interpolate', 'MICE-Flow-1', 'MICE-Flow-2'], ax=ax)
ax.set_title('MAE: Test Set')
The overall performance metric analysis (via boxplots) shows that MICE-Flow-2 is the best amongst all strategies tested, with an average absolute percentage error of around percent. Surprisingly, the MAPE and MAE values of MICE-Flow-2 and Interpolation are not very different (can even perform a hypothesis test to confirm). Howeover, when we see time series plots (as shown previously) of all prediction methods tested here, it is clear that MICE-Flow-2 is the best amongst methods tested.
We can investigate error dynamics further by grouping individual percentage errors by months. This strategy will enable us to understand how different imputation algorithms perform in different seasons. For instance, it is important to understand how different algorithms perform in the baseflow dominant winter months.
The monthly error analysis is prepared in the following code block.
import math
month_vals = []
trial_num = 0
for df_test in results_test:
j = 0
df_test.index = pd.to_datetime(df_test.index)
for model in models:
if trial_num == 0:
month_vals.append([])
for month in range(1, 13):
if trial_num == 0:
month_vals[j].append([])
y = df_test[(df_test['Missing'].isnull()) & (df_test.index.month == month)]['Original'].to_numpy()
yhat = df_test[(df_test['Missing'].isnull()) & (df_test.index.month == month)][model].to_numpy()
for i in range(len(y)):
if not math.isnan(y[i]):
month_vals[j][month-1].append(100*np.abs(y[i] - yhat[i])/y[i])
j = j + 1
trial_num = trial_num + 1
#print(len(month_vals[1]))
fig = plt.figure(figsize=[15,5])
ax = fig.add_subplot(1,3,1)
ax.boxplot(month_vals[1], showfliers=False)
ax.set_title("Interpolation")
ax.set_xlabel("Month Number")
ax.set_ylabel("Percentage Error")
ax = fig.add_subplot(1,3,2)
ax.boxplot(month_vals[2], showfliers=False)
ax.set_title("MICE-Flow-1")
ax.set_xlabel("Month Number")
ax.set_ylabel("Percentage Error")
ax = fig.add_subplot(1,3,3)
ax.boxplot(month_vals[3], showfliers=False)
ax.set_title("MICE-Flow-2")
ax.set_xlabel("Month Number")
ax.set_ylabel("Percentage Error")
This work will now use the 30 model trials of MICE-Flow-2 to develop predictions and prediction intervals for the 4 events provided in the problems statement, for final evaluation. Since multiple model trials are available (using different missing data scenarios), multiple predictions for each missing data point are available. The average of these predictions will be used as estimate for a missing data point and the standard deviation will be used to develop confidence bounds.
The code snippet below creates the predictions (at provided evaluations points - 4 gap events) and prediction intervals.
#df_orig.set_index('DateTime', inplace=True)
df_eval_obs = df_orig[df_orig.index > "2018-01-01"]
## Extract predictions and put in a dataframe
num_trials = 30
col_names = []
for i in range(num_trials):
cname = "Trial_" + str(i+1)
col_names.append(cname)
df_eval_e1 = pd.DataFrame(columns=col_names)
df_eval_e2 = pd.DataFrame(columns=col_names)
df_eval_e3 = pd.DataFrame(columns=col_names)
df_eval_e4 = pd.DataFrame(columns=col_names)
trial_num = 1
for df_test in results_test:
cname = "Trial_" + str(trial_num)
df_eval_e1[cname] = 0
df_eval_e1[cname] = df_test["MICE-Flow-2"][(df_test.index >= "2018-02-03") & (df_test.index <= "2018-02-20")]
df_eval_e2[cname] = df_test["MICE-Flow-2"][(df_test.index >= "2018-04-13") & (df_test.index <= "2018-04-28")]
df_eval_e3[cname] = df_test["MICE-Flow-2"][(df_test.index >= "2018-06-17") & (df_test.index <= "2018-07-01")]
df_eval_e4[cname] = df_test["MICE-Flow-2"][(df_test.index >= "2018-11-01") & (df_test.index <= "2018-11-15")]
trial_num = trial_num + 1
col_names = ['Mean', 'ub', 'lb']
df_eval_e1_pred = pd.DataFrame(columns=col_names)
df_eval_e1_pred['Mean'] = df_eval_e1.mean(axis=1)
df_eval_e1_pred['ub'] = df_eval_e1.quantile(0.75, axis=1)
df_eval_e1_pred['lb'] = df_eval_e1.quantile(0.25, axis=1)
df_eval_e2_pred = pd.DataFrame(columns=col_names)
df_eval_e2_pred['Mean'] = df_eval_e2.mean(axis=1)
df_eval_e2_pred['ub'] = df_eval_e2.quantile(0.75, axis=1)
df_eval_e2_pred['lb'] = df_eval_e2.quantile(0.25, axis=1)
df_eval_e3_pred = pd.DataFrame(columns=col_names)
df_eval_e3_pred['Mean'] = df_eval_e3.mean(axis=1)
df_eval_e3_pred['ub'] = df_eval_e3.quantile(0.75, axis=1)
df_eval_e3_pred['lb'] = df_eval_e3.quantile(0.25, axis=1)
df_eval_e4_pred = pd.DataFrame(columns=col_names)
df_eval_e4_pred['Mean'] = df_eval_e4.mean(axis=1)
df_eval_e4_pred['ub'] = df_eval_e4.quantile(0.75, axis=1)
df_eval_e4_pred['lb'] = df_eval_e4.quantile(0.25, axis=1)
## Creating final evaluation plot with confidence bounds
import plotly.graph_objects as go
import plotly
import plotly.io as pio
plot_data = []
trace = go.Scatter(
x=df_eval_obs.index,
y=df_eval_obs['A'],
name='Observed',
line=dict(color='#636EFA', width=2))
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e1_pred.index,
y=df_eval_e1_pred['Mean'],
name='Event 1',
line=dict(color='red', width=2, dash='dash'))
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e1_pred.index,
y=df_eval_e1_pred['lb'],
fill=None,
line_color = 'red',
mode='lines',
line = dict(width=0.5),
showlegend=False)
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e1_pred.index,
y=df_eval_e1_pred['ub'],
fill='tonexty',
mode='lines',
name='Event 1 PI',
line_color = 'red',
line = dict(width=0.5),
opacity = 0.5)
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e2_pred.index,
y=df_eval_e2_pred['Mean'],
name='Event 2',
line=dict(color='yellow', width=2, dash='dash'))
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e2_pred.index,
y=df_eval_e2_pred['lb'],
fill=None,
line_color = 'yellow',
mode='lines',
line = dict(width=0.5),
showlegend=False)
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e2_pred.index,
y=df_eval_e2_pred['ub'],
fill='tonexty',
mode='lines',
name='Event 2 PI',
line_color = 'yellow',
line = dict(width=0.5),
opacity = 0.5)
plot_data.append(trace)
race = go.Scatter(
x=df_eval_e3_pred.index,
y=df_eval_e3_pred['Mean'],
name='Event 3',
line=dict(color='indigo', width=2, dash='dash'))
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e3_pred.index,
y=df_eval_e3_pred['lb'],
fill=None,
line_color = 'indigo',
mode='lines',
line = dict(width=0.5),
showlegend=False)
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e3_pred.index,
y=df_eval_e3_pred['ub'],
fill='tonexty',
mode='lines',
name='Event 3 PI',
line_color = 'indigo',
line = dict(width=0.5),
opacity = 0.5)
plot_data.append(trace)
race = go.Scatter(
x=df_eval_e4_pred.index,
y=df_eval_e4_pred['Mean'],
name='Event 4',
line=dict(color='green', width=2, dash='dash'))
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e4_pred.index,
y=df_eval_e4_pred['lb'],
fill=None,
line_color = 'green',
mode='lines',
line = dict(width=0.5),
showlegend=False)
plot_data.append(trace)
trace = go.Scatter(
x=df_eval_e4_pred.index,
y=df_eval_e4_pred['ub'],
fill='tonexty',
mode='lines',
name='Event 4 PI',
line_color = 'green',
line = dict(width=0.5),
opacity = 0.5)
plot_data.append(trace)
layout = dict(
title='Evaluation of Selected Imputation Technique',
titlefont=dict(
size=18
),
xaxis=dict(
title="Date",
titlefont_size=14,
tickfont_size=12,
type='date'
),
yaxis=dict(
title="Flow",
titlefont_size=14,
tickfont_size=12,
),
legend=dict(
x=0.01,
y=0.99,
bordercolor="Black",
yanchor='top',
borderwidth=1
)
)
fig = dict(data=plot_data, layout=layout)
pio.show(fig)
Results reported in the above figure show that MICE-Flow-2 performs reasonably on the evaluation data set, with improvement over interpolation. However, there is potential to develop other models that can significantly improve upon MICE-Flow-2. For instance, MICE-Flow-2 does not adequately represent auto-correlations in the flow time-series where data needs to be filled. Some strategies were tested to incorporate such correlations. For example, lagged time series (with forward and backward lags) at Site A were introduced as features in the model for testing. However,due to feedback and error propagation, their inclusion deteriorated model performance. Also, the model is not able to extract knowledge from Site B, since it was excluded due to large data gaps. Moreover, a critical requirement of the model is a second site (for cross-correlated prediction), which has minimal data gaps. Given these issues, other learning models that are more flexible in-terms of feature engineering, may be more successful for this problem.
One promising Machine Learning strategy that could be explored and compared to MICE in future is a tree-based algorithm 4. The use of tree-based methods will give more freedom in terms of formulating the problem and selecting features. A tree based model that is based on the following features is proposed here:
$X^t_A = F(X^{t}_B, X^{t-L}_B, X^{t}_C, X^{t-L}_C, {X}^{Month}_A, {X}^{start}_A, {X}^{next}_A, \delta, r_{gap})$
The above features can be included to fit a tree based algorithm like XGBoost, to incorporate auto-crelations. cross-correlations and patterns in other parts of the multivariate time series.