LAB PROJECT: HUMAN ACTIVITY RECOGNITION¶

Luis Cerdán¶

17/05/2020¶

1) Introduction¶

In this lab project we have to infer the activity (1-Running, 2-Walking, 3-Standing, 4-Sitting, 5-Lying) performed by two unlabeled subjects using our Machine or Deep Learning algorithm of preference. For that we have been provided data for 8 labelled sequences corresponding to 8 different persons. For each subject, there are 5 signals with a sampling frequency of 16 Hz simultaneously acquired from the accelerometer (z and x&y axes) and gyroscope (3-axes) of a wearable device.

The steps that I have followed to attain this goal are summarized as follows:

Segmentation and feature extracton steps: I have applied a sort of Sliding Window approach coupled to the extraction of time-domain features. In particular, I have computed for each temporal observation the moving/rolling/running statistics in the time interval (window size) preceeding that observation. In this way, the features contain as many temporal observations as the original dataset. The running statistics have been obtained one person at a time, in such a way that there is no information leakage from one person to the other. I didn't find necessary to apply dimensionality reduction techniques.
Classification step: I have resorted to a gradient boosting decision tree algorithm. I have used hold-out to select the test and train partitions, and K-fold cross-validation for the training and validation partitions. I have ensured that the instances from different persons are not in different partitions. For the hyperparameter tunning phase, I have used a Bayesian Search algorithm. I have set the sliding window size as an external hyperparameter.

Finally, I have assessed the feature importance to know which is the most predictive statistic, and I have checked the performance of the classifier inspecting the confussion matrix for the prediction and visualizing the predictions on the samples whose activity must be inferred.

2) Data loading and inspection¶

Read matlab database:

from scipy.io import loadmat, savemat
HAR_database = loadmat('HAR_database.mat')
training_dataset=HAR_database['database_training']
target_dataset=HAR_database['database_test']

To visualize the dataset, I have plotted the temporal signals for a few of the training people assigning to each activity a different color. This way we can indentify which are the temporal signatures of each of them. Let's display the dataset:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
from matplotlib.colors import ListedColormap, BoundaryNorm

num_p = 4 # number of people to plot

fig, axs = plt.subplots(5,num_p,figsize=(22,10))
for p in range(num_p):
    for i in range(5):
        # target person data
        person=training_dataset[p]
        y = person[0]
        y = y[i]
        x = np.arange(0,len(y))
        col_ind = person[1].flatten()
        
        points = np.array([x, y]).T.reshape(-1, 1, 2)
        segments = np.concatenate([points[:-1], points[1:]], axis=1)
        # Create a continuous norm to map from data points to colors
        norm = plt.Normalize(1, 5)
        lc = LineCollection(segments, cmap='jet', norm=norm)
        # Set the values used for colormapping
        lc.set_array(col_ind)
        lc.set_linewidth(1)
        line = axs[i,p].add_collection(lc)
        axs[i,p].set_xlim(x.min(), x.max())
        axs[i,p].set_ylim(1.1*np.min(y), 1.1*np.max(y))
        if i==0:
            axs[i, p].set_title('Person ' + str(p+1), fontsize=18)
        if p==0:
            axs[i, p].set_ylabel('Signal ' + str(i+1), fontsize=18)

fig.tight_layout(pad=0.1)
plt.show()

3) Segmentation and feature extraction¶

As I have commented in the introduction, I have applied a sort of Sliding Window approach coupled to the extraction of time-domain features. In particular, as features I will use moving (or running) statistics for each of the temporal signals. In other words, the feature value extracted from, and assigned to, the temporal observation t is the statistic (e.g., the mean) of the signal values in the time interval defined by the window size window preceeding, and including, that observation.

Going beyond the example of Avci et al., I have found convenient to use the mean, the median, the variance and the skewness as statistics. Accordingly, each temporal signal from the wearable device, will be converted into 4 temporal signals of the same length. In this sense, I have multiplied by 4 the dimension of the problem, but I have included a lot of new information.

A problem that arises with this sliding approach is that for the first window temporal observations, there are not enough observations to compute the statistic. To partially overcome this problem I have padded the signals at negative times with the reflection of the signal mirrored on the first values of the signal.

Accordingly, I have defined functions to return a strided array that contains the current temporal observation and the previous window observations, and to calculate the corresponding moving (or running) statistics.

from scipy.stats import skew

def running_window(a, window):
    pad = np.zeros(len(a.shape), dtype=np.int32)
    pad[-1] = window-1
    pad = list(zip(pad, np.zeros(len(a.shape), dtype=np.int32)))
    a = np.pad(a, pad,mode='reflect')
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def feat_running(signal,window):
    rolled_signal=running_window(signal, window)
    feat=np.mean(rolled_signal, axis=-1)
    feat=np.vstack((feat,np.median(rolled_signal, axis=-1)))
    feat=np.vstack((feat,np.var(rolled_signal, axis=-1)))
    feat=np.vstack((feat,skew(rolled_signal, axis=-1)))
    return feat.T

Let's visualize how these running statistics evolve in time for a given signal and two different window sizes:

window=[8,64]

feat_list=["Signal 1","Mean","Median","Variance","Skewness"]

fig, axs = plt.subplots(len(feat_list),len(window),figsize=(20,10))
for p in range(len(window)):
    # retrieve signals and activities for a given person
    person=training_dataset[4]    
    Y=np.transpose(person[1])
    # compute running statistics
    feat=feat_running(person[0],window[p])
    for i in range(len(feat_list)):
        # target person data
        if i==0:
            y = person[0]
            y = y[0,:]
        else:
            y = feat[:,5*(i-1)]
    
        # assign color to each activity
        x = np.arange(0,len(y))
        col_ind = person[1].flatten()        
        points = np.array([x, y]).T.reshape(-1, 1, 2)
        segments = np.concatenate([points[:-1], points[1:]], axis=1)
        # Create a continuous norm to map from data points to colors
        norm = plt.Normalize(1, 5)
        lc = LineCollection(segments, cmap='jet', norm=norm)
        # Set the values used for colormapping
        lc.set_array(col_ind)
        lc.set_linewidth(2)
        line = axs[i,p].add_collection(lc)
        axs[i,p].set_xlim(x.min(), x.max())
        axs[i,p].set_ylim(1.1*np.min(y), 1.1*np.max(y))
        if i==0:
            axs[i, p].set_title('Window size: ' + str(window[p]), fontsize=18)
        if p==0:
            axs[i, p].set_ylabel(feat_list[i], fontsize=18)

fig.tight_layout()
plt.show()

It can be seen that the chosen statistics evolve in time quite differently. This indicates that these features could be relevant for predicting the activity. The mean and the median are more similar, but the latter is less sensitive to outliers or extreme events. Alghough it could seem that they are redundant, it will be seen that both are relevant for the classification.

In addition, the choice of the sliding window size has an important impact in the temporal evolution. The larger the window size is, the smoother the feature evolution becomes. In addition, the transitions from activity to activity are smoother. This indicates that we are lossing information at short time scales and, hence, I will not be interested in using large window sizes.

Comment on feature extraction, cross-validation and data leakage

We have to be very careful to avoid data leakage from person to person (or from training to validation or test) while doing hyperparameter tunning or when evaluating the future performance of the model. In other words, information from one person in the training partition cannot appear either in the validation or test partitions. As I will use hold-out (sklearn train_test_split) and K-fold cross validation (sklearn Kfold) to define, respectively, the train and test partitions, and the training and validation partitions, I have found convenient to make all features of the same temporal length L. This way, we guarantee that the total number of instances in the full, training, validation, and testing partition are multiples of L. This makes the splitting with hold-out and K-fold easier.

Notice that with this approach I will loose some instances for the training, but it is worthy. To make this loss the less damaging, I will discard the first observations, which are the ones more affected by the padding. In any case, once I have found the best hyperparameters, I will train the final model with all the instances.

Hence, the first thing to do is to find the length of the shortest temporal signal.

# find the minimum signal length (L) of the training dataset
signal_lengths=np.ones(8)
for i in range(8):
    person=training_dataset[i]
    labels=person[1].flatten()
    signal_lengths[i]=len(labels)
    
L=int(signal_lengths.min())

I will be using different sliding window sizes window as 'manual' hyperparameter. Hence, I have defined a function to calculate the running statistics for each person's signals and stack them to obtain a single array (for ALL people), the columns of which are the running statistics (features) and activity labels (target). As I have explained above, the signals are truncated so that their lengths are L. In addition, the subjects are ordered randomly (NOT the observations of each person).

def feat_ext_train(data_set,window):
    # we radomize the order in which the people are chosen
    np.random.seed(123)
    index=np.random.permutation(8)
    ### for the first random person
    # select data
    person=data_set[index[0]]
    # compute running statistics and get activity labels
    run_feat=feat_running(person[0],window)
    act=np.transpose(person[1])
    # select last L elements
    feat=run_feat[-L:,:]
    labels=act[-L:]
    ### repeat for the rest
    for i in index[1:]:
        person=data_set[i]
        run_feat=feat_running(person[0],window)
        act=np.transpose(person[1])
        feat=np.vstack((feat,run_feat[-L:,:]))
        labels=np.vstack((labels,act[-L:]))
    # stack features and labels
    feat=np.column_stack((feat,labels))
    return feat

4) Training phase¶

Model definition¶

As I have explained in the introduction, for the classification step, I have resorted to a gradient boosting decision tree algorithm. The reason to choose this kind algorithm is three-fold: First, gradient boosing implementations are known to be excellent performers in regression and multi-class classification tasks, second, they can partially deal with imbalanced classes (like in this dataset), and, third, it is straight-forward to have access to the feature importance. In particular, I have chosen LightGBM, which is a fast and efficient implementation of gradient boosting.

Hyperparameter tuning¶

Among the many parameters that can be optimized in LightGBM, I decided to perform a hyperparameter tunning with the number of trees (n_estimators) and the learning rate (learning_rate). To find the best hyperparameters, I will use sci-kit optimize BayesSearchCV. As it is stated in the documentation "The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings. In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.". This allows reducing the dimension of the parameter space that must be explored, as BayesSearchCV estimantes which is the best set of new parameters that has to be tried in the next iteration. As metric, I will use the balanced accuracy (BAC) which is convenient for imbalanced datasets.

In addition, as I commented before, I will use the sliding window size as a 'manual' hyperparameter. By manual I mean that I will iterate the window size and, for each iteration, I will extract the temporal features, I will run the hyperparameter tuning and will evaluate the corresponding future performance. I will select that model with the best performance.

So let's fix the model parameter space:

from skopt.space import Integer, Real
from sklearn import metrics

## Model Parameters

# window sizes. [0.5s,1s,2s]*16Hz
win_size=[8,16,32]

# Number of trees
n_estimators = Integer(200,500)
# Shrinkage/learning rate
learning_rate = Real(0.00001,1)

# set parameter space
param_grid_lgb = {'n_estimators': n_estimators,
             'learning_rate': learning_rate}

# create vectors to store best parameters and accuracy for each window size
BACs_skopt=[] # balanced accuracy
estimators_skopt = [] # number of trees
lr_skopt = [] # learning rate

I have applied hold-out (without shuffling) to obtain the train (+validation) and test partitions. In this case, 2 persons for test, and 6 for train. For the hyperparameter tunning, I have resorted to K-fold cross-validation. I have chosen 3 folds so that 2/3 of the train dataset (4 people) goes for training and the rest (2 people) for validation. Again, I do not shuffle to avoid validation data leakage.

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

cv_grid = KFold(n_splits=3, shuffle=False)

Let's give it a spin!

#pip install lightgbm
from skopt import BayesSearchCV
import lightgbm as lgb
from sklearn import metrics

# to store the models
models = []

for w in range(len(win_size)):
    # extract features
    training_set=feat_ext_train(training_dataset,win_size[w])
    # split into training/test sets
    X = training_set[:,:-1]
    y = training_set[:,-1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2/8, shuffle=False)
    
    # initialize the model. Set class_weight="balanced" so that it takes into account that class 1 is less frequent
    np.random.seed(1004)
    clf_lgb = BayesSearchCV(lgb.LGBMClassifier(class_weight="balanced"), 
                    param_grid_lgb,
                    scoring='balanced_accuracy',
                    cv=cv_grid,    
                    n_jobs=-1, verbose=1,
                    n_iter=10
                    )

    # train the model with train partition
    models.append(clf_lgb.fit(X_train,y_train))

    # make prediction on test partition
    y_test_pred_lgb = models[w].predict(X_test)
    
    # Store Metric
    error=metrics.balanced_accuracy_score(y_test, y_test_pred_lgb)
    BACs_skopt.append(error)
    
    # Store best parameters
    params = models[w].best_params_
    estimators_skopt.append(params["n_estimators"])
    lr_skopt.append(params["learning_rate"])

Selection of best model¶

The Balance Accuracy (BAC) and the best model parameters for each of the chosen window sizes are:

import pandas as pd

P=np.vstack((win_size,BACs_skopt,estimators_skopt,lr_skopt))
row_names=("Window size","BAC","n_estimators","learning rate")
P=pd.DataFrame(P.T,columns=row_names)
P[['Window size','n_estimators']] = P[['Window size','n_estimators']].astype(int)
display(P)

By a fractional marging the best model is that trained with a window size of 16, i.e., using the statistics of 1 second of signal. It is quite remarkable that the model achieves above a 95% balanced accuracy. One could suspect that there is overfitting, but the BAC score has been obtained from the test partition, that the model had not seen so far.

Evaluation of performance and feature importance¶

Let's recompute the predictions for the best model and display the confusion matrix:

# index of optimum model
i_opt=np.argmax(BACs_skopt)

# make prediction
y_test_pred = models[i_opt].predict(X_test)  

###################################
#  CONFUNSION MATRIX
###################################
from sklearn.metrics import confusion_matrix

# define labels for confusion matrices
labels=["1","2","3","4","5"]

M=confusion_matrix(y_test, y_test_pred)
# normalize by row
M=M/M.sum(axis=1, keepdims=True)

display(pd.DataFrame(M,index=labels,columns=labels))

In the confusion matrix one can see that the activity that is the worst predicted is activity 1 (running), probably because it is the minority class. In any case, it is correctly classified almost 87% of the times. In contrast, activity 2 (walking) is correctly classified 98% of the times.

As I have used a decission tree based algorithm, the importance of each feature can be retrieved using the method feature_importances_

# which are the most relevant features?
importances=models[i_opt].best_estimator_.feature_importances_

x_values = list(range(len(importances)))

# select colors for bars
from pylab import *
num_feat=4
cmap = cm.get_cmap('Spectral', num_feat)    # PiYG
colors=[]
for i in range(cmap.N):
    rgb = cmap(i)[:3] # will return rgba, we take only first 3 so we get rgb
    for j in range(5):
        colors.append(matplotlib.colors.rgb2hex(rgb))

plt.bar(x_values, importances, color=colors, edgecolor='black', orientation = 'vertical')
plt.text(1, 2300, 'Mean', bbox=dict(facecolor=cmap(0), alpha=0.3))
plt.text(5.5, 2300, 'Median', bbox=dict(facecolor=cmap(1), alpha=0.3))
plt.text(10.5, 2300, 'Variance', bbox=dict(facecolor=cmap(2), alpha=0.3))
plt.text(15.6, 2300, 'Skewness', bbox=dict(facecolor=cmap(3), alpha=0.3))
plt.ylim(0,2500)
plt.ylabel('Importance')
plt.xlabel('Feature')
plt.title('Feature Importances')
plt.show()

It can be seen that all the features that I have used are more or less equally important. Perhaps the variance is somewhat more relevant than the rest of statistics when doing predictions on the activity (overall that of signal 1).

5) Final model training and activity prediction¶

Now that there is a pesimistic estimation of the future performance (95% balanced accuracy), I can train the model with all the training data. For that, I will extract the features without truncating the signals to have a length L.

# set running window size to optimal value
window=win_size[i_opt]

# for the first person
person=training_dataset[0]
X=feat_running(person[0],window)
Y=np.transpose(person[1])
# for the rest
for i in range(1,8):
    person=training_dataset[i]
    X=np.vstack((X,feat_running(person[0],window)))
    Y=np.vstack((Y,np.transpose(person[1])))

Y=Y.flatten() # the model asks for this to have a shape (n_samples,)

Initialize and train the model with the best parameters obtained before:

# initialize model
clf_final=lgb.LGBMClassifier(n_estimators=estimators_skopt[i_opt],
                             learning_rate=lr_skopt[i_opt],
                             class_weight="balanced")

# fit model
clf_final.fit(X,Y)

LGBMClassifier(boosting_type='gbdt', class_weight='balanced',
               colsample_bytree=1.0, importance_type='split',
               learning_rate=0.29072039942736305, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=305, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

To make the final prediction, I have first to extract the test features:

# for the first test person
person=target_dataset[0]
X_pred=feat_running(person[0],window)
# for the second test person
person=target_dataset[1]
X_pred=np.vstack((X_pred,feat_running(person[0],window)))

# make prediction
y_pred = clf_final.predict(X_pred)

Visual inspection of the prediction¶

Of course we don't know the labels and cannot determine the accuracy of the prediction, but we can see a representation of the test signals, with the predicted activity coloured as was done in the beginning of the report.

fig, axs = plt.subplots(5,2,figsize=(18,9))
for p in range(2):
    for i in range(5):
        # test person data
        person=target_dataset[p]
        y = person[0]
        y = y[i]
        x = np.arange(0,len(y))
        if p==0:
            col_ind = y_pred[:len(y)]
        else:
            col_ind = y_pred[-len(y):]
        
        points = np.array([x, y]).T.reshape(-1, 1, 2)
        segments = np.concatenate([points[:-1], points[1:]], axis=1)
        # Create a continuous norm to map from data points to colors
        norm = plt.Normalize(1, 5)
        lc = LineCollection(segments, cmap='jet', norm=norm)
        # Set the values used for colormapping
        lc.set_array(col_ind)
        lc.set_linewidth(1)
        line = axs[i,p].add_collection(lc)
        axs[i,p].set_xlim(x.min(), x.max())
        axs[i,p].set_ylim(1.1*np.min(y), 1.1*np.max(y))
        if i==0:
            axs[i, p].set_title('Test Person ' + str(p+1), fontsize=14)
        axs[i,p].set_ylabel('Signal ' + str(i+1), fontsize=14)

plt.show()

In general, it seems that the prediction is not disastrous. In fact its quite accurate, overall for the Test Person 1. Test person 2 seems to show a noisier set of signals, as compared with the rest of persons both in train or test. In fact, it is difficult to determine if this person even performed activity 1 (running, in dark blue). Accordingly, for this person the accuracy seems to diminish.

Save prediction¶

First, store the predictions in the adequate format.

# load test signals
target_dataset=HAR_database['database_test']
# for test person 1
person_1=target_dataset[0]
# store signals and signal temporal length
X=person_1[0]
signal_length=len(X[0])
# store the label prediction as Y
Y=y_pred[:signal_length]
# store X and Y together
test1=[X,Y[np.newaxis,:]]

# repeat for test person 2
person_2=target_dataset[1]
X=person_2[0]
signal_length=len(X[0])
Y=y_pred[-signal_length:]
test2=[X,Y[np.newaxis,:]]

# concatenate both test persons
prediction=np.concatenate(([test1],[test2]))

Finally, create a dictionary with the signals and labels for both the train and test subjects, and save to a .mat file

HAR_prediction = {
    "database_training": training_dataset,
    "database_test" : prediction
    }

# save to .mat file
savemat('predictions_100417976.mat', HAR_prediction)

	1	2	3	4	5
1	0.868852	0.114754	0.016393	0.000000	0.000000
2	0.002751	0.985118	0.011131	0.001001	0.000000
3	0.011620	0.017430	0.947595	0.023240	0.000116
4	0.000000	0.013438	0.005913	0.963019	0.017631
5	0.000000	0.014576	0.009016	0.013824	0.962585

	Window size	BAC	n_estimators	learning rate
0	8	0.942635	471	0.237052
1	16	0.952394	305	0.290720
2	32	0.938435	471	0.237052