masterarbeit/text/thesis/02MaterialsAndMethods.tex at 6e19bbbcde1d1be0f2da6ccda7ed2c1642d701b1

Fork: 0
jph / masterarbeit
Find file
Newer
Older
masterarbeit / text / thesis / 02MaterialsAndMethods.tex
Jan-Peter Hohloch on 27 Nov 2016 48 KB final adjustments
Raw Blame History
%\chapter{Materials and Methods}
%\label{chp:mat}
\chapter{Scientific background}
\label{chp:background}
\section{Communication between Neurons and Machines}
    \subsection{Brain-Computer-Interfaces}
        The idea of BCIs began to spread in the 1970s when Vidal published his paper (\cite{Vidal73}).\\
        The connection between brain and computer allows to help the human in different ways. From implants to re-acquire hearing and sight in one direction to the commanding of machines by brainwaves or communication although having the Locked-In syndrome in the other direction a wide field of possibilities is given yet. However, most applications require lots of training and are sometimes quite far from natural behavior. Binary decisions for example are usually made through an excited or relaxed mood, which can easily be detected in brain activity.\\

        \subsubsection{Methods of recording}
            First approaches used invasive BCIs earlier in Animals (rodents and monkeys), later also in humans. Invasive BCIs in humans were mostly implanted when the human was under brain surgery for another reason like therapy of epilepsy. Problems of invasive BCIs are the need to cut through skull and dura mater, which can lead to infections and severe brain damage.\\
            An improvement were less invasive BCIs with e.g. Electrocorticography (ECoG) which is placed inside the skull but outside the dura which decreased the risk for infections massively.\\
            Measuring outside the skull entails even less risk, the dura and skull however lower the quality of the signal massively. With some improvements EEG has a spatial resolution of 2-3 cm (cf. \cite{Babiloni01}). This is quite bad compared to the single neuron one can observe with invasive methods. However, since activity is usually higher in a whole area and not only a single neuron EEG meets the requirements here.\\
            In addition, EEG is much cheaper and easier to use than other techniques. There is no need for surgery (like for invasive methods) and the hardware can be bought for less than 100\euro{} while functional Magnetic Resonance Imaging (fMRI) hardware costs far above 100,000\euro{}. This is one of the reasons EEG is far more available than other techniques. There are some newer inventions, but not as much work has been done with them, which is why they are not as well known and as far distributed as EEG.\\
            Another advantage of EEG is that the device is head mounted. That means the user may move while measuring without high impact on the tracking of activity. This is highly necessary for any BCI used in daily life.
    \subsection{EEG}
        When using Electroencephalography (EEG), one measures the electrical fields on the scalp that are generated by the activity of neurons in the brain. These measurements allow some interpretation of what is happening inside the skull. Here the recorded currents are used directly to train an SVM or as predictor for regression.

        The foundation of EEG was laid in 1875 when Richard Caton found electrical activity around the brain of monkeys. After testing the methods on animals in 1924 the first human EEG was recorded by Hans Berger in Jena. He also coined the term Electroencephalography and is seen as the inventor of EEG.

        The frequencies typically examined for movement prediction in EEG are about 8-24 Hz (\cite{Blokland15},\cite{Ahmadian13},\cite{Wang09}).
        EEG is often used for non-invasive BCIs because it is cheap and easier to use than e.g. fMRI. The electrodes have to be spread over the scalp. To allow for comparability there are standardized methods for this. These methods also bring a naming convention with them.
        \subsubsection{10-20 system}
            In this standard adjacent electrodes are placed either 10\% or 20\% of the total front-back or left-right distance apart. This standardization also makes it possible to name each electrode or rather its place. This is done with capital letters for lobes (Frontal, \qq{Central}, Parietal, Occipital and Temporal) and numbers for the specific place on the lobe. Even numbers are on the right side of the head, odd on the left; larger numbers are closer to the ears, lower numbers closer to the other hemisphere. The exact number now refers to the exact distance from center: $$\left\lceil\frac{x}{2}\right\rceil\cdot \frac{d}{10}$$ where $x$ is the number and $d$ the diameter of the scalp. Electrodes in the center are named with a lower case $z$ e.g. $Cz$.\\
            Electrodes between two lobes (10\% instead of 20\% distance) are named with both adjacent lobes (anterior first) e.g. $FCz$ (between frontal and central lobe).\\
            The naming convention according to the 10-20 system is shown in figure~\ref{fig:10-20}.
        \begin{figure}[!p]
            \centering
            \includegraphics[width=\textwidth]{pictures/eeg_electrodes_10-20.png}
            \caption{Naming according to 10-20 system\\(Author: Marius 't Hart,\\License: https://creativecommons.org/licenses/by-sa/3.0/nl/deed.en\_GB)}
            \label{fig:10-20}
        \end{figure}
        \subsubsection{Frequency bands}
            EEG waves can be divided into several bands according to the frequencies. Usually this also corresponds to the form of the waves and allows insights into the mental state from resting to highly aroused. According to \cite{Gerrard07} the following bands can be proposed:
            \begin{itemize}
                \item Delta: 1-4 Hz
                \item Theta: 4-7 Hz
                \item Alpha: 7.5-12.5 Hz (depending also on age)
                \item Beta: 13-20Hz
            \end{itemize}
            There are different definitions of the limits of the bands, but for a rough estimation these limits are suitable. For more exact results an analysis of wave patterns would be necessary.

            In limits similar to them of the alpha wave also mu-waves are measured. They are associated with mirror neurons in the motor cortex whose activity is suppressed while the subject is moving. This suppression can be detected as the arc-shaped mu-rhythm.
    \subsection{Low Frequencies}
        In the 2000s there began a movement using new techniques to record ultrafast and infraslow brainwaves (above 50Hz and below 1Hz). These were found to be of some importance for movement prediction (cf. \cite{Vanhatalo04}).\\
        Also in predicting movements there was found some significance in low frequency as was done by \cite{Liu11} and \cite{Antelis13} for example. \citeauthor{Antelis13} found correlations between hand movement and the low frequency signal of $(0.29,0.15,0.37)$ in the dimensions respectively.\\
        \cite{Lew14} state low frequencies are mainly involved in spontaneous, self-induced movement and can be found before the movement starts. Therefore, they may be a great possibility to lower reaction time of neuroprostheses for example.
    \subsection{EMG}
        When muscles are activated, they are contracted after a signal via an efferent nerve activates them. Contraction of muscles also releases measurable energy, which is used for Electromyography (EMG). There are intramuscular applications of EMG but here only used surface EMG is used.\\
        From surface EMG, activity of muscles can be estimated, however not very precisely without repetition. Since the muscles used for arm movements are quite large in this setting, EMG allows relatively precise estimations of underlying muscle activity.

        EMG is mainly developed for diagnostic tasks. However, it is also applicable in science to track muscle activity as is done here.
\section{Signal Processing}
    \subsection{Power estimation}
        \subsubsection{EEG}
            One way of using data from EEG is to analyze the occurring frequencies and their respective power.\\
            To gain these from the continuous signal, windows can be used, in which the signal is finite and a Fourier transform can be estimated. For the estimation power spectral density (PSD) is used.
        \subsubsection{Power spectral density estimation}
            The PSD is the power per frequency, where power refers to the square of the amplitude.\\
            If the Fourier transform does exist, PSD can be calculated from it, e.g. as periodogram. If not, it has to be estimated. One way to do so is fast Fourier transform (FFT), another - used here - is by parametrization with an Autoregressive model (AR). For this, one assumes that there is a relationship of the spectral density between $p$ consecutive samples and the following one. This leads to an equation with only $p$ parameters which can be estimated in different ways. Here Burg's method (\texttt{pburg} from \matlab{} library) is used.\\
            In Figure~\ref{fig:psd} we see the difference between autoregressive \texttt{pburg} and periodogram \texttt{pwelch} PSD estimation.
            \begin{figure}
                \includegraphics[width=\textwidth]{pictures/psd.png}
                \caption{PSD with FFT (top) or an Autoregressive model (bottom) respectively. The signal was unfiltered EEG data from channel \textit{Cz} second run of second session with subject AO}
                \label{fig:psd}
            \end{figure}
        \subsubsection{Burg's method - Autoregressive Model}
        \label{mat:burg}
            Burg's method (\cite{Burg75}) is a special case of parametric PSD estimation. It interprets the Yule-Walker-Equations as least squares problem and iteratively estimates solutions.\\
            According to \cite{Huang14} Burg's method works well in cases with the need of high resolution.\\
            Burg and Levinson-Durbin algorithms are examples for PSD estimation which use an autoregressive model instead of Fast Fourier Transformation. The approach is described well by \cite{Spyers98}. The idea is to lower the number of parameters determining the production of the signal. The number of parameters used is called \textit{model order} (250 for this thesis, lower in most cases). These parameters are estimated from the original data. For PSD estimation, the modeled values are used which allows easier transformation since the data is generated by a known process.\\
            Often the Rational Transfer Function Modeling is used in the general form of $$x_n=-\sum\limits_{k=1}^p a_kx_{n-k}+ \sum\limits_{k=0}^q b_ku_{n-k},$$ where $x_n$ is the output, $u_n$ the input. $a,b$ are the system parameters which have to be estimated from original data. If there is unknown input the output can only be estimated which simplifies the formula as follows $$\hat{x}_n=-\sum\limits_{k=1}^p a_k\hat{x}_{n-k}.$$
            Estimating the parameters is done by minimizing the forward prediction error $E$: $$E=\frac{1}{N}\sum\limits_{i=1}^N \left(x_i-\hat{x}_i\right)^2$$
            The minimum has zero slope and can be found by setting the derivative to zero:$$\frac{\partial E}{\partial a_k}=0,\text{ for } 1\le k\le p$$
            This yields a set of equations called \emph{Yule-Walker-Equations} (cf. \cite{Yule27},\cite{Walker31}).\\
            Using forward and backward prediction the parameters ($a_k$) are then estimated based on the Yule-Walker-Equations.
    \subsection{Filtering}
        Filtering of the recorded EEG signal is necessary for different reasons. First there are current artifacts from 50Hz current. These can be filtered out with bandstop filters.\\
        Secondly we need to concentrate on the interesting frequencies (for classical EEG 1-50Hz). This is done by applying lowpass or highpass filters respectively. This is necessary because the PSD of lower frequency is a lot higher than that of higher frequencies. The relation $$PSD(f)=\frac{c}{f^\gamma}$$ holds for constants $c$ and $\gamma$ (\cite{Demanuele07}).\\
        The Butterworth filter (\cite{Butterworth30}) was invented by Stephen Butterworth in 1930. Its advantage was uniform sensitivity to all wanted frequencies. In comparison to other filters, Butterworth's is smoother because it is flat in the pass band and monotonic over all frequencies. This, however, leads to decreased steepness meaning a higher portion of frequencies beyond cutoff.
\section{Synergies}
\label{back:synergies}
    The movement of the arm (and other parts of the body) is under-determined meaning with given trajectory there are different muscle contractions possible. One idea how this problem could be solved by our nervous system are synergies. Proposed by Bernstein in 1967 (\cite{Bernstein67}), they describe the goal of the movement (e.g. the trajectory) instead of controlling single muscles. This would mean, however, that predicting the activity of single muscles from EEG is harder than predicting a synergy which in turn determines the contraction of muscles.\\
    Evidence for the use of synergies in the nervous system was found e.g. by Bizzi et al. (\cite{Bizzi08}) and Byadarhaly et al. (\cite{Byadarhaly12}). They also showed that synergies meet the necessary requirements to be able to build predictable trajectories.\\
    Synergies are usually obtained from EMG signal through a principal component analysis (PCA, cf. \ref{mat:pca}), non-negative matrix factorization (NMF, cf. \ref{mat:nmf}) or autoencoders (a form of neuronal network, cf. \ref{mat:autoenc}).
    \subsection{Principal Component Analysis}
    \label{mat:pca}
        Principal Component Analysis (PCA) is probably the most common technique for dimensionality reduction. The idea is to use those dimensions with the highest variance to keep as much information as possible in the lower dimensional room.\\
        PCA was invented in 1901 by Karl Pearson (\cite{Pearson01}). The intention was to get the line closest to a set of data. This line also is the one that explains most variance.\\
        The PCA of data can be done in different ways. One is calculating the eigenvectors of the covariance matrix. The principal component is the eigenvector with the highest eigenvalue. Other eigenvectors follow ordered by their eigenvalues.\\
        \begin{figure}
            \centering
            \includegraphics[width=0.7\textwidth]{GaussianScatterPCA.jpg}
            \caption{Gaussian Scatter with both eigenvectors, the principal component (long arrow) explaining most, the other least variance}
            \label{fig:pca}
        \end{figure}
        In Figure~\ref{fig:pca} we see the eigenvectors of the data. The longer vector is the principal component, the shorter one is orthogonal to it and explains the remaining variance. The second component here also is the component which explains least variance, since most variance is orthogonal to it.
    \subsection{Non-Negative Matrix Factorization}
    \label{mat:nmf}
        In some applications non-Negative Matrix Factorization (NMF) is preferred over PCA (cf. \cite{Lee99}). This is because it does not learn eigenvectors but decomposes the input into parts which are all possibly used in the input. When seen as matrix factorization PCA yields matrices of arbitrary sign where one represents the eigenvectors and the other the specific mixture of them. Because an entry may be negative cancellation is possible. This leads to unintuitive representation in the first matrix.\\
        NMF in contrast only allows positive entries. This leads to \qq{what is in, is in} meaning no cancellation which in turn yields more intuitive matrices. The first contains possible parts of the data, the second lists how strongly they are represented in the current input.\\
        The formula for NMF is
        $$Input\approx \mathbf{WH}$$
        where Input is $n\times m$, $W$ is $n\times r$ and $H$ is $r\times m$ with $r<<\min\{m,n\}$. So $\mathbf{WH}$ is only an approximation of the input, however, with significant lower dimension - the number of Synergies used.\\
        The factorization is learned with an update rule that my be chosen.The \matlab{} default, an alternating least squares (ALS) algorithm, is used here. It can be described as in algorithm \ref{alg:als} (cf. \cite{Berry07}).\\
        \begin{algorithm}
            \begin{algorithmic}
                \State initialize $\mathbf{W}$ randomly
                \State $i\gets0$
                \While{$i <$ numberOfIterations}
                    \State Solve $\mathbf{W^TWH=W^TA}$ for $\mathbf{H}$
                    \State set all negative values in $H$ to zero
                    \State Solve $\mathbf{HH^TW^T=HA^T}$ for $\mathbf{W}$
                    \State set all negative values in $\mathbf{W}$ to zero
                    \State $i\gets i+1$
                \EndWhile
                \State\Return $\mathbf{W}$
            \end{algorithmic}
            \caption{Alternating Least Squares in NMF}
            \label{alg:als}
        \end{algorithm}
        This version uses some simplifications (as setting to zero to be non-negative) and a slightly improved form is used in \matlab{}.\\
        ALS usually converges faster and with a better result than multiplicative update algorithms which would be the alternative in \matlab{}.
    \subsection{Autoencoders}
    \label{mat:autoenc}
        Autoencoders are a specific type of artificial neural networks (ANN). They work like typical ANNs by adjusting weights between the layers, however, they do not predict an unknown output but their own input. What is interesting now is manipulating the size of the hidden layer where the hidden layer has to be smaller than the input layer. Now in the hidden layer the information of the input can be found in a condensed form (e.g. synergies instead of single muscle activity).\\
        Autoencoders have been successfully used by Spüler et al. to extract synergies from EMG (\cite{Spueler16}). Especially with a lower number of synergies, autoencoders should perform better than PCA or NMF because linear models fail to discover the agonist-antagonist relations that are typical for muscle movements. These, however, can be detected by autoencoders which allows for good estimations with half the synergies.\\
        An autoencoder's input layer has as many neurons as there are input dimensions (e.g. one for each EMG channel). The number of hidden layer neurons may be varied, here usually 3 are used. The output layer is of the same size as the input layer. This autoencoder is shown in Figure~\ref{fig:autoenc}.
        \begin{figure}
            \centering
            \input{autoencoder.tikz}
            \caption{Autoencoder with 6 input and 6 output neurons and a hidden layer of size 3}
            \label{fig:autoenc}
        \end{figure}
\section{Machine Learning}
    \subsection{Support-Vector Machines}
        Support-vector machines (SVMs) are used for classification of data. This is done by separating data in feature space by a hyperplane. Additional data is classified with respect to the side of the hyperplane that it is located on.\\
        This hyperplane is considered optimal if the margins on both sides (distance to the nearest data point) are maximal to allow for the maximally possible noise. This means the separating hyperplane can be constructed out of the nearest points (3 in 2-D) from both classes. These points however may be different for different attempts as an different angle in some dimension may make different points the nearest (cf. Figure~\ref{fig:hyperplanes}).\\
        \begin{figure}
            \centering
            \includegraphics[width=0.6\textwidth]{pictures/hyperplanes.png}
            \caption{Two sets (red and blue) separated by possible hyperplanes where the orange one has larger margins}
            \label{fig:hyperplanes}
        \end{figure}
        The hyperplane is defined as $\vec{w}\cdot \vec{x}-b=0$ while the hyperplanes tangential to the classes are $\vec{w}\cdot \vec{x}-b=1$ or $\vec{w}\cdot \vec{x}-b=-1$ respectively (see Figure~\ref{fig:svm}). The margin is $\frac{2}{||\vec{w}||}$. The margins have to be maximized while all data is classified correctly, so the problem can be formulated as:
        $$\min \frac{2}{||\vec{w}||}\ s.t.\ y_i(\vec{w}\cdot\vec{x_i}-b)\geq 1,\ i=1,\dots,n,$$
        where $y_i$ is the class (+1 or -1) of the corresponding data.
        \begin{figure}
            \centering
            \includegraphics[width=0.6\textwidth]{pictures/svm.png}
            \caption{Margins and hyperplane for an SVM (Figure by Cyc and Peter Buch)}
            \label{fig:svm}
        \end{figure}
        This prototype of an SVM is only able to separate two classes of linearly separable data. For other data some improvements are necessary.
        \subsubsection{Multiclass SVM}
            If there are more than two classes to separate it can be done with SVMs in different ways. One approach is \emph{one-vs-one} meaning all classes are compared with the according SVM and the SVM votes for one or the other class. This is done for all pairs and the class with most votes is picked.\\
            Another approach is \emph{one-vs-all} where every class is compared against the remaining. Here scores are used to determine which class matches best, i.e. in which class the data is farthest from the separating hyperplane.
        \subsubsection{Soft-margin SVM}
            If data is not separable soft-margin SVMs have to be used. They allow wrong classification but try to minimize them. The problem can be formulated as
            $$\text{Minimize }\frac{1}{N}\sum\limits_{i=1}^N\max\{0,1-y_i(\vec{w}\cdot\vec{x_i}-b)\}+\lambda ||\vec{w}||^2,$$
            where $\lambda$ is the parameter that adjusts the trade-off between large margins and wrong classifications (if $\lambda$ has a higher value, there is more weight on large margins).
        \subsubsection{Kernel trick}
            Data like that in figure~\ref{fig:kernel} are not \emph{linearly} separable. The idea here is to apply the \emph{kernel trick}.\\
            The kernel trick consists of mapping the input data to a higher dimensional space and separate it there. In many applications this mapping would be computationally expensive, for SVMs however we only need to map the inner products, e.g. as $$\varphi(\vec{x_i})^T\varphi(\vec{x_j})=K(\vec{x_i},\vec{x_j}),$$ where $\varphi$ is the mapping , $x_i$ are data points and $K$ is the kernel.
            We do not have to compute - or even know - $\varphi$, since $K$ only depends on the original input. This is a lot less costly and can be done quite fast. Solving the SVM in a higher dimensional space also works well.
            \begin{figure}[p]
                \input{pictures/kernel.tikz}
                \caption{Data separable with the kernel trick; left in the original space with features $x$ and $y$, right in the dimension where distance from the origin is shown and the data is linear separable}
                \label{fig:kernel}
            \end{figure}
            Common kernels are polynomial, Gaussian and hyperbolic kernels, where an example for a polynomial kernel would be $K(x_i,x_j)=(x_i^Tx_j)^3$.
    \subsection{Regression}
        Regression is the idea of finding $\beta$ so that $$y= X\beta+\epsilon$$ where X is the $n\times p$ input matrix and y the $n\times 1$ output vector of a system. Using this $\beta$, from given input, the output can be predicted.\\
        There are different ways to find this $\beta$. One common approach is the \emph{ordinary least squares}-Algorithm. $$\hat{\beta}=\arg\min\limits_{b\in\mathds{R}^p} \left(y-Xb\right)^T\left(y-Xb\right),$$ meaning the chosen $\hat\beta$ is that $b$ which produces the lowest error since $Xb$ should - apart from noise $\epsilon$ - be the same as $y$.\\
        Choosing $\beta$ like this brings the risk of overfitting. The training data is matched well, new data points however are classified poorly. This problem is addressed with RIDGE-Regression.
        \subsubsection{RIDGE-Regression}
        \label{mm:ridge}
            Instead of minimizing only the error in RIDGE-Regression (also called \emph{Tikhonov regularization}) the size of vector $\beta$ is also minimized: $$\hat{\beta}=\arg\min\limits_{b\in\mathds{R}^p} \left(y-Xb\right)^T\left(y-Xb\right)+\lambda b^Tb.$$
            $\lambda$ decides how big the influence of the size of $\beta$ or $b$ respectively is.
    \subsection{Cross Validation}
        $k$-fold cross validation means splitting the data into $k$ equally sized parts, training the model on $k-1$ parts and validating on the remaining one (see Figure~\ref{fig:crossValidation}).
        \begin{figure}[p]
            \includegraphics[width=\textwidth]{pictures/K-fold_cross_validation_EN.jpg}
            \caption{Principle of k-fold cross validation (here $k=4$)(picture by Joan.domenech91 and Fabian Flöck)}
            \label{fig:crossValidation}
        \end{figure}
        This is done to achieve a measure for how good the fit is. When using cross validation all the data is used for prediction and is predicted. This eliminates effects of randomness.\\
        The reason for e.g. 10-fold cross validation is that a lot of data ($9/10$) is left for the prediction. If there is enough data, one may want to use 2-fold cross validation only to lower computation times.

        Cross validation can also be nested. This is necessary for example if it is used for parameter optimization and to measure fitness or if there is more than one parameter. When nesting, computation time grows exponentially with the number of nested cross validation steps (see Algorithm~\ref{alg:cv})
        \begin{algorithm}
            \begin{algorithmic}
                \State data $\gets$ data$\{0$...$9\}$
                \For{$i\gets 0,...,9$}
                    \State remainingData$\{0,...,9\}\gets$ data$\{0,...,9\}\setminus\{i\}$
                    \For{$k\gets0,...,n-1$}\Comment $n$ is the number of parameter values tested
                        \For{$j\gets0,...,9$}
                            \State remainingData'$\gets$ remainingData$\{0,...,9\}\setminus\{j\}$
                            \State train with remainingData' and parameter $k$
                            \State validate on remainingData$j$ $\rightarrow$ fit$j$
                        \EndFor
                        \State fit for parameter $k$ is mean of fit$j$
                    \EndFor
                    \State choose best parameter $k$
                    \State train with remainingData$\{0,...,9\}$ and best parameter
                    \State validate on data$\{i\}$ $\rightarrow$ fit$i$
                \EndFor
                \State all-in-all fit is mean of fit$i$
            \end{algorithmic}
            \caption{Nested 10-fold Cross Validation with parameter optimization}
            \label{alg:cv}
        \end{algorithm}
\section{Evaluation Methods}
    \subsection{Confusion Matrix}
    \label{mm:cm}
        The confusion matrix is a visualization of classifications. In it, for every class, the number of samples classified as each class is shown. This is interesting, as it can show bias and give a feeling for similar cases where similar is meant according to the features.\\
        In the 2-class case the well known table of true and false positives and negatives (table~\ref{tab:tptnftfn}) is a confusion matrix. From it we can learn specificity and sensitivity as follows:
        $$\text{sensitivity}=TP/(TP+FP)$$
        $$\text{specificity}=TN/(TN+FN)$$
        \begin{table}
            \centering
            \begin{math}
                \begin{array}
                    {c||c|c}
                    &\text{predicted }\true&\text{predicted }\false\\\hline\hline
                    \text{is }\true& TP & FN\\\hline
                    \text{is }\false& FP & TN
                \end{array}
            \end{math}
            \caption{2D confusion matrix}
            \label{tab:tptnftfn}
        \end{table}
        In the higher dimensional case \matlab{} uses color coded maps as seen in figure~\ref{fig:exampleCM}. In this thesis scaled confusion matrices are used where each row adds up to 1.
        \begin{figure}
            \includegraphics[width=\textwidth]{pictures/results/cmEEGfull.png}
            \caption{Example for a color coded confusion matrix with 5 classes}
            \label{fig:exampleCM}
        \end{figure}
    \subsection{ANOVA}
        Analysis of Variance (ANOVA) is a way of checking if there is a main effect of a variable.\\
        The hypotheses tested are that all group means are equal ($H_0$) or they are not ($H_1$). To check on those ANOVA compares the deviation from the over-all mean and compares it to the deviation within the groups. If a lot of variance in the data can be explained by the groups (meaning in-group variance is lower than variance between groups) it is quite likely that the proposed groups have different means.\\
        Whether this is significant is decided based on the $p$-Value representing the probability that the difference between in-group and between-group variance is even higher. $H_0$ is rejected if $p$ is lower than a defined threshold (often $0.05$, $0.01$ or $0.001$).
    \subsection{Boxplot}
        To plot data and show their distribution boxplots are used.
        A boxplot contains information about the median (red line), 0.25 and 0.75 quantiles (ends of the box) and about the highest and lowest values that are not classified as outliers.\\
        A data point $y$ is classified as outlier if $y > q_3+1.5\cdot(q_3-q_1)$ or $y < q_1-1.5\cdot(q_3-q_1)$, where $q_1,q_3$ are the first and third quartile (which are also defining the box).
\chapter{Materials and Methods}
\label{chp:mat}
    All the processing was done on Ubuntu \texttt{14.04 / 3.19.0-39} with \matlab{} \texttt{R2016a (9.0.0.341360) 64-bit (glnxa64) February 11, 2016}.
\section{Experimental design}
\label{mm:design}
    The data used for this work was mainly recorded by Farid Shiman, Nerea Irastorza-Landa, and Andrea Sarasola-Sanz for their work (\cite{Shiman15},\cite{Sarasola15}). I was allowed to use it for further analysis.\\
    There were 9 right-handed subjects with an average age of 25 (variance 6.67, minimum 20, maximum 28). Three female and 6 male subjects were tested. All the tasks were performed with the dominant right hand.\\
    To perform was a center-out reaching task to one of four targets (see figure \ref{fig:experimentalDesign}) while 32 channel EEG, 6 channel%
    \footnote{\texttt{'AbdPolLo', 'Biceps', 'Triceps', 'FrontDelt', 'MidDelt'} and \texttt{'BackDelt'} were recorded for every subject, others only in some. Only the 6 channels tracked in every session were used} %
    surface EMG and 7 DOF kinematics were tracked.
    \begin{figure}
        \centering
        \includegraphics{experimentalDesign.jpg}
        \caption{Center-out reaching task with four color-coded targets (picture from \cite{Shiman15})}
        \label{fig:experimentalDesign}
    \end{figure}
    Of the kinematic information tracked only position ($x,y$) and angle ($\theta$, rotation around $z$-axis) of the hand are used.\\
    Only complete sessions were used for this analyses to ensure better comparability.\\
    One session consists of 5 runs with 40 trials each. The trials are separated by resting phases of varying length (2-3s, randomly assigned). Each trial is a grasp to one of four targets and begins with an auditory cue specifying the random but equally distributed target for this trial. This leads to 50 reaches to the same target each session.\\
    After the auditory cue the participants should \qq{perform the movement and return to the starting position at a comfortable pace but within 4 seconds} (\cite{Shiman15}).\\
    For each subject there are 4 to 6 sessions, each recorded on a different day. All in all there are 255 runs in 51 sessions. Each session is analyzed independently as one continuous trial.
\section{Data Preprocessing}
    \subsection{Loading of data}
        The data recorded with BCI2000 (\cite{Schalk04}) can be loaded into \matlab{} with a specific \texttt{.mex} file. The according \texttt{.mex}-Files for some platforms (Windows, MAC, Linux) are available from BCI2000 precompiled.\\
       The signal plus the according status data and the parameters is loaded as shown in Algorithm~\ref{alg:load_bcidat}).
        \begin{algorithm}
            \begin{algorithmic}
                \State [signal, states, params] = load\_bcidat('dataFile.dat','-calibrated');
                \State\Comment loading using microvolts
                \State [signal, states, params] = load\_bcidat('dataFile.dat');\Comment integers
            \end{algorithmic}
            \caption{Usage of \texttt{load\_bcidat}}
            \label{alg:load_bcidat}
        \end{algorithm}
        \subsubsection{Signal}
            The signal is loaded as a matrix of 41 channels (see Table~\ref{tab:channelNames}). All the values are integers corresponding to the voltage and can also be loaded as floating point values representing microvolts. Since the representation should not make any difference when analyzing the spectrum, the smaller representation as integers is used here.
            \begin{table}
                \centering
                \begin{tabular}{c|c|l}
                    Number & Name &\ Meaning\\\hline
                    1 & Fp1&\ cf. Figure~\ref{fig:10-20}\\
                    2&Fp2&\ cf. Figure~\ref{fig:10-20}\\
                    3&F7&\ cf. Figure~\ref{fig:10-20}\\
                    4&F3&\ cf. Figure~\ref{fig:10-20}\\
                    5&Fz&\ cf. Figure~\ref{fig:10-20}\\
                    6&F4&\ cf. Figure~\ref{fig:10-20}\\
                    7&F8&\ cf. Figure~\ref{fig:10-20}\\
                    8&FC5&\ cf. Figure~\ref{fig:10-20}\\
                    9&FC1&\ cf. Figure~\ref{fig:10-20}\\
                    10&FC2&\ cf. Figure~\ref{fig:10-20}\\
                    11&FC6&\ cf. Figure~\ref{fig:10-20}\\
                    12&T7&\ cf. Figure~\ref{fig:10-20}\\
                    13&C3&\ cf. Figure~\ref{fig:10-20}\\
                    14&Cz&\ cf. Figure~\ref{fig:10-20}\\
                    15&C4&\ cf. Figure~\ref{fig:10-20}\\
                    16&T8&\ cf. Figure~\ref{fig:10-20}\\
                    17&TP9&\ cf. Figure~\ref{fig:10-20}\\
                    18&CP5&\ cf. Figure~\ref{fig:10-20}\\
                    19&CP1&\ cf. Figure~\ref{fig:10-20}\\
                    20&CP2&\ cf. Figure~\ref{fig:10-20}\\
                    21&CP6&\ cf. Figure~\ref{fig:10-20}\\
                    22&TP10&\ cf. Figure~\ref{fig:10-20}\\
                    23&P7&\ cf. Figure~\ref{fig:10-20}\\
                    24&P3&\ cf. Figure~\ref{fig:10-20}\\
                    25&Pz&\ cf. Figure~\ref{fig:10-20}\\
                    26&P4&\ cf. Figure~\ref{fig:10-20}\\
                    27&P8&\ cf. Figure~\ref{fig:10-20}\\
                    28&PO9&\ cf. Figure~\ref{fig:10-20}\\
                    29&O1&\ cf. Figure~\ref{fig:10-20}\\
                    30&Oz&\ cf. Figure~\ref{fig:10-20}\\
                    31&O2&\ cf. Figure~\ref{fig:10-20}\\
                    32&PO10&\ cf. Figure~\ref{fig:10-20}\\
                    33&AbdPolLo&\ abductor pollicis longis\\
                    34&Biceps&\ long head of the biceps\\
                    35&Triceps&\ the external head of triceps\\
                    36&FrontDelt&\ anterior portion of deltoid\\
                    37&MidDelt&\ lateral portion of deltoid\\
                    38&BackDelt&\ posterior portion of deltoid,\\
                    &&\ teres minor and infraespinatusmuscles\\
                    39&HEOG&\ \textit{not used}\\
                    40&Synchro&\ Synchronization trigger for kinematic recording\\
                    41&Zero&\ Zero reference\\
                \end{tabular}
                \caption{Channelnumbers and -names}
                \label{tab:channelNames}
            \end{table}
        \subsubsection{States}
            The main information contained by the \texttt{states} \matlab{}\texttt{-struct} is the currently presented stimulus. The \texttt{struct} has the same length as the signal so that for every entry in the signal there is corresponding state information.\\
            There were some adjustments necessary since the states did not match the movements (cf. Section~\ref{mm:newClass}).
        \subsubsection{Parameters}
            All the settings from the BCI2000 recording are saved and loaded as \texttt{parameters}.\\
            Examples are the names of the channels, the random seed for BCI2000 and the sounds, meaning and duration for different stimuli.
    \subsection{Filtering}
        For filtering I use second order Butterworth filters.
        For normal EEG:
        \begin{enumerate}
            \item Highpass at \texttt{minEEGFreq} (usually 2Hz)
            \item Lowpass at \texttt{maxEEGFreq} (usually 49Hz)
            \item bandstop at 48-52Hz and 148-152Hz to eliminate (possibly relevant) power line noise
        \end{enumerate}
        For low frequencies:
        \begin{enumerate}
            \item Highpass at 0.01Hz
            \item Lowpass at 1Hz
        \end{enumerate}
        For EMG data:
        \begin{enumerate}
            \item Highpass at 2Hz to eliminate movement artifacts
            \item bandstop at 48-52Hz and 148-152Hz to eliminate (possibly relevant) power line noise
        \end{enumerate}
        For filtering \matlab{}s \texttt{filtfilt} function is used to reduce shift due to multiple filtering steps.
    \subsection{Windowing}
        To process continuous EEG or EMG data it is necessary to define windows. These windows are shifted over the data to obtain discrete values for the further steps.\\
        Defaults for EEG are:
        \begin{itemize}
            \item window $1 s$
            \item shift $200ms$
        \end{itemize}
        Defaults for EMG are:
        \begin{itemize}
            \item window $0.2s$
            \item shift $50ms$.
        \end{itemize}
        For re-synchronization the shift is passed to every function that needs to align EMG and EEG data and operate on \emph{time-from-start} there.
    \subsection{Waveform length}
        For the representation of EMG data waveform length as described by \cite{Phinyomark12} is used, which gives a measurement for the change in the window. It is calculated as the sum over the absolute difference of consecutive measurements in the time window:
        $$\sum\limits_{i=1}^N \left| x_i-x_{i-1}\right|,$$
        where $N$ is the number of measurements in the time window and $x_i$ are the measurements themselves.\\
        It describes the length of the way of a needle plotting the EMG on a self-moving band.
    \subsection{Classification}
    \label{mm:newClass}
        Very bad results when classifying EMG into Move/Rest made me further inspect the data. The actual movement came quite a while after the stimulus.\\
        To address this problem I did a re-classification of the data according to actual movements (cf. Appendix~\ref{code:classify}).\\
        To decide whether the subject is moving or not the mean EMG activity (from Waveform Length) is compared to a threshold (10,000 by default).\\
        If there is movement, the class occurring the most in the second before is defined as the current task. If there is movement but the stimulus indicates resting, the last active stimulus is assigned.\\
        In addition the second before movement onset is removed from the data (classified as -1) and (optionally) half a second before movement onset as belonging to the following stimulus (cf. \ref{mat:pause}).\\
        Finally some smoothing is done by taking the most frequently occurring class from the second before and after the current time step as its class.\\
        As last step the length of the stimulus-vector is adjusted to the length of the EEG data.\\
        According to this classification only data points are taken in the further analysis that are classified differently to -1 meaning they are either clear rest or clear movement.
    \subsection{Synergies}
    \label{mat:synergies}
        Synergies are generated based on different options for dimensionality reduction (cf. \ref{back:synergies}).\\
        EMG data (as waveform length) is reduced to $n$ dimensions, where $n$ is the desired number of Synergies.\\
        Using PCA this is done by taking the first $n$ components. Then the EMG data is transformed into the $n$-dimensional space spanned by the components.\\
        NMF is done with $n$ as inner dimension. Then EMG data is multiplied with the resulting matrix to transform it to $n$-dimensional data.\\
        Autoencoders eventually are trained with a hidden layer of size $n$ and afterwards EMG data is encoded with the learned weights. This is equivalent to taking the hidden layer activity for the corresponding time step.\\
        Since synergies are generated from EMG, they have the same dimensionality in the first dimension\footnote{only depending on window size and shift for EMG data and the recoding duration} and $n$ in the second.
    \subsection{Kinematics}
        Kinematic data is used either as movement or as position. The position was directly recorded, the movement is calculated as the first derivative of the position in time.\\
        The recording of kinematics was started after that of EEG. In synchronization channel\footnote{cf. Table~\ref{tab:channelNames}} there is a peak when kinematic recording is started. This was used to align movement with EEG and EMG data. In addition the kinematic data is adjusted to the EMG window and shift to be able to use corresponding data for the same time step. This was done by summing all differences (for movement) or by calculating the mean position in the time window.\\
        Size of this data is same as EMG and Synergies in length but has only three features per time step since only 3D positioning ($x,y$ and $\theta$) of the hand and no information about the fingers are used.
\section{Data Analysis}
        Figure~\ref{fig:overview} shows the regression steps of this work. EEG, EMG and positions were recorded, Synergies and velocities were calculated from them. To check the performance of the methods the relations between them were predicted.
        \begin{figure}
            \centering
            \input{pictures/overview.tikz}
            \caption{Overview: What is predicted?\\Normal arrows indicate prediction, fat ones computation and dashed ones prediction with an intermediate step}
            \label{fig:overview}
        \end{figure}
    \subsection{Classification}
        In addition to the regressions, classifications were done to have a benchmark and a possibility to compare with results from other work.
        Classification can be done in different ways. The first approach is discriminating Movement from Rest. This is done by training an SVM and testing its results with 10-fold cross validation. Here this is done with EMG, EEG and LF data. EMG in this setting is trivial since it was the basis for the classification (cf. \ref{mm:newClass}).\\
        In a second step discrimination of movement in different directions is done, also with an SVM trained on EMG, EEG or LF data respectively. The fit of the model is also checked with 10-fold cross validation.\\
        For unbiased classification it is necessary to train with equally sized classes. For that purpose and to lower computation time only  250 (as default) samples per class are taken in.\\
        The parameter $c$ for the support vector machine is found with an additional step of cross validation or set to 1. (Results in Section~\ref{res:maxC}).\\
        To learn about biased classifications and about similar brain activity for different movements the confusion matrix is created (cf. Section~\ref{mm:cm}).\\
        The resulting accuracy is the mean of each of the 10 cross validation steps.
    \subsection{Regression}
        \subsubsection{Predicting Kinematics}
            The prediction of kinematics is done with ridge regression. Since there are more data for kinematics than for EEG the mean position or movement are used and predicted.\\
            The regression is done in 10-fold cross validation for each dimension ($x,y,\theta$) and the parameter $\lambda$ (cf. ~\ref{mm:ridge}) is ascertained with an additional cross validation. The resulting correlation is the mean correlation of each of the 10 parts with the best parameter $\lambda$ each while the correlation for each dimension is calculated independently.
        \subsubsection{Predicting Synergies}
            Predicting synergies works similar as for the kinematics. The only difference is that the synergies may have other dimensionality. Nevertheless each synergy is predicted from all EEG data as one output and correlation is calculated for each synergy.
        \subsubsection{Predicting EMG}
            When predicting EMG data the sum of the waveform length in the time corresponding to the EEG data is used. As the EMG data was summed to gain the data this is a reasonable approach.\\
            The remaining steps are the same as for kinematics and Synergies.
        \subsubsection{Prediction with interim step}
            All these analyses only show the accuracy of one step. To get a measure for the over-all performance synergies are predicted from EEG and used to predict EMG or kinematics respectively.\\
            The resulting correlation is the mean of the correlations of a 10-fold cross validation where the same unknown synergies are predicted from EEG and used to predict EMG or kinematics. So there is no correction step between the steps and EMG or kinematics are predicted from EEG via the Synergies. Here also different methods to determine Synergies are compared (see Section~\ref{res:differentSynergiesVia}).
    \subsection{Parameters}
        Some structural parameters were introduced to check their influence on the predictions and classifications.
        \subsubsection{EEG offset}
            \label{mat:offset}
            Since it takes some time for commands to go from brain to the muscles, a variable offset between EEG and other data is used. The offset has to be given in a number of shifts, so in default is a multiple of 200ms.\\
            Results are given in Sections~\ref{res:offsetEEG} and~\ref{res:offsetLF}.
        \subsubsection{Pause}
        \label{mat:pause}
            A pause is used before movement onset. This pause means that 1 second before movement onset is not taken into account when analyzing the data. If there is no pause only 1s to 0.5 second before movement onset is left out and the last 0.5 seconds before movement are classified as belonging to the following task.\\
            This was necessary since the data about presentation of stimuli did not match the recordings and reclassification was necessary (cf. section \ref{mm:newClass}).
    \subsection{Multiple Sessions}
        Each session (cf. Section~\ref{mm:design}) is analyzed independently meaning there are 51 independent results for each analysis. These are used for the statistical evaluation in Chapter~\ref{chp:results}.\\
        Some analyses are only done on one session - if so it will be clearly stated.
    \subsection{Topographical Plots}
        Sometimes the interpretation of EEG data is easier if plotted topographically, meaning visualized according to the corresponding positions on a modeled head.

        Usually differences between different classes (e.g. movement or rest) are plotted in a band and not in a single frequency. Examples are given in the Results section in figures \ref{fig:topoAlpha} and \ref{fig:topoBeta}.\\
        To have 0 centered data the relation can be calculated as $$\frac{\text{Move}}{\text{Rest}}-1,$$ where Move and Rest are the mean activity in the band. Dividing instead of subtracting provides a more intuitive measure for the strength of desynchronization.
    \subsection{Default values}
    \label{mat:default}
        The values of the variables used in \texttt{'Default'} are given in table~\ref{tab:default}.
        \begin{table}
            \centering
            \begin{tabular}{r|c|l}
                Variable & default & Meaning\\\hline
                allSubjects & \true & is the computation done for all 51 sessions\\
                &&or only for one randomly chosen?\\
                eegOffset & 0 & amount of offset applied for EEG data (cf. \ref{mat:offset})\\
                $k$ & 10 & iterations of cross validation (do not change)\\
                maxEEGFreq & 49 & Frequency for a Butterworth low-pass filter\\
                minEEGFreq & 2 & Frequency for a Butterworth high-pass filter\\
                maxExpC & 0 & SVM tries values $10^{-x}$ to $10^{x}$
                with steps in x: 1\\
                maxFile & 5 & number of files per session\\
                maxPerClass & 250 & maximum number of data points in one svm class\\
                && (only for training)\\
                noLFsamples & 5 & number of samples out of one time window\\ && used for LF predictions\\
                noSynergies & 3 & number of Synergies used\\
                pause & 0 & apply pause or not? (cf. \ref{mat:pause})\\
                pBurgOrder & 250 & order of model for Burg's model (cf. \ref{mat:burg})\\
                ridgeParams & 100 & Array of parameters tried in cross \\&&validation for ridge\\
                shiftEEG & 0.2 & shift of the EEG window in each step\\
                shiftEMG & 0.05 & shift of the EMG window in each step\\
                threshold & 10000 & threshold for classification as movement (cf. \ref{mm:newClass})\\
                windowEEG & 1 & size of the EEG window \\
                windowEMG & 0.2 & size of the EMG window \\
            \end{tabular}
            \caption{Default values for all variables}
            \label{tab:default}
        \end{table}