\chapter{Materials and Methods}
\label{mat}
\section{Scientific background}
\label{mat:background}
\subsection{BCIs}
The idea of BCIs began to spread in the 1970s when Vidal published his paper (\cite{Vidal73}).\\
First approaches used invasive BCIs earlier in Animals (rodents and monkeys) later also in humans. Invasive BCIs in humans were mostly implanted when the human was under brain surgery for another reason like therapy of epilepsy. Problems of invasive BCIs are the need to cut through skull and dura mater. This can lead to infections and severe brain damage.\\
An improvement were less invasive BCIs with e.g. ECoG which is placed inside the skull but outside the dura which decreased the risk for infections massively.\\
Measuring outside the skull entails even less risk, the dura and skull however lower the quality of the signal massively. With some improvements EEG has a spatial resolution of 2-3 cm (cf. \cite{Babiloni01}). This is quite bad compared to the single neuron one can observe with invasive methods. However we are more interested in the activity of areas then single cells for our task, so EEG meets our requirements here.\\
In addition EEG is much cheaper and easier to use than other techniques. There is no need for surgery (like for invasive methods) and the hardware can be bought for less than 100\euro{} while FMRI hardware costs far above 100,000\euro{}. This is one of the reasons EEG is far more available than other techniques. There are some inventions of younger date but not as much work has been done with them why they are not as well known and as far distributed as EEG.\\
Another pro of EEG is that the device is head mounted. That means the user may move while measuring without high impact on the tracking of activity. This is highly necessary for any BCI used in daily life.
\subsection{EEG}
When using Electroencephalography (EEG) one measures the electrical fields on the scalp that are generated by activity of neurons in the brain. These measurements allow some interpretation about what is happening inside the skull. In our application we use the recorded currents directly to train a SVM or as predictor for regression.
The foundation stone for EEG was laid in 1875 when Richard Caton found electrical activity around the brain of monkeys. After testing the methods on animals in 1924 the first human EEG was recorded by Hans Berger in Jena. He also coined the term Electroencephalography and is seen as the inventor of EEG.
The frequencies typically used for movement prediction in EEG are about 8-24 Hz (\cite{Blokland15},\cite{Ahmadian13},\cite{Wang09}).
EEG is often used for non-invasive BCIs because it is cheap and easier to use than e.g. fMRI. The electrodes have to be spread over the scalp. To allow for comparability there are standardized methods for this. These methods also bring a naming convention with them.
\subsubsection{10-20 system}
In this standard adjacent electrodes are placed either 10\% or 20\% of the total front-back or left-right distance apart. This standardization also makes it possible to name each electrode or rather here place. This is done with capital letters for lobes (Frontal, \qq{Central}, Parietal, Occipital and Temporal) and numbers for the specific place on the lobe. Even numbers are on the right side of the head, odd on the left; larger numbers are closer to the ears, lower numbers closer to the other hemisphere. The exact number now refers to the exact distance from center: $$\left\lceil\frac{x}{2}\right\rceil\cdot \frac{d}{10}$$ where $x$ is the number and $d$ the diameter of the scalp. Electrodes in the centre are named with a lower case $z$ e.g. $Cz$.\\
Electrodes between two lobes (10\% instead of 20\% distance) are named with the both adjacent lobes (anterior first) e.g. $FCz$ (between frontal and central lobe).
Also see figure~\ref{fig:10-20}.
\begin{figure}[!p]
\centering
\includegraphics[width=\textwidth]{eeg_electrodes_10-20.png}
\caption{Full 10-20 system}
\label{fig:10-20}
\end{figure}
\subsection{Power estimation}
\subsubsection{EEG}
To use data from EEG one way is to analyze the occurring frequencies and their respective power.\\
To gain these from the continuous signal there are different methods. The intuitive approach would be to use Fourier transformation however the Fourier transform does not need to exists for a continuous signal. So we used power spectral density (PSD) estimation.
\subsubsection{Power spectral density estimation}
The PSD is the power per frequency. Power here refers to the square of the amplitude. %TODO: formulation,fft
If the Fourier transform is existing, PSD can be calculated from it e.g. as periodogram. If not it has to be estimated. One way to do so is parametrized with an Autoregressive model(AR). Here one assumes that the there is a correlation of the spectral density between $p$ consecutive samples and the following one. This leads to an equation with only $p$ parameters which can be estimated in different ways. We used Burg's method (\texttt{pburg} from \matlab{} library).\\
In Figure~\ref{fig:psd} we see the difference between autoregressive \texttt{pburg} and periodogram \texttt{pwelch} PSD estimation.
\begin{figure}
\includegraphics[width=\textwidth]{psd.png}
\caption{PSD with Autoregressive model and FFT respectively\protect\footnotemark}
\label{fig:psd}
\end{figure}
\footnotetext{The signal was unfiltered EEG data from channel \textit{Cz} second run of second session with subject AO}
\subsubsection{Burg's method - Autoregressive Model}
\label{mat:burg}
Burg's method (\cite{Burg75}) is a special case of parametric PSD estimation. It interprets the Yule-Walker-Equations as least squares problem and iteratively estimates solutions.\\
According to \cite{Huang14} Burg's method fits well in cases with the need of high resolution.\\
Burg and Levinson-Durbin algorithms are examples for PSD estimation where an autoregressive model is used instead of Fast Fourier Transformation. The approach is described well by Spyers-Ashby et al. (\cite{Spyers98}). The idea is to lower the number of parameters determining the production of the signal. The number of parameters used is called \textit{model order} (250 in our example, lower in most cases). These parameters are estimated from the original data. For PSD estimation the modeled values are used which allows easier transformation since the data is generated by an known process.\\
Often the Rational transfer function modeling is used having the general form of $$x_n=-\sum\limits_{k=1}^p a_kx_{n-k}+ \sum\limits_{k=0}^q b_ku_{n-k},$$ where $x_n$ is the output, $u_n$ the input. $a,b$ are the system parameters which have to be estimated from original data. As we have unknown input in our application the output can only be estimated which simplifies the formula as follows $$\hat{x}_n=-\sum\limits_{k=1}^p a_k\hat{x}_{n-k}.$$
Estimating the parameters is done by minimizing the forward prediction error $E$: $$E=\frac{1}{N}\sum\limits_{i=1}^N \left(x_i-\hat{x}_i\right)^2$$
The minimum has zero slope and can be found by setting the derivative to zero:$$\frac{\partial E}{\partial a_k},\text{ for } 1\le k\le p$$
This yields a set of equations called \emph{Yule-Walker-Equations} (cf. \cite{Yule27},\cite{Walker31}).\\
Using forward and backward prediction the parameters ($a_k$) are estimated based on the Yule-Walker-Equations then.
\subsection{Low Frequencies}
In the 2000s there began a movement using new techniques to record ultrafast and infraslow brainwaves (above 50Hz and below 1Hz). These were found to have some importance (cf. \cite{Vanhatalo04}).\\
Also in predicting movements there was found some significance in low frequency as was done by Liu et al. (\cite{Liu11}) and Antelis et al. (\cite{Antelis13}) for example. Antelis et al. found correlations between hand movement and low frequency signal of $(0.29,0.15,0.37)$ in the dimensions respectively.\\
Lew et al. (\cite{Lew14}) state low frequencies are mainly involved in spontaneous self-induced movement and can be found before the movement starts. By this they may be a great possibility to lower reaction time of neuroprostheses for example.
%TODO: more details (idea, possible explanantion)
\subsection{Filtering}
Filtering of the recorded EEG signal is necessary for different reasons. First there are current relics from 50Hz current. These can be filtered out with bandstop filters.\\
Secondly we need to concentrate on the interesting frequencies (for classical EEG 1-50Hz). This is done by applying lowpass or highpass filters respectively. This is necessary because the PSD of lower frequency is a lot higher than that of higher frequencies. The relation $$PSD(f)=\frac{c}{f^\gamma}$$ holds for constants $c$ and $\gamma$ (\cite{Demanuele07}).\\
The Butterworth filter (\cite{Butterworth30}) was invented by Stephen Butterworth in 1930. Its advantage was uniform sensitivity to all wanted frequencies. In comparison to other filters Butterworth's is smoother because it is flat in the pass band and monotonic over all frequencies. This however leads to decreased steepness meaning a higher portion of frequencies beyond cutoff.
\subsection{EMG}
When using muscles they are contracted after an signal via an efferent nerve activates them. Contraction of muscles also releases measurable energy which is used for Electromyography (EMG). There are intramuscular applications of EMG but we only used surface EMG.\\
From surface EMG activity of muscles can be estimated however not very precisely without repetition. Since the muscles used for arm movements are quite large in our setting EMG allows relatively precise estimations of underlying muscle activity.
EMG is mainly developed for diagnostic tasks. However it is also applicable in science to track muscle activity. %TODO?
\subsection{Synergies}
\label{back:synergies}
Movement of the arm (and other parts of the body) are under-determined meaning with given trajectory there are different muscle contractions possible. One idea how this problem could be solved by our nervous system are synergies. Proposed by Bernstein in 1967 (\cite{Bernstein67}) they describe the goal of the movement (e.g. the trajectory) instead of controlling single muscles. This would mean however that predicting the activity of single muscles from EEG is harder than predicting a synergy which in turn determines the contraction of muscles.\\
Evidence for the use of synergies in the nervous system was found e.g. by Bizzi et al. (\cite{Bizzi08}) and Byadarhaly et al. (\cite{Byadarhaly12}). They also showed that synergies meet the necessary requirement to be able to build predictable trajectories.\\
Synergies are usually gotten from EMG signal through a principal component analysis (PCA, cf. \ref{mat:pca}), non-negative matrix factorization (NMF, cf. \ref{mat:nmf}) or autoencoders (a form of neuronal network, cf. \ref{mat:autoenc}).
\subsection{PCA}
\label{mat:pca}
Principal Component Analysis (PCA) is probably the most common technique for dimensionality reduction. The idea is to use those dimensions with the highest variance to keep as much information as possible in the lower dimensional room.\\
Invented PCA was in 1901 by Karl Pearson (\cite{Pearson01}). The intention was to get the line closest to a set of data. This line also is the one that explains most variance.\\
The PCA of data can be done in different ways. One is calculating the eigenvectors of the covariance matrix. The principal component is the eigenvector with the highest eigenvalue. Other eigenvectors follow ordered by their eigenvalues.\\
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{GaussianScatterPCA.jpg}
\caption{Eigenvectors of Gaussian scatter}
\label{fig:pca}
\end{figure}
In Figure~\ref{fig:pca} we see the eigenvectors of the data. The longer vector is the principal component the shorter one is orthogonal to it and explains the remaining variance. The second component here also is the component which explains least variance, since most variance is orthogonal to it.
%TODO? formula, ...
\subsection{NMF}
\label{mat:nmf}
In some applications Non-negative Matrix Factorization (NMF) is preferred over PCA (cf. \cite{Lee99}). This is because it does not learn eigenvectors but decomposes the input into parts which are all possibly used in the input. When seen as matrix factorization PCA yields matrices of arbitrary sign where one represents the eigenvectors the other the specific mixture of them. Because an entry may be negative cancellation is possible. This leads to unintuitive representation in the first matrix.\\
NMF in contrast only allows positive entries. This leads to \qq{what is in, is in} meaning no cancellation which in turn yields more intuitive matrices. The first contains possible parts of the data, the second how strongly they are represented in the current input.\\
The formula for NMF is
$$Input\approx \mathbf{WH}$$
where Input is $n\times m$, $W$ is $n\times r$ and $H$ is $r\times m$ with $r<<\min\{m,n\}$. So $\mathbf{WH}$ is only an approximation of the input however with significant lower dimension - the number of Synergies used.\\ %TODO: formulation
The factorisation is learnt with an update rule that my be chosen. We used the \matlab{} default, an alternating least squares (ALS) algorithm. It can be described as in algorithm \ref{alg:als} (cf. \cite{Berry07}).\\
\begin{algorithm}
\begin{algorithmic}
\State initialize $\mathbf{W}$ randomly
\State $i\gets0$
\While{$i <$ numberOfIterations}
\State Solve $\mathbf{W^TWH=W^TA}$ for $\mathbf{H}$
\State set all negative values in $H$ to zero
\State Solve $\mathbf{HH^TW^T=HA^T}$ for $\mathbf{W}$
\State set all negative values in $\mathbf{W}$ to zero
\State $i\gets i+1$
\EndWhile
\Return $\mathbf{W}$
\end{algorithmic}
\caption{Alternating Least Squares in NMF}
\label{alg:als}
\end{algorithm}
This version uses some simplifications (as setting to zero to be non-negative) and an slightly improved form is used in \matlab{}.\\
ALS usually converges faster and with an better result than multiplicative update algorithms which would be the alternative in \matlab{}.
\subsection{Autoencoders}
\label{mat:autoenc}
Autoencoders are a specific type of artificial neural networks (ANN). They work like typical ANNs by adjusting weights between the layers however they do not predict an unknown output but they predict their own input. What is interesting now is manipulating the size of the hidden layer where the size of the hidden layer has to be smaller than the input size. Now in the hidden layer the information of the input can be found in a condensed form (e.g. synergies instead of single muscle activity).\\
Autoencoders have been successfully used by Spüler et al. to extract synergies from EMG (\cite{Spueler16}). Especially with a lower number of synergies autoencoders should perform better than PCA or NMF because linear models fail to discover the agonist-antagonist relations that are typical for muscle movements. These however can be detected by autoencoders which allows for good estimations with half the synergies.\\
An autoencoder's input layer has as many neurons as there are input dimensions (e.g. one for each EMG channel). The number of hidden layer neurons may be varied. We mostly used 3. The output layer is of the same size as the input layer. This autoencoder is shown in Figure~\ref{fig:autoenc}.
\begin{figure}
\centering
\input{autoencoder.tikz}
\caption{Autoencoder (6-3-6)}
\label{fig:autoenc}
\end{figure}
\subsection{Support-vector Machines}
Support-vector machines (SVMs) are used for classification of data. This is done by separating data in feature space by a hyperplane. Additional data is classified with respect to the site of the hyperplane it is located in feature space.\\
This hyperplane is considered optimal if the margins on both sides (distance to the nearest data point) are maximal to allow for the maximal possible noise. This means the separating hyperplane can be constructed out of the nearest points (3 in 2-D) from both classes. This points however may be different for different attempts as an different angle in some dimension may make different points the nearest (cf. Figure~\ref{fig:hyperplanes}).\\
\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{pictures/hyperplanes.png}
\caption{Two possible hyperplanes while the orange one has larger margins}
\label{fig:hyperplanes}
\end{figure}
The hyperplane is defined as $\vec{w}\cdot \vec{x}-b=0$ while the hyperplanes tangent to the classes are $\vec{w}\cdot \vec{x}-b=1$ or $\vec{w}\cdot \vec{x}-b=-1$ respectively (see Figure~\ref{fig:svm}). The margin is $\frac{2}{||\vec{w}||}$. The margins have to be maximized while all data is classified correctly, so the problem can be formulated as:
$$\min \frac{2}{||\vec{w}||}\ s.t.\ y_i(\vec{w}\cdot\vec{x_i}-b)\geq 1,\ i=1,\dots,n,$$
where $y_i$ is the class (+1 or -1) of the corresponding data.
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{pictures/svm.png}
\caption{Margins and hyperplane (Figure by Cyc and Peter Buch)}
\label{fig:svm}
\end{figure}
This prototype of a SVM is only able to separate two classes of linear separable data. For other data some improvements were necessary.
\subsubsection{Multiclass SVM}
If there are more than two classes to separate it can be done with SVMs in different ways. One approach is \emph{one-vs-one} meaning all classes are compared with the according SVM and the SVM votes for one or the other class. This is done for all pairs and the class with most votes is picked.\\
Another approach is \emph{one-vs-all} where every class is compared against the remaining. Here scores are used to determine which class matches best, i.e. in which class the data is farthest from the separating hyperplane.
\subsubsection{Soft-margin SVM}
If data is not separable soft-margin SVMs have to be used. They allow wrong classification but try to minimize them. The problem can be formulated as
$$\text{Minimize }\frac{1}{N}\sum\limits_{i=1}^N\max\{0,1-y_i(\vec{w}\cdot\vec{x_i}-b)\}+\lambda ||\vec{w}||^2,$$
where $\lambda$ is the parameter that adjusts the trade-off between large margins and wrong classifications (if $\lambda$ has an higher value, there is more weight on large margins).
\subsubsection{Kernel trick}
Data like in Figure~\ref{fig:kernel} are not \emph{linear} separable. The idea here is to apply the \emph{kernel trick} meaning to transform the data in a different space where they are linear separable. In the example this is accomplished by using the distance from origin as feature and separating in that space.
\begin{figure}
\input{pictures/kernel.tikz}
\caption{Data separable with the kernel trick}
\label{fig:kernel}
\end{figure}
Common kernels are polynomial, Gaussian and hyperbolic kernels.
\subsection{Confusion Matrix}
\label{mm:cm}
%TODO: 2 classes: specifity ...
The confusion matrix is a visualization of classifications. In it for every class the number of samples classified as each class is shown. This is interesting since it can show bias and give a feeling for similar cases where similar is meant according to the features.\\
%TODO: figure Convusion matrix
\subsection{Regression}
Regression is the idea of finding $\beta$ so that $$y= X\beta+\epsilon$$ where X is the $n\times p$ input matrix and y the $n\times 1$ output vector of a system. Having this $\beta$ from given input the output can be predicted.\\
There are different ways to find this $\beta$. One common approach is the \emph{ordinary least squares}-Algorithm. $$\hat{\beta}=\arg\min\limits_{b\in\mathds{R}^p} \left(y-Xb\right)^T\left(y-Xb\right),$$ meaning the chosen $\hat\beta$ is that $b$ which produces the lowest error since $Xb$ should be - besides from noise $\epsilon$ - the same as $y$.\\
Choosing $\beta$ like this brings the risk of overfitting. The training data is matched well, new data however are classified poorly. This problem is addressed with RIDGE-Regression.
\subsubsection{RIDGE-Regression}
\label{mm:ridge}
Instead of minimizing only the error in RIDGE-Regression (also called \emph{Tikhonov regularization}) the size of vector $\beta$ is also minimized: $$\hat{\beta}=\arg\min\limits_{b\in\mathds{R}^p} \left(y-Xb\right)^T\left(y-Xb\right)+\lambda b^Tb.$$
$\lambda$ decides how big the influence of the size of $\beta$ or $b$ respectively is.
\subsection{Cross Validation}
$k$-fold cross validation means splitting the data into $k$ equally sized parts, training the model on $k-1$ parts and validating on left one (see Figure~\ref{fig:crossValidation}).
\begin{figure}
\includegraphics[width=\textwidth]{pictures/K-fold_cross_validation_EN.jpg}
\caption{k-fold cross validation (picture by Joan.domenech91 and Fabian Flöck)}
\label{fig:crossValidation}
\end{figure}
This is done to achieve a measure for how good the fit is. When using cross validation all the data is used for prediction and is predicted. This eliminates effects of randomness.\\
The reason for e.g. 10-fold cross validation is that a lot of data ($9/10$) is left for the prediction. If there is enough data one may want to use 2-fold cross validation only to lower computation times.
Cross validation can also be nested. This is necessary for example if it is used for parameter optimization and to measure fitness or if there is more than one parameter. When nesting computation time grows exponentially in the number of nested cross validation steps (see Algorithm~\ref{alg:cv})
\begin{algorithm}
\begin{algorithmic}
\State data $\gets$ data$\{0$...$9\}$
\For{$i\gets 0,...,9$}
\State remainingData$\{0,...,9\}\gets$ data$\{0,...,9\}\setminus\{i\}$
\For{$k\gets0,...,n-1$}\Comment $n$ is the number of parameter values tested
\For{$j\gets0,...,9$}
\State remainingData'$\gets$ remainingData$\{0,...,9\}\setminus\{j\}$
\State train with remainingData' and parameter $k$
\State validate on remainingData$j$ $\rightarrow$ fit$j$
\EndFor
\State fit for parameter $k$ is mean of fit$j$
\EndFor
\State choose best parameter $k$
\State train with remainingData$\{0,...,9\}$ and best parameter
\State validate on data$\{i\}$ $\rightarrow$ fit$i$
\EndFor
\State all-in-all fit is mean of fit$i$
\end{algorithmic}
\caption{Nested 10-fold Cross Validation with parameter optimization}
\label{alg:cv}
\end{algorithm}
\section{Experimental design}
\label{mm:design}
The data used for this work were mainly recorded by Farid Shiman, Nerea Irastorza-Landa, and Andrea Sarasola-Sanz for their work (\cite{Shiman15},\cite{Sarasola15}). We were allowed to use them for further analysis.\\
There were 9 right-handed subjects with an average age of 25 (variance 6.67, minimum 20, maximum 28). Three female and 6 male subjects were tested. All the tasks were performed with the dominant right hand.\\
To perform was a center-out reaching task to one of four targets (see \ref{fig:experimentalDesign}) while 32 channel EEG, at least%
\footnote{\texttt{'AbdPolLo', 'Biceps', 'Triceps', 'FrontDelt', 'MidDelt'} and \texttt{'BackDelt'} were recorded for every subject, others only in some. Only the 6 channels tracked in every session were used} %
6 channel surface EMG and 7 DOF kinematics were tracked.
\begin{figure}
\centering
\includegraphics{experimentalDesign.jpg}
\caption{Center-out reaching task with four color-coded targets}
\label{fig:experimentalDesign}
\end{figure}
Of the kinematic information tracked we only used position ($x,y$) and angle ($\theta$, rotation around $z$-axis) of the hand.\\
Only complete sessions were used in our analysis to ensure better comparability.\\
One session consists of 5 runs with 40 trials each. The trials were separated by resting phases of varying length (2-3s, randomly assigned). Each trial began with an auditory cue specifying the random but equally distributed target for this trial. This leads to 50 reaches to the same target each session.
After the cue the participants should \qq{perform the movement and return to the starting position at a comfortable pace but within 4 seconds}\footnote{\cite{Shiman15}}\\
For each subject there were 4 to 6 sessions, each recorded on a different day. All in all there were 255 runs in 51 sessions. Each session was analyzed independently as one continuous trial.
\section{Data Acquisition}
\subsection{Loading of data}
The data recorded with BCI2000 (\cite{Schalk04}) can be loaded into \matlab{} with a specific \texttt{.mex} file. The according \texttt{.mex}-Files for some platforms (Windows, MAC, Linux) are available from BCI2000 precompiled.\\
We load the signal plus the according status data and the parameters (see Algorithm~\ref{alg:load_bcidat}).
\begin{algorithm}
\begin{algorithmic}
\State [signal, states, params] = load\_bcidat('dataFile.dat');
\end{algorithmic}
\caption{Usage of \texttt{load\_bcidat}}
\label{alg:load_bcidat}
\end{algorithm}
\subsubsection{Signal}
The signal is loaded as matrix of 41 channels (see Table~\ref{tab:channelNames}). All the values are integers %TODO: representing what?
and also can be loaded as floating point values representing microvolts. Since the representation should not make any difference when analyzing the spectrum we use the smaller representation %TODO: check
\begin{table}
\centering
\begin{tabular}{c|c|l}
Number & Name &\ Meaning\\\hline
1 & Fp1&\ cf. Figure~\ref{fig:10-20}\\
2&Fp2&\ cf. Figure~\ref{fig:10-20}\\
3&F7&\ cf. Figure~\ref{fig:10-20}\\
4&F3&\ cf. Figure~\ref{fig:10-20}\\
5&Fz&\ cf. Figure~\ref{fig:10-20}\\
6&F4&\ cf. Figure~\ref{fig:10-20}\\
7&F8&\ cf. Figure~\ref{fig:10-20}\\
8&FC5&\ cf. Figure~\ref{fig:10-20}\\
9&FC1&\ cf. Figure~\ref{fig:10-20}\\
10&FC2&\ cf. Figure~\ref{fig:10-20}\\
11&FC6&\ cf. Figure~\ref{fig:10-20}\\
12&T7&\ cf. Figure~\ref{fig:10-20}\\
13&C3&\ cf. Figure~\ref{fig:10-20}\\
14&Cz&\ cf. Figure~\ref{fig:10-20}\\
15&C4&\ cf. Figure~\ref{fig:10-20}\\
16&T8&\ cf. Figure~\ref{fig:10-20}\\
17&TP9&\ cf. Figure~\ref{fig:10-20}\\
18&CP5&\ cf. Figure~\ref{fig:10-20}\\
19&CP1&\ cf. Figure~\ref{fig:10-20}\\
20&CP2&\ cf. Figure~\ref{fig:10-20}\\
21&CP6&\ cf. Figure~\ref{fig:10-20}\\
22&TP10&\ cf. Figure~\ref{fig:10-20}\\
23&P7&\ cf. Figure~\ref{fig:10-20}\\
24&P3&\ cf. Figure~\ref{fig:10-20}\\
25&Pz&\ cf. Figure~\ref{fig:10-20}\\
26&P4&\ cf. Figure~\ref{fig:10-20}\\
27&P8&\ cf. Figure~\ref{fig:10-20}\\
28&PO9&\ cf. Figure~\ref{fig:10-20}\\
29&O1&\ cf. Figure~\ref{fig:10-20}\\
30&Oz&\ cf. Figure~\ref{fig:10-20}\\
31&O2&\ cf. Figure~\ref{fig:10-20}\\
32&PO10&\ cf. Figure~\ref{fig:10-20}\\
33&AbdPolLo&\ abductor pollicis longis\\
34&Biceps&\ long head of the biceps\\
35&Triceps&\ the external head of triceps\\
36&FrontDelt&\ anterior portion of deltoid\\
37&MidDelt&\ lateral portion of deltoid\\
38&BackDelt&\ posterior portion of deltoid,\\
&&\ teres minor and infraespinatusmuscles\\
39&HEOG&\ \textit{not used}\\
40&Synchro&\ Synchronization trigger for kinematic recording\\
41&Zero&\ Zero reference\\
\end{tabular}
\caption{Channelnumbers and -names}
\label{tab:channelNames}
\end{table}
\subsubsection{States}
The main information contained by the \texttt{states} \matlab{}\texttt{-struct} is the currently presented stimulus. The \texttt{struct} has same length as the signal so that for every entry in the signal there is corresponding state information.\\
There were some adjustments necessary since it did not match the movements (cf. Section~\ref{sec:newClass}).
\subsubsection{Parameters}
All the settings from the BCI2000 recording are saved in and loaded from the \texttt{parameters}.\\
Examples are the names of the channels, the random seed for BCI2000 and the sounds, meaning and duration for different stimuli.
\subsection{Filtering}
For filtering we use second order Butterworth filters.
For normal EEG we used:
\begin{enumerate}
\item Highpass at \texttt{minEEGFreq} (usually 2Hz)
\item Lowpass at \texttt{maxEEGFreq} (usually 49Hz)
\item bandstop at 48-52Hz and 148-152Hz to eliminate (possibly relevant) power line noise
\end{enumerate}
For low frequencies we use:
\begin{enumerate}
\item Highpass at 0.01Hz
\item Lowpass at 1Hz
\end{enumerate}
For EMG data we use:
\begin{enumerate}
\item Highpass at 2Hz to eliminate movement artifacts
\item bandstop at 48-52Hz and 148-152Hz to eliminate (possibly relevant) power line noise
\end{enumerate}
For filtering we use \matlab{}s \texttt{filtfilt} function to reduce shift due to multiple filtering steps.
\subsection{Windowing}
To process continuous EEG or EMG data it is necessary to define windows. These windows are shifted over the data to get discrete values for the further steps.\\
We choose as default for EEG
\begin{itemize}
\item window $1 s$
\item shift $200ms$
\end{itemize}
and for EMG
\begin{itemize}
\item window $0.2s$
\item shift $50ms$.
\end{itemize}
For re-synchronization we pass the shift to every function that needs to align EMG and EEG data and operate on \emph{time-from-start} there.
\subsection{Waveform length}
For the representation of EMG data we used waveform length as described by Phinyomark et al. in \cite{Phinyomark12} which gives a measurement for the change in the window. It is calculated as the sum over the absolute difference of consecutive measurements in the time window:
$$\sum\limits_{i=1}^N \left| x_i-x_{i-1}\right|,$$
where $N$ is the number of measurements in the time window and $x_i$ are the measurements themselves.\\
It describes the length of the way of a needle plotting the EMG on a self-moving band.
\subsection{Classification}
\label{mm:newClass}
Very bad results when classifying EMG into Move/Rest made us further inspect the data. The actual movement came quite a while after the stimulus.\\
To address this problem we did a re-classification of the data according to actual movements (cf. Appendix~\ref{code:classify}). To decide whether the subject is moving or not we compare the mean EMG activity (from Waveform Length) to a threshold (10,000 by default).\\
If there is movement we define the class occurring most in the second before as the current task. If there is movement but the stimulus tells to rest, we assign the last active stimulus.\\
In addition we take the second before movement onset out of the data (classified as -1) and (optionally) half a second before movement onset as belonging to the following stimulus.\\
Finally we do some smoothening by taking the most occurring class one second before to one second after the current time step as its class.\\
As last step we adjust the length of the stimulus-vector to the length of the EEG data.\\
According to this classification we take only data in the further analysis which are classified different than -1 meaning they are either clear rest or clear movement.
\subsection{Synergies}
Synergies we generate based on different options for dimensionality reduction (cf. \ref{back:synergies}).\\
EMG data (as wave length) is reduced to $n$ dimensions, where $n$ is the desired number of Synergies.\\
Using PCA this is done by taking the first $n$ components. Then the EMG data is transformed into the $n$-dimensional space spanned by the components.\\
NMF is done with $n$ as inner dimension. Then EMG data is multiplied with the resulting matrix to transform it to $n$-dimensional data.\\
Eventually autoencoders are trained with a hidden layer of size $n$ and afterwards EMG data is encoded with the learned weights. This is equivalent to taking the hidden layer activity for the corresponding time step.\\
Since synergies are generated from EMG they have the same dimensionality in the first dimension\footnote{only depending on window size and shift for EMG data and the recoding duration} and $n$ in the second.
\subsection{Kinematics}
Kinematic data we used either as movement or as position. The position was directly recorded, the movement is the first derivative of the position in time.\\
The kinematic record was started after the EEG recording. In synchronization channel\footnote{cf. Table~\ref{tab:channelNames}} there is a peak when kinematic recording is started. This was used to align movement with EEG and EMG data. In addition we adjusted the kinematic data to the EMG window and shift to be able to use corresponding data for the same time step. This was done by summing all differences (for movement) or by calculating the mean position in the time window.\\
Size of this data is same as EMG and Synergies in length but has only three features per time step since we used only 3D positioning ($x,y$ and $\theta$) of the hand and no information about the fingers.
\section{Data Analysis}
%\subsection{Overview}
In Figure~\ref{fig:overview} is shown what is predicted from where.
\begin{figure}
\centering
\input{pictures/overview.tikz}
\caption{Overview: What is predicted?}
\label{fig:overview}
\end{figure}
\subsection{Classification}
Classification can be done in different ways. First approach is discriminating Movement from Rest. This is done by training an SVM and testing its results with 10-fold cross validation. We do this with EMG, EEG and LF data. EMG in this setting is trivial since it was the basis for the classification (cf. \ref{mm:newClass}).\\
In a second step we try to discriminate movement in different directions also with an SVM trained on EMG, EEG or LF data respectively. The fit of the model is also checked with 10-fold cross validation.\\
For unbiased classification it is necessary to train with equally sized classes. For that purpose and to lower computation time we only take in 250 (as default) samples per class (cf. Comparison of results in Section~\ref{res:maxPerClass}).\\
The parameter $c$ for the support vector machine is found with an additional step of cross validation or set to 1. (Results in Section~\ref{res:maxC}).\\
To learn about biased classifications and about similar brain activity for different movements the confusion matrix is created (cf. Section~\ref{mm:cm}).\\
The resulting accuracy is the mean of each of the 10 cross validation steps.
\subsection{Predicting Kinematics}
The prediction of kinematics is done with ridge regression. Since there are more data for kinematics than for EEG we use the mean position or movement and predict these.\\
The regression is done in 10-fold cross validation for each dimension ($x,y,\theta$). The parameter $\lambda$ (cf. ~\ref{mm:ridge}) is ascertained with an additional cross validation. The resulting correlation is the mean correlation of each of the 10 parts with the best parameter lambda each. The correlation for the dimensions are calculated independently.
\subsection{Predicting Synergies}
Predicting Synergies works similar as for the kinematics. Only change is that the synergies may have other dimension. Nevertheless each synergy is predicted from all EEG data as one output and correlation is calculated for each synergy.
\subsection{Predicting EMG}
When predicting EMG data we use the sum of the waveform length in the time corresponding to the EEG data. As the EMG data was summed to gain the data for our use this is a reasonable approach.\\
The remaining steps are the same as for kinematics and Synergies.
\subsection{EEG offset}
Since it takes some time for commands to go from brain to the muscles, we introduced an variable offset between EEG and other data. The offset has to be given in a number of shifts, so in default is a multiple of 200ms.\\
Results are given in Section~\ref{res:offset}.
\subsection{Prediction with interim step}
All these analyses only show the accuracy of one step. To get a measure for the over-all performance we predict synergies from EEG and use them to predict EMG or kinematics respectively.\\
The resulting correlation is the mean of the correlations of a 10-fold cross validation where the same unknown synergies are predicted from EEG and used to predict EMG or kinematics. So there is no correction step between the steps and EMG or kinematics are predicted from EEG via the Synergies. Here also different methods to determine Synergies are compared (see Section~\ref{res:differentSynergiesVia}).
\subsection{Multiple Sessions}
We analyze each session (cf. Section~\ref{mm:design}) independently meaning there are 51 independent results for each analysis. These are used for the statistical evaluation in Chapter~\ref{chp:results}.\\
Some analyses are only done on one session - if so it will be clearly stated. %TODO: check, out if not necessary
\subsection{Evaluation}
%TODO: evaluation