diff --git a/07_final_assignment/paper/main.tex b/07_final_assignment/paper/main.tex index 93c1525..053a2f4 100644 --- a/07_final_assignment/paper/main.tex +++ b/07_final_assignment/paper/main.tex @@ -70,19 +70,19 @@ \subsection{Experimental Code} -The simulation code is split in three parts: the creation of the trials, the learning of the monkey and the analysis of the learning results. We implemented the experiment using the \emph{R Programming Language} \parencite{Rcore}. The complete code used both for simulation as well as for the data analysis is found in the appendix of this paper. +The simulation code is split in three parts: the creation of the trials, the learning of the monkey and the analysis of the learning results. We implemented the experiment using the \emph{R Programming Language} \parencite{Rcore}. The complete code used both for simulation as well as for the data analysis can be found in the appendix of this paper. \subsubsection{Trial creation} The algorithm follows as close as possible the structure defined in the reference paper and supplemental materials and described above. The word-nonword corpus is the one used by the monkey DAN in \cite{Grainger245}. Unfortunately, we had no knowledge about some aspects of the original experiment; in such cases we had to take our own decisions. E.g., trials will always be created in blocks of 100. -To ensure this constraint the new word block part can be replaced by learned words if there is no new word left in the corpus and vice versa if there's no word learned the learned word part will be filled by the new word. +To ensure this constraint, the new word block part can be replaced by learned words if there is no new word left in the corpus and vice versa if there's no word learned, the learned word part will be filled by the new word. The new words, learned words and nonwords get picked randomly out of their pool with repetition allowed. \subsubsection{Monkey learning} -After a block the presented new word can be marked as learned by the definition in \cite{Grainger245}. The rescorla wagner learner therefore has to learn a block, return the guesses and then continue learning with the next block. -This is not easily possible with \emph{ndl} \parencite{Rndl} where for we implemented a rescorla wagner learner ourself. +After a block the presented new word can be marked as learned by the definition in \cite{Grainger245}. The Rescorla Wagner learner therefore has to learn a block, return the guesses and then continue learning with the next block. +Since this is not easily possible with \emph{ndl} \parencite{Rndl}, we implemented a Rescorla Wagner learner ourself. Since preliminary experiments showed that the monkeys performed with very high accuracies (>90\%), we decided to introduce a random parameter $ r $ in the experiment, defined as the fraction of times the monkey would make a random guess instead of an experience-based prediction. When trying to account for the inequality only by lowering the learning rates, we encountered a restriction in form of the need to use floating-point numbers, which might have led to unforeseeable behavior. Therefore, we chose to use the random parameter instead. @@ -95,28 +95,27 @@ \subsubsection{Random Parameter}\label{sec:randparam} The random parameter $ r $ was set to 0.65, which proved to be reasonable value in preliminary experiments. That means, in 65\% of the cases the monkey would guess for either word or nonword with equal probabilities. Therefore, the maximum possible performance $ p_{max} $ is: $$ p_{max} = 1 - \frac{r}{2} = 0.675$$ In other words, the maximum possible performance is no longer 1.0 (for a very intelligent monkey) but rather restricted by $ r $. If a monkey's performance is slightly better than $ p_{max} $, this is assured to be due to chance. -\subsubsection{Alpha and Beta} How fast a Cue-Outcome connection is learned or unlearned depends on a learn rate which determines which fraction of the activation difference will be added or removed per learn event. The learn rate in a event is the multiplication of the learn rate of the cue $\alpha$ and the learn rate of the outcome $\beta$. -In our case we keep the learn rate constant over all cues and outcomes. -Both $ \alpha $ and $ \beta $ were our independent variables which we manipulated over the course of the experiments. We gathered data for every possible combination of $ \alpha $ and $ \beta $ values within an equally spaced range from 0.0 to 0.3. A total of 15 values for each $ \alpha $ and $ \beta $ were combined to $ 15 \cdot 15 = 225 $ possible combinations. Since $ \alpha $ and $ \beta $ were internally multiplied to a single value, we expected the outcome to be more or less symmetrical due to the commutativity of the multiplication operation and therefore calculated each combination of $ \alpha $ and $ \beta $ only once, which we used as a trick to improve the overall runtime. Therefore, $\sum_{i=1}^{15}i = 120$ combinations remained to be explored. +\subsubsection{Alpha and Beta} How fast a Cue-Outcome connection is learned or unlearned depends on a learning rate which determines which fraction of the activation difference will be added or removed per learning event. The learning rate in a event is the multiplication of the learning rate of the cue $\alpha$ and the learning rate of the outcome $\beta$. +In our case we keep the learning rate constant over all cues and outcomes within an experiment. +Both $ \alpha $ and $ \beta $ were our independent variables which we manipulated over the course of the different experiments. We gathered data for every possible combination of $ \alpha $ and $ \beta $ values within an equally spaced range from 0.0 to 0.3. A total of 15 values for each $ \alpha $ and $ \beta $ were combined to $ 15 \cdot 15 = 225 $ possible combinations. Since $ \alpha $ and $ \beta $ were internally multiplied to a single value, we expected the outcome to be more or less symmetrical due to the commutativity of the multiplication operation and therefore calculated each combination of $ \alpha $ and $ \beta $ only once, which we used as a trick to improve the overall runtime. Therefore, $\sum_{i=1}^{15}i = 120$ combinations remained to be explored. \subsubsection{Lambda} The independent variable $\lambda$ represents the maximum activation in the Rescorla-Wagner model and therefore limits the learning. -It makes it possible to modulate saliency of a stimulus. A more salient stimulus could not only have higher learning rates but also a higher maximum activation. In the original experiment the stimulus were same colored words and nonwords with four letters on an equally colored background. We assume the single words and nonwords are equally salient and keep therefore $\lambda$ constant to a value of 1. +It makes it possible to modulate saliency of a stimulus. A more salient stimulus could not only have higher learning rates but also a higher maximum activation. In the original experiment the stimulus were equally colored words and nonwords with four letters on an equally colored background. We therefore have no reason to assume the single words and nonwords are anything else than equally salient and thus keep $\lambda$ constant to a value of 1. \subsection{Running Parallelized Experiments} Running an experiment with a single combination of $ \alpha $ and $ \beta $ on a normal desktop computer took about 75 minutes. Therefore, the parameter space one could explore within a reasonable amount of time was quite restricted. We decided to write a parallelized version of the code to reduce the overall runtime. Using the R packages foreach \parencite{Rforeach}, parallel \parencite{Rparallel} and doParallel \parencite{RdoParallel}, restructured the experiment. Since conflicts can easily occur when more than one core is trying to access a shared data structure at the same time, we implemented a parallelized version that is able to run without even containing critical sections. Instead, each thread has its own data structure, a .txt file, and in the end the results are harvested and combined. This version of the experiment ran on a cluster with 15 cores, each performing a total amount of eight experiments. Altogether, 120 combinations of $ \alpha $ and $ \beta $ were explored overnight, which would have taken about 150 hours (!) in a non-parallelized version. \section{Results} -The number of words learned by the actual monkeys ranged between 87 and 308. With the chosen range for $\alpha$ and $\beta$, we obtained between 275 and 307 learned words, however, it is important to note that we only presented 307 different words, so the model reached maximum learning potential even for small learn rates (see \autoref{fig:numwords}). +The number of words learned by the actual monkeys ranged between 87 and 308. With the chosen range for $\alpha$ and $\beta$, we obtained between 275 and 307 learned words, however, it is important to note that we only presented 307 different words, so the model reached maximum learning potential even for small learning rates (see \autoref{fig:numwords}). The general accuracy for the real monkeys lay between 71.14\% and 79.81\%, while our accuracies moved between 60\% and 68\% with random parameter $r=0.65$ depending on used learn rates. -Because the absolute accuracy depends heavily on the random parameter (see above), we could easily match increase the accuracy with modifying it. A more interesting property is the range of word accuracy which is $.0867$ for the monkeys and $.08$ for the simulation. +Because the absolute accuracy depends heavily on the random parameter (see above), we could easily match increase the accuracy with modifying it. A more interesting property is the range of word accuracy which is $.0867$ for the monkeys and $.08$ for the simulation. The order and random selection of events did not have a large influence since the variance of the results of all our experiments is quite small. It seems, that 50k trials are enough to marginalize the influence of random structure in the presented data. -Using non-liner regression models (GAM) we find non linear ($df>1$) main effects for learn rates predicting the word or nonword accuracy with fixed random parameter without an interactive effect. -Because of the multiplicatory commutativity one of $\alpha$, $\beta$ is enough for explanation as learn rates here. In \autoref{fig:accuracy}, second row we see the accuracies growing fast and converging to a plateau in one dimension with almost no effect of the other learn rate dimension. +Using nonliner regression models (GAM) we find nonlinear ($df>1$) main effects for learning rates predicting the word or nonword accuracy with fixed random parameter without an interactive effect. +As is clearly visible in the second row of \autoref{fig:accuracy}, the accuracies are growing fast and converging to a plateau in one dimension with almost no effect of the other learning rate dimension. It seems that $ \alpha $ as a predictor has a very large effect (i.e, $ \alpha $ explains a lot of the variance that is in the data), however $ \beta $ has a minor influence. We attribute this mainly to the multiplicatory commutativity, since $ alpha $ and $ beta $ are internally multiplied to a single variable. The complete result data is attached in the appendix of this paper. -%TODO we need a section explaining the results of the plots. What does that mean? -> small influence of parameters as a major finding, however there ARE effects -> perhaps explain that along with the GAM, where we definitely have to explain what the predictors included in the model are: Otherwise it's black magic. I think it is crucial that we explain our findings (= our contribution): We explored the whole parameter space (which others probably couldn't), and we found this and that influence. \begin{figure*}[ht] \onecolumn @@ -148,12 +147,14 @@ \section{Discussion} -We meticulously simulated the learning experience that monkeys were exposed to in an experiment by \textcite{Grainger245} by systematically exploring the parameter space of Rescorla-Wagner equations \parencite{rescorla1972theory}.\\ -Since preliminary results indicated that without restricting the model performance to a ceiling, the model performs way too accurate compared to the original monkeys, we introduced a random choice in some cases. We were thereby able to model the performance, and to show that the influence of the alpha and beta values was surprisingly small. +We meticulously simulated the learning experience that monkeys were exposed to in an experiment by \textcite{Grainger245} by systematically exploring the parameter space of Rescorla-Wagner equations \parencite{rescorla1972theory} examining every possible combination of learning rates $ \alpha $ and $ \beta $ within a broad range.\\ +Since preliminary results indicated that without restricting the model performance to a ceiling, the model performs way too accurate compared to the original monkeys, we introduced a random choice in some cases. We were thereby able to model the performance, and showed that the influence of the $ \alpha $ and $ \beta $ values was surprisingly small. -As a limitation, we have to note that the definition of a word being learned in our impression isn't perfect: It is defined as the moment a word had 80\% accuracy of recognition. We would expect this definition to become problematic when a word was 'almost' learned, but not quite reaching the 80\%. In the next block with that word, the learning would be a lot quicker than for an actually new word. It might be a good idea to monitor and save the knowledge level concerning one specific word an measuring the actual number of repetitions a word needed to become known. +To come back to the original experiment by \textcite{Grainger245}, we know that the 'original' monkeys performed with different success rates. Interestingly, from our experiments we can conclude that $ \alpha $ and $ \beta $ had not enough influence to account for such different success rates, however the random parameter that we introduced is indeed able to restrict the performance to any value in the range between a random choice (50\%) and more than 90\% accuracies (no random choice at all). Therefore, one could say that the different monkeys employed different random parameters (i.e., the fraction of times they make a random guess differed). Still, it is not clear what exactly corresponds to a random parameter within a monkey's mind. One could argue that they had a different motivation (i.e., some monkeys decided randomly more often than others), but this remains mere speculation. -Concerning the code, we decided to write it as clear as possible and not as fast as possible. We therefore assume that the overall runtime can be enhanced quite a bit. That would enable future researchers to re-run the experiments with more words to see if there are changes in the later learning process which we now could not explore. The mode of presentation could be reassessed, as well as whether the number of letters changes the behavior of the model. +As a limitation, we have to note that the definition of a word being learned in our impression isn't perfect: It is defined as the moment a word had 80\% accuracy of recognition. We would expect this definition to become problematic when a word was 'almost' learned, but not quite reaching the 80\%. In the next block with that word, the learning would be a lot quicker than for an actually new word. It might be a good idea to monitor and save the knowledge level concerning one specific word an measuring the actual number of repetitions a word needs to become known. + +Concerning the code, we decided to write it as clear as possible and not as fast as possible in cases where we had to make a tradeoff. We therefore assume that the overall runtime can be enhanced quite a bit. That would enable future researchers to re-run the experiments with more words to see if there are changes in the later learning process which we now could not explore. The mode of presentation could be reassessed, as well as whether the number of letters changes the behavior of the model. Furthermore different models could be used in the experiment, to see if other models fit the results of the actual monkeys even better. It would also be interesting to explore the influence of different values for $ \lambda $: Although this would go beyond a simulation of the original experiment and therefore fulfill a different purpose, one would be able to explore the parameter space not only in 2D ($ \alpha $ and $ \beta $) but in 3D. A lot remains to be discovered!