\documentclass[a4paper, 12pt,longtable]{apa6}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[american]{babel}
\usepackage[style=apa,sortcites=true,sorting=nyt,backend=biber]{biblatex}
\usepackage{hyperref}
\DeclareLanguageMapping{american}{american-apa}
\addbibresource{references.bib}
\usepackage{listings}
\usepackage{color}
\usepackage{todonotes}
\title{Simulation of \cite{Grainger245} with Rescorla Wagner equations}
\shorttitle{Grainger et al. (2012) simulation with RW equations}
\author{R. Geirhos (3827808), K. Grethen (3899962), D.-E. Künstle (3822829),\\ A.-K. Mahlke (3897867), J. Maier (3879869), F. Saar (3818590), M. Weller (3837283)}
\leftheader{Geirhos, Grethen, Künstle, Mahlke, Maier, Saar, Weller}
\affiliation{Linguistics for Cognitive Science Course, University of Tübingen}
\abstract{In \citeyear{Grainger245}, \citeauthor{Grainger245} conducted a word learning experiments with baboons. Interestingly, monkeys are able to discriminate words from non-words with high accuracies. We simulate the learning experience with the Rescorla-Wagner learning model \parencite{rescorla1972theory}.
Running 225 parallelized experiments on a cluster, we show that it is possible to obtain even better results simply by application of the Rescorla-Wagner model; the learning parameters by themselves are not able to make learning slow enough to be comparable to the monkey's learning. We therefore introduced a random parameter that makes the models take random guesses in 65\% of the trials. That way, we successfully model the monkey's performance.}
\lstset{ %
basicstyle=\footnotesize, % the size of the fonts that are used for the code
captionpos=b, % sets the caption-position to bottom
frame=single, % adds a frame around the code
keepspaces=true, % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible)
keywordstyle=\color{blue}, % keyword style
numbers=left, % where to put the line-numbers; possible values are (none, left, right)
numbersep=5pt, % how far the line-numbers are from the code
rulecolor=\color{black}, % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. comments (green here))
showspaces=false, % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
showstringspaces=false, % underline spaces within strings only
showtabs=false, % show tabs within strings adding particular underscores
stepnumber=2, % the step between two line-numbers. If it's 1, each line will be numbered
tabsize=2, % sets default tabsize to 2 spaces
title=\lstname % show the filename of files included with \lstinputlisting; also try captiFon instead of title
}
\begin{document}
\maketitle
%\cite{}
\section{Introduction}
Computational models have been used for a very long time in order to represent actual events and phenomena. These models help scientists formulate their hypotheses more precisely, and to test these by applying the model to different situations. The models are also used to make predictions about future behavior and events.
Linguistics is a scientific field where computational models are becoming increasingly important, especially the field of language learning. The modeling presented in this paper can be placed within that context.
In \citeyear{Grainger245}, \citeauthor{Grainger245} tested Baboons on their ability to learn and recognize words as opposed to non-words, when presented to them in written form on a computer screen. Their goal was to show that it is possible to process orthographic information without knowledge of its semantic component and hence, that it is possible to learn to 'read' (to a certain extend), without prior language knowledge.
To achieve that goal, they trained Baboons (who had, of course, no knowledge of the human language) to discriminate words from non-words that were presented on a computer screen. The Baboons were able to independently start blocks of 100 trials in which they would be presented with 50 non-words, 25 already known words and always the same unknown word in the other 25 trials in random order. They reacted by pressing one of two buttons on the screen and were rewarded with food every time they answered correctly. A word was regarded as known once 80\% of the responses to it in one block of trials were correct. It was then added to the group of already known words and used as such in subsequent trials. The difference between the words and the non-words was, for the monkeys, that words appeared repeatedly, while one single non-word was only shown very few times.
The results show that correctness of the responses for both words and nonwords grew above chance very quickly, while overall, word accuracy was slightly higher than nonword accuracy.
In this paper, we model the results from \textcite{Grainger245} using Naive Discriminative Learning (NDL), which is a concept of modeling learning (and also an R-package) based on the Rescorla-Wagner model \parencite{rescorla1972theory} and the equilibrium equations by Danks (2003).
\subsection{Naive Discrimination Learning}
Since the first experiments in modern learning theory by Ivan Pavlov it became more and more clear that learning is not only making associations between co-occurring cues and outcomes but also enables subjects to discriminate which cues predict the presence and the absence of an outcome \parencite{baayen2015abstraction}.
A naive discrimination learning (\emph{ndl}) model as a 2-layer network implementation of the established learning rules described by \cite{rescorla1972theory} fulfills this and was applied successfully in the language context (e.g. \cite{baayen2016comprehension}).
In an attempt to simulate as close as possible the learning experience the monkeys were given in the paper that gave the inspiration to this work, \textcite{Grainger245}, we model (virtual) monkey's learning and record their performance.
\section{Simulations}
\subsection{Stimuli}
For stimuli we used the words given in the supplemetary material of the original paper. The list contained 307 four-letter words and 7832 non-words, each consisting of four letters. In every trial, the word or non-word was presented split into overlapping trigrams (for example for the word atom: \#at, ato, tom, om\#), one trigram after the other, as proposed by \cite{baayen2016comprehension}.
\subsection{Experimental Code}
The simulation code is split in three parts: the creation of the trials, the learning of the monkey and the analysis of the learning results. We implemented the experiment using the \emph{R Programming Language} \parencite{Rcore}. The complete code used both for simulation as well as for the data analysis can be found in the appendix of this paper.
\subsubsection{Trial creation}
The algorithm follows as close as possible the structure defined in the reference paper and supplemental materials and described above.
The word-nonword corpus is the one used by the monkey DAN in \cite{Grainger245}.
Unfortunately, we had no knowledge about some aspects of the original experiment; in such cases we had to take our own decisions. E.g., trials will always be created in blocks of 100.
To ensure this constraint, the new word block part can be replaced by learned words if there is no new word left in the corpus and vice versa if there's no word learned, the learned word part will be filled by the new word.
The new words, learned words and nonwords get picked randomly out of their pool with repetition allowed.
\subsubsection{Monkey learning}
After a block the presented new word can be marked as learned by the definition in \cite{Grainger245}. The Rescorla Wagner learner therefore has to learn a block, return the guesses and then continue learning with the next block.
Since this is not easily possible with \emph{ndl} \parencite{Rndl}, we implemented a Rescorla Wagner learner ourself.
Since preliminary experiments showed that the monkeys performed with very high accuracies (>90\%), we decided to introduce a random parameter $ r $ in the experiment, defined as the fraction of times the monkey would make a random guess instead of an experience-based prediction. When trying to account for the inequality only by lowering the learning rates, we encountered a restriction in form of the need to use floating-point numbers, which might have led to unforeseeable behavior. Therefore, we chose to use the random parameter instead.
\subsubsection{Data analysis}
To compare the accuracy with different learning rates we used not only standard tools like linear regression models \emph{(lm)} and \emph{anova} \parencite{Rcore} but also more advanced non linear general additive models \emph{(GAM)} provided by the package \emph{mgcv} \parencite{Rmgcv} compared and visualized with \emph{itsadug} \parencite{Ritsadug}
\subsection{Choice of Parameters}
\subsubsection{Number of Trials}
The six monkeys in the original experiment participated in a different number of trials (min: 43.041, max: 61.142, mean: 52.812). For the sake of simplicity, we presented exactly 50.000 trials in each of our experiments.
\subsubsection{Random Parameter}\label{sec:randparam} The random parameter $ r $ was set to 0.65, which proved to be reasonable value in preliminary experiments. That means, in 65\% of the cases the monkey would guess for either word or nonword with equal probabilities. Therefore, the maximum possible performance $ p_{max} $ is:
$$ p_{max} = 1 - \frac{r}{2} = 0.675$$
In other words, the maximum possible performance is no longer 1.0 (for a very intelligent monkey) but rather restricted by $ r $. If a monkey's performance is slightly better than $ p_{max} $, this is assured to be due to chance.
\subsubsection{Alpha and Beta} How fast a Cue-Outcome connection is learned or unlearned depends on a learning rate which determines which fraction of the activation difference will be added or removed per learning event. The learning rate in a event is the multiplication of the learning rate of the cue $\alpha$ and the learning rate of the outcome $\beta$.
In our case we keep the learning rate constant over all cues and outcomes within an experiment.
Both $ \alpha $ and $ \beta $ were our independent variables which we manipulated over the course of the different experiments. We gathered data for every possible combination of $ \alpha $ and $ \beta $ values within an equally spaced range from 0.0 to 0.3. A total of 15 values for each $ \alpha $ and $ \beta $ were combined to $ 15 \cdot 15 = 225 $ possible combinations. Since $ \alpha $ and $ \beta $ were internally multiplied to a single value, we expected the outcome to be more or less symmetrical due to the commutativity of the multiplication operation and therefore calculated each combination of $ \alpha $ and $ \beta $ only once, which we used as a trick to improve the overall runtime. Therefore, $\sum_{i=1}^{15}i = 120$ combinations remained to be explored.
\subsubsection{Lambda}
The independent variable $\lambda$ represents the maximum activation in the Rescorla-Wagner model and therefore limits the learning.
It makes it possible to modulate saliency of a stimulus. A more salient stimulus could not only have higher learning rates but also a higher maximum activation. In the original experiment the stimulus were equally colored words and nonwords with four letters on an equally colored background. We therefore have no reason to assume the single words and nonwords are anything else than equally salient and thus keep $\lambda$ constant to a value of 1.
\subsection{Running Parallelized Experiments}
Running an experiment with a single combination of $ \alpha $ and $ \beta $ on a normal desktop computer took about 75 minutes. Therefore, the parameter space one could explore within a reasonable amount of time was quite restricted. We decided to write a parallelized version of the code to reduce the overall runtime. Using the R packages foreach \parencite{Rforeach}, parallel \parencite{Rparallel} and doParallel \parencite{RdoParallel}, restructured the experiment. Since conflicts can easily occur when more than one core is trying to access a shared data structure at the same time, we implemented a parallelized version that is able to run without even containing critical sections. Instead, each thread has its own data structure, a .txt file, and in the end the results are harvested and combined. This version of the experiment ran on a cluster with 15 cores, each performing a total amount of eight experiments. Altogether, 120 combinations of $ \alpha $ and $ \beta $ were explored overnight, which would have taken about 150 hours (!) in a non-parallelized version.
\section{Results}
The number of words learned by the actual monkeys ranged between 87 and 308. With the chosen range for $\alpha$ and $\beta$, we obtained between 275 and 307 learned words, however, it is important to note that we only presented 307 different words, so the model reached maximum learning potential even for small learning rates (see \autoref{fig:numwords}).
The general accuracy for the real monkeys lay between 71.14\% and 79.81\%, while our accuracies moved between 60\% and 68\% with random parameter $r=0.65$ depending on used learn rates.
Because the absolute accuracy depends heavily on the random parameter (see above), we could easily match increase the accuracy with modifying it. A more interesting property is the range of word accuracy which is $.0867$ for the monkeys and $.08$ for the simulation. The order and random selection of events did not have a large influence since the variance of the results of all our experiments is quite small. It seems, that 50k trials are enough to marginalize the influence of random structure in the presented data.
Using nonliner regression models (GAM) we find nonlinear ($df>1$) main effects for learning rates predicting the word or nonword accuracy with fixed random parameter without an interactive effect.
As is clearly visible in the second row of \autoref{fig:accuracy}, the accuracies are growing fast and converging to a plateau in one dimension with almost no effect of the other learning rate dimension. It seems that $ \alpha $ as a predictor has a very large effect (i.e, $ \alpha $ explains a lot of the variance that is in the data), however $ \beta $ has a minor influence. We attribute this mainly to the multiplicatory commutativity, since $ alpha $ and $ beta $ are internally multiplied to a single variable.
The complete result data is attached in the appendix of this paper.
\begin{figure*}[ht]
\onecolumn
\centering
\includegraphics[width=0.9\textwidth]{../plots/plot_accuracy}
\caption{
Top row shows model output accuracies in dependence of modulated $ \alpha $ and $ \beta $.
Second row visualizes corresponding nonlinear regressions (GAM).
Accuracy seems to approximate a maximal accuracy with growing $ \alpha $ and $ \beta $ parameters.
Visible in the GAM plot is the small influence of one of the parameters.
This indicates that the results might probably be approximated with just one nonlinear parameter.
}
\label{fig:accuracy}
\twocolumn
\end{figure*}
\begin{figure*}[ht]
\onecolumn
\centering
\includegraphics[width=0.9\textwidth]{../plots/plot_numwords}
\caption{
The left plot shows the raw number of words learned by the model with modulated parameters $ \alpha $ and $ \beta $. The corresponding nonlinear regression plot (middle) doesn't mirror a first hypothesis of more words learned with increasing parameter values.
This is not necessarily a consequence of a wrong hypothesis but may rather be caused by the weak data with one very high frequency of around 305 learned words and many very low frequencies (right plot).
}
\label{fig:numwords}
\twocolumn
\end{figure*}
\section{Discussion}
We meticulously simulated the learning experience that monkeys were exposed to in an experiment by \textcite{Grainger245} by systematically exploring the parameter space of Rescorla-Wagner equations \parencite{rescorla1972theory} examining every possible combination of learning rates $ \alpha $ and $ \beta $ within a broad range.\\
Since preliminary results indicated that without restricting the model performance to a ceiling, the model performs way too accurate compared to the original monkeys, we introduced a random choice in some cases. We were thereby able to model the performance, and showed that the influence of the $ \alpha $ and $ \beta $ values was surprisingly small.
To come back to the original experiment by \textcite{Grainger245}, we know that the 'original' monkeys performed with different success rates. Interestingly, from our experiments we can conclude that $ \alpha $ and $ \beta $ had not enough influence to account for such different success rates, however the random parameter that we introduced is indeed able to restrict the performance to any value in the range between a random choice (50\%) and more than 90\% accuracies (no random choice at all). Therefore, one could say that the different monkeys employed different random parameters (i.e., the fraction of times they make a random guess differed). Still, it is not clear what exactly corresponds to a random parameter within a monkey's mind. One could argue that they had a different motivation (i.e., some monkeys decided randomly more often than others), but this remains mere speculation.
As a limitation, we have to note that the definition of a word being learned in our impression isn't perfect: It is defined as the moment a word had 80\% accuracy of recognition. We would expect this definition to become problematic when a word was 'almost' learned, but not quite reaching the 80\%. In the next block with that word, the learning would be a lot quicker than for an actually new word. It might be a good idea to monitor and save the knowledge level concerning one specific word an measuring the actual number of repetitions a word needs to become known.
Concerning the code, we decided to write it as clear as possible and not as fast as possible in cases where we had to make a tradeoff. We therefore assume that the overall runtime can be enhanced quite a bit. That would enable future researchers to re-run the experiments with more words to see if there are changes in the later learning process which we now could not explore. The mode of presentation could be reassessed, as well as whether the number of letters changes the behavior of the model.
Furthermore different models could be used in the experiment, to see if other models fit the results of the actual monkeys even better. It would also be interesting to explore the influence of different values for $ \lambda $: Although this would go beyond a simulation of the original experiment and therefore fulfill a different purpose, one would be able to explore the parameter space not only in 2D ($ \alpha $ and $ \beta $) but in 3D. A lot remains to be discovered!
\newpage
\printbibliography{}
\appendix
\onecolumn
\input{result_tables.tex}
\newpage
\lstinputlisting[language=R]{../baboonSimulation.R}
\end{document}