linguistic_assignments/02_word_freq/src/summary.md at 6510e08283a53be3b91408dd15e939ef3e3a142e

title: "Linguistics for Cognitive Science: Assignment 2" author:

- Felicia Saar (3818590)
- David-Elias Künstle (3822829)

date: 5.11.2015

documentclass: article geometry: "margin=1in" urlcolor: black header-includes:

- \usepackage{listings}
- \lstset{ numbers=left, title=\lstname, frame=single }

To create a word corpus out of a text like Grimms Fairy Tales we remove punctuation, make the text to lower case, replace every space with a new line and remove empty lines so we get a list with exactly one lower case word in each line. This all can be done with the Unix tool tr. Additionally required when using Windows UTF-8 text sources like project Gutenberg does, we remove the Byte Order Mark if there is one with the Unix tool sed The creation of the word list happens in the bash script src/wordify.sh you find in the Appendix.

The Rscript src/plotfreq.r reads a list of words from standard input stream, counts and ranks them and creates the plots you find in the figure.

We apply these two scripts in the makefile which also compiles this summary with pandoc from markdown via latex to pdf.

$Vocabulary distribution in Grimms Fairy Tales (www.gutenberg.org/ebooks/2591) \label{mylabel}$

\newpage

APPENDIX

\lstinputlisting[language=bash]{src/wordify.sh} \lstinputlisting[language=R]{src/plotfreq.r} \lstinputlisting[language=bash]{makefile}