diff --git a/02_word_freq/src/summary.md b/02_word_freq/src/summary.md index 3e7bc39..2406288 100644 --- a/02_word_freq/src/summary.md +++ b/02_word_freq/src/summary.md @@ -13,6 +13,12 @@ - \lstset{ numbers=left, title=\lstname, frame=single } --- +To create a word corpus out of a text like *Grimms Fairy Tales* we remove punctuation, make the text to lower case, replace every space with a new line and remove empty lines so we get a list with exactly one lower case word in each line. This all can be done with the Unix tool *tr*. Additionally required when using Windows UTF-8 text sources like project Gutenberg does, we remove the Byte Order Mark if there is one with the Unix tool *sed* +The creation of the word list happens in the bash script *src/wordify.sh* you find in the *Appendix*. + +The *Rscript* *src/plotfreq.r* reads a list of words from standard input stream, counts and ranks them and creates the plots you find in the figure. + +We apply these two scripts in the *makefile* which also compiles this summary with pandoc from markdown via latex to pdf. [plot]: res/grimms_fairy_tales.pdf ![Vocabulary distribution in Grimms Fairy Tales (www.gutenberg.org/ebooks/2591) \label{mylabel}][plot]