Hi there. In this page, I use the R programming language to do text analysis and text mining to obtain wordcounts and wordclouds from the Dr. Seuss - green Eggs & Ham book. The subject of bigrams (two indigenous phrases) is not debated here this time around.

You are watching: Green eggs and ham word count

 

*

 

Source: http://mommyneedsabottle.com/wp-content/uploads/2015/08/GreenEggs_Ad.png

 


Introduction & obtaining Started

 

One of the very first children’s publication I was presented to was Dr. Seuss - eco-friendly Eggs & Ham. I would check out this publication a lot at the doctor’s office when I to be young.

A .txt version of the book can be discovered online v this link. Since there is no location or monster characters, over there is no require for data clean in R.

Wordcounts and also wordclouds are created in the tidy way as explained from the (online) publication Text Mining through R: A Tidy method by Julia Silge and also David Robinson.

 

Loading Libraries In R

 

The R packages that interest room dplyr, tidyr, ggplot2, tidytext, wordcloud and also gridExtra.

 

# fill libraries right into R:# install packages v install.packages("pkg_name")library(dplyr) # Data Manipulationlibrary(tidyr) # Data Wranglinglibrary(ggplot2) # Data Visualizationlibrary(tidytext) # For text mining and analysislibrary(wordcloud) # Wordcloud capabilitieslibrary(gridExtra) # multiple plots in one

 


Wordcounts & Wordclouds In environment-friendly Eggs & Ham

 

With the tidytext package in R, friend can obtain wordcounts from piece of text. To have the ability to generate wordclouds, you would require the wordcloud R package. My other text mining write-ups mention producing wordclouds v the usage of the tm package but in this situation I am using the tidytext and also wordcloud packages.

There is a text version of the green Eggs & Ham publication online here. This text record is the publication itself so there is no require for data cleaning. To check out in the file, use the readLines() role in R.

 

# 1) Wordcounts in eco-friendly Eggs and HamgreenEggs_book ## # A tibble: 15 x 1## text ## ## 1 i am Sam ## 2 Sam i am ## 3 "" ## 4 the Sam-I-am! ## 5 that Sam-I-am! ## 6 I do not prefer that Sam-I-am! ## 7 "" ## 8 "Do you like " ## 9 green eggs and also ham? ## 10 I do not choose them, Sam-I-am.## 11 I do not like ## 12 eco-friendly eggs and ham. ## 13 "" ## 14 "Would you like them " ## 15 right here or there? 

From the tidytext package, the unnest_tokens() function converts the text in a method such that each heat is simply a solitary word.

 

# Unnest tokens: have actually each indigenous in a row:greenEggs_words % unnest_tokens(output = word, intake = Text) # Preview with head() function:head(greenEggs_words, n = 10)## # A tibble: 10 x 1## indigenous ## ## 1 i ## 2 to be ## 3 sam ## 4 sam ## 5 i ## 6 to be ## 7 that ## 8 sam ## 9 ns ## 10 am 

Normally, I desire to eliminate stopwords indigenous the message as castle carry very little meaning on your own. This time around, i will achieve word counts in green Eggs & Ham as soon as the stopwords are filtered out and the word counts of the original book itself. Come filter out the avoid words the anti_join() function from R’s dplyr parcel is used. The change which is linked with the filtered text is greenEggs_words_filt.

 

# remove English protect against words from Fox In Socks:# prevent words include me, you, for, myself, he, shegreenEggs_words_filt % anti_join(stop_words)## Joining, by = "word" 

With the usage of dplyr’s pipeline operator (%>%) and also its count() function, counts because that each word have the right to be acquired for the filtered case and the non-filtered case.

 

# indigenous Counts in Fox In Socks (No stopwords)greenEggs_wordcounts % count(word, type = TRUE)# word Counts in Fox In Socks (Stopwords removed)greenEggs_wordcounts_filt % count(word, type = TRUE)# Print top 15 wordshead(greenEggs_wordcounts, n = 15)## # A tibble: 15 x 2## word n## ## 1 ns 84## 2 no 84## 3 them 61## 4 a 59## 5 favor 45## 6 in 41## 7 do 37## 8 you 34## 9 would 26## 10 and 25## 11 eat 24## 12 will 21## 13 through 19## 14 sam 18## 15 am 15head(greenEggs_wordcounts_filt, n = 15)## # A tibble: 15 x 2## native n## ## 1 eat 24## 2 sam 18## 3 eggs 11## 4 green 11## 5 ham 10## 6 train 9## 7 house 8## 8 computer mouse 8## 9 box 7## 10 vehicle 7## 11 dark 7## 12 fox 7## 13 tree 6## 14 goat 4## 15 rain 4 


Generating The Plots

 

Case One: Wordcounts Plot and Wordcloud with Stopwords

Plots are produced with the usage of R’s ggplot2 data visualization package. The plots space saved into variables which will certainly be provided the grid.arrange() role later because that multiple plots.

From the unfiltered version, ns take the optimal 15 most typical words in the green Eggs & Ham book. The outcomes from the plot are not too motivating besides the name sam.

 

## a) Plot & Wordcloud through StopWords# Bar Graph (Top 15 Words):green_wordcounts_plot % mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col(fill = "#807af5") + coord_flip() + labs(x = "Word ", y = " counting ", location = "The 15 Most typical Words In green Eggs and Ham ") + geom_text(aes(label = n), hjust = 1, colour = "white", fontface = "bold", dimension = 3.5) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_blank(), axis.title.x = element_text(face="bold", colour="darkblue", dimension = 12), axis.title.y = element_text(face="bold", colour="darkblue", dimension = 12)) # print plot:green_wordcounts_plot

*

 

Most that the preprocessing has currently been done with the dplyr functions. Generating the wordcloud does no take much extra code.

 

# Wordcounts Wordcloud:greenEggs_wordcounts %>% with(wordcloud(words = word, freq = n, min.freq = 2, max.words = 100, random.order=FALSE, rot.per=0.35, colour = rainbow(30)))

*

 

Case Two: Wordcounts Plot and Wordcloud without Stopwords

 

The password is not much various from situation one. In this case, the filtered version of the word counts is used.

 

## b) Plot & Wordcloud v No StopWords# Bar Graph (Top 15 Words):green_wordcounts_plot_filt % mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col(fill = "#d9232f") + coord_flip() + labs(x = "Word ", y = " count ", location = "The 15 Most common Words In eco-friendly Eggs and Ham (No Stopwords) ") + geom_text(aes(label = n), hjust = 1, colour = "white", fontface = "bold", dimension = 3.5) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_blank(), axis.title.x = element_text(face="bold", colour="darkblue", size = 12), axis.title.y = element_text(face="bold", colour="darkblue", dimension = 12)) # print plot:green_wordcounts_plot_filt

*

 

From the results, peak words include:

 

eatsamgreeneggshammousehousefox

 

These peak words show that the book has something to perform with sam, eggs, ham, eating and the colour green.

Generating the wordcloud in R through the wordcloud package is no much different as in the first case.

 

# Wordcounts Wordcloud:greenEggs_wordcounts_filt %>% with(wordcloud(words = word, freq = n, min.freq = 2, max.words = 100, random.order=FALSE, rot.per=0.35, color = rainbow(30)))

*

 


Combining The Bar Plots into One Graph through grid.arrange()

The horizontal bar graphs from earlier were saved right into variables. Native the gridExtra package in R, the two variables include the plots can be supplied in the grid.arrange() function to create a plot with multiple graphs.

See more: Is Scarlett Estevez Related To Martin Sheen, Estevez Family

 

## Bar graphs togethergrid.arrange(green_wordcounts_plot, green_wordcounts_plot_filt, ncol = 2)

*

 

There is a clear and definite difference with the graphs when the English stopwords such as I, the, of, will and also with space removed. The outcomes carry more meaning.

 


References & Resources

 

R graphic Cookbook through Winston ChangText Mining through R: A Tidy technique By Julia Silge and David Robinson