Publication word cloud

I wanted to play with word clouds for some time. Usually the problem is to obtain a nice corpus. Luckily PubMedWordCloud was recently updated and I had something to play with.

The initial cloud looks not good enough - I did not like how plurals are counted separately. Since I couldn’t clean it up easy enough I ended up going back to the tm package and roll my own script. Unfortunately it’s far from publishable.

Anyway the result is way more to my liking. clean word cloud

My final script looks like this - I still use getAbstracts from PubMedWordCloud.

library(PubMedWordcloud)  
library(tm)
library(wordcloud)
library(magrittr)

pmids <- getPMIDs(author="Clemens Ager",dFrom=2002,dTo=2016,n=30)

abstracts <- getAbstracts(pmids)
corp <- VCorpus(VectorSource(abstracts))

f <- content_transformer(function(x, pattern) gsub(pattern, "", x))
c1 <- tm_map(corp, removePunctuation) %>% 
  tm_map(removeNumbers) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(content_transformer(function(x) gsub("([^si])s\\>", "\\1", x))) %>% 
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(removeWords, c("also", "thus", "three", "calu", "hbepc", "gcm", "ptrm")) %>%
  tm_map(stripWhitespace) %>%
  tm_map(f, "\\<[[:alpha:]]{,2}\\>\\W+")


wordcloud(c1, random.order=F, min.freq = 6, 
		  colors = brewer.pal(9,"Set1"),
		  rot.per= 0.3)

I already have some ideas on where to improve - basically I could crawl our pdfs as well. But for today I’m done.