The goal of this project is to create a user friendly API for an integrated workflow to perform typical text mining, natural language processing, and topic modelling tasks. This would include complete process of topic modelling:

  • Loading data, including loading text files from a local filesystem, as well as harvesting texts from the internet (via the package stylo/tm)
  • Data transformation, including calculating word frequencies (stylo/tm)
  • Text stemming and tagging (koRpus/snowballC)
  • Data subsampling (stylo)
  • Topic modelling (mallet/LDA/topicmodelling)
  • Visualizations (wordcloud/networkD3/ggplot2)

In the first stage, I plan to integrate a few packages as mentioned above. Future development assumes construction of a package integrating more tools in the similar fashion as caret for predictive modelling. The Google Summer of Code is planned to be just outset of a bigger project.




  • Tomasz Melcer
  • Maciej Eder