The goal of this project is to create a user friendly API for an integrated workflow to perform typical text mining, natural language processing, and topic modelling tasks. This would include complete process of topic modelling:
- Loading data, including loading text files from a local filesystem, as well as harvesting texts from the internet (via the package stylo/tm)
- Data transformation, including calculating word frequencies (stylo/tm)
- Text stemming and tagging (koRpus/snowballC)
- Data subsampling (stylo)
- Topic modelling (mallet/LDA/topicmodelling)
- Visualizations (wordcloud/networkD3/ggplot2)
In the first stage, I plan to integrate a few packages as mentioned above. Future development assumes construction of a package integrating more tools in the similar fashion as caret for predictive modelling. The Google Summer of Code is planned to be just outset of a bigger project.