Clustering is an important Data Mining technique with wide applications in Medicine, Biology, Social Network Analysis, Image Segmentation just to name a few. Density-based clustering is an intuitive and efficient to group similar objects together. The DBSCAN algorithm is a state of the art density-based clustering algorithm. The DBSCAN algorithm has quadratic time complexity making it unsuitable for Big Data Applications. I propose to implement a distributed R-Tree based DBSCAN algorithm in Mahout which has a complexity of O(nlog(n)). And after due discussions, implement an optimized version of the distributed DBSCAN algorithm.


A. S. Aditya Sarma


  • Trevor Grant