DBSCAN Clustering in Mahout
- Mentors
- Trevor Grant
- Organization
- Apache Software Foundation
Clustering is an important Data Mining technique with wide applications in Medicine, Biology, Social Network Analysis, Image Segmentation just to name a few. Density-based clustering is an intuitive and efficient to group similar objects together. The DBSCAN algorithm is a state of the art density-based clustering algorithm. The DBSCAN algorithm has quadratic time complexity making it unsuitable for Big Data Applications. I propose to implement a distributed R-Tree based DBSCAN algorithm in Mahout which has a complexity of O(nlog(n)). And after due discussions, implement an optimized version of the distributed DBSCAN algorithm.