Security is a serious problem nowadays. Anomaly Detection might be a good answer to the problem. As the amount of data becomes extreme large, normal computing methods cannot handle such a volume. Running Big Data on Spark ecosystems might be a reasonable way. Its capability to compute data at large scale give us a good solution. However, most of data are unlabeled data; and labeling data not only means highly cost but also requires expertise in that domain. That is to say, using Supervised Learning algorithms might be unreasonable. In addition, Semi-supervised learning might not be a good way too as for its assumptions need some particular data distributions. Unsupervised Learning (Clustering-based) is a good start step towards this problem. Thus, in this project, I am trying to solve this problem by two phases: retrieving data from Elasticsearch-Hadoop; building a statistical analysis tool on Spark streaming computing from a clustering perspective. It aims to detecting the patterns in a dataset which behave abnormally.




  • Charitha Elvitigala
  • Ruwan Geeganage
  • Sameera