I aim to solve the problem of detecting and recognizing objects in videos by using RGB and depth data as input. The deep learning algorithm used would be an extension of the convolutional neural network YOLO (You Only Look Once). There are two phases to my project. In my first phase, since YOLO is natively written in Darknet based in C, I will explore ports of YOLO in a more friendlier framework (ex: Keras or Tensorflow) and test their accuracy against Darknet.
In the second phase, I will extend Darknet to accept RGB and depth data as input. An accuracy script will be written in Darknet to evaluate the accuracy of the created models, original YOLO (only RGB), extended YOLO (depth only) and YOLO-D (RGB + depth). The inclusion of depth should hopefully result in an increase in accuracy.