The internet today is rich in image-content, from entire websites like Instagram and Pinterest dedicated to curating and displaying images, to Facebook and Reddit that have large amounts of content in image form. Non-visual users find it challenging to navigate and use these websites for their intended purpose. The information in images, whether on the internet or stored locally, is also inaccessible to non-visual users. NVDA or NonVisual Desktop Access is a free, open-source, portable screen reader for Microsoft Windows. It already includes an OCR that can recognize text within images, however, it lacks any functionality to describe image content and allow users to interact with the various objects within images. This project aims to overcome these issues through two modules:

  1. A machine learning module to generate descriptive captions for images on the user’s screen or those manually inputted by the user.
  2. An object detection model that draws bounding boxes around recognized objects and outputs the object label when the user’s pointer enters a bounding box. The outputs of both these modules could be presented to the user using NVDA’s existing mechanisms such as voice or braille.



Shubham Dilip Jain


  • Michael Curran
  • Reef Turner