The main motivation for this project is to create a more automated and simplified video annotation pipeline using both image and textual data for the FrameNet webtool. The existing version relies on manual annotation which is a highly tedious task. In order to annotate multimodal corpora, individual frames or ideas depicted within the video need to be extracted using corresponding audio transcriptions and identified objects. This is important to obtain fine grained semantic representations of events and entities. Moreover, the video may be presented in multiple languages that need to be detected and translated to Portuguese. Also individual objects within a video need to be tracked and identified to generate corresponding textual frame elements that can speed up the annotation process.



Prishita Ray


  • Marcelo Viridiano
  • Tiago Timponi Torrent
  • Ely Matos
  • Fred Belcavello