Genome Assembly and Annotation

Providing freely accessible genomic data

Technologies
python, mysql, docker, pytorch, nextflow
Topics
machine learning, genomics, big data, cloud, hpc
Providing freely accessible genomic data
The Genome Assembly and Annotation section of EMBL-EBI brings together key reference resources in the field of genomics: - Ensembl (http://www.ensembl.org) was created in 1999 in preparation for the publication of the first draft of the human genome, to allow researchers and clinicians to start translating the secrets hidden within the human genome into real world applications. Ensembl has grown into a champion of biodiversity, providing data for tens of thousands of species across our vertebrate, metazoa, plant, fungi and bacterial divisions. - MGnify (http://www.ebi.ac.uk/metagenomics) provides a free to use platform for the assembly, analysis and archiving of microbiome data derived from sequencing microbial populations that are present in particular environments. Over the past 2 years, MGnify has more than doubled the number of publicly available analysed datasets held within the resource. - WormBase (https://wormbase.org/) is one of the World's oldest active bioinformatic resources, more than 20 years old. We scan all published literature and datasets on the model organism C. elegans, to create a very comprehensive resouce of genomics, strains, experiments, paper and people, aimed towards accelerating research and discoveries in fundamental biology as well as human health. - The Hugo Gene Nomenclature Committee (HGNC) and its sibling project the Vertebrate Gene Nomenclature Committee (VGNC) are jointly responsible for defining the official names of genes in human and key vertebrate species. This official nomenclature ensures that studies and results on the same gene can easily be aggregated. Given the rapid pace of generation of genomics and sequencing data, we support a fast-evolving software stack, and are constantly investigating new solutions for data storage, processing, distribution and display. Please visit our projects page for ideas on potential GSoC projects: https://www.ensembl.info/about/projects/
2022 Program

Successful Projects

Contributor
Rohit Shrivastava
Mentor
Andy Yates
Organization
Genome Assembly and Annotation
Accessing Ensembl data with Presto and AWS Athena
The goal of this project is to build a nextgen replacement for the BioMart tool that provides a way to download custom reports of genes, transcripts,...
Contributor
Sunny Tarawade
Mentor
Alexey Sokolov
Organization
Genome Assembly and Annotation
New FAANG backend with Elasticsearch and GraphQL
Current limitations: The current Back End for the Functional Annotation of Animal Genomes project (FAANG) provides users with a public rest API to...
Contributor
Yantong
Mentor
Jose Perez-Silva, William Stark, Francesca Tricomi, Leanne Haggerty
Organization
Genome Assembly and Annotation
Using Machine Learning to Identify and Classify Repeat Features
A number of tools exist for identifying repeat features, but it remains a problem that the DNA sequence of some genes can be identified as being a...
Contributor
KevinGao
Mentor
Ivana Piližota, David Thybert
Organization
Genome Assembly and Annotation
Investigating and Implementing Compact Data Representation of Homology Relationship
A key challenge surrounding modern bioinformatics is to manage and store the growing amount of biological data with both space efficiency and...
Contributor
Malay Joshi
Mentor
MagdalenaZ
Organization
Genome Assembly and Annotation
Extract important information from scientific papers
During GSoC 2021, BioBERT and RegEx/string matching technique based “Named Entity Recognition” (NER) system was developed to recognize and extract...
Contributor
kshitijsoni
Mentor
MagdalenaZ
Organization
Genome Assembly and Annotation
GSoC 2022 Proposal - Extract text from tables in Scientific Papers by Kshitij Soni
PyTesseract is really helpful, the first time I knew PyTesseract, I directly used it to detect some a short text and the result is satisfying. Then,...