Dr. Kevelin Barbosa-Xavier
Lead Bioinformatician
Cannamatrix
POSTER PRESENTER
CULTIVATION

A pangenomic panel to enhance low-coverage genotyping and GWAS for machine learning in Cannabis

Machine learning (ML) applications and genomic-assisted selection in Cannabis sativa require rich, representative, and reliable genotype datasets. However, standard single-reference genome approaches introduce mapping bias, miss structural and multiallelic variations, and compromise sensitivity in low-coverage data. These limitations reduce detection power in GWAS and diminish data quality for predictive models. To overcome these shortcomings, we developed a graph-based pangenome pipeline (Minigraph–Cactus) composed of 20 genomes (10 phased, 10 unphased). We extracted a graph-VCF containing ~13.6 million variant sites (10.8M SNPs, 2.5M indels, 973k MNPs), 1.87M of which were multiallelic, highlighting the allelic complexity preserved by the graph. To make this resource usable for imputation and genotyping, we processed the graph-VCF: we normalized and split multiallelic sites and applied panel-level quality filters (Beagle DR2 ≥ 0.8). This resulted in a high-quality biallelic panel of 8.6 million variants, preserving a diverse set of SNPs, indels, and MNPs. This panel was then used to genotype and impute 176 public samples (with cannabinoid data available) via two complementary routes: Glimpse2, focused on SNPs/indels (75,820 variants with INFO ≥ 0.8), and Kage2, which leveraged the graph to genotype additional complex variants (confidently called in >75% of the cohort). This pangenome approach is superior to single-reference methods for several reasons. First, it reduces reference bias by representing multiple haplotypes. Second, it improves the capture of structural and multiallelic variation—classes crucial for secondary metabolite biosynthesis, such as cannabinoid and terpene production. Third, it increases imputation and phasing accuracy in low-coverage data, resulting in genotypes with lower systematic error and less missingness. These improvements translate to greater sensitivity in GWAS and more informative, less noisy data for ML, capturing haplotype signals and interactions. This establishes a cost-effective solution for large-scale genotyping in breeding programs. We are now scaling this framework to a 152-genome pangenome (56 phased, 96 unphased), aiming for an even more robust panel for GWAS, genomic selection, and chemotypic prediction in Cannabis sativa.

Learning Objectives:

Understand the limitations of single-reference genomes for Cannabis structural variation

Explore how a graph-pangenome panel reduces reference bias and improves imputation accuracy

Apply pangenomic-aware genotyping to enhance GWAS and ML models in low-coverage data

BIO

Kevelin Barbosa Xavier is an Agricultural Technician and Biologist, with an MSc in Genetics and Plant Breeding and a PhD in Plant Biotechnology. As Lead Bioinformatician at Cannamatrix, she works at the intersection of bioinformatics, data science, and biotechnology, leading research in genomics, GWAS, pangenomics, machine learning, and phenohunting. Her work focuses on uncovering the genetic basis of key Cannabis sativa chemical traits, including cannabinoids, terpenoids, and volatile compounds (VSCs). She is passionate about translating complex biological data into actionable insights and innovative solutions.

View CannMed Resources Below:

Nov. 13, 2024

View CannMed Resources Below:

Analyzing Cannabis Gene Expression with Kevelin Barbosa-Xavier