An Automated Analysis Pipeline for Solving the GWAS Puzzle – Gene to Genome

An Automated Analysis Pipeline for Solving the GWAS Puzzle – Gene to Genome

Today’s guest post was contributed by Neil Halterman, assistant professor in the Department of Molecular and Human Genetics at Baylor College of Medicine, who combines team science, genetics and neuroscience to study the mechanisms driving arthritis. Nelle is passionate about science communication and advocacy: she runs a blog for early career scientists (ecrLife) and promotes open, reproducible science. You can follow Nele on LinkedIn.

Genome-wide association studies (GWAS) have helped researchers unravel the complex genetic basis of many complex traits and common diseases. By comparing the genomes of large groups of people (from a few hundred to over a million people) with or without a genetic trait, GWAS identifies common genetic variants that co-vary with phenotype. This approach has identified thousands of genetic associations across diverse genetic traits and diseases, providing insight into human biology and mechanisms that reduce disease risk. However, despite millions of genotyped variants and very large sample sizes, standard GWAS analysis typically explains only a fraction of the expected genetic contribution.

Part of this missing heritability stems from the fact that traditional GWAS examines variants independently, whereas many genetic traits are complex, and formed by the combined, interactive effect of multiple variants. Therefore, scientists are developing a variety of approaches to study how variables may interact with each other to result in a given phenotype. One such approach is kernel-based association testing, which applies statistical modeling to evaluate the joint effects of many GWAS variants simultaneously, enabling the discovery of genetic effects that are additive, nonlinear, interactive, or functionally informed. Researchers have also begun integrating external datasets, such as transcriptomic or structural data, into kernel-based analysis to further enhance its analytical power. However, as genomic and multiomic datasets continue to expand in scale and complexity, there is a growing need for robust, scalable, and reproducible tools that streamline such integrated, kernel-based analyses.

The new research, led by Dr. David Anuma and Dr. Jangini Hai, is published in G3: Genes|Genomes Annotated.

MOKA, which is open source and available on Github, allows researchers to define relevant external datasets and leverage them to compute a variety of biologically informed functional weights. Such datasets may include imaging-derived annotations, evolutionary conservation scores, transcription factor occupancy, neural network predictions, or others that may inform some sort of functional outcome. The pipeline then combines these variable-level functional weights with genotypes in a kernel-based association model to jointly assess the contribution of multiple variants to a gene region while accounting for population structure and association. Additionally, the platform enables automated downstream annotation and analysis through visualization, disease database validation, gene ontology enrichment, and KEGG pathway analysis, allowing researchers to identify functional themes between biological processes and significantly associated genes. Finally, the pipeline includes an external validation step, during which the results obtained are compared to curated knowledge bases such as DisGeNET and the proportion of associated genes that are transcribed in independent datasets is validated.

In summary, MOKA GWAS offers a one-stop shop for scientists embarking on a journey of robust, multiomics-informed genetic discovery.

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *