Mengyao Zhou

profile

Research Interests

I am interested in genomic sequence related algorithm development, such as writing tools for genomic variation detection and representation.

Education and Qualifications

Ph.D. Program in Bioinformatics: Boston College, MA, USA

Sep. 2009 - Present

Ph.D. Program Courses Taken: Biological Chemistry, Advanced Genetics, Advanced Cell Biology, Eukaryotic Gene Expression, Viruses, Genes and Evolution, Graduate Bioinformatics, Algorithms/Computational Biology

Current Projects

SIMD Smith-Waterman C/C++ Library, profile Hidden Markov Model based genomic variation detector

Master of Science in Genomics and Pathway Biology

Division of Pathway Medicine, The University of Edinburgh, UK

Sep. 2006 - Aug. 2007

Final Mark: 73/100 A, Distinction

Graduate Thesis: “The integration of the dynamic pathway information into the Medical Microbiology Database and the improvement of the Boolean network modeling method”.

M.Sc. Program Courses Taken: Bioinformatics

Courses Taken at the Graduate University of Chinese Academy of Science (GUCAS) as employee education of Beijing Genomics Institute, P. R. China

Sep. 2005 - Jun. 2006

Molecular Biology, Biological Statistics

Bachelor of Science in Computer Science

Beijing Normal University, P.R. China

Sep. 2000 - Aug. 2004

GPA: 3.4/4.0 Undergraduate Thesis “DNA Sequencing and Human Leukocyte Antigen (HLA) Matching Software” (completed in the Beijing Genomics Institute) was awarded Excellent Graduation Thesis (95/100), 2004

Awarded Third Prize Scholarship, 2001

Employment History

Teaching Assistant, Boston College, USA

Sep. 2009 - May 2010, Sep. 2012 - Dec. 2012

Courses taught: Introduction to Molecular and Cell Biology, Anatomy and Physiology Lab, Graduate Bioinformatics

Computational Biologist, Wellcome Trust Sanger Institute, UK

Nov. 2007 - Aug. 2009

Group: Human Evolution, Group leader: Chris Tyler-Smith

Main work: Coalescence history analysis based on the SNP density distribution of 6 whole genome sequences from YRI, CEU and CHB populations (gotten from the 1000 Genome Project).

Other work: Facilitate the data process for other projects in our group. Scientific Programmer and Database Curator European Bioinformatics Institute, UK Aug. 2007- Nov. 2007

The Microarray Informatics Team, Group leader: Alvis Brazma Main

Work: Manually curate data for the Array Expression Database. Used Perl and DBI SQLite to construct a curation management system to facilitate the manual curation.

Scientific Programmer (Visiting work) Wellcome Trust Sanger Institute

Apr. 2007

Group: Signal Transduction in C. elegans, Group leader: Andrew Fraster 

Main work: Use Perl to calculate the distance between a/each pair of terms of C. elegans’ genes in the Gene Ontology graph

Beijing Genomics Institute (BGI), Chinese Academy of Science, Research Assistant

Jun. 2004 - Aug. 2006

Group: Algorithm Development Team, Group leader: Heng Li 

Main work: Re-annotate the rice genome. Contribute to the development of hyterozygote detection software for capillary reads. Document the data processing pipeline of HLA typing. Sequence genes of Salmon.

Other work: Help to establish the employee training and external training systems of BGI. Work as teaching assistant of the graduate course “An Introduction of Bioinformatics” in GUCAS.

Current Projects

1) SIMD Smith-Waterman (SSW) C/C++ Library

SSW is a fast implementation of the Smith-Waterman (SW) algorithm, which uses the Single-Instruction Multiple-Data (SIMD) instructions to parallelize the algorithm within one processor. SSW library provides an API that can be flexibly used by programs written in C, C++ and other languages, even in multithreaded way. We also provide a demo software that can do protein and genome alignment directly. Current version of our implementation is ~50 times faster than an ordinary Smith-Waterman. It can return the Smith-Waterman score, alignment location and trace back path (cigar) of the optimal alignment accurately; and return the sub-optimal alignment score and location heuristically.

License: MIT

Features:

Efficient: ~50 times faster and ~100 times less memory usage than an ordinary Smith-Waterman
Accurate results and more: This library always gives the accurate optimal alignment results exactly the same as an ordinary Smith-Waterman. A suboptimal alignment is also returned.
Feather weight and easy usage: 1) Copy ssw.h and ssw.c fromhttps://github.com/mengyao/Complete-Striped-Smith-Waterman-Library and past them into your source code folder. 2) #include “ssw.h” where you would like to use the API 3) Compile with your own codes.
example.c and example.cpp are two simple examples to show how to use the API in C and C++ style respectively.
Robust demo software is provided for you to play with whatever way you like.
Double layer parallelization: 1) SIMD parallelization realizes parallel pairwise alignment within each processor. 2) Multithreaded usage of SSW realizes running alignments on multiple processors simultaneously.

See more at: https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library

Pictured Above: Running time for different SW implementations. Log-scaled running time is shown on the y-axis for SSW without (blue) and with (red) detailed alignment, Farrar’s implementation (green) and SSEARCH (pink). (A) Running times are shown for searching protein sequences against the full UniProt database (left) and one half of the TrEMBL database (right). All SW implementations used the BLOSUM50 scoring matrix with gap open penalty -12 and extension penalty -2. (B) The running time of aligning 1,000 simulated Illumina reads to human reference sequences of various lengths; and (C) of aligning 1,000 real sequencing reads to various microorganism genomes and the human chromosome 1 are shown. Farrar’s implementation cannot handle long sequences as human chromosome 1, so its corresponding running time is not shown here.

2) Profile Hidden Markov Model based genomic variation detector

A variety of algorithms have been developed to detect SNPs, insertions and deletions (INDELs), and other complex genomic variations using next- generation sequencing data. However, INDEL and other complex variation detections still face significant problems because of sequencing and alignment uncertainties. Moreover, most of the existing algorithms are designed for high-quality short reads whereas upcoming technologies tend to produce longer reads at the cost of an increased sequencing error rate. This leads to more challenging variation detection.

We are therefore developing a genomic variation caller (PHV) based on Profile Hidden Markov Model (PHMM), which jointly models sequencing and alignment uncertainties that are usually separately modeled by other callers. While this tool will show its greater advantage for more error prone data, it also demonstrates an improvement in variation detection when deployed on currently available short read technology data.

Other variation detectors’ pipeline:

PHV’s pipeline: