MATCH-G - Richard Maraia Lab | NICHD - Eunice Kennedy Shriver National Institute of Child Health and Human Development

MATCH-G Program

The MATCH-G toolset

MATCH-G (Mutational Analysis Toolset Comparing wHole Genomes)

Download the Toolset (ZIP 6 MB)

The toolset is prepared for use with pombe genomes and a sample genome from Sanger is included. However, it is written in such a way as to allow it to utilize any set or number of chromosomes. Simply follow the naming convention "chromosomeX.contig.embl" where X=1,2,3 etc.

For use in terminal windows in Mac OSX or Unix-like environments where Perl is present by default.

Toolset Description

Version History

2.0 – Bug fixes related to scaling to other genomes. First version of GUI interface.

1.0 – Initial Release. Includes build, alignment, snp, copy, and gap resolution routines in original form.

Please note this project in under development and will undergo significant changes.

Project Description

The MATCH-G toolset has been developed to facilitate the evaluation of whole genome sequencing data from yeasts. Currently, the toolset is written for pombe strains, however the toolset is easily scalable to other yeasts or other sequenced genomes. The included tools assist in the identification of SNP mutations, short (and long) insertion/deletions, and changes in gene/region copy number between a parent strain and mutant or revertant strain.

Currently, the toolset is run from the command line in a unix or similar terminal and requires a few additional programs to run noted below.

Installation

It is suggested that a separate folder be generated for the toolset and generated files. Free disk space proportional to the original read datasets is recommended.

Additional Free Programs

The toolset utilizes and requires a few additional free programs:

Bowtie rapid alignment software:

http://bowtie-bio.sourceforge.net/index.shtml

(Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.)

SAMtools:

http://samtools.sourceforge.net/

(Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9.)

Bioperl (not present in a default perl installation):

http://www.bioperl.org/

MUSCLE:

(for current incarnation of gap resolution, likely will not used in later revisions of this toolset)

http://www.drive5.com/muscle/muscle.html

(Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), 1792-97)

Please note installation paths of Bowtie and SAMtools. Muscle should be placed either in the same directory as the matchg.pl and config.txt or somewhere in the environment path.

Files config.txt and matchg.pl may be placed anywhere as long as they share the same directory.

Read datasets should be in fastq format and placed into the Bowtie path in the "reads" subdirectory.

Your template genome should be in embl format. Several common genomes are available from The Sanger Institute, including pombe http://www.sanger.ac.uk/Projects/S_pombe/

A sample pombe genome is included (From Sanger, November 2010). Although the toolset is written for pombe and allows for utilization of other genomes, support for fetching of GenBank genomes will be implemented in future releases.

Place chromosomal and mitochondrial embl files in your working project directory. A numbered naming convention for genomes should be followed: chromosomeX.contig.embl (where X=1,2,3, etc.). The final number should match the number in the configuration file.

Modify the text file config.txt to reflect naming schemes desired, directory paths of program installations, and name of fastq read files. Additional settings can be left at default or altered if desired and are described below.

Configuration File Description

To be modified in any text editor. A brief description of the settings follows:

[Working Directory]

Points to the directory to house files generated by the toolset as well as the genome to serve as the template for read alignment (embl format).

[WT FASTQ Read File]

Reads for the wild-type dataset in fastq format. Names the file housed in the reads subdirectory of the bowtie installation.

[MUT FASTQ Read File]

Reads for the mutant strain derived from the WT. Names the file housed in the reads subdirectory of the bowtie installation.

[Bowtie Path]

Defines the path of the Bowtie program installation.

[Samtools Path]

Defines the path of the SAMtools program installation.

[Reference Genome Name]

Gives the name of the project genome. User defined. To differentiate different modified genome builds.

[Number Chromosomes]

For pombe, leave at the default 3.

[Default Name WT]

User defined name for the WT or Parent dataset. Will define file names resulting from alignment of WT reads and which files will be used for comparison to MUT in associated SNP and Copy Number analysis.

[Default Name MUT]

User defined name for the WT or Parent dataset. Will define file names resulting from alignment of MUT reads and which files will be used for comparison to WT in associated SNP and Copy Number analysis.

[SNP Min Depth]

The minimum number of reads mapping to a genome location required to define a base identity. (Default 4)

[SNP Min Identity]

Fraction of bases required in agreement in order to define a base identity. (Default 0.8)

[Zero Flank Step]

For coverage gap resolution. Number of bases beyond a zero mapped read location to move before defining a sequence flank for resolving gaps (insertion/deletion event). (Default 5)

[Zero Flank Seed Length]

Length in nucleotides of a sequence flank surrounding a zero-mapped read region of the genome used in resolving a gap (insertion/deletion event). Ideally, this will be long enough to unambiguously define a sequence of the genome but shorter than half of the length of reads. (Default 20)

[Copy Num Averaging Window]

Averaging window in bases to use in scanning the genome for regions demonstrating a change in copy number. Shorter windows tend to result in more noise. Large windows may mask short regions of duplication/loss of copy number. (Default 400)

[Copy Num Ratio Cutoff]

Ratio in read depth between WT and MUT datasets used as a cutoff in scanning the genome for regions demonstrating a change in copy number. (Default 1.8)

Command Line Execution

In the terminal window, navigate to the location of matchg.pl and run with the following switches according to desired functionality:

matchg.pl --build

Reads genome embl files from the working directory to construct fasta and indexes for alignment as well as feature index tables for targeted evaluation of changes in copy number. Should be run first unless the same genome has been built previously without modifications.

matchg.pl --all

Executes all subroutines described below.

matchg.pl --align

Performs alignment of WT and MUT reads to a built genome. Generates "sorted.bam" and index files suitable for viewing alignments in a program like IGV (http://www.broadinstitute.org/igv/ ). Generates plaintext pileup files suitable for subsequent parsing or by-eye evaluation of alignments.

matchg.pl --stats

Provides some basic statistics related to an alignment such as average mapped read depth.

matchg.pl --SNP

Calls single nucleotide polymorphisms (SNPs) from WT and MUT alignments. SNPs are called relative to the provided reference genome as well as between WT and MUT. Writes a tab-delimitated file named after MUT with suffix SNPS defining each SNP location and associated mapping score (0-255).

Further post-processes the SNP calls and generates an "_Annotated.SNPS" spreadsheet which defines SNPs which fall within gene regions. For each such SNP, it identifies whether it is synonymous or non-synonymous and for the former describes the amino acid mutation. Maps all the gene names to link to the UniProt database. Also generates a copy of the chromosomal EMBL files of the reference which contains 'polymorphism' features for each SNP. This resulting database, which corresponds to the above-mentioned spreadsheet, is suitable for viewing in the Artemis sequence viewer (https://www.sanger.ac.uk/tool/artemis/ ).

Generates ".zero" files which identify regions of the genome that lack read coverage.

matchg.pl --flanks

Parses MUT.zero file and clips sequence flanks to attempt to resolve insertion/deletion mutations.

matchg.pl --gaps

Utilizes a clipped flank file to search reads for gap spanning sequence. Combines reads and aligns them with the MUSCLE alignment tool to generate report.txt in the align subdirectory. Currently, alignments must be manually evaluated and compared to the template genome to identify insertion or deletion events. Through multiple iterations, sequence can be "walked" across larger gaps to resolve large regions such as plasmid integrations or the context of larger rearrangments and gene deletions. Functionality and usability of this routine will be expanded in future versions.

matchg.pl --AnCopy

Utilizes mapped read depth of WT and MUT across defined sequence elements such as protein coding genes to identify changes in copy number. Generates a report file in tab-delimited format "MUT_AnCopy.tab" for further sorting and processing that covers all annotated features of the genome.

matchg.pl --InCopy

Scans across WT and MUT genomes to identify regions in a context-insensitive manner that demonstrate a change in copy number. Flags bases using a scanning averaging window across each chromosome that exceed a cutoff fold-change in mapped read depth. Combines consecutive positions and generates a report in tab-delimited format "MUT_IndCopy.tab".

Working with the Graphic Interface

To work with the graphic interface, ensure prerequisite programs are installed and a reference genome in embl format is present as above, and execute matchgGUI.pl.

Settings described above may be altered within the dropdown File menu.

The program will detect whether or not you have built a reference genome by your given naming convention previously. If it has not been done previously or if the genome has been altered, click "Build Reference Genome" to construct your reference and indexes from embl files within your working directory.

With a constructed reference, choose the different functions required by the associated checkboxes then click on "Execute Selected Commands" to run them. These each function similarly to what is described above for command line execution. Time required will vary depending on hardware and functions selected.

Description of Generated Files and Reports

The toolset generates a number of report files through it's various functions. Below is a brief description of each.

MUT.SNPS – Tab delimited file that defines the locations where nucleotide identities from alignments of WT and MUT reads do not agree. Also includes locations where either or both differ from the reference genome. Includes the position of each, reference, WT, and MUT base identities. Number of reads covering the given location for each, as well as the number of reads calling that particular base identity. Also includes the mapping score describing the confidence with which that base is determined (scale from 0-255).

MUT_Annotated.SNPS – Tab delimited excel file generated from MUT.SNPS which defines protein coding mutations and hyperlinks to gene information.

chromosomeX.contig.out.embl – Modified version of the chromosomal embl file which includes annotations for each identified SNP.

cXtable.txt – Index table of annotated genome features. Used in comparison of copy numbers of each between genomes.

MUT_AnCopy.tab – Tab delimited file describing estimated copy number of each genome feature when comparing WT and MUT alignments. Each entry defines the chromosome, location, and description of each genome annotation. Depth is defined as the average mapped read depth across the entire feature (or gene) coordinates expressed as a multiple of the average mapped depth for the entire genome. This serves as an approximation of region copy number. Ratio is the Ratio of WT depth versus MUT depth.

MUT_InCopy.tab - Tab delimited file describing estimated copy number exceeding a cutoff when the entire genome is scanned using an averaging window. An entry defines a chromosome and location, along with the consecutive length that exceeds the cutoff threshold. Avg Depth defines the depth averaged across that region as a multiple of the global average mapped read depth. Local Ratio defines the ratio between these averages for WT and MUT datasets. W Avg Depth includes the boundaries of the averaging window itself, and the Ratio is defined in WindowRatio. Note that consecutive units are strict and do not include the averaging window boundaries itself (may be changed in future releases).

MUT.zero – tab delimited file containing the chromosome and location of each region corresponding to zero mapped read depth. Defines the areas of the genome for which reads fail to map.

MUT.flanks – File containing sequence flanks clipped proximal to zero read depth regions. Contains basic position and orientation information. For use in gap resolution.

Report.txt – Located in the align subdirectory. Contains a plain text output where reads corresponding to the edges of zero mapped read depth regions have been aligned. Alignments must be manually evaluated and compared with reference sequence to determine insertion and deletion events that have lead to the absence of coverage.

Copyright and License

Authors:

James Iben ibenjame@mail.nih.gov
Jonathan Epstein Jonathan_Epstein@nih.gov

Public Domain Notice

Eunice Kennedy Shriver National Institute of Child Health and Human Development

This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Institutes of Health and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NIH and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NIH and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite the author in any work or product based on this material.

This toolset is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.5 or, at your option, any later version of Perl 5 you may have available.