mrCaNaVaR analyzes the mapping read
depth to discover large segmental duplications and deletions. It
also has the capability of predicting absolute copy numbers of
genomic intervals.
Citation
Personalized copy number and segmental duplication maps using next-generation sequencing. Can Alkan, Jeffrey M. Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O. Kitzman, Carl Baker, Maika Malig, Onur Mutlu, S. Cenk Sahinalp, Richard A. Gibbs, Evan E. Eichler. Nature Genetics, Oct, 41(10):1061-1067, 2009.Basics & Definitions
Pronunciation: mrCaNaVaR is
pronounced as /mɪstər
ʤanavar/, which sounds like "mister juh-nuh-vuhR".Requirements:
- mrFAST or mrsFAST
for read mapping stage. No other
mapper is recommended to use with mrCaNaVaR for characterization
of duplications and absolute copy number prediction.
Deletion discovery *might* work, but other mappers are untested.
- NOTE: Our SOLiD mapper drFAST
is untested for CNV detection at this moment.
- zlib for the ability to read compressed SAM files.
- C compiler (mrCaNaVaR is developed with gcc version 4.1.2)
Coordinate space: mrCaNaVaR uses BED-style coordinates for genomic locations (chrom/chromStart/chromEnd) to facilitate easier integration with genome browsers and tools like BEDtools. The first base in a chromosome is numbered 0. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
Sliding windows: mrCaNaVaR operates on three classses of sliding windows over the reference genome to estimate absolute copy numbers and discover duplicated and deleted regions. These three windows are:
- Long Windows (LW) are
used to find locations of deletions and duplications. The LWs are
overlapping. By default the window size size is 5000 bp of
non-masked characters (i.e. A, C, G, T only), and the slide size
is 1000 bp of any characters. For example, in repeat masked NCBI
Build 35, the first two LWs are:
chr1 0 6792
chr1 1000 6792
chr1 2000 7203
Here, the first window has 6792 any characters, but 5000 non-N characters (thus 1792 N characters). We then slide the window by 1000 "any" base pairs, starting the second window at position 1000. The end of the second window is again 6792 since all of the first 1000 characters of chr1 is masked. The third window starts at position 2000 and contains 5203 any characters, where 5000 are non-N.
- Short Windows (SW) are
used to refine breakpoints of detected duplication and deletion
intervals. Default window size is 1000 bp of non-masked
characters, and the slide size is 1000 bp of any characters,
similar to LW.
- Copy Windows (CW) are
used to predict absolute copy numbers. CWs are non-overlapping
windows. Default window size is 1000 bp of non-masked characters.
Before using mrCaNaVaR
Mask the repeats in your reference genome assembly. For the human genome assemblies (NCBI Build 35, 36, GRCh37) we recommend both:
- RepeatMasker with
sensitive "-s" parameter enabled. The repeat masked versions of
many genome assemblies are available at the UCSC
Genome Browser.
- Tandem Repeats Finder (parameters: trf input_file 2 7 7 80 10 50 500 -m)
Read Mapping:
Please use either mrFAST or mrsFAST in single-read mapping mode (not paired-end). Recommended edit/hamming distance threshold is 5% of the read length (i.e. 2 for 36 bp reads, 4 for 76 bp reads, etc.). Using other aligners might be possible if all map locations (including suboptimal alignments) are reported, however, we strongly recommend mrFAST or mrsFAST that were specifically developed for SV/CNV analysis.
Use the same repeat masked version of the reference genome described above for building the mrFAST/mrsFAST index and read mapping.
Using mrCaNaVaR:
- Mask the common and tandem repeats in your reference genome assembly of choice. Replace repeat characters with N.
- Build mrFAST or mrsFAST search index using the same genome.
- Map your reads with mrFAST or mrsFAST.
- Prepare genome configuration file with mrCaNaVaR (PREP mode; see below).
- Calculate read depth values from the mapping information generated by mrFAST or mrsFAST using mrCaNaVaR (READ mode; see below).
- Call segmental duplication and deletion intervals, and predict
absolute copy numbers using mrCaNaVaR (CALL mode; see below).
Run Modes
mrCaNaVaR has three run modes
1. PREP mode (--prep) is used to prepare a reference genome for analysis. Users need to run mrCaNaVaR in this mode only once for a new reference genome assembly (or versions of it with different repeat masking or different window sizes). In this mode mrCaNaVaR will:
- Load reference genome file (FASTA format), calculate the coordinates and GC content of three window classes (LW, SW, and CW).
- Save this information in a "configuration" file.
- -fasta: Repeat masked reference genome in FASTA format (e.g. hg17_masked.fa)
- -gaps: Coordinates of assembly gaps in the reference
genome in BED format (e.g. hg19.gaps.bed)
- -conf: Reference configuration file. This file is in a native binary format. (e.g. hg17.cnvr)
mrcanavar --prep -fasta hg17_masked.fa -gaps hg17.gaps.bed -conf hg17.cnvr
Optional
parameters:
-lw_size <lw_size> : Long window span size. Default is 5000.
-lw_slide <lw_slide> : Long window slide size. Default is 1000.
-sw_size <sw_size> : Short window span size. Default is 1000.
-sw_slide <sw_slide> : Short window slide size. Default is 1000.
-cw_size <cw_size> : Copy number window size. Default is 1000.
-pseudoa <bed_file> : Coordinates for pseudoautosomal regions in the
reference genome in BED format.
2. READ mode (--read) is used after read mapping stage to load mapping information and calculate read depths over the LW, SW, and CW intervals. The mapping information should be in SAM format; either uncompressed (*.sam), or compressed with gzip (*.sam.gz). Both mrFAST and mrsFAST are capable of generating map files in plain SAM format and gzip-compressed SAM. In this mode mrCaNaVaR will:
- Load reference configuration file.
- Scan a given directory for *.sam or *.sam.gz files, and calculate the read depth in three window classes.
- Save the read depth information in a depth output file in a native binary format. In addition it will generate three human-readable text filesfor read depths and GC contents of LW, SW, and CW intervals.
-
- Perform GC correction of the read depth values over the LW, SW, and CW intervals. This is new due to the CONC mode and multiple library support.
- -conf: Reference configuration file created using the PREP mode.
- -samdir: Plain SAM files (*.sam) or gzip-compressed
SAM files (*.sam.gz) in a directory (e.g.
sam_files_are_here/).
- -depth: Read depth file in a native binary format.
- Three human readable text files that contain the same information as the file above. These three files are for manual inspection purposes only, if needed, and will not be loaded and needed by mrCaNaVaR.
mrcanavar --read -conf hg17.cnvr -samdir sam_files_are_here/ -depth mysample.depth
Optional
parameters:
--gz : Indicates the SAM files are compressed in gzip format.
-nsam <first> <second> : Choose a sequence of SAM files between two indexes.
3. CONC mode (--conc) is used to merge >=2 *.depth files created using the READ mode. The CONC mode is optional, and needed only if the SAM files are read into a number of *.depth files; or, if there are more than one library preparation (which should be read into different *.depth files). In this mode, mrCaNaVaR will:
Required
files:
- -conf: Reference configuration file created using the PREP mode.
- -concdepth:
List of read
depth files to
merge.
- -depth: Read depth file in a native binary format.
- Three human readable text files that contain the same information as the file above. These three files are for manual inspection purposesonly, if needed, and will not be loaded and needed by mrCaNaVaR.
mrcanavar --conc -conf hg17.cnvr -concdepth lib1.depth lib2.depth lib3.depth -depth merged.depth
4. CALL mode (--call) is the final step in CNV analysis using mrCaNaVaR. In this mode, mrCaNaVaR will:
- Load reference configuration file.
- Load read depth file.
- Perform GC correction of the read depth values over the LW, SW, and CW intervals if the *.depth file is created with mrCaNaVaR version earlier than 0.5.
- Predict absolute copy number over the CW intervals.
- Predict duplicated and deleted regions in the analyzed sample using the LW and SW intervals.
- -conf: Reference configuration file created using the PREP mode.
- -depth: Read depth file created using the READ mode.
- -o: Output file prefix (out_prefix).
Output files:
- out_prefix.log:
Log file that describes what mrCaNaVaR did.
Includes read depth statistics.
- out_prefix.cw_norm.bed:
GC-corrected read depth values for the
inferred control regions for CW intervals.
- out_prefix.copynumber.bed:
Estimated copy numbers over CW intervals.
Sample command line:
mrcanavar --call -conf hg17.cnvr -depth mysample.depth -o mysample
Optional
parameters:
--xx : Set gender of the sequenced sample as female. Mammalian genomes only.
Default is autodetect.
--xy : Set gender of the sequenced sample as male. Mammalian genomes only.
Default is autodetect.
--verbose : Verbose output.
NOT YET IMPLEMENTED:
-cont_win <cwin_number> : Contiguous window number to look for high/low depth windows.
Default is 7.
-cut_win <cwin_cutoff> : Window number cutoff. Default is 6.