Mercurial > repos > jjohnson > crest
comparison README @ 0:acc8d8bfeb9a
Uploaded
| author | jjohnson |
|---|---|
| date | Wed, 08 Feb 2012 16:59:24 -0500 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:acc8d8bfeb9a |
|---|---|
| 1 This document has the information on how to run CREST for structural | |
| 2 variation detection. | |
| 3 | |
| 4 ============= | |
| 5 Requirements: | |
| 6 ============= | |
| 7 Before running CREST, you need to make sure that several pieces of software | |
| 8 and/or modules are installed on the system: | |
| 9 1. BLAT software suite, especially blat, gfClient, and gfServer. BLAT | |
| 10 can be obtained from these links: | |
| 11 BLAT for academic use: http://www.soe.ucsc.edu/~kent | |
| 12 BLAT commercial license: http://www.kentinformatics.com/ | |
| 13 2. CAP3 assembly program, available here: | |
| 14 CAP3 for academic use: http://seq.cs.iastate.edu/cap3.html | |
| 15 CAP3 commercial license: Contact Robin Kolehmainen at Michigan Tech, | |
| 16 rakolehm@mtu.edu or (906)487-2228. | |
| 17 3: SAMtools library for accessing SAM/BAM files, available from SourceForge: | |
| 18 SAMtools: http://sourceforge.net/projects/samtools/files/ | |
| 19 4. BioPerl and Bio::DB::Sam modules. They are usually available as | |
| 20 packages on most Linux distributions, but are also available at this link: | |
| 21 BioPerl: http://www.bioperl.org/ | |
| 22 Bio::DB::Sam: http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm | |
| 23 Important: you must install SAMtools library before install Bio::DB::Sam. | |
| 24 5. ptrfinder is needed if you want to remove short tandem repeat mediated | |
| 25 SVs, the executable is included in the download package, put it on the path. | |
| 26 | |
| 27 Note: | |
| 28 1. You can use your own programs in place of BLAT and CAP3, but you need to | |
| 29 implement the run method in SVExtTools.pm. | |
| 30 2. The pipeline uses gfServer to mimic a standard blat server, so you need | |
| 31 to setup your own gfServer. Details on setting up the server can be found in | |
| 32 the BLAT package. Using a query server can significantly increase the speed of | |
| 33 the pipeline. | |
| 34 | |
| 35 Your BAM files must contain soft-clipping signatures at the breakpoints. If | |
| 36 they do not, you will not get any results. For more information see the | |
| 37 section "About Soft-Clipping" at the end of this document. | |
| 38 | |
| 39 ===================== | |
| 40 Running the pipeline: | |
| 41 ===================== | |
| 42 | |
| 43 Make sure that all the required perl modules are in @INC. One simple way | |
| 44 is to put all .pm and .pl scripts in the same directory and run them from | |
| 45 this same directory. Also, the input bam file, must be sorted and indexed | |
| 46 before running the pipeline. | |
| 47 | |
| 48 We are going to use two sample bam files (tumor.bam and germline.bam) to | |
| 49 illustate how to run the pipeline. The examples assume you want to find SV | |
| 50 in tumor.bam and you also have the matched germline sample bam file. | |
| 51 | |
| 52 Important: indexing all bam files before running the pipeline is required. | |
| 53 | |
| 54 1. Get soft-clipping positions. | |
| 55 | |
| 56 The program extractSClip.pl will extract all soft-clipping positions first, | |
| 57 and identify those positions with a cluster of soft-clipped reads. The | |
| 58 program requires only the BAM file and the reference genome's FASTA file | |
| 59 The following is an example to extract all positions: | |
| 60 | |
| 61 extractSClip.pl -i tumor.bam --ref_genome hg18.fa | |
| 62 | |
| 63 Two files named tumor.bam.cover and tumor.bam.sclip.txt will be generated | |
| 64 for use in the next step. | |
| 65 | |
| 66 Note: The program can use either paired or single-end sequencing data. For | |
| 67 single-end data, use the --nopaired parameter. | |
| 68 | |
| 69 For whole genome sequencing project, we highly suggest running the procedure | |
| 70 in parallel by dividing the genome into pieces. One natural way is by | |
| 71 chromosome. The following is an example to extract all positons on chr4. | |
| 72 | |
| 73 extractSClip.pl -i tumor.bam --ref_genome hg18.fa -r 4 | |
| 74 | |
| 75 Important: The genome file used in this pipeline must be the same as the one | |
| 76 used to map reads, so the chromosome names need to agree. In this example, | |
| 77 the genome file and bam file all have the chromosome name as 4 instead of | |
| 78 chr4 you may encounter. | |
| 79 | |
| 80 Two files named tumor.bam.4.cover and tumor.bam.4.sclip.txt will be generated | |
| 81 for use in the next step. So it's very easy to run this step in parallel and | |
| 82 combine the results together to form a final result. | |
| 83 | |
| 84 The output files for this step have names with suffixes of *.cover and | |
| 85 *.sclip.txt. The .cover file is a tab-delimited text file, with columns: | |
| 86 chr, position, strand, number of soft-clipped reads, and coverage at that | |
| 87 position. The strand is just left-clipped or right-clipped to help identify | |
| 88 the SV orientation. The .sclip.txt file has the detailed information for | |
| 89 all soft-clipped reads including sequence and quality values. This file is | |
| 90 also tab-delimited with the following columns: chr, posiiton, strand, read | |
| 91 name, sequence, and quality. | |
| 92 | |
| 93 Example of part of a *.cover file: | |
| 94 4 125892327 + 1 28 | |
| 95 4 125892458 + 1 27 | |
| 96 4 125893225 + 1 28 | |
| 97 4 125893227 + 5 29 | |
| 98 4 125893365 - 1 26 | |
| 99 4 125893979 - 1 16 | |
| 100 10 66301086 - 1 33 | |
| 101 10 66301858 + 4 14 | |
| 102 10 66301865 - 8 21 | |
| 103 10 66301871 - 1 22 | |
| 104 10 66302136 + 1 51 | |
| 105 | |
| 106 Example of part of a *.sclip.txt file: | |
| 107 4 125892327 + HWUSI-EAS1591_6113C:3:17:12332:19420#0 CC CC | |
| 108 4 125892458 + HWUSI-EAS1591_6113C:4:91:6281:9961#0 GACTAACCACCACGGTACATGTTTTCCTATGTAAAAAACCTGCACATTCTACACATGTATCCCAGAACTTAAAGTAAAACAC B@C@?:CC>CCBCCCCACBCDCCCCCC;<:<9CCCCC@CCCCCBCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCAC | |
| 109 4 125893225 + HWI-EAS90_614M9:5:18:17924:10181#0 CCCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAA #######@@7@:8@><16+6(B>AABCAA3AB@CC6CCCCCCCDCCCCCCBCCDCCCCCCCCCCCCDCCCCCCCCCCCCCC | |
| 110 | |
| 111 If you run this step in parallel, you need to combine the outputs by | |
| 112 concatenating the files. Tumor and germline files must be concatenated | |
| 113 separately, for example: | |
| 114 cat tumor.bam.*.cover > tumor.bam.cover | |
| 115 cat germline.bam.*.cover > germline.bam.cover | |
| 116 | |
| 117 2.Remove germline events (optional) | |
| 118 | |
| 119 Running step 1 on both germline and tumor samples, you will get the | |
| 120 soft-clipping posiitons in both samples. This step will remove any position | |
| 121 in the tumor sample that also appears in germline sample, so germline events | |
| 122 will be removed. This step does not use any sequence information and could | |
| 123 remove true events. By our observations, true events are rarely removed. | |
| 124 You can skip this step and the program will do germline clean up at later step | |
| 125 (see the -g parameter for CREST.pl). | |
| 126 | |
| 127 The script for this step is countDiff.pl and it only requires two parameters | |
| 128 to specity the two output files from previous step. | |
| 129 | |
| 130 countDiff.pl -d tumor.bam.cover -g germline.bam.cover > soft_clip.dist.txt | |
| 131 | |
| 132 A file named tumor.bam.cover.somatic.cover will be generated for next step. | |
| 133 | |
| 134 The program will generate a file with suffix *.somatic.cover, and it will | |
| 135 be used for the next step. The file has the same format as *.cover generated | |
| 136 in the previous step. | |
| 137 | |
| 138 The standard output will show the coverage distribution. For every read count | |
| 139 in the range 1-999, it will show the number of breakpoints supported by that | |
| 140 many soft-clipped reads. | |
| 141 | |
| 142 3. Running the SV detection script. | |
| 143 | |
| 144 This is the core step in the detection process. The program is CREST.pl. | |
| 145 | |
| 146 The program needs quite a few parameters, but you can think about what you | |
| 147 will need. Here is a partial list of required and common parameters: | |
| 148 | |
| 149 -f The input soft-clipped coverage file produced in step 1 or 2. | |
| 150 -d The disease or tumor bam file | |
| 151 -g The germline bam file. If you want to identify somatic SVs only, you | |
| 152 should provide this parameter. If you also want to identify germline | |
| 153 events, you can leave this parameter unspecified. When treat your | |
| 154 germline file as disease without specify -g parameter, the program can | |
| 155 be used to identify germline events, or SV polymorphism. | |
| 156 --ref_genome The reference genome in fa format (used by bam file) | |
| 157 -t The reference genome in 2bit format (used by gfClient), this file can | |
| 158 be generated by using faToTwoBit program in BLAT program suit. This | |
| 159 file must be the same as the one you used to setup gfServer. | |
| 160 --blatserver The name or IP address of blat server, you need to use your | |
| 161 own one instead of using the public one at UCSC. | |
| 162 --blatport The port number for the blat server. | |
| 163 --nopaired Tell the program the reads are not paired. | |
| 164 | |
| 165 If all of the required programs are on the path then you won't need to | |
| 166 specify them again, otherwise you need to specify the paths to the programs | |
| 167 using the corresponding parameters. Please use CREST.pl --man to show the | |
| 168 man page, which provides a detailed parameter list. | |
| 169 | |
| 170 An example of running this step is: | |
| 171 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \ | |
| 172 hg18.fa -t hg18.2bit | |
| 173 | |
| 174 There is also a -r parameter to specify the range to be searched and it's highly | |
| 175 recommended to run using -r as below: | |
| 176 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \ | |
| 177 hg18.fa -t /genome/hg18.2bit -r chr1 | |
| 178 So it's very easy to run the program in parallel by spliting the genome into | |
| 179 pieces. | |
| 180 | |
| 181 The program will generate a *.predSV.txt file. The filename will be the input | |
| 182 bam with .predSV.txt appended unless you specify the -p parameter. Also the | |
| 183 STDERR output has the full list of SVs, including rejected ones. The output | |
| 184 file *.predSV.txt has the following tab-delimited columns: left_chr, left_pos, | |
| 185 left_strand, # of left soft-clipped reads, right_chr, right_pos, right_strand, | |
| 186 # right soft-clipped reads, SV type, coverage at left_pos, coverage at | |
| 187 right_pos, assembled length at left_pos, assembled length at right_pos, | |
| 188 average percent identity at left_pos, percent of non-unique mapping reads at | |
| 189 left_pos, average percent identity at right_pos, percent of non-unique mapping | |
| 190 reads at right_pos, start position of consensus mapping to genome, | |
| 191 starting chromosome of consensus mapping, position of the genomic mapping of | |
| 192 consensus starting position, end position of consensus mapping to genome, | |
| 193 ending chromsome of consnesus mapping, position of genomic mapping of | |
| 194 consensus ending posiiton, and consensus sequences. For inversion(INV), the | |
| 195 last 7 fields will be repeated to reflect the fact two different breakpoints | |
| 196 are needed to identify an INV event. | |
| 197 | |
| 198 Example of the tumor.predSV.txt file: | |
| 199 4 125893227 + 5 10 66301858 - 4 CTX 29 14 83 71 0.895173453996983 0.230769230769231 0.735384615384615 0.5 1 4 125893135 176 10 66301773 TTATGAATTTTGAAATATATATCATATTTTGAAATATATATCATATTCTAAATTATGAAAAGAGAATATGATTCTCTTTTCAGTAGCTGTCACCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAATTTT | |
| 200 5 7052198 - 0 10 66301865 + 8 CTX 0 22 0 81 0.761379310344828 0.482758620689655 0 0 1 5 7052278 164 10 66301947 AGCCATGGACCTTGTGGTGGGTTCTTAACAATGGTGAGTCCGGAGTTCTTAACGATGGTGAGTCCGTAGTTTGTTCCTTCAGGAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCAACAGAGGAAGACTCCACCTCAAAAAAAAAAAGTGGGAAGAGG | |
| 201 10 66301858 + 4 4 125893225 - 1 CTX 15 28 71 81 0.735384615384615 0.5 0.889507154213037 0.243243243243243 1 10 66301777 153 4 125893154 TTAGCCAGGCATGGTGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGTAGAGGCAGGAGAATCACTTGAACCCAGGAGGTGACAGCTACTGAAAAGAGAATCATATTCTCTTTTCATAATTTAGAATATGATATATATTTCAAAATATGATA | |
| 202 | |
| 203 If there are no or very few results, there may be a lack of soft-clipping. See | |
| 204 the section "About Soft-Clipping" at the end of this document. | |
| 205 | |
| 206 4. Visulization of the detailed alignment at breakpoint (optional) | |
| 207 | |
| 208 The bam2html.pl script builds an html view of the multiple alignment for | |
| 209 the breakpoint, so you can manually check the soft-clipping and other things. | |
| 210 | |
| 211 bam2html.pl -d diag.bam -g germline.bam -i diag.bam.predSV.txt \ | |
| 212 --ref_genome /genome/hg18 -o diag.bam.predSV.html | |
| 213 | |
| 214 The output file is specified by -o option. | |
| 215 | |
| 216 ==================== | |
| 217 About Soft-Clipping: | |
| 218 ==================== | |
| 219 | |
| 220 CREST uses soft-clipping signatures to identify breakpoints. Soft-clipping is | |
| 221 indicated by "S" elements in the CIGAR for SAM/BAM records. Soft-clipping may | |
| 222 not occur, depending on the mapping algorithm and parameters and sometimes even | |
| 223 the library preparation. | |
| 224 | |
| 225 With bwa sampe: | |
| 226 --------------- | |
| 227 | |
| 228 One mapping method that will soft-clip reads is bwa sampe (BWA for paired-end | |
| 229 reads). When BWA successfully maps one read in a pair but is not able to map | |
| 230 the other, it will attempt a more permissive Smith-Waterman alignment of the | |
| 231 unmapped read in the neighborhood of the mapped mate. If it is only able to | |
| 232 align part of the read, then it will soft-clip the portion on the end that it | |
| 233 could not align. Often this occurs at the breakpoints of structural | |
| 234 variations. | |
| 235 | |
| 236 In some cases when the insert sizes approach the read length, BWA will not | |
| 237 perform Smith-Waterman alignment. Reads from inserts smaller than the read | |
| 238 length will contain primer and/or adapter and will often not map. When the | |
| 239 insert size is close to the read length, this creates a skewed distribution | |
| 240 of inferred insert sizes which may cause BWA to not attempt Smith-Waterman | |
| 241 realignment. This is indicated by the error message "weird pairing". Often | |
| 242 in these cases there are also unusually low mapping rates. | |
| 243 | |
| 244 One way to fix this problem is to remap unmapped reads bwasw. To do this, | |
| 245 extract the unmapped reads as FASTQ files (this may be done with a combination | |
| 246 of samtools view -f 4 and Picard's SamToFastq). Realign using bwa bwasw and | |
| 247 build a BAM file. Then, re-run CREST on this new BAM file, and you may pick | |
| 248 up events that would have been missed otherwise. | |
| 249 | |
| 250 With other aligners: | |
| 251 -------------------- | |
| 252 | |
| 253 Consult the documentation or mailing list(s) for your mapper to determine its | |
| 254 behavior with regard to soft-clipping. |
