comparison MACARON-GenMed-LabEx/README.md @ 0:c9636a827049 draft default tip

Uploaded
author waqas
date Wed, 12 Sep 2018 08:45:03 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:c9636a827049
1 MACARON User Guide
2 ================
3
4 # Table of Contents
5
6 [//]: # (BEGIN automated TOC section, any edits will be overwritten on next source refresh)
7
8 * [Introduction](#introduction)
9 * [Installation](#installation)
10 * [Operating System Guidelines](#operating-system-guidelines)
11 * [Runtime Pre-requisite](#runtime-pre-requisite)
12 * [Software Dependencies](#software-dependencies)
13 * [Downloading the Source Code](#downloading-the-source-code)
14 * [Contents of the Folder MACARON_GenMed](#contents-of-the-folder-macaron_genmed)
15 * [Running the MACARON](#running-the-macaron)
16 * [Input Requirements](#input-requirements)
17 * [Default Options](#default-options)
18 * [demo Folder](#demo-folder)
19 * [Advanced Options](#advanced-options)
20 * [MACARON Reporting Format](#macaron-reporting-format)
21 * [Validating SNVs Existed on the Same Reads](#validating-snvs-existed-on-the-same-reads)
22 * [References](#references)
23 * [Citation](#citation)
24
25 [//]: # (END automated TOC section, any edits will be overwritten on next source refresh)
26
27 # Introduction
28
29 MACARON (Multi-bAse Codon-Associated variant Re-annotatiON) is a python framework to identify and re-annotate multi-base affected codons in whole genome/exome sequence data. Starting from a standard VCF file, MACARON identifies, re-annotates and predicts the amino acid change resulting from multiple single nucleotide variants (SNVs) within the same genetic codon.
30
31 The information below includes how to install and run MACARON to filter a list of variant records (from VCF file) called by any existing SNP-based variant caller to identify SNVs with the same genetic codon and correct their corresponding amino acid change.
32
33 See latest [News](https://github.com/waqasuddinkhan/MACARON-GenMed-LabEx/wiki/News???) and [Updates](https://github.com/waqasuddinkhan/MACARON-GenMed-LabEx/wiki#updates) on [MACARON-GenMed-LabEx Wiki page](https://github.com/waqasuddinkhan/MACARON-GenMed-LabEx/wiki).
34
35 # Installation
36
37 ### Operating System Guidelines
38
39 MACARON is know to run on LINUX UBUNTU 16.04 LTS. However, MACARON can be run on any other LINUX version.
40
41 ### Runtime Pre-requisite
42
43 __1.__ MACARON is executable in __PYTHON v2.7 or later__. If the user has multiple PYTHON versions, please make sure that your running environment is set to the required version of PYTHON.
44
45 __2.__ Check your __JAVA__ version as MACARON is tested with:
46
47 java -version
48 openjdk version __"1.8.0_151"__
49 OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
50 OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
51
52 ### Software Dependencies
53
54 Before running MACARON, please make sure that following software are installed properly:
55
56 __1.__ __Genome-Analysis Toolkit__ (https://software.broadinstitute.org/gatk/download/).
57
58 __2.__ __SnpEff__ (tested with __v4.3__ (build 2017-05-05 18:41). However, MACARON can also run with any older or newer version (http://snpeff.sourceforge.net/download.html).
59
60 __3.__ __SAMTools__ (tested with version __0.1.19__), however any version can be used.
61
62 __4.__ __Human Reference Genome__: Depends on user’s input.
63
64 __5.__ __SnpEff’s Human Annotation Database__: Depends on user’s input.
65
66 For __1__ and __2__, as long as they are compatible with JAVA, MACARON has no issues.
67
68 ### Downloading the Source Code
69
70 The most prefered way to use the lastest version of MACARON is:
71
72 git clone https://github.com/waqasuddinkhan/MACARON-GenMed-LabEx.git
73
74 or download the ZIP folder.
75
76 MACARON source code can also be downloaded from http://www.genmed.fr/images/publications/data/MACARON_GenMed.zip
77
78 After acquiring a release distribution of the source code, the build procedure is to unpack the zip file:
79
80 unzip MACARON_GenMed.zip
81
82 ### Contents of the folder MACARON_GenMed
83
84 * *MACARON* – The MACARON python code
85 * *MACARON_validate.sh* – a BASH-shell script to validate multi-SNVs located on the same read that affect the same genetic codon
86
87 # Running the MACARON
88
89 ### Input Requirements
90
91 Before running MACARON, check these __input technical notes__ as the following limitations exist for either the input VCF file, or the required software dependencines:
92
93 * Chromosome (chr) notation should be compatible with both input VCF file and Human Reference Genome file, or vice versa,
94
95 * Sequence dictionaries of input VCF file and Human Reference Genome file should be the same,
96
97 * Input VCF file (should) suitably be annotated with ANNOVAR, and additionally with any other annotation software, e.g, VEP (https://www.ensembl.org/info/docs/tools/vep/index.html) if the user has a desire to get the full functionality of -f option (see [Advanced Options](#advanced-options) below),
98
99 * Same Human Reference Genome file should be used for MACARON which is practiced earlier for alignemnt and (or) to call variant sets,
100
101 * Versions of input VCF file, Human Reference Genome file and SnpEff database file should be the same (hg19 / GRCh37 = SnpEff GRCh37.75) or (hg38 / GRCh38 = SnpEff GRCh38.86).
102
103 ### Default Options
104
105 For a full list of MACARON executable options, run:
106
107 python MACARON -h
108
109 By default, MACARON depends on the `GLOBAL VARIABLES` set in the script before run:
110
111 ## GLOBAL VARIABLES (IMPORTANT: You can set the default values here)
112 GATK="/home/wuk/software/GenomeAnalysisTK.jar"
113 #GATK="/home/wuk/software/gatk-4.0.1.2/gatk-package-4.0.1.2-local.jar"
114 HG_REF="/home/wuk/Working/gnme_refrnces/Homo_sapiens_assembly19.fasta"
115 SNPEFF="/home/wuk/software/snpEff/snpEff.jar"
116 SNPEFF_HG="GRCh37.75" ## SnpEff genome version
117
118 To run MACARON with __GATK <4.0__ versions, simply type:
119
120 python MACARON -i test_input.vcf
121
122 If running with __GATK >= 4.0__ versions, make following changes:
123
124 #GATK="/home/wuk/software/GenomeAnalysisTK.jar"
125 GATK="/home/wuk/software/gatk-4.0.1.2/gatk-package-4.0.1.2-local.jar"
126 HG_REF="/home/wuk/Working/gnme_refrnces/Homo_sapiens_assembly19.fasta"
127 SNPEFF="/home/wuk/software/snpEff/snpEff.jar"
128 SNPEFF_HG="GRCh37.75" ## SnpEff genome version
129
130 and run with:
131
132 python MACARON -i test_input.vcf --gatk4
133
134 ### demo Folder
135
136 To help verify a successful installation, MACARON includes a small demo data set:
137
138 * *variants_of_interest.vcf* – a test VCF file to check the functionality of MACARON
139 * *MACARON_output.txt* – The output file generated by running the MACARON
140 * *sub1.chr22_21349676-21349677.sample02.bam* – a subset of BAM file used as input for MACARON_validate.sh
141 * *MACARON_validate.txt* – The output file with read count information of concerned pcSNV in sample02 (in this case).
142 (All files are referenced with hg19)
143
144 `cd` to `demo` folder and run:
145
146 python ../MACARON -i variants_of_interest.vcf
147
148 MACARON_output.txt is the default output file name of MACARON. User can change it with `-o` option.
149
150 python ../MACARON -i variants_of_interest.vcf -o variants_of_interest.txt
151
152 ### Advanced Options
153
154 MACARON can be run by invoking paths directly set from the command-line:
155
156 ```bash
157 python ../MACARON -i variants_of_interest.vcf --GATK /home/wuk/software/GenomeAnalysisTK.jar --HG_REF /home/wuk/Working/gnme_refrnces/Homo_sapiens_assembly19.fasta --SNPEFF /home/wuk/software/snpEff/snpEff.jar --SNPEFF_HG GRCh37.75
158 ```
159 * For __GATK >= 4.0__ versions:
160
161 ```bash
162 python ../MACARON -i variants_of_interest.vcf --gatk4 --GATK /home/wuk/software/ --HG_REF /home/wuk/Working/gnme_refrnces/Homo_sapiens_assembly19.fasta --SNPEFF /home/wuk/software/snpEff/snpEff.jar --SNPEFF_HG GRCh37.75
163 ```
164 MACARON can add additional fields, besdies the dafault (see [MACARON Reporting Format](#macaron-reporting-format)) by using `-f` option:
165
166 * `-f CSQ` (if input VCF file is additionally annotated with VEP, the output txt file also has the same complete annotation for each variant record)
167
168 * `-f EFF` (if user wants to output SnpEff annotations in output txt file), or -f ANN (if SnpEff is used without -formatEff option)
169
170 * `-f QUAL,DP,AF,Func.refGene,Gene.refGene,GeneDetail.refGene` (this will keep any other default annotations of input VCF file and of ANNOVAR to output txt file)
171
172 -f can be used multiple times, e.g.,
173
174 * `-f CSQ,DP,Func.refGene`
175 or
176 * `-f FILTER,EFF,CSQ,AF`
177
178 The order of the fields in the output txt file depends on the order of INFO field headers used in `-f`.
179
180 ```bash
181 python ../MACARON -i variants_of_interest.vcf --gatk4 --GATK /home/wuk/software/ --HG_REF /home/wuk/Working/gnme_refrnces/Homo_sapiens_assembly19.fasta --SNPEFF /home/wuk/software/snpEff/snpEff.jar --SNPEFF_HG GRCh37.75 -f QUAL,FILTER,SIFT_pred
182 ```
183 Without `-f` option, `QUAL` field is outputted as default.If user wants to keep `QUAL` along with any other field, `-f` should mentiond `QUAL` in addition to other field headers: `-f QUAL,FILTER,SIFT_pred`. If only `-f SIFT_pred` is used, `QUAL` field is over-written by `SIFT_pred` field.
184
185 # MACARON Reporting Format
186
187 MACARON outputs a table text file with the following format specifications:
188
189 ```
190 chr22 21349676 rs412470 T A LZTR1 423 T/T T/A T/T 0/0 0/1 0/0 MISSENSE S92T Tct Act ATt I 0 0
191 chr22 21349677 rs376419 C T LZTR1 423 C/C C/T C/C 0/0 0/1 0/0 MISSENSE S92F tCt tTt 0 I 0 0
192 ```
193 Field Number | Field Name | Description
194 --- | --- | ---
195 1 |CHROM | Chromosome number
196 2 | POS | Chromosomal position / coordinates of SNV
197 3 | ID | dbSNP rsID
198 4 | REF | Reference base
199 5 | ALT | Alternate base
200 6 | Gene_Name | Name of a gene in which SnpCluster is located
201 7 | QUAL | Quality of the ALT base called
202 8 | [SAMPLE NAME].GT | Genotype of samples as base conventions as well as binary conventions
203 9 | Protein_coding_EFF | Functional Effect of Variant on protein
204 10 | AA-Change | Amino acid change by individual SNV
205 11 | REF-codon | Reference Codon
206 12 | ALT-codon | Alternate Codon
207 13 | ALT-codon_merge-2VAR | A new codon formed by the combination of two Alt-codons (pcSNV codon; see [MACARON](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty382/4992149?redirectedFrom=fulltext))
208 14 | AA-Change-2VAR | Re-annotated amino acid formed by pcSNV codon
209 15 | ALT-codon_merge-3VAR | A new codon formed by the combination of three Alt-codons
210 16 | AA-Change-3VAR | Re-annotated amino acid formed by the combination of three Alt-codons
211
212 This default's MACARON output can be changed by using `-f` option. For example, if MACARON run with `-f QUAL,FILTER,SIFT_pred`, the new output looks like:
213
214 Field Number | Field Name | Description
215 --- | --- | ---
216 1 |CHROM | Chromosome number
217 2 | POS | Chromosomal position / coordinates of SNV
218 3 | ID | dbSNP rsID
219 4 | REF | Reference base
220 5 | ALT | Alternate base
221 6 | Gene_Name | Name of a gene in which SnpCluster is located
222 7 | QUAL | Quality of the ALT base called
223 8 | FILTER | Filter (PASS) tag
224 9 | SIFT_pred | Functional effect prediction of SNV on protien
225 10 | [SAMPLE NAME].GT | Genotype of samples as base conventions as well as binary conventions
226 11 | Protein_coding_EFF | Functional Effect of Variant on protein
227 12 | AA-Change | Amino acid change by individual SNV
228 13 | REF-codon | Reference Codon
229 14 | ALT-codon | Alternate Codon
230 15 | ALT-codon_merge-2VAR | A new codon formed by the combination of two Alt-codons (pcSNV codon; see [MACARON](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty382/4992149?redirectedFrom=fulltext))
231 16 | AA-Change-2VAR | Re-annotated amino acid formed by pcSNV codon
232 17 | ALT-codon_merge-3VAR | A new codon formed by the combination of three Alt-codons
233 18 | AA-Change-3VAR | Re-annotated amino acid formed by the combination of three Alt-codons
234
235 # Validating SNVs Existed on the Same Reads
236
237 **NB: You do not need to run this step if you already used phased VCF file to run MACARON**
238
239 To confirm the existence of multi-SNVs within the same genetic codon, an accessory BASH-shell script [MACARON_validate.sh](MACARON_validate.sh) calculates the read count information of affected bases. This script requires as an input subset of BAM files (should be the same that used to generate the input VCF file) covering 50 bps over each SnpCluster.
240
241 Subset of any BAM file can be generated by using the following command:
242
243 `
244 samtools view –hb –L sub1.bed sample02.bam > sub1.chr22_21349676-21349677.sample02.bam
245 `
246
247 In this case, our big BAM file `sample02.bam` (not provided here, obviously!!!) is subsetted as `sub1.chr22_21349676-21349677.sample02.bam` (see [demo](demo) folder) for the position `chr22:21349676`. The naming format of output BAM file should be the same. The `sub1.bed` file has 1 tab-seperated line:
248
249 `chr22 21349676`
250
251 representing the first position of SnpCluster (SNV1 only).
252
253 Once subset BAM file(s) are generated, run MACARON_validate.sh:
254
255 `MACARON_validate.sh sub1.chr22_21349676-21349677.sample02.bam`
256
257 This will generate an output text file (`MACARON_validate.txt`) allowing the user for further analysis.
258
259 sub1 chr22:21349676-21349677 sample02
260 1 AA
261 1 T
262 11 AT
263 14 TC
264
265 See [MACARON-GenMed-LabEx Wiki page](https://github.com/waqasuddinkhan/MACARON-GenMed-LabEx/wiki) for more details, and interpretations of the [demo](demo) data.
266
267 # References
268
269 __1.__ [Van der Auwera G.A., et al. (2013) From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline, Curr Protoc Bioinformatics, 43:11.10.1-11.10.33](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/0471250953.bi1110s43).
270
271 __2.__ [Cingolani, P., et al. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3, Fly, 6, 80-92](https://www.tandfonline.com/doi/full/10.4161/fly.19695).
272
273 __3.__ [McLaren, W., et al. (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor, Bioinformatics, 26, 2069-2070](https://academic.oup.com/bioinformatics/article/26/16/2069/217748).
274
275 __4.__ [Wang, K., Li, M. and Hakonarson, H. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data, Nucleic Acids Res, 38, e164](https://academic.oup.com/nar/article/38/16/e164/1749458).
276
277 # Citation
278
279 If you use [MACARON](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty382/4992149?redirectedFrom=fulltext) in your research, please cite:
280
281 *Khan W. et al. MACARON: a python framework to identify and re-annotate multi-base affected codons in whole genome/exome sequence data, Bioinformatics 2018*
282
283 *CONTACT: david-alexandre.tregouet@inserm.fr; waqasnayab@gmail.com*
284
285 *VERSION: 0.7*
286 *VERSION DATE: September 5, 2018*