annotate README.rst @ 0:64e75e21466e draft default tip

Uploaded
author pmac
date Wed, 01 Jun 2016 03:38:39 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
64e75e21466e Uploaded
pmac
parents:
diff changeset
1 .. class:: warningmark
64e75e21466e Uploaded
pmac
parents:
diff changeset
2
64e75e21466e Uploaded
pmac
parents:
diff changeset
3 '''WARNING''' This tool requires the 'dbscan' (https://cran.r-project.org/web/packages/dbscan/index.html) and 'flashpcaR' (https://github.com/gabraham/flashpca/releases) R packages to be installed on the galaxy instance.
64e75e21466e Uploaded
pmac
parents:
diff changeset
4
64e75e21466e Uploaded
pmac
parents:
diff changeset
5 ======================================
64e75e21466e Uploaded
pmac
parents:
diff changeset
6 Principle Component Analysis Pipeline
64e75e21466e Uploaded
pmac
parents:
diff changeset
7 ======================================
64e75e21466e Uploaded
pmac
parents:
diff changeset
8
64e75e21466e Uploaded
pmac
parents:
diff changeset
9 :Author: Adrian Cheung
64e75e21466e Uploaded
pmac
parents:
diff changeset
10 :Contact: adrian.che0222@gmail.com
64e75e21466e Uploaded
pmac
parents:
diff changeset
11 :Date: 15-01-2015
64e75e21466e Uploaded
pmac
parents:
diff changeset
12
64e75e21466e Uploaded
pmac
parents:
diff changeset
13 Contents
64e75e21466e Uploaded
pmac
parents:
diff changeset
14 --------
64e75e21466e Uploaded
pmac
parents:
diff changeset
15
64e75e21466e Uploaded
pmac
parents:
diff changeset
16 - `Overview`_
64e75e21466e Uploaded
pmac
parents:
diff changeset
17 - `Primary Input`_
64e75e21466e Uploaded
pmac
parents:
diff changeset
18 - `Primary Output`_
64e75e21466e Uploaded
pmac
parents:
diff changeset
19 - `Options/Secondary Inputs`_
64e75e21466e Uploaded
pmac
parents:
diff changeset
20 - `Other Output`_
64e75e21466e Uploaded
pmac
parents:
diff changeset
21 - `Command Line Interface`_
64e75e21466e Uploaded
pmac
parents:
diff changeset
22 - `Implementation Details`_
64e75e21466e Uploaded
pmac
parents:
diff changeset
23
64e75e21466e Uploaded
pmac
parents:
diff changeset
24 Overview
64e75e21466e Uploaded
pmac
parents:
diff changeset
25 --------
64e75e21466e Uploaded
pmac
parents:
diff changeset
26 A tool which performs iterative principle component analysis.
64e75e21466e Uploaded
pmac
parents:
diff changeset
27 The general idea is to seperate patient samples based on their ethnicity, by performing PCA on the variant data of each sample.
64e75e21466e Uploaded
pmac
parents:
diff changeset
28 After each analysis step, outliers are identified. The PCA is then repeated, with the outliers removed.
64e75e21466e Uploaded
pmac
parents:
diff changeset
29 This process continues for a set number of iterations specified by the user. After the pipeline completes, the user can see a
64e75e21466e Uploaded
pmac
parents:
diff changeset
30 detailed summary, as well as have access to the outliers identified at each iteration.
64e75e21466e Uploaded
pmac
parents:
diff changeset
31
64e75e21466e Uploaded
pmac
parents:
diff changeset
32 Primary Input
64e75e21466e Uploaded
pmac
parents:
diff changeset
33 -------------
64e75e21466e Uploaded
pmac
parents:
diff changeset
34 As primary input the tools accepts a single file, which may be formatted in the following ways:
64e75e21466e Uploaded
pmac
parents:
diff changeset
35
64e75e21466e Uploaded
pmac
parents:
diff changeset
36 - **Variant data file:** This should be a tab-delimited text file, with each row containing data about a single variant site from a single person. If this option is selected, the column names which contain important information must also be specified, either via a configuration file (see below), or through the tool's form fields.
64e75e21466e Uploaded
pmac
parents:
diff changeset
37 - **Numeric ped file:** See http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml for detailed information on PED format. This tool requires the affection status of each site to be specified numerically i.e.:
64e75e21466e Uploaded
pmac
parents:
diff changeset
38
64e75e21466e Uploaded
pmac
parents:
diff changeset
39 - 0 = homozygous reference
64e75e21466e Uploaded
pmac
parents:
diff changeset
40 - 1 = heterozygous
64e75e21466e Uploaded
pmac
parents:
diff changeset
41 - 2 = homozygous alternate
64e75e21466e Uploaded
pmac
parents:
diff changeset
42
64e75e21466e Uploaded
pmac
parents:
diff changeset
43 rather than consisting of pairs of genotypes for each site.
64e75e21466e Uploaded
pmac
parents:
diff changeset
44 - **RData file:** File containing stored data from an R session. For this tool the input must meet certain requirements:
64e75e21466e Uploaded
pmac
parents:
diff changeset
45
64e75e21466e Uploaded
pmac
parents:
diff changeset
46 - The file can only contain a SINGLE R object, which must be a list.
64e75e21466e Uploaded
pmac
parents:
diff changeset
47 - The list must contain a named 'bed' element.
64e75e21466e Uploaded
pmac
parents:
diff changeset
48 - The 'bed' element must be an n x m matrix/data frame, where n = number of samples, m = number of unique snps found in all the samples.
64e75e21466e Uploaded
pmac
parents:
diff changeset
49 - The A(i,j)th entry in the 'bed' matrix should indicate affectation status of the ith sample at the jth SNP site, according to the key for numeric ped files (as above).
64e75e21466e Uploaded
pmac
parents:
diff changeset
50 - The row names of the 'bed' matrix must contain the ids of the samples.
64e75e21466e Uploaded
pmac
parents:
diff changeset
51 - The column names of the 'bed' matrix must contain the ids of the SNPs.
64e75e21466e Uploaded
pmac
parents:
diff changeset
52
64e75e21466e Uploaded
pmac
parents:
diff changeset
53 If these very specific criteria are not met, the tool WILL fail.
64e75e21466e Uploaded
pmac
parents:
diff changeset
54
64e75e21466e Uploaded
pmac
parents:
diff changeset
55 - **RDS file (command line only):** File containing a single R object. Object must follow same specifications as the RData file
64e75e21466e Uploaded
pmac
parents:
diff changeset
56
64e75e21466e Uploaded
pmac
parents:
diff changeset
57 Primary Output
64e75e21466e Uploaded
pmac
parents:
diff changeset
58 ---------------
64e75e21466e Uploaded
pmac
parents:
diff changeset
59
64e75e21466e Uploaded
pmac
parents:
diff changeset
60 HTML file containing plots of the PCA for each iteration.
64e75e21466e Uploaded
pmac
parents:
diff changeset
61 Possible plots, depending on user specified options:
64e75e21466e Uploaded
pmac
parents:
diff changeset
62
64e75e21466e Uploaded
pmac
parents:
diff changeset
63 - **Control vs Cases Plot:** If control and/or cases tags are provided, this plot will be output. ALL samples are plotted, with controls shown in blue, cases in red, unknown samples in black.
64e75e21466e Uploaded
pmac
parents:
diff changeset
64 - **Cluster Plot:** Output if user opts to do clustering. Samples are plotted, with clusters colour-coded. Outliers as identified by DBSCAN are always read and use an open circle as the icon. Trimmed clusters use a cross for the icon, instead of a circle. Both the outliers (open circles) AND the rejected clusters (crosses) will be dropped in the next iteration.
64e75e21466e Uploaded
pmac
parents:
diff changeset
65 - **Outliers Plot:** Output if user does NOT opt to do clustering. Samples which are considered outliers (as described above in 'Detecting outliers without clustering') are plotted as red open circles; all other samples are plotted as green full circles.
64e75e21466e Uploaded
pmac
parents:
diff changeset
66 - **Standard Deviations Plot:** Samples are colour-coded by standard deviation. Samples which fall within 1 standard devaiton of the median are red, <= 2 sds are green, <= 3 sds are blue, > 3 sds are purple.
64e75e21466e Uploaded
pmac
parents:
diff changeset
67 - **Ethnicity Plot:** Each ethnicity uses a specific colour and symbol. Fairly self-explanotory. Plot is only output if an ethnicity data file is provided as input.
64e75e21466e Uploaded
pmac
parents:
diff changeset
68
64e75e21466e Uploaded
pmac
parents:
diff changeset
69 Beneath the plots there are also two expandable lists. Samples excluded shows which samples were not part of the PCA for this iteration. This is cumulative. Outliers shows the outliers detected in THIS iteration. Any available data from the ethnicity file (if provided) is also displayed for each excluded sample.
64e75e21466e Uploaded
pmac
parents:
diff changeset
70
64e75e21466e Uploaded
pmac
parents:
diff changeset
71 Options/Secondary Inputs
64e75e21466e Uploaded
pmac
parents:
diff changeset
72 ------------------------
64e75e21466e Uploaded
pmac
parents:
diff changeset
73 - **Type of input data file:** Either a ped file or a text file as specified above
64e75e21466e Uploaded
pmac
parents:
diff changeset
74 - **Number of iterations to complete:** A single iteration would involve performing PCA on the input data, then identifying and removing outliers. Two iterations would involve performing PCA again with the outliers identified from the first iteration excluded, three iterations would exclude the outliers from the first 2 stages, and so on and so forth.
64e75e21466e Uploaded
pmac
parents:
diff changeset
75 - **Detecting outliers without clustering:** This is done by obtaining the standard deviations of the first two principle components. Any samples whose scores for either of these first two components falls more than 'n' number of standard deviations away from the component median are considered outliers.
64e75e21466e Uploaded
pmac
parents:
diff changeset
76 - **Clustering:** The user may select from a range of algorithms which will try to identify clusters in the data, with each cluster hopefully corresponding to an ethnic group.
64e75e21466e Uploaded
pmac
parents:
diff changeset
77 - **Clustering methods:**
64e75e21466e Uploaded
pmac
parents:
diff changeset
78
64e75e21466e Uploaded
pmac
parents:
diff changeset
79 - *DBSCAN (Density based spatial clustering of applications with noise):*
64e75e21466e Uploaded
pmac
parents:
diff changeset
80
64e75e21466e Uploaded
pmac
parents:
diff changeset
81 Forms clusters based on density of points, and does not require the number of clusters to be specified beforehand. Good for irregularly shaped, non-spherical clusters. Does NOT require all points to be part of clusters, and produces a set of 'outliers', i.e. points which do not belong to any clusters.
64e75e21466e Uploaded
pmac
parents:
diff changeset
82
64e75e21466e Uploaded
pmac
parents:
diff changeset
83 - *Hierarchical Clustering:*
64e75e21466e Uploaded
pmac
parents:
diff changeset
84
64e75e21466e Uploaded
pmac
parents:
diff changeset
85 Forms clusters based on distance between points. Tends to result in spherical clusters, but able to handle clusters of varying density. Forces all points to be part of a single cluster. The number of clusters is determined seperately, using the silhouette scores of all the points as a heuristic.
64e75e21466e Uploaded
pmac
parents:
diff changeset
86
64e75e21466e Uploaded
pmac
parents:
diff changeset
87 - **Cluster trimming methods:** All these methods first involve finding the centres of each cluster.
64e75e21466e Uploaded
pmac
parents:
diff changeset
88
64e75e21466e Uploaded
pmac
parents:
diff changeset
89 - *Standard Deviations:*
64e75e21466e Uploaded
pmac
parents:
diff changeset
90
64e75e21466e Uploaded
pmac
parents:
diff changeset
91 If the centroid of a cluster lies more than ‘n’ standard deviations (n is passed in as a parameter by the user) from the centroid of the entire dataset in either the x or y directions, the entire cluster is cut. If DBSCAN is selected, the outliers it identifies are also cut.
64e75e21466e Uploaded
pmac
parents:
diff changeset
92
64e75e21466e Uploaded
pmac
parents:
diff changeset
93 - *Mean Cluster Distance:*
64e75e21466e Uploaded
pmac
parents:
diff changeset
94
64e75e21466e Uploaded
pmac
parents:
diff changeset
95 Obtain the average distance between clusters, done by computing the distance between all pairs of clusters and taking the mean. For each cluster, we also compute an average “isolation” value, which is the mean of the distances between that particular cluster and all other clusters. If a cluster’s isolation value is larger than the average cluster distance (multiplied by the strictness weighting), then that cluster is considered an outlier and cut from the next iteration. If DBSCAN is selected, the outliers it identifies are also cut.
64e75e21466e Uploaded
pmac
parents:
diff changeset
96
64e75e21466e Uploaded
pmac
parents:
diff changeset
97 - *DBSCAN outliers only:*
64e75e21466e Uploaded
pmac
parents:
diff changeset
98
64e75e21466e Uploaded
pmac
parents:
diff changeset
99 Only cut the points identified by the DBSCAN algorithm as not belonging to any cluster. No entire clusters are cut. Obviously this method is only applicable if DBSCAN is selected as the clustering method. THE TOOL WILL NOT RUN IF YOU SELECT THIS OPTION TOGETHER WITH 'Hierarchical Clustering' AS THE CLUSTERING METHOD.
64e75e21466e Uploaded
pmac
parents:
diff changeset
100
64e75e21466e Uploaded
pmac
parents:
diff changeset
101 - **Strictness:** A multiplier used to determine how 'strict' the outlier cutting methods are. For example, if strictness = 1, and we are not doing clustering, all points which lie more than 1 sd from the median are cut. If strictness = 2, all points which lie more than 2 sd from the median are cut, etc.
64e75e21466e Uploaded
pmac
parents:
diff changeset
102
64e75e21466e Uploaded
pmac
parents:
diff changeset
103 - **Control Tag:** A pattern present in the ids of all the control samples, e.g. "LP"
64e75e21466e Uploaded
pmac
parents:
diff changeset
104
64e75e21466e Uploaded
pmac
parents:
diff changeset
105 - **Cases Tag:** A pattern present in the ids of all the cases samples, e.g. "HAPS"
64e75e21466e Uploaded
pmac
parents:
diff changeset
106
64e75e21466e Uploaded
pmac
parents:
diff changeset
107 - **Configuration file:** A configuration file to accompany an input variant text file. The config file has a rather specific format, an example is given below::
64e75e21466e Uploaded
pmac
parents:
diff changeset
108
64e75e21466e Uploaded
pmac
parents:
diff changeset
109 #control
64e75e21466e Uploaded
pmac
parents:
diff changeset
110 control_tag,#Sample,HAPS
64e75e21466e Uploaded
pmac
parents:
diff changeset
111 cases_tag,#Sample,LP
64e75e21466e Uploaded
pmac
parents:
diff changeset
112 #column_names
64e75e21466e Uploaded
pmac
parents:
diff changeset
113 genotype_column,GT
64e75e21466e Uploaded
pmac
parents:
diff changeset
114 reference_column,REF
64e75e21466e Uploaded
pmac
parents:
diff changeset
115 alternate_column,ALT
64e75e21466e Uploaded
pmac
parents:
diff changeset
116 sample_id_column,#Sample
64e75e21466e Uploaded
pmac
parents:
diff changeset
117 chromosome_column,CHROM
64e75e21466e Uploaded
pmac
parents:
diff changeset
118 position_column,POS
64e75e21466e Uploaded
pmac
parents:
diff changeset
119 variant_id_column,ID
64e75e21466e Uploaded
pmac
parents:
diff changeset
120 #numeric_filters
64e75e21466e Uploaded
pmac
parents:
diff changeset
121 strand_bias_filter,Fraction_with_strand_bias,<,0.03
64e75e21466e Uploaded
pmac
parents:
diff changeset
122 position_bias_filter,Fraction_with_positional_bias,<,0.03
64e75e21466e Uploaded
pmac
parents:
diff changeset
123 count_filter,Num_samples_variant,>,1
64e75e21466e Uploaded
pmac
parents:
diff changeset
124 pass_filter,Fraction_samples_passed_filter,>,0.9
64e75e21466e Uploaded
pmac
parents:
diff changeset
125 #string_filters
64e75e21466e Uploaded
pmac
parents:
diff changeset
126 variant_type_filter,Variant_Type,exact,accept
64e75e21466e Uploaded
pmac
parents:
diff changeset
127 SNV
64e75e21466e Uploaded
pmac
parents:
diff changeset
128 genotype_filter,GT,exact,accept
64e75e21466e Uploaded
pmac
parents:
diff changeset
129 '0/1,'1/1
64e75e21466e Uploaded
pmac
parents:
diff changeset
130
64e75e21466e Uploaded
pmac
parents:
diff changeset
131 File consists of up to four sections, the starts of which are marked by lines beginning with an octothorpe.
64e75e21466e Uploaded
pmac
parents:
diff changeset
132
64e75e21466e Uploaded
pmac
parents:
diff changeset
133 - *'#control' section:* Indicates substrings found in ids of controls and cases
64e75e21466e Uploaded
pmac
parents:
diff changeset
134 - *'#column_names' section:* This is the only required section. First column indicates what column name (in the variant text file) the second column specifies. The same keys i.e. left most column values, as shown in the example must be used, e.g. sample_id_column, the RHS column names must match the names in the variant data file. If using a generated config file, only modify the RHS column, and DO NOT REMOVE ANY rows from this section.
64e75e21466e Uploaded
pmac
parents:
diff changeset
135 - *'#numeric_filters' section:* Each filter takes up a single line, and is seperated into 4 sections by commas.
64e75e21466e Uploaded
pmac
parents:
diff changeset
136
64e75e21466e Uploaded
pmac
parents:
diff changeset
137 - Column 1: Name of the filter, which is arbitrary
64e75e21466e Uploaded
pmac
parents:
diff changeset
138 - Column 2: The name of the column in the variant file to filter on. If this column is not found, a warning is displayed
64e75e21466e Uploaded
pmac
parents:
diff changeset
139 - Column 3: The criteria of the filter which must be passed in order for us to accept a particular row. E.g. less than, greater than
64e75e21466e Uploaded
pmac
parents:
diff changeset
140 - Column 4: The cutoff value to be compared against.
64e75e21466e Uploaded
pmac
parents:
diff changeset
141
64e75e21466e Uploaded
pmac
parents:
diff changeset
142 - *'#string_filters' section:* Each filter takes up two lines.
64e75e21466e Uploaded
pmac
parents:
diff changeset
143
64e75e21466e Uploaded
pmac
parents:
diff changeset
144 - Line 1, Column 1: Arbitrary filter name
64e75e21466e Uploaded
pmac
parents:
diff changeset
145 - Line 1, Column 2: Column name to filter on
64e75e21466e Uploaded
pmac
parents:
diff changeset
146 - Line 1, Column 3: Do the patterns have to be exact matches, or just a substrings? E.g. if pattern = "HAPS" and string being compared = "HAPS-909090", if exact was true this would not be a successfull match, whereas if not_exact was true it would be a match.
64e75e21466e Uploaded
pmac
parents:
diff changeset
147 - Line 1, Column 4: What to do with the row if a successful match is detected, e.g. accept or reject
64e75e21466e Uploaded
pmac
parents:
diff changeset
148 - Line 2: A comma seperated list of patterns to match on
64e75e21466e Uploaded
pmac
parents:
diff changeset
149
64e75e21466e Uploaded
pmac
parents:
diff changeset
150
64e75e21466e Uploaded
pmac
parents:
diff changeset
151 - **Ethnicity file:** An ethnicity file containing ethnicity data, and possible other data, on the samples. Note this data is not used to sort the input and has no effect on the PCA itself. It is used only to label the results of the output.
64e75e21466e Uploaded
pmac
parents:
diff changeset
152
64e75e21466e Uploaded
pmac
parents:
diff changeset
153 Requirements:
64e75e21466e Uploaded
pmac
parents:
diff changeset
154
64e75e21466e Uploaded
pmac
parents:
diff changeset
155 - tab delimited
64e75e21466e Uploaded
pmac
parents:
diff changeset
156 - Must have at least two columns
64e75e21466e Uploaded
pmac
parents:
diff changeset
157 - First column has sample ID's
64e75e21466e Uploaded
pmac
parents:
diff changeset
158 - Second column has ethnicities
64e75e21466e Uploaded
pmac
parents:
diff changeset
159 - First row must be a header
64e75e21466e Uploaded
pmac
parents:
diff changeset
160
64e75e21466e Uploaded
pmac
parents:
diff changeset
161 First few lines of a correctly formatted ethnicity file given below::
64e75e21466e Uploaded
pmac
parents:
diff changeset
162
64e75e21466e Uploaded
pmac
parents:
diff changeset
163 IID population Halo1.or.2. BloodAge SalivaAge COB ethnicity
64e75e21466e Uploaded
pmac
parents:
diff changeset
164 LP-10000001 AUSTRALIAN Halo2 - LP-BC 67 NA Australia australian
64e75e21466e Uploaded
pmac
parents:
diff changeset
165 LP-10000003 AUSTRALIAN Halo1 45 NA Australia australian southern_european
64e75e21466e Uploaded
pmac
parents:
diff changeset
166 LP-10000005 AUSTRALIAN Halo1 73 NA Australia australian southern_european
64e75e21466e Uploaded
pmac
parents:
diff changeset
167 LP-10000008 EUROPE Halo1 54 NA South Eastern Europe south_east_european
64e75e21466e Uploaded
pmac
parents:
diff changeset
168 LP-10000009 OTHER Halo1 65 NA Southern & East Africa jewish
64e75e21466e Uploaded
pmac
parents:
diff changeset
169
64e75e21466e Uploaded
pmac
parents:
diff changeset
170 - **Exclude samples file:** A text file containing exact ids of samples to exclude from the PCA.
64e75e21466e Uploaded
pmac
parents:
diff changeset
171
64e75e21466e Uploaded
pmac
parents:
diff changeset
172 Requirements:
64e75e21466e Uploaded
pmac
parents:
diff changeset
173
64e75e21466e Uploaded
pmac
parents:
diff changeset
174 - single column
64e75e21466e Uploaded
pmac
parents:
diff changeset
175 - sample ids seperated by newlines
64e75e21466e Uploaded
pmac
parents:
diff changeset
176 - one sample id per line
64e75e21466e Uploaded
pmac
parents:
diff changeset
177
64e75e21466e Uploaded
pmac
parents:
diff changeset
178 Example::
64e75e21466e Uploaded
pmac
parents:
diff changeset
179
64e75e21466e Uploaded
pmac
parents:
diff changeset
180 HAPS-90573
64e75e21466e Uploaded
pmac
parents:
diff changeset
181 HAPS-90578R
64e75e21466e Uploaded
pmac
parents:
diff changeset
182 HAPS-110542
64e75e21466e Uploaded
pmac
parents:
diff changeset
183 HAPS-110605
64e75e21466e Uploaded
pmac
parents:
diff changeset
184 HAPS-110620
64e75e21466e Uploaded
pmac
parents:
diff changeset
185 HAPS-110638
64e75e21466e Uploaded
pmac
parents:
diff changeset
186 HAPS-110649
64e75e21466e Uploaded
pmac
parents:
diff changeset
187 HAPS-110668
64e75e21466e Uploaded
pmac
parents:
diff changeset
188 HAPS-110799
64e75e21466e Uploaded
pmac
parents:
diff changeset
189 HAPS-110813
64e75e21466e Uploaded
pmac
parents:
diff changeset
190 HAPS-110959
64e75e21466e Uploaded
pmac
parents:
diff changeset
191 HAPS-111186
64e75e21466e Uploaded
pmac
parents:
diff changeset
192 HAPS-111298
64e75e21466e Uploaded
pmac
parents:
diff changeset
193 HAPS-111404
64e75e21466e Uploaded
pmac
parents:
diff changeset
194 HAPS-111493
64e75e21466e Uploaded
pmac
parents:
diff changeset
195 HAPS-111512
64e75e21466e Uploaded
pmac
parents:
diff changeset
196 HAPS-111538
64e75e21466e Uploaded
pmac
parents:
diff changeset
197
64e75e21466e Uploaded
pmac
parents:
diff changeset
198 - **Exclude SNPS file:** A text file containing exact ids of SNPs to exclude from the PCA.
64e75e21466e Uploaded
pmac
parents:
diff changeset
199
64e75e21466e Uploaded
pmac
parents:
diff changeset
200 Requirements:
64e75e21466e Uploaded
pmac
parents:
diff changeset
201
64e75e21466e Uploaded
pmac
parents:
diff changeset
202 - single column
64e75e21466e Uploaded
pmac
parents:
diff changeset
203 - snp ids seperated by newlines
64e75e21466e Uploaded
pmac
parents:
diff changeset
204 - one snp id per line
64e75e21466e Uploaded
pmac
parents:
diff changeset
205
64e75e21466e Uploaded
pmac
parents:
diff changeset
206 Example::
64e75e21466e Uploaded
pmac
parents:
diff changeset
207
64e75e21466e Uploaded
pmac
parents:
diff changeset
208 rs72896283
64e75e21466e Uploaded
pmac
parents:
diff changeset
209 rs7534447
64e75e21466e Uploaded
pmac
parents:
diff changeset
210 rs4662775
64e75e21466e Uploaded
pmac
parents:
diff changeset
211 rs10932813
64e75e21466e Uploaded
pmac
parents:
diff changeset
212 rs10932816
64e75e21466e Uploaded
pmac
parents:
diff changeset
213 rs12330369
64e75e21466e Uploaded
pmac
parents:
diff changeset
214 rs1802904
64e75e21466e Uploaded
pmac
parents:
diff changeset
215 rs10902762
64e75e21466e Uploaded
pmac
parents:
diff changeset
216 rs9996817
64e75e21466e Uploaded
pmac
parents:
diff changeset
217 rs6446393
64e75e21466e Uploaded
pmac
parents:
diff changeset
218 rs871133
64e75e21466e Uploaded
pmac
parents:
diff changeset
219 rs4301095
64e75e21466e Uploaded
pmac
parents:
diff changeset
220 rs941849
64e75e21466e Uploaded
pmac
parents:
diff changeset
221 rs6917467
64e75e21466e Uploaded
pmac
parents:
diff changeset
222 rs75834296
64e75e21466e Uploaded
pmac
parents:
diff changeset
223 rs142922667
64e75e21466e Uploaded
pmac
parents:
diff changeset
224
64e75e21466e Uploaded
pmac
parents:
diff changeset
225 - **Required Column Headers:** If a variant text file is the primary input, the following information MUST be provided, either through the config file, or by filling out the corresponding fields in the tool submission form.
64e75e21466e Uploaded
pmac
parents:
diff changeset
226
64e75e21466e Uploaded
pmac
parents:
diff changeset
227 - Sample IDs: Name of the column containing the sample ids
64e75e21466e Uploaded
pmac
parents:
diff changeset
228 - Chromosome: Name of the column indicating what chromosome the SNP is found on
64e75e21466e Uploaded
pmac
parents:
diff changeset
229 - Position: Name of the column indicating at which position on the chromosome the SNP is found
64e75e21466e Uploaded
pmac
parents:
diff changeset
230 - Genotype: The genotype of the sample for this site
64e75e21466e Uploaded
pmac
parents:
diff changeset
231 - Reference: The 'normal'/'common' genotype for this site
64e75e21466e Uploaded
pmac
parents:
diff changeset
232 - Alternate: The alternate genotype for this site
64e75e21466e Uploaded
pmac
parents:
diff changeset
233 - Variant IDs: Name of the column indicating the ID of the SNP
64e75e21466e Uploaded
pmac
parents:
diff changeset
234
64e75e21466e Uploaded
pmac
parents:
diff changeset
235 - **Numeric Filters:** See Configuration file section
64e75e21466e Uploaded
pmac
parents:
diff changeset
236 - **String Filters:** See Configuration file section
64e75e21466e Uploaded
pmac
parents:
diff changeset
237
64e75e21466e Uploaded
pmac
parents:
diff changeset
238 Other Output
64e75e21466e Uploaded
pmac
parents:
diff changeset
239 -------------
64e75e21466e Uploaded
pmac
parents:
diff changeset
240
64e75e21466e Uploaded
pmac
parents:
diff changeset
241 - Tool will output a root folder containing the HTML file and all the plots, placed in directories seperated by iteration.
64e75e21466e Uploaded
pmac
parents:
diff changeset
242 - If the input data was a variant file, the output folder will also contain a numeric ped file, generated before the first iteration, as well as a config file. The config file is either the exact one passed in by the user, or one automatically generated from the form input, which can be used for future PCA runs.
64e75e21466e Uploaded
pmac
parents:
diff changeset
243
64e75e21466e Uploaded
pmac
parents:
diff changeset
244
64e75e21466e Uploaded
pmac
parents:
diff changeset
245 Command Line Interface
64e75e21466e Uploaded
pmac
parents:
diff changeset
246 -----------------------
64e75e21466e Uploaded
pmac
parents:
diff changeset
247 To run the tool via the command line, make sure you have the following files/folders::
64e75e21466e Uploaded
pmac
parents:
diff changeset
248
64e75e21466e Uploaded
pmac
parents:
diff changeset
249 <root_dir>
64e75e21466e Uploaded
pmac
parents:
diff changeset
250 |-> iterative_pca.py
64e75e21466e Uploaded
pmac
parents:
diff changeset
251 |-> iterative_pca_plot.R
64e75e21466e Uploaded
pmac
parents:
diff changeset
252 |-> pedify.py
64e75e21466e Uploaded
pmac
parents:
diff changeset
253 |-> pca_report.html
64e75e21466e Uploaded
pmac
parents:
diff changeset
254 |-> R_functions
64e75e21466e Uploaded
pmac
parents:
diff changeset
255 |-> plotting_functions.R
64e75e21466e Uploaded
pmac
parents:
diff changeset
256 |-> clustering_functions.R
64e75e21466e Uploaded
pmac
parents:
diff changeset
257 |-> outlier_trimming.R
64e75e21466e Uploaded
pmac
parents:
diff changeset
258 |-> pca_helpers.R
64e75e21466e Uploaded
pmac
parents:
diff changeset
259 |-> pipeline_code.R
64e75e21466e Uploaded
pmac
parents:
diff changeset
260
64e75e21466e Uploaded
pmac
parents:
diff changeset
261
64e75e21466e Uploaded
pmac
parents:
diff changeset
262 **Output Directory Structure**
64e75e21466e Uploaded
pmac
parents:
diff changeset
263
64e75e21466e Uploaded
pmac
parents:
diff changeset
264 By default, the root output directory will be called "full_output_<basename>", where basename is the name of the primary input data file with no extensions. The root output dir will contain the HTML report file, as well as a log file, as well as some other possible outputs. Each iteration will be stored in subdirectories inside the output root dir, these will be called "output_<basename>_<iteration number>", and will contain the pngs for the plots, as well as the outliers file and excluded samples file. The output basename can be set with the --out <custom_basename> command line argument. Here is a section of an example output directory tree, if the basename was 'data1'::
64e75e21466e Uploaded
pmac
parents:
diff changeset
265
64e75e21466e Uploaded
pmac
parents:
diff changeset
266 full_output_data1
64e75e21466e Uploaded
pmac
parents:
diff changeset
267 |-> data1.html
64e75e21466e Uploaded
pmac
parents:
diff changeset
268 |-> data1.log
64e75e21466e Uploaded
pmac
parents:
diff changeset
269 |-> output_data1_iteration_0
64e75e21466e Uploaded
pmac
parents:
diff changeset
270 |-> data1_outliers.txt
64e75e21466e Uploaded
pmac
parents:
diff changeset
271 |-> data1_xsamples.txt
64e75e21466e Uploaded
pmac
parents:
diff changeset
272 |-> data1_plot1
64e75e21466e Uploaded
pmac
parents:
diff changeset
273 |-> data1_plot2
64e75e21466e Uploaded
pmac
parents:
diff changeset
274 ...etc
64e75e21466e Uploaded
pmac
parents:
diff changeset
275 |-> output_data1_iteration_1
64e75e21466e Uploaded
pmac
parents:
diff changeset
276 ...etc
64e75e21466e Uploaded
pmac
parents:
diff changeset
277
64e75e21466e Uploaded
pmac
parents:
diff changeset
278 **Example Usage**
64e75e21466e Uploaded
pmac
parents:
diff changeset
279
64e75e21466e Uploaded
pmac
parents:
diff changeset
280 To see more help text on how to run the program, do::
64e75e21466e Uploaded
pmac
parents:
diff changeset
281
64e75e21466e Uploaded
pmac
parents:
diff changeset
282 python iterative_pca.py -h
64e75e21466e Uploaded
pmac
parents:
diff changeset
283
64e75e21466e Uploaded
pmac
parents:
diff changeset
284 If we have a directory containing some valid input files at <root_dir>/input, then some here are some example
64e75e21466e Uploaded
pmac
parents:
diff changeset
285 use cases:
64e75e21466e Uploaded
pmac
parents:
diff changeset
286
64e75e21466e Uploaded
pmac
parents:
diff changeset
287 - *Text file containing variant data input*::
64e75e21466e Uploaded
pmac
parents:
diff changeset
288
64e75e21466e Uploaded
pmac
parents:
diff changeset
289 python iterative_pca.py <root_dir>/input/parsed_all_Halo1_150528_wGT.txt variant_data 10 --config_file <root_dir>/input/halo1_pca.config --ethnicity_file <root_dir>/input/Halo_ethnicity_rf.txt --clustering_flag --clustering_method dbscan --cluster_trimming sd --out halo1_out
64e75e21466e Uploaded
pmac
parents:
diff changeset
290
64e75e21466e Uploaded
pmac
parents:
diff changeset
291 Does 10 iterations on the input file, using the ethnicity data from 'Halo_ethnicity_rf.txt', and outputs data to 'full_output_halo1_out'. Use the specified config file, and use DBSCAN for clustering, and trim by standard deviations.
64e75e21466e Uploaded
pmac
parents:
diff changeset
292
64e75e21466e Uploaded
pmac
parents:
diff changeset
293 - *Numeric ped file input*::
64e75e21466e Uploaded
pmac
parents:
diff changeset
294
64e75e21466e Uploaded
pmac
parents:
diff changeset
295 python iterative_pca.py <root_dir>/input/halo1_numeric.ped numeric_ped 5
64e75e21466e Uploaded
pmac
parents:
diff changeset
296
64e75e21466e Uploaded
pmac
parents:
diff changeset
297 Does 5 iterations on the input file. Trim outliers purely by standard deviations (no clustering), and output data to 'full_output_halo1_numeric'.
64e75e21466e Uploaded
pmac
parents:
diff changeset
298
64e75e21466e Uploaded
pmac
parents:
diff changeset
299 - *RDS file input*::
64e75e21466e Uploaded
pmac
parents:
diff changeset
300
64e75e21466e Uploaded
pmac
parents:
diff changeset
301 python iterative_pca.py <root_dir>/input/HapMap3_flashPCA_data.rds rds 20 --ethnicity_file <root_dir>/input/HapMap3_ethnicity_rf.txt --clustering_flag --clustering_method hclust --cluster_trimming mcd --out hapmap3_test1 --control LP --cases HAPS
64e75e21466e Uploaded
pmac
parents:
diff changeset
302
64e75e21466e Uploaded
pmac
parents:
diff changeset
303 Do 20 iterations on the input rds file, using ethnicity data from 'HapMap3_ethnicity_rf.txt'. Use hierarchical clustering, and trim by mean cluster distance. Indicate that ids of control samples contain the pattern "LP", and ids of case samples contain the pattern "HAPS".
64e75e21466e Uploaded
pmac
parents:
diff changeset
304
64e75e21466e Uploaded
pmac
parents:
diff changeset
305
64e75e21466e Uploaded
pmac
parents:
diff changeset
306 Implementation Details
64e75e21466e Uploaded
pmac
parents:
diff changeset
307 ----------------------
64e75e21466e Uploaded
pmac
parents:
diff changeset
308 The program consists of two main scripts: a top level python script and an R script. The python script prepares the input files so they can be read into the R script. The R script is where most of the heavy lifting is done; the iterative pca steps as well as the plotting all occurs here. After this script finishes, the python script will generate the HTML output using the Jinja2 templating engine, before exiting.
64e75e21466e Uploaded
pmac
parents:
diff changeset
309
64e75e21466e Uploaded
pmac
parents:
diff changeset
310 Here is a brief overview of each of the files' functions:
64e75e21466e Uploaded
pmac
parents:
diff changeset
311
64e75e21466e Uploaded
pmac
parents:
diff changeset
312 Python:
64e75e21466e Uploaded
pmac
parents:
diff changeset
313
64e75e21466e Uploaded
pmac
parents:
diff changeset
314 - **iterative_pca.py**: Top level script which manages the entire pipeline. Main functions are preparing the input files for the R script and writing to the log file, and creating the HTML file at the end.
64e75e21466e Uploaded
pmac
parents:
diff changeset
315 - **pedify.py**: Contains functions for parsing the configuration file and generating ped and map files from an input variant text file. Also responsible for filtering out unwanted samples/SNPs when parsing the text file.
64e75e21466e Uploaded
pmac
parents:
diff changeset
316
64e75e21466e Uploaded
pmac
parents:
diff changeset
317 R:
64e75e21466e Uploaded
pmac
parents:
diff changeset
318
64e75e21466e Uploaded
pmac
parents:
diff changeset
319 - **iterative_pca_plot.R**: reads in data, then runs multiple PCA iterations, keeping track of the data. This data is then used to generate the output folders, plots and files.
64e75e21466e Uploaded
pmac
parents:
diff changeset
320 - **plotting_functions.R**: Functions used to prepare the different kinds of plots.
64e75e21466e Uploaded
pmac
parents:
diff changeset
321 - **clustering_functions.R**: Clustering functions. Contains functions to optimise the DBSCAN and hclust algorithms, including ways to find k, the number of clusters. Also contains different methods for identifying cluster outliers.
64e75e21466e Uploaded
pmac
parents:
diff changeset
322 - **outlier_trimming.R**: Functions to trim outliers (no clustering).
64e75e21466e Uploaded
pmac
parents:
diff changeset
323 - **pca_helpers.R**: Functions to read in data and perform PCA, along with some other utility functions.
64e75e21466e Uploaded
pmac
parents:
diff changeset
324 - **pipeline_code.R**: A function to run a single iteration of the pipeline.
64e75e21466e Uploaded
pmac
parents:
diff changeset
325
64e75e21466e Uploaded
pmac
parents:
diff changeset
326 Other:
64e75e21466e Uploaded
pmac
parents:
diff changeset
327
64e75e21466e Uploaded
pmac
parents:
diff changeset
328 - **pca_report.html**: HTML template which is filled in with the iteration data after completing a run successfully.
64e75e21466e Uploaded
pmac
parents:
diff changeset
329
64e75e21466e Uploaded
pmac
parents:
diff changeset
330
64e75e21466e Uploaded
pmac
parents:
diff changeset
331