view README.rst @ 0:64e75e21466e draft default tip

Uploaded
author pmac
date Wed, 01 Jun 2016 03:38:39 -0400
parents
children
line wrap: on
line source

.. class:: warningmark 

'''WARNING''' This tool requires the 'dbscan' (https://cran.r-project.org/web/packages/dbscan/index.html) and 'flashpcaR' (https://github.com/gabraham/flashpca/releases) R packages to be installed on the galaxy instance.

======================================
Principle Component Analysis Pipeline 
======================================

	:Author: Adrian Cheung
	:Contact: adrian.che0222@gmail.com
	:Date: 15-01-2015

Contents
--------

- `Overview`_
- `Primary Input`_
- `Primary Output`_
- `Options/Secondary Inputs`_
- `Other Output`_
- `Command Line Interface`_
- `Implementation Details`_

Overview
--------
A tool which performs iterative principle component analysis. 
The general idea is to seperate patient samples based on their ethnicity, by performing PCA on the variant data of each sample. 
After each analysis step, outliers are identified. The PCA is then repeated, with the outliers removed.
This process continues for a set number of iterations specified by the user. After the pipeline completes, the user can see a 
detailed summary, as well as have access to the outliers identified at each iteration.

Primary Input
-------------
As primary input the tools accepts a single file, which may be formatted in the following ways:

- **Variant data file:** This should be a tab-delimited text file, with each row containing data about a single variant site from a single person. If this option is selected, the column names which contain important information must also be specified, either via a configuration file (see below), or through the tool's form fields.
- **Numeric ped file:** See http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml for detailed information on PED format. This tool requires the affection status of each site to be specified numerically i.e.:

	- 0 = homozygous reference 
	- 1 = heterozygous
	- 2 = homozygous alternate
  
  rather than consisting of pairs of genotypes for each site.
- **RData file:** File containing stored data from an R session. For this tool the input must meet certain requirements:
 	
 	- The file can only contain a SINGLE R object, which must be a list. 
 	- The list must contain a named 'bed' element. 
 	- The 'bed' element must be an n x m matrix/data frame, where n = number of samples, m = number of unique snps found in all the samples. 
 	- The A(i,j)th entry in the 'bed' matrix should indicate affectation status of the ith sample at the jth SNP site, according to the key for numeric ped files (as above). 
 	- The row names of the 'bed' matrix must contain the ids of the samples. 
 	- The column names of the 'bed' matrix must contain the ids of the SNPs. 

  If these very specific criteria are not met, the tool WILL fail.

- **RDS file (command line only):** File containing a single R object. Object must follow same specifications as the RData file

Primary Output
---------------

HTML file containing plots of the PCA for each iteration.
Possible plots, depending on user specified options:

- **Control vs Cases Plot:** If control and/or cases tags are provided, this plot will be output. ALL samples are plotted, with controls shown in blue, cases in red, unknown samples in black.
- **Cluster Plot:** Output if user opts to do clustering. Samples are plotted, with clusters colour-coded. Outliers as identified by DBSCAN are always read and use an open circle as the icon. Trimmed clusters use a cross for the icon, instead of a circle. Both the outliers (open circles) AND the rejected clusters (crosses) will be dropped in the next iteration.
- **Outliers Plot:** Output if user does NOT opt to do clustering. Samples which are considered outliers (as described above in 'Detecting outliers without clustering') are plotted as red open circles; all other samples are plotted as green full circles.
- **Standard Deviations Plot:** Samples are colour-coded by standard deviation. Samples which fall within 1 standard devaiton of the median are red, <= 2 sds are green, <= 3 sds are blue, > 3 sds are purple. 
- **Ethnicity Plot:** Each ethnicity uses a specific colour and symbol. Fairly self-explanotory. Plot is only output if an ethnicity data file is provided as input.

Beneath the plots there are also two expandable lists. Samples excluded shows which samples were not part of the PCA for this iteration. This is cumulative. Outliers shows the outliers detected in THIS iteration. Any available data from the ethnicity file (if provided) is also displayed for each excluded sample.

Options/Secondary Inputs
------------------------
- **Type of input data file:** Either a ped file or a text file as specified above
- **Number of iterations to complete:** A single iteration would involve performing PCA on the input data, then identifying and removing outliers. Two iterations would involve performing PCA again with the outliers identified from the first iteration excluded, three iterations would exclude the outliers from the first 2 stages, and so on and so forth.
- **Detecting outliers without clustering:** This is done by obtaining the standard deviations of the first two principle components. Any samples whose scores for either of these first two components falls more than 'n' number of standard deviations away from the component median are considered outliers.
- **Clustering:** The user may select from a range of algorithms which will try to identify clusters in the data, with each cluster hopefully corresponding to an ethnic group.
- **Clustering methods:**

	- *DBSCAN (Density based spatial clustering of applications with noise):*

	  Forms clusters based on density of points, and does not require the number of clusters to be specified beforehand. Good for irregularly shaped, non-spherical clusters. Does NOT require all points to be part of clusters, and produces a set of 'outliers', i.e. points which do not belong to any clusters.
	
	- *Hierarchical Clustering:* 

	  Forms clusters based on distance between points. Tends to result in spherical clusters, but able to handle clusters of varying density. Forces all points to be part of a single cluster. The number of clusters is determined seperately, using the silhouette scores of all the points as a heuristic.

- **Cluster trimming methods:** All these methods first involve finding the centres of each cluster.

	- *Standard Deviations:* 

	  If the centroid of a cluster lies more than ‘n’ standard deviations (n is passed in as a parameter by the user) from the centroid of the entire dataset in either the x or y directions, the entire cluster is cut. If DBSCAN is selected, the outliers it identifies are also cut.

	- *Mean Cluster Distance:* 

	  Obtain the average distance between clusters, done by computing the distance between all pairs of clusters and taking the mean. For each cluster, we also compute an average “isolation” value, which is the mean of the distances between that particular cluster and all other clusters. If a cluster’s isolation value is larger than the average cluster distance (multiplied by the strictness weighting), then that cluster is considered an outlier and cut from the next iteration. If DBSCAN is selected, the outliers it identifies are also cut.

	- *DBSCAN outliers only:* 

	  Only cut the points identified by the DBSCAN algorithm as not belonging to any cluster. No entire clusters are cut. Obviously this method is only applicable if DBSCAN is selected as the clustering method. THE TOOL WILL NOT RUN IF YOU SELECT THIS OPTION TOGETHER WITH 'Hierarchical Clustering' AS THE CLUSTERING METHOD.

- **Strictness:** A multiplier used to determine how 'strict' the outlier cutting methods are. For example, if strictness = 1, and we are not doing clustering, all points which lie more than 1 sd from the median are cut. If strictness = 2, all points which lie more than 2 sd from the median are cut, etc.

- **Control Tag:** A pattern present in the ids of all the control samples, e.g. "LP"

- **Cases Tag:** A pattern present in the ids of all the cases samples, e.g. "HAPS"

- **Configuration file:** A configuration file to accompany an input variant text file. The config file has a rather specific format, an example is given below::

	#control
	control_tag,#Sample,HAPS
	cases_tag,#Sample,LP
	#column_names
	genotype_column,GT
	reference_column,REF
	alternate_column,ALT
	sample_id_column,#Sample
	chromosome_column,CHROM
	position_column,POS
	variant_id_column,ID
	#numeric_filters
	strand_bias_filter,Fraction_with_strand_bias,<,0.03
	position_bias_filter,Fraction_with_positional_bias,<,0.03
	count_filter,Num_samples_variant,>,1
	pass_filter,Fraction_samples_passed_filter,>,0.9
	#string_filters
	variant_type_filter,Variant_Type,exact,accept
	SNV
	genotype_filter,GT,exact,accept
	'0/1,'1/1

  File consists of up to four sections, the starts of which are marked by lines beginning with an octothorpe.

	- *'#control' section:* Indicates substrings found in ids of controls and cases
	- *'#column_names' section:* This is the only required section. First column indicates what column name (in the variant text file) the second column specifies. The same keys i.e. left most column values, as shown in the example must be used, e.g. sample_id_column, the RHS column names must match the names in the variant data file. If using a generated config file, only modify the RHS column, and DO NOT REMOVE ANY rows from this section.
	- *'#numeric_filters' section:* Each filter takes up a single line, and is seperated into 4 sections by commas.

		- Column 1: Name of the filter, which is arbitrary
		- Column 2: The name of the column in the variant file to filter on. If this column is not found, a warning is displayed
		- Column 3: The criteria of the filter which must be passed in order for us to accept a particular row. E.g. less than, greater than
		- Column 4: The cutoff value to be compared against.

	- *'#string_filters' section:* Each filter takes up two lines.

		- Line 1, Column 1: Arbitrary filter name
		- Line 1, Column 2: Column name to filter on 
		- Line 1, Column 3: Do the patterns have to be exact matches, or just a substrings? E.g. if pattern = "HAPS" and string being compared = "HAPS-909090", if exact was true this would not be a successfull match, whereas if not_exact was true it would be a match.
		- Line 1, Column 4: What to do with the row if a successful match is detected, e.g. accept or reject
		- Line 2: A comma seperated list of patterns to match on


- **Ethnicity file:** An ethnicity file containing ethnicity data, and possible other data, on the samples. Note this data is not used to sort the input and has no effect on the PCA itself. It is used only to label the results of the output.

  Requirements:

 	- tab delimited
 	- Must have at least two columns
 	- First column has sample ID's
 	- Second column has ethnicities
 	- First row must be a header

  First few lines of a correctly formatted ethnicity file given below::

	IID	population	Halo1.or.2.	BloodAge	SalivaAge	COB	ethnicity
	LP-10000001	AUSTRALIAN	Halo2 - LP-BC	67	NA	Australia	australian
	LP-10000003	AUSTRALIAN	Halo1	45	NA	Australia	australian  southern_european
	LP-10000005	AUSTRALIAN	Halo1	73	NA	Australia	australian  southern_european
	LP-10000008	EUROPE	Halo1	54	NA	South Eastern Europe	south_east_european
	LP-10000009	OTHER	Halo1	65	NA	Southern & East Africa	jewish

- **Exclude samples file:** A text file containing exact ids of samples to exclude from the PCA.

  Requirements:

	- single column
	- sample ids seperated by newlines
	- one sample id per line
  
  Example::

	HAPS-90573
	HAPS-90578R
	HAPS-110542
	HAPS-110605
	HAPS-110620
	HAPS-110638
	HAPS-110649
	HAPS-110668
	HAPS-110799
	HAPS-110813
	HAPS-110959
	HAPS-111186
	HAPS-111298
	HAPS-111404
	HAPS-111493
	HAPS-111512
	HAPS-111538

- **Exclude SNPS file:** A text file containing exact ids of SNPs to exclude from the PCA.
  
  Requirements:

	- single column
	- snp ids seperated by newlines
	- one snp id per line

  Example::

	rs72896283
	rs7534447
	rs4662775
	rs10932813
	rs10932816
	rs12330369
	rs1802904
	rs10902762
	rs9996817
	rs6446393
	rs871133
	rs4301095
	rs941849
	rs6917467
	rs75834296
	rs142922667

- **Required Column Headers:** If a variant text file is the primary input, the following information MUST be provided, either through the config file, or by filling out the corresponding fields in the tool submission form.

	- Sample IDs: Name of the column containing the sample ids
	- Chromosome: Name of the column indicating what chromosome the SNP is found on
	- Position: Name of the column indicating at which position on the chromosome the SNP is found
	- Genotype: The genotype of the sample for this site
	- Reference: The 'normal'/'common' genotype for this site
	- Alternate: The alternate genotype for this site
	- Variant IDs: Name of the column indicating the ID of the SNP 

- **Numeric Filters:** See Configuration file section
- **String Filters:** See Configuration file section

Other Output
-------------

- Tool will output a root folder containing the HTML file and all the plots, placed in directories seperated by iteration.
- If the input data was a variant file, the output folder will also contain a numeric ped file, generated before the first iteration, as well as a config file. The config file is either the exact one passed in by the user, or one automatically generated from the form input, which can be used for future PCA runs.


Command Line Interface
-----------------------
To run the tool via the command line, make sure you have the following files/folders::

	<root_dir>
	|-> iterative_pca.py
	|-> iterative_pca_plot.R
	|-> pedify.py
	|-> pca_report.html
	|-> R_functions
		|-> plotting_functions.R
		|-> clustering_functions.R
		|-> outlier_trimming.R
		|-> pca_helpers.R
		|-> pipeline_code.R


**Output Directory Structure**

By default, the root output directory will be called "full_output_<basename>", where basename is the name of the primary input data file with no extensions. The root output dir will contain the HTML report file, as well as a log file, as well as some other possible outputs. Each iteration will be stored in subdirectories inside the output root dir, these will be called "output_<basename>_<iteration number>", and will contain the pngs for the plots, as well as the outliers file and excluded samples file. The output basename can be set with the --out <custom_basename> command line argument. Here is a section of an example output directory tree, if the basename was 'data1'::

	full_output_data1
	|-> data1.html
	|-> data1.log
	|-> output_data1_iteration_0
		|-> data1_outliers.txt
		|-> data1_xsamples.txt
		|-> data1_plot1
		|-> data1_plot2
		...etc
	|-> output_data1_iteration_1
	...etc

**Example Usage**

To see more help text on how to run the program, do::
	
	python iterative_pca.py -h

If we have a directory containing some valid input files at <root_dir>/input, then some here are some example
use cases:

- *Text file containing variant data input*::

	python iterative_pca.py <root_dir>/input/parsed_all_Halo1_150528_wGT.txt variant_data 10 --config_file <root_dir>/input/halo1_pca.config --ethnicity_file <root_dir>/input/Halo_ethnicity_rf.txt --clustering_flag --clustering_method dbscan --cluster_trimming sd --out halo1_out	

  Does 10 iterations on the input file, using the ethnicity data from 'Halo_ethnicity_rf.txt', and outputs data to 'full_output_halo1_out'. Use the specified config file, and use DBSCAN for clustering, and trim by standard deviations. 

- *Numeric ped file input*::

	python iterative_pca.py <root_dir>/input/halo1_numeric.ped numeric_ped 5

  Does 5 iterations on the input file. Trim outliers purely by standard deviations (no clustering), and output data to 'full_output_halo1_numeric'.

- *RDS file input*::

	python iterative_pca.py <root_dir>/input/HapMap3_flashPCA_data.rds rds 20 --ethnicity_file <root_dir>/input/HapMap3_ethnicity_rf.txt --clustering_flag --clustering_method hclust --cluster_trimming mcd --out hapmap3_test1 --control LP --cases HAPS
  
  Do 20 iterations on the input rds file, using ethnicity data from 'HapMap3_ethnicity_rf.txt'. Use hierarchical clustering, and trim by mean cluster distance. Indicate that ids of control samples contain the pattern "LP", and ids of case samples contain the pattern "HAPS".


Implementation Details
----------------------
The program consists of two main scripts: a top level python script and an R script. The python script prepares the input files so they can be read into the R script. The R script is where most of the heavy lifting is done; the iterative pca steps as well as the plotting all occurs here. After this script finishes, the python script will generate the HTML output using the Jinja2 templating engine, before exiting.

Here is a brief overview of each of the files' functions:

Python:

- **iterative_pca.py**: Top level script which manages the entire pipeline. Main functions are preparing the input files for the R script and  writing to the log file, and creating the HTML file at the end.
- **pedify.py**: Contains functions for parsing the configuration file and generating ped and map files from an input variant text file. Also responsible for filtering out unwanted samples/SNPs when parsing the text file.

R:

- **iterative_pca_plot.R**: reads in data, then runs multiple PCA iterations, keeping track of the data. This data is then used to generate the output folders, plots and files.
- **plotting_functions.R**: Functions used to prepare the different kinds of plots.
- **clustering_functions.R**: Clustering functions. Contains functions to optimise the DBSCAN and hclust algorithms, including ways to find k, the number of clusters. Also contains different methods for identifying cluster outliers.
- **outlier_trimming.R**: Functions to trim outliers (no clustering).
- **pca_helpers.R**: Functions to read in data and perform PCA, along with some other utility functions.
- **pipeline_code.R**: A function to run a single iteration of the pipeline.

Other:

- **pca_report.html**: HTML template which is filled in with the iteration data after completing a run successfully.