# HG changeset patch # User nikhil-joshi # Date 1426034226 14400 # Node ID 0a70eb1e643287ad12bac08fd412b7a85f48b771 # Parent e01c19b112618568d066c1c57d220efbc6612652 Uploaded diff -r e01c19b11261 -r 0a70eb1e6432 scythe/LICENSE --- a/scythe/LICENSE Tue Aug 06 23:17:08 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,20 +0,0 @@ -MIT License -Permission is hereby granted, free of charge, to any person -obtaining a copy of this software and associated documentation -files (the "Software"), to deal in the Software without -restriction, including without limitation the rights to use, copy, -modify, merge, publish, distribute, sublicense, and/or sell copies -of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be -included in all copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, -EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND -NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS -BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN -ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN -CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. diff -r e01c19b11261 -r 0a70eb1e6432 scythe/README.md --- a/scythe/README.md Tue Aug 06 23:17:08 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,191 +0,0 @@ -# Scythe - A very simple adapter trimmer (version 0.981 BETA) - -Scythe and all supporting documentation -Copyright (c) Vince Buffalo, 2011-2012 - -Contact: Vince Buffalo (with the poly-A tail removed) - -If you wish to report a bug, please open an issue on Github -(http://github.com/vsbuffalo/scythe/issues) so that it can be -tracked. You can contact me as well, but please open an issue first. - -## About - -Scythe uses a Naive Bayesian approach to classify contaminant -substrings in sequence reads. It considers quality information, which -can make it robust in picking out 3'-end adapters, which often include -poor quality bases. - -Most next generation sequencing reads have deteriorating quality -towards the 3'-end. It's common for a quality-based trimmer to be -employed before mapping, assemblies, and analysis to remove these poor -quality bases. However, quality-based trimming could remove bases that -are helpful in identifying (and removing) 3'-end adapter -contaminants. Thus, it is recommended you run Scythe *before* -quality-based trimming, as part of a read quality control pipeline. - -The Bayesian approach Scythe uses compares two likelihood models: the -probability of seeing the matches in a sequence given contamination, -and not given contamination. Given that the read is contaminated, the -probability of seeing a certain number of matches and mistmatches is a -function of the quality of the sequence. Given the read is not -contaminated (and is thus assumed to be random sequence), the -probability of seeing a certain number of matches and mismatches is -chance. The posterior is calculated across both these likelihood -models, and the class (contaminated or not contaminated) with the -maximum posterior probability is the class selected. - -## Requirements - -Scythe can be compiled using GCC or Clang; compilation during -development used the latter. Scythe relies on Heng Li's kseq.h, which -is bundled with the source. - -Scythe requires Zlib, which can be obtained at . - -## Building and Installing Scythe - -To build Scythe, enter: - - make build - -Then, copy or move "scythe" to a directory in your $PATH. - -## Usage - -Scythe can be run minimally with: - - scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq - -By default, the prior contamination rate is 0.05. This can be changed -(and one is encouraged to do so!) with: - - scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq - -If you'd like to use standard out, it is recommended you use the ---quiet option: - - scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq - -Also, more detailed output about matches can be obtained with: - - scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq - -By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger -or Solexa (pipeline < 1.3) qualities can be specified with -q: - - scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq - -Lastly, a minimum match length argument can be specified with -n : - - scythe -a adapter_file.fasta -n 0 -o trimmed_sequences.fasta sequences.fastq - -The default is 5. If this pre-processing is upstream of assembly on a -very contaminated lane, decreasing this parameter could lead to *very* -liberal trimming, i.e. of only a few bases. - -## Notes - -Scythe only checks for 3'-end contaminants, up to the adapter's length -into the 3'-end. For reads with contamination in *any* position, the -program TagDust () -is recommended. Scythe has the advantages of allowing fuzzier matching -and being base quality-aware, while TagDust has the advantages of very -fast matching (but allowing few mismatches, and not considering -quality) and FDR. TagDust also removes contaminated reads *entirely*, while -Scythe trims off contaminants. - -A possible pipeline would run FASTQ reads through Scythe, then -TagDust, then a quality-based trimmer, and finally through a read -quality statistics program such as qrqc -() or FASTqc -(). - -## FAQ - -### Does Scythe work with paired-end data? - -Scythe does work with paired-end data. Each file must be run -separately, but Scythe will not remove reads entirely leaving -mismatched pairs. - -In some cases, barcodes are ligated to both the 3'-end and 5'-end of -reads. 5'-end removal is trivial since base calling is near-perfect -there, but 3'-end removal can be trickier. Some users have created -Scythe adapter files that contain all possible barcodes concatenated -with possible adapters, so that both can be recognized and -removed. This has worked well and is recommended for cases when 3'-end -quality deteriorates and prevents barcode removal. Newer Illumina -chemistry has the barcode separated from the fragment, so that it -appears as an entirely separate read and is used to demultiplex sample -reads by Illumina's CASAVA pipeline. - -### Does Scythe work on 5'-end or other contaminants? - -No. Embracing the Unix tool philosophy that tools should do one thing -very well, Scythe just removes 3'-end contaminants where there could -be multiple base mismatches due to poor base quality. N-mismatch -algorithms (such as TagDust) don't consider base qualities. Scythe -will allow more mismatches in an alignment if the mismatched bases are -of low quality. - -**Scythe only checks as far in as the entire adapter contaminant's -length.** However, some investigation has shown that Illumina -pipelines sometimes produce reads longer than the read length + -adapter length. The extra bases have always been observed to be -A's. Some testing has shown this can be addressed by appending A's to -the adapters in the adapters file. Since Scythe begins by checking for -contamination from the 5'-end of the adapter, this won't affect the -normal adapter contaminant cases. - -### What does the numeric output from Scythe mean? - -For each adapter in the file, the contaminants removed by position are -returned via standard error. For example: - - Adapter 1 'fake adapter' contamination occurences: - [10, 2, 4, 5, 6] - -indicates that "fake adapter" is 5 bases long (the length of the array -returned), and that there were 10 contaminants found of first base (-n -was set to 0 then), 2 of the first two bases, 4 contaminants of the -first 3 bases, 5 of the first 4 bases, etc. - -### Does Scythe work on FASTA files? - -No, as these have no quality information. - -### How can I report a bug? - -See the section below. - -### How does Scythe compare to program "x"? - -As far as I know, Scythe is the only program that employs a Bayesian -model that allows prior contaminant estimates to be used. This prior -is a more realistic approach than setting a fixed number of mismatches -because we can visually estimate it with the Unix tool `less`. - -Scythe also looks at base-level qualities, *not* just a fixed level of -mismatches. A fixed number of mismatches is a bad approach with data -our group (the UC Davis Bioinformatics Core) has seen, as a small bad -quality run can quickly exhaust even a high numbers of fixed -mismatches and lead to higher false negatives. - -## Reporting Bugs - -Scythe is free software and is proved without a warranty. However, I -am proud of this software and I will do my best to provide updates, -bug fixes, and additional documentation as needed. Please report all -bugs and issues to Github's issue tracker -(http://github.com/vsbuffalo/scythe/issues). If you want to email me, -do so in addition to an issue request. - -If you have a suggestion or comment on Scythe's methods, you can email -me directly. - -## Is there a paper about Scythe? - -I am currently writing a paper on Scythe's methods. In my preliminary -testing, Scythe has fewew false positives and false negatives than -it competitors. \ No newline at end of file diff -r e01c19b11261 -r 0a70eb1e6432 scythe/illumina_adapters.fa --- a/scythe/illumina_adapters.fa Tue Aug 06 23:17:08 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ ->Multiplexing_adapter_1 -GATCGGAAGAGCACACGTCT ->Multiplexing_adapter_2 -CACTCTTTCCCTACACGACGCTCTTCCGATCT diff -r e01c19b11261 -r 0a70eb1e6432 scythe/scythe Binary file scythe/scythe has changed diff -r e01c19b11261 -r 0a70eb1e6432 scythe/scythe.xml --- a/scythe/scythe.xml Tue Aug 06 23:17:08 2013 -0400 +++ b/scythe/scythe.xml Tue Mar 10 20:37:06 2015 -0400 @@ -1,8 +1,8 @@ - + Trimming adapters/contaminants using a Naive Bayesian classifier - scythe --quiet -a $adapter_file + scythe -a $adapter_file #if $input_fastq.ext == "fastq": -q sanger @@ -34,7 +34,7 @@ -m $output_matches #end if - -o $output_trimmed $input_fastq 2> /dev/null + -o $output_trimmed $input_fastq 2>&1 diff -r e01c19b11261 -r 0a70eb1e6432 scythe/truseq_adapters.fasta --- a/scythe/truseq_adapters.fasta Tue Aug 06 23:17:08 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ ->TruSeq_forward_contam -AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC[NNNNNN]ATCTCGTATGCCGTCTTCTGCTTGAAAAA ->TruSeq_reverse_contam -AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAA