Mercurial > repos > jjohnson > cdhit
annotate README @ 3:43724ea1c85f
Add cd-hit for protein fastas
author | Jim Johnson <jj@umn.edu> |
---|---|
date | Thu, 27 Jun 2013 21:37:08 -0500 |
parents | cca0838c1597 |
children |
rev | line source |
---|---|
2
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
1 CD-HIT-EST |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
2 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
3 CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
4 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
5 Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, (2010). 26:680 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
6 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
7 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
8 From: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
9 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
10 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
11 CD-HIT was originally a protein clustering program. The main advantage of this program is its ultra-fast speed. It can be hundreds of times faster than other clustering programs, for example, BLASTCLUST. Therefore it can handle very large databases, like NR. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
12 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
13 The 1st version of this program, CD-HI, was published and released in 2001. The 2nd version, called CD-HIT, was published in 2002 with significant improvements. Since 2004, CD-HIT has been hosted at bioinformatics.org as an open source project. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
14 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
15 Since its release, CD-HIT has been getting more and more popular. It has a significant user base, I estimated at over several thousands users. It is used at many research and educational institutions. For example, at UniProt, CD-HIT is used to generate the UniRef reference data sets (http://www.pir.uniprot.org/database/DBDescription.shtml). It is also used in PDB to treat redundant sequences (http://rutgers.rcsb.org/pdb/redundancy.html). |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
16 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
17 In 2006, the 3rd major updates were published and released with abilities to perform various jobs like clustering a protein database, clustering a DNA/RNA database, comparing two databases (protein or DNA/RNA), generating protein families, and many others. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
18 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
19 The CD-HIT web server was implemented in 2009, which allows users to cluster or compare sequences without using command CD-HIT. The server provides interactive interface and additional visualization tools. It also provides pre-calculated and regularly updated sequence clusters for several widely used databases. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
20 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
21 CD-HIT-454, a special version of CD-HIT was implemented in 2010 to cluster artificial duplicated reads in pyrosequencing (454) data. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
22 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
23 Currently, CD-HIT package has many programs: cd-hit, cd-hit-2d, cd-hit-est, cd-hit-est-2d, cd-hit-para, cd-hit-2d-para, psi-cd-hit, psi-cd-hit-2d, cd-hit-454. I also developed some utility tools, written in Perl, to help run and analyze CD-HIT jobs. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
24 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
25 |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
26 NOTE to installer: |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
27 The tool_dependency will set an environment variable: "CDHIT_SITE_OPTIONS" to -M 4000 -T 0 which will be in the commandline. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
28 You can adjust the values of -M and -T to match the memory and thread capabilities of your site. |
cca0838c1597
Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff
changeset
|
29 |