Mercurial > repos > xuebing > sharplabtool
view tools/rgenetics/rgfakePhe.py @ 1:cdcb0ce84a1b
Uploaded
author | xuebing |
---|---|
date | Fri, 09 Mar 2012 19:45:15 -0500 |
parents | 9071e359b9a3 |
children |
line wrap: on
line source
""" fakephe.py ross lazarus sept 30 2007 This is available under the LGPL as defined then. use the pedigree data for ids use pythons generators to literally generate a bunch of random phenotype measurements Plink format at http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#pheno is To specify an alternate phenotype for analysis, i.e. other than the one in the *.ped file (or, if using a binary fileset, the *.fam file), use the --pheno option: plink --file mydata --pheno pheno.txt where pheno.txt is a file that contains 3 columns (one row per individual): Family ID Individual ID Phenotype NOTE The original PED file must still contain a phenotype in column 6 (even if this is a dummy phenotype, e.g. all missing), unless the --no-pheno flag is given. If an individual is in the original file but not listed in the alternate phenotype file, that person's phenotype will be set to missing. If a person is in the alternate phenotype file but not in the original file, that entry will be ignored. The order of the alternate phenotype file need not be the same as for the original file. If the phenotype file contains more than one phenotype, then use the --mpheno N option to specify the Nth phenotype is the one to be used: plink --file mydata --pheno pheno2.txt --mpheno 4 where pheno2.txt contains 5 different phenotypes (i.e. 7 columns in total), this command will use the 4th for analysis (phenotype D): Family ID Individual ID Phenotype A Phenotype B Phenotype C Phenotype D Phenotype E Alternatively, your alternate phenotype file can have a header row, in which case you can use variable names to specify which phenotype to use. If you have a header row, the first two variables must be labelled FID and IID. All subsequent variable names cannot have any whitespace in them. For example, FID IID qt1 bmi site F1 1110 2.3 22.22 2 F2 2202 34.12 18.23 1 ... then plink --file mydata --pheno pheno2.txt --pheno-name bmi --assoc will select the second phenotype labelled "bmi", for analysis Finally, if there is more than one phenotype, then for basic association tests, it is possible to specify that all phenotypes be tested, sequentially, with the output sent to different files: e.g. if bigpheno.raw contains 10,000 phenotypes, then plink --bfile mydata --assoc --pheno bigpheno.raw --all-pheno will loop over all of these, one at a time testing for association with SNP, generating a lot of output. You might want to use the --pfilter command in this case, to only report results with a p-value less than a certain value, e.g. --pfilter 1e-3. WARNING Currently, all phenotypes must be numerically coded, including missing values, in the alternate phenotype file. The default missing value is -9, change this with --missing-phenotype, but it must be a numeric value still (in contrast to the main phenotype in the PED/FAM file. This issue will be fixed in future releases. Covariate files =========================== rgfakePhe.xml <tool id="fakePhe1" name="Fake phenos"> <description>for multiple null fake phenotype</description> <command interpreter="python2.4">rgfakePhe.py $input1 '$title1' $out_file1 $log_file1 $script_file</command> <inputs> <page> <param name="input1" type="library" format="lped" label="Pedigree from Dataset"/> <param name="title1" type="text" value="My fake phenos" size="60" label="Title for outputs"/> </page> <page> <repeat name="fakePhe" title="Phenotypes to Fake"> <param name="pName" type="text" label="Phenotype Name"> </param> <conditional name="series"> <param name="phetype" type="select" label="Phenotype Type"> <option value="rnorm" selected="true">Random normal variate</option> <option value="cat">Random categorical</option> </param> <when value="rnorm"> <param name="Mean" type="float" value="0.0" label="Mean"> </param> <param name="SD" type="float" label="SD" value="1.0"> </param> </when> <when value="cat"> <param name="values" type="text" value="1,2,3,fiddle" label="comma separated values to choose from"> </param> </when> </conditional> </repeat> </page> </inputs> <configfiles> <configfile name="script_file"> #for $n, $f in enumerate($fakePhe) #if $f.series.phetype=='rnorm' {'pN':'$f.pName','pT':'$f.series.phetype','pP':"{'Mean':'$f.series.Mean', 'SD':'$f.series.SD'}"} #elif $f.series.phetype=='cat' {'pN':'$f.pName','pT':'$f.series.phetype','pP':"{'values':'$f.series.values'}"} #end if #end for </configfile> </configfiles> <outputs> <data format="pphe" name="out_file1" /> <data format="text" name="log_file1" parent="out_file1"/> </outputs> <help> .. class:: infomark This tool allows you to generate an arbitrary (sort of) synthetic phenotype file with measurements drawn from normal, gamma, or categorical distributions. These are for testing under the null hypothesis of no association - the values are random but from user specified distributions. ----- .. class:: warningmark This tool is very experimental ----- - **Pedigree** is a library pedigree file - the id's will be used in the synthetic null phenotypes - **Title** is a name to give to the output phenotype file On the next page, you can add an unlimited number of various kinds of phenotypes including choices for categorical ones or distributions with specific parameters Just keep adding new ones until you're done then use the Execute button to run the generation **Example from another tool to keep you busy and in awe** Input file:: 1 68 4.1 2 71 4.6 3 62 3.8 4 75 4.4 5 58 3.2 6 60 3.1 7 67 3.8 8 68 4.1 9 71 4.3 10 69 3.7 Create a two series XY plot on the above data: - Series 1: Red Dashed-Line plot between columns 1 and 2 - Series 2: Blue Circular-Point plot between columns 3 and 2 .. image:: ./static/images/xy_example.jpg </help> </tool> """ import random,copy,sys,os,time,string,math doFbatphe = False galhtmlprefix = """<?xml version="1.0" encoding="utf-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="Galaxy %s tool output - see http://g2.trac.bx.psu.edu/" /> <title></title> <link rel="stylesheet" href="/static/style/base.css" type="text/css" /> </head> <body> <div class="document"> """ def timenow(): """return current time as a string """ return time.strftime('%d/%m/%Y %H:%M:%S', time.localtime(time.time())) def poisson(lamb=2): """ algorithm poisson random number (Knuth): init: Let L = e^-lamb, k = 0 and p = 1. do: k = k + 1. Generate uniform random number u in [0,1] and let p = p u. while p >= L. return k - 1. """ lamb = float(lamb) l = math.e**(-lamb) k=0 p=1 while 1: while p >= l: k += 1 u = random.uniform(0.0,1.0) p *= u yield '%e' % (k - 1) def gammaPhe(alpha=1,beta=1): """generator for random values from a gamma """ while 1: # an iterator for a random phenotype dat = random.gammavariate(float(alpha),float(beta)) yield '%e' % dat def weibullPhe(alpha=1,beta=1): """generator for random values from a weibull distribution """ while 1: # an iterator for a random phenotype dat = random.weibullvariate(float(alpha),float(beta)) yield '%e' % dat def normPhe(mean=0,sd=1): """generator for random values from a normal distribution """ while 1:# an iterator for a random phenotype dat = random.normalvariate(float(mean),float(sd)) yield '%e' % dat def expoPhe(mean=1): """generator for random values from an exponential distribution """ lamb = 1.0/float(mean) while 1:# an iterator for a random phenotype dat = random.expovariate(lamb) yield '%e' % dat def catPhe(values='1,2,3'): """ Schrodinger's of course. """ v = values.split(',') while 1:# an iterator for a random phenotype dat = random.choice(v) yield dat def uniPhe(low=0.0,hi=1.0): """generator for a uniform distribution what if low=-5 and hi=-2 """ low = float(low) hi = float(hi) while 1: # unif phenotype v = random.uniform(low,hi) # 0-1 yield '%e' % v def getIds(indir='foo'): """read identifiers - first 2 cols from a pedigree file or fam file """ inpath = os.path.split(indir)[0] # get root infam = '%s.fam' % indir inped = '%s.ped' % indir # if there's a ped flist = os.listdir(inpath) if len(flist) == 0: print >> sys.stderr, '### Error - input path %s is empty' % indir sys.exit(1) if os.path.exists(infam): pfname = infam elif os.path.exists(inped): pfname = inped else: print >> sys.stderr, '### Error - input path %s contains no ped or fam' % indir sys.exit(1) f = file(pfname,'r') ids = [] for l in f: ll = l.split() if len(ll) > 5: # ok line? ids.append(ll[:2]) return ids def makePhe(phes = [],ids=[]): """Create the null phenotype values and append them to the case id phes is the (generator, headername) for each column for a phe file, ids are the case identifiers for the phenotype file res contains the final strings for the file each column is derived by iterating over the generator actions set up by makePhes """ header = ['FID','IID'] # for plink res = copy.copy(ids) for (f,fname) in phes: header.append(fname) for n,subject in enumerate(ids): res[n].append(f.next()) # generate the right value res.insert(0,header) res = [' '.join(x) for x in res] # must be space delim for fbat return res def makePhes(pheTypes=[], pheNames=[], pheParams=[]): """set up phes for makePhe each has to have an iterator (action) and a name """ action = None phes = [] for n,pt in enumerate(pheTypes): name = pheNames[n] if pt == 'rnorm': m = pheParams[n].get('Mean',0.0) s = pheParams[n].get('SD',1.0) action = normPhe(mean=m,sd=s) # so we can just iterate over action elif pt == 'rgamma': m = pheParams[n].get('Alpha',0.0) s = pheParams[n].get('Beta',1.0) action = gammaPhe(alpha=m,beta=s) if pt == 'exponential': m = pheParams[n].get('Mean',1.0) action = expoPhe(mean=m) # so we can just iterate over action elif pt == 'weibull': m = pheParams[n].get('Alpha',0.0) s = pheParams[n].get('Beta',1.0) action = weibullPhe(alpha=m,beta=s) elif pt == 'cat': v = pheParams[n].get('values',['?',]) action = catPhe(values=v) elif pt == 'unif': low = pheParams[n].get('low',0.0) hi = pheParams[n].get('hi',1.0) action = uniPhe(low=low,hi=hi) elif pt == 'poisson': lamb = pheParams[n].get('lamb',1.0) action = poisson(lamb=lamb) phes.append((action,name)) return phes def doImport(outfile='test',flist=[],expl='',mylog=[]): """ import into one of the new html composite data types for Rgenetics Dan Blankenberg with mods by Ross Lazarus October 2007 """ outf = open(outfile,'w') progname = os.path.basename(sys.argv[0]) outf.write(galhtmlprefix % progname) if len(flist) > 0: outf.write('<ol>\n') for i, data in enumerate( flist ): outf.write('<li><a href="%s">%s %s</a></li>\n' % (os.path.split(data)[-1],os.path.split(data)[-1],expl)) outf.write('</ol>\n') else: outf.write('No files found') outf.write('<hr/><h4>Log of run follows:</h4><br/><pre>%s\n</pre><br/><hr/>' % ('\n'.join(mylog))) outf.write("</div></body></html>\n") outf.close() def test(): """test case need to get these out of a galaxy form - series of pages - get types on first screen, names on second, params on third? holy shit. this actually works I think """ pT = ['rnorm','rnorm','rnorm','rnorm','cat','unif'] pN = ['SysBP','DiaBP','HtCM','WtKG','Race','age'] pP = [{'Mean':'120','SD':'10'},{'Mean':'90','SD':'15'},{'Mean':'160','SD':'20'},{'Mean':'60','SD':'20'}, \ {'values':'Blink,What,Yours,green'},{'low':16,'hi':99}] phes = makePhes(pheTypes=pT, pheNames=pN, pheParams=pP) ids = [] for i in range(10): ids.append(['fid%d' % i,'iid%d' % i]) pheres = makePhe(phes=phes,ids=ids) res = [''.join(x) for x in pheres] print '\n'.join(res) if __name__ == "__main__": """ <command interpreter="python">rgfakePhe.py '$infile1.extra_files_path/$infile1.metadata.base_name' "$title1" '$ppheout' '$ppheout.files_path' '$script_file ' </command> The xml file for this tool is complex, and allows any arbitrary number of phenotype columns to be specified from a couple of optional types - rnorm, cat are working now. Note that we create new files in their respective library directories and link to them in the output file so they can be displayed and downloaded separately """ killme = string.punctuation + string.whitespace trantab = string.maketrans(killme,'_'*len(killme)) progname = os.path.basename(sys.argv[0]) cl = '## at %s, %s got cl= %s' % (timenow(),progname,' '.join(sys.argv)) print >> sys.stdout,cl if len(sys.argv) < 5: test() else: inped = sys.argv[1] title = sys.argv[2].translate(trantab) ppheout = sys.argv[3] pphe_path = sys.argv[4] scriptfile = sys.argv[5] ind = file(scriptfile,'r').readlines() mylog = [] s = '## %s starting at %s<br/>\n' % (progname,timenow()) mylog.append(s) mylog.append(cl) s = '## params = %s<br/>\n' % (' '.join(sys.argv[1:])) mylog.append(s) s = '\n'.join(ind) mylog.append('Script file %s contained %s<br/>\n' % (scriptfile,s)) pT = [] pN = [] pP = [] for l in ind: l = l.strip() if len(l) > 1: adict = eval(l) pT.append(adict.get('pT',None)) pN.append(adict.get('pN',None)) pP.append(eval(adict.get('pP',None))) s = '## pt,pn,pp=%s,%s,%s<br/>\n' % (str(pT),str(pN),str(pP)) mylog.append(s) phes = makePhes(pheTypes=pT, pheNames=pN, pheParams=pP) ids = getIds(indir=inped) # for labelling rows pheres = makePhe(phes=phes,ids=ids) # random values from given distributions try: os.makedirs(pphe_path) except: pass outname = os.path.join(pphe_path,title) pphefname = '%s.pphe' % outname f = file(pphefname, 'w') f.write('\n'.join(pheres)) f.write('\n') f.close() if doFbatphe: try: os.makedirs(fphe_path) except: pass outname = os.path.join(fphe_path,title) fphefname = '%s.phe' % outname f = file(fphefname, 'w') header = pheres[0].split() pheres[0] = ' '.join(header[2:])# remove iid fid from header for fbat f.write('\n'.join(pheres)) f.close() doImport(outfile=fpheout,flist=[fphefname,],expl='(FBAT phenotype format)',mylog=mylog) #doImport(outfile='test',flist=[],expl='',mylog=[]): doImport(outfile=ppheout,flist=[pphefname,],expl='(Plink phenotype format)',mylog=mylog)