sharplabtool: tools/rgenetics/rgfakePhe.py comparison

comparison tools/rgenetics/rgfakePhe.py @ 0:9071e359b9a3

Uploaded

author	xuebing
date	Fri, 09 Mar 2012 19:37:19 -0500
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:9071e359b9a3
+"""
+fakephe.py
+ross lazarus sept 30 2007
+This is available under the LGPL as defined then.
+use the pedigree data for ids
+use pythons generators to literally generate a bunch of random phenotype measurements
+Plink format at http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#pheno
+is
+To specify an alternate phenotype for analysis, i.e. other than the one in the *.ped file (or, if using a binary fileset, the
+*.fam file), use the --pheno option:
+plink --file mydata --pheno pheno.txt
+where pheno.txt is a file that contains 3 columns (one row per individual):
+Family ID
+Individual ID
+Phenotype
+NOTE The original PED file must still contain a phenotype in column 6 (even if this is a dummy phenotype, e.g. all missing),
+unless the --no-pheno flag is given.
+If an individual is in the original file but not listed in the alternate phenotype file, that person's phenotype will be set to
+missing. If a person is in the alternate phenotype file but not in the original file, that entry will be ignored. The order of
+the alternate phenotype file need not be the same as for the original file.
+If the phenotype file contains more than one phenotype, then use the --mpheno N option to specify the Nth phenotype is the one
+to be used:
+plink --file mydata --pheno pheno2.txt --mpheno 4
+where pheno2.txt contains 5 different phenotypes (i.e. 7 columns in total), this command will use the 4th for analysis
+(phenotype D):
+Family ID
+Individual ID
+Phenotype A
+Phenotype B
+Phenotype C
+Phenotype D
+Phenotype E
+Alternatively, your alternate phenotype file can have a header row, in which case you can use variable names to specify which
+phenotype to use. If you have a header row, the first two variables must be labelled FID and IID. All subsequent variable names
+cannot have any whitespace in them. For example,
+FID    IID      qt1   bmi    site
+F1     1110     2.3   22.22  2
+F2     2202     34.12 18.23  1
+...
+then
+plink --file mydata --pheno pheno2.txt --pheno-name bmi --assoc
+will select the second phenotype labelled "bmi", for analysis
+Finally, if there is more than one phenotype, then for basic association tests, it is possible to specify that all phenotypes
+be tested, sequentially, with the output sent to different files: e.g. if bigpheno.raw contains 10,000 phenotypes, then
+plink --bfile mydata --assoc --pheno bigpheno.raw --all-pheno
+will loop over all of these, one at a time testing for association with SNP, generating a lot of output. You might want to use
+the --pfilter command in this case, to only report results with a p-value less than a certain value, e.g. --pfilter 1e-3.
+WARNING Currently, all phenotypes must be numerically coded, including missing values, in the alternate phenotype file. The
+default missing value is -9, change this with --missing-phenotype, but it must be a numeric value still (in contrast to the
+main phenotype in the PED/FAM file. This issue will be fixed in future releases.
+Covariate files
+===========================
+rgfakePhe.xml
+<tool id="fakePhe1" name="Fake phenos">
+<description>for multiple null fake phenotype</description>
+<command interpreter="python2.4">rgfakePhe.py $input1 '$title1' $out_file1 $log_file1 $script_file</command>
+<inputs>
+<page>
+<param name="input1"
+type="library" format="lped"
+label="Pedigree from Dataset"/>
+<param name="title1" type="text"
+value="My fake phenos" size="60"
+label="Title for outputs"/>
+</page>
+<page>
+<repeat name="fakePhe" title="Phenotypes to Fake">
+<param name="pName" type="text" label="Phenotype Name">
+</param>
+<conditional name="series">
+<param name="phetype" type="select" label="Phenotype Type">
+<option value="rnorm" selected="true">Random normal variate</option>
+<option value="cat">Random categorical</option>
+</param>
+<when value="rnorm">
+<param name="Mean" type="float" value="0.0" label="Mean">
+</param>
+<param name="SD" type="float" label="SD" value="1.0">
+</param>
+</when>
+<when value="cat">
+<param name="values" type="text" value="1,2,3,fiddle" label="comma separated values to choose from">
+</param>
+</when>
+</conditional>
+</repeat>
+</page>
+</inputs>
+<configfiles>
+<configfile name="script_file">
+#for $n, $f in enumerate($fakePhe)
+#if $f.series.phetype=='rnorm'
+{'pN':'$f.pName','pT':'$f.series.phetype','pP':"{'Mean':'$f.series.Mean', 'SD':'$f.series.SD'}"}
+#elif $f.series.phetype=='cat'
+{'pN':'$f.pName','pT':'$f.series.phetype','pP':"{'values':'$f.series.values'}"}
+#end if
+#end for
+</configfile>
+</configfiles>
+<outputs>
+<data format="pphe" name="out_file1" />
+<data format="text" name="log_file1" parent="out_file1"/>
+</outputs>
+<help>
+.. class:: infomark
+This tool allows you to generate an arbitrary (sort of)
+synthetic phenotype file with measurements drawn from normal,
+gamma, or categorical distributions. These are for testing under
+the null hypothesis of no association - the values are random but
+from user specified distributions.
+-----
+.. class:: warningmark
+This tool is very experimental
+-----
+- **Pedigree** is a library pedigree file - the id's will be used in the synthetic null phenotypes
+- **Title** is a name to give to the output phenotype file
+On the next page, you can add an unlimited number of various kinds of phenotypes including choices for
+categorical ones or distributions with specific parameters
+Just keep adding new ones until you're done then use the Execute button to run the generation
+**Example from another tool to keep you busy and in awe**
+Input file::
+1   68  4.1
+2   71  4.6
+3   62  3.8
+4   75  4.4
+5   58  3.2
+6   60  3.1
+7   67  3.8
+8   68  4.1
+9   71  4.3
+10  69  3.7
+Create a two series XY plot on the above data:
+- Series 1: Red Dashed-Line plot between columns 1 and 2
+- Series 2: Blue Circular-Point plot between columns 3 and 2
+.. image:: ./static/images/xy_example.jpg
+</help>
+</tool>
+"""
+import random,copy,sys,os,time,string,math
+doFbatphe = False
+galhtmlprefix = """<?xml version="1.0" encoding="utf-8" ?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+<meta name="generator" content="Galaxy %s tool output - see http://g2.trac.bx.psu.edu/" />
+<title></title>
+<link rel="stylesheet" href="/static/style/base.css" type="text/css" />
+</head>
+<body>
+<div class="document">
+"""
+def timenow():
+"""return current time as a string
+"""
+return time.strftime('%d/%m/%Y %H:%M:%S', time.localtime(time.time()))
+def poisson(lamb=2):
+"""
+algorithm poisson random number (Knuth):
+init:
+Let L = e^-lamb, k = 0 and p = 1.
+do:
+k = k + 1.
+Generate uniform random number u in [0,1] and let p = p u.
+while p >= L.
+return k - 1.
+"""
+lamb = float(lamb)
+l = math.e**(-lamb)
+k=0
+p=1
+while 1:
+while p >= l:
+k += 1
+u = random.uniform(0.0,1.0)
+p *= u
+yield '%e' % (k - 1)
+def gammaPhe(alpha=1,beta=1):
+"""generator for random values from a gamma
+"""
+while 1: # an iterator for a random phenotype
+dat = random.gammavariate(float(alpha),float(beta))
+yield '%e' % dat
+def weibullPhe(alpha=1,beta=1):
+"""generator for random values from a weibull distribution
+"""
+while 1: # an iterator for a random phenotype
+dat = random.weibullvariate(float(alpha),float(beta))
+yield '%e' % dat
+def normPhe(mean=0,sd=1):
+"""generator for random values from a normal distribution
+"""
+while 1:# an iterator for a random phenotype
+dat = random.normalvariate(float(mean),float(sd))
+yield '%e' % dat
+def expoPhe(mean=1):
+"""generator for random values from an exponential distribution
+"""
+lamb = 1.0/float(mean)
+while 1:# an iterator for a random phenotype
+dat = random.expovariate(lamb)
+yield '%e' % dat
+def catPhe(values='1,2,3'):
+""" Schrodinger's of course.
+"""
+v = values.split(',')
+while 1:# an iterator for a random phenotype
+dat = random.choice(v)
+yield dat
+def uniPhe(low=0.0,hi=1.0):
+"""generator for a uniform distribution
+what if low=-5 and hi=-2
+"""
+low = float(low)
+hi = float(hi)
+while 1: # unif phenotype
+v = random.uniform(low,hi) # 0-1
+yield '%e' % v
+def getIds(indir='foo'):
+"""read identifiers - first 2 cols from a pedigree file or fam file
+"""
+inpath = os.path.split(indir)[0] # get root
+infam = '%s.fam' % indir
+inped = '%s.ped' % indir # if there's a ped
+flist = os.listdir(inpath)
+if len(flist) == 0:
+print >> sys.stderr, '### Error - input path %s is empty' % indir
+sys.exit(1)
+if os.path.exists(infam):
+pfname = infam
+elif os.path.exists(inped):
+pfname = inped
+else:
+print >> sys.stderr, '### Error - input path %s contains no ped or fam' % indir
+sys.exit(1)
+f = file(pfname,'r')
+ids = []
+for l in f:
+ll = l.split()
+if len(ll) > 5: # ok line?
+ids.append(ll[:2])
+return ids
+def makePhe(phes = [],ids=[]):
+"""Create the null phenotype values and append them to the case id
+phes is the (generator, headername) for each column
+for a phe file, ids are the case identifiers for the phenotype file
+res contains the final strings for the file
+each column is derived by iterating
+over the generator actions set up by makePhes
+"""
+header = ['FID','IID'] # for plink
+res = copy.copy(ids)
+for (f,fname) in phes:
+header.append(fname)
+for n,subject in enumerate(ids):
+res[n].append(f.next()) # generate the right value
+res.insert(0,header)
+res = [' '.join(x) for x in res] # must be space delim for fbat
+return res
+def makePhes(pheTypes=[], pheNames=[], pheParams=[]):
+"""set up phes for makePhe
+each has to have an iterator (action) and a name
+"""
+action = None
+phes = []
+for n,pt in enumerate(pheTypes):
+name = pheNames[n]
+if pt == 'rnorm':
+m = pheParams[n].get('Mean',0.0)
+s = pheParams[n].get('SD',1.0)
+action = normPhe(mean=m,sd=s) # so we can just iterate over action
+elif pt == 'rgamma':
+m = pheParams[n].get('Alpha',0.0)
+s = pheParams[n].get('Beta',1.0)
+action = gammaPhe(alpha=m,beta=s)
+if pt == 'exponential':
+m = pheParams[n].get('Mean',1.0)
+action = expoPhe(mean=m) # so we can just iterate over action
+elif pt == 'weibull':
+m = pheParams[n].get('Alpha',0.0)
+s = pheParams[n].get('Beta',1.0)
+action = weibullPhe(alpha=m,beta=s)
+elif pt == 'cat':
+v = pheParams[n].get('values',['?',])
+action = catPhe(values=v)
+elif pt == 'unif':
+low = pheParams[n].get('low',0.0)
+hi = pheParams[n].get('hi',1.0)
+action = uniPhe(low=low,hi=hi)
+elif pt == 'poisson':
+lamb = pheParams[n].get('lamb',1.0)
+action = poisson(lamb=lamb)
+phes.append((action,name))
+return phes
+def doImport(outfile='test',flist=[],expl='',mylog=[]):
+""" import into one of the new html composite data types for Rgenetics
+Dan Blankenberg with mods by Ross Lazarus
+October 2007
+"""
+outf = open(outfile,'w')
+progname = os.path.basename(sys.argv[0])
+outf.write(galhtmlprefix % progname)
+if len(flist) > 0:
+outf.write('<ol>\n')
+for i, data in enumerate( flist ):
+outf.write('<li><a href="%s">%s %s</a></li>\n' % (os.path.split(data)[-1],os.path.split(data)[-1],expl))
+outf.write('</ol>\n')
+else:
+outf.write('No files found')
+outf.write('<hr/><h4>Log of run follows:</h4><br/><pre>%s\n</pre><br/><hr/>' % ('\n'.join(mylog)))
+outf.write("</div></body></html>\n")
+outf.close()
+def test():
+"""test case
+need to get these out of a galaxy form - series of pages - get types
+on first screen, names on second, params on third?
+holy shit. this actually works I think
+"""
+pT = ['rnorm','rnorm','rnorm','rnorm','cat','unif']
+pN = ['SysBP','DiaBP','HtCM','WtKG','Race','age']
+pP = [{'Mean':'120','SD':'10'},{'Mean':'90','SD':'15'},{'Mean':'160','SD':'20'},{'Mean':'60','SD':'20'}, \
+{'values':'Blink,What,Yours,green'},{'low':16,'hi':99}]
+phes = makePhes(pheTypes=pT, pheNames=pN, pheParams=pP)
+ids = []
+for i in range(10):
+ids.append(['fid%d' % i,'iid%d' % i])
+pheres = makePhe(phes=phes,ids=ids)
+res = [''.join(x) for x in pheres]
+print '\n'.join(res)
+if __name__ == "__main__":
+"""
+<command interpreter="python">rgfakePhe.py '$infile1.extra_files_path/$infile1.metadata.base_name'
+"$title1" '$ppheout' '$ppheout.files_path' '$script_file '
+</command>
+The xml file for this tool is complex, and allows any arbitrary number of
+phenotype columns to be specified from a couple of optional types - rnorm, cat
+are working now.
+Note that we create new files in their respective library directories and link to them in the output file
+so they can be displayed and downloaded separately
+"""
+killme = string.punctuation + string.whitespace
+trantab = string.maketrans(killme,'_'*len(killme))
+progname = os.path.basename(sys.argv[0])
+cl = '## at %s, %s got cl= %s' % (timenow(),progname,' '.join(sys.argv))
+print >> sys.stdout,cl
+if len(sys.argv) < 5:
+test()
+else:
+inped = sys.argv[1]
+title = sys.argv[2].translate(trantab)
+ppheout = sys.argv[3]
+pphe_path = sys.argv[4]
+scriptfile = sys.argv[5]
+ind = file(scriptfile,'r').readlines()
+mylog = []
+s = '## %s starting at %s<br/>\n' % (progname,timenow())
+mylog.append(s)
+mylog.append(cl)
+s = '## params = %s<br/>\n' % (' '.join(sys.argv[1:]))
+mylog.append(s)
+s = '\n'.join(ind)
+mylog.append('Script file %s contained %s<br/>\n' % (scriptfile,s))
+pT = []
+pN = []
+pP = []
+for l in ind:
+l = l.strip()
+if len(l) > 1:
+adict = eval(l)
+pT.append(adict.get('pT',None))
+pN.append(adict.get('pN',None))
+pP.append(eval(adict.get('pP',None)))
+s = '## pt,pn,pp=%s,%s,%s<br/>\n' % (str(pT),str(pN),str(pP))
+mylog.append(s)
+phes = makePhes(pheTypes=pT, pheNames=pN, pheParams=pP)
+ids = getIds(indir=inped) # for labelling rows
+pheres = makePhe(phes=phes,ids=ids) # random values from given distributions
+try:
+os.makedirs(pphe_path)
+except:
+pass
+outname = os.path.join(pphe_path,title)
+pphefname = '%s.pphe' % outname
+f = file(pphefname, 'w')
+f.write('\n'.join(pheres))
+f.write('\n')
+f.close()
+if doFbatphe:
+try:
+os.makedirs(fphe_path)
+except:
+pass
+outname = os.path.join(fphe_path,title)
+fphefname = '%s.phe' % outname
+f = file(fphefname, 'w')
+header = pheres[0].split()
+pheres[0] = ' '.join(header[2:])# remove iid fid from header for fbat
+f.write('\n'.join(pheres))
+f.close()
+doImport(outfile=fpheout,flist=[fphefname,],expl='(FBAT phenotype format)',mylog=mylog)
+#doImport(outfile='test',flist=[],expl='',mylog=[]):
+doImport(outfile=ppheout,flist=[pphefname,],expl='(Plink phenotype format)',mylog=mylog)

Mercurial > repos > xuebing > sharplabtool

comparison tools/rgenetics/rgfakePhe.py @ 0:9071e359b9a3