Mercurial > repos > vipints > rdiff
view rDiff/src/locfit/m/locfit.m @ 2:233c30f91d66
updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
author | vipints <vipin@cbio.mskcc.org> |
---|---|
date | Tue, 08 Oct 2013 07:15:44 -0400 |
parents | 0f80a5141704 |
children |
line wrap: on
line source
function fit=locfit(varargin) % Smoothing noisy data using Local Regression and Likelihood. % % arguments still to add: dc maxit % % Usage: fit = locfit(x,y) % local regression fit of x and y. % fit = locfit(x) % density estimation of x. % % Smoothing with locfit is a two-step procedure. The locfit() % function evaluates the local regression smooth at a set of points % (can be specified through an evaluation structure). Then, use % the predict() function to interpolate this fit to other points. % % Additional arguments to locfit() are specified as 'name',value pairs, e.g.: % locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 ); % % % Data-related inputs: % % x is a vector or matrix of the independent (or predictor) variables. % Rows of x represent subjects, columns represent variables. % Generally, local regression would be used with 1-4 independent % variables. In higher dimensions, the curse-of-dimensionality, % as well as the difficulty of visualizing higher dimensional % surfaces, may limit usefulness. % % y is the column vector of the dependent (or response) variable. % For density families, 'y' is omitted. % NOTE: x and y are the first two arguments. All other arguments require % the 'name',value notation. % % 'weights' Prior weights for observations (reciprocal of variance, or % sample size). % 'cens' Censoring indicators for hazard rate or censored regression. % The coding is '1' (or 'TRUE') for a censored observation, and % '0' (or 'FALSE') for uncensored observations. % 'base' Baseline parameter estimate. If a baseline is provided, % the local regression model is fitted as % Y_i = b_i + m(x_i) + epsilon_i, % with Locfit estimating the m(x) term. For regression models, % this effectively subtracts b_i from Y_i. The advantage of the % 'base' formulation is that it extends to likelihood % regression models. % 'scale' A scale to apply to each variable. This is especially % important for multivariate fitting, where variables may be % measured in non-comparable units. It is also used to specify % the frequency for variables with the 'a' (angular) style. % 'sty' Character string (length d) of styles for each predictor variable. % n denotes `normal'; a denotes angular (or periodic); l and r % denotes one-sided left and right; c is conditionally parametric. % % % Smoothing Parameters and Bandwidths: % The bandwidth (or more accurately, half-width) of the smoothing window % controls the amount of smoothing. Locfit allows specification of constant % (fixed), nearest neighbor, certain locally adaptive variable bandwidths, % and combinations of these. Also related to the smoothing parameter % are the local polynmial degree and weight function. % % 'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5 % means that the width of each smoothing neighborhood is chosen % to cover 50% of the data. % % 'h' A constant (or fixed) bandwidth parameter. For example, 'h',2 % means that the smoothing windows have constant half-width % (or radius) 2. Note that h is applied after scaling. % % 'pen' penalty parameter for adaptive smoothing. Needs to be used % with care. % % 'alpha' The old way of specifying smoothing parameters, as used in % my book. alpha is equivalent to the vector [nn,h,pen]. % If multiple componenents are non-zero, the largest corresponding % bandwidth is used. The default (if none of alpha,nn,h,pen % are provided) is [0.7 0 0]. % % 'deg' Degree of local polynomial. Default: 2 (local quadratic). % Degrees 0 to 3 are supported by almost all parts of the % Locfit code. Higher degrees may work in some cases. % % 'kern' Weight function, default = 'tcub'. Other choices are % 'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'. % Choices may be restricted when derivatives are % required; e.g. for confidence bands and some bandwidth % selectors. % % 'kt' Kernel type, 'sph' (default); 'prod'. In multivariate % problems, 'prod' uses a simplified product model which % speeds up computations. % % 'acri' Criterion for adaptive bandwidth selection. % % % Derivative Estimation. % Generally I recommend caution when using derivative estimation % (and especially higher order derivative estimation) -- can you % really estimate derivatives from noisy data? Any derivative % estimate is inherently more dependent on an assumed smoothness % (expressed through the bandwidth) than the data. Warnings aside... % % 'deriv' Derivative estimation. 'deriv',1 specifies the first derivative % (or more correctly, an estimate of the local slope is returned. % 'deriv',[1 1] specifies the second derivative. For bivariate fits % 'deriv',2 specifies the first partial derivative wrt x2. % 'deriv',[1 2] is mixed second-order derivative. % % Fitting family. % 'family' is used to specify the local likelihood family. % Regression-type families are 'gaussian', 'binomial', % 'poisson', 'gamma' and 'geom'. If the family is preceded % by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is % used; in particular, a dispersion estimate is computed. % Preceding by an 'r' makes an attempt at robust (outlier-resistant) % estimation. Combining q and r (e.g. 'family','qrpois') may % work, if you're lucky. % Density estimation-type families are 'dens', 'rate' and 'hazard' % (hazard or failure rate). Note that `dens' scales the output % to be a statistical density estimate (i.e. scaled to integrate % to 1). 'rate' estimates the rate or intensity function (events % per unit time, or events per unit area), which may be called % density in some fields. % The default family is 'qgauss' if a response (y argument) has been % provided, and 'dens' if no response is given. % 'link' Link function for local likelihood fitting. Depending on the % family, choices may be 'ident', 'log', 'logit', % 'inverse', 'sqrt' and 'arcsin'. % % Evaluation structures. % By default, locfit chooses a set of points, depending on the data % and smoothing parameters, to evaluate at. This is controlled by % the evaluation structure. % 'ev' Specify the evaluation structure. Default is 'tree'. % Other choices include 'phull' (triangulation), 'grid' (a grid % of points), 'data' (each data point), 'crossval' (data, % but use leave-one-out cross validation), 'none' (no evaluation % points, effectively producing the global parametric fit). % Alternatively, a vector/matrix of evaluation points may be % provided. % (kd trees not currently supported in mlocfit) % 'll' and 'ur' -- row vectors specifying the upper and lower limits % for the bounding box used by the evaluation structure. % They default to the data range. % 'mg' For the 'grid' evaluation structure, 'mg' specifies the % number of points on each margin. Default 10. Can be either a % single number or vector. % 'cut' Refinement parameter for adaptive partitions. Default 0.8; % smaller values result in more refined partitions. % 'maxk' Controls space assignment for evaluation structures. For the % adaptive evaluation structures, it is impossible to be sure % in advance how many vertices will be generated. If you get % warnings about `Insufficient vertex space', Locfit's default % assigment can be increased by increasing 'maxk'. The default % is 'maxk','100'. % % 'xlim' For density estimation, Locfit allows the density to be % supported on a bounded interval (or rectangle, in more than % one dimension). The format should be [ll;ul] (ie, matrix with % two rows, d columns) where ll is the lower left corner of % the rectangle, and ur is the upper right corner. % One-sided bounds, such as [0,infty), are not supported, but can be % effectively specified by specifying a very large upper % bound. % % 'module' either 'name' or {'name','/path/to/module',parameters}. % % Density Estimation % 'renorm',1 will attempt to renormalize the local likelihood % density estimate so that it integrates to 1. The llde % (specified by 'family','dens') is scaled to estimate the % density, but since the estimation is pointwise, there is % no guarantee that the resulting density integrates exactly % to 1. Renormalization attempts to achieve this. % % The output of locfit() is a Matlab structure: % % fit.data.x (n*d) % fit.data.y (n*1) % fit.data.weights (n*1 or 1*1) % fit.data.censor (n*1 or 1*1) % fit.data.baseline (n*1 or 1*1) % fit.data.style (string length d) % fit.data.scales (1*d) % fit.data.xlim (2*d) % % fit.evaluation_structure.type (string) % fit.evaluation_structure.module.name (string) % fit.evaluation_structure.module.directory (string) % fit.evaluation_structure.module.parameters (string) % fit.evaluation_structure.lower_left (numeric 1*d) % fit.evaluation_structure.upper_right (numeric 1*d) % fit.evaluation_structure.grid (numeric 1*d) % fit.evaluation_structure.cut (numeric 1*d) % fit.evaluation_structure.maxk % fit.evaluation_structure.derivative % % fit.smoothing_parameters.alpha = (nn h pen) vector % fit.smoothing_parameters.adaptive_criterion (string) % fit.smoothing_parameters.degree (numeric) % fit.smoothing_parameters.family (string) % fit.smoothing_parameters.link (string) % fit.smoothing_parameters.kernel (string) % fit.smoothing_parameters.kernel_type (string) % fit.smoothing_parameters.deren % fit.smoothing_parameters.deit % fit.smoothing_parameters.demint % fit.smoothing_parameters.debug % % fit.fit_points.evaluation_points (d*nv matrix) % fit.fit_points.fitted_values (matrix, nv rows, many columns) % fit.fit_points.evaluation_vectors.cell % fit.fit_points.evaluation_vectors.splitvar % fit.fit_points.evaluation_vectors.lo % fit.fit_points.evaluation_vectors.hi % fit.fit_points.fit_limits (d*2 matrix) % fit.fit_points.family_link (numeric values) % fit.fit_points.kappa (likelihood, degrees of freedom, etc) % % fit.parametric_component % % % The OLD format: % % fit{1} = data. % fit{2} = evaluation structure. % fit{3} = smoothing parameter structure. % fit{4}{1} = fit points matrix. % fit{4}{2} = matrix of fitted values etc. % Note that these are not back-transformed, and may have the % parametric component removed. % (exact content varies according to module). % fit{4}{3} = various details of the evaluation points. % fit{4}{4} = fit limits. % fit{4}{5} = family,link. % fit{5} = parametric component values. % % Minimal input validation if nargin < 1 error( 'At least one input argument required' ); end xdata = double(varargin{1}); d = size(xdata,2); n = size(xdata,1); if ((nargin>1) && (~ischar(varargin{2}))) ydata = double(varargin{2}); if (any(size(ydata) ~= [n 1])); error('y must be n*1 column vector'); end; family = 'qgauss'; na = 3; else ydata = 0; family = 'density'; na = 2; end; if mod(nargin-na,2)==0 error( 'All arguments other than x, y must be name,value pairs' ); end wdata = ones(n,1); cdata = 0; base = 0; style = 'n'; scale = 1; xl = zeros(2,d); alpha = [0 0 0]; deg = 2; link = 'default'; acri = 'none'; kern = 'tcub'; kt = 'sph'; deren = 0; deit = 'default'; demint= 20; debug = 0; ev = 'tree'; ll = zeros(1,d); ur = zeros(1,d); mg = 10; maxk = 100; deriv=0; cut = 0.8; mdl = struct('name','std', 'directory','', 'parameters',0 ); while na < length(varargin) inc = 0; if (varargin{na}=='y') ydata = double(varargin{na+1}); family = 'qgauss'; inc = 2; if (any(size(ydata) ~= [n 1])); error('y must be n*1 column vector'); end; end if (strcmp(varargin{na},'weights')) wdata = double(varargin{na+1}); inc = 2; if (any(size(wdata) ~= [n 1])); error('weights must be n*1 column vector'); end; end if (strcmp(varargin{na},'cens')) cdata = double(varargin{na+1}); inc = 2; if (any(size(cdata) ~= [n 1])); error('cens must be n*1 column vector'); end; end if (strcmp(varargin{na},'base')) % numeric vector, n*1 or 1*1. base = double(varargin{na+1}); if (length(base)==1); base = base*ones(n,1); end; inc = 2; end if (strcmp(varargin{na},'style')) % character string of length d. style = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'scale')) % row vector, length 1 or d. scale = varargin{na+1}; if (scale==0) scale = zeros(1,d); for i=1:d scale(i) = sqrt(var(xdata(:,i))); end; end; inc = 2; end; if (strcmp(varargin{na},'xlim')) % 2*d numeric matrix. xl = varargin{na+1}; inc = 2; end if (strcmp(varargin{na},'alpha')) % row vector of length 1, 2 or 3. alpha = [varargin{na+1} 0 0 0]; alpha = alpha(1:3); inc = 2; end if (strcmp(varargin{na},'nn')) % scalar alpha(1) = varargin{na+1}; inc = 2; end if (strcmp(varargin{na},'h')) % scalar alpha(2) = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'pen')) % scalar alpha(3) = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'acri')) % string acri = varargin{na+1}; inc = 2; end if (strcmp(varargin{na},'deg')) % positive integer. deg = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'family')) % character string. family = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'link')) % character string. link = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'kern')) % character string. kern = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'kt')) % character string. kt = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'ev')) % char. string, or matrix with d columns. ev = varargin{na+1}; if (isnumeric(ev)); ev = ev'; end; inc = 2; end; if (strcmp(varargin{na},'ll')) % row vector of length d. ll = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'ur')) % row vector of length d. ur = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'mg')) % row vector of length d. mg = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'cut')) % positive scalar. cut = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'module')) % string. mdl = struct('name',varargin{na+1}, 'directory','', 'parameters',0 ); inc = 2; end; if (strcmp(varargin{na},'maxk')) % positive integer. maxk = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'deriv')) % numeric row vector, up to deg elements. deriv = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'renorm')) % density renormalization. deren = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'itype')) % density - integration type. deit = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'mint')) % density - # of integration points. demint = varargin{na+1}; inc = 2; end; if (strcmp(varargin{na},'debug')) % debug level. debug = varargin{na+1}; inc = 2; end; if (inc==0) disp(varargin{na}); error('Unknown Input Argument.'); end; na=na+inc; end fit.data.x = xdata; fit.data.y = ydata; fit.data.weights = wdata; fit.data.censor = cdata; fit.data.baseline = base; fit.data.style = style; fit.data.scales = scale; fit.data.xlim = xl; fit.evaluation_structure.type = ev; fit.evaluation_structure.module = mdl; fit.evaluation_structure.lower_left = ll; fit.evaluation_structure.upper_right = ur; fit.evaluation_structure.grid = mg; fit.evaluation_structure.cut = cut; fit.evaluation_structure.maxk = maxk; fit.evaluation_structure.derivative = deriv; if (alpha==0); alpha = [0.7 0 0]; end; fit.smoothing_parameters.alpha = alpha; fit.smoothing_parameters.adaptive_criterion = acri; fit.smoothing_parameters.degree = deg; fit.smoothing_parameters.family = family; fit.smoothing_parameters.link = link; fit.smoothing_parameters.kernel = kern; fit.smoothing_parameters.kernel_type = kt; fit.smoothing_parameters.deren = deren; fit.smoothing_parameters.deit = deit; fit.smoothing_parameters.demint = demint; fit.smoothing_parameters.debug = debug; [fpc pcomp] = mexlf(fit.data,fit.evaluation_structure,fit.smoothing_parameters); fit.fit_points = fpc; fit.parametric_component = pcomp; return