transcan {Hmisc} | R Documentation |
transcan
is a nonlinear additive transformation and imputation
function, and there are several functions for using and operating on
its results. transcan
automatically transforms continuous and
categorical variables to have maximum correlation with the best linear
combination of the other variables. There is also an option to use a
substitute criterion - maximum correlation with the first principal
component of the other variables. Continuous variables are expanded
as restricted cubic splines and categorical variables are expanded as
contrasts (e.g., dummy variables). By default, the first canonical
variate is used to find optimum linear combinations of component
columns. This function is similar to ace
except that
transformations for continuous variables are fitted using restricted
cubic splines, monotonicity restrictions are not allowed, and NAs are
allowed. When a variable has any NAs, transformed scores for that
variable are imputed using least squares multiple regression
incorporating optimum transformations, or NAs are optionally set to
constants. Shrinkage can be used to safeguard against overfitting
when imputing. Optionally, imputed values on the original scale are
also computed and returned. For this purpose, recursive partitioning
or multinomial logistic models can
optionally be used to impute categorical variables, using what is
predicted to be the most probable category.
By default, transcan
imputes NAs with “best guess” expected values
of transformed variables, back transformed to the original scale.
Values thus imputed are most like conditional medians assuming the
transformations make variables' distributions symmetric (imputed
values are similar to conditionl modes for categorical variables). By
instead specifying n.impute
, transcan
does approximate multiple imputation
from the distribution of each variable conditional on all other
variables. This is done by sampling n.impute
residuals from the
transformed variable, with replacement (a la bootstrapping), or by
default, using Rubin's approximate Bayesian bootstrap, where a sample
of size n with replacement is selected from the residuals on n
non-missing values of the target variable, and then a sample of size m
with replacement is chosen from this sample, where m is the number of
missing values needing imputation for the current multiple imputation
repetition. Neither of these bootstrap procedures
assume normality or even symmetry of residuals.
For sometimes-missing categorical variables, optimal scores are
computed by adding the “best guess” predicted mean score to random
residuals off this score. Then categories having scores closest to
these predicted scores are taken as the random multiple imputations
(impcat = "tree"
or "rpart" are not currently allowed with
n.impute
). The literature recommends using n.impute = 5
or greater.
transcan
provides only an approximation to multiple imputation,
especially since it “freezes” the imputation model before drawing the
multiple imputations rather than using different estimates of
regression coefficients for each imputation. For multiple imputation,
the aregImpute
function provides a much better approximation to the
full Bayesian approach while still not requiring linearity assumptions.
When you specify n.impute
to transcan
you can use
fit.mult.impute
to re-fit any model n.impute
times based on
n.impute
completed datasets (if there are any sometimes missing
variables not specified to transcan
, some observations will still be
dropped from these fits). After fitting n.impute
models,
fit.mult.impute
will return the fit object from the last imputation,
with coefficients
replaced by the average of the n.impute
coefficient vectors and with a component var
equal to the
imputation-corrected variance-covariance matrix. fit.mult.impute
can also use the object created by the mice
function in the MICE
library to draw the multiple imputations, as well as objects created
by aregImpute
.
The summary
method for transcan
prints the function call,
R-squares achieved in transforming each variable, and for each variable
the coefficients of all other transformed variables that are used to
estimate the transformation of the initial variable. If
imputed = TRUE
was used in the call to transcan, also uses the
describe
function to print a summary of imputed values. If
long = TRUE
, also prints all imputed values with observation
identifiers. There is also a simple function print.transcan
which merely prints the transformation matrix and the function call. It
has an optional argument long
, which if set to TRUE
causes
detailed parameters to be printed. Instead of plotting while
transcan
is running, you can plot the final transformations
after the fact using plot.transcan
, if the option
trantab = TRUE
was specified to transcan
. If in addition
the option imputed = TRUE
was specified to transcan
,
plot.transcan
will show the location of imputed values (including
multiples) along the axes.
impute
does imputations for a selected original data variable, on
the original scale (if imputed = TRUE
was given to
transcan
). If you do not specify a variable to impute
, it
will do imputations for all variables given to transcan
which had
at least one missing value. This assumes that the original variables
are accessible (i.e., they have been attached) and that you want
the imputed variables to have the same names are the original variables.
If n.impute
was specified to transcan
you must tell
impute
which imputation
to use.
predict
computes predicted variables and imputed values from a
matrix of new data. This matrix should have the same column variables
as the original matrix used with transcan
, and in the same order
(unless a formula was used with transcan
).
Function
is a generic function generator.
Function.transcan
creates S functions to transform variables using
transformations created by transcan
. These functions are useful
for getting predicted values with predictors set to values on the original
scale.
vcov
methods are defined here so that imputation-corrected
variance-covariance matrices are readily extracted from
fit.mult.impute
objects, and so that fit.mult.impute
can easily
compute traditional covariance matrices for individual completed
datasets.
The subscript function preserves attributes.
The invertTabulated
function does either inverse linear
interpolation or uses sampling to sample qualifying x-values having
y-values near the desired values. The latter is used to get inverse
values having a reasonable distribution (e.g., no floor or ceiling
effects) when the transformation has a flat or nearly flat segment,
resulting in a many-to-one transformation in that region. Sampling
weights are a combination of the frequency of occurrence of x-values
that are within tolInverse
times the range of y
and the squared
distance between the associated y-values and the target y-value (aty
).
transcan(x, method=c("canonical","pc"), categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute, boot.method=c('approximate bayesian', 'simple'), trantab=FALSE, transformed=FALSE, impcat=c("score", "multinom", "rpart", "tree"), mincut=40, inverse=c('linearInterp','sample'), tolInverse=.05, pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE, imputed.actual=c('none','datadensity','hist','qq','ecdf'), iter.max=50, eps=.1, curtail=TRUE, imp.con=FALSE, shrink=FALSE, init.cat="mode", nres=if(boot.method=='simple')200 else 400, data, subset, na.action, treeinfo=FALSE, rhsImp=c('mean','random'), details.impcat='', ...) ## S3 method for class 'transcan' summary(object, long=FALSE, ...) ## S3 method for class 'transcan' print(x, long=FALSE, ...) ## S3 method for class 'transcan' plot(x, ...) ## S3 method for class 'transcan' impute(x, var, imputation, name, where.in, data, where.out=1, frame.out, list.out=FALSE, pr=TRUE, check=TRUE, ...) fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE, dtrans, derived, pr=TRUE, subset, ...) ## S3 method for class 'transcan' predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE, type=c("transformed","original"), inverse, tolInverse, check=FALSE, ...) Function(object, ...) ## S3 method for class 'transcan' Function(object, prefix=".", suffix="", where=1, ...) invertTabulated(x, y, freq=rep(1,length(x)), aty, name='value', inverse=c('linearInterp','sample'), tolInverse=0.05, rule=2) ## Default S3 method: vcov(object, regcoef.only=FALSE, ...) ## S3 method for class 'fit.mult.impute' vcov(object, ...)
x |
a matrix containing continuous variable values and codes for categorical
variables. The matrix must have column names ( |
formula |
any S model formula |
fitter |
any S or Design modeling function (not in quotes) that computes a
vector of |
xtrans |
an object created by |
method |
use |
categorical |
a character vector of names of variables in |
asis |
a character vector of names of variables that are not to be transformed.
For these variables, the guts of |
nk |
number of knots to use in expanding each continuous variable (not listed
in |
imputed |
Set to |
n.impute |
number of multiple imputations. If omitted, single predicted expected
value imputation is used. |
boot.method |
default is to use the approximate Bayesian bootstrap (sample with
replacement from sample with replacement of the vector of residuals).
You can also specify |
trantab |
Set to |
transformed |
set to |
impcat |
This argument tells how to impute categorical variables on the original
scale.
The default is |
mincut |
If |
inverse |
By default, imputed values are back-solved on the original scale using
inverse linear interpolation on the fitted tabulated transformed values.
This will cause distorted distributions of imputed values (e.g., floor
and ceiling effects) when the estimated transformation has a flat or
nearly flat section. To instead use the |
tolInverse |
the multiplyer of the range of transformed values, weighted by |
pr |
For |
pl |
Set to |
allpl |
Set to |
show.na |
Set to |
imputed.actual |
The default is "none" to suppress plotting of actual vs. imputed
values for all variables having any |
iter.max |
maximum number of iterations to perform for |
eps |
convergence criterion for |
curtail |
for |
imp.con |
for |
shrink |
default is |
init.cat |
method for initializing scorings of categorical variables. Default is "mode" to use a dummy variable set to 1 if the value is the most frequent value (this is the default). Use "random" to use a random 0-1 variable. Set to "asis" to use the original integer codes as starting scores. |
nres |
number of residuals to store if |
data |
|
subset |
an integer or logical vector specifying the subset of observations to fit |
na.action |
These may be used if |
treeinfo |
Set to |
rhsImp |
Set to "random" to use random draw imputation when a sometimes
missing variable is moved to be a predictor of other sometimes missing
variables. Default is |
details.impcat |
set to a character scalar that is the name of a
category variable to include in the resulting |
... |
arguments passed to |
long |
for |
var |
For |
imputation |
specifies which of the multiple imputations to use for filling in
|
name |
name of variable to impute, for |
where.in |
location in |
where.out |
location in the |
frame.out |
Instead of specifying |
list.out |
If |
check |
set to |
newdata |
a new data matrix for which to compute transformed variables.
Categorical variables must use the same integer codes as were used
in the call to |
fit.reps |
set to |
dtrans |
provides an approach to creating derived variables from a single
filled-in dataset. The function specified as |
derived |
an expression containing S expressions for computing derived
variables that are used in the model formula. This is useful when
multiple imputations are done for component variables but the actual
model uses combinations of these (e.g., ratios or other derivations).
For a single derived variable you can specified for example
|
type |
By default, the matrix of transformed variables is returned, with imputed
values on the transformed scale. If you had specified |
object |
an object created by |
prefix |
|
suffix |
When creating separate S functions for each variable in |
where |
position in |
y |
a vector corresponding to |
freq |
a vector of frequencies corresponding to cross-classified |
aty |
vector of transformed values at which inverses are desired |
rule |
see |
regcoef.only |
set to |
The starting approximation to the transformation for each variable
is taken to be the original coding of the variable. The initial
approximation for each missing value is taken to be the median of
the non-missing values for the variable (for continuous ones) or
the most frequent category (for categorical ones). Instead, if imp.con
is
a vector, its values are used for imputing NA
values. When using each
variable as a dependent variable, NA
values on that variable cause all
observations to be temporarily deleted. Once a new working transformation
is found for the variable, along with a model to predict that transformation
from all the other variables, that latter model is used to impute
NA
values in the selected dependent variable if imp.con
is not specified.
When that variable is used
to predict a new dependent variable, the current working imputed values
are inserted. Transformations are updated after each variable becomes
a dependent variable, so the order of variables on x
could conceivably
make a difference in the final estimates. For obtaining out-of-sample
predictions/transformations, predict
uses the same iterative
procedure as transcan
for imputation, with the same starting
values for fill-ins as were used by transcan
. It also (by default)
uses a conservative approach of curtailing transformed variables to
be within the range of the original ones.
Even when method = "pc"
is specified, canonical variables are used
for imputing missing values.
Note that fitted transformations, when evaluated at imputed variable
values (on the original scale), will not precisely match the transformed
imputed values returned in xt
. This is because transcan
uses an
approximate method based on linear interpolation to back-solve for
imputed values on the original scale.
Shrinkage uses the method of Van Houwelingen and Le Cessie (1990) (similar to Copas, 1983). The shrinkage factor is
[1-(1-R2) (n-1)/(n-k-1)]/R2
, where
R2 is the apparent R-squared for predicting the variable, n is the number
of non-missing values, and k is the effective number of degrees of freedom
(aside from intercepts). A heuristic estimate is used for k:
A - 1 + sum(max(0,Bi-1))/m + m
, where
A is the number of d.f. required
to represent the variable being predicted, the Bi are the number of
columns required to represent all the other variables, and m is the
number of all other variables. Division by m is done because the
transformations for the other variables are fixed at their current
transformations the last time they were being predicted. The + m term
comes from the number of coefficients estimated on the right hand side,
whether by least squares or canonical variates. If a shrinkage factor
is negative, it is set to 0. The shrinkage factor is the ratio of
the adjusted R-squared to the ordinary R-squared.
The adjusted R-squared is
1 - (1 - R2)(n-1)/(n-k-1)
, which is also set to
zero if it is negative. If shrink = FALSE
and the adjusted R-squares are much
smaller than
the ordinary R-squares, you may want to run transcan
with
shrink = TRUE
.
Canonical variates are scaled to have variance of 1.0, by multiplying canonical
coefficients from cancor
by
sqrt(n-1)
.
When specifying a non-Design library fitting function to
fit.mult.impute
(e.g., lm
, glm
), running the result of
fit.mult.impute
through that fit's summary
method will not use the
imputation-adjusted variances. You may obtain the new variances using
fit$var
or vcov(fit)
.
When you specify a Design function to fit.mult.impute
(e.g.,
lrm, ols, cph, psm, bj
), automatically computed transformation
parameters (e.g., knot locations for rcs
) that are estimated for the
first imputation are used for all other imputations. This ensures
that knot locations will not vary, which would change the meaning of
the regression coefficients.
Warning: even though fit.mult.impute
takes imputation into account
when estimating variances of regression coefficient, it does not take
into account the variation that results from estimation of the shapes
and regression coefficients of the customized imputation equations.
Specifying shrink = TRUE
solves a small part of this problem. To fully
account for all sources of variation you should consider putting the
transcan
invocation inside a bootstrap or loop, if execution time
allows. Better still, use aregImpute
or one of the libraries such
as MICE that uses real Bayesian posterior realizations to multiply
impute missing values correctly.
It is strongly recommended that you use the Hmisc naclus
function to
determine is there is a good basis for imputation. naclus
will tell
you, for example, if systolic blood pressure is missing whenever
diastolic blood pressure is missing. If the only variable that is
well correlated with diastolic bp is systolic bp, there is no basis
for imputing diastolic bp in this case.
At present, predict
does not work with multiple imputation.
When calling fit.mult.impute
with glm
as the fitter
argument, if
you need to pass a family
argument to glm
do it by quoting the
family, e.g., family = "binomial"
.
fit.mult.impute
will not work with proportional odds models when
regression imputation was used (as opposed to predictive mean
matching). That's because regression imputation will create values of
the response variable that did not exist in the dataset, altering the
intercept terms in the model.
You should be able to use a variable in the formula given to
fit.mult.impute
as a numeric variable in the regression model even
though it was a factor variable in the invocation of transcan
. Use
for example fit.mult.impute(y ~ codes(x), lrm, trans)
(thanks to
Trevor Thompson trevor@hp5.eushc.org).
For transcan
, a list of class transcan
with elements
call |
(with the function call) |
iter |
(number of iterations done) |
rsq, rsq.adj |
containing the R-squares and adjusted R-squares achieved in predicting each variable from all the others |
categorical |
the values supplied for |
asis |
the values supplied for |
coef |
the within-variable coefficients used to compute the first canonical variate |
xcoef |
the (possibly shrunk) across-variables coefficients of the first canonical variate that predicts each variable in turn |
parms |
the parameters of the transformation (knots for splines, contrast matrix for categorical variables) |
fillin |
the initial estimates for missing
values ( |
ranges |
the matrix of ranges of the transformed variables (min and max in first and second row) |
scale |
a vector of scales used to determine convergence for a transformation |
formula |
the formula (if |
, and optionally a vector of shrinkage
factors used for predicting each variable from the others. For
asis
variables, the scale is the average absolute difference about
the median. For other variables it is unity, since canonical
variables are standardized. For xcoef
, row i
has the coefficients
to predict transformed variable i
, with the column for the
coefficient of variable i
set to NA
. If imputed =
TRUE
was given, an
optional element imputed
also appears. This is a list with the
vector of imputed values (on the original scale) for each variable
containing NAs. Matrices rather than vectors are returned if
n.impute
is given. If trantab = TRUE
, the trantab
element also
appears, as described above. If n.impute > 0
, transcan
also returns
a list residuals
that can be used for future multiple imputation.
impute
returns a vector (the same
length as var
) of class "impute"
with NA
values imputed. predict
returns a matrix with the same number of columns or variables as were
in x
.
fit.mult.impute
returns a fit object that is a modification of the
fit object created by fitting the completed dataset for the final
imputation. The var
matrix in the fit object has the
imputation-corrected variance-covariance matrix. coefficients
is
the average (over imputations) of the coefficient vectors,
variance.inflation.impute
is a vector containing the ratios of
the diagonals of the between-imputation variance matrix to the diagonals
of the average apparent (within-imputation) variance matrix.
missingInfo
is Rubin's “rate of missing information” and
dfmi
is Rubin's degrees of freedom for a t-statistic for testing
a single parameter. The last two objects are vectors corresponding to
the diagonal of the variance matrix.
prints, plots, and impute.transcan
creates new variables.
Frank Harrell
Department of Biostatistics
Vanderbilt University
f.harrell@vanderbilt.edu
Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, Fourth Edition, Volume 2, pp. 1265–1323, 1990.
Van Houwelingen JC, Le Cessie S: Predictive value of statistical models. Statistics in Medicine 8:1303–1325, 1990.
Copas JB: Regression, prediction and shrinkage. JRSS B 45:311–354, 1983.
He X, Shen L: Linear regression after spline transformation. Biometrika 84:474–481, 1997.
Little RJA, Rubin DB: Statistical Analysis with Missing Data. New York: Wiley, 1987.
Rubin DJ, Schenker N: Multiple imputation in health-care databases: An overview and some applications. Stat in Med 10:585–598, 1991.
Faris PD, Ghali WA, et al:Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidem 55:184–191, 2002.
aregImpute
, impute
, naclus
, naplot
,
ace
, avas
, cancor
, prcomp
, rcspline.eval
,
lsfit
, approx
, datadensity
, mice
## Not run: x <- cbind(age, disease, blood.pressure, pH) #cbind will convert factor object `disease' to integer par(mfrow=c(2,2)) x.trans <- transcan(x, categorical="disease", asis="pH", transformed=TRUE, imputed=TRUE) summary(x.trans) #Summary distribution of imputed values, and R-squares f <- lm(y ~ x.trans$transformed) #use transformed values in a regression #Now replace NAs in original variables with imputed values, if not #using transformations age <- impute(x.trans, age) disease <- impute(x.trans, disease) blood.pressure <- impute(x.trans, blood.pressure) pH <- impute(x.trans, pH) #Do impute(x.trans) to impute all variables, storing new variables under #the old names summary(pH) #uses summary.impute to tell about imputations #and summary.default to tell about pH overall # Get transformed and imputed values on some new data frame xnew newx.trans <- predict(x.trans, xnew) w <- predict(x.trans, xnew, type="original") age <- w[,"age"] #inserts imputed values blood.pressure <- w[,"blood.pressure"] Function(x.trans) #creates .age, .disease, .blood.pressure, .pH() #Repeat first fit using a formula x.trans <- transcan(~ age + disease + blood.pressure + I(pH), imputed=TRUE) age <- impute(x.trans, age) predict(x.trans, expand.grid(age=50, disease="pneumonia", blood.pressure=60:260, pH=7.4)) z <- transcan(~ age + factor(disease.code), # disease.code categorical transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE) plot(z$transformed) ## End(Not run) # Multiple imputation and estimation of variances and covariances of # regression coefficient estimates accounting for imputation set.seed(1) x1 <- factor(sample(c('a','b','c'),100,TRUE)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100) y <- x2 + 1*(x1=='c') + rnorm(100) x1[1:20] <- NA x2[18:23] <- NA d <- data.frame(x1,x2,y) n <- naclus(d) plot(n); naplot(n) # Show patterns of NAs f <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d) options(digits=3) summary(f) f <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d) summary(f) h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d) # Add ,fit.reps=TRUE to save all fit objects in h, then do something like: # for(i in 1:length(h$fits)) print(summary(h$fits[[i]])) diag(vcov(h)) h.complete <- lm(y ~ x1 + x2, na.action=na.omit) h.complete diag(vcov(h.complete)) # Note: had Design's ols function been used in place of lm, any # function run on h (anova, summary, etc.) would have automatically # used imputation-corrected variances and covariances # Example demonstrating how using the multinomial logistic model # to impute a categorical variable results in a frequency # distribution of imputed values that matches the distribution # of non-missing values of the categorical variable ## Not run: set.seed(11) x1 <- factor(sample(letters[1:4], 1000,TRUE)) x1[1:200] <- NA table(x1)/sum(table(x1)) x2 <- runif(1000) z <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom') table(z$imputed$x1)/sum(table(z$imputed$x1)) # Here is how to create a completed dataset d <- data.frame(x1, x2) z <- transcan(~x1 + I(x2), n.impute=5, data=d) imputed <- impute(z, imputation=1, data=d, list.out=TRUE, pr=FALSE, check=FALSE) sapply(imputed, function(x)sum(is.imputed(x))) sapply(imputed, function(x)sum(is.na(x))) ## End(Not run) # Example where multiple imputations are for basic variables and # modeling is done on variables derived from these set.seed(137) n <- 400 x1 <- runif(n) x2 <- runif(n) y <- x1*x2 + x1/(1+x2) + rnorm(n)/3 x1[1:5] <- NA d <- data.frame(x1,x2,y) w <- transcan(~ x1 + x2 + y, n.impute=5, data=d) # Add ,show.imputed.actual for graphical diagnostics ## Not run: g <- fit.mult.impute(y ~ product + ratio, ols, w, data=data.frame(x1,x2,y), derived=expression({ product <- x1*x2 ratio <- x1/(1+x2) print(cbind(x1,x2,x1*x2,product)[1:6,])})) ## End(Not run) # Here's a method for creating a permanent data frame containing # one set of imputed values for each variable specified to transcan # that had at least one NA, and also containing all the variables # in an original data frame. The following is based on the fact # that the default output location for impute.transcan is # given by where.out=1 (search position 1) ## Not run: xt <- transcan(~. , data=mine, imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE) attach(mine, pos=1, use.names=FALSE) impute(xt, imputation=1) # use first imputation # omit imputation= if using single imputation detach(1, 'mine2') ## End(Not run) # Example of using invertTabulated outside transcan x <- c(1,2,3,4,5,6,7,8,9,10) y <- c(1,2,3,4,5,5,5,5,9,10) freq <- c(1,1,1,1,1,2,3,4,1,1) # x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5 # Within a tolerance of .05*(10-1) all y's match exactly # so the distance measure does not play a role set.seed(1) # so can reproduce for(inverse in c('linearInterp','sample')) print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse))) # Test inverse='sample' when the estimated transformation is # flat on the right. First show default imputations set.seed(3) x <- rnorm(1000) y <- pmin(x, 0) x[1:500] <- NA for(inverse in c('linearInterp','sample')) { par(mfrow=c(2,2)) w <- transcan(~ x + y, imputed.actual='hist', inverse=inverse, curtail=FALSE, data=data.frame(x,y)) if(inverse=='sample') next # cat('Click mouse on graph to proceed\n') # locator(1) }