nlpred
: Small-sample optimized
estimators of nonlinear risksWhen predicting an outcome is the scientific goal, one must decide on
a metric by which to evaluate the quality of predictions. Often, the
performance of a prediction algorithm must be estimated using the same
data that are used to train the algorithm. To correct for overly
optimistic assessments of performance, cross-validation is often used.
However, standard K-fold
cross-validated risk estimates may perform poorly in small samples when
one considers nonlinear risks. This includes several popular
risk criteria including the area under the ROC operating characteristics
curve (AUC). The nlpred
package implements several
estimators that are tailored for improved cross-validated estimation of
nonlinear risks. This vignette provides a brief overview of the
motivation for these estimators, and demonstrations of how they are
implemented in the nlpred
package.
Prediction is important in many areas of research. Modern technology has led to collection of vast amounts of data; for example in biomedical studies, we now routinely collect genetic sequences, gene expressions, proteomics, and metabolomics. Relative to the amount of information measured on each unit, the total number of units available may be quite modest. Many practical applications thus require methodology that enables researchers to simultaneously develop and evaluate performance of prediction algorithms in small samples. It is well known that estimating performance of an algorithm using the same data that were used to train the algorithm often leads to an overly optimistic estimate of performance. To correct for this, it is common to use K-fold cross-validation (CV). K-fold CV generalizes partitions the data into several distinct groups. The prediction algorithm is developed using all but one group, and the prediction metric is estimated in the remaining group. This is repeated until each partition has been used to estimate the risk once. The K-fold CV risk estimate is the average of these partition-specific estimates.
Theoretical frameworks have been developed for estimation of K-fold CV risks that apply to arbitrary learning algorithms (Hubbard, Kherad-Pajouh, and Laan 2016). Moreover, it is often possible to construct closed-form, computationally efficient confidence intervals and hypothesis tests based on K-fold CV estimates, e.g., (LeDell, Petersen, and Laan 2015). However, these estimators can suffer from poor behavior for certain risks. In particular, risks that are non-linear in the data generating distribution may suffer from poor performance. Whereas linear metrics can be estimated using estimators that themselves are linear (i.e., behave like means), non-linear metrics typically require asymptotically linear estimators. Such estimators behave as a mean plus a remainder term. While the remainder is generally negligible in large samples, it may contribute substantially to the behavior of the estimator in small samples.
In our recent work (references to be added when available), we have
developed improved estimators of nonlinear risk functions. Our approach
involves viewing the risk criteria as a statistical function of certain
key parameters the data generating distribution. Standard CV estimators
use the validation data to estimate these parameters, for example, using
nonparametric maximum likelihood. However, as discussed above, this is
problematic when validation samples are small. Thus, our proposal uses
the training sample twice: once to train the prediction algorithm, and
then again to estimate the relevant parameters of the data generating
distribution that are needed to evaluate the risk criteria of interest.
Because of the double-usage of the data, these estimates may exhibit
bias. This motivates some form of bias correction. nlpred
implements three asymptotically equivalent approaches: CV targeted
minimum loss estimation (CVTMLE), one-step estimation, and estimating
equations. We refer readers to our publication for further details on
how the estimators are constructed.
The area under the ROC curve (hence, AUC) is a popular risk criteria for evaluating prediction algorithms. Suppose we have a prediction algorithm ψ that maps a feature vector X into a predicted probability of a binary outcome Y. The AUC of ψ can be defined as follows. Consider drawing (X1, Y1) at random from the population of units with Y = 1 and (X2, Y2) independently from the population of units with Y = 0. The AUC can be interpreted as P(ψ(X1) > ψ(X2)|Y1 = 1, Y2 = 0). That is, the probability that the predicted risk of X1 is higher than that of X2.
Estimates of CV AUC can be implemented in nlpred
as
follows.
## Loading required package: data.table
# turn off messages from np package
options(np.messages=FALSE)
# simulate data
n <- 200
p <- 10
X <- data.frame(matrix(rnorm(n*p), nrow = n, ncol = p))
Y <- rbinom(n, 1, plogis(X[,1] + X[,10]))
# get cv auc estimates for logistic regression
logistic_cv_auc_ests <- cv_auc(Y = Y, X = X, K = 5, nested_K = 5,
learner = "glm_wrapper",
nested_cv = TRUE)
logistic_cv_auc_ests
## est se cil ciu
## cvtmle 0.7832779 0.03069815 0.7231107 0.8434452
## onestep 0.7836500 0.03117726 0.7225437 0.8447563
## esteq 0.7776170 0.03117726 0.7165107 0.8387234
## standard 0.7677368 0.03298084 0.7030956 0.8323781
The main options to the cv_auc
function are
Y
, the outcome, X
, the features, and
K
, the number of cross-validation folds. The
learner
option specifies the learning algorithm of
interest. See the Writing wrappers section below for
more details. For now, it suffices to say that
"glm_wrapper"
corresponds to a main effects logistic
regression. The nested_cv
option is an important option in
the estimation procedure. It specifies whether to use nested CV to
estimate the nuisance parameters in each training sample. Obviously,
requiring nested cross-validation adds considerable computational
expense, so it is natural to inquire as to when this is necessary. In
general, we recommend nested CV for more aggressive learning algorithms.
Because logistic regression is a fairly stable algorithm, it may not be
necessary in this case.
The printed output shows four estimates of CV AUC based on the three
different bias corrections (cvtmle
, onestep
,
esteq
) as well as the standard CV AUC estimate
(empirical
) (LeDell, Petersen, and
Laan 2015). Also shown is the influence function-based standard
error estimate (se
) and the limits of a, by default, 95%
confidence interval. The level of the confidence interval is controlled
by the ci_level
of the print.cv_auc
method.
## est se cil ciu
## cvtmle 0.7832779 0.03069815 0.7327840 0.8337719
## onestep 0.7836500 0.03117726 0.7323680 0.8349320
## esteq 0.7776170 0.03117726 0.7263350 0.8288991
## standard 0.7677368 0.03298084 0.7134882 0.8219855
We now consider a more aggressive algorithm: random forests (as
implemented in the ranger
package).
# load the ranger package
library(ranger)
# set a seed (reason to be made clear)
set.seed(123)
# get cv auc estimates for random forest
# using nested cross-validation for nuisance parameter estimation
rf_cv_auc_ests <- cv_auc(Y = Y, X = X, K = 5,
learner = "ranger_wrapper",
nested_cv = TRUE)
rf_cv_auc_ests
## est se cil ciu
## cvtmle 0.8065077 0.03086060 0.7460220 0.8669934
## onestep 0.8072512 0.03157018 0.7453747 0.8691276
## esteq 0.7982185 0.03157018 0.7363421 0.8600949
## standard 0.8000075 0.03118587 0.7388843 0.8611307
By default, cv_auc
will use K-1
folds for
the nested cross-validation. This choice is a sensible default since it
allows for considerably less learner training relative to, e.g., two
layers of K-fold CV, because
certain learners can be re-used across different partitions of the data.
However, if one wishes to control this, there is the
nested_K
option.
The table below shows wrappers for learners that are available in
nlpred
.
Wrapper | Description | Reference |
---|---|---|
glm_wrapper |
logistic regression | |
stepglm_wrapper |
stepwise logistic regression | |
xgboost_wrapper |
eXtreme gradient boosting | Chen and Guestrin (2016) |
ranger_wrapper |
random forests | Wright and Ziegler (2017) |
randomforest_wrapper |
random forests | Liaw and Wiener (2002) |
superlearner_wrapper |
super learner | Polley et al. (2019) |
glmnet_wrapper |
elastic net regression | Simon et al. (2011) |
It is not difficult to write your own function that is compatible
with cv_auc
and other functions in the nlpred
package. Let’s examine the glm_wrapper
function.
## function (train, test)
## {
## if (!is.data.frame(train$X)) {
## train$X <- data.frame(train$X)
## }
## if (!is.data.frame(test$X)) {
## test$X <- data.frame(test$X)
## }
## glm_fit <- stats::glm(train$Y ~ ., data = train$X, family = stats::binomial())
## train_pred <- stats::predict(glm_fit, newdata = train$X,
## type = "response")
## test_pred <- stats::predict(glm_fit, newdata = test$X, type = "response")
## return(list(test_pred = test_pred, train_pred = train_pred,
## model = NULL, train_y = train$Y, test_y = test$Y))
## }
## <bytecode: 0x55f64ea3d7b0>
## <environment: namespace:nlpred>
We see that the function expect input in the form of two lists:
train
and test
. In these lists, we expect to
find entries X
and Y
corresponding to the
features and outcomes, respectively, in the a giving training sample and
validation/testing sample. The salient points of the workflow of this
function are: fit a main terms logistic regression and store it in the
glm_fit
object; obtain predictions on both the training
data and testing data; structure the output to have a particular format.
In particular, the output should be a list with named entries
test_pred
, predictions in the test sample,
train_pred
, predictions in the training sample,
model
, the fitted model (optional; only needed if you wish
to examine it in the $prediction_list
entry of the output
of cv_auc
), train_y
, the training sample
outcomes (copied from train$Y
), and test_y
,
the testing sample outcomes (copied from test$Y
).
The sensitivity constrained rate of negative prediction (SCRNP) can be described as follows. Suppose again that we have a prediction function ψ that maps features X into a predicted probability of a binary outcome Y. Suppose we choose a cutoff c0, such that P(ψ(X) > c0|Y = 1) ≥ ρ for a user-defined ρ. That is, we enforce that the sensitivity of a classifier based on ψ is at least ρ. The SCRNP is then P(ψ(X) ≤ c0); that is, the proportion of all data units that would be classified as a “control” (i.e., Y = 0).
To motivate SCRNP, consider developing a prediction algorithm for breast cancer incidence in women. We would like to identify a large proportion of women who will eventually develop breast cancer; that is, we would like to enforce that our procedure for classifying women at high-risk has high sensitivity. However, women with high predicted risk of cancer may be recommended to undergo more invasive screening procedures. So beyond our sensitivity constraint, we would like to maximize the proportion of women who are not required to undergo additional screening. The SCRNP describes this exactly this proportion. Zheng et al. (2018) discuss SCRNP in the context of HIV prevention.
Estimating SCRNP using traditional K-fold CV approaches is particularly challenging because we need to estimate a quantile of the distribution of ψ(X) amongst those with Y = 1. If there are few observations with Y = 1 in the validation fold, then this estimation will be highly unstable and will cause downstream instability in the estimate of CV SCRNP. This makes our approach particularly appealing for this problem.
The syntax to estimate CV SCRNP is through the cv_scrnp
function as shown below.
# get cv scrnp estimates for logistic regression
logistic_cv_scrnp_ests <- cv_scrnp(Y = Y, X = X, K = 5,
learner = "glm_wrapper",
nested_cv = FALSE)
logistic_cv_scrnp_ests
## est se cil ciu
## cvtmle 0.2205570 0.04599747 0.13040359 0.3107104
## onestep 0.1496639 0.04700371 0.05753832 0.2417895
## esteq 0.1496639 0.04700371 0.05753832 0.2417895
## standard 0.2008648 0.04142204 0.11967909 0.2820505
# get cv scrnp estimates for random forest
# using nested cross-validation for nuisance parameter estimation
rf_cv_scrnp_ests <- cv_scrnp(Y = Y, X = X, K = 5,
learner = "ranger_wrapper",
nested_cv = TRUE)
rf_cv_scrnp_ests
## est se cil ciu
## cvtmle 0.1946194 0.05575193 0.08534760 0.3038912
## onestep 0.1909077 0.05582234 0.08149787 0.3003174
## esteq 0.1909077 0.05582234 0.08149787 0.3003174
## standard 0.1990411 0.04732361 0.10628854 0.2917937
nlpred
To compare the novel methods in nlpred
to existing
approaches, we have included several alternative approaches to
estimating performance of these quantities.
The functions boot_auc
and boot_scrnp
can
be used to estimate bootstrap based performance of prediction
algorithms. There are several available approaches. In particular, each
function implements a standard bootstrap correction (Harrell, Lee, and Mark 1996), as well as an
0.632 correction described in Chapter 7 of Friedman, Hastie, and Tibshirani (2001).
# get bootstrap estimated auc of logistic regression
boot_auc_est <- boot_auc(Y = Y, X = X, learner = "glm_wrapper",
correct632 = FALSE)
boot_auc_est
## $auc
## [1] 0.7851946
##
## $n_valid_boot
## [1] 500
# with 0.632 correction
boot632_auc_est <- boot_auc(Y = Y, X = X, learner = "glm_wrapper",
correct632 = TRUE)
boot632_auc_est
## $auc
## [1] 0.7823117
##
## $n_valid_boot
## [1] 500
# get bootstrap estimated scrnp of logistic regression
boot_scrnp_est <- boot_scrnp(Y = Y, X = X, learner = "glm_wrapper",
correct632 = FALSE)
boot_scrnp_est
## $scrnp
## [1] 0.170875
##
## $n_valid_boot
## [1] 200
# with 0.632 correction
boot632_scrnp_est <- boot_scrnp(Y = Y, X = X, learner = "glm_wrapper",
correct632 = TRUE)
boot632_scrnp_est
## $scrnp
## [1] 0.08021718
##
## $n_valid_boot
## [1] 200
Another proposal for estimating AUC is using leave-pairs-out CV (Airola et al. 2011). In this approach, a random
observation with Y = 0 and
Y = 1 are left out; the
algorithm is trained on the remaining data and predictions are made on
the two held out observations. The estimate is the proportion of these
pairs for which the Y = 1
observation had higher predicted risk than the Y = 0 observation. Because it can be
quite computationally expensive to retrain algorithms for every, we
include an option max_pairs
to specify the maximum number
of pairs to leave out. If left equal to NULL
, then every
possible case/control pair is left out in turn.
# leave out at most 250 pairs
lpo_cv_auc_est <- lpo_auc(Y = Y, X = X, learner = "glm_wrapper",
max_pairs = 250)
lpo_cv_auc_est
## $auc
## [1] 0.804
nlpred
Unfortunately parallelization is not yet available in
nlpred
, but will be added as a feature soon.
From extensive simulation studies, here are a few relevant observations.
K = 10
or K = 20
tended to yield
the best performance across a variety of settings.inner_K = 5
tended to yield the best
performance across a variety of settings.nlpred
nlpred