Title: | Estimators of Non-Linear Cross-Validated Risks Optimized for Small Samples |
---|---|
Description: | Methods for obtaining improved estimates of non-linear cross-validated risks are obtained using targeted minimum loss-based estimation, estimating equations, and one-step estimation (Benkeser, Petersen, van der Laan (2019), <doi:10.1080/01621459.2019.1668794>). Cross-validated area under the receiver operating characteristics curve (LeDell, Petersen, van der Laan (2015), <doi:10.1214/15-EJS1035>) and other metrics are included. |
Authors: | David Benkeser [aut, cre] |
Maintainer: | David Benkeser <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2024-11-04 03:49:19 UTC |
Source: | https://github.com/benkeser/nlpred |
Compute one of the terms of the efficient influence function
.Dy(full_long_data, y)
.Dy(full_long_data, y)
full_long_data |
A long form data set |
y |
Which portion of the EIF to compute |
Vector of one piece of EIF evaluated at estimates in full_long_data
An estimating function for cvAUC
.estim_fn(auc = 0.5, prediction_list, gn)
.estim_fn(auc = 0.5, prediction_list, gn)
auc |
The value of auc to find root for |
prediction_list |
Entry in prediction_list |
gn |
Marginal probability of outcome |
A numeric value of the estimating function evaluated at current
auc
estimate.
An estimating function for cvAUC with initial estimates generated via nested cross-validation
.estim_fn_nested_cv(auc = 0.5, prediction_list, folds, gn, K)
.estim_fn_nested_cv(auc = 0.5, prediction_list, folds, gn, K)
auc |
The value of auc to find root for |
prediction_list |
Entry in prediction_list |
folds |
Cross-validation folds |
gn |
Marginal probability of outcome |
K |
Number of CV folds |
A numeric value of the estimating function evaluated at current
auc
estimate.
See ?.get_psi_distribution
to understand expected input format
.get_auc(dist_y0, dist_y1)
.get_auc(dist_y0, dist_y1)
dist_y0 |
Distribution of psi given Y = 0 |
dist_y1 |
Distribution of psi given Y = 1 |
Numeric value of AUC
Helper function to turn prediction_list into CV estimate of SCRNP
.get_cv_estim(prediction_list, sens, gn, quantile_type = 8, ...)
.get_cv_estim(prediction_list, sens, gn, quantile_type = 8, ...)
prediction_list |
Properly formatted list of predictions. |
sens |
The sensitivity constraint. |
gn |
The marginal probability that |
quantile_type |
The type of quantile estimate to use. |
... |
Other options (not currently used) |
Function to estimate density needed to evaluate standard errors.
.get_density( x, c0, bounded_kernel = FALSE, x_name = "train_pred", y_name = "train_y", nested_cv = FALSE, prediction_list = NULL, folds = NULL, maxDens = 1000, ... )
.get_density( x, c0, bounded_kernel = FALSE, x_name = "train_pred", y_name = "train_y", nested_cv = FALSE, prediction_list = NULL, folds = NULL, maxDens = 1000, ... )
x |
An entry in prediction_list. |
c0 |
The point at which the density estimate is evaluated. |
bounded_kernel |
Should a bounded kernel be used? Default is |
x_name |
Name of variable to compute density of. |
y_name |
Name of variable to stratify density computation on. |
nested_cv |
Use nested CV to estimate density? |
prediction_list |
Properly formatted list of predictions. |
folds |
Cross-validation fold assignments. |
maxDens |
The maximum allowed value for the density. |
... |
Other options (not currently used) |
Helper function to get quantile for a single training fold data when nested CV is used.
.get_nested_cv_quantile(x, p, prediction_list, folds, quantile_type = 8)
.get_nested_cv_quantile(x, p, prediction_list, folds, quantile_type = 8)
x |
An entry in prediction_list. |
p |
The quantile to get. |
prediction_list |
Properly formatted list of predictions. |
folds |
Cross-validation fold assignments. |
quantile_type |
The type of quantile estimate to use. |
Helper function to get results for a single cross-validation fold
.get_one_fold(x, sens, gn, quantile_type = 8, ...)
.get_one_fold(x, sens, gn, quantile_type = 8, ...)
x |
An entry in prediction_list. |
sens |
The sensitivity constraint. |
gn |
An estimate of the marginal probability that |
quantile_type |
The type of quantile estimate to use. |
... |
Other options (not currently used) |
Worker function for fitting prediction functions (possibly in parallel)
.get_predictions( learner, Y, X, K = 10, folds, parallel, nested_cv = FALSE, nested_K = K - 1 )
.get_predictions( learner, Y, X, K = 10, folds, parallel, nested_cv = FALSE, nested_K = K - 1 )
learner |
The wrapper to use |
Y |
The outcome |
X |
The predictors |
K |
The number of folds |
folds |
Vector of CV fold assignments |
parallel |
Whether to compute things in parallel using future |
nested_cv |
Is nested CV being used? |
nested_K |
How many folds of nested CV? |
A list of the result of the wrapper executed in each fold
Compute the conditional (given Y = y) estimated distribution of psi
.get_psi_distribution(x, y, epsilon = 0)
.get_psi_distribution(x, y, epsilon = 0)
x |
An entry in the output from .get_predictions |
y |
What value of Y to compute dist. est. |
epsilon |
A vector of estimated coefficients form tmle fluctuation submodels. |
A data.frame with the distribution of psi given Y = y with names psix (what value estimates are evaluated at), dFn (density estimates), Fn (cdf estimates)
Compute the conditional (given Y = y) CV-estimated distribution of psi
.get_psi_distribution_nested_cv(x, y, prediction_list, folds, epsilon = 0)
.get_psi_distribution_nested_cv(x, y, prediction_list, folds, epsilon = 0)
x |
The outer validation fold withheld |
y |
What value of Y to compute dist. est. |
prediction_list |
List output from .get_predictions. |
folds |
Cross validation fold indicator. |
epsilon |
A vector of estimated coefficients form tmle fluctuation submodels. |
A data.frame with the distribution of psi given Y = y with names psix (what value estimates are evaluated at), dFn (density estimates), Fn (cdf estimates)
Helper function to get quantile for a single training fold data when nested CV is NOT used.
.get_quantile(x, p, quantile_type = 8)
.get_quantile(x, p, quantile_type = 8)
x |
An entry in prediction_list. |
p |
The quantile to get. |
quantile_type |
The type of quantile estimate to use. |
Worker function to make long form data set needed for CVTMLE targeting step
.make_long_data( x, gn, update = FALSE, epsilon_0 = 0, epsilon_1 = 0, tol = 0.001 )
.make_long_data( x, gn, update = FALSE, epsilon_0 = 0, epsilon_1 = 0, tol = 0.001 )
x |
An entry in the "predictions list" that has certain
named values (see |
gn |
An estimate of the probability that |
update |
A boolean of whether this is called for initial construction of the long data set or as part of the targeting loop. If the former, empirical "density" estimates are used. If the latter these are derived from the targeted cdf. |
epsilon_0 |
If |
epsilon_1 |
Same as for |
tol |
A truncation level when taking logit transformations. |
A long form data list of a particular set up. Columns are named id (multiple rows per observation in validation sample), u (if Yi = 0, these are the values of psi(x) in the training sample for obs with Y = 1, if Yi = 1, these are values of psi(x) in the training sample for obs. with Y = 0), Yi (this observation's value of Y), Fn (estimated value of the cdf of psi(X) given Y = Yi in the training sample), dFn (estimated value of the density of psi(X) given Y = (1-Yi) in the training sample), psi (the value of this observations Psihat(P_n,B_n^0)), gn (estimate of marginal of Y e.g., computed in whole sample), outcome (indicator that psix <= u), logit_Fn (the cdf estimate on the logit scale, needed for offset in targeting model).
Worker function to make long form data set needed for CVTMLE targeting step when nested cv is used
.make_long_data_nested_cv( x, prediction_list, folds, gn, update = FALSE, epsilon_0 = 0, epsilon_1 = 0, tol = 0.001 )
.make_long_data_nested_cv( x, prediction_list, folds, gn, update = FALSE, epsilon_0 = 0, epsilon_1 = 0, tol = 0.001 )
x |
The outer validation fold |
prediction_list |
The full prediction list |
folds |
Vector of CV folds |
gn |
An estimate of the marginal dist. of Y |
update |
Boolean of whether this is called for initial construction of the long data set or as part of the targeting loop. If the former, cross-validated empirical "density" estimates are used. If the latter these are derived from the targeted cdf. |
epsilon_0 |
If |
epsilon_1 |
Ditto above |
tol |
A truncation level when taking logit transformations. |
A long form data list of a particular set up. Columns are named id (multiple per obs. in validation sample), u (if Yi = 0, these are the unique values of psi(x) in the inner validation samples for psi fit on inner training samples for obs with Y = 1, if Yi = 1, these are values of psi(x) in the inner validation samples for psi fit on inner training samples for obs. with Y = 0), Yi (this id's value of Y), Fn (cross-validation estimated value of the cdf of psi(X) given Y = Yi in the training sample), dFn (cross-validated estimate of the density of psi(X) given Y = (1-Yi) in the training sample), psi (the value of this observations Psihat(P_n,B_n^0)), gn (estimate of marginal of Y e.g., computed in whole sample), outcome (indicator that psix <= u), logit_Fn (the cdf estimate on the logit scale, needed for offset in targeting model).
Helper function for making data set in proper format for CVTMLE
.make_targeting_data( x, prediction_list, quantile_list, density_list, folds, nested_cv = FALSE, gn )
.make_targeting_data( x, prediction_list, quantile_list, density_list, folds, nested_cv = FALSE, gn )
x |
A numeric identifier of which entry in |
prediction_list |
Properly formatted list of predictions. |
quantile_list |
List of estimated quantile for each fold. |
density_list |
List of density estimates for each fold. |
folds |
Cross-validation fold assignments. |
nested_cv |
A boolean indicating whether nested CV was used in estimation. |
gn |
An estimate of the marginal probability that |
Unexported function from cvAUC package
.process_input( predictions, labels, label.ordering = NULL, folds = NULL, ids = NULL, confidence = NULL )
.process_input( predictions, labels, label.ordering = NULL, folds = NULL, ids = NULL, confidence = NULL )
predictions |
A vector, matrix, list, or data frame containing the predictions. |
labels |
A vector, matrix, list, or data frame containing the true class labels. Must have the
same dimensions as |
label.ordering |
The default ordering of the classes can be changed by supplying a vector containing the negative and the positive class label (negative label first, positive label second). |
folds |
If specified, this must be a vector of fold ids equal in length to |
ids |
Vector of ids |
confidence |
confidence interval level |
The "Adult" data set from UCI machine learning repository. Raw data have been processed
and an outcome
column added.
Description (copied from UCI):
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
Prediction task is to determine whether a person makes over 50K a year (column outcome
).
Listing of attributes:
>50K, <=50K
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
https://archive.ics.uci.edu/ml/datasets/Adult
http://robotics.stanford.edu/~ronnyk/nbtree.pdf
Bank data from UCI Machine Learning Repository. The raw bank data have been processed
and an outcome
column added.
Description (copied from UCI):
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:
1) (included in predtmle
) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM). The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
Attribute Information:
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown') 6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
This estimator is computed by re-sampling with replacement (i.e., bootstrap sampling) from the data. The AUC is computed for the learner trained on the full data. The AUC is then computed for the learner trained on each bootstrap sample. The average difference between the full data-trained learner and the bootstrap-trained learner is computed to estimate the bias in the full-data-estimated AUC. The final estimate of AUC is given by the difference in the full-data AUC and the estimated bias.
boot_auc(Y, X, B = 500, learner = "glm_wrapper", correct632 = FALSE, ...)
boot_auc(Y, X, B = 500, learner = "glm_wrapper", correct632 = FALSE, ...)
Y |
A numeric vector of outcomes, assume to equal |
X |
A |
B |
The number of bootstrap samples. |
learner |
A wrapper that implements the desired method for building a
prediction algorithm. See |
correct632 |
A boolean indicating whether to use the .632 correction. |
... |
Other options, not currently used. |
A list with $auc
as the bootstrap-corrected AUC estimate and
$n_valid_boot
as the number of bootstrap of bootstrap samples where learner
successfully executed.
# simulate data X <- data.frame(x1 = rnorm(50)) Y <- rbinom(50, 1, plogis(X$x1)) # compute lpo_auc for logistic regression # use small B for fast run boot <- boot_auc(Y = Y, X = X, B = 25, learner = "glm_wrapper")
# simulate data X <- data.frame(x1 = rnorm(50)) Y <- rbinom(50, 1, plogis(X$x1)) # compute lpo_auc for logistic regression # use small B for fast run boot <- boot_auc(Y = Y, X = X, B = 25, learner = "glm_wrapper")
This estimator is computed by re-sampling with replacement (i.e., bootstrap sampling) from the data. The SCRNP is computed for the learner trained on the full data. The SCRNP is then computed for the learner trained on each bootstrap sample. The average difference between the full data-trained learner and the bootstrap-trained learner is computed to estimate the bias in the full-data-estimated SCRNP. The final estimate of SCRNP is given by the difference in the full-data SCRNP and the estimated bias.
boot_scrnp( Y, X, B = 200, learner = "glm_wrapper", sens = 0.95, correct632 = FALSE, ... )
boot_scrnp( Y, X, B = 200, learner = "glm_wrapper", sens = 0.95, correct632 = FALSE, ... )
Y |
A numeric vector of outcomes, assume to equal |
X |
A |
B |
The number of bootstrap samples. |
learner |
A wrapper that implements the desired method for building a
prediction algorithm. See |
sens |
The sensitivity constraint to use. |
correct632 |
A boolean indicating whether to use the .632 correction. |
... |
Other options, not currently used. |
A list with $scrnp
the bootstrap-corrected estimate of SCRNP and
$n_valid_boot
as the number of bootstrap of bootstrap samples where learner
successfully executed.
# simulate data X <- data.frame(x1 = rnorm(50)) Y <- rbinom(50, 1, plogis(X$x1)) # compute bootstrap estimate of scrnp for logistic regression # use small B for fast run boot <- boot_scrnp(Y = Y, X = X, B = 25, learner = "glm_wrapper")
# simulate data X <- data.frame(x1 = rnorm(50)) Y <- rbinom(50, 1, plogis(X$x1)) # compute bootstrap estimate of scrnp for logistic regression # use small B for fast run boot <- boot_scrnp(Y = Y, X = X, B = 25, learner = "glm_wrapper")
Cardiotocography data from UCI machine learning repository. Raw data have been
cleaned and an outcome
column added that is a binary variable of predicting
NSP (described below) = 2.
Data Set Information: 2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.
Attribute Information:
LB - FHR baseline (beats per minute)
AC - # of accelerations per second
FM - # of fetal movements per second
UC - # of uterine contractions per second
DL - # of light decelerations per second
DS - # of severe decelerations per second
DP - # of prolongued decelerations per second
ASTV - percentage of time with abnormal short term variability
MSTV - mean value of short term variability
ALTV - percentage of time with abnormal long term variability
MLTV - mean value of long term variability
Width - width of FHR histogram
Min - minimum of FHR histogram
Max - Maximum of FHR histogram
Nmax - # of histogram peaks
Nzeros - # of histogram zeros
Mode - histogram mode
Mean - histogram mean
Median - histogram median
Variance - histogram variance
Tendency - histogram tendency
CLASS - FHR pattern class code (1 to 10)
NSP - fetal state class code (N=normal; S=suspect; P=pathologic)
https://archive.ics.uci.edu/ml/datasets/Cardiotocography
Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318
This function is nearly verbatim ci.cvAUC from the cvAUC package. The only difference is that it additionally returns estimated influence functions.
ci.cvAUC_withIC( predictions, labels, label.ordering = NULL, folds = NULL, confidence = 0.95 )
ci.cvAUC_withIC( predictions, labels, label.ordering = NULL, folds = NULL, confidence = 0.95 )
predictions |
A vector, matrix, list, or data frame containing the predictions. |
labels |
A vector, matrix, list, or data frame containing the true class labels. Must have the
same dimensions as |
label.ordering |
The default ordering of the classes can be changed by supplying a vector containing the negative and the positive class label (negative label first, positive label second). |
folds |
If specified, this must be a vector of fold ids equal in length to |
confidence |
number between 0 and 1 that represents confidence level. |
A list containing the following named elements:
cvAUC |
Cross-validated area under the curve estimate. |
se |
Standard error. |
ci |
A vector of length two containing the upper and lower bounds for the confidence interval. |
confidence |
A number between 0 and 1 representing the confidence. |
ic |
A vector of the influence function evaluated at observations. |
This function computes K-fold cross-validated estimates of the area under the receiver operating characteristics (ROC) curve (hereafter, AUC). This quantity can be interpreted as the probability that a randomly selected case will have higher predicted risk than a randomly selected control.
cv_auc( Y, X, K = 10, learner = "glm_wrapper", nested_cv = TRUE, nested_K = K - 1, parallel = FALSE, max_cvtmle_iter = 10, cvtmle_ictol = 1/length(Y), prediction_list = NULL, ... )
cv_auc( Y, X, K = 10, learner = "glm_wrapper", nested_cv = TRUE, nested_K = K - 1, parallel = FALSE, max_cvtmle_iter = 10, cvtmle_ictol = 1/length(Y), prediction_list = NULL, ... )
Y |
A numeric vector of outcomes, assume to equal |
X |
A |
K |
The number of cross-validation folds (default is |
learner |
A wrapper that implements the desired method for building a
prediction algorithm. See See |
nested_cv |
A boolean indicating whether nested cross validation should
be used to estimate the distribution of the prediction function. Default ( |
nested_K |
If nested cross validation is used, how many inner folds should
there be? Default ( |
parallel |
A boolean indicating whether prediction algorithms should be
trained in parallel. Default to |
max_cvtmle_iter |
Maximum number of iterations for the bias correction
step of the CV-TMLE estimator (default |
cvtmle_ictol |
The CV-TMLE will iterate |
prediction_list |
For power users: a list of predictions made by |
... |
Other arguments, not currently used |
To estimate the AUC of a particular prediction algorithm, K-fold cross-validation is commonly used: data are partitioned into K distinct groups and the prediction algorithm is developed using K-1 of these groups. In standard K-fold cross-validation, the AUC of this prediction algorithm is estimated using the remaining fold. This can be problematic when the number of observations is small or the number of cross-validation folds is large.
Here, we estimate relevant nuisance parameters in the training sample and use
the validation sample to perform some form of bias correction – either through
cross-validated targeted minimum loss-based estimation, estimating equations,
or one-step estimation. When aggressive learning algorithms are applied, it is
necessary to use an additional layer of cross-validation in the training sample
to estimate the nuisance parameters. This is controlled via the nested_cv
option below.
An object of class "cvauc"
.
est_cvtmle
cross-validated targeted minimum loss-based estimator of K-fold CV AUC
iter_cvtmle
iterations needed to achieve convergence of CVTMLE algorithm
cvtmle_trace
the value of the CVTMLE at each iteration of the targeting algorithm
se_cvtmle
estimated standard error based on targeted nuisance parameters
est_init
plug-in estimate of CV AUC where nuisance parameters are estimated in the training sample
est_empirical
the standard K-fold CV AUC estimator
se_empirical
estimated standard error for the standard estimator
est_onestep
cross-validated one-step estimate of K-fold CV AUC
se_onestep
estimated standard error for the one-step estimator
est_esteq
cross-validated estimating equations estimate of K-fold CV AUC
se_esteq
estimated standard error for the estimating equations estimator (same as for one-step)
folds
list of observation indexes in each validation fold
ic_cvtmle
influence function evaluated at the targeted nuisance parameter estimates
ic_onestep
influence function evaluated at the training-fold-estimated nuisance parameters
ic_esteq
influence function evaluated at the training-fold-estimated nuisance parameters
ic_empirical
influence function evaluated at the validation-fold estimated nuisance parameters
prediction_list
a list of output from the cross-validated model training; see the individual wrapper function documentation for further details
# simulate data n <- 200 p <- 10 X <- data.frame(matrix(rnorm(n*p), nrow = n, ncol = p)) Y <- rbinom(n, 1, plogis(X[,1] + X[,10])) # get cv auc estimates for logistic regression cv_auc_ests <- cv_auc(Y = Y, X = X, K = 5, learner = "glm_wrapper") # get cv auc estimates for random forest # using nested cross-validation for nuisance parameter estimation fit <- cv_auc(Y = Y, X = X, K = 5, learner = "randomforest_wrapper", nested_cv = TRUE)
# simulate data n <- 200 p <- 10 X <- data.frame(matrix(rnorm(n*p), nrow = n, ncol = p)) Y <- rbinom(n, 1, plogis(X[,1] + X[,10])) # get cv auc estimates for logistic regression cv_auc_ests <- cv_auc(Y = Y, X = X, K = 5, learner = "glm_wrapper") # get cv auc estimates for random forest # using nested cross-validation for nuisance parameter estimation fit <- cv_auc(Y = Y, X = X, K = 5, learner = "randomforest_wrapper", nested_cv = TRUE)
This function computes K-fold cross-validated estimates of estimates of cross-validated sensitivity-constrained rate of negative prediction (SCRNP). This quantity can be interpreted as the rate of negative classification for a fixed constraint on the sensitivity of a prediction algorithm. Thus, if an algorithm has a high SCRNP, it will also have a high positive predictive value.
cv_scrnp( Y, X, K = 10, sens = 0.95, learner = "glm_wrapper", nested_cv = TRUE, nested_K = K - 1, parallel = FALSE, max_cvtmle_iter = 10, cvtmle_ictol = 1/length(Y), quantile_type = 8, prediction_list = NULL, ... )
cv_scrnp( Y, X, K = 10, sens = 0.95, learner = "glm_wrapper", nested_cv = TRUE, nested_K = K - 1, parallel = FALSE, max_cvtmle_iter = 10, cvtmle_ictol = 1/length(Y), quantile_type = 8, prediction_list = NULL, ... )
Y |
A numeric vector of outcomes, assume to equal |
X |
A |
K |
The number of cross-validation folds (default is |
sens |
The sensitivity constraint imposed on the rate of negative prediction (see description). |
learner |
A wrapper that implements the desired method for building a prediction algorithm. |
nested_cv |
A boolean indicating whether nested cross validation should
be used to estimate the distribution of the prediction function. Default ( |
nested_K |
If nested cross validation is used, how many inner folds should
there be? Default ( |
parallel |
A boolean indicating whether prediction algorithms should be
trained in parallel. Default to |
max_cvtmle_iter |
Maximum number of iterations for the bias correction
step of the CV-TMLE estimator (default |
cvtmle_ictol |
The CV-TMLE will iterate |
quantile_type |
Type of quantile estimator to be used. See quantile for description. |
prediction_list |
For power users: a list of predictions made by |
... |
Other arguments, not currently used |
To estimate the SCRNP using K-fold cross-validation is problematic. If
data are partitioned into K distinct groups, depending on the sample size
and choice of K, the validation sample may be quite small. In order to estimate
SCRNP, we require estimation of a quantile of the predictor's distribution. More extreme
quantiles (which correspond to high sensitivity constraints) are difficult to estimate
using few observations. Here, we estimate relevant nuisance parameters in the training sample and use
the validation sample to perform some form of bias correction – either through
cross-validated targeted minimum loss-based estimation, estimating equations,
or one-step estimation. When aggressive learning algorithms are applied, it is
necessary to use an additional layer of cross-validation in the training sample
to estimate the nuisance parameters. This is controlled via the nested_cv
option below.
An object of class "scrnp"
.
est_cvtmle
cross-validated targeted minimum loss-based estimator of K-fold CV AUC
iter_cvtmle
iterations needed to achieve convergence of CVTMLE algorithm
cvtmle_trace
the value of the CVTMLE at each iteration of the targeting algorithm
se_cvtmle
estimated standard error based on targeted nuisance parameters
est_init
plug-in estimate of CV AUC where nuisance parameters are estimated in the training sample
est_empirical
the standard K-fold CV AUC estimator
se_empirical
estimated standard error for the standard estimator
est_onestep
cross-validated one-step estimate of K-fold CV AUC
se_onestep
estimated standard error for the one-step estimator
est_esteq
cross-validated estimating equations estimate of K-fold CV AUC (here, equivalent to one-step, since the estimating equation is linear in SCRNP)
se_esteq
estimated standard error for the estimating equations estimator (same as one-step)
folds
list of observation indexes in each validation fold
ic_cvtmle
influence function evaluated at the targeted nuisance parameter estimates
ic_onestep
influence function evaluated at the training-fold-estimated nuisance parameters
ic_esteq
influence function evaluated at the training-fold-estimated nuisance parameters
ic_empirical
influence function evaluated at the validation-fold estimated nuisance parameters
prediction_list
a list of output from the cross-validated model training; see the individual wrapper function documentation for further details
# simulate data n <- 200 p <- 10 X <- data.frame(matrix(rnorm(n*p), nrow = n, ncol = p)) Y <- rbinom(n, 1, plogis(X[,1] + X[,10])) # estimate cv scrnp of logistic regression scrnp_ests <- cv_scrnp(Y = Y, X = X, K = 5, nested_cv = FALSE, learner = "glm_wrapper") # estimate cv scrnp of random forest with nested # cross-validation for nuisance parameter estimation scrnp_ests <- cv_scrnp(Y = Y, X = X, K = 5, nested_cv = TRUE, learner = "randomforest_wrapper")
# simulate data n <- 200 p <- 10 X <- data.frame(matrix(rnorm(n*p), nrow = n, ncol = p)) Y <- rbinom(n, 1, plogis(X[,1] + X[,10])) # estimate cv scrnp of logistic regression scrnp_ests <- cv_scrnp(Y = Y, X = X, K = 5, nested_cv = FALSE, learner = "glm_wrapper") # estimate cv scrnp of random forest with nested # cross-validation for nuisance parameter estimation scrnp_ests <- cv_scrnp(Y = Y, X = X, K = 5, nested_cv = TRUE, learner = "randomforest_wrapper")
"Drug consumption (quantified) Data Set" from UCI Machine Learning Repository.
Raw data have been processed and an outcome
(heroin use) column added.
Data Set Information (copied from UCI library):
Database contains records for 1885 respondents. For each respondent 12 attributes are known: Personality measurements which include NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), BIS-11 (impulsivity), and ImpSS (sensation seeking), level of education, age, gender, country of residence and ethnicity. All input attributes are originally categorical and are quantified. After quantification values of all input features can be considered as real-valued. In addition, participants were questioned concerning their use of 18 legal and illegal drugs (alcohol, amphetamines, amyl nitrite, benzodiazepine, cannabis, chocolate, cocaine, caffeine, crack, ecstasy, heroin, ketamine, legal highs, LSD, methadone, mushrooms, nicotine and volatile substance abuse and one fictitious drug (Semeron) which was introduced to identify over-claimers. For each drug they have to select one of the answers: never used the drug, used it over a decade ago, or in the last decade, year, month, week, or day.
Database contains 18 classification problems. Each of independent label variables contains seven classes: "Never Used", "Used over a Decade Ago", "Used in Last Decade", "Used in Last Year", "Used in Last Month", "Used in Last Week", and "Used in Last Day".
Problem which can be solved:
* Seven class classifications for each drug separately.
* Problem can be transformed to binary classification by union of part of classes into one new class. For example, "Never Used", "Used over a Decade Ago" form class "Non-user" and all other classes form class "User".
* The best binarization of classes for each attribute.
* Evaluation of risk to be drug consumer for each drug.
Detailed description of database and process of data quantification are presented in E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan and A. N. Gorban, "The Five Factor Model of personality and evaluation of drug consumption risk.," arXiv [Web Link], 2015
Paper above solve binary classification problem for all drugs. For most of drugs sensitivity and specificity are greater than 75%.
https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29
https://arxiv.org/abs/1506.06297
Compute the targeted conditional cumulative distribution of the learner at a point
F_nBn_star(psi_x, y, train_pred, train_y, epsilon = 0, tol = 0.001)
F_nBn_star(psi_x, y, train_pred, train_y, epsilon = 0, tol = 0.001)
psi_x |
Value to compute conditional (on Y=y) cdf of learner |
y |
Value of Y to condition on |
train_pred |
Values of Psi_nBn(X) from training sample |
train_y |
Values of Y from training sample |
epsilon |
Vector of fluctuation parameter estimates |
tol |
Truncation level for logistic transformation |
Numeric value of CDF at psi_x
Compute the targeted conditional cumulative distribution of the learner at a point where the initial distribution is based on cross validation
F_nBn_star_nested_cv( psi_x, y, inner_valid_prediction_and_y_list, epsilon = 0, tol = 0.001 )
F_nBn_star_nested_cv( psi_x, y, inner_valid_prediction_and_y_list, epsilon = 0, tol = 0.001 )
psi_x |
Value to compute conditional (on Y=y) cdf of learner |
y |
Value of Y to condition on |
inner_valid_prediction_and_y_list |
A list of predictions and y's from |
epsilon |
Vector of fluctuation parameter estimates |
tol |
A truncation level when taking logit transformations. |
Numeric value of CDF at psi_x
Helper function for CVTMLE grid search
fluc_mod_optim_0(epsilon, fld, tol = 0.001)
fluc_mod_optim_0(epsilon, fld, tol = 0.001)
epsilon |
Fluctuation parameter |
fld |
The |
tol |
Tolerance on predictions close to 0 or 1 |
A numeric value of negative log-likelihood
Helper function for CVTMLE grid search
fluc_mod_optim_1(epsilon, fld, tol = 0.001)
fluc_mod_optim_1(epsilon, fld, tol = 0.001)
epsilon |
Fluctuation parameter |
fld |
full_long_data_list |
tol |
Tolerance on predictions close to 0 or 1 |
A numeric value of negative log-likelihood
glm
.Compatible learner wrappers for this package should have a specific format.
Namely they should take as input a list called train
that contains
named objects $Y
and $X
, that contain, respectively, the outcomes
and predictors in a particular training fold. Other options may be passed in
to the function as well. The function must output a list with the following
named objects: test_pred
= predictions of test$Y
based on the learner
fit using train$X
; train_pred
= prediction of train$Y
based
on the learner fit using train$X
; model
= the fitted model (only
necessary if you desire to look at this model later, not used for internal
computations); train_y
= a copy of train$Y
; test_y
= a copy
of test$Y
.
glm_wrapper(train, test)
glm_wrapper(train, test)
train |
A list with named objects |
test |
A list with named objects |
This particular wrapper implements a logistic regression using glm. We refer readers to the original package's documentation for more details.
A list with named objects (see description).
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit glm glm_wrap <- glm_wrapper(train = train, test = test)
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit glm glm_wrap <- glm_wrapper(train = train, test = test)
glmnet
.Compatible learner wrappers for this package should have a specific format.
Namely they should take as input a list called train
that contains
named objects $Y
and $X
, that contain, respectively, the outcomes
and predictors in a particular training fold. Other options may be passed in
to the function as well. The function must output a list with the following
named objects: test_pred
= predictions of test$Y
based on the learner
fit using train$X
; train_pred
= prediction of train$Y
based
on the learner fit using train$X
; model
= the fitted model (only
necessary if you desire to look at this model later, not used for internal
computations); train_y
= a copy of train$Y
; test_y
= a copy
of test$Y
.
glmnet_wrapper( train, test, alpha = 1, nfolds = 5, nlambda = 100, use_min = TRUE, loss = "deviance", ... )
glmnet_wrapper( train, test, alpha = 1, nfolds = 5, nlambda = 100, use_min = TRUE, loss = "deviance", ... )
train |
A list with named objects |
test |
A list with named objects |
alpha |
See glmnet for further description. |
nfolds |
See glmnet for further description. |
nlambda |
See glmnet for further description. |
use_min |
See glmnet for further description. |
loss |
See glmnet for further description. |
... |
Other options (passed to |
This particular wrapper implements glmnet. We refer readers to the original package's documentation for more details.
A list with named objects (see description).
# load super learner package library(glmnet) # simulate data # make list of training data train_X <- data.frame(x1 = runif(50), x2 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50), x2 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit super learner glmnet_wrap <- glmnet_wrapper(train = train, test = test)
# load super learner package library(glmnet) # simulate data # make list of training data train_X <- data.frame(x1 = runif(50), x2 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50), x2 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit super learner glmnet_wrap <- glmnet_wrapper(train = train, test = test)
This estimator is computed by leaving out a pair of one case (Y = 1
) and
one control (Y = 0
). The learner is trained on the remaining observations
and predicted values are obtained for the left-out pair. The estimate is given by
the proportion of left-out pairs for which the case had higher predicted risk
than the control.
lpo_auc(Y, X, learner = "glm_wrapper", max_pairs = NULL, parallel = FALSE, ...)
lpo_auc(Y, X, learner = "glm_wrapper", max_pairs = NULL, parallel = FALSE, ...)
Y |
A numeric vector of outcomes, assume to equal |
X |
A |
learner |
A wrapper that implements the desired method for building a
prediction algorithm. See |
max_pairs |
The maximum number of pairs to leave out. |
parallel |
A boolean indicating whether prediction algorithms should be
trained in parallel. Default to |
... |
Other options (not currently used) |
# simulate data X <- data.frame(x1 = rnorm(50)) Y <- rbinom(50, 1, plogis(X$x1)) # compute lpo_auc for logistic regression lpo <- lpo_auc(Y = Y, X = X, learner = "glm_wrapper")
# simulate data X <- data.frame(x1 = rnorm(50)) Y <- rbinom(50, 1, plogis(X$x1)) # compute lpo_auc for logistic regression lpo <- lpo_auc(Y = Y, X = X, learner = "glm_wrapper")
try
s to fit learner
on a bootstrap sample. If for some reason
(e.g., the bootstrap sample contains no observations with Y = 1
)
the learner fails, then the function returns NA
. These NA
s
are ignored later when computing the bootstrap corrected estimate.Internal function used to perform one bootstrap sample. The function
try
s to fit learner
on a bootstrap sample. If for some reason
(e.g., the bootstrap sample contains no observations with Y = 1
)
the learner fails, then the function returns NA
. These NA
s
are ignored later when computing the bootstrap corrected estimate.
one_boot_auc(Y, X, n, correct632, learner)
one_boot_auc(Y, X, n, correct632, learner)
Y |
A numeric binary outcome |
X |
A |
n |
Number of observations |
correct632 |
A boolean indicating whether to use the .632 correction. |
learner |
A wrapper that implements the desired method for building a
prediction algorithm. See |
If learner
executes successfully, a numeric estimate of AUC
on this bootstrap sample. Otherwise the function returns NA
.
try
s to fit learner
on a bootstrap sample. If for some reason
(e.g., the bootstrap sample contains no observations with Y = 1
)
the learner fails, then the function returns NA
. These NA
s
are ignored later when computing the bootstrap corrected estimate.Internal function used to perform one bootstrap sample. The function
try
s to fit learner
on a bootstrap sample. If for some reason
(e.g., the bootstrap sample contains no observations with Y = 1
)
the learner fails, then the function returns NA
. These NA
s
are ignored later when computing the bootstrap corrected estimate.
one_boot_scrnp(Y, X, n, correct632, learner, sens)
one_boot_scrnp(Y, X, n, correct632, learner, sens)
Y |
A numeric binary outcome |
X |
A |
n |
Number of observations |
correct632 |
A boolean indicating whether to use the .632 correction. |
learner |
A wrapper that implements the desired method for building a
prediction algorithm. See |
sens |
The sensitivity constraint to use. |
If learner
executes successfully, a numeric estimate of AUC
on this bootstrap sample. Otherwise the function returns NA
.
Print results of cv_auc
## S3 method for class 'cvauc' print(x, ci_level = 0.95, se_type = "std", ...)
## S3 method for class 'cvauc' print(x, ci_level = 0.95, se_type = "std", ...)
x |
An object of class "cvauc" |
ci_level |
Level of confidence interval to print. Defaults to 0.95. |
se_type |
The type of standard error (currently only "std") |
... |
Other options (not currently used) |
Print results of cv_scrnp
## S3 method for class 'scrnp' print(x, se_type = "std", ci_level = 0.95, ...)
## S3 method for class 'scrnp' print(x, se_type = "std", ci_level = 0.95, ...)
x |
An object of class "cvauc" |
se_type |
The type of standard error (currently only "std") |
ci_level |
Level of confidence interval to print. Defaults to 0.95. |
... |
Other options (not currently used) |
Compatible learner wrappers for this package should have a specific format.
Namely they should take as input a list called train
that contains
named objects $Y
and $X
, that contain, respectively, the outcomes
and predictors in a particular training fold. Other options may be passed in
to the function as well. The function must output a list with the following
named objects: test_pred
= predictions of test$Y
based on the learner
fit using train$X
; train_pred
= prediction of train$Y
based
on the learner fit using train$X
; model
= the fitted model (only
necessary if you desire to look at this model later, not used for internal
computations); train_y
= a copy of train$Y
; test_y
= a copy
of test$Y
.
randomforest_wrapper( train, test, mtry = floor(sqrt(ncol(train$X))), ntree = 1000, nodesize = 1, maxnodes = NULL, importance = FALSE, ... )
randomforest_wrapper( train, test, mtry = floor(sqrt(ncol(train$X))), ntree = 1000, nodesize = 1, maxnodes = NULL, importance = FALSE, ... )
train |
A list with named objects |
test |
A list with named objects |
mtry |
See randomForest. |
ntree |
See randomForest. |
nodesize |
See randomForest. |
maxnodes |
See randomForest. |
importance |
See randomForest. |
... |
Other options (passed to |
This particular wrapper implements the randomForest ensemble methodology. We refer readers to the original package's documentation for more details.
A list with named objects (see description).
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit randomforest rf_wrap <- randomforest_wrapper(train = train, test = test)
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit randomforest rf_wrap <- randomforest_wrapper(train = train, test = test)
Compatible learner wrappers for this package should have a specific format.
Namely they should take as input a list called train
that contains
named objects $Y
and $X
, that contain, respectively, the outcomes
and predictors in a particular training fold. Other options may be passed in
to the function as well. The function must output a list with the following
named objects: test_pred
= predictions of test$Y
based on the learner
fit using train$X
; train_pred
= prediction of train$Y
based
on the learner fit using train$X
; model
= the fitted model (only
necessary if you desire to look at this model later, not used for internal
computations); train_y
= a copy of train$Y
; test_y
= a copy
of test$Y
.
ranger_wrapper( train, test, num.trees = 500, mtry = floor(sqrt(ncol(train$X))), write.forest = TRUE, probability = TRUE, min.node.size = 5, replace = TRUE, sample.fraction = ifelse(replace, 1, 0.632), num.threads = 1, verbose = TRUE, ... )
ranger_wrapper( train, test, num.trees = 500, mtry = floor(sqrt(ncol(train$X))), write.forest = TRUE, probability = TRUE, min.node.size = 5, replace = TRUE, sample.fraction = ifelse(replace, 1, 0.632), num.threads = 1, verbose = TRUE, ... )
train |
A list with named objects |
test |
A list with named objects |
num.trees |
See ranger. |
mtry |
See ranger. |
write.forest |
See ranger. |
probability |
See ranger. |
min.node.size |
See ranger. |
replace |
See ranger. |
sample.fraction |
See ranger. |
num.threads |
See ranger. |
verbose |
See ranger. |
... |
Other options (passed to |
This particular wrapper implements the ranger ensemble methodology. We refer readers to the original package's documentation for more details.
A list with named objects (see description).
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit ranger rf_wrap <- ranger_wrapper(train = train, test = test)
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit ranger rf_wrap <- ranger_wrapper(train = train, test = test)
glm
.Compatible learner wrappers for this package should have a specific format.
Namely they should take as input a list called train
that contains
named objects $Y
and $X
, that contain, respectively, the outcomes
and predictors in a particular training fold. Other options may be passed in
to the function as well. The function must output a list with the following
named objects: test_pred
= predictions of test$Y
based on the learner
fit using train$X
; train_pred
= prediction of train$Y
based
on the learner fit using train$X
; model
= the fitted model (only
necessary if you desire to look at this model later, not used for internal
computations); train_y
= a copy of train$Y
; test_y
= a copy
of test$Y
.
stepglm_wrapper(train, test)
stepglm_wrapper(train, test)
train |
A list with named objects |
test |
A list with named objects |
This particular wrapper implements a forward stepwise logistic regression using glm and step. We refer readers to the original package's documentation for more details.
A list with named objects (see description).
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit stepwise glm step_wrap <- stepglm_wrapper(train = train, test = test)
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit stepwise glm step_wrap <- stepglm_wrapper(train = train, test = test)
SuperLearner
.Compatible learner wrappers for this package should have a specific format.
Namely they should take as input a list called train
that contains
named objects $Y
and $X
, that contain, respectively, the outcomes
and predictors in a particular training fold. Other options may be passed in
to the function as well. The function must output a list with the following
named objects: test_pred
= predictions of test$Y
based on the learner
fit using train$X
; train_pred
= prediction of train$Y
based
on the learner fit using train$X
; model
= the fitted model (only
necessary if you desire to look at this model later, not used for internal
computations); train_y
= a copy of train$Y
; test_y
= a copy
of test$Y
.
superlearner_wrapper(train, test, SL.library = c("SL.mean"), ...)
superlearner_wrapper(train, test, SL.library = c("SL.mean"), ...)
train |
A list with named objects |
test |
A list with named objects |
SL.library |
|
... |
Other options (passed to |
This particular wrapper implements the SuperLearner ensemble methodology. We refer readers to the original package's documentation for more details.
A list with named objects (see description).
# load super learner package library(SuperLearner) # simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit super learner sl_wrap <- superlearner_wrapper(train = train, test = test, SL.library = c("SL.mean","SL.glm"))
# load super learner package library(SuperLearner) # simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit super learner sl_wrap <- superlearner_wrapper(train = train, test = test, SL.library = c("SL.mean","SL.glm"))
"Wine Quality" data set from UCI Machine Learning Repository. The red and white wine data sets have been combined with an added attribute for red vs. white.
Data Set Information (copied from UCI):
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: [Web Link] or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Attribute Information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
https://archive.ics.uci.edu/ml/datasets/Wine+Quality
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
https://doi.org/10.1016/j.dss.2009.05.016
xgboost
Compatible learner wrappers for this package should have a specific format.
Namely they should take as input a list called train
that contains
named objects $Y
and $X
, that contain, respectively, the outcomes
and predictors in a particular training fold. Other options may be passed in
to the function as well. The function must output a list with the following
named objects: test_pred
= predictions of test$Y
based on the learner
fit using train$X
; train_pred
= prediction of train$Y
based
on the learner fit using train$X
; model
= the fitted model (only
necessary if you desire to look at this model later, not used for internal
computations); train_y
= a copy of train$Y
; test_y
= a copy
of test$Y
.
xgboost_wrapper( test, train, ntrees = 500, max_depth = 4, shrinkage = 0.1, minobspernode = 2, params = list(), nthread = 1, verbose = 0, save_period = NULL )
xgboost_wrapper( test, train, ntrees = 500, max_depth = 4, shrinkage = 0.1, minobspernode = 2, params = list(), nthread = 1, verbose = 0, save_period = NULL )
test |
A list with named objects |
train |
A list with named objects |
ntrees |
See xgboost |
max_depth |
See xgboost |
shrinkage |
See xgboost |
minobspernode |
See xgboost |
params |
See xgboost |
nthread |
See xgboost |
verbose |
See xgboost |
save_period |
See xgboost |
This particular wrapper implements eXtreme gradient boosting using xgboost. We refer readers to the original package's documentation for more details.
A list with named objects (see description).
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit xgboost xgb_wrap <- xgboost_wrapper(train = train, test = test)
# simulate data # make list of training data train_X <- data.frame(x1 = runif(50)) train_Y <- rbinom(50, 1, plogis(train_X$x1)) train <- list(Y = train_Y, X = train_X) # make list of test data test_X <- data.frame(x1 = runif(50)) test_Y <- rbinom(50, 1, plogis(train_X$x1)) test <- list(Y = test_Y, X = test_X) # fit xgboost xgb_wrap <- xgboost_wrapper(train = train, test = test)