Fits a model on the original data, then repeatedly fits on perturbed
versions to collect coefficient estimates, p-values, and predictions for
downstream stability analysis. Four modeling backends are supported:
ordinary least squares ("lm"), generalized linear models
("glm"), robust regression ("rlm" via MASS), and
penalized regression ("glmnet" via glmnet).
Arguments
- formula
A model formula.
- data
A data frame.
- B
Integer. Number of perturbation iterations. Default is
200.- method
Perturbation method passed to
perturb_data. One of"bootstrap"(default),"subsample", or"noise".- alpha
Significance threshold for p-value and selection stability. Default is
0.05.- frac
Subsampling fraction. Passed to
perturb_data. Default is0.8.- noise_sd
Noise level. Passed to
perturb_data. Default is0.05.- predict_newdata
Optional data frame for out-of-sample prediction stability. Defaults to
data.- family
A GLM family object (e.g.
stats::binomial(),stats::poisson()). Used only whenbackend = "glm"(or whenfamilyis non-NULLandbackend = "lm", in which case the backend is silently promoted to"glm").- backend
Modeling backend. One of
"lm"(default),"glm","rlm"(robust M-estimation viaMASS::rlm), or"glmnet"(penalized regression viaglmnet::glmnet).- en_alpha
Elastic-net mixing parameter passed to
glmnet::glmnet:1(default) gives the LASSO,0gives ridge, intermediate values give elastic net. Ignored for other backends.- lambda
Regularization parameter for
glmnet. WhenNULL(default) the penalty is selected byglmnet::cv.glmnetusinglambda.min. Ignored for other backends.- perturb_response
Logical. When
FALSE(default) andmethod = "noise", noise is added only to predictor columns; the response variable is left unchanged. Set toTRUEto also perturb the response (e.g. to simulate measurement error in the outcome). Has no effect formethod = "bootstrap"or"subsample", where row-level resampling naturally carries the response along.
Value
An object of class "reprostat", a list with components:
callThe matched call.
formulaThe model formula.
BNumber of iterations used.
alphaSignificance threshold used.
base_fitModel fitted on the original data.
y_trainResponse vector from the original data (used internally for RI computation).
coef_matB x p matrix of coefficient estimates.
p_matB x p matrix of p-values, or all-
NAforbackend = "glmnet"(where p-values are not defined).pred_matn x B matrix of predictions.
methodPerturbation method used.
familyGLM family used, or
NULL.backendBackend used.
en_alphaElastic-net mixing parameter used (only relevant for
backend = "glmnet", otherwiseNA).
Examples
set.seed(1)
# Linear model (small B for a quick check)
diag_lm <- run_diagnostics(mpg ~ wt + hp, data = mtcars, B = 20)
print(diag_lm)
#> ReproStat Diagnostics
#> ---------------------
#> Formula : mpg ~ wt + hp
#> Backend : lm
#> Method : bootstrap
#> Iterations: 20
#> Terms : (Intercept), wt, hp
# \donttest{
# Logistic regression
diag_glm <- run_diagnostics(am ~ wt + hp + qsec, data = mtcars, B = 50,
family = stats::binomial())
reproducibility_index(diag_glm)
#> $index
#> [1] 75.00492
#>
#> $components
#> coef pvalue selection prediction
#> 0.2497427 1.0000000 0.8533333 0.8971209
#>
# Robust regression (requires MASS)
if (requireNamespace("MASS", quietly = TRUE)) {
diag_rlm <- run_diagnostics(mpg ~ wt + hp, data = mtcars, B = 50,
backend = "rlm")
reproducibility_index(diag_rlm)
}
#> $index
#> [1] 96.84662
#>
#> $components
#> coef pvalue selection prediction
#> 0.9190617 0.9800000 1.0000000 0.9748032
#>
# Penalized regression / LASSO (requires glmnet)
if (requireNamespace("glmnet", quietly = TRUE)) {
diag_glmnet <- run_diagnostics(mpg ~ wt + hp + disp + qsec, data = mtcars,
B = 50, backend = "glmnet", en_alpha = 1)
reproducibility_index(diag_glmnet)
}
#> $index
#> [1] 82.57931
#>
#> $components
#> coef pvalue selection prediction
#> 0.7604751 NA 0.7450000 0.9719042
#>
# }