| Title: | Hybrid Genetic Algorithm and Simulated Annealing for Variable Selection with Interaction Effects |
|---|---|
| Description: | A hybrid genetic algorithm combined with simulated annealing for variable selection in high-dimensional linear regression with two-way interaction effects. The package uses an indexed model formulation for efficient storage and minimizes the ABC (Adaptive Best-subset with Complexity) criterion from Ye and Yang (2019) <doi:10.1109/TIT.2019.2913417>. Features include: (1) support for all three heredity conditions (strong, weak, and no heredity), (2) adaptive temperature control in simulated annealing, (3) LASSO-based screening for ultra-high-dimensional problems, and (4) automatic sigma and lambda tuning. The package scales to problems with thousands of predictors. |
| Authors: | Leiyue Li [aut, cre], Zhengzhong You [aut], Zeya Wang [aut], Chenglong Ye [aut] |
| Maintainer: | Leiyue Li <[email protected]> |
| License: | GPL-2 |
| Version: | 2.3.0 |
| Built: | 2026-06-03 07:21:49 UTC |
| Source: | https://github.com/lli289/hysaint |
Computes the ABC (Adaptive Best-subset with Complexity) score for a fitted model following Ye and Yang (2019). Lower ABC scores indicate better model fit.
ABC( X, y, heredity = "Strong", sigma = NULL, main_idx = NULL, inter_idx = NULL, lambda = 10, p = NULL )ABC( X, y, heredity = "Strong", sigma = NULL, main_idx = NULL, inter_idx = NULL, lambda = 10, p = NULL )
X |
Input data matrix of dimension |
y |
Response variable. A numeric vector of length |
heredity |
Heredity constraint: "Strong", "Weak", or "No". Default is "Strong". |
sigma |
The standard deviation of noise. If NULL, it is estimated automatically. |
main_idx |
Indices of selected main effects (integers from 1 to p). |
inter_idx |
Interaction indices. Can be either:
|
lambda |
ABC penalty parameter. Must satisfy |
p |
Total number of main effects. If NULL, uses ncol(X). |
The complexity term depends on the heredity condition:
Strong heredity:
Weak heredity:
where .
No heredity:
The function uses lchoose for numerical stability with large binomial coefficients.
A list with components:
ABC |
The ABC score of the fitted model. |
SSE |
Sum of squared errors. |
r_I |
Rank of the design matrix (number of parameters). |
C_I |
Complexity penalty term. |
k1 |
Number of main effects in the model. |
k2 |
Number of interactions in the model. |
Ye, C. and Yang, Y. (2019). High-dimensional adaptive minimax sparse estimation with interactions. IEEE Transactions on Information Theory, 65(9), 5367-5379. doi:10.1109/TIT.2019.2913417
hySAINT for the main variable selection function.
set.seed(123) n <- 100; p <- 10 X <- matrix(rnorm(n * p), n, p) y <- X[,1] + X[,2] + 0.5*X[,1]*X[,2] + rnorm(n, 0, 0.5) # Compute ABC for a model with main effects 1, 2 and interaction 1:2 result <- ABC(X, y, heredity = "Strong", sigma = 0.5, main_idx = c(1, 2), inter_idx = matrix(c(1, 2), nrow = 1)) print(result$ABC) # Compare with a smaller model (main effects only) result2 <- ABC(X, y, heredity = "Strong", sigma = 0.5, main_idx = c(1, 2), inter_idx = NULL) print(result2$ABC) # Should be higher (worse) since true model has interactionset.seed(123) n <- 100; p <- 10 X <- matrix(rnorm(n * p), n, p) y <- X[,1] + X[,2] + 0.5*X[,1]*X[,2] + rnorm(n, 0, 0.5) # Compute ABC for a model with main effects 1, 2 and interaction 1:2 result <- ABC(X, y, heredity = "Strong", sigma = 0.5, main_idx = c(1, 2), inter_idx = matrix(c(1, 2), nrow = 1)) print(result$ABC) # Compare with a smaller model (main effects only) result2 <- ABC(X, y, heredity = "Strong", sigma = 0.5, main_idx = c(1, 2), inter_idx = NULL) print(result2$ABC) # Should be higher (worse) since true model has interaction
Creates an initial population of candidate models for the genetic algorithm. Each individual represents a model with a random subset of main effects and interactions satisfying the heredity constraint. The first few individuals use top-ranked variables to seed the population with good solutions.
generate_initial_population( main_pool, inter_pool, r1, r2, pop_size, heredity, p )generate_initial_population( main_pool, inter_pool, r1, r2, pop_size, heredity, p )
main_pool |
Ranked main effect candidates (from |
inter_pool |
Ranked interaction candidates (from |
r1 |
Maximum number of main effects per individual. |
r2 |
Maximum number of interactions per individual. |
pop_size |
Number of individuals in the population. |
heredity |
Heredity constraint: "Strong", "Weak", or "No". |
p |
Total number of main effects in the original data. |
The function uses a biased sampling scheme where top-ranked variables (from LASSO screening) have higher probability of selection. This helps the algorithm converge faster by starting with promising candidates.
A list of individuals, where each individual is a list with:
main_idx |
Integer vector of selected main effect indices. |
inter_idx |
Integer vector of selected interaction linear indices. |
hySAINT which calls this function,
crossover and mutate for genetic operators.
# This function is typically called internally by hySAINT # Example for illustration: main_pool <- c(1, 2, 3, 5, 8, 10) inter_pool <- matrix(c(1, 2, 101, 1, 3, 102, 2, 3, 103), ncol = 3, byrow = TRUE) colnames(inter_pool) <- c("i", "j", "idx") pop <- generate_initial_population(main_pool, inter_pool, r1 = 4, r2 = 2, pop_size = 10, heredity = "Strong", p = 50) length(pop) # 10 individuals# This function is typically called internally by hySAINT # Example for illustration: main_pool <- c(1, 2, 3, 5, 8, 10) inter_pool <- matrix(c(1, 2, 101, 1, 3, 102, 2, 3, 103), ncol = 3, byrow = TRUE) colnames(inter_pool) <- c("i", "j", "idx") pop <- generate_initial_population(main_pool, inter_pool, r1 = 4, r2 = 2, pop_size = 10, heredity = "Strong", p = 50) length(pop) # 10 individuals
Generates candidate interaction effects based on selected main effects and the heredity constraint. The pool size is controlled by the heredity condition:
Strong: Only pairs where both variables are in main_idx
Weak: Pairs where at least one variable is in main_idx
No: All pairs among candidate main effects
generate_inter_pool(main_idx, p, heredity = "Strong", all_main_pool = NULL)generate_inter_pool(main_idx, p, heredity = "Strong", all_main_pool = NULL)
main_idx |
Indices of selected main effects. |
p |
Total number of main effects in the original data. |
heredity |
Heredity constraint: "Strong", "Weak", or "No". Default is "Strong". |
all_main_pool |
Full pool of candidate main effects. Used for weak heredity to determine which additional interactions to consider. |
A matrix with three columns:
i |
First variable in the interaction pair. |
j |
Second variable in the interaction pair (j > i). |
idx |
Linear index of the interaction. |
screen_main_effects for screening main effects,
rank_interactions for ranking interaction candidates.
# Strong heredity: only interactions among selected main effects main_selected <- c(1, 2, 5, 8) pool <- generate_inter_pool(main_selected, p = 100, heredity = "Strong") print(nrow(pool)) # Should be choose(4, 2) = 6 # Weak heredity: more interactions allowed pool_weak <- generate_inter_pool(main_selected, p = 100, heredity = "Weak", all_main_pool = 1:20) print(nrow(pool_weak)) # Larger than strong heredity# Strong heredity: only interactions among selected main effects main_selected <- c(1, 2, 5, 8) pool <- generate_inter_pool(main_selected, p = 100, heredity = "Strong") print(nrow(pool)) # Should be choose(4, 2) = 6 # Weak heredity: more interactions allowed pool_weak <- generate_inter_pool(main_selected, p = 100, heredity = "Weak", all_main_pool = 1:20) print(nrow(pool_weak)) # Larger than strong heredity
This is the main function of the hySAINT package. It implements a hybrid genetic algorithm combined with simulated annealing for variable selection in high-dimensional linear regression with two-way interaction effects. The algorithm minimizes the ABC (Adaptive Best-subset with Complexity) criterion from Ye and Yang (2019), which achieves selection consistency under all three heredity conditions.
hySAINT( X, y, heredity = c("Strong", "Weak", "No"), r1, r2, sigma = NULL, lambda = NULL, pop_size = 50, max_iter = 100, patience = 30, verbose = TRUE )hySAINT( X, y, heredity = c("Strong", "Weak", "No"), r1, r2, sigma = NULL, lambda = NULL, pop_size = 50, max_iter = 100, patience = 30, verbose = TRUE )
X |
Input data matrix of dimension |
y |
Response variable. A numeric vector of length |
heredity |
Heredity constraint: "Strong", "Weak", or "No". Default is "Strong".
|
r1 |
Maximum number of main effects to select. |
r2 |
Maximum number of interaction effects to select. |
sigma |
The standard deviation of noise. If NULL (default), it is estimated automatically using the median absolute deviation (MAD) estimator. |
lambda |
ABC penalty parameter. If NULL (default), uses the theoretical minimum
|
pop_size |
Population size for the genetic algorithm. Default is 50. |
max_iter |
Maximum number of GA iterations. Default is 100. |
patience |
Early stopping patience. The algorithm stops if no improvement is found for this many consecutive iterations. Default is 30. |
verbose |
Logical. Print progress messages? Default is TRUE. |
The algorithm proceeds in four main steps:
Screening: Main effects are screened using LASSO (via glmnet). This reduces the search space from p to a manageable number of candidates.
Interaction Pool: Candidate interactions are generated based on the heredity condition and ranked by partial correlation with the response.
Genetic Algorithm: A population of candidate models evolves through selection, crossover, and mutation operations.
Simulated Annealing: The mutation step uses a Metropolis acceptance criterion with adaptive temperature control, allowing the algorithm to escape local optima.
The ABC criterion balances goodness-of-fit with model complexity:
where is the model rank and is a heredity-specific complexity term
derived from minimum description length theory.
An object of class "hySAINT" with components:
main_effects |
Indices of selected main effects (1 to p). |
interactions |
Matrix of selected interaction pairs (each row is a pair of indices). |
ABC_score |
Final ABC score of the selected model. |
coefficients |
Fitted OLS coefficients for the selected model. |
sigma |
Estimated or provided sigma value. |
lambda |
Lambda value used. |
runtime |
Total computation time in seconds. |
convergence |
List with convergence information including best scores per iteration. |
Ye, C. and Yang, Y. (2019). High-dimensional adaptive minimax sparse estimation with interactions. IEEE Transactions on Information Theory, 65(9), 5367-5379. doi:10.1109/TIT.2019.2913417
ABC for computing the ABC criterion directly,
screen_main_effects for the screening step,
generate_initial_population for population initialization.
# Example 1: Low-dimensional case (p < n) set.seed(123) n <- 200; p <- 20 X <- matrix(rnorm(n * p), n, p) # True model: y = X1 + X2 + X3 + X1*X2 + noise y <- X[,1] + X[,2] + X[,3] + X[,1]*X[,2] + rnorm(n, 0, 0.5) result <- hySAINT(X, y, heredity = "Strong", r1 = 5, r2 = 3, max_iter = 50, verbose = FALSE) print(result) # Example 2: High-dimensional case (p > n) ## Not run: set.seed(456) n <- 100; p <- 500 X <- matrix(rnorm(n * p), n, p) y <- X[,1] + X[,2] + 0.5*X[,1]*X[,2] + rnorm(n, 0, 1) result <- hySAINT(X, y, heredity = "Strong", r1 = 10, r2 = 5) print(result) ## End(Not run)# Example 1: Low-dimensional case (p < n) set.seed(123) n <- 200; p <- 20 X <- matrix(rnorm(n * p), n, p) # True model: y = X1 + X2 + X3 + X1*X2 + noise y <- X[,1] + X[,2] + X[,3] + X[,1]*X[,2] + rnorm(n, 0, 0.5) result <- hySAINT(X, y, heredity = "Strong", r1 = 5, r2 = 3, max_iter = 50, verbose = FALSE) print(result) # Example 2: High-dimensional case (p > n) ## Not run: set.seed(456) n <- 100; p <- 500 X <- matrix(rnorm(n * p), n, p) y <- X[,1] + X[,2] + 0.5*X[,1]*X[,2] + rnorm(n, 0, 1) result <- hySAINT(X, y, heredity = "Strong", r1 = 10, r2 = 5) print(result) ## End(Not run)
Screens main effects using the LASSO regularization path. Variables that enter the model earlier and more frequently across the regularization path are ranked higher. This approach is more robust than simple correlation screening because it accounts for correlations among predictors.
screen_main_effects(X, y, n.lambda = 50, max.main = NULL)screen_main_effects(X, y, n.lambda = 50, max.main = NULL)
X |
Input data matrix of dimension |
y |
Response variable. A numeric vector of length |
n.lambda |
Number of lambda values in the LASSO path. Default is 50. |
max.main |
Maximum number of main effects to return. If NULL (default),
uses |
The function fits a LASSO path using glmnet and ranks variables by:
Order of entry (earlier = more important)
Frequency of selection across the lambda path
If glmnet is not available, the function falls back to correlation-based screening.
A list with components:
ranked_main |
Indices of main effects ranked by importance (earlier entry in LASSO path and higher selection frequency). |
selection_freq |
Number of times each variable was selected across the lambda path. |
first_entry |
Index along the path where each variable first entered. |
method |
Screening method used ("glmnet" or "correlation" as fallback). |
hySAINT which uses this function internally,
generate_inter_pool for generating interaction candidates.
set.seed(123) n <- 100; p <- 50 X <- matrix(rnorm(n * p), n, p) y <- X[,1] + X[,2] + X[,3] + rnorm(n) screened <- screen_main_effects(X, y, max.main = 20) print(screened$ranked_main[1:10]) # Top 10 main effectsset.seed(123) n <- 100; p <- 50 X <- matrix(rnorm(n * p), n, p) y <- X[,1] + X[,2] + X[,3] + rnorm(n) screened <- screen_main_effects(X, y, max.main = 20) print(screened$ranked_main[1:10]) # Top 10 main effects