Evolutionary Optimization in R — GA, DE, CMA‑ES & GP

Overview

Evolutionary Algorithms (EAs) are population‑based, stochastic optimizers inspired by natural selection. They shine on non‑convex, non‑differentiable, multi‑modal, and mixed discrete/continuous problems where gradients are unavailable or unreliable.

GA: selection, crossover, mutation, elitism — continuous and binary encodings.
DE: vector‑based mutations on real vectors, simple and powerful.
CMA‑ES: adapts a full covariance matrix to follow the local landscape curvature.
GP: evolves programs or formulas (symbolic regression/classifiers) under a grammar.

Why this interests me

I’ve always loved nature documentaries — particularly the gritty stories of survival of the fittest. That lens made evolutionary computation click for me: maintain diversity, adapt to the environment (the objective), and select what works. In practice, that means searching complex spaces without assuming smoothness or convexity — exactly what many data science and OR problems look like.

When would you use EAs?

Vast model spaces: feature selection with hundreds/thousands of variables; pipeline search; rule/scorecard construction.
Non‑differentiable/black‑box objectives: CV AUC, F1, fairness constraints, simulation outputs.
Mixed discrete–continuous decisions: e.g., integer programming with real hyperparameters; architecture search.
Multi‑modal landscapes: many local optima (e.g., Rastrigin, Ackley) where restarts/gradients struggle.
Constraints & business rules: budget, cardinality, monotonicity — handled via penalties or repair.
Expensive evaluations: use small populations, early stopping, caching, and parallelism.

Genetic Algorithm (continuous optimization)

We minimize the Rastrigin function (non‑convex, many local minima) to illustrate GA on a continuous domain.

R

# --- Setup (packages) ---
# install.packages(c("GA")) # uncomment if needed
library(GA)
set.seed(42)

# Rastrigin function (global minimum at x = 0)
rastrigin <- function(x) {
  10 * length(x) + sum(x^2 - 10 * cos(2 * pi * x))
}

# GA maximizes by default, so maximize the negative objective
fitness_rastrigin <- function(x) -rastrigin(x)

# Run GA in 5 dimensions with real-valued genes
D <- 5
GA_fit <- ga(
  type = "real-valued",
  fitness = fitness_rastrigin,
  lower = rep(-5.12, D), upper = rep(5.12, D),
  popSize = 80, maxiter = 200, run = 50,
  pmutation = 0.10, pcrossover = 0.80, elitism = 3
  # optional: parallel = TRUE
)

summary(GA_fit)
plot(GA_fit)  # fitness over iterations

Explanation:

Encoding: Real‑valued vector x ∈ ℝ^D.
Selection → crossover → mutation → elitism provide exploitation/exploration balance.
Stopping: run=50 stops if no improvement for 50 generations or maxiter reached.

GA for Feature Selection (binary encoding)

We select a parsimonious subset for a logistic regression on a credit‑like dataset, balancing AUC against model size with a sparsity penalty.

R

# --- Data: read previously generated CSV or simulate ---
# install.packages(c("pROC"))
library(pROC)
set.seed(42)

if (file.exists("synthetic_credit_data.csv")) {
  df <- read.csv("synthetic_credit_data.csv")
} else {
  # quick synthetic fallback (smaller)
  n <- 2000
  Revenue <- rlnorm(n, 10, 1)
  Profit  <- Revenue * runif(n, 0.05, 0.2)
  Score   <- pmin(1000, pmax(500, rnorm(n, 750, 100) + (Revenue/1e5)*50 + (Profit/5e4)*30))
  Loan    <- pmin(5e4, pmax(1e3, rgamma(n, 5, 1/5000))) * (1 - 0.2 * ((1000-Score)/1000))
  pd      <- pmin(0.9, pmax(0.01, 0.3 - (Score-500)/2000 + (Loan-1000)/1e5))
  Default <- rbinom(n, 1, pd)
  df <- data.frame(Default, Revenue, Profit, Score, Loan)
}

response   <- "Default"
predictors <- setdiff(names(df), response)
p <- length(predictors)

# 5-fold CV AUC for a selected subset
cv_auc <- function(mask) {
  idx <- which(mask == 1)
  if (length(idx) == 0) return(0)  # no features → poor
  Xvars <- predictors[idx]
  fold <- sample(rep(1:5, length.out = nrow(df)))
  aucs <- numeric(5)
  for (k in 1:5) {
    train <- df[fold != k, ]
    test  <- df[fold == k, ]
    fmla  <- as.formula(paste(response, "~", paste(Xvars, collapse = "+")))
    m     <- glm(fmla, data = train, family = binomial())
    pr    <- predict(m, newdata = test, type = "response")
    aucs[k] <- tryCatch(as.numeric(auc(roc(test[[response]], pr))), error = function(e) 0.5)
  }
  mean(aucs)
}

# Fitness with sparsity penalty (maximize)
lambda <- 0.10  # penalty per fraction of selected features
fitness_fs <- function(bits) {
  bits <- round(bits) # GA may pass real numbers; coerce to 0/1
  auc  <- cv_auc(bits)
  sel_frac <- sum(bits)/p
  auc - lambda * sel_frac
}

# Run GA (binary)
library(GA)
GA_fs <- ga(
  type = "binary",
  fitness = fitness_fs,
  nBits = p,
  popSize = 60, maxiter = 120, run = 40,
  pmutation = 0.08, pcrossover = 0.85, elitism = 2
)

summary(GA_fs)
plot(GA_fs)

best_mask   <- GA_fs@solution[1, ]
chosen_vars <- predictors[as.logical(best_mask)]
final_fmla  <- as.formula(paste(response, "~", paste(chosen_vars, collapse = "+")))
final_mod   <- glm(final_fmla, data = df, family = binomial())
summary(final_mod)

Notes:

Encoding: Binary vector of length p; 1 = include feature.
Fitness: CV AUC − λ·(fraction selected) balances accuracy vs parsimony.
Reliability: Report nested‑CV scores or bootstrap selection frequency for stability.

Differential Evolution (continuous hyperparameter tuning)

Tune SVM hyperparameters (C, gamma) with DEoptim using 5‑fold CV AUC. DE is robust on rugged continuous landscapes.

R

# --- Setup ---
# install.packages(c("DEoptim", "e1071", "pROC"))
library(DEoptim)
library(e1071)
library(pROC)
set.seed(42)

# Reuse df from above if present; otherwise create a quick numeric-only frame
if (!exists("df")) {
  n <- 1500
  x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
  eta <- 1 + 1.2*x1 - 0.8*x2 + 0.5*x3
  p   <- plogis(eta)
  y   <- rbinom(n, 1, p)
  df  <- data.frame(Default = y, x1, x2, x3)
}

# 5-fold CV AUC for given (C, gamma)
cv_auc_svm <- function(C, gamma) {
  folds <- sample(rep(1:5, length.out = nrow(df)))
  aucs  <- numeric(5)
  for (k in 1:5) {
    train <- df[folds != k, ]
    test  <- df[folds == k, ]
    m <- svm(Default ~ ., data = train, kernel = "radial",
             cost = C, gamma = gamma, probability = TRUE)
    pr <- attr(predict(m, newdata = test, probability = TRUE), "probabilities")[,"1"]
    aucs[k] <- tryCatch(as.numeric(auc(roc(test$Default, pr))), error = function(e) 0.5)
  }
  mean(aucs)
}

# DEoptim minimizes; minimize 1 - AUC; search on log-scale
obj <- function(par) {
  C     <- exp(par[1])
  gamma <- exp(par[2])
  1 - cv_auc_svm(C, gamma)
}

lower <- log(c(1e-3, 1e-4))
upper <- log(c(1e+3, 1e-1))

res <- DEoptim(obj, lower = lower, upper = upper,
               DEoptim.control(NP = 60, itermax = 80, CR = 0.9, F = 0.7))

best_par <- exp(res$optim$bestmem)
cat("Best C, gamma:", best_par, "\n")

m_final <- svm(Default ~ ., data = df, kernel = "radial",
               cost = best_par[1], gamma = best_par[2], probability = TRUE)
summary(m_final)

Explanation:

Mutation (DE/rand/1): v = a + F*(b - c); binomial crossover with rate CR; 1‑to‑1 selection.
Scale & bounds: Search on log‑scale for C, gamma, then exponentiate.
Objective: Minimizing 1 − AUC ≡ maximizing AUC.

CMA‑ES (Covariance Matrix Adaptation Evolution Strategy)

CMA‑ES adapts a full covariance matrix of the search distribution to match local curvature, yielding strong performance on non‑separable, ill‑conditioned problems.

R

# --- CMA-ES on Rastrigin ---
# install.packages("cmaes")  # uncomment if needed
library(cmaes)
set.seed(42)

rastrigin <- function(x) 10 * length(x) + sum(x^2 - 10 * cos(2 * pi * x))

D <- 5
start <- runif(D, -5, 5)
res <- cma_es(
  par = start,
  fn  = rastrigin,
  lower = rep(-5.12, D), upper = rep(5.12, D),
  control = list(sigma = 2.0, maxit = 1500, stopfitness = 1e-8)
)

res$par      # solution (≈ 0)
res$value    # objective value (≈ 0)

Explanation:

Distribution: Samples from 𝒩(m, σ²C); updates m (mean), σ (step size), and C (covariance).
Invariance: Rank‑based updates make CMA‑ES invariant to monotone fitness transforms and to linear coordinate transforms.
When to prefer: Non‑separable, ill‑conditioned problems; lower tuning burden than GA/DE.

Genetic Programming (symbolic regression)

GP evolves expressions (trees) instead of parameter vectors, discovering human‑readable formulas. We use gramEvol for grammar‑guided GP.

R

# --- Grammar-guided GP with gramEvol ---
# install.packages("gramEvol")
library(gramEvol)
set.seed(42)

# Synthetic regression data
n  <- 1000
x1 <- runif(n, -2, 2)
x2 <- runif(n, -2, 2)
x3 <- runif(n, -2, 2)
y  <- sin(x1) + 0.5 * x2^2 - 0.3 * x3 + rnorm(n, 0, 0.1)
data <- data.frame(x1, x2, x3, y)

# Grammar: expressions built from variables, ops and functions
ruleDef <- list(
  expr = grule(op(expr, expr), func(expr), var),
  op   = grule(`+`, `-`, `*`, `/`),
  func = grule(sin, cos, exp, log),
  var  = grule(x1, x2, x3)
)
G <- CreateGrammar(ruleDef)

# Fitness: negative RMSE on a validation fold
fitnessFunction <- function(expr) {
  f <- eval(parse(text = expr))
  # train/validation split
  idx <- sample(1:nrow(data), size = floor(0.8*nrow(data)))
  tr <- data[idx,]; va <- data[-idx,]
  # Evaluate expression as a function of variables
  pred <- with(va, eval(f))
  if (any(!is.finite(pred))) return(-1e6)
  -sqrt(mean((va$y - pred)^2))
}

GE <- GrammaticalEvolution(
  grammarDef = G,
  fitnessFunction = fitnessFunction,
  terminationCost = -0.15,   # stop when RMSE ≈ 0.15
  max.depth = 7, popSize = 80, iterations = 120
)

best_expr <- BestExpr(GE)
cat("Best expression:\n", best_expr, "\n")

Why GP:

Interpretability: Produces closed‑form formulas; great for model cards and stakeholder communication.
Inductive bias: Grammars enforce valid forms (e.g., monotone terms, or excluding forbidden functions).
Use‑cases: symbolic regression, rule extraction, scorecard discovery, feature construction.

Advanced Notes & Practical Guidance

Constraint handling: Penalize infeasible solutions or repair (project back into feasible set). For cardinality (feature budget) use a ramped penalty or hard cap.
Nested CV: Outer CV for unbiased performance; inner EA for tuning. Report outer metrics.
Diversity control: Increase mutation, use crowding/fitness sharing if the population collapses.
Anytime optimization: Log and checkpoint best solutions; EAs degrade gracefully if interrupted.
Runtime economics: Evaluations × cost(fitness). Cache precomputations; use parallelism; stop when gains flatten.
Defaults that work: pop 50–100; GA pc≈0.8, pm≈0.05–0.15 (binary); DE F≈0.5–0.9, CR≈0.7–0.95; CMA‑ES σ ~ 1–3% of range.