On this page
Key Objectives
- Generate realistic financial and credit-related variables.
- Introduce logical correlations to reflect real-world credit risk modelling.
- Incorporate macroeconomic indicators to simulate external economic influences.
Core Business Variables
| Feature | Description |
|---|---|
| Revenue | Annual revenue (Log-normal distribution) |
| Time_in_Business | Years since business started (Exponential distribution) |
| Credit_Score | Business credit score (500–1000, Truncated Normal) |
| Loan_Amount | Loan requested ($1,000–$50,000, Gamma distribution) |
| Industry | Business sector (Retail, Manufacturing, etc.) |
| Business_Size | Small, Medium, or Large |
| Years_with_Bank | Years the business has been a bank client |
| Number_of_Employees | Headcount |
| Annual_Profit | Calculated as % of revenue |
Macroeconomic Indicators
| Feature | Description |
|---|---|
| Unemployment Rate (%) | Higher → More defaults |
| Inflation Rate (%) | Higher → Increased loan risk |
| GDP Growth Rate (%) | Negative growth → More defaults |
| Market Volatility Index (VIX) | Measures financial market uncertainty |
| Interest Rate (%) | Higher rates → More expensive loans |
Distributions & Relationships (Code)
Python
# Synthetic Credit Risk Dataset (reproducible, pure NumPy)
# --------------------------------------------------------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(42)
num_samples = 10000
# Helper: truncated normal via rejection
def truncated_normal(mean, std, low, high, size):
out = np.empty(size, dtype=float)
filled = 0
while filled < size:
batch = np.random.normal(mean, std, size - filled)
batch = batch[(batch >= low) & (batch <= high)]
take = min(batch.size, size - filled)
if take:
out[filled:filled+take] = batch[:take]
filled += take
return out
# 1) Base features
mu, sigma = 10.0, 1.0
revenue = np.random.lognormal(mean=mu, sigma=sigma, size=num_samples) # right-skewed
credit_score = truncated_normal(750, 100, 500, 1000, num_samples) # bounded 500–1000
loan_amount = np.clip(np.random.gamma(5.0, 5000.0, num_samples), 1000, 50000) # right-skewed
industry = np.random.choice(["Retail","Manufacturing","Services","Hospitality"],
p=[0.4,0.25,0.25,0.1], size=num_samples)
business_size = np.random.choice(["Small","Medium","Large"], p=[0.6,0.3,0.1], size=num_samples)
years_in_business = np.clip(np.random.exponential(5.0, num_samples).astype(int), 1, 20)
years_with_bank = (years_in_business * np.random.uniform(0.2,0.8, num_samples)).astype(int)
num_employees = np.empty(num_samples, dtype=int)
mask_s = business_size=="Small"; mask_m = business_size=="Medium"; mask_l = business_size=="Large"
num_employees[mask_s] = np.random.randint(1,20, mask_s.sum())
num_employees[mask_m] = np.random.randint(20,100, mask_m.sum())
num_employees[mask_l] = np.random.randint(100,500, mask_l.sum())
# 2) Logical correlations
size_mult = {"Small":1.0, "Medium":2.0, "Large":5.0}
revenue = revenue * np.vectorize(size_mult.get)(business_size)
annual_profit = np.clip(revenue * np.random.uniform(0.05,0.20, num_samples), 0, None)
credit_score = np.clip(
credit_score + (revenue/1e5)*50 + (annual_profit/5e4)*30 + years_in_business*5, 500, 1000
)
loan_amount = loan_amount * (1.0 - 0.2 * ((1000.0 - credit_score)/1000.0)) # risk-adjusted
# 3) Default probability & label
industry_risk = {"Retail":1.2, "Manufacturing":0.8, "Services":1.0, "Hospitality":1.3}
ind_factor = np.vectorize(industry_risk.get)(industry)
pd_base = 0.3 - (credit_score - 500.0)/2000.0 + (loan_amount - 1000.0)/100000.0
pd = np.clip(pd_base, 0.05, 0.5) * ind_factor
pd = np.clip(pd, 0.01, 0.9)
default = np.random.binomial(1, pd, num_samples)
# 4) DataFrame
import pandas as pd
df = pd.DataFrame({
"Revenue": revenue,
"Time_in_Business": years_in_business,
"Credit_Score": credit_score,
"Loan_Amount": loan_amount,
"Industry": industry,
"Default": default,
"Business_Size": business_size,
"Years_with_Bank": years_with_bank,
"Number_of_Employees": num_employees,
"Annual_Profit": annual_profit
})
# 5) Plotting — one figure per chart (no seaborn)
# Save figures as PNG; replace paths as needed
import matplotlib.pyplot as plt
fig = plt.figure(); plt.hist(df["Revenue"], bins=100); plt.title("Revenue (log-normal)"); fig.savefig("revenue_hist.png"); plt.show()
fig = plt.figure(); plt.hist(df["Credit_Score"], bins=50); plt.title("Credit Score (trunc-normal + lift)"); fig.savefig("credit_score_hist.png"); plt.show()
fig = plt.figure(); plt.hist(df["Loan_Amount"], bins=60); plt.title("Loan Amount (gamma, adjusted)"); fig.savefig("loan_amount_hist.png"); plt.show()
fig = plt.figure(); plt.hist(df["Annual_Profit"], bins=80); plt.title("Annual Profit (5–20% margin)"); fig.savefig("annual_profit_hist.png"); plt.show()
# Relationships
fig = plt.figure(); plt.scatter(df["Revenue"], df["Credit_Score"], s=5, alpha=0.5); plt.title("Score vs Revenue"); fig.savefig("score_vs_revenue.png"); plt.show()
fig = plt.figure(); plt.scatter(df["Credit_Score"], df["Loan_Amount"], s=5, alpha=0.5); plt.title("Loan vs Score"); fig.savefig("loan_vs_score.png"); plt.show()
Why these distributions & plots?
- Revenue — Log-normal: firm revenues are multiplicative growth processes and exhibit heavy right tails; the log-normal captures many small firms and a few very large ones. The histogram should be strongly right-skewed.
- Credit Score — Truncated Normal: scores are bounded (500–1000 here). Truncation respects business rules; the distribution centers near prime credit with mass trimmed at the edges.
- Loan Amount — Gamma: loan sizes are positive and right-skewed; a gamma matches the empirical pattern of many small/moderate loans and fewer big ones. We also risk-adjust amounts down for lower scores.
- Years in Business — Exponential: survival-bias shape with many young firms and a decaying tail of older survivors.
- Annual Profit — Proportional to revenue: simple margins (5–20%) create a realistic spread consistent with heterogeneous cost structures.
- Relationship plots: Score vs Revenue sanity-checks monotonicity (better performance → higher score). Loan vs Score verifies risk-based lending (larger loans concentrate at higher scores).
- Categorical checks (Industry mix & default rate by industry): ensure priors and relative risks behave as intended before modelling.
Notes
- Replace
dataset_pathwith your own relative path if deploying on your site (e.g.,assets/data/synthetic_credit_data.csv). - Consider generating the dataset deterministically with a fixed seed and documenting correlations in a separate section.
- Add a Modeling section to demonstrate train/validation splits, calibration curves, and lift charts.