This is a hands-on machine learning tutorial for beginners using Python. You will engineer features, select the best columns, train sklearn models, and on the same churn problem, compare them to NucleusIQ (Direct mode few-shot + Standard mode agent with tools), so you see when to use classical ML vs an agent framework in 2026.
If you have not read our roadmap post yet, start with How to Start Machine Learning from Scratch in 2026 for the bigger picture (AI + ML combined learning). This tutorial is Week 3โ4 of that path, the โbuildโ track, where code and metrics finally click.
We align with techniques from our deeper divesโbuilding logistic regression in Python, model evaluation metrics, and logistic regression applicationsโbut keep the flow beginner-friendly and runnable in one afternoon.
TL;DR
- Same problem, three approaches: predict telecom customer churn using (A) sklearn, (B) NucleusIQ Direct (few-shot), (C) NucleusIQ Standard (agent + tools)โso you see what each is good at.
- Track A (ML): feature engineering โ feature selection โ logistic regression + random forest โ ROC-AUC on held-out test data.
- Track B (NucleusIQ Direct): few-shot churn classification via a Direct mode agentโno raw OpenAI SDK.
- Track C (NucleusIQ Standard): Standard mode agent with
@toolfunctions (score_churn, etc.)โML for scores, NucleusIQ for workflow. - Metrics: accuracy is not enoughโuse classification report, confusion matrix, ROC-AUC (metrics guide).
- Clone the GitHub repo for runnable scripts (
src/train.py,src/nucleusiq_few_shot.py,src/nucleusiq_agent_churn.py).
Python Libraries Used in This Tutorial (And Why)
| Library | Role in this project |
|---|---|
| pandas | Load CSV, clean text fields, one-hot encode, inspect churn rates |
| NumPy | Under the hood for sklearn; you rarely call it directly as a beginner |
| matplotlib / seaborn | EDA charts for stakeholders |
| scikit-learn | Train/test split, pipelines, models, metrics |
| joblib | Save/load trained pipelines |
| nucleusiq + nucleusiq-openai | Track B (Direct) and Track C (Standard + @tool) |
You do not need PyTorch, TensorFlow, or LangChain for this lab. Tracks B/C use NucleusIQ instead of hand-rolled OpenAI SDK callsโsee our start-from-scratch 2026 guide.
What You Will Build
By the end of this tutorial you will have:
| Output | Description |
|---|---|
| Cleaned dataset | Numeric features ready for sklearn |
| Trained pipeline | Scaler + model in one object |
| Evaluation report | Precision, recall, F1, ROC-AUC on held-out test data |
| Saved model file | churn_model.joblib for inference |
| Feature selector | selector.joblib fit only on train data |
| NucleusIQ Direct few-shot benchmark | Same test rows, same metrics as ML |
| NucleusIQ Standard agent demo | @tool + retention workflow |
| README story | Side-by-side comparison tableโfor your portfolio |
Business question (fixed for the whole tutorial): Will this customer churn (leave the service)?
That is binary classificationโthe same problem we solve in building logistic regression in Python and logistic regression applications. What changes is how you solve itโnot what you solve.
One Problem, Three Approaches (Read This First)
Beginners in 2026 often ask: โCan I just use ChatGPT for churn instead of scikit-learn?โ
Honest answer: You can try, but you must compare both on the same test customers with the same metrics. That is what this tutorial does.
| Approach | What it is | Strengths | Weaknesses |
|---|---|---|---|
| Track A โ Classical ML | Features โ algorithm โ probability | Cheap at scale, reproducible, auditable metrics | Needs feature work; weak on raw unstructured text |
| Track B โ NucleusIQ Direct | Few-shot in system prompt โ single-pass classify | Fast to prototype; provider-portable | Costly per row at scale; evaluate on test set |
| Track C โ NucleusIQ Standard | Agent + @tool โ calls your sklearn model | Guardrails, tool loops, production patterns | Needs Track A model saved first |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Problem: Will customer churn? โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
Track A: sklearn Track B: NucleusIQ Direct Track C: NucleusIQ Standard
(engineer features) (few-shot in prompt) (@tool + ML model)
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โผ
Same test set โ precision / recall / F1
Tracks A and B are the learning core. Track C shows how production teams combine them using NucleusIQโthe same framework we use in our Direct mode and Standard mode with tools guides.
Prerequisites
From the 2026 ML roadmap:
- Python 3.10+ installed
- Basic comfort with variables, functions, and
pip install - 2โ3 hours uninterrupted time
Helpful but optional: read AI vs ML vs DL vs Data Science if you confuse those terms.
Step 0: Environment Setup
Create a virtual environment and install dependencies:
python -m venv ml-tutorial-env
# Windows
ml-tutorial-env\Scripts\activate
# macOS/Linux
source ml-tutorial-env/bin/activate
pip install pandas numpy matplotlib seaborn scikit-learn joblib python-dotenv nucleusiq nucleusiq-openai
Verify sklearn:
import sklearn
print(sklearn.__version__) # e.g. 1.4+
Step 1: Load the Dataset
We use the Telco Customer Churn datasetโa public benchmark that matches the telecom examples in our multivariate logistic regression post.
Option A โ OpenML (no manual download):
from sklearn.datasets import fetch_openml
# IBM Telco Customer Churn on OpenML (id=42178)
churn = fetch_openml(data_id=42178, as_frame=True, parser="auto")
df = churn.frame
target_col = "Churn" if "Churn" in df.columns else churn.target_names[0]
Option B โ CSV from GitHub (recommended for the companion notebook):
Download telco_churn.csv from the repo linked at the end of this post, then:
import pandas as pd
df = pd.read_csv("data/telco_churn.csv")
target_col = "Churn"
Option C โ Your own CSV from the IBM Telco churn Kaggle dataset; align column names with the preprocessing below.
Quick inspection:
print(df.shape)
print(df.head())
print(df[target_col].value_counts(normalize=True))
You will often see ~73% No churn / ~27% Yes churn. That class imbalance matters when you interpret accuracy laterโexactly what we discuss in model evaluation metrics.
The Machine Learning Workflow (Map This Tutorial)
Every beginner Python ML project follows the same spineโwhether you are predicting churn, loan default, or email spam:
Business question โ EDA โ Feature engineering โ Split โ Feature selection
โ Train ML โ Evaluate
โ (compare) Few-shot LLM on same test rows
โ (optional) Agent tools wrapping ML
This tutorial walks that spine end to end, then adds AI tracks on the same churn question. When you read our older building logistic regression article, you will recognize the same churn story: merge tables, dummy variables, train/test discipline. Here we use pandas + sklearn pipelines because that is what most teams hire for in 2026.
Parametric vs non-parametric (one sentence): logistic regression assumes a smooth linear boundary in feature space; random forests do not. Our parametric vs non-parametric guide goes deeperโafter you finish this lab.
Step 2: Exploratory Data Analysis (EDA)
EDA is not optional decoration. It prevents training on broken data. See our EDA explainer for mindset; here is the minimum for this tutorial.
import matplotlib.pyplot as plt
import seaborn as sns
# Missing values
print(df.isnull().sum().sort_values(ascending=False).head(10))
# Numeric summary
print(df.describe())
# Churn rate by contract type (example business slice)
if "Contract" in df.columns:
sns.countplot(data=df, x="Contract", hue=target_col)
plt.title("Churn by contract type")
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()
Write down three observations in your notebook READMEโfor example:
- Month-to-month contracts may show higher churn.
TotalChargessometimes loads as text because of empty strings.- Class imbalance means accuracy alone is misleading.
These bullets become portfolio gold and mirror how we reason in production case studies like financial recommendation systems.
Track A โ Classical Machine Learning (Steps 3โ10)
Everything in Steps 3โ10 is Track A. Do not skip to the LLM section until you have test metrics from sklearnโotherwise you cannot compare fairly.
Step 3: Data Cleaning
3.1 Drop ID columns
IDs do not generalize; they cause memorization.
drop_cols = [c for c in ["customerID", "customerId"] if c in df.columns]
df = df.drop(columns=drop_cols, errors="ignore")
3.2 Fix TotalCharges
A common Telco issue: TotalCharges stored as strings.
if "TotalCharges" in df.columns:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].median())
3.3 Encode target + keep readable features for Track B/C
y = df[target_col].copy()
if y.dtype == object:
y = y.map({"Yes": 1, "No": 0, "yes": 1, "no": 0})
y = y.astype(int)
# Human-readable rows for few-shot LLM + agent tools (before heavy encoding)
df_model = df.drop(columns=[target_col]).copy()
Step 4: Feature Engineering (Create Signal, Do Not Only Encode)
Feature engineering means building new columns from domain logic before modeling. This is where many beginners stop too earlyโthey only one-hot encode and wonder why ROC-AUC is flat.
Our dedicated post Feature Engineering Techniques for Machine Learning (topics.xlsx) goes deeper; here is a Telco churn starter set:
import numpy as np
fe = df_model.copy()
# Numeric fixes (if not done in Step 3)
if "TotalCharges" in fe.columns:
fe["TotalCharges"] = pd.to_numeric(fe["TotalCharges"], errors="coerce")
fe["TotalCharges"] = fe["TotalCharges"].fillna(fe["TotalCharges"].median())
# --- Engineered features ---
if "tenure" in fe.columns:
fe["tenure_bucket"] = pd.cut(
fe["tenure"],
bins=[-1, 12, 36, 72, 999],
labels=["0-12m", "13-36m", "37-72m", "72m+"],
)
if "MonthlyCharges" in fe.columns and "tenure" in fe.columns:
fe["avg_charge_per_tenure_month"] = fe["MonthlyCharges"] / (fe["tenure"].clip(lower=1))
# Count how many add-on services are active (Yes = 1)
service_cols = [c for c in fe.columns if fe[c].dtype == object and fe[c].isin(["Yes", "No"]).any()]
for c in service_cols:
fe[c] = fe[c].map({"Yes": 1, "No": 0})
if service_cols:
fe["num_active_services"] = fe[service_cols].sum(axis=1)
# Contract risk flag (business intuition from EDA)
if "Contract" in fe.columns:
fe["is_month_to_month"] = (fe["Contract"] == "Month-to-month").astype(int)
print(fe[["tenure", "avg_charge_per_tenure_month", "num_active_services", "is_month_to_month"]].head())
Why these features help
- tenure_bucket โ churn risk often spikes early tenure.
- avg_charge_per_tenure_month โ separates heavy spenders from long-tenure low payers.
- num_active_services โ engagement proxy.
- is_month_to_month โ aligns with contract plots from EDA and our multicollinearity / regression discussions when features overlap.
Rule: every engineered feature must be computable at inference time from data you will actually have before churn happensโno future information.
4.1 One-hot encode categoricals (after engineering)
X = pd.get_dummies(fe, drop_first=True)
print("Feature count after engineering + encoding:", X.shape[1])
Leakage check: drop post-churn columns. When in doubt, remove suspicious fields.
4.2 Why we use drop_first=True in get_dummies
With one-hot encoding, the last category can be inferred from the others (dummy variable trap). drop_first=True avoids perfect multicollinearityโrelated to the issues we discuss in multicollinearity in regression. For random forest it matters less; for logistic regression it keeps coefficients more stable.
4.3 Keep a list of training columns for production
training_columns = list(X.columns)
import json
with open("models/training_columns.json", "w") as f:
json.dump(training_columns, f)
At inference time, align new data to these columns (missing columns โ 0). This prevents the โworks in notebook, breaks in APIโ failure mode.
Step 5: Train / Test Split (Before Feature Selection)
Split after engineering, before selecting featuresโso the selector never sees test labels.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y,
)
# Human-readable rows for Track B/C (same indices)
train_idx, test_idx = X_train.index, X_test.index
df_train_readable = df_model.loc[train_idx]
df_test_readable = df_model.loc[test_idx]
print("Train size:", len(X_train), "Test size:", len(X_test))
stratify=y keeps churn ratio stableโcritical for imbalanced data (evaluation metrics post).
Step 6: Feature Selection (Reduce Noise, Fit on Train Only)
With 50โ80 columns after get_dummies, many are weak or redundant. Feature selection picks a subset that helps generalization.
Important: fit the selector on X_train only, then transform X_test.
from sklearn.feature_selection import SelectKBest, mutual_info_classif
import joblib
K = 35 # tune: try 20, 35, 50 and compare CV ROC-AUC
selector = SelectKBest(score_func=mutual_info_classif, k=min(K, X_train.shape[1]))
X_train_sel = selector.fit_transform(X_train, y_train)
X_test_sel = selector.transform(X_test)
selected_mask = selector.get_support()
selected_features = X_train.columns[selected_mask].tolist()
print("Selected features:", len(selected_features))
print(selected_features[:10], "...")
joblib.dump(selector, "models/feature_selector.joblib")
What mutual_info_classif does: scores each feature by how much it reduces uncertainty about churnโuseful for non-linear relationships before tree models.
Alternative selectors (try in exercises):
| Method | When to use |
|---|---|
SelectKBest(f_classif) | Fast linear screening |
RFECV with logistic regression | Smaller, interpretable set; slower |
Tree feature_importances_ | After RF train; good for reporting, not always for leakage-safe selection |
Convert back to DataFrames for readable column names:
X_train_sel = pd.DataFrame(X_train_sel, columns=selected_features, index=X_train.index)
X_test_sel = pd.DataFrame(X_test_sel, columns=selected_features, index=X_test.index)
Step 7: Build a sklearn Pipeline (Best Practice)
A pipeline chains preprocessing and model so you do not accidentally fit the scaler on test data (a classic beginner bug).
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
logistic_pipe = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=2000, class_weight="balanced")),
])
class_weight="balanced" gives more attention to the minority churn classโuseful before you tune thresholds.
Train:
logistic_pipe.fit(X_train_sel, y_train)
Train on X_train_sel, not the full wide matrixโthis is the pipeline you compare against LLM approaches.
7.1 What logistic regression is doing (intuition)
Logistic regression outputs a probability between 0 and 1 using the sigmoid function. If you want the math storyโsigmoid, log-odds, likelihoodโread Logistic Regression for Machine Learning using Python and cost function in logistic regression. For this tutorial, remember: coefficients tell direction, probabilities tell risk, and thresholds turn risk into yes/no decisions.
Step 8: Predict and Evaluate โ Track A (Do Not Skip)
6.1 Class predictions and probabilities
y_pred = logistic_pipe.predict(X_test_sel)
y_proba = logistic_pipe.predict_proba(X_test_sel)[:, 1]
6.2 Classification report
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
print(classification_report(y_test, y_pred, target_names=["No Churn", "Churn"]))
print("ROC-AUC:", round(roc_auc_score(y_test, y_proba), 4))
How to read this (short version):
- Precision (Churn): of customers you flagged as churn, how many actually churned?
- Recall (Churn): of all real churners, how many did you catch? (Sensitivity in our metrics blog)
- F1: balance when classes are imbalanced
- ROC-AUC: ranking quality across thresholds
6.3 Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN, FP],
# [FN, TP]]
Map this to TP, TN, FP, FN exactly as in Fig-1 of our model evaluation metrics post. If your recall for churn is low, the model is missing customers who leaveโoften more expensive than a false alarm.
6.4 Accuracy (with caution)
from sklearn.metrics import accuracy_score
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
If accuracy is ~80% but recall for churn is ~55%, you have a model that looks โfineโ but fails the business goal. That is why we teach metrics beyond accuracy.
Step 9: Compare a Second Model (Random Forest)
Logistic regression is a strong interpretable baseline. Random Forest often improves tabular performance with non-linear patterns.
from sklearn.ensemble import RandomForestClassifier
rf_pipe = Pipeline([
("model", RandomForestClassifier(
n_estimators=200,
max_depth=12,
class_weight="balanced",
random_state=42,
n_jobs=-1,
)),
])
rf_pipe.fit(X_train_sel, y_train)
rf_pred = rf_pipe.predict(X_test_sel)
rf_proba = rf_pipe.predict_proba(X_test_sel)[:, 1]
print("=== Random Forest ===")
print(classification_report(y_test, rf_pred, target_names=["No Churn", "Churn"]))
print("ROC-AUC:", round(roc_auc_score(y_test, rf_proba), 4))
Compare in a table (fill with your numbers):
| Model | Accuracy | Churn Recall | Churn F1 | ROC-AUC |
|---|---|---|---|---|
| Logistic Regression | ? | ? | ? | ? |
| Random Forest | ? | ? | ? | ? |
Pick the model that matches your business cost. If missing churn is expensive, optimize recall (possibly lower the probability threshold)โthe same tradeoff we illustrate with ROC curves in our metrics article.
9.1 Feature importance (Random Forest)
import pandas as pd
rf_model = rf_pipe.named_steps["model"]
importances = pd.Series(rf_model.feature_importances_, index=X_train_sel.columns)
print(importances.sort_values(ascending=False).head(15))
Use this to sanity-check the model: if customerID sneaks in as top feature, your pipeline has a bug. If Contract_Month-to-month ranks high, that matches business intuition from EDAโgood sign.
9.2 Cross-validation (more honest than one lucky split)
A single train/test split can flatter or punish you by accident. Cross-validation trains on several folds and averages scores:
from sklearn.model_selection import cross_val_score
# Pipeline: selector must be inside CV โ here we CV on selected train matrix for simplicity
cv_scores = cross_val_score(
rf_pipe, X_train_sel, y_train, cv=5, scoring="roc_auc", n_jobs=-1
)
print("CV ROC-AUC mean:", cv_scores.mean().round(4))
print("CV ROC-AUC std:", cv_scores.std().round(4))
Report mean ยฑ std in your README. Low std means stable; high std means you need more data or simpler models.
Step 10: Tune the Decision Threshold (Optional but Valuable)
Default threshold is 0.5. For churn, you may want 0.3 or 0.4 to catch more leavers.
import numpy as np
thresholds = np.arange(0.2, 0.6, 0.05)
for t in thresholds:
preds_t = (y_proba >= t).astype(int)
from sklearn.metrics import recall_score, precision_score
print(
f"threshold={t:.2f} "
f"precision={precision_score(y_test, preds_t, zero_division=0):.3f} "
f"recall={recall_score(y_test, preds_t, zero_division=0):.3f}"
)
This connects directly to the cutoff analysis section in our evaluation metrics postโwhere we sweep probability cutoffs for telecom churn.
Step 11: Save the Model for Reuse
import joblib
best_model = rf_pipe # or logistic_pipeโwhichever you chose
joblib.dump(best_model, "models/churn_model.joblib")
Load later:
loaded = joblib.load("models/churn_model.joblib")
sample_pred = loaded.predict(X_test.iloc[:5])
Step 12: Inference on New Customers (Track A)
new_customer = X_test_sel.iloc[[0]] # replace with real preprocessed row
prob_churn = loaded.predict_proba(new_customer)[0, 1]
print(f"Churn probability: {prob_churn:.2%}")
In production you would:
- Apply the same encoding steps (same dummy columnsโmissing columns = 0).
- Log predictions and outcomes.
- Retrain on a schedule.
Aligning raw rows to training columns
def align_features(raw_df: pd.DataFrame, training_cols: list) -> pd.DataFrame:
encoded = pd.get_dummies(raw_df, drop_first=True)
aligned = encoded.reindex(columns=training_cols, fill_value=0)
return aligned
# Example after loading training_columns.json
# aligned = align_features(new_raw, training_columns)
# prob = loaded.predict_proba(aligned)[0, 1]
This function is the bridge between โdata science notebookโ and โbackend APIโโworth committing to your GitHub repo under src/preprocess.py.
Plot ROC Curve (Visual Evaluation)
Numbers are mandatory; plots help you communicate to non-technical stakeholders.
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title("Logistic Regression โ ROC Curve")
plt.show()
Compare both models on one chart:
rf_proba = rf_pipe.predict_proba(X_test)[:, 1]
RocCurveDisplay.from_predictions(y_test, rf_proba, name="Random Forest")
RocCurveDisplay.from_predictions(y_test, y_proba, name="Logistic Regression")
plt.legend()
plt.show()
Tie this chart to the ROC explanation in our model evaluation metrics postโespecially the tradeoff between sensitivity and false positive rate.
Track B โ Few-Shot Churn With NucleusIQ Direct Mode
Track B solves churn without sklearn trainingโonly labeled examples in the prompt. Instead of wiring the raw OpenAI SDK, we use NucleusIQ Direct mode: one agent, one pass, low overheadโexactly what we document in the Direct mode beginner guide.
Why NucleusIQ here (vs raw API)?
- Same provider portability as the rest of our stack (
BaseOpenAItoday,BaseGeminitomorrow). - Usage tracking on
agent.last_usage(tokens per task). - Plugins later (
ModelCallLimitPluginwhen you batch-score many customers). - Same
Agent+TaskAPI you will reuse in Track Cโonly the execution mode changes.
B.1 Install NucleusIQ + provider
pip install nucleusiq nucleusiq-openai python-dotenv
# .env (never commit)
OPENAI_API_KEY=sk-...
B.2 Shared helpers (same as Track A)
import asyncio
import json
import pandas as pd
from dotenv import load_dotenv
load_dotenv()
def row_to_customer_dict(row: pd.Series) -> dict:
return {
k: (None if pd.isna(v) else v)
for k, v in row.to_dict().items()
if k not in ("customerID", "customerId")
}
def build_few_shot_examples(df_readable, y_series, n_per_class=3):
examples = []
for label, name in [(1, "Yes"), (0, "No")]:
for i in y_series[y_series == label].index[:n_per_class]:
examples.append({
"customer": row_to_customer_dict(df_readable.loc[i]),
"churn": name,
})
return examples
def parse_churn_label(text: str) -> str:
t = (text or "").strip().lower()
if t.startswith("yes"):
return "Yes"
if t.startswith("no"):
return "No"
return "No" # conservative default for parsing
FEW_SHOT = build_few_shot_examples(df_train_readable, y_train, n_per_class=3)
B.3 Build the few-shot system prompt
def few_shot_system_block(examples: list) -> str:
shots = "\n\n".join(
f"Customer: {json.dumps(ex['customer'])}\nChurn: {ex['churn']}"
for ex in examples
)
return f"""You are a telecom churn analyst.
Study these labeled examples:
{shots}
Rules:
- For each new customer, reply with exactly one word: Yes or No.
- Use only the fields in the customer JSON.
- Do not explain unless asked."""
FEW_SHOT_SYSTEM = few_shot_system_block(FEW_SHOT)
B.4 Create a NucleusIQ Direct mode classifier agent
from nucleusiq.agents import Agent
from nucleusiq.agents.config import AgentConfig, ExecutionMode
from nucleusiq.agents.task import Task
from nucleusiq.prompts.zero_shot import ZeroShotPrompt
from nucleusiq.plugins.builtin.model_call_limit import ModelCallLimitPlugin
from nucleusiq_openai import BaseOpenAI
def create_few_shot_churn_agent() -> Agent:
return Agent(
name="churn-few-shot",
role="Churn classifier",
objective="Classify telecom churn from customer profiles",
prompt=ZeroShotPrompt().configure(
system=FEW_SHOT_SYSTEM,
user="Classify the customer in the task. Reply Yes or No only.",
),
llm=BaseOpenAI(model_name="gpt-4o-mini"),
config=AgentConfig(execution_mode=ExecutionMode.DIRECT),
plugins=[ModelCallLimitPlugin(max_calls=3)], # safety for batch loops
)
few_shot_agent = create_few_shot_churn_agent()
ExecutionMode.DIRECT = fast single-pass classificationโno tool loop. See choosing agent modes when you outgrow this.
B.5 Classify one customer (async)
async def predict_churn_nucleusiq(agent: Agent, customer: dict) -> str:
await agent.initialize()
result = await agent.execute(
Task(
id="churn-cls",
objective=f"Customer JSON:\n{json.dumps(customer)}",
)
)
return parse_churn_label(str(result.output))
# Example
sample = row_to_customer_dict(df_test_readable.iloc[0])
label = asyncio.run(predict_churn_nucleusiq(few_shot_agent, sample))
print("Predicted churn:", label)
print("Tokens:", few_shot_agent.last_usage.total.total_tokens)
B.6 Batch evaluate on the same test set as Track A
from sklearn.metrics import classification_report, accuracy_score
EVAL_N = 80
test_slice = df_test_readable.iloc[:EVAL_N]
y_true = y_test.iloc[:EVAL_N]
async def eval_few_shot_track():
agent = create_few_shot_churn_agent()
await agent.initialize()
preds = []
for _, row in test_slice.iterrows():
customer = row_to_customer_dict(row)
result = await agent.execute(
Task(
id=f"churn-{len(preds)}",
objective=f"Customer JSON:\n{json.dumps(customer)}",
)
)
preds.append(1 if parse_churn_label(str(result.output)) == "Yes" else 0)
return preds
llm_preds = asyncio.run(eval_few_shot_track())
print("=== Track B: NucleusIQ Direct (few-shot) ===")
print(classification_report(y_true, llm_preds, target_names=["No Churn", "Churn"]))
print("Accuracy:", accuracy_score(y_true, llm_preds))
What you will often see
- Few-shot via Direct mode can look reasonable on easy rows but miss imbalance nuance unless examples are balanced.
agent.last_usagehelps you estimate cost per 1,000 classificationsโstill far more than sklearn at scale.- You still need
classification_reporton held-out dataโNucleusIQ does not replace evaluation.
Track C โ Churn Agent With NucleusIQ Standard Mode + Tools
Track C is the production pattern: your Track A model becomes a tool; NucleusIQ Standard mode runs the tool loop and writes the retention narrative. This matches our Standard mode with tools and plugins guides.
Prerequisite: complete Track A and save models/churn_model.joblib + selected_features.
C.1 Load ML artifacts (Track A output)
import joblib
loaded_model = joblib.load("models/churn_model.joblib")
# selected_features from Step 6 โ list of column names after SelectKBest
C.2 Register tools with @tool
NucleusIQ turns Python functions into agent tools automatically (schemas from type hints + docstrings):
from nucleusiq.tools.decorators import tool
def _score_churn_core(customer_json: str) -> str:
"""Shared scoring logic for tools and offline evaluation."""
row = pd.Series(json.loads(customer_json))
encoded = pd.get_dummies(row.to_frame().T, drop_first=True)
aligned = encoded.reindex(columns=selected_features, fill_value=0)
proba = float(loaded_model.predict_proba(aligned)[0, 1])
return json.dumps({
"churn_probability": round(proba, 4),
"churn_label": "Yes" if proba >= 0.4 else "No",
"threshold": 0.4,
})
@tool
def score_churn(customer_json: str) -> str:
"""Score churn probability using the trained sklearn model from Track A.
Args:
customer_json: JSON object of customer fields (same schema as training rows).
"""
return _score_churn_core(customer_json)
@tool
def get_customer_profile(customer_json: str) -> str:
"""Return the customer profile JSON for display (no scoring).
Args:
customer_json: JSON object of customer fields.
"""
return customer_json
The LLM should call score_churn for numbersโnever invent probabilities in free text.
C.3 Build the Standard mode retention agent
from nucleusiq.plugins.builtin.model_call_limit import ModelCallLimitPlugin
from nucleusiq.plugins.builtin.tool_call_limit import ToolCallLimitPlugin
def create_churn_retention_agent() -> Agent:
return Agent(
name="churn-retention-agent",
role="Retention analyst",
objective="Assess churn and recommend one action",
prompt=ZeroShotPrompt().configure(
system=(
"You help telecom retention teams. "
"Always call score_churn before recommending action. "
"Never guess churn probabilityโuse the tool result only. "
"Give one concrete retention step."
),
),
llm=BaseOpenAI(model_name="gpt-4o-mini"),
tools=[score_churn, get_customer_profile],
config=AgentConfig(execution_mode=ExecutionMode.STANDARD),
plugins=[
ModelCallLimitPlugin(max_calls=8),
ToolCallLimitPlugin(max_calls=5),
],
)
ExecutionMode.STANDARD enables the multi-step tool loopโthe agent can call score_churn, read the JSON, then respond. For high-stakes review workflows, you could move to Autonomous mode later; churn triage usually starts at Standard.
C.4 Run the agent for one customer
async def run_retention_agent(customer_row: pd.Series) -> str:
agent = create_churn_retention_agent()
await agent.initialize()
profile = row_to_customer_dict(customer_row)
result = await agent.execute(
Task(
id="retention-1",
objective=(
"Assess churn risk and suggest one retention action.\n"
f"Customer profile JSON:\n{json.dumps(profile)}"
),
)
)
print(f"Tool calls: {result.tool_call_count}")
print(f"Tokens: {agent.last_usage.total.total_tokens}")
return str(result.output)
print(asyncio.run(run_retention_agent(df_test_readable.iloc[0])))
C.5 Evaluate ML tool scores vs Track A (apples to apples)
The agentโs marketing text is not your classification metricโthe score_churn tool output is. For a fair comparison, call the same tool logic on test rows (or parse tool results from traces):
def score_churn_direct(customer_row: pd.Series) -> float:
payload = json.loads(_score_churn_core(json.dumps(row_to_customer_dict(customer_row))))
return payload["churn_probability"]
agent_probs = [
score_churn_direct(df_test_readable.loc[i])
for i in df_test_readable.iloc[:EVAL_N].index
]
from sklearn.metrics import roc_auc_score
print("Track C tool ROC-AUC (should match Track A):",
roc_auc_score(y_true, agent_probs))
If ROC-AUC matches Track A, your tool wiring is correctโthe agent layer adds workflow, not a new model.
C.6 Optional: stream tool visibility
For demos and debugging, use execute_stream() as in our Standard mode guide:
from nucleusiq.streaming.events import StreamEventType
async def stream_retention(customer_row: pd.Series):
agent = create_churn_retention_agent()
await agent.initialize()
profile = row_to_customer_dict(customer_row)
async for event in agent.execute_stream(
Task(id="ret-s", objective=f"Score and recommend:\n{json.dumps(profile)}")
):
if event.type == StreamEventType.TOOL_CALL_START:
print("Tool:", event.data.get("tool_name"))
elif event.type == StreamEventType.TOKEN:
print(event.data.get("content", ""), end="", flush=True)
NucleusIQ Track Picker (Quick Reference)
| Goal | NucleusIQ mode | This tutorial |
|---|---|---|
| Few-shot classify one JSON profile | Direct | Track B |
| Tool loop + ML model + narrative | Standard | Track C |
| Verify + revise high-stakes analysis | Autonomous | Not needed for first churn lab |
Docs: NucleusIQ docs ยท GitHub
Head-to-Head: sklearn vs NucleusIQ Direct vs NucleusIQ Standard (Same Test Customers)
Fill this table after running all tracks on the same EVAL_N rows:
| Criterion | Track A โ sklearn | Track B โ NucleusIQ Direct | Track C โ NucleusIQ Standard |
|---|---|---|---|
| Framework | scikit-learn | NucleusIQ ExecutionMode.DIRECT | NucleusIQ ExecutionMode.STANDARD + @tool |
| Trains on data | Yes (weights) | No (few-shot in prompt) | ML tool uses Track A model |
| Reproducible | High (fixed seed) | Medium (prompt/version drift) | High for scores; medium for narrative |
| Cost at 1M rows/day | Low | Very high (LLM per row) | High (LLM + cheap ML tool) |
| Probabilities | Native | No (Yes/No only) | From score_churn tool |
| Guardrails | You build | ModelCallLimitPlugin | Call limits + tool limits |
| Best for | Batch scoring, regulation | Label-scarce prototypes | CRM workflow + retention copy |
| Metrics you trust | ROC-AUC, F1 | Same F1 on test set | Tool ROC-AUC + human review of text |
Decision guide (2026)
- Deploy Track A for nightly churn scores in a data pipeline.
- Use Track B (NucleusIQ Direct) to prototype when labels are scarceโthen measure on held-out data.
- Use Track C (NucleusIQ Standard) when retention teams need tool-backed scores plus natural-language actionsโafter Track A is saved.
Wrong: "We replaced sklearn with NucleusIQ."
Right: "sklearn scores customers; NucleusIQ runs few-shot experiments and agent workflows."
This is the combined learning promise from our 2026 roadmap: same problem statement, different tools, measured the same way.
Optional: NucleusIQ Direct Explains Track A (Narration Only)
After Track A metrics look good, use another Direct mode agent to explainโthe probability still comes from sklearn:
def create_explainer_agent() -> Agent:
return Agent(
name="churn-explainer",
role="Analyst",
objective="Explain model output",
prompt=ZeroShotPrompt().configure(
system=(
"Explain churn risk in 3 short bullets. "
"Use ONLY the customer JSON and the model probability provided. "
"Do not invent fields."
),
),
llm=BaseOpenAI(model_name="gpt-4o-mini"),
config=AgentConfig(execution_mode=ExecutionMode.DIRECT),
)
async def explain_prediction(customer: dict, probability: float) -> str:
agent = create_explainer_agent()
await agent.initialize()
result = await agent.execute(
Task(
id="explain-1",
objective=(
f"Customer:\n{json.dumps(customer)}\n"
f"Model churn probability: {probability:.2%}"
),
)
)
return str(result.output)
row = df_test_readable.iloc[0]
p = loaded.predict_proba(X_test_sel.iloc[[0]])[0, 1]
print(asyncio.run(explain_prediction(row_to_customer_dict(row), p)))
Project Structure for GitHub
Publish a small repo so readers can run the tutorial end-to-end (topics.xlsx asks for notebook + GitHub link):
ml-beginners-python/
โโโ README.md
โโโ requirements.txt
โโโ .env.example
โโโ data/
โ โโโ telco_churn.csv
โโโ notebooks/
โ โโโ 01_churn_ml_vs_llm.ipynb
โโโ src/
โ โโโ train.py # Track A: FE + selection + sklearn
โ โโโ nucleusiq_few_shot.py # Track B โ Direct mode
โ โโโ nucleusiq_agent_churn.py # Track C โ Standard + tools
โ โโโ compare_tracks.py # same test metrics table
โโโ models/
โโโ churn_model.joblib
โโโ feature_selector.joblib
โโโ training_columns.json
requirements.txt:
pandas>=2.0
numpy>=1.24
scikit-learn>=1.3
matplotlib>=3.7
seaborn>=0.13
joblib>=1.3
python-dotenv>=1.0
nucleusiq>=0.6.0
nucleusiq-openai>=0.6.0
src/train.py skeletonโmove the tutorial code into functions:
"""Train churn classifier. Run: python src/train.py"""
from pathlib import Path
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
DATA = Path("data/telco_churn.csv")
MODEL_OUT = Path("models/churn_model.joblib")
def load_and_prepare(path: Path):
df = pd.read_csv(path)
# ... same cleaning as tutorial ...
return X, y
def main():
X, y = load_and_prepare(DATA)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
pipe = Pipeline([("model", RandomForestClassifier(
n_estimators=200, class_weight="balanced", random_state=42))])
pipe.fit(X_train, y_train)
proba = pipe.predict_proba(X_test)[:, 1]
preds = pipe.predict(X_test)
print(classification_report(y_test, preds))
print("ROC-AUC:", roc_auc_score(y_test, proba))
MODEL_OUT.parent.mkdir(parents=True, exist_ok=True)
joblib.dump(pipe, MODEL_OUT)
if __name__ == "__main__":
main()
Star or fork: https://github.com/nucleusbox/ml-beginners-python (replace with your live URL when the repo is public).
Common Errors Beginners Hit (And Fixes)
| Error | Symptom | Fix |
|---|---|---|
| Fit scaler on full data | Test scores โtoo goodโ | Use Pipeline |
| Different columns at inference | ValueError on predict | Save training column list; align dummies |
| Ignoring imbalance | High accuracy, bad churn recall | class_weight, threshold tuning, F1 |
| Data leakage | Near-perfect test score | Audit features; time-based split if needed |
| Only reading AI explanations | Confident wrong story | Always print classification_report first |
Skipping await agent.initialize() | NucleusIQ runtime errors | Call initialize() before execute() |
| Few-shot examples from test set | Inflated Track B scores | Build FEW_SHOT from df_train_readable only |
How This Tutorial Fits the Nucleusbox ML Series
| You are here | Next depth on Nucleusbox |
|---|---|
| This tutorial โ full sklearn workflow | Building logistic regression in Python โ telecom merge, dummies, VIF |
| Evaluation section above | Model evaluation metrics โ sensitivity, ROC, cutoffs |
| Theory craving | Logistic regression for ML using Python โ sigmoid, MLE |
| Real-world motivation | Logistic regression applications |
| Bigger 2026 path | Start ML from scratch in 2026 |
Practice Exercises (Do These Before Moving On)
- Feature engineering: add
MonthlyCharges * tenureinteractionโdoes ROC-AUC improve after selection? - Feature selection: compare
K=20vsK=50with cross-validationโwhich is more stable? - Track B: increase few-shot examples to 5 per classโdoes LLM recall improve on the same 80 test rows?
- Track B vs A: on rows where ML is correct and LLM is wrong, what pattern do you see?
- Track C: add a
draft_retention_emailtool that only runs ifchurn_probability > 0.5. - Plot ROC curve for Track Aโlink to our metrics post.
- Portfolio README: fill the head-to-head table with real numbers and state which track you would deploy in production.
From Tutorial to Portfolio README (Template)
Paste this into your GitHub README.md and fill the blanks:
## Telco Churn โ ML vs Few-Shot vs Agent (Same Test Set)
**Problem:** Predict whether a customer will churn.
**Data:** IBM Telco Customer Churn (OpenML #42178).
**Track A (sklearn):** Feature engineering + SelectKBest (k=35) + Random Forest.
**Track B (NucleusIQ Direct):** 3 examples per class, N=80 test rows.
**Track C (NucleusIQ Standard):** score_churn `@tool` + retention recommendation.
| Track | Churn Recall | F1 (Churn) | Notes |
|-------|--------------|------------|-------|
| A โ ML | ___ | ___ | Production candidate |
| B โ Few-shot | ___ | ___ | Prototype / expensive at scale |
| C โ Agent tool | ___ | ___ | Same scores as A; LLM for UX |
**Decision:** Deploy Track A for scoring; Track C for CRM workflow.
**Limitations:** Class imbalance; LLM eval subset; no causal claims.
This is what hiring managers skim in thirty seconds.
How This Tutorial Connects to โLearn ML + AI Togetherโ
Our 2026 scratch roadmap puts Week 3โ4 on sklearn and Week 7โ8 on Gen AI layers. This single tutorial merges both on one churn dataset so you feel the difference immediatelyโnot six months apart.
You now have evidence for three statements interviewers like:
- โI can engineer features and select them without leakage.โ
- โI can evaluate any classifierโincluding an LLMโwith precision and recall on a held-out set.โ
- โI know when to use agents for workflow, not as a replacement for ML scores.โ
FAQ
Is this tutorial enough to get a job?
It is one portfolio-quality baseline project. Combine it with the hybrid projects in our 2026 roadmap (ML + LLM layer) and ideas from Top 7 AI Projects for High-Paying Jobs.
Why logistic regression and random forest?
Logistic regression teaches linear baselines and probabilities. Random forest teaches non-linear tabular strength without neural network complexityโideal for beginners in 2026.
Do I need TensorFlow or PyTorch here?
Not for this tutorial. Master sklearn pipelines first; add deep learning when your problem is images, audio, or large unstructured text.
Where is the Jupyter notebook?
Clone the GitHub repoโnotebooks/01_churn_classification.ipynb mirrors every step in this post. You can also run src/train.py from the terminal without Jupyter.
How is this different from the older Nucleusbox churn posts?
Our building logistic regression in Python walkthrough merges three CSV files manually, builds dummies column by column, and steps through VIF and statsmodelsโexcellent for depth. This tutorial teaches the same churn problem with sklearn pipelines so you can ship faster and avoid copy-paste errors between train and test. Do both: this post for execution speed, the older post for statistical intuition.
Can I use AI to write this code for me?
Yesโas long as you run every cell, change one parameter, and explain the metrics in your own words. Copilots accelerate boilerplate; they do not replace confusion matrices. That is the same โbuild + understandโ rule from our 2026 roadmap.
Should I skip sklearn and only use NucleusIQ Direct for churn?
Only for a prototype. Production churn pipelines need Track A for cost, latency, and auditability. Use NucleusIQ Direct (Track B) to experiment with few-shot examples; use Standard (Track C) for retention workflowsโnot to skip ML evaluation.
What is the difference between NucleusIQ Direct and Standard here?
Direct = one pass, few-shot classification (Track B). Standard = tool loop where score_churn returns probabilities from sklearn (Track C). Same Agent classโdifferent ExecutionMode in AgentConfig.
Do I need the raw OpenAI SDK?
No for Tracks B/C. Install nucleusiq + nucleusiq-openai (or another provider package). Swap BaseOpenAI for BaseGemini without rewriting agent logicโsee Why NucleusIQ?.
Summary
You learned a complete Python machine learning tutorial for beginnersโplus how it differs from solving the same churn problem with AI:
Track A โ Classical ML
- Load and explore data
- Feature engineering (tenure buckets, service counts, contract flags)
- Feature selection (
SelectKBeston train only) - Split train/test with
stratify - Train logistic regression and random forest on selected features
- Evaluate with recall, precision, F1, ROC-AUC
- Save model + selector for production inference
Track B โ NucleusIQ Direct (few-shot)
- Build labeled examples from train only
- Configure
ZeroShotPrompt+ExecutionMode.DIRECT - Batch-classify test customers; score with the same metrics as Track A
Track C โ NucleusIQ Standard (tools)
- Register
score_churnwith@tool(wraps Track A model) - Run Standard mode agent with pluginsโscores from tools, copy from LLM
Continue with model evaluation metrics, the 2026 ML + AI roadmap, NucleusIQ Direct mode, and Standard mode with tools.
Written by Nucleusbox. More tutorials: Machine Learning archive. Code: GitHub โ ml-beginners-python.
Footnotes:
Additional Reading
- GitHub: NucleusIQ
- AI Agents: The Next Big Thing in 2025
- Logistic Regression for Machine Learning
- Cost Function in Logistic Regression
- Maximum Likelihood Estimation (MLE) for Machine Learning
- ETL vs ELT: Choosing the Right Data Integration
- What is ELT & How Does It Work?
- What is ETL & How Does It Work?
- Data Integration for Businesses: Tools, Platform, and Technique
- What is Master Data Management?
- Check DeepSeek-R1 AI reasoning Papaer
OK, thatโs it, we are done now. If you have any questions or suggestions, please feel free to comment. Iโll come up with more topics on Machine Learning and Data Engineering soon. Please also comment and subscribe if you like my work, any suggestions are welcome and appreciate