Machine Learning Tutorial for Beginners Using Python (Step-by-Step)

This is a hands-on machine learning tutorial for beginners using Python. You will engineer features, select the best columns, train sklearn models, and on the same churn problem, compare them to NucleusIQ (Direct mode few-shot + Standard mode agent with tools), so you see when to use classical ML vs an agent framework in 2026.

If you have not read our roadmap post yet, start with How to Start Machine Learning from Scratch in 2026 for the bigger picture (AI + ML combined learning). This tutorial is Week 3–4 of that path, the “build” track, where code and metrics finally click.

We align with techniques from our deeper dives—building logistic regression in Python, model evaluation metrics, and logistic regression applications—but keep the flow beginner-friendly and runnable in one afternoon.

TL;DR

Same problem, three approaches: predict telecom customer churn using (A) sklearn, (B) NucleusIQ Direct (few-shot), (C) NucleusIQ Standard (agent + tools)—so you see what each is good at.
Track A (ML): feature engineering → feature selection → logistic regression + random forest → ROC-AUC on held-out test data.
Track B (NucleusIQ Direct): few-shot churn classification via a Direct mode agent—no raw OpenAI SDK.
Track C (NucleusIQ Standard): Standard mode agent with @tool functions (score_churn, etc.)—ML for scores, NucleusIQ for workflow.
Metrics: accuracy is not enough—use classification report, confusion matrix, ROC-AUC (metrics guide).
Clone the GitHub repo for runnable scripts (src/train.py, src/nucleusiq_few_shot.py, src/nucleusiq_agent_churn.py).

Python Libraries Used in This Tutorial (And Why)

Library	Role in this project
pandas	Load CSV, clean text fields, one-hot encode, inspect churn rates
NumPy	Under the hood for sklearn; you rarely call it directly as a beginner
matplotlib / seaborn	EDA charts for stakeholders
scikit-learn	Train/test split, pipelines, models, metrics
joblib	Save/load trained pipelines
nucleusiq + nucleusiq-openai	Track B (Direct) and Track C (Standard + `@tool`)

You do not need PyTorch, TensorFlow, or LangChain for this lab. Tracks B/C use NucleusIQ instead of hand-rolled OpenAI SDK calls—see our start-from-scratch 2026 guide.

What You Will Build

By the end of this tutorial you will have:

NucleusIQ

v0.6.0 · Open Source · MIT Licensed

Tired of complex agent frameworks? NucleusIQ gives you 3 execution modes, 10 production plugins, and provider portability — in pure Python. Try it →

Gearbox Strategy 10 Production Plugins Provider Portable

      $ pip install nucleusiq nucleusiq-openai
    

★ Star on GitHub Read Docs PyPI

Output	Description
Cleaned dataset	Numeric features ready for sklearn
Trained pipeline	Scaler + model in one object
Evaluation report	Precision, recall, F1, ROC-AUC on held-out test data
Saved model file	`churn_model.joblib` for inference
Feature selector	`selector.joblib` fit only on train data
NucleusIQ Direct few-shot benchmark	Same test rows, same metrics as ML
NucleusIQ Standard agent demo	`@tool` + retention workflow
README story	Side-by-side comparison table—for your portfolio

Business question (fixed for the whole tutorial): Will this customer churn (leave the service)?

That is binary classification—the same problem we solve in building logistic regression in Python and logistic regression applications. What changes is how you solve it—not what you solve.

One Problem, Three Approaches (Read This First)

Beginners in 2026 often ask: “Can I just use ChatGPT for churn instead of scikit-learn?”

Honest answer: You can try, but you must compare both on the same test customers with the same metrics. That is what this tutorial does.

Approach	What it is	Strengths	Weaknesses
Track A — Classical ML	Features → algorithm → probability	Cheap at scale, reproducible, auditable metrics	Needs feature work; weak on raw unstructured text
Track B — NucleusIQ Direct	Few-shot in system prompt → single-pass classify	Fast to prototype; provider-portable	Costly per row at scale; evaluate on test set
Track C — NucleusIQ Standard	Agent + `@tool` → calls your sklearn model	Guardrails, tool loops, production patterns	Needs Track A model saved first

                    ┌─────────────────────────────────────┐
                    │  Problem: Will customer churn?      │
                    └─────────────────────────────────────┘
                                      │
          ┌───────────────────────────┼───────────────────────────┐
          ▼                           ▼                           ▼
   Track A: sklearn              Track B: NucleusIQ Direct   Track C: NucleusIQ Standard
   (engineer features)           (few-shot in prompt)        (@tool + ML model)
          │                           │                           │
          └───────────────────────────┴───────────────────────────┘
                                      ▼
                         Same test set → precision / recall / F1

Tracks A and B are the learning core. Track C shows how production teams combine them using NucleusIQ—the same framework we use in our Direct mode and Standard mode with tools guides.

Prerequisites

From the 2026 ML roadmap:

Python 3.10+ installed
Basic comfort with variables, functions, and pip install
2–3 hours uninterrupted time

Helpful but optional: read AI vs ML vs DL vs Data Science if you confuse those terms.

Step 0: Environment Setup

Create a virtual environment and install dependencies:

python -m venv ml-tutorial-env
# Windows
ml-tutorial-env\Scripts\activate
# macOS/Linux
source ml-tutorial-env/bin/activate

pip install pandas numpy matplotlib seaborn scikit-learn joblib python-dotenv nucleusiq nucleusiq-openai

Verify sklearn:

import sklearn
print(sklearn.__version__)  # e.g. 1.4+

Step 1: Load the Dataset

We use the Telco Customer Churn dataset—a public benchmark that matches the telecom examples in our multivariate logistic regression post.

Option A — OpenML (no manual download):

from sklearn.datasets import fetch_openml

# IBM Telco Customer Churn on OpenML (id=42178)
churn = fetch_openml(data_id=42178, as_frame=True, parser="auto")
df = churn.frame
target_col = "Churn" if "Churn" in df.columns else churn.target_names[0]

Option B — CSV from GitHub (recommended for the companion notebook):

Download telco_churn.csv from the repo linked at the end of this post, then:

import pandas as pd

df = pd.read_csv("data/telco_churn.csv")
target_col = "Churn"

Option C — Your own CSV from the IBM Telco churn Kaggle dataset; align column names with the preprocessing below.

Quick inspection:

print(df.shape)
print(df.head())
print(df[target_col].value_counts(normalize=True))

You will often see ~73% No churn / ~27% Yes churn. That class imbalance matters when you interpret accuracy later—exactly what we discuss in model evaluation metrics.

The Machine Learning Workflow (Map This Tutorial)

Every beginner Python ML project follows the same spine—whether you are predicting churn, loan default, or email spam:

Business question → EDA → Feature engineering → Split → Feature selection
    → Train ML → Evaluate
    → (compare) Few-shot LLM on same test rows
    → (optional) Agent tools wrapping ML

This tutorial walks that spine end to end, then adds AI tracks on the same churn question. When you read our older building logistic regression article, you will recognize the same churn story: merge tables, dummy variables, train/test discipline. Here we use pandas + sklearn pipelines because that is what most teams hire for in 2026.

Parametric vs non-parametric (one sentence): logistic regression assumes a smooth linear boundary in feature space; random forests do not. Our parametric vs non-parametric guide goes deeper—after you finish this lab.

Step 2: Exploratory Data Analysis (EDA)

EDA is not optional decoration. It prevents training on broken data. See our EDA explainer for mindset; here is the minimum for this tutorial.

import matplotlib.pyplot as plt
import seaborn as sns

# Missing values
print(df.isnull().sum().sort_values(ascending=False).head(10))

# Numeric summary
print(df.describe())

# Churn rate by contract type (example business slice)
if "Contract" in df.columns:
    sns.countplot(data=df, x="Contract", hue=target_col)
    plt.title("Churn by contract type")
    plt.xticks(rotation=15)
    plt.tight_layout()
    plt.show()

Write down three observations in your notebook README—for example:

Month-to-month contracts may show higher churn.
TotalCharges sometimes loads as text because of empty strings.
Class imbalance means accuracy alone is misleading.

These bullets become portfolio gold and mirror how we reason in production case studies like financial recommendation systems.

Track A — Classical Machine Learning (Steps 3–10)

Everything in Steps 3–10 is Track A. Do not skip to the LLM section until you have test metrics from sklearn—otherwise you cannot compare fairly.

Step 3: Data Cleaning

3.1 Drop ID columns

IDs do not generalize; they cause memorization.

drop_cols = [c for c in ["customerID", "customerId"] if c in df.columns]
df = df.drop(columns=drop_cols, errors="ignore")

3.2 Fix TotalCharges

A common Telco issue: TotalCharges stored as strings.

if "TotalCharges" in df.columns:
    df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
    df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].median())

3.3 Encode target + keep readable features for Track B/C

y = df[target_col].copy()
if y.dtype == object:
    y = y.map({"Yes": 1, "No": 0, "yes": 1, "no": 0})
y = y.astype(int)

# Human-readable rows for few-shot LLM + agent tools (before heavy encoding)
df_model = df.drop(columns=[target_col]).copy()

Step 4: Feature Engineering (Create Signal, Do Not Only Encode)

Feature engineering means building new columns from domain logic before modeling. This is where many beginners stop too early—they only one-hot encode and wonder why ROC-AUC is flat.

Our dedicated post Feature Engineering Techniques for Machine Learning (topics.xlsx) goes deeper; here is a Telco churn starter set:

import numpy as np

fe = df_model.copy()

# Numeric fixes (if not done in Step 3)
if "TotalCharges" in fe.columns:
    fe["TotalCharges"] = pd.to_numeric(fe["TotalCharges"], errors="coerce")
    fe["TotalCharges"] = fe["TotalCharges"].fillna(fe["TotalCharges"].median())

# --- Engineered features ---
if "tenure" in fe.columns:
    fe["tenure_bucket"] = pd.cut(
        fe["tenure"],
        bins=[-1, 12, 36, 72, 999],
        labels=["0-12m", "13-36m", "37-72m", "72m+"],
    )

if "MonthlyCharges" in fe.columns and "tenure" in fe.columns:
    fe["avg_charge_per_tenure_month"] = fe["MonthlyCharges"] / (fe["tenure"].clip(lower=1))

# Count how many add-on services are active (Yes = 1)
service_cols = [c for c in fe.columns if fe[c].dtype == object and fe[c].isin(["Yes", "No"]).any()]
for c in service_cols:
    fe[c] = fe[c].map({"Yes": 1, "No": 0})

if service_cols:
    fe["num_active_services"] = fe[service_cols].sum(axis=1)

# Contract risk flag (business intuition from EDA)
if "Contract" in fe.columns:
    fe["is_month_to_month"] = (fe["Contract"] == "Month-to-month").astype(int)

print(fe[["tenure", "avg_charge_per_tenure_month", "num_active_services", "is_month_to_month"]].head())

Why these features help

tenure_bucket — churn risk often spikes early tenure.
avg_charge_per_tenure_month — separates heavy spenders from long-tenure low payers.
num_active_services — engagement proxy.
is_month_to_month — aligns with contract plots from EDA and our multicollinearity / regression discussions when features overlap.

Rule: every engineered feature must be computable at inference time from data you will actually have before churn happens—no future information.

4.1 One-hot encode categoricals (after engineering)

X = pd.get_dummies(fe, drop_first=True)
print("Feature count after engineering + encoding:", X.shape[1])

Leakage check: drop post-churn columns. When in doubt, remove suspicious fields.

4.2 Why we use `drop_first=True` in `get_dummies`

With one-hot encoding, the last category can be inferred from the others (dummy variable trap). drop_first=True avoids perfect multicollinearity—related to the issues we discuss in multicollinearity in regression. For random forest it matters less; for logistic regression it keeps coefficients more stable.

4.3 Keep a list of training columns for production

training_columns = list(X.columns)

import json
with open("models/training_columns.json", "w") as f:
    json.dump(training_columns, f)

At inference time, align new data to these columns (missing columns → 0). This prevents the “works in notebook, breaks in API” failure mode.

Step 5: Train / Test Split (Before Feature Selection)

Split after engineering, before selecting features—so the selector never sees test labels.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

# Human-readable rows for Track B/C (same indices)
train_idx, test_idx = X_train.index, X_test.index
df_train_readable = df_model.loc[train_idx]
df_test_readable = df_model.loc[test_idx]

print("Train size:", len(X_train), "Test size:", len(X_test))

stratify=y keeps churn ratio stable—critical for imbalanced data (evaluation metrics post).

Step 6: Feature Selection (Reduce Noise, Fit on Train Only)

With 50–80 columns after get_dummies, many are weak or redundant. Feature selection picks a subset that helps generalization.

Important: fit the selector on X_train only, then transform X_test.

from sklearn.feature_selection import SelectKBest, mutual_info_classif
import joblib

K = 35  # tune: try 20, 35, 50 and compare CV ROC-AUC

selector = SelectKBest(score_func=mutual_info_classif, k=min(K, X_train.shape[1]))
X_train_sel = selector.fit_transform(X_train, y_train)
X_test_sel = selector.transform(X_test)

selected_mask = selector.get_support()
selected_features = X_train.columns[selected_mask].tolist()
print("Selected features:", len(selected_features))
print(selected_features[:10], "...")

joblib.dump(selector, "models/feature_selector.joblib")

What mutual_info_classif does: scores each feature by how much it reduces uncertainty about churn—useful for non-linear relationships before tree models.

Alternative selectors (try in exercises):

Method	When to use
`SelectKBest(f_classif)`	Fast linear screening
`RFECV` with logistic regression	Smaller, interpretable set; slower
Tree `feature_importances_`	After RF train; good for reporting, not always for leakage-safe selection

Convert back to DataFrames for readable column names:

X_train_sel = pd.DataFrame(X_train_sel, columns=selected_features, index=X_train.index)
X_test_sel = pd.DataFrame(X_test_sel, columns=selected_features, index=X_test.index)

Step 7: Build a sklearn Pipeline (Best Practice)

A pipeline chains preprocessing and model so you do not accidentally fit the scaler on test data (a classic beginner bug).

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

logistic_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=2000, class_weight="balanced")),
])

class_weight="balanced" gives more attention to the minority churn class—useful before you tune thresholds.

Train:

logistic_pipe.fit(X_train_sel, y_train)

Train on X_train_sel, not the full wide matrix—this is the pipeline you compare against LLM approaches.

7.1 What logistic regression is doing (intuition)

Logistic regression outputs a probability between 0 and 1 using the sigmoid function. If you want the math story—sigmoid, log-odds, likelihood—read Logistic Regression for Machine Learning using Python and cost function in logistic regression. For this tutorial, remember: coefficients tell direction, probabilities tell risk, and thresholds turn risk into yes/no decisions.

Step 8: Predict and Evaluate — Track A (Do Not Skip)

6.1 Class predictions and probabilities

y_pred = logistic_pipe.predict(X_test_sel)
y_proba = logistic_pipe.predict_proba(X_test_sel)[:, 1]

6.2 Classification report

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print(classification_report(y_test, y_pred, target_names=["No Churn", "Churn"]))
print("ROC-AUC:", round(roc_auc_score(y_test, y_proba), 4))

How to read this (short version):

Precision (Churn): of customers you flagged as churn, how many actually churned?
Recall (Churn): of all real churners, how many did you catch? (Sensitivity in our metrics blog)
F1: balance when classes are imbalanced
ROC-AUC: ranking quality across thresholds

6.3 Confusion matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN, FP],
#  [FN, TP]]

Map this to TP, TN, FP, FN exactly as in Fig-1 of our model evaluation metrics post. If your recall for churn is low, the model is missing customers who leave—often more expensive than a false alarm.

6.4 Accuracy (with caution)

from sklearn.metrics import accuracy_score
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))

If accuracy is ~80% but recall for churn is ~55%, you have a model that looks “fine” but fails the business goal. That is why we teach metrics beyond accuracy.

Step 9: Compare a Second Model (Random Forest)

Logistic regression is a strong interpretable baseline. Random Forest often improves tabular performance with non-linear patterns.

from sklearn.ensemble import RandomForestClassifier

rf_pipe = Pipeline([
    ("model", RandomForestClassifier(
        n_estimators=200,
        max_depth=12,
        class_weight="balanced",
        random_state=42,
        n_jobs=-1,
    )),
])

rf_pipe.fit(X_train_sel, y_train)
rf_pred = rf_pipe.predict(X_test_sel)
rf_proba = rf_pipe.predict_proba(X_test_sel)[:, 1]

print("=== Random Forest ===")
print(classification_report(y_test, rf_pred, target_names=["No Churn", "Churn"]))
print("ROC-AUC:", round(roc_auc_score(y_test, rf_proba), 4))

Compare in a table (fill with your numbers):

Model	Accuracy	Churn Recall	Churn F1	ROC-AUC
Logistic Regression	?	?	?	?
Random Forest	?	?	?	?

Pick the model that matches your business cost. If missing churn is expensive, optimize recall (possibly lower the probability threshold)—the same tradeoff we illustrate with ROC curves in our metrics article.

9.1 Feature importance (Random Forest)

import pandas as pd

rf_model = rf_pipe.named_steps["model"]
importances = pd.Series(rf_model.feature_importances_, index=X_train_sel.columns)
print(importances.sort_values(ascending=False).head(15))

Use this to sanity-check the model: if customerID sneaks in as top feature, your pipeline has a bug. If Contract_Month-to-month ranks high, that matches business intuition from EDA—good sign.

9.2 Cross-validation (more honest than one lucky split)

A single train/test split can flatter or punish you by accident. Cross-validation trains on several folds and averages scores:

from sklearn.model_selection import cross_val_score

# Pipeline: selector must be inside CV — here we CV on selected train matrix for simplicity
cv_scores = cross_val_score(
    rf_pipe, X_train_sel, y_train, cv=5, scoring="roc_auc", n_jobs=-1
)
print("CV ROC-AUC mean:", cv_scores.mean().round(4))
print("CV ROC-AUC std:", cv_scores.std().round(4))

Report mean ± std in your README. Low std means stable; high std means you need more data or simpler models.

Step 10: Tune the Decision Threshold (Optional but Valuable)

Default threshold is 0.5. For churn, you may want 0.3 or 0.4 to catch more leavers.

import numpy as np

thresholds = np.arange(0.2, 0.6, 0.05)
for t in thresholds:
    preds_t = (y_proba >= t).astype(int)
    from sklearn.metrics import recall_score, precision_score
    print(
        f"threshold={t:.2f}  "
        f"precision={precision_score(y_test, preds_t, zero_division=0):.3f}  "
        f"recall={recall_score(y_test, preds_t, zero_division=0):.3f}"
    )

This connects directly to the cutoff analysis section in our evaluation metrics post—where we sweep probability cutoffs for telecom churn.

Step 11: Save the Model for Reuse

import joblib

best_model = rf_pipe  # or logistic_pipe—whichever you chose
joblib.dump(best_model, "models/churn_model.joblib")

Load later:

loaded = joblib.load("models/churn_model.joblib")
sample_pred = loaded.predict(X_test.iloc[:5])

Step 12: Inference on New Customers (Track A)

new_customer = X_test_sel.iloc[[0]]  # replace with real preprocessed row
prob_churn = loaded.predict_proba(new_customer)[0, 1]
print(f"Churn probability: {prob_churn:.2%}")

In production you would:

Apply the same encoding steps (same dummy columns—missing columns = 0).
Log predictions and outcomes.
Retrain on a schedule.

Aligning raw rows to training columns

def align_features(raw_df: pd.DataFrame, training_cols: list) -> pd.DataFrame:
    encoded = pd.get_dummies(raw_df, drop_first=True)
    aligned = encoded.reindex(columns=training_cols, fill_value=0)
    return aligned

# Example after loading training_columns.json
# aligned = align_features(new_raw, training_columns)
# prob = loaded.predict_proba(aligned)[0, 1]

This function is the bridge between “data science notebook” and “backend API”—worth committing to your GitHub repo under src/preprocess.py.

Plot ROC Curve (Visual Evaluation)

Numbers are mandatory; plots help you communicate to non-technical stakeholders.

from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title("Logistic Regression — ROC Curve")
plt.show()

Compare both models on one chart:

rf_proba = rf_pipe.predict_proba(X_test)[:, 1]
RocCurveDisplay.from_predictions(y_test, rf_proba, name="Random Forest")
RocCurveDisplay.from_predictions(y_test, y_proba, name="Logistic Regression")
plt.legend()
plt.show()

Tie this chart to the ROC explanation in our model evaluation metrics post—especially the tradeoff between sensitivity and false positive rate.

Track B — Few-Shot Churn With NucleusIQ Direct Mode

Track B solves churn without sklearn training—only labeled examples in the prompt. Instead of wiring the raw OpenAI SDK, we use NucleusIQ Direct mode: one agent, one pass, low overhead—exactly what we document in the Direct mode beginner guide.

Why NucleusIQ here (vs raw API)?

Same provider portability as the rest of our stack (BaseOpenAI today, BaseGemini tomorrow).
Usage tracking on agent.last_usage (tokens per task).
Plugins later (ModelCallLimitPlugin when you batch-score many customers).
Same Agent + Task API you will reuse in Track C—only the execution mode changes.

B.1 Install NucleusIQ + provider

pip install nucleusiq nucleusiq-openai python-dotenv

# .env (never commit)
OPENAI_API_KEY=sk-...

B.2 Shared helpers (same as Track A)

import asyncio
import json
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

def row_to_customer_dict(row: pd.Series) -> dict:
    return {
        k: (None if pd.isna(v) else v)
        for k, v in row.to_dict().items()
        if k not in ("customerID", "customerId")
    }

def build_few_shot_examples(df_readable, y_series, n_per_class=3):
    examples = []
    for label, name in [(1, "Yes"), (0, "No")]:
        for i in y_series[y_series == label].index[:n_per_class]:
            examples.append({
                "customer": row_to_customer_dict(df_readable.loc[i]),
                "churn": name,
            })
    return examples

def parse_churn_label(text: str) -> str:
    t = (text or "").strip().lower()
    if t.startswith("yes"):
        return "Yes"
    if t.startswith("no"):
        return "No"
    return "No"  # conservative default for parsing

FEW_SHOT = build_few_shot_examples(df_train_readable, y_train, n_per_class=3)

B.3 Build the few-shot system prompt

def few_shot_system_block(examples: list) -> str:
    shots = "\n\n".join(
        f"Customer: {json.dumps(ex['customer'])}\nChurn: {ex['churn']}"
        for ex in examples
    )
    return f"""You are a telecom churn analyst.

Study these labeled examples:
{shots}

Rules:
- For each new customer, reply with exactly one word: Yes or No.
- Use only the fields in the customer JSON.
- Do not explain unless asked."""

FEW_SHOT_SYSTEM = few_shot_system_block(FEW_SHOT)

B.4 Create a NucleusIQ Direct mode classifier agent

from nucleusiq.agents import Agent
from nucleusiq.agents.config import AgentConfig, ExecutionMode
from nucleusiq.agents.task import Task
from nucleusiq.prompts.zero_shot import ZeroShotPrompt
from nucleusiq.plugins.builtin.model_call_limit import ModelCallLimitPlugin
from nucleusiq_openai import BaseOpenAI

def create_few_shot_churn_agent() -> Agent:
    return Agent(
        name="churn-few-shot",
        role="Churn classifier",
        objective="Classify telecom churn from customer profiles",
        prompt=ZeroShotPrompt().configure(
            system=FEW_SHOT_SYSTEM,
            user="Classify the customer in the task. Reply Yes or No only.",
        ),
        llm=BaseOpenAI(model_name="gpt-4o-mini"),
        config=AgentConfig(execution_mode=ExecutionMode.DIRECT),
        plugins=[ModelCallLimitPlugin(max_calls=3)],  # safety for batch loops
    )

few_shot_agent = create_few_shot_churn_agent()

ExecutionMode.DIRECT = fast single-pass classification—no tool loop. See choosing agent modes when you outgrow this.

B.5 Classify one customer (async)

async def predict_churn_nucleusiq(agent: Agent, customer: dict) -> str:
    await agent.initialize()
    result = await agent.execute(
        Task(
            id="churn-cls",
            objective=f"Customer JSON:\n{json.dumps(customer)}",
        )
    )
    return parse_churn_label(str(result.output))

# Example
sample = row_to_customer_dict(df_test_readable.iloc[0])
label = asyncio.run(predict_churn_nucleusiq(few_shot_agent, sample))
print("Predicted churn:", label)
print("Tokens:", few_shot_agent.last_usage.total.total_tokens)

B.6 Batch evaluate on the same test set as Track A

from sklearn.metrics import classification_report, accuracy_score

EVAL_N = 80
test_slice = df_test_readable.iloc[:EVAL_N]
y_true = y_test.iloc[:EVAL_N]

async def eval_few_shot_track():
    agent = create_few_shot_churn_agent()
    await agent.initialize()
    preds = []
    for _, row in test_slice.iterrows():
        customer = row_to_customer_dict(row)
        result = await agent.execute(
            Task(
                id=f"churn-{len(preds)}",
                objective=f"Customer JSON:\n{json.dumps(customer)}",
            )
        )
        preds.append(1 if parse_churn_label(str(result.output)) == "Yes" else 0)
    return preds

llm_preds = asyncio.run(eval_few_shot_track())
print("=== Track B: NucleusIQ Direct (few-shot) ===")
print(classification_report(y_true, llm_preds, target_names=["No Churn", "Churn"]))
print("Accuracy:", accuracy_score(y_true, llm_preds))

What you will often see

Few-shot via Direct mode can look reasonable on easy rows but miss imbalance nuance unless examples are balanced.
agent.last_usage helps you estimate cost per 1,000 classifications—still far more than sklearn at scale.
You still need classification_report on held-out data—NucleusIQ does not replace evaluation.

Track C — Churn Agent With NucleusIQ Standard Mode + Tools

Track C is the production pattern: your Track A model becomes a tool; NucleusIQ Standard mode runs the tool loop and writes the retention narrative. This matches our Standard mode with tools and plugins guides.

Prerequisite: complete Track A and save models/churn_model.joblib + selected_features.

C.1 Load ML artifacts (Track A output)

import joblib

loaded_model = joblib.load("models/churn_model.joblib")
# selected_features from Step 6 — list of column names after SelectKBest

C.2 Register tools with `@tool`

NucleusIQ turns Python functions into agent tools automatically (schemas from type hints + docstrings):

from nucleusiq.tools.decorators import tool

def _score_churn_core(customer_json: str) -> str:
    """Shared scoring logic for tools and offline evaluation."""
    row = pd.Series(json.loads(customer_json))
    encoded = pd.get_dummies(row.to_frame().T, drop_first=True)
    aligned = encoded.reindex(columns=selected_features, fill_value=0)
    proba = float(loaded_model.predict_proba(aligned)[0, 1])
    return json.dumps({
        "churn_probability": round(proba, 4),
        "churn_label": "Yes" if proba >= 0.4 else "No",
        "threshold": 0.4,
    })

@tool
def score_churn(customer_json: str) -> str:
    """Score churn probability using the trained sklearn model from Track A.

    Args:
        customer_json: JSON object of customer fields (same schema as training rows).
    """
    return _score_churn_core(customer_json)

@tool
def get_customer_profile(customer_json: str) -> str:
    """Return the customer profile JSON for display (no scoring).

    Args:
        customer_json: JSON object of customer fields.
    """
    return customer_json

The LLM should call score_churn for numbers—never invent probabilities in free text.

C.3 Build the Standard mode retention agent

from nucleusiq.plugins.builtin.model_call_limit import ModelCallLimitPlugin
from nucleusiq.plugins.builtin.tool_call_limit import ToolCallLimitPlugin

def create_churn_retention_agent() -> Agent:
    return Agent(
        name="churn-retention-agent",
        role="Retention analyst",
        objective="Assess churn and recommend one action",
        prompt=ZeroShotPrompt().configure(
            system=(
                "You help telecom retention teams. "
                "Always call score_churn before recommending action. "
                "Never guess churn probability—use the tool result only. "
                "Give one concrete retention step."
            ),
        ),
        llm=BaseOpenAI(model_name="gpt-4o-mini"),
        tools=[score_churn, get_customer_profile],
        config=AgentConfig(execution_mode=ExecutionMode.STANDARD),
        plugins=[
            ModelCallLimitPlugin(max_calls=8),
            ToolCallLimitPlugin(max_calls=5),
        ],
    )

ExecutionMode.STANDARD enables the multi-step tool loop—the agent can call score_churn, read the JSON, then respond. For high-stakes review workflows, you could move to Autonomous mode later; churn triage usually starts at Standard.

C.4 Run the agent for one customer

async def run_retention_agent(customer_row: pd.Series) -> str:
    agent = create_churn_retention_agent()
    await agent.initialize()
    profile = row_to_customer_dict(customer_row)
    result = await agent.execute(
        Task(
            id="retention-1",
            objective=(
                "Assess churn risk and suggest one retention action.\n"
                f"Customer profile JSON:\n{json.dumps(profile)}"
            ),
        )
    )
    print(f"Tool calls: {result.tool_call_count}")
    print(f"Tokens: {agent.last_usage.total.total_tokens}")
    return str(result.output)

print(asyncio.run(run_retention_agent(df_test_readable.iloc[0])))

C.5 Evaluate ML tool scores vs Track A (apples to apples)

The agent’s marketing text is not your classification metric—the score_churn tool output is. For a fair comparison, call the same tool logic on test rows (or parse tool results from traces):

def score_churn_direct(customer_row: pd.Series) -> float:
    payload = json.loads(_score_churn_core(json.dumps(row_to_customer_dict(customer_row))))
    return payload["churn_probability"]

agent_probs = [
    score_churn_direct(df_test_readable.loc[i])
    for i in df_test_readable.iloc[:EVAL_N].index
]

from sklearn.metrics import roc_auc_score
print("Track C tool ROC-AUC (should match Track A):",
      roc_auc_score(y_true, agent_probs))

If ROC-AUC matches Track A, your tool wiring is correct—the agent layer adds workflow, not a new model.

C.6 Optional: stream tool visibility

For demos and debugging, use execute_stream() as in our Standard mode guide:

from nucleusiq.streaming.events import StreamEventType

async def stream_retention(customer_row: pd.Series):
    agent = create_churn_retention_agent()
    await agent.initialize()
    profile = row_to_customer_dict(customer_row)
    async for event in agent.execute_stream(
        Task(id="ret-s", objective=f"Score and recommend:\n{json.dumps(profile)}")
    ):
        if event.type == StreamEventType.TOOL_CALL_START:
            print("Tool:", event.data.get("tool_name"))
        elif event.type == StreamEventType.TOKEN:
            print(event.data.get("content", ""), end="", flush=True)

NucleusIQ Track Picker (Quick Reference)

Goal	NucleusIQ mode	This tutorial
Few-shot classify one JSON profile	Direct	Track B
Tool loop + ML model + narrative	Standard	Track C
Verify + revise high-stakes analysis	Autonomous	Not needed for first churn lab

Docs: NucleusIQ docs · GitHub

Head-to-Head: sklearn vs NucleusIQ Direct vs NucleusIQ Standard (Same Test Customers)

Fill this table after running all tracks on the same EVAL_N rows:

Criterion	Track A — sklearn	Track B — NucleusIQ Direct	Track C — NucleusIQ Standard
Framework	scikit-learn	NucleusIQ `ExecutionMode.DIRECT`	NucleusIQ `ExecutionMode.STANDARD` + `@tool`
Trains on data	Yes (weights)	No (few-shot in prompt)	ML tool uses Track A model
Reproducible	High (fixed seed)	Medium (prompt/version drift)	High for scores; medium for narrative
Cost at 1M rows/day	Low	Very high (LLM per row)	High (LLM + cheap ML tool)
Probabilities	Native	No (Yes/No only)	From `score_churn` tool
Guardrails	You build	`ModelCallLimitPlugin`	Call limits + tool limits
Best for	Batch scoring, regulation	Label-scarce prototypes	CRM workflow + retention copy
Metrics you trust	ROC-AUC, F1	Same F1 on test set	Tool ROC-AUC + human review of text

Decision guide (2026)

Deploy Track A for nightly churn scores in a data pipeline.
Use Track B (NucleusIQ Direct) to prototype when labels are scarce—then measure on held-out data.
Use Track C (NucleusIQ Standard) when retention teams need tool-backed scores plus natural-language actions—after Track A is saved.

Wrong:  "We replaced sklearn with NucleusIQ."
Right:  "sklearn scores customers; NucleusIQ runs few-shot experiments and agent workflows."

This is the combined learning promise from our 2026 roadmap: same problem statement, different tools, measured the same way.

Optional: NucleusIQ Direct Explains Track A (Narration Only)

After Track A metrics look good, use another Direct mode agent to explain—the probability still comes from sklearn:

def create_explainer_agent() -> Agent:
    return Agent(
        name="churn-explainer",
        role="Analyst",
        objective="Explain model output",
        prompt=ZeroShotPrompt().configure(
            system=(
                "Explain churn risk in 3 short bullets. "
                "Use ONLY the customer JSON and the model probability provided. "
                "Do not invent fields."
            ),
        ),
        llm=BaseOpenAI(model_name="gpt-4o-mini"),
        config=AgentConfig(execution_mode=ExecutionMode.DIRECT),
    )

async def explain_prediction(customer: dict, probability: float) -> str:
    agent = create_explainer_agent()
    await agent.initialize()
    result = await agent.execute(
        Task(
            id="explain-1",
            objective=(
                f"Customer:\n{json.dumps(customer)}\n"
                f"Model churn probability: {probability:.2%}"
            ),
        )
    )
    return str(result.output)

row = df_test_readable.iloc[0]
p = loaded.predict_proba(X_test_sel.iloc[[0]])[0, 1]
print(asyncio.run(explain_prediction(row_to_customer_dict(row), p)))

Project Structure for GitHub

Publish a small repo so readers can run the tutorial end-to-end (topics.xlsx asks for notebook + GitHub link):

ml-beginners-python/
├── README.md
├── requirements.txt
├── .env.example
├── data/
│   └── telco_churn.csv
├── notebooks/
│   └── 01_churn_ml_vs_llm.ipynb
├── src/
│   ├── train.py              # Track A: FE + selection + sklearn
│   ├── nucleusiq_few_shot.py   # Track B — Direct mode
│   ├── nucleusiq_agent_churn.py # Track C — Standard + tools
│   └── compare_tracks.py     # same test metrics table
└── models/
    ├── churn_model.joblib
    ├── feature_selector.joblib
    └── training_columns.json

requirements.txt:

pandas>=2.0
numpy>=1.24
scikit-learn>=1.3
matplotlib>=3.7
seaborn>=0.13
joblib>=1.3
python-dotenv>=1.0
nucleusiq>=0.6.0
nucleusiq-openai>=0.6.0

src/train.py skeleton—move the tutorial code into functions:

"""Train churn classifier. Run: python src/train.py"""
from pathlib import Path
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

DATA = Path("data/telco_churn.csv")
MODEL_OUT = Path("models/churn_model.joblib")

def load_and_prepare(path: Path):
    df = pd.read_csv(path)
    # ... same cleaning as tutorial ...
    return X, y

def main():
    X, y = load_and_prepare(DATA)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    pipe = Pipeline([("model", RandomForestClassifier(
        n_estimators=200, class_weight="balanced", random_state=42))])
    pipe.fit(X_train, y_train)
    proba = pipe.predict_proba(X_test)[:, 1]
    preds = pipe.predict(X_test)
    print(classification_report(y_test, preds))
    print("ROC-AUC:", roc_auc_score(y_test, proba))
    MODEL_OUT.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(pipe, MODEL_OUT)

if __name__ == "__main__":
    main()

Star or fork: https://github.com/nucleusbox/ml-beginners-python (replace with your live URL when the repo is public).

Common Errors Beginners Hit (And Fixes)

Error	Symptom	Fix
Fit scaler on full data	Test scores “too good”	Use `Pipeline`
Different columns at inference	`ValueError` on predict	Save training column list; align dummies
Ignoring imbalance	High accuracy, bad churn recall	`class_weight`, threshold tuning, F1
Data leakage	Near-perfect test score	Audit features; time-based split if needed
Only reading AI explanations	Confident wrong story	Always print `classification_report` first
Skipping `await agent.initialize()`	NucleusIQ runtime errors	Call `initialize()` before `execute()`
Few-shot examples from test set	Inflated Track B scores	Build `FEW_SHOT` from `df_train_readable` only

How This Tutorial Fits the Nucleusbox ML Series

You are here	Next depth on Nucleusbox
This tutorial — full sklearn workflow	Building logistic regression in Python — telecom merge, dummies, VIF
Evaluation section above	Model evaluation metrics — sensitivity, ROC, cutoffs
Theory craving	Logistic regression for ML using Python — sigmoid, MLE
Real-world motivation	Logistic regression applications
Bigger 2026 path	Start ML from scratch in 2026

Practice Exercises (Do These Before Moving On)

Feature engineering: add MonthlyCharges * tenure interaction—does ROC-AUC improve after selection?
Feature selection: compare K=20 vs K=50 with cross-validation—which is more stable?
Track B: increase few-shot examples to 5 per class—does LLM recall improve on the same 80 test rows?
Track B vs A: on rows where ML is correct and LLM is wrong, what pattern do you see?
Track C: add a draft_retention_email tool that only runs if churn_probability > 0.5.
Plot ROC curve for Track A—link to our metrics post.
Portfolio README: fill the head-to-head table with real numbers and state which track you would deploy in production.

From Tutorial to Portfolio README (Template)

Paste this into your GitHub README.md and fill the blanks:

## Telco Churn — ML vs Few-Shot vs Agent (Same Test Set)

**Problem:** Predict whether a customer will churn.
**Data:** IBM Telco Customer Churn (OpenML #42178).
**Track A (sklearn):** Feature engineering + SelectKBest (k=35) + Random Forest.
**Track B (NucleusIQ Direct):** 3 examples per class, N=80 test rows.
**Track C (NucleusIQ Standard):** score_churn `@tool` + retention recommendation.

| Track | Churn Recall | F1 (Churn) | Notes |
|-------|--------------|------------|-------|
| A — ML | ___ | ___ | Production candidate |
| B — Few-shot | ___ | ___ | Prototype / expensive at scale |
| C — Agent tool | ___ | ___ | Same scores as A; LLM for UX |

**Decision:** Deploy Track A for scoring; Track C for CRM workflow.
**Limitations:** Class imbalance; LLM eval subset; no causal claims.

This is what hiring managers skim in thirty seconds.

How This Tutorial Connects to “Learn ML + AI Together”

Our 2026 scratch roadmap puts Week 3–4 on sklearn and Week 7–8 on Gen AI layers. This single tutorial merges both on one churn dataset so you feel the difference immediately—not six months apart.

You now have evidence for three statements interviewers like:

“I can engineer features and select them without leakage.”
“I can evaluate any classifier—including an LLM—with precision and recall on a held-out set.”
“I know when to use agents for workflow, not as a replacement for ML scores.”

FAQ

Is this tutorial enough to get a job?

It is one portfolio-quality baseline project. Combine it with the hybrid projects in our 2026 roadmap (ML + LLM layer) and ideas from Top 7 AI Projects for High-Paying Jobs.

Why logistic regression and random forest?

Logistic regression teaches linear baselines and probabilities. Random forest teaches non-linear tabular strength without neural network complexity—ideal for beginners in 2026.

Do I need TensorFlow or PyTorch here?

Not for this tutorial. Master sklearn pipelines first; add deep learning when your problem is images, audio, or large unstructured text.

Where is the Jupyter notebook?

Clone the GitHub repo—notebooks/01_churn_classification.ipynb mirrors every step in this post. You can also run src/train.py from the terminal without Jupyter.

How is this different from the older Nucleusbox churn posts?

Our building logistic regression in Python walkthrough merges three CSV files manually, builds dummies column by column, and steps through VIF and statsmodels—excellent for depth. This tutorial teaches the same churn problem with sklearn pipelines so you can ship faster and avoid copy-paste errors between train and test. Do both: this post for execution speed, the older post for statistical intuition.

Can I use AI to write this code for me?

Yes—as long as you run every cell, change one parameter, and explain the metrics in your own words. Copilots accelerate boilerplate; they do not replace confusion matrices. That is the same “build + understand” rule from our 2026 roadmap.

Should I skip sklearn and only use NucleusIQ Direct for churn?

Only for a prototype. Production churn pipelines need Track A for cost, latency, and auditability. Use NucleusIQ Direct (Track B) to experiment with few-shot examples; use Standard (Track C) for retention workflows—not to skip ML evaluation.

What is the difference between NucleusIQ Direct and Standard here?

Direct = one pass, few-shot classification (Track B). Standard = tool loop where score_churn returns probabilities from sklearn (Track C). Same Agent class—different ExecutionMode in AgentConfig.

Do I need the raw OpenAI SDK?

No for Tracks B/C. Install nucleusiq + nucleusiq-openai (or another provider package). Swap BaseOpenAI for BaseGemini without rewriting agent logic—see Why NucleusIQ?.

Summary

You learned a complete Python machine learning tutorial for beginners—plus how it differs from solving the same churn problem with AI:

Track A — Classical ML

Load and explore data
Feature engineering (tenure buckets, service counts, contract flags)
Feature selection (SelectKBest on train only)
Split train/test with stratify
Train logistic regression and random forest on selected features
Evaluate with recall, precision, F1, ROC-AUC
Save model + selector for production inference

Track B — NucleusIQ Direct (few-shot)

Build labeled examples from train only
Configure ZeroShotPrompt + ExecutionMode.DIRECT
Batch-classify test customers; score with the same metrics as Track A

Track C — NucleusIQ Standard (tools)

Register score_churn with @tool (wraps Track A model)
Run Standard mode agent with plugins—scores from tools, copy from LLM

Continue with model evaluation metrics, the 2026 ML + AI roadmap, NucleusIQ Direct mode, and Standard mode with tools.

Written by Nucleusbox. More tutorials: Machine Learning archive. Code: GitHub — ml-beginners-python.

Footnotes:

Additional Reading

OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to comment. I’ll come up with more topics on Machine Learning and Data Engineering soon. Please also comment and subscribe if you like my work, any suggestions are welcome and appreciate

Post Views: 97