Skip to content

Machine Learning Tutorial for Beginners Using Python (Step-by-Step)

machine learning tutorial for beginners python 2026 - Nucleusbox

This is a hands-on machine learning tutorial for beginners using Python. You will engineer features, select the best columns, train sklearn models, and on the same churn problem, compare them to NucleusIQ (Direct mode few-shot + Standard mode agent with tools), so you see when to use classical ML vs an agent framework in 2026.

If you have not read our roadmap post yet, start with How to Start Machine Learning from Scratch in 2026 for the bigger picture (AI + ML combined learning). This tutorial is Week 3โ€“4 of that path, the โ€œbuildโ€ track, where code and metrics finally click.

We align with techniques from our deeper divesโ€”building logistic regression in Python, model evaluation metrics, and logistic regression applicationsโ€”but keep the flow beginner-friendly and runnable in one afternoon.

TL;DR

  • Same problem, three approaches: predict telecom customer churn using (A) sklearn, (B) NucleusIQ Direct (few-shot), (C) NucleusIQ Standard (agent + tools)โ€”so you see what each is good at.
  • Track A (ML): feature engineering โ†’ feature selection โ†’ logistic regression + random forest โ†’ ROC-AUC on held-out test data.
  • Track B (NucleusIQ Direct): few-shot churn classification via a Direct mode agentโ€”no raw OpenAI SDK.
  • Track C (NucleusIQ Standard): Standard mode agent with @tool functions (score_churn, etc.)โ€”ML for scores, NucleusIQ for workflow.
  • Metrics: accuracy is not enoughโ€”use classification report, confusion matrix, ROC-AUC (metrics guide).
  • Clone the GitHub repo for runnable scripts (src/train.py, src/nucleusiq_few_shot.py, src/nucleusiq_agent_churn.py).

Python Libraries Used in This Tutorial (And Why)

LibraryRole in this project
pandasLoad CSV, clean text fields, one-hot encode, inspect churn rates
NumPyUnder the hood for sklearn; you rarely call it directly as a beginner
matplotlib / seabornEDA charts for stakeholders
scikit-learnTrain/test split, pipelines, models, metrics
joblibSave/load trained pipelines
nucleusiq + nucleusiq-openaiTrack B (Direct) and Track C (Standard + @tool)

You do not need PyTorch, TensorFlow, or LangChain for this lab. Tracks B/C use NucleusIQ instead of hand-rolled OpenAI SDK callsโ€”see our start-from-scratch 2026 guide.


What You Will Build

By the end of this tutorial you will have:

NucleusIQ
v0.6.0 ยท Open Source ยท MIT Licensed

Tired of complex agent frameworks? NucleusIQ gives you 3 execution modes, 10 production plugins, and provider portability โ€” in pure Python. Try it โ†’

Gearbox Strategy 10 Production Plugins Provider Portable
$ pip install nucleusiq nucleusiq-openai
OutputDescription
Cleaned datasetNumeric features ready for sklearn
Trained pipelineScaler + model in one object
Evaluation reportPrecision, recall, F1, ROC-AUC on held-out test data
Saved model filechurn_model.joblib for inference
Feature selectorselector.joblib fit only on train data
NucleusIQ Direct few-shot benchmarkSame test rows, same metrics as ML
NucleusIQ Standard agent demo@tool + retention workflow
README storySide-by-side comparison tableโ€”for your portfolio

Business question (fixed for the whole tutorial): Will this customer churn (leave the service)?

That is binary classificationโ€”the same problem we solve in building logistic regression in Python and logistic regression applications. What changes is how you solve itโ€”not what you solve.


One Problem, Three Approaches (Read This First)

Beginners in 2026 often ask: โ€œCan I just use ChatGPT for churn instead of scikit-learn?โ€

Honest answer: You can try, but you must compare both on the same test customers with the same metrics. That is what this tutorial does.

ApproachWhat it isStrengthsWeaknesses
Track A โ€” Classical MLFeatures โ†’ algorithm โ†’ probabilityCheap at scale, reproducible, auditable metricsNeeds feature work; weak on raw unstructured text
Track B โ€” NucleusIQ DirectFew-shot in system prompt โ†’ single-pass classifyFast to prototype; provider-portableCostly per row at scale; evaluate on test set
Track C โ€” NucleusIQ StandardAgent + @tool โ†’ calls your sklearn modelGuardrails, tool loops, production patternsNeeds Track A model saved first
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  Problem: Will customer churn?      โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                      โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ–ผ                           โ–ผ                           โ–ผ
   Track A: sklearn              Track B: NucleusIQ Direct   Track C: NucleusIQ Standard
   (engineer features)           (few-shot in prompt)        (@tool + ML model)
          โ”‚                           โ”‚                           โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                      โ–ผ
                         Same test set โ†’ precision / recall / F1

Tracks A and B are the learning core. Track C shows how production teams combine them using NucleusIQโ€”the same framework we use in our Direct mode and Standard mode with tools guides.


Prerequisites

From the 2026 ML roadmap:

  • Python 3.10+ installed
  • Basic comfort with variables, functions, and pip install
  • 2โ€“3 hours uninterrupted time

Helpful but optional: read AI vs ML vs DL vs Data Science if you confuse those terms.


Step 0: Environment Setup

Create a virtual environment and install dependencies:

python -m venv ml-tutorial-env
# Windows
ml-tutorial-env\Scripts\activate
# macOS/Linux
source ml-tutorial-env/bin/activate

pip install pandas numpy matplotlib seaborn scikit-learn joblib python-dotenv nucleusiq nucleusiq-openai

Verify sklearn:

import sklearn
print(sklearn.__version__)  # e.g. 1.4+

Step 1: Load the Dataset

We use the Telco Customer Churn datasetโ€”a public benchmark that matches the telecom examples in our multivariate logistic regression post.

Option A โ€” OpenML (no manual download):

from sklearn.datasets import fetch_openml

# IBM Telco Customer Churn on OpenML (id=42178)
churn = fetch_openml(data_id=42178, as_frame=True, parser="auto")
df = churn.frame
target_col = "Churn" if "Churn" in df.columns else churn.target_names[0]

Option B โ€” CSV from GitHub (recommended for the companion notebook):

Download telco_churn.csv from the repo linked at the end of this post, then:

import pandas as pd

df = pd.read_csv("data/telco_churn.csv")
target_col = "Churn"

Option C โ€” Your own CSV from the IBM Telco churn Kaggle dataset; align column names with the preprocessing below.

Quick inspection:

print(df.shape)
print(df.head())
print(df[target_col].value_counts(normalize=True))

You will often see ~73% No churn / ~27% Yes churn. That class imbalance matters when you interpret accuracy laterโ€”exactly what we discuss in model evaluation metrics.


The Machine Learning Workflow (Map This Tutorial)

Every beginner Python ML project follows the same spineโ€”whether you are predicting churn, loan default, or email spam:

Business question โ†’ EDA โ†’ Feature engineering โ†’ Split โ†’ Feature selection
    โ†’ Train ML โ†’ Evaluate
    โ†’ (compare) Few-shot LLM on same test rows
    โ†’ (optional) Agent tools wrapping ML

This tutorial walks that spine end to end, then adds AI tracks on the same churn question. When you read our older building logistic regression article, you will recognize the same churn story: merge tables, dummy variables, train/test discipline. Here we use pandas + sklearn pipelines because that is what most teams hire for in 2026.

Parametric vs non-parametric (one sentence): logistic regression assumes a smooth linear boundary in feature space; random forests do not. Our parametric vs non-parametric guide goes deeperโ€”after you finish this lab.


Step 2: Exploratory Data Analysis (EDA)

EDA is not optional decoration. It prevents training on broken data. See our EDA explainer for mindset; here is the minimum for this tutorial.

import matplotlib.pyplot as plt
import seaborn as sns

# Missing values
print(df.isnull().sum().sort_values(ascending=False).head(10))

# Numeric summary
print(df.describe())

# Churn rate by contract type (example business slice)
if "Contract" in df.columns:
    sns.countplot(data=df, x="Contract", hue=target_col)
    plt.title("Churn by contract type")
    plt.xticks(rotation=15)
    plt.tight_layout()
    plt.show()

Write down three observations in your notebook READMEโ€”for example:

  1. Month-to-month contracts may show higher churn.
  2. TotalCharges sometimes loads as text because of empty strings.
  3. Class imbalance means accuracy alone is misleading.

These bullets become portfolio gold and mirror how we reason in production case studies like financial recommendation systems.


Track A โ€” Classical Machine Learning (Steps 3โ€“10)

Everything in Steps 3โ€“10 is Track A. Do not skip to the LLM section until you have test metrics from sklearnโ€”otherwise you cannot compare fairly.


Step 3: Data Cleaning

3.1 Drop ID columns

IDs do not generalize; they cause memorization.

drop_cols = [c for c in ["customerID", "customerId"] if c in df.columns]
df = df.drop(columns=drop_cols, errors="ignore")

3.2 Fix TotalCharges

A common Telco issue: TotalCharges stored as strings.

if "TotalCharges" in df.columns:
    df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
    df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].median())

3.3 Encode target + keep readable features for Track B/C

y = df[target_col].copy()
if y.dtype == object:
    y = y.map({"Yes": 1, "No": 0, "yes": 1, "no": 0})
y = y.astype(int)

# Human-readable rows for few-shot LLM + agent tools (before heavy encoding)
df_model = df.drop(columns=[target_col]).copy()

Step 4: Feature Engineering (Create Signal, Do Not Only Encode)

Feature engineering means building new columns from domain logic before modeling. This is where many beginners stop too earlyโ€”they only one-hot encode and wonder why ROC-AUC is flat.

Our dedicated post Feature Engineering Techniques for Machine Learning (topics.xlsx) goes deeper; here is a Telco churn starter set:

import numpy as np

fe = df_model.copy()

# Numeric fixes (if not done in Step 3)
if "TotalCharges" in fe.columns:
    fe["TotalCharges"] = pd.to_numeric(fe["TotalCharges"], errors="coerce")
    fe["TotalCharges"] = fe["TotalCharges"].fillna(fe["TotalCharges"].median())

# --- Engineered features ---
if "tenure" in fe.columns:
    fe["tenure_bucket"] = pd.cut(
        fe["tenure"],
        bins=[-1, 12, 36, 72, 999],
        labels=["0-12m", "13-36m", "37-72m", "72m+"],
    )

if "MonthlyCharges" in fe.columns and "tenure" in fe.columns:
    fe["avg_charge_per_tenure_month"] = fe["MonthlyCharges"] / (fe["tenure"].clip(lower=1))

# Count how many add-on services are active (Yes = 1)
service_cols = [c for c in fe.columns if fe[c].dtype == object and fe[c].isin(["Yes", "No"]).any()]
for c in service_cols:
    fe[c] = fe[c].map({"Yes": 1, "No": 0})

if service_cols:
    fe["num_active_services"] = fe[service_cols].sum(axis=1)

# Contract risk flag (business intuition from EDA)
if "Contract" in fe.columns:
    fe["is_month_to_month"] = (fe["Contract"] == "Month-to-month").astype(int)

print(fe[["tenure", "avg_charge_per_tenure_month", "num_active_services", "is_month_to_month"]].head())

Why these features help

  • tenure_bucket โ€” churn risk often spikes early tenure.
  • avg_charge_per_tenure_month โ€” separates heavy spenders from long-tenure low payers.
  • num_active_services โ€” engagement proxy.
  • is_month_to_month โ€” aligns with contract plots from EDA and our multicollinearity / regression discussions when features overlap.

Rule: every engineered feature must be computable at inference time from data you will actually have before churn happensโ€”no future information.

4.1 One-hot encode categoricals (after engineering)

X = pd.get_dummies(fe, drop_first=True)
print("Feature count after engineering + encoding:", X.shape[1])

Leakage check: drop post-churn columns. When in doubt, remove suspicious fields.

4.2 Why we use drop_first=True in get_dummies

With one-hot encoding, the last category can be inferred from the others (dummy variable trap). drop_first=True avoids perfect multicollinearityโ€”related to the issues we discuss in multicollinearity in regression. For random forest it matters less; for logistic regression it keeps coefficients more stable.

4.3 Keep a list of training columns for production

training_columns = list(X.columns)

import json
with open("models/training_columns.json", "w") as f:
    json.dump(training_columns, f)

At inference time, align new data to these columns (missing columns โ†’ 0). This prevents the โ€œworks in notebook, breaks in APIโ€ failure mode.


Step 5: Train / Test Split (Before Feature Selection)

Split after engineering, before selecting featuresโ€”so the selector never sees test labels.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

# Human-readable rows for Track B/C (same indices)
train_idx, test_idx = X_train.index, X_test.index
df_train_readable = df_model.loc[train_idx]
df_test_readable = df_model.loc[test_idx]

print("Train size:", len(X_train), "Test size:", len(X_test))

stratify=y keeps churn ratio stableโ€”critical for imbalanced data (evaluation metrics post).


Step 6: Feature Selection (Reduce Noise, Fit on Train Only)

With 50โ€“80 columns after get_dummies, many are weak or redundant. Feature selection picks a subset that helps generalization.

Important: fit the selector on X_train only, then transform X_test.

from sklearn.feature_selection import SelectKBest, mutual_info_classif
import joblib

K = 35  # tune: try 20, 35, 50 and compare CV ROC-AUC

selector = SelectKBest(score_func=mutual_info_classif, k=min(K, X_train.shape[1]))
X_train_sel = selector.fit_transform(X_train, y_train)
X_test_sel = selector.transform(X_test)

selected_mask = selector.get_support()
selected_features = X_train.columns[selected_mask].tolist()
print("Selected features:", len(selected_features))
print(selected_features[:10], "...")

joblib.dump(selector, "models/feature_selector.joblib")

What mutual_info_classif does: scores each feature by how much it reduces uncertainty about churnโ€”useful for non-linear relationships before tree models.

Alternative selectors (try in exercises):

MethodWhen to use
SelectKBest(f_classif)Fast linear screening
RFECV with logistic regressionSmaller, interpretable set; slower
Tree feature_importances_After RF train; good for reporting, not always for leakage-safe selection

Convert back to DataFrames for readable column names:

X_train_sel = pd.DataFrame(X_train_sel, columns=selected_features, index=X_train.index)
X_test_sel = pd.DataFrame(X_test_sel, columns=selected_features, index=X_test.index)

Step 7: Build a sklearn Pipeline (Best Practice)

A pipeline chains preprocessing and model so you do not accidentally fit the scaler on test data (a classic beginner bug).

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

logistic_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=2000, class_weight="balanced")),
])

class_weight="balanced" gives more attention to the minority churn classโ€”useful before you tune thresholds.

Train:

logistic_pipe.fit(X_train_sel, y_train)

Train on X_train_sel, not the full wide matrixโ€”this is the pipeline you compare against LLM approaches.

7.1 What logistic regression is doing (intuition)

Logistic regression outputs a probability between 0 and 1 using the sigmoid function. If you want the math storyโ€”sigmoid, log-odds, likelihoodโ€”read Logistic Regression for Machine Learning using Python and cost function in logistic regression. For this tutorial, remember: coefficients tell direction, probabilities tell risk, and thresholds turn risk into yes/no decisions.


Step 8: Predict and Evaluate โ€” Track A (Do Not Skip)

6.1 Class predictions and probabilities

y_pred = logistic_pipe.predict(X_test_sel)
y_proba = logistic_pipe.predict_proba(X_test_sel)[:, 1]

6.2 Classification report

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print(classification_report(y_test, y_pred, target_names=["No Churn", "Churn"]))
print("ROC-AUC:", round(roc_auc_score(y_test, y_proba), 4))

How to read this (short version):

  • Precision (Churn): of customers you flagged as churn, how many actually churned?
  • Recall (Churn): of all real churners, how many did you catch? (Sensitivity in our metrics blog)
  • F1: balance when classes are imbalanced
  • ROC-AUC: ranking quality across thresholds

6.3 Confusion matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN, FP],
#  [FN, TP]]

Map this to TP, TN, FP, FN exactly as in Fig-1 of our model evaluation metrics post. If your recall for churn is low, the model is missing customers who leaveโ€”often more expensive than a false alarm.

6.4 Accuracy (with caution)

from sklearn.metrics import accuracy_score
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))

If accuracy is ~80% but recall for churn is ~55%, you have a model that looks โ€œfineโ€ but fails the business goal. That is why we teach metrics beyond accuracy.


Step 9: Compare a Second Model (Random Forest)

Logistic regression is a strong interpretable baseline. Random Forest often improves tabular performance with non-linear patterns.

from sklearn.ensemble import RandomForestClassifier

rf_pipe = Pipeline([
    ("model", RandomForestClassifier(
        n_estimators=200,
        max_depth=12,
        class_weight="balanced",
        random_state=42,
        n_jobs=-1,
    )),
])

rf_pipe.fit(X_train_sel, y_train)
rf_pred = rf_pipe.predict(X_test_sel)
rf_proba = rf_pipe.predict_proba(X_test_sel)[:, 1]

print("=== Random Forest ===")
print(classification_report(y_test, rf_pred, target_names=["No Churn", "Churn"]))
print("ROC-AUC:", round(roc_auc_score(y_test, rf_proba), 4))

Compare in a table (fill with your numbers):

ModelAccuracyChurn RecallChurn F1ROC-AUC
Logistic Regression????
Random Forest????

Pick the model that matches your business cost. If missing churn is expensive, optimize recall (possibly lower the probability threshold)โ€”the same tradeoff we illustrate with ROC curves in our metrics article.

9.1 Feature importance (Random Forest)

import pandas as pd

rf_model = rf_pipe.named_steps["model"]
importances = pd.Series(rf_model.feature_importances_, index=X_train_sel.columns)
print(importances.sort_values(ascending=False).head(15))

Use this to sanity-check the model: if customerID sneaks in as top feature, your pipeline has a bug. If Contract_Month-to-month ranks high, that matches business intuition from EDAโ€”good sign.

9.2 Cross-validation (more honest than one lucky split)

A single train/test split can flatter or punish you by accident. Cross-validation trains on several folds and averages scores:

from sklearn.model_selection import cross_val_score

# Pipeline: selector must be inside CV โ€” here we CV on selected train matrix for simplicity
cv_scores = cross_val_score(
    rf_pipe, X_train_sel, y_train, cv=5, scoring="roc_auc", n_jobs=-1
)
print("CV ROC-AUC mean:", cv_scores.mean().round(4))
print("CV ROC-AUC std:", cv_scores.std().round(4))

Report mean ยฑ std in your README. Low std means stable; high std means you need more data or simpler models.


Step 10: Tune the Decision Threshold (Optional but Valuable)

Default threshold is 0.5. For churn, you may want 0.3 or 0.4 to catch more leavers.

import numpy as np

thresholds = np.arange(0.2, 0.6, 0.05)
for t in thresholds:
    preds_t = (y_proba >= t).astype(int)
    from sklearn.metrics import recall_score, precision_score
    print(
        f"threshold={t:.2f}  "
        f"precision={precision_score(y_test, preds_t, zero_division=0):.3f}  "
        f"recall={recall_score(y_test, preds_t, zero_division=0):.3f}"
    )

This connects directly to the cutoff analysis section in our evaluation metrics postโ€”where we sweep probability cutoffs for telecom churn.


Step 11: Save the Model for Reuse

import joblib

best_model = rf_pipe  # or logistic_pipeโ€”whichever you chose
joblib.dump(best_model, "models/churn_model.joblib")

Load later:

loaded = joblib.load("models/churn_model.joblib")
sample_pred = loaded.predict(X_test.iloc[:5])

Step 12: Inference on New Customers (Track A)

new_customer = X_test_sel.iloc[[0]]  # replace with real preprocessed row
prob_churn = loaded.predict_proba(new_customer)[0, 1]
print(f"Churn probability: {prob_churn:.2%}")

In production you would:

  1. Apply the same encoding steps (same dummy columnsโ€”missing columns = 0).
  2. Log predictions and outcomes.
  3. Retrain on a schedule.

Aligning raw rows to training columns

def align_features(raw_df: pd.DataFrame, training_cols: list) -> pd.DataFrame:
    encoded = pd.get_dummies(raw_df, drop_first=True)
    aligned = encoded.reindex(columns=training_cols, fill_value=0)
    return aligned

# Example after loading training_columns.json
# aligned = align_features(new_raw, training_columns)
# prob = loaded.predict_proba(aligned)[0, 1]

This function is the bridge between โ€œdata science notebookโ€ and โ€œbackend APIโ€โ€”worth committing to your GitHub repo under src/preprocess.py.


Plot ROC Curve (Visual Evaluation)

Numbers are mandatory; plots help you communicate to non-technical stakeholders.

from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title("Logistic Regression โ€” ROC Curve")
plt.show()

Compare both models on one chart:

rf_proba = rf_pipe.predict_proba(X_test)[:, 1]
RocCurveDisplay.from_predictions(y_test, rf_proba, name="Random Forest")
RocCurveDisplay.from_predictions(y_test, y_proba, name="Logistic Regression")
plt.legend()
plt.show()

Tie this chart to the ROC explanation in our model evaluation metrics postโ€”especially the tradeoff between sensitivity and false positive rate.


Track B โ€” Few-Shot Churn With NucleusIQ Direct Mode

Track B solves churn without sklearn trainingโ€”only labeled examples in the prompt. Instead of wiring the raw OpenAI SDK, we use NucleusIQ Direct mode: one agent, one pass, low overheadโ€”exactly what we document in the Direct mode beginner guide.

Why NucleusIQ here (vs raw API)?

  • Same provider portability as the rest of our stack (BaseOpenAI today, BaseGemini tomorrow).
  • Usage tracking on agent.last_usage (tokens per task).
  • Plugins later (ModelCallLimitPlugin when you batch-score many customers).
  • Same Agent + Task API you will reuse in Track Cโ€”only the execution mode changes.

B.1 Install NucleusIQ + provider

pip install nucleusiq nucleusiq-openai python-dotenv
# .env (never commit)
OPENAI_API_KEY=sk-...

B.2 Shared helpers (same as Track A)

import asyncio
import json
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

def row_to_customer_dict(row: pd.Series) -> dict:
    return {
        k: (None if pd.isna(v) else v)
        for k, v in row.to_dict().items()
        if k not in ("customerID", "customerId")
    }

def build_few_shot_examples(df_readable, y_series, n_per_class=3):
    examples = []
    for label, name in [(1, "Yes"), (0, "No")]:
        for i in y_series[y_series == label].index[:n_per_class]:
            examples.append({
                "customer": row_to_customer_dict(df_readable.loc[i]),
                "churn": name,
            })
    return examples

def parse_churn_label(text: str) -> str:
    t = (text or "").strip().lower()
    if t.startswith("yes"):
        return "Yes"
    if t.startswith("no"):
        return "No"
    return "No"  # conservative default for parsing

FEW_SHOT = build_few_shot_examples(df_train_readable, y_train, n_per_class=3)

B.3 Build the few-shot system prompt

def few_shot_system_block(examples: list) -> str:
    shots = "\n\n".join(
        f"Customer: {json.dumps(ex['customer'])}\nChurn: {ex['churn']}"
        for ex in examples
    )
    return f"""You are a telecom churn analyst.

Study these labeled examples:
{shots}

Rules:
- For each new customer, reply with exactly one word: Yes or No.
- Use only the fields in the customer JSON.
- Do not explain unless asked."""

FEW_SHOT_SYSTEM = few_shot_system_block(FEW_SHOT)

B.4 Create a NucleusIQ Direct mode classifier agent

from nucleusiq.agents import Agent
from nucleusiq.agents.config import AgentConfig, ExecutionMode
from nucleusiq.agents.task import Task
from nucleusiq.prompts.zero_shot import ZeroShotPrompt
from nucleusiq.plugins.builtin.model_call_limit import ModelCallLimitPlugin
from nucleusiq_openai import BaseOpenAI

def create_few_shot_churn_agent() -> Agent:
    return Agent(
        name="churn-few-shot",
        role="Churn classifier",
        objective="Classify telecom churn from customer profiles",
        prompt=ZeroShotPrompt().configure(
            system=FEW_SHOT_SYSTEM,
            user="Classify the customer in the task. Reply Yes or No only.",
        ),
        llm=BaseOpenAI(model_name="gpt-4o-mini"),
        config=AgentConfig(execution_mode=ExecutionMode.DIRECT),
        plugins=[ModelCallLimitPlugin(max_calls=3)],  # safety for batch loops
    )

few_shot_agent = create_few_shot_churn_agent()

ExecutionMode.DIRECT = fast single-pass classificationโ€”no tool loop. See choosing agent modes when you outgrow this.

B.5 Classify one customer (async)

async def predict_churn_nucleusiq(agent: Agent, customer: dict) -> str:
    await agent.initialize()
    result = await agent.execute(
        Task(
            id="churn-cls",
            objective=f"Customer JSON:\n{json.dumps(customer)}",
        )
    )
    return parse_churn_label(str(result.output))

# Example
sample = row_to_customer_dict(df_test_readable.iloc[0])
label = asyncio.run(predict_churn_nucleusiq(few_shot_agent, sample))
print("Predicted churn:", label)
print("Tokens:", few_shot_agent.last_usage.total.total_tokens)

B.6 Batch evaluate on the same test set as Track A

from sklearn.metrics import classification_report, accuracy_score

EVAL_N = 80
test_slice = df_test_readable.iloc[:EVAL_N]
y_true = y_test.iloc[:EVAL_N]

async def eval_few_shot_track():
    agent = create_few_shot_churn_agent()
    await agent.initialize()
    preds = []
    for _, row in test_slice.iterrows():
        customer = row_to_customer_dict(row)
        result = await agent.execute(
            Task(
                id=f"churn-{len(preds)}",
                objective=f"Customer JSON:\n{json.dumps(customer)}",
            )
        )
        preds.append(1 if parse_churn_label(str(result.output)) == "Yes" else 0)
    return preds

llm_preds = asyncio.run(eval_few_shot_track())
print("=== Track B: NucleusIQ Direct (few-shot) ===")
print(classification_report(y_true, llm_preds, target_names=["No Churn", "Churn"]))
print("Accuracy:", accuracy_score(y_true, llm_preds))

What you will often see

  • Few-shot via Direct mode can look reasonable on easy rows but miss imbalance nuance unless examples are balanced.
  • agent.last_usage helps you estimate cost per 1,000 classificationsโ€”still far more than sklearn at scale.
  • You still need classification_report on held-out dataโ€”NucleusIQ does not replace evaluation.

Track C โ€” Churn Agent With NucleusIQ Standard Mode + Tools

Track C is the production pattern: your Track A model becomes a tool; NucleusIQ Standard mode runs the tool loop and writes the retention narrative. This matches our Standard mode with tools and plugins guides.

Prerequisite: complete Track A and save models/churn_model.joblib + selected_features.

C.1 Load ML artifacts (Track A output)

import joblib

loaded_model = joblib.load("models/churn_model.joblib")
# selected_features from Step 6 โ€” list of column names after SelectKBest

C.2 Register tools with @tool

NucleusIQ turns Python functions into agent tools automatically (schemas from type hints + docstrings):

from nucleusiq.tools.decorators import tool

def _score_churn_core(customer_json: str) -> str:
    """Shared scoring logic for tools and offline evaluation."""
    row = pd.Series(json.loads(customer_json))
    encoded = pd.get_dummies(row.to_frame().T, drop_first=True)
    aligned = encoded.reindex(columns=selected_features, fill_value=0)
    proba = float(loaded_model.predict_proba(aligned)[0, 1])
    return json.dumps({
        "churn_probability": round(proba, 4),
        "churn_label": "Yes" if proba >= 0.4 else "No",
        "threshold": 0.4,
    })

@tool
def score_churn(customer_json: str) -> str:
    """Score churn probability using the trained sklearn model from Track A.

    Args:
        customer_json: JSON object of customer fields (same schema as training rows).
    """
    return _score_churn_core(customer_json)

@tool
def get_customer_profile(customer_json: str) -> str:
    """Return the customer profile JSON for display (no scoring).

    Args:
        customer_json: JSON object of customer fields.
    """
    return customer_json

The LLM should call score_churn for numbersโ€”never invent probabilities in free text.

C.3 Build the Standard mode retention agent

from nucleusiq.plugins.builtin.model_call_limit import ModelCallLimitPlugin
from nucleusiq.plugins.builtin.tool_call_limit import ToolCallLimitPlugin

def create_churn_retention_agent() -> Agent:
    return Agent(
        name="churn-retention-agent",
        role="Retention analyst",
        objective="Assess churn and recommend one action",
        prompt=ZeroShotPrompt().configure(
            system=(
                "You help telecom retention teams. "
                "Always call score_churn before recommending action. "
                "Never guess churn probabilityโ€”use the tool result only. "
                "Give one concrete retention step."
            ),
        ),
        llm=BaseOpenAI(model_name="gpt-4o-mini"),
        tools=[score_churn, get_customer_profile],
        config=AgentConfig(execution_mode=ExecutionMode.STANDARD),
        plugins=[
            ModelCallLimitPlugin(max_calls=8),
            ToolCallLimitPlugin(max_calls=5),
        ],
    )

ExecutionMode.STANDARD enables the multi-step tool loopโ€”the agent can call score_churn, read the JSON, then respond. For high-stakes review workflows, you could move to Autonomous mode later; churn triage usually starts at Standard.

C.4 Run the agent for one customer

async def run_retention_agent(customer_row: pd.Series) -> str:
    agent = create_churn_retention_agent()
    await agent.initialize()
    profile = row_to_customer_dict(customer_row)
    result = await agent.execute(
        Task(
            id="retention-1",
            objective=(
                "Assess churn risk and suggest one retention action.\n"
                f"Customer profile JSON:\n{json.dumps(profile)}"
            ),
        )
    )
    print(f"Tool calls: {result.tool_call_count}")
    print(f"Tokens: {agent.last_usage.total.total_tokens}")
    return str(result.output)

print(asyncio.run(run_retention_agent(df_test_readable.iloc[0])))

C.5 Evaluate ML tool scores vs Track A (apples to apples)

The agentโ€™s marketing text is not your classification metricโ€”the score_churn tool output is. For a fair comparison, call the same tool logic on test rows (or parse tool results from traces):

def score_churn_direct(customer_row: pd.Series) -> float:
    payload = json.loads(_score_churn_core(json.dumps(row_to_customer_dict(customer_row))))
    return payload["churn_probability"]

agent_probs = [
    score_churn_direct(df_test_readable.loc[i])
    for i in df_test_readable.iloc[:EVAL_N].index
]

from sklearn.metrics import roc_auc_score
print("Track C tool ROC-AUC (should match Track A):",
      roc_auc_score(y_true, agent_probs))

If ROC-AUC matches Track A, your tool wiring is correctโ€”the agent layer adds workflow, not a new model.

C.6 Optional: stream tool visibility

For demos and debugging, use execute_stream() as in our Standard mode guide:

from nucleusiq.streaming.events import StreamEventType

async def stream_retention(customer_row: pd.Series):
    agent = create_churn_retention_agent()
    await agent.initialize()
    profile = row_to_customer_dict(customer_row)
    async for event in agent.execute_stream(
        Task(id="ret-s", objective=f"Score and recommend:\n{json.dumps(profile)}")
    ):
        if event.type == StreamEventType.TOOL_CALL_START:
            print("Tool:", event.data.get("tool_name"))
        elif event.type == StreamEventType.TOKEN:
            print(event.data.get("content", ""), end="", flush=True)

NucleusIQ Track Picker (Quick Reference)

GoalNucleusIQ modeThis tutorial
Few-shot classify one JSON profileDirectTrack B
Tool loop + ML model + narrativeStandardTrack C
Verify + revise high-stakes analysisAutonomousNot needed for first churn lab

Docs: NucleusIQ docs ยท GitHub


Head-to-Head: sklearn vs NucleusIQ Direct vs NucleusIQ Standard (Same Test Customers)

Fill this table after running all tracks on the same EVAL_N rows:

CriterionTrack A โ€” sklearnTrack B โ€” NucleusIQ DirectTrack C โ€” NucleusIQ Standard
Frameworkscikit-learnNucleusIQ ExecutionMode.DIRECTNucleusIQ ExecutionMode.STANDARD + @tool
Trains on dataYes (weights)No (few-shot in prompt)ML tool uses Track A model
ReproducibleHigh (fixed seed)Medium (prompt/version drift)High for scores; medium for narrative
Cost at 1M rows/dayLowVery high (LLM per row)High (LLM + cheap ML tool)
ProbabilitiesNativeNo (Yes/No only)From score_churn tool
GuardrailsYou buildModelCallLimitPluginCall limits + tool limits
Best forBatch scoring, regulationLabel-scarce prototypesCRM workflow + retention copy
Metrics you trustROC-AUC, F1Same F1 on test setTool ROC-AUC + human review of text

Decision guide (2026)

  • Deploy Track A for nightly churn scores in a data pipeline.
  • Use Track B (NucleusIQ Direct) to prototype when labels are scarceโ€”then measure on held-out data.
  • Use Track C (NucleusIQ Standard) when retention teams need tool-backed scores plus natural-language actionsโ€”after Track A is saved.
Wrong:  "We replaced sklearn with NucleusIQ."
Right:  "sklearn scores customers; NucleusIQ runs few-shot experiments and agent workflows."

This is the combined learning promise from our 2026 roadmap: same problem statement, different tools, measured the same way.


Optional: NucleusIQ Direct Explains Track A (Narration Only)

After Track A metrics look good, use another Direct mode agent to explainโ€”the probability still comes from sklearn:

def create_explainer_agent() -> Agent:
    return Agent(
        name="churn-explainer",
        role="Analyst",
        objective="Explain model output",
        prompt=ZeroShotPrompt().configure(
            system=(
                "Explain churn risk in 3 short bullets. "
                "Use ONLY the customer JSON and the model probability provided. "
                "Do not invent fields."
            ),
        ),
        llm=BaseOpenAI(model_name="gpt-4o-mini"),
        config=AgentConfig(execution_mode=ExecutionMode.DIRECT),
    )

async def explain_prediction(customer: dict, probability: float) -> str:
    agent = create_explainer_agent()
    await agent.initialize()
    result = await agent.execute(
        Task(
            id="explain-1",
            objective=(
                f"Customer:\n{json.dumps(customer)}\n"
                f"Model churn probability: {probability:.2%}"
            ),
        )
    )
    return str(result.output)

row = df_test_readable.iloc[0]
p = loaded.predict_proba(X_test_sel.iloc[[0]])[0, 1]
print(asyncio.run(explain_prediction(row_to_customer_dict(row), p)))

Project Structure for GitHub

Publish a small repo so readers can run the tutorial end-to-end (topics.xlsx asks for notebook + GitHub link):

ml-beginners-python/
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ .env.example
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ telco_churn.csv
โ”œโ”€โ”€ notebooks/
โ”‚   โ””โ”€โ”€ 01_churn_ml_vs_llm.ipynb
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ train.py              # Track A: FE + selection + sklearn
โ”‚   โ”œโ”€โ”€ nucleusiq_few_shot.py   # Track B โ€” Direct mode
โ”‚   โ”œโ”€โ”€ nucleusiq_agent_churn.py # Track C โ€” Standard + tools
โ”‚   โ””โ”€โ”€ compare_tracks.py     # same test metrics table
โ””โ”€โ”€ models/
    โ”œโ”€โ”€ churn_model.joblib
    โ”œโ”€โ”€ feature_selector.joblib
    โ””โ”€โ”€ training_columns.json

requirements.txt:

pandas>=2.0
numpy>=1.24
scikit-learn>=1.3
matplotlib>=3.7
seaborn>=0.13
joblib>=1.3
python-dotenv>=1.0
nucleusiq>=0.6.0
nucleusiq-openai>=0.6.0

src/train.py skeletonโ€”move the tutorial code into functions:

"""Train churn classifier. Run: python src/train.py"""
from pathlib import Path
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

DATA = Path("data/telco_churn.csv")
MODEL_OUT = Path("models/churn_model.joblib")

def load_and_prepare(path: Path):
    df = pd.read_csv(path)
    # ... same cleaning as tutorial ...
    return X, y

def main():
    X, y = load_and_prepare(DATA)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    pipe = Pipeline([("model", RandomForestClassifier(
        n_estimators=200, class_weight="balanced", random_state=42))])
    pipe.fit(X_train, y_train)
    proba = pipe.predict_proba(X_test)[:, 1]
    preds = pipe.predict(X_test)
    print(classification_report(y_test, preds))
    print("ROC-AUC:", roc_auc_score(y_test, proba))
    MODEL_OUT.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(pipe, MODEL_OUT)

if __name__ == "__main__":
    main()

Star or fork: https://github.com/nucleusbox/ml-beginners-python (replace with your live URL when the repo is public).


Common Errors Beginners Hit (And Fixes)

ErrorSymptomFix
Fit scaler on full dataTest scores โ€œtoo goodโ€Use Pipeline
Different columns at inferenceValueError on predictSave training column list; align dummies
Ignoring imbalanceHigh accuracy, bad churn recallclass_weight, threshold tuning, F1
Data leakageNear-perfect test scoreAudit features; time-based split if needed
Only reading AI explanationsConfident wrong storyAlways print classification_report first
Skipping await agent.initialize()NucleusIQ runtime errorsCall initialize() before execute()
Few-shot examples from test setInflated Track B scoresBuild FEW_SHOT from df_train_readable only

How This Tutorial Fits the Nucleusbox ML Series

You are hereNext depth on Nucleusbox
This tutorial โ€” full sklearn workflowBuilding logistic regression in Python โ€” telecom merge, dummies, VIF
Evaluation section aboveModel evaluation metrics โ€” sensitivity, ROC, cutoffs
Theory cravingLogistic regression for ML using Python โ€” sigmoid, MLE
Real-world motivationLogistic regression applications
Bigger 2026 pathStart ML from scratch in 2026

Practice Exercises (Do These Before Moving On)

  1. Feature engineering: add MonthlyCharges * tenure interactionโ€”does ROC-AUC improve after selection?
  2. Feature selection: compare K=20 vs K=50 with cross-validationโ€”which is more stable?
  3. Track B: increase few-shot examples to 5 per classโ€”does LLM recall improve on the same 80 test rows?
  4. Track B vs A: on rows where ML is correct and LLM is wrong, what pattern do you see?
  5. Track C: add a draft_retention_email tool that only runs if churn_probability > 0.5.
  6. Plot ROC curve for Track Aโ€”link to our metrics post.
  7. Portfolio README: fill the head-to-head table with real numbers and state which track you would deploy in production.

From Tutorial to Portfolio README (Template)

Paste this into your GitHub README.md and fill the blanks:

## Telco Churn โ€” ML vs Few-Shot vs Agent (Same Test Set)

**Problem:** Predict whether a customer will churn.
**Data:** IBM Telco Customer Churn (OpenML #42178).
**Track A (sklearn):** Feature engineering + SelectKBest (k=35) + Random Forest.
**Track B (NucleusIQ Direct):** 3 examples per class, N=80 test rows.
**Track C (NucleusIQ Standard):** score_churn `@tool` + retention recommendation.

| Track | Churn Recall | F1 (Churn) | Notes |
|-------|--------------|------------|-------|
| A โ€” ML | ___ | ___ | Production candidate |
| B โ€” Few-shot | ___ | ___ | Prototype / expensive at scale |
| C โ€” Agent tool | ___ | ___ | Same scores as A; LLM for UX |

**Decision:** Deploy Track A for scoring; Track C for CRM workflow.
**Limitations:** Class imbalance; LLM eval subset; no causal claims.

This is what hiring managers skim in thirty seconds.


How This Tutorial Connects to โ€œLearn ML + AI Togetherโ€

Our 2026 scratch roadmap puts Week 3โ€“4 on sklearn and Week 7โ€“8 on Gen AI layers. This single tutorial merges both on one churn dataset so you feel the difference immediatelyโ€”not six months apart.

You now have evidence for three statements interviewers like:

  1. โ€œI can engineer features and select them without leakage.โ€
  2. โ€œI can evaluate any classifierโ€”including an LLMโ€”with precision and recall on a held-out set.โ€
  3. โ€œI know when to use agents for workflow, not as a replacement for ML scores.โ€

FAQ

Is this tutorial enough to get a job?

It is one portfolio-quality baseline project. Combine it with the hybrid projects in our 2026 roadmap (ML + LLM layer) and ideas from Top 7 AI Projects for High-Paying Jobs.

Why logistic regression and random forest?

Logistic regression teaches linear baselines and probabilities. Random forest teaches non-linear tabular strength without neural network complexityโ€”ideal for beginners in 2026.

Do I need TensorFlow or PyTorch here?

Not for this tutorial. Master sklearn pipelines first; add deep learning when your problem is images, audio, or large unstructured text.

Where is the Jupyter notebook?

Clone the GitHub repoโ€”notebooks/01_churn_classification.ipynb mirrors every step in this post. You can also run src/train.py from the terminal without Jupyter.

How is this different from the older Nucleusbox churn posts?

Our building logistic regression in Python walkthrough merges three CSV files manually, builds dummies column by column, and steps through VIF and statsmodelsโ€”excellent for depth. This tutorial teaches the same churn problem with sklearn pipelines so you can ship faster and avoid copy-paste errors between train and test. Do both: this post for execution speed, the older post for statistical intuition.

Can I use AI to write this code for me?

Yesโ€”as long as you run every cell, change one parameter, and explain the metrics in your own words. Copilots accelerate boilerplate; they do not replace confusion matrices. That is the same โ€œbuild + understandโ€ rule from our 2026 roadmap.

Should I skip sklearn and only use NucleusIQ Direct for churn?

Only for a prototype. Production churn pipelines need Track A for cost, latency, and auditability. Use NucleusIQ Direct (Track B) to experiment with few-shot examples; use Standard (Track C) for retention workflowsโ€”not to skip ML evaluation.

What is the difference between NucleusIQ Direct and Standard here?

Direct = one pass, few-shot classification (Track B). Standard = tool loop where score_churn returns probabilities from sklearn (Track C). Same Agent classโ€”different ExecutionMode in AgentConfig.

Do I need the raw OpenAI SDK?

No for Tracks B/C. Install nucleusiq + nucleusiq-openai (or another provider package). Swap BaseOpenAI for BaseGemini without rewriting agent logicโ€”see Why NucleusIQ?.


Summary

You learned a complete Python machine learning tutorial for beginnersโ€”plus how it differs from solving the same churn problem with AI:

Track A โ€” Classical ML

  1. Load and explore data
  2. Feature engineering (tenure buckets, service counts, contract flags)
  3. Feature selection (SelectKBest on train only)
  4. Split train/test with stratify
  5. Train logistic regression and random forest on selected features
  6. Evaluate with recall, precision, F1, ROC-AUC
  7. Save model + selector for production inference

Track B โ€” NucleusIQ Direct (few-shot)

  1. Build labeled examples from train only
  2. Configure ZeroShotPrompt + ExecutionMode.DIRECT
  3. Batch-classify test customers; score with the same metrics as Track A

Track C โ€” NucleusIQ Standard (tools)

  1. Register score_churn with @tool (wraps Track A model)
  2. Run Standard mode agent with pluginsโ€”scores from tools, copy from LLM

Continue with model evaluation metrics, the 2026 ML + AI roadmap, NucleusIQ Direct mode, and Standard mode with tools.


Written by Nucleusbox. More tutorials: Machine Learning archive. Code: GitHub โ€” ml-beginners-python.

Footnotes:

Additional Reading

OK, thatโ€™s it, we are done now. If you have any questions or suggestions, please feel free to comment. Iโ€™ll come up with more topics on Machine Learning and Data Engineering soon. Please also comment and subscribe if you like my work, any suggestions are welcome and appreciate

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments