Predicting High-Opportunity Customers (top 25%)¶

Jennifer Smith¶

June 23, 2022¶

Project Overview¶

This notebook builds a repeatable framework for preparing customer data for modeling, building and evaluating various models, and performing paramater tuning on two of the models. Additional comments are included throughout the steps below.

This dataset was sourced from Kaggle and contains customer demographics and purchase information for a presumably hypothetical grocery-type store. I decided to see if I could build a predictive model that would correctly identify high-opportunity customers, which for this project I am defining as the top 25%. This type of model would allow a business to focus marketing efforts and dollars on the high-opportunity customers identified, thereby resulting in improved ROI and revenue.

The dataset was retreived from the following link, though it was tab separated and didn't import correctly so I did a text-to-columns edit in Excel first to fix the issue before importing into my Jupyter notebook. https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis

Target Feature¶

The top 25% was not an included feature in the dataset, so one of my first steps was to identify them and record in a new binary column. The new column I created, 'Top25%', becomes my target.

Predictor Features¶

My predictor features are all of the demographic and behavioral columns, just about everything else in the dataset except the columns beginning with 'Mnt', because those are sales in various categories and are included in the total.

Data Preparation¶

Before building my models, I performed the following data preparation tasks.

  • Filled in missing incomes.
    • I looked at averages by education and marital status, but as they were all close to the overall mean and median, I chose to use the median for simplicity.
  • Created the target column.
  • Created a column with the number of days a customer has been active.
  • Changed the Education and Marital status to be numerical categories.
  • Deleted unneeded columns - customer ID, a couple that had all the same value throughout, and the original date column.
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

pd.options.display.max_columns = None

Data Prep Function¶

Fills in missing incomes, drops unnecessary columns, converts categorical columns to numerical values, adds column number of days as a customer, adds binary column identifying customers in top 25%, splits data into train and test groups, identifies predictor columns and target column, and finally creates scaled data.

In [2]:
def prepare_customer_data(data):

    # filled missing incomes with median
    data['Income'] = data['Income'].fillna(data['Income'].median())

    # remove unnecessary columns
    data_prep = data.drop(columns=['ID','Z_CostContact','Z_Revenue'])

    # replace Education with numeric categories
    data_prep['Education'] = data_prep['Education'].replace(to_replace={'Basic':0, 'Graduation':1,'Master':2, '2n Cycle':3, 'PhD':4})

    # replace Marital Status with numeric categories
    data_prep['Marital_Status'] = data_prep['Marital_Status'].replace(to_replace={'Single':0, 'Alone':0, 'YOLO':0, 'Absurd':0, 'Together':1,'Married':2, 'Divorced':3, 'Widow':4})

    # convert date column to Pandas datetime & add new column with integer number of days (models don't work with datetime)
    data_prep['Dt_Customer'] = pd.to_datetime(data_prep['Dt_Customer'])
    data_prep['DaysCust'] = (data_prep['Dt_Customer'].max() - data_prep['Dt_Customer']).dt.days.astype('int16')

    # remove original Dt_Customer column
    data_prep = data_prep.drop(columns='Dt_Customer')

    # add column classifying Top 25% customers (high value customers)
    data_prep['Top25%'] = np.where(data_prep['MntTotal'] >= data_prep['MntTotal'].quantile(q=0.75), 1, 0)

    # list of column names for predictors
    predictors = ['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
           'Teenhome', 'Recency','NumDealsPurchases', 'NumWebPurchases',
           'NumCatalogPurchases', 'NumStorePurchases', 'NumTotalPurchases',
           'NumWebVisitsMonth', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5',
           'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response', 'DaysCust']

    # column name for target
    target = 'Top25%'
    
    # split into a training and testing set
    train, test = train_test_split(data_prep)
    
    # create scaled train and test sets
    ss = StandardScaler()
    ss.fit(train[predictors])
    scaled_train = ss.transform(train[predictors])
    scaled_test = ss.transform(test[predictors])

    return train, test, predictors, target, scaled_train, scaled_test

Model creation & evaultation build¶

Model functions are build with scaled data, because most improved once the data was standardized.

In [3]:
def create_scaled_model(model_type,train_data,test_data,scaled_train_data,scaled_test_data,predictors,target):

    clf = model_type
    clf.fit(scaled_train_data, train_data[target])
    predictions = clf.predict(scaled_test_data)
    accuracy = metrics.accuracy_score(test_data[target], predictions)
    cm = metrics.confusion_matrix(test_data[target], predictions)
    sns.heatmap(cm, annot=True, fmt='.0f')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Top25%')
    class1_error_rate = cm[1][0] / (cm[1][0] + cm[1][1])
    class1_error_rate
    class0_error_rate = cm[0][1] / (cm[0][1] + cm[0][0])
    class0_error_rate
    print(model_type)
    print('Accuracy:',accuracy)
    print('Class 1 Error Rate:', class1_error_rate, '(i.e., wrong on customers who are in top 25%)')
    print('Class 0 Error Rate:', class0_error_rate, '(i.e., wrong on customers who are not in top 25%)')
    plt.show()
    return class1_error_rate, class0_error_rate, accuracy

Data loaded and models run¶

I ran four types of predictive models - KNeighborsClassifier, DecisionTreeClassifier, LogisticRegression, and RandomForestClassifier - each with the default settings, and then with the weight parameter set to the type that produced the best results. I thought these would be good starting points for trying some parameter tuning later in the notebook.

For the purpose of this experiment, I decided to focus primarily on the Class 1 Error Rate, using the assumption that the current business priority was to identify the most high-opportunity customers (top 25%) with the goal of optimizing revenue. With that in mind, the model that performed the best of those tried here was the LogisticRegression model with balanced class weight, which consistently produces single-digit Class 1 Error Rates.

In [11]:
marketing_data = pd.read_csv('marketing_campaign.csv')
train, test, predictors, target, scaled_train, scaled_test = prepare_customer_data(marketing_data)

create_scaled_model(KNeighborsClassifier(),train,test,scaled_train,scaled_test,predictors,target)
create_scaled_model(KNeighborsClassifier(weights='distance'),train,test,scaled_train,scaled_test,predictors,target)
create_scaled_model(DecisionTreeClassifier(),train,test,scaled_train,scaled_test,predictors,target)
create_scaled_model(DecisionTreeClassifier(class_weight='balanced'),train,test,scaled_train,scaled_test,predictors,target)
create_scaled_model(LogisticRegression(),train,test,scaled_train,scaled_test,predictors,target)
create_scaled_model(LogisticRegression(class_weight='balanced'),train,test,scaled_train,scaled_test,predictors,target)
create_scaled_model(RandomForestClassifier(),train,test,scaled_train,scaled_test,predictors,target)
create_scaled_model(RandomForestClassifier(class_weight='balanced'),train,test,scaled_train,scaled_test,predictors,target)
KNeighborsClassifier()
Accuracy: 0.8714285714285714
Class 1 Error Rate: 0.2440944881889764 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.09468822170900693 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(weights='distance')
Accuracy: 0.8857142857142857
Class 1 Error Rate: 0.2283464566929134 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.08083140877598152 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier()
Accuracy: 0.9053571428571429
Class 1 Error Rate: 0.14960629921259844 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.07852193995381063 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced')
Accuracy: 0.8821428571428571
Class 1 Error Rate: 0.2047244094488189 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.09237875288683603 (i.e., wrong on customers who are not in top 25%)
LogisticRegression()
Accuracy: 0.8803571428571428
Class 1 Error Rate: 0.2283464566929134 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.08775981524249422 (i.e., wrong on customers who are not in top 25%)
LogisticRegression(class_weight='balanced')
Accuracy: 0.8642857142857143
Class 1 Error Rate: 0.06299212598425197 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.15704387990762125 (i.e., wrong on customers who are not in top 25%)
RandomForestClassifier()
Accuracy: 0.9214285714285714
Class 1 Error Rate: 0.11023622047244094 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.06928406466512702 (i.e., wrong on customers who are not in top 25%)
RandomForestClassifier(class_weight='balanced')
Accuracy: 0.9214285714285714
Class 1 Error Rate: 0.11811023622047244 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.06697459584295612 (i.e., wrong on customers who are not in top 25%)
Out[11]:
(0.11811023622047244, 0.06697459584295612, 0.9214285714285714)

Parameter Tuning Experiment 1: K Neighbors¶

First I tried a loop with the KNearestNeighbor model, trying k= every odd number between 0 and 20. I then plotted both the error rates and the overall accuracy on line graphs.

In [5]:
k_neighbors = [1,3,5,7,9,11,13,15,17,19]
c1er_list = []
c0er_list = []
accuracy_list = []

for k in k_neighbors:
    c1er, c0er, accuracy =create_scaled_model(KNeighborsClassifier(weights='distance',n_neighbors=k),train,test,scaled_train,scaled_test,predictors,target)
    c1er_list.append(c1er)
    c0er_list.append(c0er)
    accuracy_list.append(accuracy)
KNeighborsClassifier(n_neighbors=1, weights='distance')
Accuracy: 0.8803571428571428
Class 1 Error Rate: 0.291970802919708 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.06382978723404255 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(n_neighbors=3, weights='distance')
Accuracy: 0.8803571428571428
Class 1 Error Rate: 0.30656934306569344 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.0591016548463357 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(weights='distance')
Accuracy: 0.8964285714285715
Class 1 Error Rate: 0.2773722627737226 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.04728132387706856 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(n_neighbors=7, weights='distance')
Accuracy: 0.875
Class 1 Error Rate: 0.3357664233576642 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.05673758865248227 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(n_neighbors=9, weights='distance')
Accuracy: 0.8857142857142857
Class 1 Error Rate: 0.3284671532846715 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.04491725768321513 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(n_neighbors=11, weights='distance')
Accuracy: 0.8928571428571429
Class 1 Error Rate: 0.30656934306569344 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.0425531914893617 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(n_neighbors=13, weights='distance')
Accuracy: 0.8892857142857142
Class 1 Error Rate: 0.31386861313868614 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.04491725768321513 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(n_neighbors=15, weights='distance')
Accuracy: 0.8875
Class 1 Error Rate: 0.3284671532846715 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.0425531914893617 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(n_neighbors=17, weights='distance')
Accuracy: 0.8857142857142857
Class 1 Error Rate: 0.31386861313868614 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.04964539007092199 (i.e., wrong on customers who are not in top 25%)
KNeighborsClassifier(n_neighbors=19, weights='distance')
Accuracy: 0.8839285714285714
Class 1 Error Rate: 0.3284671532846715 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.04728132387706856 (i.e., wrong on customers who are not in top 25%)
In [6]:
plt.plot(k_neighbors,c1er_list,label='Class 1 Error Rate')
plt.plot(k_neighbors,c0er_list,label='Class 0 Error Rate')
plt.title('K Neighbors Tuning Experiment: Error Rates')
plt.xlabel('n_neighbors (k) =')
plt.legend()
plt.show()

plt.plot(k_neighbors,accuracy_list,label='Accuracy')
plt.title('K Neighbors Tuning Experiment: Overall Accuracy')
plt.xlabel('n_neighbors (k) =')
plt.show()

Parameter Tuning Experiment 2: Decision Tree Max Depth¶

Next I tried a loop with my DecisionTree model with the max_depth set to every number from 1 to 20, again plotting the results on line graphs. This experiment produced much more interesting results which I will comment on at the end of the notebook.

In [9]:
depth = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
c1er_list = []
c0er_list = []
accuracy_list = []

for d in depth:
    c1er, c0er, accuracy =create_scaled_model(DecisionTreeClassifier(class_weight='balanced',max_depth = d),train,test,scaled_train,scaled_test,predictors,target)
    c1er_list.append(c1er)
    c0er_list.append(c0er)
    accuracy_list.append(accuracy)
DecisionTreeClassifier(class_weight='balanced', max_depth=1)
Accuracy: 0.8142857142857143
Class 1 Error Rate: 0.08759124087591241 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.21749408983451538 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=2)
Accuracy: 0.8821428571428571
Class 1 Error Rate: 0.0948905109489051 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.12529550827423167 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=3)
Accuracy: 0.8821428571428571
Class 1 Error Rate: 0.06569343065693431 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.1347517730496454 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=4)
Accuracy: 0.8875
Class 1 Error Rate: 0.08759124087591241 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.12056737588652482 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=5)
Accuracy: 0.8928571428571429
Class 1 Error Rate: 0.13138686131386862 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.09929078014184398 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=6)
Accuracy: 0.8625
Class 1 Error Rate: 0.11678832116788321 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.14420803782505912 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=7)
Accuracy: 0.8946428571428572
Class 1 Error Rate: 0.12408759124087591 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.09929078014184398 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=8)
Accuracy: 0.8875
Class 1 Error Rate: 0.17518248175182483 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.09219858156028368 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=9)
Accuracy: 0.8803571428571428
Class 1 Error Rate: 0.20437956204379562 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.09219858156028368 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=10)
Accuracy: 0.875
Class 1 Error Rate: 0.21897810218978103 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.09456264775413711 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=11)
Accuracy: 0.8803571428571428
Class 1 Error Rate: 0.21897810218978103 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.08747044917257683 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=12)
Accuracy: 0.8767857142857143
Class 1 Error Rate: 0.21897810218978103 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.09219858156028368 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=13)
Accuracy: 0.8785714285714286
Class 1 Error Rate: 0.24087591240875914 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.08274231678486997 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=14)
Accuracy: 0.8875
Class 1 Error Rate: 0.22627737226277372 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.07565011820330969 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=15)
Accuracy: 0.8821428571428571
Class 1 Error Rate: 0.31386861313868614 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.054373522458628844 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=16)
Accuracy: 0.8821428571428571
Class 1 Error Rate: 0.2773722627737226 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.06619385342789598 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=17)
Accuracy: 0.8875
Class 1 Error Rate: 0.2846715328467153 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.05673758865248227 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=18)
Accuracy: 0.8767857142857143
Class 1 Error Rate: 0.32116788321167883 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.0591016548463357 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=19)
Accuracy: 0.8803571428571428
Class 1 Error Rate: 0.30656934306569344 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.0591016548463357 (i.e., wrong on customers who are not in top 25%)
DecisionTreeClassifier(class_weight='balanced', max_depth=20)
Accuracy: 0.8803571428571428
Class 1 Error Rate: 0.291970802919708 (i.e., wrong on customers who are in top 25%)
Class 0 Error Rate: 0.06382978723404255 (i.e., wrong on customers who are not in top 25%)
In [10]:
plt.plot(depth,c1er_list,label='Class 1 Error Rate')
plt.plot(depth,c0er_list,label='Class 0 Error Rate')
plt.title('Decision Tree Depth Tuning Experiment: Error Rates')
plt.xlabel('depth')
plt.legend()
plt.show()

plt.plot(depth,accuracy_list,label='Accuracy')
plt.title('Decision Tree Depth Tuning Experiment: Overall Accuracy')
plt.xlabel('depth')
plt.show()

Interpreting Results¶

All of my models performed better than random guessing. I was trying to predict if customers would be in the top 25%, so just guessing no across the board would be accurate 75% of the time.

Since all of the models performed better than random chance and different models optimize different metrics, choosing the right one will depend on business objectives and priorities:

  • If revenue growth is most important, minimizing false negatives would be ideal because it would result in the largest number of high-opportunity customers to be identified.
  • On the other hand, if cost control is most important, limiting false positives would be best, as it would reduce wasted spend on customers who will ultimately not rise to the top.
  • If balance is desired, the best strategy would be to choose the model with the highest overall accuracy.

I didn't feel that the KNeighbors tuning experiment told me anything particularly interesting. The Class 0 Error Rate slowly declined as the value of k increased, but the Class 1 Error Rate slowly increased (as a general trend). If I'm looking for a model with the lowest Class 1 Error Rate, none of these are good options.

I was suprised by my second experiment though, where changing the depth has a huge effect on both the Class 1 and Class 0 Error Rates. With a max depth of 3, this model actually performs similarly to the LogisticRegression with balanced weights that had previously been my best model. An advantage with this DecisionTree model is the fact that with only the change of one parameter I can get either error rate down to single digits. That fact could make this a really great model to put into practice. As business budgets can change frequently and priorities along with them, it could be useful that this model could be set to optimize revenue by setting it to a max depth of 3 for a low Class 1 Error Rate (false negatives), and then easily shift to a more conservative approach that controls costs by setting a max depth of 15 to optimize the Class 0 Error Rate (false positives).

In [ ]: