Predicting Customer Churn Using Greenplum and gpmlbot

Agile AnalysisGreenplumAutomationMachine LearningApplication

Aug 8

This blog summarizes a demonstration delivered to a customer interested in understanding how Greenplum can be used for machine learning, without exporting data to external systems. The objective was to showcase how churn modeling can be prototyped and iterated entirely within Greenplum using the gpmlbot utility, which automates feature preparation, model training, and evaluation.

While this demo was not intended as a production-ready implementation, it illustrates the possibilities for teams that want to explore in-database machine learning workflows using familiar SQL and open-source tools like MADlib and PostgresML.

The Problem: Customer Churn

Customer churn is a critical metric for any company that has competitors. Proactively identifying which customers are likely to leave allows businesses to take preemptive action, improve retention, and reduce costs. We showcased this classification problem using a open-source Telco Customer Churn dataset (here) and enhanced it with automated feature engineering and correlation analysis.

Step 1: Load csv file

We utilized Greenplum Sailfish (Blog post) to load the telco_customers.csv file into the database. This resulted in a table called telco_customers in the database with 7043 rows loaded.

Step 2: Data Preparation and Feature Engineering

Using SQL automation (churn_prep.sql), we transformed categorical columns into numerical features suitable for machine learning algorithms. Techniques included:

One-hot-style encoding using dense ranking (e.g., gender, contract, internet_service)
Conversion of churn labels from ‘Yes’/’No’ to 1/0
Feature columns were clearly separated with suffix _feature for traceability

-- churn_prep.sql
-- Create a table for customer churn and copy the telco_customers dataset into it
drop table if exists telco_customer_churn;
create table telco_customer_churn as select * from telco_customers
distributed by (customerid);

-- alter numeric columns and make them numeric if used as features or prediction in the model
update telco_customer_churn set churn_label = '0' where churn_label = 'No';
update telco_customer_churn set churn_label = '1' where churn_label = 'Yes';
alter table telco_customer_churn alter column churn_label set data type int using churn_label::int;

-- Add numeric "feature" columns for columns you want to use as features
alter table telco_customer_churn add column zip_code_feature bigint;
alter table telco_customer_churn add column tenure_months_feature int;
alter table telco_customer_churn add column gender_feature int;
alter table telco_customer_churn add column senior_citizen_feature int;
alter table telco_customer_churn add column dependents_feature int;
alter table telco_customer_churn add column partner_feature int;
alter table telco_customer_churn add column phone_service_feature int;
alter table telco_customer_churn add column multiple_lines_feature int;
alter table telco_customer_churn add column internet_service_feature int;
alter table telco_customer_churn add column online_security_feature int;
alter table telco_customer_churn add column online_backup_feature int;
alter table telco_customer_churn add column device_protection_feature int;
alter table telco_customer_churn add column tech_support_feature int;
alter table telco_customer_churn add column streaming_tv_feature int;
alter table telco_customer_churn add column streaming_movies_feature int;
alter table telco_customer_churn add column contract_feature int;

-- Convert the "feature" columns from text to numerics based on distinct values
with ranked_data as (
    select
        customerid
        , 100 + dense_rank() over (order by zip_code) as zip_code_number
        , 200 + dense_rank() over (order by tenure_months) as tenure_months_number
        , 300 + dense_rank() over (order by gender) as gender_number
        , 400 + dense_rank() over (order by senior_citizen) as senior_citizen_number
        , 500 + dense_rank() over (order by dependents) as dependents_number
        , 600 + dense_rank() over (order by partner) as partner_number
        , 700 + dense_rank() over (order by phone_service) as phone_service_number
        , 800 + dense_rank() over (order by multiple_lines) as multiple_lines_number
        , 900 + dense_rank() over (order by internet_service) as internet_service_number
        , 1000 + dense_rank() over (order by online_security) as online_security_number
        , 1100 + dense_rank() over (order by online_backup) as online_backup_number
        , 1200 + dense_rank() over (order by device_protection) as device_protection_number
        , 1300 + dense_rank() over (order by tech_support) as tech_support_number
        , 1400 + dense_rank() over (order by streaming_tv) as streaming_tv_number
        , 1500 + dense_rank() over (order by streaming_movies) as streaming_movies_number
        , 1600 + dense_rank() over (order by contract) as contract_number
    from telco_customer_churn
)
update telco_customer_churn a
set zip_code_feature = ranked_data.zip_code_number
	, tenure_months_feature = ranked_data.tenure_months_number
	, gender_feature = ranked_data.gender_number
	, senior_citizen_feature = ranked_data.senior_citizen_number
	, dependents_feature = ranked_data.dependents_number
	, partner_feature = ranked_data.partner_number
	, phone_service_feature = ranked_data.phone_service_number
	, multiple_lines_feature = ranked_data.multiple_lines_number
	, internet_service_feature = ranked_data.internet_service_number
	, online_security_feature = ranked_data.online_security_number
	, online_backup_feature = ranked_data.online_backup_number
	, device_protection_feature = ranked_data.device_protection_number
	, tech_support_feature = ranked_data.tech_support_number
	, streaming_tv_feature = ranked_data.streaming_tv_number
	, streaming_movies_feature = ranked_data.streaming_movies_number
	, contract_feature = ranked_data.contract_number
from ranked_data
where a.customerid = ranked_data.customerid;

Step 3: Training with gpmlbot

The gpmlbot utility was used to orchestrate model training. It accepts configuration in TOML format and runs all models defined in a single pass using the Greenplum engine. Both MADlib and PostgresML extensions were used to compare results across algorithms.

  
    [trainings.training]
database = 'gpadmin'
prediction_column = 'churn_label'
feature_columns = ['contract_feature', 'dependents_feature', 'device_protection_feature', 'gender_feature', 'internet_service_feature', 'multiple_lines_feature', 'online_backup_feature', 'online_security_feature', 'partner_feature', 'phone_service_feature', 'senior_citizen_feature', 'streaming_movies_feature', 'streaming_tv_feature', 'tech_support_feature', 'tenure_months_feature', 'zip_code_feature']
algorithm_type = 'classification'
algorithms = ['decision tree', 'random forest', 'support vector machines', 'lightgbm', 'xgboost', 'multilayer perceptron']

  

Step 4: Model Evaluation and Results

The model was trained using the latest versions of Greenplum, MADlib, PostgresML, and the latest gpmlbot orchestration utility. Below are the most recent results:

  
      Rank
      Algorithm
      Accuracy
      Precision
      Duration
    
  1Decision Tree77.50%61.83%42s
2LightGBM76.30%61.31%2s
3Multilayer Perceptron76.22%66.97%44s
4Random Forest75.23%59.00%59s
5Support Vector Machines68.28%45.27%15s
6XGBoost29.10%29.10%23s

The Decision Tree model (using MADlib) produced the best overall results in terms of accuracy and precision. The Multilayer Perceptron model showed the highest precision score, which may be useful for minimizing false positives.

Why This Matters

Greenplum users can now evaluate machine learning models at scale without ever moving data out of the platform. Whether using open-source MADlib or Python-backed PostgresML, the gpmlbot utility simplifies comparative analysis and speeds up experimentation cycles.

Want to see this in action or adapt it to your own business case? Contact us to explore how Mugnano Data Consulting can help you accelerate your Greenplum analytics and machine learning initiatives.

Environment Details

This demo was executed on a small docker image running the following component versions:

  
      Component
      Version
    
  Greenplum DB7.5.2
MADlib2.2.0
PostgresML2.8.5
gpmlbot Utility1.2.0
Python3.11.7
OSCentOS Stream release 8

GreenplumAgile AnalysisAutomationData JourneyMachine Learning

Louis Mugnano https://www.mugnanodc.com

Rank	Algorithm	Accuracy	Precision	Duration
1	Decision Tree	77.50%	61.83%	42s
2	LightGBM	76.30%	61.31%	2s
3	Multilayer Perceptron	76.22%	66.97%	44s
4	Random Forest	75.23%	59.00%	59s
5	Support Vector Machines	68.28%	45.27%	15s
6	XGBoost	29.10%	29.10%	23s

Component	Version
Greenplum DB	7.5.2
MADlib	2.2.0
PostgresML	2.8.5
gpmlbot Utility	1.2.0
Python	3.11.7
OS	CentOS Stream release 8