Predicting Customer Churn Using Greenplum and gpmlbot
This blog summarizes a demonstration delivered to a customer interested in understanding how Greenplum can be used for machine learning, without exporting data to external systems. The objective was to showcase how churn modeling can be prototyped and iterated entirely within Greenplum using the gpmlbot utility, which automates feature preparation, model training, and evaluation.
While this demo was not intended as a production-ready implementation, it illustrates the possibilities for teams that want to explore in-database machine learning workflows using familiar SQL and open-source tools like MADlib and PostgresML.
The Problem: Customer Churn
Customer churn is a critical metric for any company that has competitors. Proactively identifying which customers are likely to leave allows businesses to take preemptive action, improve retention, and reduce costs. We showcased this classification problem using a open-source Telco Customer Churn dataset (here) and enhanced it with automated feature engineering and correlation analysis.
Step 1: Load csv file
We utilized Greenplum Sailfish (Blog post) to load the telco_customers.csv file into the database. This resulted in a table called telco_customers in the database with 7043 rows loaded.
Step 2: Data Preparation and Feature Engineering
Using SQL automation (churn_prep.sql), we transformed categorical columns into numerical features suitable for machine learning algorithms. Techniques included:
One-hot-style encoding using dense ranking (e.g., gender, contract, internet_service)
Conversion of churn labels from ‘Yes’/’No’ to 1/0
Feature columns were clearly separated with suffix _feature for traceability
Step 3: Training with gpmlbot
The gpmlbot utility was used to orchestrate model training. It accepts configuration in TOML format and runs all models defined in a single pass using the Greenplum engine. Both MADlib and PostgresML extensions were used to compare results across algorithms.
[trainings.training] database = 'gpadmin' prediction_column = 'churn_label' feature_columns = ['contract_feature', 'dependents_feature', 'device_protection_feature', 'gender_feature', 'internet_service_feature', 'multiple_lines_feature', 'online_backup_feature', 'online_security_feature', 'partner_feature', 'phone_service_feature', 'senior_citizen_feature', 'streaming_movies_feature', 'streaming_tv_feature', 'tech_support_feature', 'tenure_months_feature', 'zip_code_feature'] algorithm_type = 'classification' algorithms = ['decision tree', 'random forest', 'support vector machines', 'lightgbm', 'xgboost', 'multilayer perceptron']
Step 4: Model Evaluation and Results
The model was trained using the latest versions of Greenplum, MADlib, PostgresML, and the latest gpmlbot orchestration utility. Below are the most recent results:
Rank | Algorithm | Accuracy | Precision | Duration |
---|---|---|---|---|
1 | Decision Tree | 77.50% | 61.83% | 42s |
2 | LightGBM | 76.30% | 61.31% | 2s |
3 | Multilayer Perceptron | 76.22% | 66.97% | 44s |
4 | Random Forest | 75.23% | 59.00% | 59s |
5 | Support Vector Machines | 68.28% | 45.27% | 15s |
6 | XGBoost | 29.10% | 29.10% | 23s |
The Decision Tree model (using MADlib) produced the best overall results in terms of accuracy and precision. The Multilayer Perceptron model showed the highest precision score, which may be useful for minimizing false positives.
Why This Matters
Greenplum users can now evaluate machine learning models at scale without ever moving data out of the platform. Whether using open-source MADlib or Python-backed PostgresML, the gpmlbot utility simplifies comparative analysis and speeds up experimentation cycles.
Want to see this in action or adapt it to your own business case? Contact us to explore how Mugnano Data Consulting can help you accelerate your Greenplum analytics and machine learning initiatives.
Environment Details
This demo was executed on a small docker image running the following component versions:
Component | Version |
---|---|
Greenplum DB | 7.5.2 |
MADlib | 2.2.0 |
PostgresML | 2.8.5 |
gpmlbot Utility | 1.2.0 |
Python | 3.11.7 |
OS | CentOS Stream release 8 |