dbt-flink-adapter - job lifecycle management. Transforming data streaming
It's been a year since the announcement of the dbt-flink-adapter, and the concept of enabling real-time analytics with dbt and Flink SQL is simply…
Read moreDespite the era of GenAI hype, classical machine learning is still alive! Personally, I used to use ChatGPT (e.g. for idea generation), however I recently stopped. Thus, I believe OpenAI also needs (or probably already uses) a churn model to predict which customers will stop using their services. Not only can they predict the probability of churn for a particular user, but also, with the help of model explanation tools, find reasons as to why people get dissatisfied with their tool and then enhance it by debugging or developing new features.
In this article I will guide you through the key elements of building a churn model from a business perspective (mainly, as there will be a few technical/machine learning tips). You will discover what the main challenges are when defining what churn means for you, how business people are crucial in the process of creating features for a machine learning model, and how to translate these business thoughts on the reasons for churn into numbers.
One of the main challenges when building a machine learning model predicting churn is the definition of churn itself. It may seem easy at first glance, but sometimes it can take weeks to finally decide what the definition should be - especially in big corporations or banks. The reasoning behind this is:
business units and different people can understand churn differently
the definition should be a mathematical formula, based on selected datasets and sometimes there are a few “sources of truth”
there are multiple business exceptions that, depending on the organisation, should not be treated as churn:
there are technical exceptions:
Another important dimension of the definition is time.
Consider the two variants below:
Probability that the customer will churn on the last day of subscription
A. calculated on any day of the subscription
B. calculated 30 days before last day of subscription
These two variants have an immense impact on the way variables will be built and what the training set should look like. Version A is more informative as we can track the subscription score day to day and observe how it changes due to various events the customer generates. However, it’s more complex to implement and requires more data points.
Version B generates scores for subscriptions that are about to expire, making the training set more homogenous and generating scores for subscriptions just at a time when customer service can try to persuade the customer to prolong them. There are some other variants possible as well, and it always needs to be a joint decision together with the customer as to which one to choose.
As you can imagine, the list above can often bring about lots of challenges when defining churn in the organisation. This is why some of them choose to have multiple definitions of churn, track them and use them in separate machine learning models. However, using a single definition is the easiest road when it comes to model development, business understanding of the results and maintenance of the solution - including running and evaluating marketing campaigns).
All of these factors show that there is no golden rule when it comes to implementing a churn machine learning model in an organisation - each company is different and requires a custom-made solution to meet their business needs.
Without good quality features, even the most robust machine learning model will not perform well. Here, a feature brainstorming workshop comes in handy. To make the most of it, multiple business units need to take part: sales, marketing, customer service, IT and data specialists. When stakes are high (reducing churn will bring lots of money), it’s useful to organise a bigger meeting and brainstorm over the subject for a few hours. The result of such a meeting should be:
List of data sources that can be used in churn modelling
a. crucial (e.g. subscription data)
b. nice-to-have (e.g. inbound calls to customer service)
List of data sources that cannot be used in the project at the moment, but have information that seems important when predicting churn. This unavailability can be caused by:
a. Historical data that has not been collected
b. A key to match records from the source to our customer/subscription not being available (but is possible to develop such an identifier)
c. Data not being collected at all (e.g. transcription of inbound calls to customer service or even recordings of such)
List of customer behaviours/features (in business language) that impact the decision making when it comes to subscription renewal (both positively and negatively), e.g.:
On a side note, such a workshop is also a great opportunity to meet the team and people who are interested in the results of the project. It will benefit future cooperation and finally the whole solution quality.
The next step is reflecting the outcome of the brainstorming in data using statistical aggregations. In my opinion, this is one of the most interesting parts of a Data Scientist’s job - describing reality with numbers and trying to be as close as possible. Let’s try to create the features that would cover the ideas from the previous paragraph:
“when a customer is dissatisfied with our service, they will not prolong their subscription”
1.1 Count of incoming customer calls within last year
1.2 Count of customer emails complaining about our services within last 6 months
“some of our customers are students that only need the subscription for a few months to learn how to use it”
2.1 If the registration e-mail contains any university address (e.g. @uw.edu.pl)
2.2 If they are a declared student (e.g. marked during registration of a new customer)
“last autumn there was a huge campaign with discounts from our competitor”
3.1 The daily competitor’s price of a similar product
3.2 Simple indicator: if the competitor has a campaign with discounts
“due to inflation, we needed to impose a 50% price increase for subscription renewal starting this January”
4.1 Price of the product paid by the customer last year
4.2 Price of the product that the customer will need to pay to renew the subscription
Sometimes we need to be very creative to try to reflect the information in data, as often we don't have such historical data available. In this case, you need to treat such ideas as a trigger to start collecting more data sources that you find useful for your business.
The most important things when creating variables:
No information “from the future” should be provided in the model. For example, if we know that a subscription churned on 1st April 2023 (this was the last day of the subscription) and we want our model to predict the event a month earlier, we need to use the information that was available on 1st March 2023. Below please find two examples of variables where the first one leaks information from the future and the second one does not:
❌ Total count of products purchased last calendar year (meaning whole 2023)
✅ Count of products purchased by the customer during the last 12 months (calculated on 1st March 2023)
When a distribution of the variable changes in time (by its design, not because of behavioural changes), there will always be data drift and the model will start performing worse. In the examples below, the first variable will always grow for customers that registered many years ago; the second variable will be rather stable, if our offer stays the same
❌ Total count of products purchased by the customer
✅ Count of products purchased by the customer during the last 12 months
❌ Average value of active product_id=87673
✅ Average value of active Advertising Subscriptions
When it comes to technology, I personally recommend preparing such variables in the form of multiple marts (and not just making transformations as a pipeline to generate scores). For that, I used the dbt framework, where the code is just SQL, some yaml configuration and a bit of Python. With that solution, results of our work - if properly documented - can be used by other team members for their machine learning models or some insightful dashboards for management.
Now you can run your first machine learning models. A few recommendations from my side:
The machine learning model itself won’t stop customers from cancelling their subscriptions - what you need is a plan on how to use it to maximise the business outcome. And it’s best to start such planning just at the beginning of a churn modelling project.
Last but not least, when executing a marketing campaign, the best way is to run it using A/B tests. This way you will be able to:
Also, if possible, ask your customer service representatives to comment on the scores - they know customers and somehow have a feeling as to which ones are more likely to churn and which ones will definitely prolong their subscription.
Such feedback, along with thorough analysis of the campaign’s results, will help you to enhance your model with additional features or eliminate bugs that cause wrong predictions.
Churn modelling is still a valuable contributor when it comes to diminishing churn in your customer base. When done wisely, it can give you the tools to make your retention campaigns more sophisticated and effective.
If you have any questions or require a deeper understanding, sign up for a free consultation with our experts, and don’t forget to subscribe to our newsletter for more updates.
It's been a year since the announcement of the dbt-flink-adapter, and the concept of enabling real-time analytics with dbt and Flink SQL is simply…
Read moreOur recently released white paper, "Data Democratization Through Data Management" offers an in-depth exploration of the subject. This article will…
Read moreData space has been changing rapidly in recent years, and data streaming plays a vital role. In this blog post, we will explore the concepts and…
Read moreIntroduction At GetInData, we understand the value of full observability across our application stacks. For our Customers, we always recommend…
Read moreReal-time analytics are all processes of collecting, transforming, enriching, cleaning and analyzing data to provide immediate insights and actionable…
Read moreThe year 2021 passed in the blink of an eye and the time has come to summarize our goals at GetinData and define our challenges for the next year…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?