Churn Rate by Contract Type: Customers with a prepaid or rather a month-to-month connection have a very high probability to churn compared to their peers on 1 or 2 years contracts. It is very expensive to win them back once lost, not even thinking that they will not do the best word to mouth marketing if unsatisfied. We use Canvas to perform the following steps: For our dataset, we use a synthetic dataset from a telecommunications mobile phone carrier. services= ['PhoneService','MultipleLines'. Note: All the code in this article is executed using the Spyder IDE for Python. Similarly, in the case of Germany, you can see a 1 in the Germany column and a 0 in the Spain column. The attributes are as follows: The last attribute, Churn?, is the attribute that we want the ML model to predict. Saved model pickle files in the directory. It's important to mention that the data for the independent variables was collected 6 months before the data for the dependent variable, since the task is to develop a machine learning model that can predict whether a customer will leave the bank after 6 months, depending on the current feature values. Instead dont be scared, go out and engage with your customers. Analyze the distribution of categorical variables: 9.2.1. 9.3.3. Why is this bad? Predict Customer Churn. Execute the following script: The get_dummies method of the pandas library converts categorical columns to numeric columns. Step 14: Conduct Feature Scaling: Its quite important to normalize the variables before conducting any machine learning (classification) algorithms so that all the training and test variables are scaled within a range of 0 to 1. 2. Execute the following code to do so: The second step is to load the dataset from the local CSV file into your Python program. I Build Scalable Data Science, Analytics & BI Solutions for Marketing | srees.org. Test those hypotheses against customer data to start building your prediction model. Random Forest achieves the best performance on the test set: strong confusion matrix although still generating false negative which could be an issue given our objective to detect churn likelyhood: The influence of each feature on the prediction to churn can be visualized using SHAP module (feature pushing towards churn to the right of the y-axis): This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. This is suitable for our use case because we have only two possible prediction values: True or False, so we go with the recommendation Canvas made. We can directly perform an interactive prediction on the Predict tab, either in batch or single (real-time) prediction. Categorical Features: The following are categorical features/variables CustomerID,Country,State,City,Lat Long,Gender,Senior Citizen,Partner,Dependents,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Total Charges,Churn Label,Churn Reason. We can run a prediction, and our model returns a confidence score of 93.2% that this customer will churn (True). On the Scoring tab, we can review a visual plot of our predictions mapped to their outcomes. We want to get a quick view into whether our target column can be predicted by the other columns. plt.title('Customers by Contract Type \n', plt.legend(loc='top right', fontsize = "medium"), x_labels = np.array(contract_split[["No. - GitHub - Ifegwu/predict-customer-churn-with-clean-code-: In this project . To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. Dont expect a perfect model, but expect something you can use in your own company / project today! predict-churn-py. These are if the model thinks a customer in the dataset will churn and they actually dont (false positive), or if the model thinks the customer will churn and they actually do (false negative). This set tracks . classifier = LogisticRegression(random_state = 0. accuracies = cross_val_score(estimator = classifier. Step 15.3. Formatting is required for Lat Long and Total Charges columns. Here, 1 refers to the case where the customer left the bank after 6 months, and 0 is the case where the customer didn't leave the bank after 6 months. There are 7043 records and 33 features in the dataset. In Python's scikit-learn library, you can use built-in functions to find all of these values. Therefore with the to_numeric function we can change the format and prepare the data for our machine learning model. Outside work, Chaoran loves spending time with his family and two dogs, Biubiu and Coco. First 5 records of all columns in the dataset, Modifying the dataset based on observations from previous steps, Now lets quickly check if the modifications are done. The first project in the Machine Learning DevOps Nanodegree by Udacity. It can be observed that some variables have a positive relation to our predicted variable and some have a negative relation. The meta information shown below is available on Kaggle where the dataset is also available, hence you can check out the meta info using the same link which we used to download the dataset. You signed in with another tab or window. You can see how easy and straightforward it is to create a machine learning model for classification tasks. When the geography is Spain, you can see a 1 in the Spain column and a 0 in the Germany column. Templates let you quickly answer FAQs or store snippets for re-use. 5 Things to Know About Churn Prediction. The dataset contains 7043 rows and 21 columns and there seem to be no missing values in the dataset. Tools to predict churn in python. We might now choose to provide promotion discounts to retain this customer. He achieves this by working with customers to help them achieve their business goals using the AWS Cloud. If we could figure out why a customer leaves and when they leave with reasonable accuracy, it would immensely help the organization to strategize their retention initiatives manifold. Step 9: Exploratory Data Analysis: Lets try to explore and visualize our data set by doing distribution of independent variables to better understand the patterns in the data and to potentially form some hypothesis. (3) what retention strategies can be implemented based on the results to diminish prospective customer churn? This works for columns with only two categories. Changing the data types of Zip Code and Total Charges columns. As data is rarely shared publicly, we take an available dataset you can find on IBMs website as well as on other pages like Kaggle: Telcom Customer Churn Dataset. Revalidate NAs: Its always a good practice to revalidate and ensure that we dont have any more null values in the dataset. Why: Even though the Zip Code column is in the form of numbers, we cannot consider it as a Numerical type as their values cannot be used meaningfully in any kind of calculations. plt.title('Percentage of Churn in Dataset'), data.drop(['customerID'], axis=1, inplace=True), data['TotalCharges'] = pd.to_numeric(data['TotalCharges']), data["Churn"] = data["Churn"].astype(int), from sklearn.linear_model import LogisticRegression, # To get the weights of all the variables, Learn all about the basics of customer churn in one of my previous articles. They can still re-publish the post if they are not suspended. Python is one of the most frequently used programming languages for financial data analysis, with plenty of useful libraries and built-in functionality. From the look of it, we can presume that the dataset contains several numerical and categorical columns providing various information on the customer transactions. Predicting Customer Churn with Python - Nolan Greenup However, we'll use the random forest algorithm, since it's simple and one of the most powerful algorithms for classification problems. In this article, you'll see how Python's machine learning libraries can be used for customer churn prediction. We do this by implementing a predictive model with the help of python. Step 1: Import relevant libraries: Import all the relevant python libraries for building supervised machine learning algorithms. These two columns contain data in textual format; we need to convert them to numeric columns. Furthermore we import Pandas, which puts our data in an easy-to-use structure for data analysis and data transformation. The pillars of any company are Customers and Employees, and its always expensive to acquire a new customer or to hire a good employee. Solution Build a machine learning model to identify/predict the customers who are likely to churn. One Talk to your team.We did not only find out which customers are likely to churn, but also which features have the most impact on a customer leaving. Compare the quarter results with the same quarter last year or the year before and share the outcome with the senior management of your organization. Once unsuspended, mage_ai will be able to comment and publish posts again. If this continues, the company will incur huge losses. Since we need to classify customers as either churn or no-churn, we'll train a simple-yet-powerful classification model. We can also remove spaces between the feature names and convert all the feature names into lower case (For eg: Total Charges to total_charges). The problem with this solution is that, giving huge discounts to all the customers may cause losses or the company may produce very little profit. Predict Customer Churn with Clean Code This is a classification problem. The first project for the ML DevOps Engineer Nanodegree by Udacity.. print("Total Charges: ",df['Total Charges'].dtype,"\nZip Code: ", df['Zip Code'].dtype) This method belongs to the supervised learning category, just in case you needed one more buzzing expression. Here's an overview of the steps we'll take in this article: The first step, as always, is to import the required libraries. To avoid incurring future session charges, log out of SageMaker Canvas. Looking for some advice to build a data science portfolio that will put you ahead of other aspiring data scientists? Learn all about the basics of customer churn in one of my previous articles. Lucky us for today, but important to know usually we have to handle this. If we are working with a real time dataset, then at this stage it is recommended to save the cleaned dataset in the cloud databases for future usage. Lets have a look into the positive and negative correlations graphically in the next step. Customer Churn Prediction with Python In this process, we take our categories (France, Germany, Spain) and represent them with columns. 2023, Amazon Web Services, Inc. or its affiliates. This step is known as algorithm training. DEV Community 2016 - 2023. Also, they are paying bills via credit card, bank transfer or electronic checks. 3.Total Charges values are numerical and are of float data type, so let's convert this column into numerical. Project Predict Customer Churn of ML DevOps Engineer Nanodegree Udacity; Project Description. To get more detailed insights beyond what is displayed in the Sankey diagram, business analysts can use a confusion matrix analysis for their business solutions. Variance problem occurs when we get good accuracy while running the model on a training set and a test set but then the accuracy looks different when the model is run on another test set. First, connect your dataset. In this article, we'll use this library for customer churn prediction. Step 9.5. This gives us an insight into what columns impact the performance of our model the most. of customers"]]). Such parameters are called the hyperparameters; a set of configurable values external to a model that cannot be determined by the data, and that we are trying to optimize through Parameter Tuning techniques like Random Search or Grid Search. After we confirm that the imported dataset is ready, we create our model. Step 7: Take care of missing data: As we saw earlier, the data provided has no missing values and hence this step is not required for the chosen dataset. Build a machine learning model to identify/predict the customers who are likely to churn. So we can later evaluate the performance of our machine learning model, let's also divide the data into a training and test set. In this article, you'll see how Python's machine learning libraries can be used for customer churn prediction. Let's discuss each column one by one: After careful observation of the features, we'll remove the RowNumber, CustomerId, and Surname columns from our feature set. code of conduct because it is harassing, offensive or spammy. We make an instance of the ModelStep 3. High numbers for either might make us think more on if we can use the model to make decisions. Customers with a month-to-month connection have a very high probability to churn that too if they have subscribed to pay via electronic checks. Interested in exploring some other applications of Python for financial data analysis? For a column like Geography with three or more categories, you can use the values 0, 1, and 2 for the three countries of France, Germany, and Spain. As shown in the following screenshot, the Phone and State columns have much less impact on our prediction. Customer attrition (a.k.a customer churn) is one of the biggest expenditures of any organization. Churn Rate by Payment Method Type: Customers who pay via bank transfers seem to have the lowest churn rate among all the payment method segments. Only if the team knows where to put emphasis on, the team is able to to guide a customer to features that make him/her stick around longer. Approach customers likely to churn, but make sure that you come up with relevant things that may fit their individual needs. Note: Records are known as Instances or Rows, Features are known as Variables or Columns. This can accelerate ML-based value creation and help scale improved outcomes faster. Source code and the raw data that I have used to build the machine learning classifiers can be found in this repository for reference. Distribution of contract type: Most of the customers seem to have a prepaid connection with the telecom company. The customer_data data frame still contains all the columns. To keep things simple, we'll use an open source dataset Telco Customer Churn for this blog. We can see this in the Sankey diagram, but want more insights, so we choose Advanced metrics. Almost half of the customers in our dataset are female whilst the other half are male. Numerical Features: The following are numerical features/variables Count,Zip Code,Latitude,Longitude,Tenure Months,Monthly Charges,Churn Value,Churn Score,CLTV. A simple example in the Telcom dataset is the gender. Most of the customers in the dataset are younger people. Hence, we can say that churn prediction is always an important strategy that every company should consider. Its a telecommunications company that provides home phone and internet services to residents in the USA. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. Therefore, let us write a for loop that iterates 20 to 30 times and gives the accuracy at each iteration so as to figure out the optimal number of K neighbors for the KNN Model. plt.ylabel('Proportion of Customers',horizontalalignment="center", plt.xlabel('Churn',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif"). Therefore we set the coefficients in our model to zero and look at the weights of each variable. We require customer data, list of services, plans and cost details etc. models.append(('Logistic Regression', LogisticRegression(solver='liblinear', random_state = 0, models.append(('SVC', SVC(kernel = 'linear', random_state = 0))), models.append(('Kernel SVM', SVC(kernel = 'rbf', random_state = 0))), models.append(('KNN', KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2))), models.append(('Gaussian NB', GaussianNB())). You can verify this by executing the following code: In the output, you should see the following list : Not all columns affect the customer churn. Hence, let's try to use Logistic Regression and evaluate its performance in the forthcoming sections. datatype of Total Charges and Zip code changed after modifications, Lets take a look at the sample of modified dataset. Are you sure you want to create this branch? Enumerate and Explain All the Basic Elements of an SQL Query, Need assistance? As we got a brief idea about what their business is, so lets start gathering the data. Steps required to build a model: DEV Community A constructive and inclusive social network for software developers. Identify Numerical and Categorical features in the gathered data and check if formatting is required or not. Udacity project#1 machine Learning DevOps Engineer Nano degree. Plot Correlation Matrix of all independent variables: Correlation matrix helps us to discover the bivariate relationship between independent variables in a dataset. You can use the concat function from pandas to horizontally concatenate two data frames as shown below: Our data is now ready, and we can train our machine learning model. So, in order to fix the variance problem, k-fold cross-validation basically split the training set into 10 folds and train the model on 9 folds (9 subsets of the training dataset) before testing it on the test fold. Below, I simply drag-and-drop a CSV file of my churn data into the platform. Looks like Formatting is required for the Zip Code column. So they immediately worked on a plan to retain their customers. #Unique values in each categorical variable: dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'],errors='coerce'), dataset['TotalCharges'] = dataset['TotalCharges'].astype("float"), na_cols = na_cols[na_cols == True].reset_index(), # Label Encoding will be used for columns with 2 or less unique, vals = np.size(dataset2.iloc[:, i].unique()), contract_split = dataset[[ "customerID", "Contract"]]. For evaluating the performance of a classification algorithm, the most commonly used metrics are the F1 measure, precision, recall, and accuracy. Thats considered quite good for a first run, especially when we look which impact each variable has and if that makes sense. If a feature is of int or float data type, then we say that the features are Numerical and if the features are of object datatype or have string values, then we say that the features are Categorical. Made with love and Ruby on Rails. By the end of this article, lets attempt to solve some of the key business challenges pertaining to customer attrition like say, (1) what is the likelihood of an active customer leaving an organization? Step 15.4. I first outline the data cleaning and preprocessing procedures I implemented to prepare the data for modeling. One should always remember that the way we define the objective, the way we gather data and the way we clean/format the data will vary depending on the requirements and the data we have. 3. srees1988/predict-churn-py: Predict Customer Churn in Python We might now choose to provide promotional discounts to retain this customer. Finally, as a side note, lets briefly understand the importance of Churn Prediction. We can get a fast view into the models estimated accuracy and column impact(the estimated importance of each column in predicting the target column). Therefore, we will convert the Zip Code feature into object data type. A quick describe method reveals that the telecom customers are staying on average for 32 months and are paying $64 per month. Therefore, companies always find ways to retain their customers and employees. Be aware that the better we prepare our data for the machine learning model, the better our prediction will be. The dataset does not have any missing or erroneous data values. So let's restart the session, clear the cache and start afresh! We can see here that the Monthly Charges and Total Charges have a high VIF value. Here, the phone number is just the equivalent of an account numbernot of value in predicting other accounts likelihood of churn, and the customers state doesnt impact our model much. ax = contract_split[["No. A very common and high value use case for ML. It is a critical prediction for many businesses because acquiring new clients often costs more than retaining existing ones. Customers who have availed Online Backup, Device Protection, Technical Support and Online Security features are a minority. This allows us a deeper insight into our model. In this post, we show you how business analysts can build a customer churn ML model with Amazon SageMaker Canvas, no code required. For this post, we assume the role of a marketing analyst in the marketing department of a mobile phone operator. A tag already exists with the provided branch name. Use Git or checkout with SVN using the web URL. The Sankey diagram in the following screenshot shows how the model performed on the test set. If you didnt get a chance to go through it, feel free to check out this blog, where I explained in detail all the steps required to build and optimize a ML model. In this project, you will implement your learnings to identify credit card customers that are most likely to churn. Identify the optimal number of trees for Random Forest Model: Quite similar to the iterations in the KNN model, here we are trying to find the optimal number of decision trees to compose the best random forest. The data scientists can view the Canvas model in Amazon SageMaker Studio, where they can explore the choices Canvas AutoML made, validate model results, and even productionalize the model with a few clicks. As you can see below, the data set is imbalanced with a high proportion of active customers compared to their churned counterparts. Train and build the churn model. Lets make use of a customer transaction dataset from Kaggle to understand the key steps involved in predicting customer attrition in Python. For that reason data scientists spend so much time on preparing the data. Zip Code values are numerical but we should convert them to string as discussed in the above Step-3.2. See LICENSE for more information. Once unpublished, all posts by mage_ai will become hidden and only accessible to themselves. Here we can verify that our data is correct. In practice we conduct the following steps to make these precise predictions: To predict if a customer will churn or not, we are working with Python and its amazing open source libraries. Also, feel free to reach out to me if you need any help in understanding the fundamentals of supervised machine learning algorithms in Python. Generally data is gathered from multiple resources in real time. Here, X is our feature set; it contains all the columns except the one that we have to predict (Exited). As a final step, let's see which features play the most important role in the identification of customer churn. print("Logistic Regression Classifier Accuracy: rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1]). Enroll in our Python Basics course to gain more hands-on experience. Now if you open the Geography and customer_data data frames in the Variable Explorer pane, you should see something like this: In accordance with our earlier explanation, the Geography data frame contains two columns instead of three. Are you sure you want to hide this comment? Please note that of course it makes sense to understand the theory behind the model in detail, but in this case our goal is to make use of the predictions we wont go through this in this article. Examples include creating a detailed FAQ on websites to reduce customer service calls, and running education campaigns with customers on the FAQ that can keep engagement up. Plot positive & negative correlations: Step 9.6. . Step 17:Predict Feature Importance: Logistic Regression allows us to determine the key features that have significance in predicting the target attribute (Churn in this project). The marketing team can take actions to prevent customer churn based on these learnings. Running one prediction is great for individual what-if analysis, but we also need to run predictions on many records at once. Lets try to drop one of the correlated features to see if it help us in bringing down the multicollinearity between correlated features: In our example, after dropping the Total Charges variable, VIF values for all the independent variables have decreased to a considerable extent. This project is part of Unit 2: Clean Code Principles.
Plus Size One Shoulder Crop Top, Cheap Trousers Women's, Blackboard Smart Notebook Set - Letter Size, Surly Midnight Special Hot Mayonnaise, Yukon Solo Stove Size, Jeep Wrangler 2 Door For Sale Craigslist, Socket Programming In Python,