Bank Deposit Prediction
In the context of finishing our curriculum in Holberton school, students are expected to create a project from scratch, in which we have the freedom of choosing and building our own ideas, using the experience and skills we gathered in Holberton in the last 2 years. So my team picked working on Bank Deposit Prediction.
Introduction
In this project, we will develop and evaluate the performance and the predictive power of a model trained and tested on data collected from Bank Marketing Dataset.
About Project
In banks, huge data records information about their customers. This data can be used to create and keep clear relationship and connection with the customers in order to target them individually for definite products or banking offers. Usually, the selected customers are contacted directly through: personal contact, telephone cellular, mail, and email or any other contacts to advertise the new product/service or give an offer, this kind of marketing is called direct marketing. In fact, direct marketing is in the main a strategy of many of the banks and insurance companies for interacting with their customers.
The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact with the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed. So our application is Bank Deposit Prediction.
Problem Statement:
All bank marketing campaigns are dependent on customer’s huge electronic data. The size of these data sources is impossible for a human analyst to come up with interesting information that will help in the decision-making process. Data mining models are completely helping in the performance of these campaigns. The purpose is increasing the campaign effectiveness by identifying the main characteristics that affect the success based on a handful of algorithms that we will test (e.g. Logistic Regression, Random Forests, Decision Trees and others). With the experimental results we will demonstrate the performance of the models by statistical metrics like accuracy, sensitivity, precision, recall, etc. With the higher scoring of these metrics, we will be able to judge the success of these models in predicting the best campaign contact with the clients for subscribing deposit. The aim of the marketing campaign was to get customers to subscribe to a bank term deposit product. Whether they did this or not is variable ‘y’ in the data set. The bank in question is considering how to optimize this campaign in future.
Analysis:
Data Exploration:
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
1. age (numeric)
2. job : type of job (categorical:
‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)
3. marital : marital status (categorical: ‘divorced’,’married’,’single’,’unknown’; note: ‘divorced’ means divorced or widowed)
4. education (categorical:‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
5. default : has credit in default? (categorical: ‘no’,’yes’,’unknown’)
6. housing : has housing loan? (categorical: ‘no’,’yes’,’unknown’)
7. loan : has personal loan? (categorical: ‘no’,’yes’,’unknown’)
8. contact : contact communication type (categorical: ‘cellular’,’telephone’)
9. month : last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10. day_of_week : last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12. campaign : number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays : number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous : number of contacts performed before this campaign and for this client (numeric)
15. poutcome : outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)
16. emp.var.rate : employment variation rate — quarterly indicator (numeric)
17. cons.price.idx : consumer price index — monthly indicator (numeric)
18. cons.conf.idx : consumer confidence index — monthly indicator (numeric)
19. euribor3m : euribor 3 month rate — daily indicator (numeric)
20. nr.employed : number of employees — quarterly indicator (numeric)
Output variable (desired target):
21. y — has the client subscribed a term deposit? (binary: ‘yes’,’no’)
The dataset has 21 columns and 41188 rows, with 20 features, and one response variable.
Data distribution
Architecture
Machine Learning architecture is defined as the subject that has evolved from the concept of fantasy to the proof of reality.
1. Data Acquisition
Data acquisition, or DAQ as it is often referred, is the process of digitizing data from the world around us so it can be displayed, analyzed, and stored in a computer
2. Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms.
3. Data Processing
manipulation of data by a computer. It includes the conversion of raw data to machine-readable form, flow of data through the CPU and memory to output devices, and formatting or transformation of output
4. Data Modeling
Decision tree learning or induction of decision trees is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves).
5. Execution
Algorithms and Techniques:
A Decision Tree is a robust and transparent Machine Learning model. The tree starts with a single node and then branches out, with a decision being made at every branch point. It can be used to predict whether a particular variable would have mattered in the customer’s decision to subscribe or not to the bank’s term deposit. The given data set is a typical supervised learning problem for which tree type models perform a lot better than the rest.
Power Bi Report
Power BI is a business analytics service by Microsoft. It aims to provide interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards.
we have created a power by reports for interactive data visualizations and business analytics
Web Application
We have created a web application using Streamlit is an open-source app framework for Machine Learning and Data Science teams.
Successes
Measuring the effectiveness of an ML approach requires the tracking of model performance.
During the building of the ML product, offline metrics or model metrics are useful for defining success.
DecisionTreeClassifier report:
Train score 1.0
Test score 0.9706592386258125
Train confusion matrix:
[[24386 0]
[ 0 25873]]
Test confusion matrix:
[[ 9823 628]
[ 4 11085]]
Failures
Failure is part of the learning process. Unfortunately, it frequents being part of the machine learning, development process far too often. ML projects can be doomed from conception due to a misalignment between product metrics and model metrics.
We had used many models as ‘RandomForestClassifier’, ‘XGBClassifier’, .. then we have chosen ‘DecisionTreeClassifier’.
Technologies
Pandas
NumPy
Seaborn
scikit-learn
Streamlit
Pickle
Microsoft Power BI–Business data analytics
What can we improve?
Looking for more data
In this case, our option is to increase our training set size. We will Try increasing our sample by providing new data, which could translate into new cases or new features.
Selecting features and examples
If estimate variance is high and your algorithm is relying on many features, we need to prune some features for better results. In this context, reducing the number of features in our data matrix by picking those with the highest predictive value is advisable.
Conclusion
Machine Learning Architecture occupies the major industry interest now as every process is looking out for optimizing the available resources and output based on the historical data available, additionally, machine learning involves major advantages about data forecasting and predictive analytics when coupled with data science technology. The machine learning architecture defines the various layers involved in the machine learning cycle and involves the major steps being carried out in the transformation of raw data into training data sets capable for enabling the decision making of a system.
The most important and time consuming part of the problem was data cleansing and processing . Once the data was prepared and ready, the next challenge was to pick an algorithm that could be best suited for the problem we choose to solve. With experience and prior knowledge and also the outcome of accuracy on training data, we observed that Decision tree classification performed the best out of others.
You can find the complete project, documentation and Dataset on my GitHub page