In this project, we are using four different classification techniques to predict the class level. However, the final result is not only depending on the data and the model. Credit scoring or credit risk assessment is an important research issue in the banking industry.
If the applicant is a good credit risk, i. The main aim of this project is comparing the performance of the typical methods.
At the same time, longer duration increases 2 risk and the higher the risk, the less likely should be positive response.
All models use the same fold cross validation resampling option, without any manual tuning. There is no full correlation only because some instances have no data about the purpose, i. Although the number German credit analisys foreigners in the sample is low less than 0.
Therefore, The three attributes related to gender and marital status i. The outputs of these models are shown bellow in tabular form. Credit scoring models have been widely studied in the areas of statistics, machine learning, and artificial intelligence.
The nice thing about this dataset is that it has a lot of challenges faced by data scientists on a daily basis. We decided to remove these from our analysis. We hope that the Principal Component Analysis will be able to Join these highly correlated attributes. Attribute OBS is clearly an identifier which is not relevant for the evaluation of the credit risk.
Because of that I thought it would be nice to introduce some datasets that I will use in the illustration of models and methods later on. As a result, we created a new attribute called MALE which contains true in case the applicant is a man.
Concepts, techniques, and applications in Microsoft Excel with XLminer. We have shown that the optimal solution for the logistic regression model in terms of profit is achieved with setting the cut-off value to around 0. We noticed that, in general, models created using Naive Bayes and logistic regression ad the best accuracy.
However this constraint is not valid, as shown in the following pivot table which contains the record count for any value combination of those attributes. The correlation matrix including response reveals that there is a low, yet significant correlation between response and check account 0.
This model results were Just slightly better, as one can see from the table below. More on that later. In this part we are going through the data in order to find outliers, missing and inconsistent data. We want to obtain a model that may be used to determine if new applicants present a good or bad credit risk.
With the growth of the credit industry and the large loan portfolios under management today, credit industry is actively developing more accurate credit scoring models. Exploratory Analysis Now that we have the data, its important that we understand the data before we attempt to model it.
Besides, it has qualitative and quantitative information about the individuals. German Credit data set has data on past credit applicants, described by 20 attributes. Surprisingly, our data analysis has proven that our assumption was wrong.
Therefore, our objective is to find a classification model that surpasses this benchmark. The results of such experiment are shown in the tables below: The solution in this case was also to delete the outliers.
We see that the first principal component is most affected by the weights of duration and amount, while the second measures the balance between two quantities, uration versus age.
WEKA Manual or verston Later on we are going to exploit the Correlation Matrix and the Principal Component Analysis for those attributes that are numerical.
Comparison of different model in tabular form 1. This sometimes facilitates since it provides a grouping effect for the levels of the categorical variable.
Therefore, we roceeded with the following box plot analysis: Values with red background show the number of inconsistencies that combination. We can see above code for Figure here that the German credit data is a case of unbalanced dataset with of the individuals being classified as having good credit.
We intend not to use all these attributes to reduce the overlap of information in order the data mining algorithms operate more efficiently.credit scoring rule that can be used to determine if a new applicant is a good credit risk or a bad credit risk, based on values for one or more of the predictor variables.
All. Statlog (German Credit Data) Data Set Download: Data Folder, Data Set Description. Abstract: This dataset classifies people described by a set of attributes as good or bad credit killarney10mile.com in two formats (one all numeric).
Also comes with a cost matrix. Feb 12, · Modeling is one of the topics I will be writing a lot on this blog.
Because of that I thought it would be nice to introduce some datasets that I will use in the illustration of models and methods later on.
In this post I describe the German credit data , very popular within the. The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for loan applicants. Here is a link to the German Credit data (right-click and "save as").
analysis (Gibson etal, ). For the advantages of R and introductory tutorials see Credit Scoring in R 4 of 45 R Code Examples In the credit scoring examples below the German Credit Data set is used (Asuncion et al, ).