Python: Neural Networks

Scritto ed eseguito sul portatile con Windows 10 – Effetto South Working

Utilizzo l’environment conda py3_tf

1
~$ conda activate py3_tf

Versione modulo installato

1
2
3
4
5
6
7
8
9
10
11
~$ pip show tensorflow
Name: tensorflow
Version: 2.2.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /home/user/miniconda3/envs/py3_tf/lib/python3.7/site-packages
Requires: h5py, keras-preprocessing, opt-einsum, wrapt, six, protobuf, tensorboard, wheel, gast, tensorflow-estimator, google-pasta, termcolor, absl-py, astunparse, scipy, numpy, grpcio
Required-by: 

Utile per monitorare la GPU

1
~$ nvidia-smi

Indice

  1. Neural Nets and Deep Learning
    • Perceptron Model
    • Neural Networks
    • Activation functions
    • Backpropagation
    • Keras Neural Network Example
    • House Sales in King County (Regression)
    • Breast cancer Wisconsin (Classification)
    • LendingClub dataset (Classification)

Neural Nets and Deep Learning

Perceptron Model


\(\hat{y}=\sum_{i=1}^{n}x_iw_i+b_i\)
Il termine Bias, è da interpretare come una soglia (se negativo) da superare prima che una variabile possa avere un’impatto positivo.

Neural Networks

blablabla
Universal Approximation Theorem

Activation functions

Hanno l’utilità di vincolare l’output, ad esempio se si vuole ottenere un output di classificazione, ad esempio:

  • la funzione logistica (sigmoid) come funzione d’attivazione (o il cosh, sinh, tanh)
  • Rectified Linear Unit (ReLU) con dominio \(\max(0,z)\) , la ReLU è buona per contenere il vanishing gradient
  • others

nb. Z e X sono spesso maiuscoli per indicare un tensore (multiple values)

Multiclass Activation Functions

  1. Non-Esclusive Classes
    Pluri assegnazione di etichetta per ogni osservazione, es multiple tag. La funzione di attivazione logistica funziona bene, hai una probabilità per ogni classe e scelta una soglia assegni una o più etichette
  2. Mutually Esclusive Classes
    Solo una classe assegnata. Si usa la Softmax Function
    \(\sigma\left (\textbf{z}\right )_i=\frac{e^{z_i}}{\sum_{j=1}^Ke^{z_j}}\) for \(i=1,...,K\) eventi. Si ottiene una distribuzione di probabilità la cui somma è uno, la classe scelta è associata al valore di probabilità massimo.

Si struttura la rete affinché abbia più nodi output

Cost Functions and Gradient Descent

La funzione di costo serve a monitorare l’andamento dell’aggiornamento dei pesi in fase di training.

Quadratic Cost Function

\(C=\frac{1}{2n}\sum_x\left \|y(x)-a^L(x) \right \|^2\)
con
a = valori previsti
y = valori osservati
Si può pensare alla funzione di costo per le NN
\(C(W,B,S^r,E^r)\)
con
\(W =\) pesi della neural network
\(B =\) bias
\(S^r =\) input per il singolo campione di training
\(E^r =\) output per il singolo campione di training

Bisogna trovare il \(W_{\min}\) che minimizzi la funzione di costo. Se è n-dimensionale si usa il Gradient Descent. Con una funzione di costo convessa, si ottiene tramite step (tutti parametrizzabili) fino a quando i pesi portano la derivata prima della funzione di costo a 0 (o quasi). Adam come ottimizzatore. Se si parla di due dimensioni passiamo dalle derivate al gradiente, quindi si calcola
\(\nabla C\left (w_1,w_2,...,w_n\right )\)

Cross-Entropy

Per i problemi di classificazione si usa spesso la cross entropy loss function.
\(C=-\left (y \log\left (p\right )+\left (1-y\right ) \log\left (1-p\right )\right )\)
per un problema Multiclasse
\(C=-\sum_{c=1}^{M}y_{o,c}\log\left (p_{o,c}\right )\)

Backpropagation

Chain-rule derivative per aggiornare iterativamente i vari pesi partendo dall’ultimo minimizzando la funzione di costo.
Hadamard Product (il prodotto come funziona su numpy) \(\begin{bmatrix} 1\\ 1 \end{bmatrix} \odot \begin{bmatrix} 3\\ 4 \end{bmatrix} = \begin{bmatrix} 1*3\\ 2*4 \end{bmatrix} = \begin{bmatrix} 3\\ 8 \end{bmatrix}\)

Keras Neural Network Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# lib
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import tensorflow as tf
import random as rn
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import mean_absolute_error,mean_squared_error
from tensorflow.keras.models import load_model
1
2
3
# df (obiettivo: predizione del price)
df = pd.read_csv('fake_reg.csv')
df.head()
price feature1 feature2
0 461.527929 999.787558 999.766096
1 548.130011 998.861615 1001.042403
2 410.297162 1000.070267 998.844015
3 540.382220 999.952251 1000.440940
4 546.024553 1000.446011 1000.338531
1
2
3
# pairplot
sns.set_style('whitegrid')
sns.pairplot(df, palette='red')
1
<seaborn.axisgrid.PairGrid at 0x2722a98fec8>

png

1
2
3
# X e y come np array
X = df[['feature1','feature2']].values
y = df['price'].values
1
2
3
4
# train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)
1
2
(700, 2)
(300, 2)
1
2
# min max scaling -- help
help(MinMaxScaler)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Help on class MinMaxScaler in module sklearn.preprocessing._data:

class MinMaxScaler(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  MinMaxScaler(feature_range=(0, 1), *, copy=True)
 |  
 |  Transform features by scaling each feature to a given range.
 |  
 |  This estimator scales and translates each feature individually such
 |  that it is in the given range on the training set, e.g. between
 |  zero and one.
 |  
 |  The transformation is given by::
 |  
 |      X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
 |      X_scaled = X_std * (max - min) + min
 |  
 |  where min, max = feature_range.
 |  
 |  This transformation is often used as an alternative to zero mean,
 |  unit variance scaling.
 |  ...
1
2
3
4
5
# min max scaling
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
1
2
print('Train Min-Max: ', X_train.min(), X_train.max())
print('Test Min-Max: ', X_test.min(), X_test.max())
1
2
Train Min-Max:  0.0 1.0
Test Min-Max:  -0.014108392024496652 1.0186515935232023

Choosing an optimizer and loss

Seguono i principali problemi supervisionati con Keras:

  1. Multi-class classification problem
    1
    
    # model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
    
  2. Binary classification problem
    1
    
    # model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    
  3. Mean squared error regression problem
    1
    
    # model.compile(optimizer='rmsprop', loss='mse')
    
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
6
7
8
9
# define model multiple line (si possono passare i Dense anche dentro il Sequential come lista)
model = Sequential()
model.add(Dense(units=4,activation='relu')) # units sono i nodi
model.add(Dense(units=4,activation='relu'))
model.add(Dense(units=4,activation='relu'))
model.add(Dense(units=1)) # final layer (vogliamo solo il price)

model.compile(optimizer='rmsprop',loss='mse')
# nb. questo passo mangia la GPU

Training

Descrizione parametri principali in Keras:

  • Sample: one element of a dataset.
    • Example: one image is a sample in a convolutional network
    • Example: one audio file is a sample for a speech recognition model
  • Batch: a set of N samples. The samples in a batch are processed independently, in parallel. If training, a batch results in only one update to the model.A batch generally approximates the distribution of the input data better than a single input. The larger the batch, the better the approximation; however, it is also true that the batch will take longer to process and will still result in only one update. For inference (evaluate/predict), it is recommended to pick a batch size that is as large as you can afford without going out of memory (since larger batches will usually result in faster evaluation/prediction).
  • Epoch: an arbitrary cutoff, generally defined as “one pass over the entire dataset”, used to separate training into distinct phases, which is useful for logging and periodic evaluation.
  • When using validation_data or validation_split with the fit method of Keras models, evaluation will be run at the end of every epoch.
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
# training model
model.fit(X_train,y_train,epochs=250)
1
2
3
4
5
6
7
8
9
10
Train on 700 samples
Epoch 1/250
700/700 [==============================] - 1s 2ms/sample - loss: 256557.3441
Epoch 2/250
700/700 [==============================] - 0s 73us/sample - loss: 256365.9997
...
Epoch 250/250
700/700 [==============================] - 0s 58us/sample - loss: 24.1132

<tensorflow.python.keras.callbacks.History at 0x27236051688>

Evaluation

1
2
# loss trend
loss = model.history.history['loss']
1
2
3
# plot loss trend
sns.lineplot(x=range(len(loss)),y=loss)
plt.title("Training Loss per Epoch")
1
Text(0.5, 1.0, 'Training Loss per Epoch')

png

1
2
3
4
5
# Loss (in questo caso MSE)
training_score = model.evaluate(X_train,y_train,verbose=0)
test_score = model.evaluate(X_test,y_test,verbose=0)
print('training Score:', training_score)
print('test Score:', test_score)
1
2
training Score: 23.728277675083707
test Score: 25.146427205403647
1
2
# predictions
test_predictions = model.predict(X_test)
1
2
# previsti
test_predictions = pd.Series(test_predictions.reshape(300,))
1
2
3
# osservati
pred_df = pd.DataFrame(y_test,columns=['Test Y'])
pred_df.head()
Test Y
0 402.296319
1 624.156198
2 582.455066
3 578.588606
4 371.224104
1
2
3
4
# previsti e osservati
pred_df = pd.concat([pred_df,test_predictions],axis=1)
pred_df.columns = ['Test Y','Model Predictions']
pred_df.head()
Test Y Model Predictions
0 402.296319 405.533844
1 624.156198 623.994934
2 582.455066 592.561340
3 578.588606 572.621155
4 371.224104 366.802795
1
2
3
# scatter predict vs observed
sns.set_style('whitegrid')
sns.scatterplot(x='Test Y',y='Model Predictions',data=pred_df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x27238970d08>

png

1
2
3
# distribution errors
pred_df['Error'] = pred_df['Test Y'] - pred_df['Model Predictions']
sns.distplot(pred_df['Error'],bins=50)
1
<matplotlib.axes._subplots.AxesSubplot at 0x27235ddbd88>

png

1
2
3
4
5
# metrics
print('MAE:',mean_absolute_error(pred_df['Test Y'],pred_df['Model Predictions']))
print('MSE:',mean_squared_error(pred_df['Test Y'],pred_df['Model Predictions']))
print('MSE (from model.evaluate):',test_score)
print('RMSE:',test_score**0.5)
1
2
3
4
MAE: 4.023428708666904
MSE: 25.14642937056938
MSE (from model.evaluate): 25.146427205403647
RMSE: 5.014621342175663
1
2
# Il MAE di circa 4, un errore di meno dell'1% della media
df['price'].describe()
1
2
3
4
5
6
7
8
9
count    1000.000000
mean      498.673029
std        93.785431
min       223.346793
25%       433.025732
50%       502.382117
75%       564.921588
max       774.407854
Name: price, dtype: float64

New observation to predict

1
2
3
4
5
6
# [[Feature1, Feature2]]
new_gem = [[998,1000]]
# scaling
new_gem = scaler.transform(new_gem)
# predict
print(model.predict(new_gem))
1
[[419.92566]]

Saving model

1
2
# working directory
os.getcwd()
1
'F:\\Python\\Course 001'
1
2
# save
model.save('Keras_Neural_Network_Example.h5')  # creates a HDF5 file
1
2
# load
later_model = load_model(r'F:\GitHub\AlbGri.github.io\assets\files\Python\Course 001\Keras_Neural_Network_Example.h5')
1
WARNING:tensorflow:Sequential models without an `input_shape` passed to the first layer cannot reload their optimizer state. As a result, your model isstarting with a freshly initialized optimizer.
1
2
# prediction with loaded model
print(later_model.predict(new_gem))
1
[[420.05133]]

House Sales in King County

Kaggle: Predict house price using regression

1
2
3
4
5
6
7
8
9
10
11
12
# lib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score
1
2
3
# df
df = pd.read_csv('/kc_house_data.csv')
df.head()
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 10/13/2014 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 12/9/2014 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 2/25/2015 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 12/9/2014 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 2/18/2015 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     21597 non-null  int64  
 9   view           21597 non-null  int64  
 10  condition      21597 non-null  int64  
 11  grade          21597 non-null  int64  
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  int64  
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   21597 non-null  int64  
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long           21597 non-null  float64
 19  sqft_living15  21597 non-null  int64  
 20  sqft_lot15     21597 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
1
2
# converto id a stringa (così non la escludiamo tra le misure di sintesi)
df['id'] = df['id'].apply(str)

EDA

1
2
# check missing
df.isnull().sum().sum()
1
0
1
2
# per mostrare i separatori delle migliaia come punti e decimali come virgola
dot_sep = lambda x: format(round(x,2) if abs(x) < 1 else round(x,1) if abs(x) < 10 else int(x), ',').replace(",", "X").replace(".", ",").replace("X", ".")
1
2
3
4
# describe, potrei usare il .transpose, ma preferico così e miglioro i decimali
df.describe(percentiles=[0.25,0.5,0.75,0.999]).applymap(dot_sep)
# df.describe(percentiles=[0.25,0.5,0.75,0.999]).style.format("{:.1f}")
# sono presenti forti outliers
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated lat long sqft_living15 sqft_lot15 month year
count 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597 21.597
mean 540.296 3,4 2,1 2.080 15.099 1,5 0,01 0,23 3,4 7,7 1.788 291 1.970 84 47 -122 1.986 12.758 6,6 2.014
std 367.368 0,93 0,77 918 41.412 0,54 0,09 0,77 0,65 1,2 827 442 29 401 0,14 0,14 685 27.274 3,1 0,47
min 78.000 1,0 0,5 370 520 1,0 0,0 0,0 1,0 3,0 370 0,0 1.900 0,0 47 -122 399 651 1,0 2.014
25% 322.000 3,0 1,8 1.430 5.040 1,0 0,0 0,0 3,0 7,0 1.190 0,0 1.951 0,0 47 -122 1.490 5.100 4,0 2.014
50% 450.000 3,0 2,2 1.910 7.618 1,5 0,0 0,0 3,0 7,0 1.560 0,0 1.975 0,0 47 -122 1.840 7.620 6,0 2.014
75% 645.000 4,0 2,5 2.550 10.685 2,0 0,0 0,0 4,0 8,0 2.210 560 1.997 0,0 47 -122 2.360 10.083 9,0 2.015
99.9% 3.480.600 8,0 5,5 7.290 495.972 3,0 1,0 4,0 5,0 12 6.114 2.372 2.015 2.014 47 -121 5.012 303.191 12 2.015
max 7.700.000 33 8,0 13.540 1.651.359 3,5 1,0 4,0 5,0 13 9.410 4.820 2.015 2.015 47 -121 6.210 871.200 12 2.015
1
2
3
4
# distribuzione (continua) del price
sns.set_style('whitegrid')
plt.figure(figsize=(12,8))
sns.distplot(df['price'])
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba0584148>

png

1
2
# distribuzione (discreta) bedrooms
sns.countplot(df['bedrooms'])
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba2b8e148>

png

1
2
# correlazioni con target
df.corr()['price'].sort_values()[-10:]
1
2
3
4
5
6
7
8
9
10
11
lat              0.306692
bedrooms         0.308787
sqft_basement    0.323799
view             0.397370
bathrooms        0.525906
sqft_living15    0.585241
sqft_above       0.605368
grade            0.667951
sqft_living      0.701917
price            1.000000
Name: price, dtype: float64
1
2
3
# scatterplot price e sqft_living
plt.figure(figsize=(12,8))
sns.scatterplot(x='price',y='sqft_living',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba1a2c0c8>

png

1
2
# boxplot bedrooms e price
sns.boxplot(x='bedrooms',y='price',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba29eb148>

png

1
2
# boxplot waterfront e price
sns.boxplot(x='waterfront',y='price',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba4e746c8>

png

Geographical Properties

1
2
3
# il prezzo varia in funzione della longitudine?
plt.figure(figsize=(12,8))
sns.scatterplot(x='price',y='long',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba2eda608>

png

1
2
3
# il prezzo varia in funzione della latitudine?
plt.figure(figsize=(12,8))
sns.scatterplot(x='price',y='lat',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba2f71108>

png

1
2
3
4
# plot latitudine e longitudine (King County)
plt.figure(figsize=(12,8))
sns.scatterplot(x='long',y='lat',data=df,hue='price')
# il colore non è distribuito bene perché ci sono forti outliers che spingono la distribuzione verso il basso
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba2ebc508>

png

1
2
# top 10 outliers per price
df.select_dtypes(include=np.number).sort_values('price',ascending=False).head(10).applymap(dot_sep)
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated lat long sqft_living15 sqft_lot15 month year
7245 7.700.000 6 8,0 12.050 27.600 2,5 0 3 4 13 8.570 3.480 1.910 1.987 47 -122 3.940 8.800 10 2.014
3910 7.060.000 5 4,5 10.040 37.325 2,0 1 2 3 11 7.680 2.360 1.940 2.001 47 -122 3.930 25.449 6 2.014
9245 6.890.000 6 7,8 9.890 31.374 2,0 0 4 3 13 8.860 1.030 2.001 0 47 -122 4.540 42.730 9 2.014
4407 5.570.000 5 5,8 9.200 35.069 2,0 0 0 3 13 6.200 3.000 2.001 0 47 -122 3.560 24.345 8 2.014
1446 5.350.000 5 5,0 8.000 23.985 2,0 0 4 3 12 6.720 1.280 2.009 0 47 -122 4.600 21.750 4 2.015
1313 5.300.000 6 6,0 7.390 24.829 2,0 1 4 4 12 5.000 2.390 1.991 0 47 -122 4.320 24.619 4 2.015
1162 5.110.000 5 5,2 8.010 45.517 2,0 1 4 3 12 5.990 2.020 1.999 0 47 -122 3.430 26.788 10 2.014
8085 4.670.000 5 6,8 9.640 13.068 1,0 1 4 3 12 4.820 4.820 1.983 2.009 47 -122 3.270 10.454 6 2.014
2624 4.500.000 5 5,5 6.640 40.014 2,0 1 4 3 12 6.350 290 2.004 0 47 -122 3.030 23.408 8 2.014
8629 4.490.000 4 3,0 6.430 27.517 2,0 0 0 3 12 6.430 0 2.001 0 47 -122 3.720 14.592 6 2.014
1
2
# escludo l'1% di coda del dataset, cioè 216 osservazioni
non_top_1_perc = df.sort_values('price',ascending=False).iloc[216:]
1
2
3
4
5
# plot latitudine e longitudine senza l'1% di coda
plt.figure(figsize=(12,8))
sns.scatterplot(x='long',y='lat',
                data=non_top_1_perc,hue='price',
                palette='RdYlGn',edgecolor=None,alpha=0.2)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba4fdd148>

png

Feature Engineering

1
2
# drop id
df = df.drop('id',axis=1)
1
2
3
4
5
6
7
# engineering date
# convertiamo da string a time così è più semplice estrarre le info
df['date_string'] = df['date']
df['date'] = pd.to_datetime(df['date_string'])
df['month'] = df['date'].apply(lambda x: x.month)
df['year'] = df['date'].apply(lambda x: x.year)
df[['date_string','date','month','year']].info()
1
2
3
4
5
6
7
8
9
10
11
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date_string  21597 non-null  object        
 1   date         21597 non-null  datetime64[ns]
 2   month        21597 non-null  int64         
 3   year         21597 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 675.0+ KB
1
2
# boxplot anno price
sns.boxplot(x='year',y='price',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba4e86308>

png

1
2
# boxplot mese price
sns.boxplot(x='month',y='price',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba5505408>

png

1
2
# andamento prezzo medio per mese
df.groupby('month').mean()['price'].plot()
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba197cac8>

png

1
2
# andamento prezzo medio per anno
df.groupby('year').mean()['price'].plot()
1
<matplotlib.axes._subplots.AxesSubplot at 0x18ba5090648>

png

1
2
3
# escludo le variabili con le date complete
df = df.drop(['date','date_string'],axis=1)
df.columns
1
2
3
4
5
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15', 'month', 'year'],
      dtype='object')
1
2
3
4
# lo zip code va lavorato o escluso. lo escludiamo.
# se si include nel modello verrebbe considerato come numerico
# ci sono 70 zipcode diversi, quindi è un problema renderli dummy, si potrebbero raggruppare come zipcode più ricchi e meno ricchi, oppure fare raggruppamenti geografici come nord/centro/sud
df = df.drop('zipcode',axis=1)
1
2
3
# la maggior parte dei yr_renoveted sono 0, si potrebbe discretizzare come chi ha rinnovato e chi no.
# inoltre c'è una correlazione, più è recente e più sono i casi
df['yr_renovated'].value_counts()
1
2
3
4
5
6
7
8
9
10
11
12
0       20683
2014       91
2013       37
2003       36
2000       35
        ...  
1934        1
1959        1
1951        1
1948        1
1944        1
Name: yr_renovated, Length: 70, dtype: int64
1
2
# la maggior parte dei sqft_basement sono 0, si potrebbe discretizzare come chi ha rinnovato e chi no
df['sqft_basement'].value_counts()
1
2
3
4
5
6
7
8
9
10
11
12
0       13110
600       221
700       218
500       214
800       206
        ...  
792         1
2590        1
935         1
2390        1
248         1
Name: sqft_basement, Length: 306, dtype: int64

Models

1
2
3
# X e y
X = df.drop('price',axis=1)
y = df['price']
1
2
# train test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)
1
2
3
4
5
6
# scaling Min Max
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train) # attenzione il fit solo sul train
X_test = scaler.transform(X_test)
print(X_train.shape)
print(X_test.shape)
1
2
(15117, 19)
(6480, 19)
1
2
3
4
5
6
7
8
9
10
# definisco il modello
model = Sequential()

model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(1))

model.compile(optimizer='adam',loss='mse')
1
2
3
4
5
# stimo il modello
model.fit(x=X_train,y=y_train.values,
          validation_data=(X_test,y_test.values),
          verbose=2,batch_size=128,epochs=400)
# il validation non viene utilizzato per il tuning solo per monitorare
1
2
3
4
5
6
7
8
9
Train on 15117 samples, validate on 6480 samples
Epoch 1/400
15117/15117 - 0s - loss: 96382199451.0191 - val_loss: 92658238633.4025
Epoch 2/400
...
Epoch 400/400
15117/15117 - 0s - loss: 29373463675.0805 - val_loss: 27157900968.1383

<tensorflow.python.keras.callbacks.History at 0x18bafce7548>
1
2
3
4
# andamento loss (mse)
losses = pd.DataFrame(model.history.history)
losses.plot()
# val_loss è sul test set, utile per capire se sto facendo overfit. non stiamo facendo overfit, potremmo continuare con le epochs.
1
<matplotlib.axes._subplots.AxesSubplot at 0x18bafd17f48>

png

1
2
# predictions
predictions = model.predict(X_test)
1
2
3
4
5
# metrics
print('MAE:',dot_sep(mean_absolute_error(y_test,predictions)))
print('MSE:',dot_sep(mean_squared_error(y_test,predictions)))
print('RMSE:',dot_sep(mean_squared_error(y_test,predictions)**0.5))
print('Explained Var Score:',dot_sep(explained_variance_score(y_test,predictions)))
1
2
3
4
MAE: 99.996
MSE: 27.157.901.022
RMSE: 164.796
Explained Var Score: 0,8
1
2
3
print('Media Price:',dot_sep(df['price'].mean()))
print('Mediana Price:',dot_sep(df['price'].median()))
# il MAE è di circa 100k, quindi un 20% della media
1
2
Media Price: 540.296
Mediana Price: 450.000
1
2
3
# plot osservate vs previste
plt.scatter(y_test,predictions,edgecolors='black',alpha=0.5)
plt.plot(y_test,y_test,'r')
1
[<matplotlib.lines.Line2D at 0x18bafcd3948>]

png

1
2
3
4
# distribuzione errori (devo formattarli uguali)
print(type(y_test))
print(type(predictions))
print(type(y_test.values.reshape(6480, 1)))
1
2
3
<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
1
2
3
# distribuzione errori
errors = y_test.values.reshape(6480, 1) - predictions
sns.distplot(errors)
1
<matplotlib.axes._subplots.AxesSubplot at 0x18bb3093288>

png

Prediction on new observation

1
2
3
# esempio da prevedere
single_house = df.drop('price',axis=1).iloc[0]
single_house
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bedrooms            3.0000
bathrooms           1.0000
sqft_living      1180.0000
sqft_lot         5650.0000
floors              1.0000
waterfront          0.0000
view                0.0000
condition           3.0000
grade               7.0000
sqft_above       1180.0000
sqft_basement       0.0000
yr_built         1955.0000
yr_renovated        0.0000
lat                47.5112
long             -122.2570
sqft_living15    1340.0000
sqft_lot15       5650.0000
month              10.0000
year             2014.0000
Name: 0, dtype: float64
1
2
3
# bisogna scalarlo e renderlo in formato vettore
single_house = scaler.transform(single_house.values.reshape(-1, 19))
single_house
1
2
3
4
array([[0.2       , 0.08      , 0.08376422, 0.00310751, 0.        ,
        0.        , 0.        , 0.5       , 0.4       , 0.10785619,
        0.        , 0.47826087, 0.        , 0.57149751, 0.21760797,
        0.16193426, 0.00582059, 0.81818182, 0.        ]])
1
2
# sappiamo qual è il target osservato
df.head(1)
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated lat long sqft_living15 sqft_lot15 month year
0 221900.0 3 1.0 1180 5650 1.0 0 0 3 7 1180 0 1955 0 47.5112 -122.257 1340 5650 10 2014
1
2
3
# previsione
model.predict(single_house)
# si potrebbe migliorare il modello escludendo l'1% degli outliers e facendo più feature engineering
1
array([[294433.44]], dtype=float32)

Breast cancer Wisconsin

Keras Classification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# lib
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report,confusion_matrix

import random as rn
import os

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.callbacks import EarlyStopping
1
2
3
# df
df = pd.read_csv('cancer_classification.csv')
df.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension benign_0__mal_1
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

1
2
# controllo missing
df.isnull().sum().sum()
1
0
1
2
# per mostrare i separatori delle migliaia come punti e decimali come virgola
dot_sep = lambda x: format(round(x,2) if abs(x) < 1 else round(x,1) if abs(x) < 10 else int(x), ',').replace(",", "X").replace(".", ",").replace("X", ".")
1
2
# describe, potrei usare il .transpose, ma preferico così e miglioro i decimali
df.describe(percentiles=[0.25,0.5,0.75,0.999]).applymap(dot_sep)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension benign_0__mal_1
count 569 569 569 569 569 569 569 569 569 569 ... 569 569 569 569 569 569 569 569 569 569
mean 14 19 91 654 0,1 0,1 0,09 0,05 0,18 0,06 ... 25 107 880 0,13 0,25 0,27 0,11 0,29 0,08 0,63
std 3,5 4,3 24 351 0,01 0,05 0,08 0,04 0,03 0,01 ... 6,1 33 569 0,02 0,16 0,21 0,07 0,06 0,02 0,48
min 7,0 9,7 43 143 0,05 0,02 0,0 0,0 0,11 0,05 ... 12 50 185 0,07 0,03 0,0 0,0 0,16 0,06 0,0
25% 11 16 75 420 0,09 0,06 0,03 0,02 0,16 0,06 ... 21 84 515 0,12 0,15 0,11 0,06 0,25 0,07 0,0
50% 13 18 86 551 0,1 0,09 0,06 0,03 0,18 0,06 ... 25 97 686 0,13 0,21 0,23 0,1 0,28 0,08 1,0
75% 15 21 104 782 0,11 0,13 0,13 0,07 0,2 0,07 ... 29 125 1.084 0,15 0,34 0,38 0,16 0,32 0,09 1,0
99.9% 27 36 187 2.499 0,15 0,33 0,43 0,2 0,3 0,1 ... 48 238 3.787 0,22 0,99 1,2 0,29 0,61 0,19 1,0
max 28 39 188 2.501 0,16 0,35 0,43 0,2 0,3 0,1 ... 49 251 4.254 0,22 1,1 1,3 0,29 0,66 0,21 1,0

9 rows × 31 columns

1
2
3
4
5
6
7
8
9
10
11
# verifichiamo la distribuzione target
ax = sns.countplot(x='benign_0__mal_1',data=df)
# non è troppo sbilanciato

# aggiungo le frequenze
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.0f}'.format(height), # '{:1.2f}'.format(height/float(len(df)))
            ha="center")

png

1
2
3
# heatmap
plt.figure(figsize=(12,12))
sns.heatmap(df.corr())
1
<matplotlib.axes._subplots.AxesSubplot at 0x26b02b93048>

png

1
2
# head e tail delle variabili più correlate con la target
pd.concat([df.corr()['benign_0__mal_1'].sort_values().head(),df.corr()['benign_0__mal_1'].sort_values().tail()])
1
2
3
4
5
6
7
8
9
10
11
worst concave points     -0.793566
worst perimeter          -0.782914
mean concave points      -0.776614
worst radius             -0.776454
mean perimeter           -0.742636
symmetry error            0.006522
texture error             0.008303
mean fractal dimension    0.012838
smoothness error          0.067016
benign_0__mal_1           1.000000
Name: benign_0__mal_1, dtype: float64
1
2
# correlate con la target (escludo la target)
df.corr()['benign_0__mal_1'][:-1].sort_values().plot(kind='bar')
1
<matplotlib.axes._subplots.AxesSubplot at 0x26b02bbbb88>

png

Model

1
2
3
# X e y (come numpy arrays)
X = df.drop('benign_0__mal_1',axis=1).values
y = df['benign_0__mal_1'].values
1
2
# train e test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=101)
1
2
3
4
5
6
7
# scaling data
scaler = MinMaxScaler()
# scaler.fit(X_train) # stima i parametri
# X_train = scaler.transform(X_train) # applica i parametri
# X_test = scaler.transform(X_test)
X_train = scaler.fit_transform(X_train) # stima e applica i parametri in un solo comando
X_test = scaler.transform(X_test) # avevo erroneamente fatto il fit, non andava messo, si otterranno risultati leggermente diversi
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
6
7
8
# definizione struttura neural network per classificazione
model = Sequential()
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
model.add(Dense(units=30,activation='relu'))
model.add(Dense(units=15,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
6
7
8
9
%%time
# stima modello con overfitting
# https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network
# https://datascience.stackexchange.com/questions/18414/are-there-any-rules-for-choosing-the-size-of-a-mini-batch
model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=0
          )
1
2
3
Wall time: 35.6 s

<tensorflow.python.keras.callbacks.History at 0x26b07120608>
1
2
3
# loss crossentropy
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()
1
<matplotlib.axes._subplots.AxesSubplot at 0x26b05f654c8>

png

Early Stopping

1
2
# 'Stop training when a monitored quantity has stopped improving'
help(EarlyStopping)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Help on class EarlyStopping in module tensorflow.python.keras.callbacks:

class EarlyStopping(Callback)
 |  EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=0, mode='auto', baseline=None, restore_best_weights=False)
 |  
 |  Stop training when a monitored quantity has stopped improving.
 |  
 |  Arguments:
 |      monitor: Quantity to be monitored.
 |      min_delta: Minimum change in the monitored quantity
 |          to qualify as an improvement, i.e. an absolute
 |          change of less than min_delta, will count as no
 |          improvement.
 |      patience: Number of epochs with no improvement
 |          after which training will be stopped.
 |      verbose: verbosity mode.
 |      mode: One of `{"auto", "min", "max"}`. In `min` mode,
 |          training will stop when the quantity
 |          monitored has stopped decreasing; in `max`
 |          mode it will stop when the quantity
 |          monitored has stopped increasing; in `auto`
 |          mode, the direction is automatically inferred
 |          from the name of the monitored quantity.
 |      baseline: Baseline value for the monitored quantity.
 |          Training will stop if the model doesn't show improvement over the
 |          baseline.
 |      restore_best_weights: Whether to restore model weights from
 |          the epoch with the best value of the monitored quantity.
 |          If False, the model weights obtained at the last step of
 |          training are used.
 |  
 |  ...
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
6
# definizione struttura neural network per classificazione
model = Sequential()
model.add(Dense(units=30,activation='relu'))
model.add(Dense(units=15,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
1
2
# definizione early stopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
6
7
8
%%time
# stima modello con early stop per limitare overfitting
model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=0,
          callbacks=[early_stop]
          )
1
2
3
4
Epoch 00047: early stopping
Wall time: 3.06 s

<tensorflow.python.keras.callbacks.History at 0x26b08b5b2c8>
1
2
3
# loss crossentropy
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()
1
<matplotlib.axes._subplots.AxesSubplot at 0x26b08f1bb48>

png

DropOut Layers

1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
6
7
8
# definizione struttura neural network per classificazione
model = Sequential()
model.add(Dense(units=30,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=15,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
6
7
8
%%time
# stima modello con early stop e dropout per limitare overfitting
model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=0,
          callbacks=[early_stop]
          )
1
2
3
4
Epoch 00107: early stopping
Wall time: 8.93 s

<tensorflow.python.keras.callbacks.History at 0x26b091d5b48>
1
2
3
# loss crossentropy
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()
1
<matplotlib.axes._subplots.AxesSubplot at 0x26b0583fc08>

png

Metrics

1
2
# predictions
predictions = model.predict_classes(X_test)
1
2
3
4
5
# metrics
print('\nConfusion Matrix:')
print(confusion_matrix(y_test,predictions))
print('\nClassification metrics:')
print(classification_report(y_test,predictions))
1
2
3
4
5
6
7
8
9
10
11
12
13
Confusion Matrix:
[[54  1]
 [ 6 82]]

Classification metrics:
              precision    recall  f1-score   support

           0       0.90      0.98      0.94        55
           1       0.99      0.93      0.96        88

    accuracy                           0.95       143
   macro avg       0.94      0.96      0.95       143
weighted avg       0.95      0.95      0.95       143

LendingClub dataset

Kaggle: Predict default loans with classification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# lib
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report,confusion_matrix

import random as rn
import os

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
1
2
3
# df
df = pd.read_csv('/lending_club_loan_two.csv')
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             396030 non-null  float64
 1   term                  396030 non-null  object 
 2   int_rate              396030 non-null  float64
 3   installment           396030 non-null  float64
 4   grade                 396030 non-null  object 
 5   sub_grade             396030 non-null  object 
 6   emp_title             373103 non-null  object 
 7   emp_length            377729 non-null  object 
 8   home_ownership        396030 non-null  object 
 9   annual_inc            396030 non-null  float64
 10  verification_status   396030 non-null  object 
 11  issue_d               396030 non-null  object 
 12  loan_status           396030 non-null  object 
 13  purpose               396030 non-null  object 
 14  title                 394275 non-null  object 
 15  dti                   396030 non-null  float64
 16  earliest_cr_line      396030 non-null  object 
 17  open_acc              396030 non-null  float64
 18  pub_rec               396030 non-null  float64
 19  revol_bal             396030 non-null  float64
 20  revol_util            395754 non-null  float64
 21  total_acc             396030 non-null  float64
 22  initial_list_status   396030 non-null  object 
 23  application_type      396030 non-null  object 
 24  mort_acc              358235 non-null  float64
 25  pub_rec_bankruptcies  395495 non-null  float64
 26  address               396030 non-null  object 
dtypes: float64(12), object(15)
memory usage: 81.6+ MB

Step 1: EDA

1
2
3
4
5
6
7
8
9
10
11
# verifichiamo la distribuzione target
ax = sns.countplot(x='loan_status',data=df)
# un po' sbilanciato, ci aspetteremo un'elevata accuracy ma precision e recall saranno quelle difficili

# aggiungo le frequenze
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height+3000,
            '{:,.0f}'.format(height).replace(",", "X").replace(".", ",").replace("X", "."), # '{:1.2f}'.format(height/float(len(df)))
            ha="center")

png

1
2
3
4
# histogram di loan_amnt
plt.figure(figsize=(12,4))
sns.distplot(df['loan_amnt'],kde=False,color='b',bins=40,hist_kws=dict(edgecolor='grey'))
plt.xlim(0,45000)
1
(0.0, 45000.0)

png

1
2
3
4
# heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),annot=True,cmap='coolwarm',linecolor='white',linewidths=1)
# plt.ylim(10, 0)
1
<matplotlib.axes._subplots.AxesSubplot at 0x2db149a6f88>

png

1
2
3
# mezzo-pairplot (parte 1)
df_pair1 = df[df.columns[0:18]].sample(n=10000, random_state=1311)
sns.pairplot(df_pair1,hue='loan_status',diag_kind='hist',diag_kws=dict(edgecolor='black',alpha=0.6,bins=30),plot_kws=dict(alpha=0.4))
1
<seaborn.axisgrid.PairGrid at 0x18e3e6a8cc8>

png

1
2
3
# mezzo-pairplot (parte 2)
df_pair2 = df.iloc[:, np.r_[12,18:27]].sample(n=10000, random_state=1311) # indexer
sns.pairplot(df_pair2,hue='loan_status',diag_kind='hist',diag_kws=dict(edgecolor='black',alpha=0.6,bins=30),plot_kws=dict(alpha=0.4))
1
<seaborn.axisgrid.PairGrid at 0x18e0a62d088>

png

1
2
# scatterplot 
sns.scatterplot(x='installment',y='loan_amnt',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x2db16fec1c8>

png

1
2
# boxplot loan_status e loan amount
sns.boxplot(x='loan_status',y='loan_amnt',data=df)
1
<matplotlib.axes._subplots.AxesSubplot at 0x2db177d3ac8>

png

1
2
3
4
5
# per mostrare i separatori delle migliaia come punti e decimali come virgola
dot_sep = lambda x: format(round(x,2) if abs(x) < 1 else round(x,1) if abs(x) < 10 else int(x), ',').replace(",", "X").replace(".", ",").replace("X", ".")

# loan amount by loan status
df.groupby('loan_status')['loan_amnt'].describe().applymap(dot_sep)
count mean std min 25% 50% 75% max
loan_status
Charged Off 77.673 15.126 8.505 1.000 8.525 14.000 20.000 40.000
Fully Paid 318.357 13.866 8.302 500 7.500 12.000 19.225 40.000
1
2
# countplot per grade stratificato per target
sns.countplot(x='grade',hue='loan_status',data=df,order=sorted(df['grade'].unique()))
1
<matplotlib.axes._subplots.AxesSubplot at 0x2a843721688>

png

1
2
3
# countplot per subgrade
plt.figure(figsize=(12,4))
sns.countplot(x='sub_grade',data=df,order=sorted(df['sub_grade'].unique()),palette='coolwarm')
1
<matplotlib.axes._subplots.AxesSubplot at 0x2db14a6f548>

png

1
2
3
# countplot per subgrade stratificato per target
plt.figure(figsize=(12,4))
sns.countplot(x='sub_grade',data=df,order=sorted(df['sub_grade'].unique()),palette='coolwarm',hue='loan_status')
1
<matplotlib.axes._subplots.AxesSubplot at 0x2db16a31888>

png

1
2
3
4
5
# countplot per subgrade (F e G) stratificato per target
plt.figure(figsize=(12,4))
df_FG = df[(df['grade']=='G') | (df['grade']=='F')]
# df_FG = df[df['sub_grade'].apply(lambda x: x[0] in ['G','F'])]
sns.countplot(x='sub_grade',data=df_FG,order=sorted(df_FG['sub_grade'].unique()),palette='coolwarm',hue='loan_status')
1
<matplotlib.axes._subplots.AxesSubplot at 0x2db1782ed48>

png

1
2
3
4
5
6
7
8
# il method map è il più veloce per la rimappatura di una variabile
mapping_dict = {'Fully Paid': 1, 'Charged Off': 0}
df['loan_status'].map(mapping_dict).value_counts() # se il mapping non è esaustivo invece dei NaN si può dare il valore originale con il metodo .fillna(df['loan_status'])

# Risultati identici ma meno performanti:
# df['loan_status'].replace(mapping_dict).value_counts()
# df.replace({'loan_status': mapping_dict})['loan_status'].value_counts() # così devo specificare la colonna e deve agire su tutto il df
# df['loan_status'].replace(to_replace=['Fully Paid', 'Charged Off'], value=[1, 0]).value_counts()
1
2
3
1    318357
0     77673
Name: loan_status, dtype: int64
1
2
3
4
5
6
# creo dummy/dicotomizzo la target loan status
df['loan_repaid'] = df['loan_status'].map({'Fully Paid':1,'Charged Off':0})
print(df.shape)
# tabella di contingenza
df.groupby(["loan_repaid", "loan_status"]).size().reset_index(name="Frequenza")
# pd.crosstab(df['loan_repaid'],df['loan_status'])
1
(396030, 28)
loan_repaid loan_status Frequenza
0 0 Charged Off 77673
1 1 Fully Paid 318357
1
2
3
# correlazione target con le altre variabili numeriche
df.corr()['loan_repaid'][:-1].sort_values().plot(kind='bar')
# df.corr()['loan_repaid'].sort_values().drop('loan_repaid').plot(kind='bar')
1
<matplotlib.axes._subplots.AxesSubplot at 0x18e17784e48>

png

Step 2: Data Preprocessing

Section Goals: Remove or fill any missing data. Remove unnecessary or repetitive features. Convert categorical string features to dummy variables.

1
2
# df numero record
len(df)
1
396030
1
2
# missing data
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
1
<matplotlib.axes._subplots.AxesSubplot at 0x2db17a29148>

png

1
2
3
4
5
6
7
# missing data count
missing_counts = pd.concat(
    [df.isnull().sum()[df.isnull().sum()>0].apply(dot_sep),
    (df.isnull().sum()[df.isnull().sum()>0]/len(df)*100).apply(dot_sep)], 
    axis=1)
missing_counts.columns = ['Freq', 'Freq %']
missing_counts
Freq Freq %
emp_title 22.927 5,8
emp_length 18.301 4,6
title 1.755 0,44
revol_util 276 0,07
mort_acc 37.795 9,5
pub_rec_bankruptcies 535 0,14
1
2
3
4
5
6
# employement job titles univoci
print(df['emp_title'].value_counts())
print('\nUnivoci:',dot_sep(df['emp_title'].nunique()))
# troppe per creare dummy, la rimuovo ma si potrebbero raggruppare
df.drop('emp_title',inplace=True,axis=1)
print(df.shape)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Teacher                     4389
Manager                     4250
Registered Nurse            1856
RN                          1846
Supervisor                  1830
                            ... 
Annunciation                   1
Atos Inc                       1
chevy parts maneger            1
Architectural Intern           1
GroupSystems Corporation       1
Name: emp_title, Length: 173105, dtype: int64

Univoci: 173.105
(396030, 27)
1
2
3
4
# sorted(df['emp_length'].dropna().unique())
emp_length_order = ['Missing','< 1 year', '1 year', '2 years', '3 years', 
                    '4 years', '5 years', '6 years', '7 years', 
                    '8 years', '9 years', '10+ years']
1
2
3
4
5
6
7
8
9
10
11
# countplot emp_length
plt.figure(figsize=(12,4))
ax = sns.countplot(x='emp_length',data=df[['emp_length']].fillna('Missing'),order=emp_length_order)

# aggiungo le frequenze
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 1000,
            '{:,.0f}'.format(height).replace(",", "X").replace(".", ",").replace("X", "."),
            ha="center")

png

1
2
3
# countplot emp_length
plt.figure(figsize=(12,4))
sns.countplot(x='emp_length',data=df[['emp_length','loan_status']].fillna('Missing'),order=emp_length_order, hue='loan_status')
1
<matplotlib.axes._subplots.AxesSubplot at 0x2db203a15c8>

png

1
2
3
4
# tasso di insucesso per ogni emp_length
# emp_len = df[df['loan_status']=="Charged Off"].groupby("emp_length").count()['loan_status']/df.groupby("emp_length").count()['loan_status']
emp_len = pd.DataFrame(df[['emp_length','loan_status']].fillna('Missing').groupby(['emp_length','loan_status']).size().groupby(level=0).apply(lambda x: x / x.sum()).xs('Charged Off',level='loan_status'),columns=['% Failure']).reset_index()
emp_len
emp_length % Failure
0 1 year 0.199135
1 10+ years 0.184186
2 2 years 0.193262
3 3 years 0.195231
4 4 years 0.192385
5 5 years 0.192187
6 6 years 0.189194
7 7 years 0.194774
8 8 years 0.199760
9 9 years 0.200470
10 < 1 year 0.206872
11 Missing 0.275286
1
2
3
4
5
6
7
# barplot tasso
plt.figure(figsize=(12,4))
# emp_len.plot(kind='bar')
sns.barplot(x='emp_length',y='% Failure',data=emp_len,order=emp_length_order,palette='coolwarm')
# non c'è una forte evidenza, quindi rimuovo 'emp_length'
df.drop('emp_length',axis=1,inplace=True)
df.shape
1
(396030, 26)

png

1
2
3
4
5
6
7
# missing data count
missing_counts = pd.concat(
    [df.isnull().sum()[df.isnull().sum()>0].apply(dot_sep),
    (df.isnull().sum()[df.isnull().sum()>0]/len(df)*100).apply(dot_sep)], 
    axis=1)
missing_counts.columns = ['Freq', 'Freq %']
missing_counts
Freq Freq %
title 1.755 0,44
revol_util 276 0,07
mort_acc 37.795 9,5
pub_rec_bankruptcies 535 0,14
1
2
3
# title e purpose
print(df['purpose'].value_counts().head(7), end='\n\n')
print(df['title'].value_counts().head(7))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
debt_consolidation    234507
credit_card            83019
home_improvement       24030
other                  21185
major_purchase          8790
small_business          5701
car                     4697
Name: purpose, dtype: int64

Debt consolidation         152472
Credit card refinancing     51487
Home improvement            15264
Other                       12930
Debt Consolidation          11608
Major purchase               4769
Consolidation                3852
Name: title, dtype: int64
1
2
3
4
5
6
# title e purpose
print('Purpose nunique:',df['purpose'].nunique())
print('Title nunique:',df['title'].nunique())
# rimuovo title che è una sottocategoria e non avrebbe senso rendere dummy
df.drop('title',axis=1,inplace=True)
print(df.shape)
1
2
3
Purpose nunique: 14
Title nunique: 48817
(396030, 25)
1
2
3
4
5
6
7
# missing data count
missing_counts = pd.concat(
    [df.isnull().sum()[df.isnull().sum()>0].apply(dot_sep),
    (df.isnull().sum()[df.isnull().sum()>0]/len(df)*100).apply(dot_sep)], 
    axis=1)
missing_counts.columns = ['Freq', 'Freq %']
missing_counts
Freq Freq %
revol_util 276 0,07
mort_acc 37.795 9,5
pub_rec_bankruptcies 535 0,14
1
2
3
# mort_acc (acconto del mutuo ipotecario)
print('mort_acc nunique:', df['mort_acc'].nunique())
print(df['mort_acc'].value_counts().head().apply(dot_sep))
1
2
3
4
5
6
7
mort_acc nunique: 33
0.0    139.777
1.0     60.416
2.0     49.948
3.0     38.049
4.0     27.887
Name: mort_acc, dtype: object
1
2
# mort_acc correlations
df.corr()['mort_acc'].sort_values()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
int_rate               -0.082583
dti                    -0.025439
revol_util              0.007514
pub_rec                 0.011552
pub_rec_bankruptcies    0.027239
loan_repaid             0.073111
open_acc                0.109205
installment             0.193694
revol_bal               0.194925
loan_amnt               0.222315
annual_inc              0.236320
total_acc               0.381072
mort_acc                1.000000
Name: mort_acc, dtype: float64
1
df[['mort_acc','total_acc']].head()
mort_acc total_acc
0 0.0 25.0
1 3.0 27.0
2 0.0 26.0
3 0.0 13.0
4 1.0 43.0
1
2
3
# Mean of mort_acc column per total_acc
print('Mean of mort_acc column per total_acc:')
print(df.groupby('total_acc').mean()['mort_acc'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Mean of mort_acc column per total_acc:
total_acc
2.0      0.000000
3.0      0.052023
4.0      0.066743
5.0      0.103289
6.0      0.151293
           ...   
124.0    1.000000
129.0    1.000000
135.0    3.000000
150.0    2.000000
151.0    0.000000
Name: mort_acc, Length: 118, dtype: float64
1
2
3
4
5
6
7
8
9
10
11
12
# fillna by group per mort_acc

# lambda function con due variabili
# def fill_mort_acc(total_acc,mort_acc):
#     if np.isnan(mort_acc):
#         return total_acc_avg[total_acc]
#     else:
#         return mort_acc
# df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis=1)

df['mort_acc'] = df[['mort_acc','total_acc']].groupby('total_acc').transform(lambda x: x.fillna(x.mean()))
df.shape
1
(396030, 25)
1
2
3
4
5
# rimuovo record con missing inferiori allo 0.5%
print('Pre rimozione missing:',len(df))
df.dropna(inplace=True)
print('Post rimozione missing:',len(df))
df.shape
1
2
3
4
Pre rimozione missing: 396030
Post rimozione missing: 395219

(395219, 25)
1
2
# verifica missing rimasti
df.isnull().sum().sum()
1
0

Step 3: Categorical Variables and Dummy Variables

String values due to the categorical columns

1
2
# distribuzione tipi colonne
df.dtypes.value_counts()
1
2
3
4
float64    12
object     12
int64       1
dtype: int64
1
2
3
# seleziono colonne non numeriche
df.select_dtypes(['object']).columns
# df.select_dtypes(exclude=['float64','int64']).columns
1
2
3
4
Index(['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status',
       'issue_d', 'loan_status', 'purpose', 'earliest_cr_line',
       'initial_list_status', 'application_type', 'address'],
      dtype='object')
1
2
3
4
5
# codifico term in numeric
print('Pre codifica:', df['term'].unique())
df['term'] = df['term'].map({' 36 months':36,' 60 months':60})
# df['term'] = df['term'].apply(lambda x: int(x[:3]))
print('Post codifica:', df['term'].unique())
1
2
Pre codifica: [' 36 months' ' 60 months']
Post codifica: [36 60]
1
2
3
4
5
# sub_grade è una sottocategoria di grade ma si può rendere dummy, quindi rimuoviamo grade
print('Univoci grade:', df['grade'].nunique())
print('Univoci sub_grade:', df['sub_grade'].nunique())
df.drop('grade',axis=1,inplace=True)
print(df.shape)
1
2
3
Univoci grade: 7
Univoci sub_grade: 35
(395219, 24)
1
2
3
4
5
# dummyfication
df = pd.get_dummies(df,columns=['sub_grade'],drop_first=True)
# subgrade_dummies = pd.get_dummies(df['sub_grade'],drop_first=True,prefix='sub_grade')
# df = pd.concat([df.drop('sub_grade',axis=1),subgrade_dummies],axis=1)
df.columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Index(['loan_amnt', 'term', 'int_rate', 'installment', 'home_ownership',
       'annual_inc', 'verification_status', 'issue_d', 'loan_status',
       'purpose', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'initial_list_status',
       'application_type', 'mort_acc', 'pub_rec_bankruptcies', 'address',
       'loan_repaid', 'sub_grade_A2', 'sub_grade_A3', 'sub_grade_A4',
       'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2', 'sub_grade_B3',
       'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1', 'sub_grade_C2',
       'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5', 'sub_grade_D1',
       'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4', 'sub_grade_D5',
       'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3', 'sub_grade_E4',
       'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2', 'sub_grade_F3',
       'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1', 'sub_grade_G2',
       'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5'],
      dtype='object')
1
2
# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns
1
2
3
4
Index(['home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'purpose', 'earliest_cr_line', 'initial_list_status',
       'application_type', 'address'],
      dtype='object')
1
2
3
4
5
# rimanenti variabili categoriche
print('verification_status nunique:',df['verification_status'].nunique())
print('application_type nunique:',df['application_type'].nunique())
print('initial_list_status nunique:',df['initial_list_status'].nunique())
print('purpose nunique:',df['purpose'].nunique())
1
2
3
4
verification_status nunique: 3
application_type nunique: 3
initial_list_status nunique: 2
purpose nunique: 14
1
2
# dummyficate
df = pd.get_dummies(df,columns=['verification_status', 'application_type','initial_list_status','purpose'],drop_first=True)
1
2
# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns
1
2
3
Index(['home_ownership', 'issue_d', 'loan_status', 'earliest_cr_line',
       'address'],
      dtype='object')
1
2
3
4
5
# riduzione mapping sostituzione categorie di home_ownership
print(df['home_ownership'].value_counts())
print('\n')
# df['home_ownership'].map({'NONE':'OTHER','ANY':'OTHER'}).fillna(df['home_ownership']).value_counts()
print(df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER').value_counts())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          110
NONE            29
ANY              3
Name: home_ownership, dtype: int64


MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          142
Name: home_ownership, dtype: int64
1
2
3
4
# dummyfication home_ownership
df['home_ownership'] = df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')
df = pd.get_dummies(df,columns=['home_ownership'],drop_first=True)
df.columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Index(['loan_amnt', 'term', 'int_rate', 'installment', 'annual_inc', 'issue_d',
       'loan_status', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'mort_acc',
       'pub_rec_bankruptcies', 'address', 'loan_repaid', 'sub_grade_A2',
       'sub_grade_A3', 'sub_grade_A4', 'sub_grade_A5', 'sub_grade_B1',
       'sub_grade_B2', 'sub_grade_B3', 'sub_grade_B4', 'sub_grade_B5',
       'sub_grade_C1', 'sub_grade_C2', 'sub_grade_C3', 'sub_grade_C4',
       'sub_grade_C5', 'sub_grade_D1', 'sub_grade_D2', 'sub_grade_D3',
       'sub_grade_D4', 'sub_grade_D5', 'sub_grade_E1', 'sub_grade_E2',
       'sub_grade_E3', 'sub_grade_E4', 'sub_grade_E5', 'sub_grade_F1',
       'sub_grade_F2', 'sub_grade_F3', 'sub_grade_F4', 'sub_grade_F5',
       'sub_grade_G1', 'sub_grade_G2', 'sub_grade_G3', 'sub_grade_G4',
       'sub_grade_G5', 'verification_status_Source Verified',
       'verification_status_Verified', 'application_type_INDIVIDUAL',
       'application_type_JOINT', 'initial_list_status_w',
       'purpose_credit_card', 'purpose_debt_consolidation',
       'purpose_educational', 'purpose_home_improvement', 'purpose_house',
       'purpose_major_purchase', 'purpose_medical', 'purpose_moving',
       'purpose_other', 'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding', 'home_ownership_OTHER',
       'home_ownership_OWN', 'home_ownership_RENT'],
      dtype='object')
1
2
# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns
1
Index(['issue_d', 'loan_status', 'earliest_cr_line', 'address'], dtype='object')
1
2
3
4
# feature engineering address column
print('address nunique:',df['address'].nunique())
print('\n')
print(df['address'].value_counts().head(10))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
address nunique: 392898


USS Johnson\nFPO AE 48052     8
USCGC Smith\nFPO AE 70466     8
USS Smith\nFPO AP 70466       8
USNS Johnson\nFPO AE 05113    8
USNS Johnson\nFPO AP 48052    7
USNS Johnson\nFPO AA 70466    6
USNV Brown\nFPO AA 48052      6
USS Smith\nFPO AP 22690       6
USCGC Miller\nFPO AA 22690    6
USCGC Smith\nFPO AA 70466     6
Name: address, dtype: int64
1
2
3
# creo variabile cap (zip code)
df['zip_code'] = df['address'].apply(lambda x: x[-5:])
df['zip_code'].unique()
1
2
array(['22690', '05113', '00813', '11650', '30723', '70466', '29597',
       '48052', '86630', '93700'], dtype=object)
1
2
3
4
# dummy zip code
df.drop('address',axis=1,inplace=True)
df = pd.get_dummies(df,columns=['zip_code'],drop_first=True)
df.columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Index(['loan_amnt', 'term', 'int_rate', 'installment', 'annual_inc', 'issue_d',
       'loan_status', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'mort_acc',
       'pub_rec_bankruptcies', 'loan_repaid', 'sub_grade_A2', 'sub_grade_A3',
       'sub_grade_A4', 'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2',
       'sub_grade_B3', 'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1',
       'sub_grade_C2', 'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5',
       'sub_grade_D1', 'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4',
       'sub_grade_D5', 'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3',
       'sub_grade_E4', 'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2',
       'sub_grade_F3', 'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1',
       'sub_grade_G2', 'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5',
       'verification_status_Source Verified', 'verification_status_Verified',
       'application_type_INDIVIDUAL', 'application_type_JOINT',
       'initial_list_status_w', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_educational',
       'purpose_home_improvement', 'purpose_house', 'purpose_major_purchase',
       'purpose_medical', 'purpose_moving', 'purpose_other',
       'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding', 'home_ownership_OTHER',
       'home_ownership_OWN', 'home_ownership_RENT', 'zip_code_05113',
       'zip_code_11650', 'zip_code_22690', 'zip_code_29597', 'zip_code_30723',
       'zip_code_48052', 'zip_code_70466', 'zip_code_86630', 'zip_code_93700'],
      dtype='object')
1
2
# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns
1
Index(['issue_d', 'loan_status', 'earliest_cr_line'], dtype='object')
1
2
3
# issue_date è 'data leakage' la avremmo solo con la realizzazione della terget, quindi va esclusa
df.drop('issue_d',axis=1,inplace=True)
df.shape
1
(395219, 80)
1
2
3
4
5
6
# earliest_cr_line possiamo estrarre l'anno
print(df['earliest_cr_line'].head())
print('\n')
print('Lunghezza stringa:',df['earliest_cr_line'].apply(lambda x: len(x)).unique()) # sono tutti costanti da 8, si prende l'anno dalla fine
df['earliest_cr_year'] = df['earliest_cr_line'].apply(lambda x: int(x[-4:]))
df.drop('earliest_cr_line',axis=1,inplace=True)
1
2
3
4
5
6
7
8
0    Jun-1990
1    Jul-2004
2    Aug-2007
3    Sep-2006
4    Mar-1999
Name: earliest_cr_line, dtype: object

Lunghezza stringa: [8]
1
2
3
# faccio un backup del df, nel caso sbaglio
df_backup = df.copy()
df_backup.shape
1
(395219, 80)
1
2
# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns
1
Index(['loan_status'], dtype='object')
1
2
3
4
# elimino loan_status così lascio la target codificata ('loan_repaid')
df.drop('loan_status',axis=1,inplace=True)
df.shape
# siamo pronti!
1
(395219, 79)

Step 4: Model

Tuning NN
Droput links 1, 2, 3

1
2
3
4
# potrei lavorare sul campione
df_sample = df.sample(frac=0.1,random_state=101)
print('df intero:',len(df))
print('df ridotto:',len(df_sample))
1
2
df intero: 395219
df ridotto: 39522
1
2
3
# X e y (come numpy arrays)
X = df.drop('loan_repaid',axis=1).values
y = df['loan_repaid'].values
1
2
# train e test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20,random_state=101)
1
2
3
4
# normalizing data
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # avevo erroneamente fatto il fit, non andava messo, si otterranno risultati leggermente diversi
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
6
7
8
9
10
# definizione struttura neural network per classificazione
model = Sequential()
model.add(Dense(units=78,activation='relu'))
model.add(Dropout(0.5)) # 0.2 da provare
model.add(Dense(units=39,activation='relu'))
model.add(Dropout(0.5)) # 0.2 da provare
model.add(Dense(units=19,activation='relu'))
model.add(Dropout(0.5)) # 0.2 da provare
model.add(Dense(units=1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
1
2
3
4
5
# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)
1
2
3
4
5
%%time
# stimo il modello
model.fit(x=X_train,y=y_train,
          validation_data=(X_test,y_test),
          verbose=2,batch_size=128,epochs=25) # batch_size=256 da provare
1
2
3
4
5
6
7
8
9
Train on 316175 samples, validate on 79044 samples
Epoch 1/25
316175/316175 - 8s - loss: 0.3197 - val_loss: 0.2726
...
Epoch 25/25
316175/316175 - 6s - loss: 0.2629 - val_loss: 0.2713
Wall time: 2min 51s

<tensorflow.python.keras.callbacks.History at 0x2a80300b748>
1
2
# save the model
model.save('LendingClub_Keras_Model.h5')  # creates a HDF5 file
1
'F:\\Python\\Course 001'
1
2
3
# loss crossentropy (campione)
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()
1
<matplotlib.axes._subplots.AxesSubplot at 0x2a8573fa248>

png

1
2
3
# loss crossentropy
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()
1
<matplotlib.axes._subplots.AxesSubplot at 0x2a80381b2c8>

png

1
2
# predictions
predictions = model.predict_classes(X_test)
1
2
3
4
5
6
# metrics
print('\nConfusion Matrix:')
print(confusion_matrix(y_test,predictions))
print('\nClassification metrics:')
print(classification_report(y_test,predictions))
# non è fantastico perché la recall della classe 0 è bassa
1
2
3
4
5
6
7
8
9
10
11
12
13
Confusion Matrix:
[[ 7176  8482]
 [  406 62980]]

Classification metrics:
              precision    recall  f1-score   support

           0       0.95      0.46      0.62     15658
           1       0.88      0.99      0.93     63386

    accuracy                           0.89     79044
   macro avg       0.91      0.73      0.78     79044
weighted avg       0.89      0.89      0.87     79044

New data prediction

1
2
3
4
5
# df
rn.seed(101)
random_ind = rn.randint(0,len(df))
new_customer = df.drop('loan_repaid',axis=1).iloc[random_ind]
new_customer
1
2
3
4
5
6
7
8
9
10
11
12
loan_amnt           25000.00
term                   60.00
int_rate               18.24
installment           638.11
annual_inc          61665.00
                      ...   
zip_code_48052          0.00
zip_code_70466          0.00
zip_code_86630          0.00
zip_code_93700          0.00
earliest_cr_year     1996.00
Name: 305323, Length: 78, dtype: float64
1
2
3
4
5
6
# scaling
new_customer = scaler.transform(new_customer.values.reshape(1, 78))
# predict
print('Probabilità Prevista:', model.predict(new_customer))
print('Classe Prevista:', model.predict_classes(new_customer))
print('Classe Osservata:', df.iloc[random_ind]['loan_repaid'])
1
2
3
Probabilità Prevista: [[0.54277164]]
Classe Prevista: [[1]]
Classe Osservata: 1.0
1
2
# reset
%reset -f