Python: Neural Networks

Scritto ed eseguito sul portatile con Windows 10 – Effetto South Working

Utilizzo l’environment conda py3_tf

~$ conda activate py3_tf

Versione modulo installato

~$ pip show tensorflow
Name: tensorflow
Version: 2.2.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /home/user/miniconda3/envs/py3_tf/lib/python3.7/site-packages
Requires: h5py, keras-preprocessing, opt-einsum, wrapt, six, protobuf, tensorboard, wheel, gast, tensorflow-estimator, google-pasta, termcolor, absl-py, astunparse, scipy, numpy, grpcio
Required-by: 

Utile per monitorare la GPU

~$ nvidia-smi

Indice

Neural Nets and Deep Learning
- Perceptron Model
- Neural Networks
- Activation functions
- Backpropagation
- Keras Neural Network Example
- House Sales in King County (Regression)
- Breast cancer Wisconsin (Classification)
- LendingClub dataset (Classification)

Neural Nets and Deep Learning

Perceptron Model

\(\hat{y}=\sum_{i=1}^{n}x_iw_i+b_i\)
Il termine Bias, è da interpretare come una soglia (se negativo) da superare prima che una variabile possa avere un’impatto positivo.

Neural Networks

blablabla
Universal Approximation Theorem

Activation functions

Hanno l’utilità di vincolare l’output, ad esempio se si vuole ottenere un output di classificazione, ad esempio:

la funzione logistica (sigmoid) come funzione d’attivazione (o il cosh, sinh, tanh)
Rectified Linear Unit (ReLU) con dominio \(\max(0,z)\) , la ReLU è buona per contenere il vanishing gradient
others

nb. Z e X sono spesso maiuscoli per indicare un tensore (multiple values)

Multiclass Activation Functions

Non-Esclusive Classes
Pluri assegnazione di etichetta per ogni osservazione, es multiple tag. La funzione di attivazione logistica funziona bene, hai una probabilità per ogni classe e scelta una soglia assegni una o più etichette
Mutually Esclusive Classes
Solo una classe assegnata. Si usa la Softmax Function
\(\sigma\left (\textbf{z}\right )_i=\frac{e^{z_i}}{\sum_{j=1}^Ke^{z_j}}\) for \(i=1,...,K\) eventi. Si ottiene una distribuzione di probabilità la cui somma è uno, la classe scelta è associata al valore di probabilità massimo.

Si struttura la rete affinché abbia più nodi output

Cost Functions and Gradient Descent

La funzione di costo serve a monitorare l’andamento dell’aggiornamento dei pesi in fase di training.

Quadratic Cost Function

\(C=\frac{1}{2n}\sum_x\left \|y(x)-a^L(x) \right \|^2\)
con
a = valori previsti
y = valori osservati
Si può pensare alla funzione di costo per le NN
\(C(W,B,S^r,E^r)\)
con
\(W =\) pesi della neural network
\(B =\) bias
\(S^r =\) input per il singolo campione di training
\(E^r =\) output per il singolo campione di training

Bisogna trovare il \(W_{\min}\) che minimizzi la funzione di costo. Se è n-dimensionale si usa il Gradient Descent. Con una funzione di costo convessa, si ottiene tramite step (tutti parametrizzabili) fino a quando i pesi portano la derivata prima della funzione di costo a 0 (o quasi). Adam come ottimizzatore. Se si parla di due dimensioni passiamo dalle derivate al gradiente, quindi si calcola
\(\nabla C\left (w_1,w_2,...,w_n\right )\)

Cross-Entropy

Per i problemi di classificazione si usa spesso la cross entropy loss function.
\(C=-\left (y \log\left (p\right )+\left (1-y\right ) \log\left (1-p\right )\right )\)
per un problema Multiclasse
\(C=-\sum_{c=1}^{M}y_{o,c}\log\left (p_{o,c}\right )\)

Backpropagation

Chain-rule derivative per aggiornare iterativamente i vari pesi partendo dall’ultimo minimizzando la funzione di costo.
Hadamard Product (il prodotto come funziona su numpy) \(\begin{bmatrix} 1\\ 1 \end{bmatrix} \odot \begin{bmatrix} 3\\ 4 \end{bmatrix} = \begin{bmatrix} 1*3\\ 2*4 \end{bmatrix} = \begin{bmatrix} 3\\ 8 \end{bmatrix}\)

Keras Neural Network Example

# lib
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import tensorflow as tf
import random as rn
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import mean_absolute_error,mean_squared_error
from tensorflow.keras.models import load_model

# df (obiettivo: predizione del price)
df = pd.read_csv('fake_reg.csv')
df.head()

	price	feature1	feature2
0	461.527929	999.787558	999.766096
1	548.130011	998.861615	1001.042403
2	410.297162	1000.070267	998.844015
3	540.382220	999.952251	1000.440940
4	546.024553	1000.446011	1000.338531

# pairplot
sns.set_style('whitegrid')
sns.pairplot(df, palette='red')

<seaborn.axisgrid.PairGrid at 0x2722a98fec8>

png

# X e y come np array
X = df[['feature1','feature2']].values
y = df['price'].values

# train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

(700, 2)
(300, 2)

# min max scaling -- help
help(MinMaxScaler)

Help on class MinMaxScaler in module sklearn.preprocessing._data:

class MinMaxScaler(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  MinMaxScaler(feature_range=(0, 1), *, copy=True)
 |  
 |  Transform features by scaling each feature to a given range.
 |  
 |  This estimator scales and translates each feature individually such
 |  that it is in the given range on the training set, e.g. between
 |  zero and one.
 |  
 |  The transformation is given by::
 |  
 |      X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
 |      X_scaled = X_std * (max - min) + min
 |  
 |  where min, max = feature_range.
 |  
 |  This transformation is often used as an alternative to zero mean,
 |  unit variance scaling.
 |  ...

# min max scaling
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print('Train Min-Max: ', X_train.min(), X_train.max())
print('Test Min-Max: ', X_test.min(), X_test.max())

Train Min-Max:  0.0 1.0
Test Min-Max:  -0.014108392024496652 1.0186515935232023

Choosing an optimizer and loss

Seguono i principali problemi supervisionati con Keras:

Multi-class classification problem

# model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

Binary classification problem

# model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

Mean squared error regression problem

# model.compile(optimizer='rmsprop', loss='mse')

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

# define model multiple line (si possono passare i Dense anche dentro il Sequential come lista)
model = Sequential()
model.add(Dense(units=4,activation='relu')) # units sono i nodi
model.add(Dense(units=4,activation='relu'))
model.add(Dense(units=4,activation='relu'))
model.add(Dense(units=1)) # final layer (vogliamo solo il price)

model.compile(optimizer='rmsprop',loss='mse')
# nb. questo passo mangia la GPU

Training

Descrizione parametri principali in Keras:

Sample: one element of a dataset.
- Example: one image is a sample in a convolutional network
- Example: one audio file is a sample for a speech recognition model
Batch: a set of N samples. The samples in a batch are processed independently, in parallel. If training, a batch results in only one update to the model.A batch generally approximates the distribution of the input data better than a single input. The larger the batch, the better the approximation; however, it is also true that the batch will take longer to process and will still result in only one update. For inference (evaluate/predict), it is recommended to pick a batch size that is as large as you can afford without going out of memory (since larger batches will usually result in faster evaluation/prediction).
Epoch: an arbitrary cutoff, generally defined as “one pass over the entire dataset”, used to separate training into distinct phases, which is useful for logging and periodic evaluation.
When using validation_data or validation_split with the fit method of Keras models, evaluation will be run at the end of every epoch.

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

# training model
model.fit(X_train,y_train,epochs=250)

Train on 700 samples
Epoch 1/250
700/700 [==============================] - 1s 2ms/sample - loss: 256557.3441
Epoch 2/250
700/700 [==============================] - 0s 73us/sample - loss: 256365.9997
...
Epoch 250/250
700/700 [==============================] - 0s 58us/sample - loss: 24.1132

<tensorflow.python.keras.callbacks.History at 0x27236051688>

Evaluation

# loss trend
loss = model.history.history['loss']

# plot loss trend
sns.lineplot(x=range(len(loss)),y=loss)
plt.title("Training Loss per Epoch")

Text(0.5, 1.0, 'Training Loss per Epoch')

png

# Loss (in questo caso MSE)
training_score = model.evaluate(X_train,y_train,verbose=0)
test_score = model.evaluate(X_test,y_test,verbose=0)
print('training Score:', training_score)
print('test Score:', test_score)

training Score: 23.728277675083707
test Score: 25.146427205403647

# predictions
test_predictions = model.predict(X_test)

# previsti
test_predictions = pd.Series(test_predictions.reshape(300,))

# osservati
pred_df = pd.DataFrame(y_test,columns=['Test Y'])
pred_df.head()

	Test Y
0	402.296319
1	624.156198
2	582.455066
3	578.588606
4	371.224104

# previsti e osservati
pred_df = pd.concat([pred_df,test_predictions],axis=1)
pred_df.columns = ['Test Y','Model Predictions']
pred_df.head()

	Test Y	Model Predictions
0	402.296319	405.533844
1	624.156198	623.994934
2	582.455066	592.561340
3	578.588606	572.621155
4	371.224104	366.802795

# scatter predict vs observed
sns.set_style('whitegrid')
sns.scatterplot(x='Test Y',y='Model Predictions',data=pred_df)

<matplotlib.axes._subplots.AxesSubplot at 0x27238970d08>

png

# distribution errors
pred_df['Error'] = pred_df['Test Y'] - pred_df['Model Predictions']
sns.distplot(pred_df['Error'],bins=50)

<matplotlib.axes._subplots.AxesSubplot at 0x27235ddbd88>

png

# metrics
print('MAE:',mean_absolute_error(pred_df['Test Y'],pred_df['Model Predictions']))
print('MSE:',mean_squared_error(pred_df['Test Y'],pred_df['Model Predictions']))
print('MSE (from model.evaluate):',test_score)
print('RMSE:',test_score**0.5)

MAE: 4.023428708666904
MSE: 25.14642937056938
MSE (from model.evaluate): 25.146427205403647
RMSE: 5.014621342175663

# Il MAE di circa 4, un errore di meno dell'1% della media
df['price'].describe()

count    1000.000000
mean      498.673029
std        93.785431
min       223.346793
25%       433.025732
50%       502.382117
75%       564.921588
max       774.407854
Name: price, dtype: float64

New observation to predict

# [[Feature1, Feature2]]
new_gem = [[998,1000]]
# scaling
new_gem = scaler.transform(new_gem)
# predict
print(model.predict(new_gem))

[[419.92566]]

Saving model

# working directory
os.getcwd()

'F:\\Python\\Course 001'

# save
model.save('Keras_Neural_Network_Example.h5')  # creates a HDF5 file

# load
later_model = load_model(r'F:\GitHub\AlbGri.github.io\assets\files\Python\Course 001\Keras_Neural_Network_Example.h5')

WARNING:tensorflow:Sequential models without an `input_shape` passed to the first layer cannot reload their optimizer state. As a result, your model isstarting with a freshly initialized optimizer.

# prediction with loaded model
print(later_model.predict(new_gem))

[[420.05133]]

House Sales in King County

Kaggle: Predict house price using regression

# lib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score

# df
df = pd.read_csv('/kc_house_data.csv')
df.head()

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	10/13/2014	221900.0	3	1.00	1180	5650	1.0	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	6414100192	12/9/2014	538000.0	3	2.25	2570	7242	2.0	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	5631500400	2/25/2015	180000.0	2	1.00	770	10000	1.0	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	2487200875	12/9/2014	604000.0	4	3.00	1960	5000	1.0	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	1954400510	2/18/2015	510000.0	3	2.00	1680	8080	1.0	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503

5 rows × 21 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 id             21597 non-null  int64  
 date           21597 non-null  object 
 price          21597 non-null  float64
 bedrooms       21597 non-null  int64  
 bathrooms      21597 non-null  float64
 sqft_living    21597 non-null  int64  
 sqft_lot       21597 non-null  int64  
 floors         21597 non-null  float64
 waterfront     21597 non-null  int64  
 view           21597 non-null  int64  
condition      21597 non-null  int64  
grade          21597 non-null  int64  
sqft_above     21597 non-null  int64  
sqft_basement  21597 non-null  int64  
yr_built       21597 non-null  int64  
yr_renovated   21597 non-null  int64  
zipcode        21597 non-null  int64  
lat            21597 non-null  float64
long           21597 non-null  float64
sqft_living15  21597 non-null  int64  
sqft_lot15     21597 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB

# converto id a stringa (così non la escludiamo tra le misure di sintesi)
df['id'] = df['id'].apply(str)

EDA

# check missing
df.isnull().sum().sum()

0

# per mostrare i separatori delle migliaia come punti e decimali come virgola
dot_sep = lambda x: format(round(x,2) if abs(x) < 1 else round(x,1) if abs(x) < 10 else int(x), ',').replace(",", "X").replace(".", ",").replace("X", ".")

# describe, potrei usare il .transpose, ma preferico così e miglioro i decimali
df.describe(percentiles=[0.25,0.5,0.75,0.999]).applymap(dot_sep)
# df.describe(percentiles=[0.25,0.5,0.75,0.999]).style.format("{:.1f}")
# sono presenti forti outliers

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	lat	long	sqft_living15	sqft_lot15	month	year
count	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597	21.597
mean	540.296	3,4	2,1	2.080	15.099	1,5	0,01	0,23	3,4	7,7	1.788	291	1.970	84	47	-122	1.986	12.758	6,6	2.014
std	367.368	0,93	0,77	918	41.412	0,54	0,09	0,77	0,65	1,2	827	442	29	401	0,14	0,14	685	27.274	3,1	0,47
min	78.000	1,0	0,5	370	520	1,0	0,0	0,0	1,0	3,0	370	0,0	1.900	0,0	47	-122	399	651	1,0	2.014
25%	322.000	3,0	1,8	1.430	5.040	1,0	0,0	0,0	3,0	7,0	1.190	0,0	1.951	0,0	47	-122	1.490	5.100	4,0	2.014
50%	450.000	3,0	2,2	1.910	7.618	1,5	0,0	0,0	3,0	7,0	1.560	0,0	1.975	0,0	47	-122	1.840	7.620	6,0	2.014
75%	645.000	4,0	2,5	2.550	10.685	2,0	0,0	0,0	4,0	8,0	2.210	560	1.997	0,0	47	-122	2.360	10.083	9,0	2.015
99.9%	3.480.600	8,0	5,5	7.290	495.972	3,0	1,0	4,0	5,0	12	6.114	2.372	2.015	2.014	47	-121	5.012	303.191	12	2.015
max	7.700.000	33	8,0	13.540	1.651.359	3,5	1,0	4,0	5,0	13	9.410	4.820	2.015	2.015	47	-121	6.210	871.200	12	2.015

# distribuzione (continua) del price
sns.set_style('whitegrid')
plt.figure(figsize=(12,8))
sns.distplot(df['price'])

<matplotlib.axes._subplots.AxesSubplot at 0x18ba0584148>

png

# distribuzione (discreta) bedrooms
sns.countplot(df['bedrooms'])

<matplotlib.axes._subplots.AxesSubplot at 0x18ba2b8e148>

png

# correlazioni con target
df.corr()['price'].sort_values()[-10:]

lat              0.306692
bedrooms         0.308787
sqft_basement    0.323799
view             0.397370
bathrooms        0.525906
sqft_living15    0.585241
sqft_above       0.605368
grade            0.667951
sqft_living      0.701917
price            1.000000
Name: price, dtype: float64

# scatterplot price e sqft_living
plt.figure(figsize=(12,8))
sns.scatterplot(x='price',y='sqft_living',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x18ba1a2c0c8>

png

# boxplot bedrooms e price
sns.boxplot(x='bedrooms',y='price',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x18ba29eb148>

png

# boxplot waterfront e price
sns.boxplot(x='waterfront',y='price',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x18ba4e746c8>

png

Geographical Properties

# il prezzo varia in funzione della longitudine?
plt.figure(figsize=(12,8))
sns.scatterplot(x='price',y='long',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x18ba2eda608>

png

# il prezzo varia in funzione della latitudine?
plt.figure(figsize=(12,8))
sns.scatterplot(x='price',y='lat',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x18ba2f71108>

png

# plot latitudine e longitudine (King County)
plt.figure(figsize=(12,8))
sns.scatterplot(x='long',y='lat',data=df,hue='price')
# il colore non è distribuito bene perché ci sono forti outliers che spingono la distribuzione verso il basso

<matplotlib.axes._subplots.AxesSubplot at 0x18ba2ebc508>

png

# top 10 outliers per price
df.select_dtypes(include=np.number).sort_values('price',ascending=False).head(10).applymap(dot_sep)

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	lat	long	sqft_living15	sqft_lot15	month	year
7245	7.700.000	6	8,0	12.050	27.600	2,5	0	3	4	13	8.570	3.480	1.910	1.987	47	-122	3.940	8.800	10	2.014
3910	7.060.000	5	4,5	10.040	37.325	2,0	1	2	3	11	7.680	2.360	1.940	2.001	47	-122	3.930	25.449	6	2.014
9245	6.890.000	6	7,8	9.890	31.374	2,0	0	4	3	13	8.860	1.030	2.001	0	47	-122	4.540	42.730	9	2.014
4407	5.570.000	5	5,8	9.200	35.069	2,0	0	0	3	13	6.200	3.000	2.001	0	47	-122	3.560	24.345	8	2.014
1446	5.350.000	5	5,0	8.000	23.985	2,0	0	4	3	12	6.720	1.280	2.009	0	47	-122	4.600	21.750	4	2.015
1313	5.300.000	6	6,0	7.390	24.829	2,0	1	4	4	12	5.000	2.390	1.991	0	47	-122	4.320	24.619	4	2.015
1162	5.110.000	5	5,2	8.010	45.517	2,0	1	4	3	12	5.990	2.020	1.999	0	47	-122	3.430	26.788	10	2.014
8085	4.670.000	5	6,8	9.640	13.068	1,0	1	4	3	12	4.820	4.820	1.983	2.009	47	-122	3.270	10.454	6	2.014
2624	4.500.000	5	5,5	6.640	40.014	2,0	1	4	3	12	6.350	290	2.004	0	47	-122	3.030	23.408	8	2.014
8629	4.490.000	4	3,0	6.430	27.517	2,0	0	0	3	12	6.430	0	2.001	0	47	-122	3.720	14.592	6	2.014

# escludo l'1% di coda del dataset, cioè 216 osservazioni
non_top_1_perc = df.sort_values('price',ascending=False).iloc[216:]

# plot latitudine e longitudine senza l'1% di coda
plt.figure(figsize=(12,8))
sns.scatterplot(x='long',y='lat',
                data=non_top_1_perc,hue='price',
                palette='RdYlGn',edgecolor=None,alpha=0.2)

<matplotlib.axes._subplots.AxesSubplot at 0x18ba4fdd148>

png

Feature Engineering

# drop id
df = df.drop('id',axis=1)

# engineering date
# convertiamo da string a time così è più semplice estrarre le info
df['date_string'] = df['date']
df['date'] = pd.to_datetime(df['date_string'])
df['month'] = df['date'].apply(lambda x: x.month)
df['year'] = df['date'].apply(lambda x: x.year)
df[['date_string','date','month','year']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date_string  21597 non-null  object        
 1   date         21597 non-null  datetime64[ns]
 2   month        21597 non-null  int64         
 3   year         21597 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 675.0+ KB

# boxplot anno price
sns.boxplot(x='year',y='price',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x18ba4e86308>

png

# boxplot mese price
sns.boxplot(x='month',y='price',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x18ba5505408>

png

# andamento prezzo medio per mese
df.groupby('month').mean()['price'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x18ba197cac8>

png

# andamento prezzo medio per anno
df.groupby('year').mean()['price'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x18ba5090648>

png

# escludo le variabili con le date complete
df = df.drop(['date','date_string'],axis=1)
df.columns

Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15', 'month', 'year'],
      dtype='object')

# lo zip code va lavorato o escluso. lo escludiamo.
# se si include nel modello verrebbe considerato come numerico
# ci sono 70 zipcode diversi, quindi è un problema renderli dummy, si potrebbero raggruppare come zipcode più ricchi e meno ricchi, oppure fare raggruppamenti geografici come nord/centro/sud
df = df.drop('zipcode',axis=1)

# la maggior parte dei yr_renoveted sono 0, si potrebbe discretizzare come chi ha rinnovato e chi no.
# inoltre c'è una correlazione, più è recente e più sono i casi
df['yr_renovated'].value_counts()

     20683
     91
     37
     36
     35
        ...  
      1
      1
      1
      1
      1
Name: yr_renovated, Length: 70, dtype: int64

# la maggior parte dei sqft_basement sono 0, si potrebbe discretizzare come chi ha rinnovato e chi no
df['sqft_basement'].value_counts()

     13110
     221
     218
     214
     206
        ...  
       1
      1
       1
      1
       1
Name: sqft_basement, Length: 306, dtype: int64

Models

# X e y
X = df.drop('price',axis=1)
y = df['price']

# train test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

# scaling Min Max
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train) # attenzione il fit solo sul train
X_test = scaler.transform(X_test)
print(X_train.shape)
print(X_test.shape)

(15117, 19)
(6480, 19)

# definisco il modello
model = Sequential()

model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(1))

model.compile(optimizer='adam',loss='mse')

# stimo il modello
model.fit(x=X_train,y=y_train.values,
          validation_data=(X_test,y_test.values),
          verbose=2,batch_size=128,epochs=400)
# il validation non viene utilizzato per il tuning solo per monitorare

Train on 15117 samples, validate on 6480 samples
Epoch 1/400
15117/15117 - 0s - loss: 96382199451.0191 - val_loss: 92658238633.4025
Epoch 2/400
...
Epoch 400/400
15117/15117 - 0s - loss: 29373463675.0805 - val_loss: 27157900968.1383

<tensorflow.python.keras.callbacks.History at 0x18bafce7548>

# andamento loss (mse)
losses = pd.DataFrame(model.history.history)
losses.plot()
# val_loss è sul test set, utile per capire se sto facendo overfit. non stiamo facendo overfit, potremmo continuare con le epochs.

<matplotlib.axes._subplots.AxesSubplot at 0x18bafd17f48>

png

# predictions
predictions = model.predict(X_test)

# metrics
print('MAE:',dot_sep(mean_absolute_error(y_test,predictions)))
print('MSE:',dot_sep(mean_squared_error(y_test,predictions)))
print('RMSE:',dot_sep(mean_squared_error(y_test,predictions)**0.5))
print('Explained Var Score:',dot_sep(explained_variance_score(y_test,predictions)))

MAE: 99.996
MSE: 27.157.901.022
RMSE: 164.796
Explained Var Score: 0,8

print('Media Price:',dot_sep(df['price'].mean()))
print('Mediana Price:',dot_sep(df['price'].median()))
# il MAE è di circa 100k, quindi un 20% della media

Media Price: 540.296
Mediana Price: 450.000

# plot osservate vs previste
plt.scatter(y_test,predictions,edgecolors='black',alpha=0.5)
plt.plot(y_test,y_test,'r')

[<matplotlib.lines.Line2D at 0x18bafcd3948>]

png

# distribuzione errori (devo formattarli uguali)
print(type(y_test))
print(type(predictions))
print(type(y_test.values.reshape(6480, 1)))

<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

# distribuzione errori
errors = y_test.values.reshape(6480, 1) - predictions
sns.distplot(errors)

<matplotlib.axes._subplots.AxesSubplot at 0x18bb3093288>

png

Prediction on new observation

# esempio da prevedere
single_house = df.drop('price',axis=1).iloc[0]
single_house

bedrooms            3.0000
bathrooms           1.0000
sqft_living      1180.0000
sqft_lot         5650.0000
floors              1.0000
waterfront          0.0000
view                0.0000
condition           3.0000
grade               7.0000
sqft_above       1180.0000
sqft_basement       0.0000
yr_built         1955.0000
yr_renovated        0.0000
lat                47.5112
long             -122.2570
sqft_living15    1340.0000
sqft_lot15       5650.0000
month              10.0000
year             2014.0000
Name: 0, dtype: float64

# bisogna scalarlo e renderlo in formato vettore
single_house = scaler.transform(single_house.values.reshape(-1, 19))
single_house

array([[0.2       , 0.08      , 0.08376422, 0.00310751, 0.        ,
        0.        , 0.        , 0.5       , 0.4       , 0.10785619,
        0.        , 0.47826087, 0.        , 0.57149751, 0.21760797,
        0.16193426, 0.00582059, 0.81818182, 0.        ]])

# sappiamo qual è il target osservato
df.head(1)

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	lat	long	sqft_living15	sqft_lot15	month	year
0	221900.0	3	1.0	1180	5650	1.0	0	0	3	7	1180	0	1955	0	47.5112	-122.257	1340	5650	10	2014

# previsione
model.predict(single_house)
# si potrebbe migliorare il modello escludendo l'1% degli outliers e facendo più feature engineering

array([[294433.44]], dtype=float32)

Breast cancer Wisconsin

Keras Classification

# lib
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report,confusion_matrix

import random as rn
import os

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.callbacks import EarlyStopping

# df
df = pd.read_csv('cancer_classification.csv')
df.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

# controllo missing
df.isnull().sum().sum()

0

# per mostrare i separatori delle migliaia come punti e decimali come virgola
dot_sep = lambda x: format(round(x,2) if abs(x) < 1 else round(x,1) if abs(x) < 10 else int(x), ',').replace(",", "X").replace(".", ",").replace("X", ".")

# describe, potrei usare il .transpose, ma preferico così e miglioro i decimali
df.describe(percentiles=[0.25,0.5,0.75,0.999]).applymap(dot_sep)

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	benign_0__mal_1
count	569	569	569	569	569	569	569	569	569	569	...	569	569	569	569	569	569	569	569	569	569
mean	14	19	91	654	0,1	0,1	0,09	0,05	0,18	0,06	...	25	107	880	0,13	0,25	0,27	0,11	0,29	0,08	0,63
std	3,5	4,3	24	351	0,01	0,05	0,08	0,04	0,03	0,01	...	6,1	33	569	0,02	0,16	0,21	0,07	0,06	0,02	0,48
min	7,0	9,7	43	143	0,05	0,02	0,0	0,0	0,11	0,05	...	12	50	185	0,07	0,03	0,0	0,0	0,16	0,06	0,0
25%	11	16	75	420	0,09	0,06	0,03	0,02	0,16	0,06	...	21	84	515	0,12	0,15	0,11	0,06	0,25	0,07	0,0
50%	13	18	86	551	0,1	0,09	0,06	0,03	0,18	0,06	...	25	97	686	0,13	0,21	0,23	0,1	0,28	0,08	1,0
75%	15	21	104	782	0,11	0,13	0,13	0,07	0,2	0,07	...	29	125	1.084	0,15	0,34	0,38	0,16	0,32	0,09	1,0
99.9%	27	36	187	2.499	0,15	0,33	0,43	0,2	0,3	0,1	...	48	238	3.787	0,22	0,99	1,2	0,29	0,61	0,19	1,0
max	28	39	188	2.501	0,16	0,35	0,43	0,2	0,3	0,1	...	49	251	4.254	0,22	1,1	1,3	0,29	0,66	0,21	1,0

9 rows × 31 columns

# verifichiamo la distribuzione target
ax = sns.countplot(x='benign_0__mal_1',data=df)
# non è troppo sbilanciato

# aggiungo le frequenze
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.0f}'.format(height), # '{:1.2f}'.format(height/float(len(df)))
            ha="center")

png

# heatmap
plt.figure(figsize=(12,12))
sns.heatmap(df.corr())

<matplotlib.axes._subplots.AxesSubplot at 0x26b02b93048>

png

# head e tail delle variabili più correlate con la target
pd.concat([df.corr()['benign_0__mal_1'].sort_values().head(),df.corr()['benign_0__mal_1'].sort_values().tail()])

worst concave points     -0.793566
worst perimeter          -0.782914
mean concave points      -0.776614
worst radius             -0.776454
mean perimeter           -0.742636
symmetry error            0.006522
texture error             0.008303
mean fractal dimension    0.012838
smoothness error          0.067016
benign_0__mal_1           1.000000
Name: benign_0__mal_1, dtype: float64

# correlate con la target (escludo la target)
df.corr()['benign_0__mal_1'][:-1].sort_values().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x26b02bbbb88>

png

Model

# X e y (come numpy arrays)
X = df.drop('benign_0__mal_1',axis=1).values
y = df['benign_0__mal_1'].values

# train e test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=101)

# scaling data
scaler = MinMaxScaler()
# scaler.fit(X_train) # stima i parametri
# X_train = scaler.transform(X_train) # applica i parametri
# X_test = scaler.transform(X_test)
X_train = scaler.fit_transform(X_train) # stima e applica i parametri in un solo comando
X_test = scaler.transform(X_test) # avevo erroneamente fatto il fit, non andava messo, si otterranno risultati leggermente diversi

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

# definizione struttura neural network per classificazione
model = Sequential()
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
model.add(Dense(units=30,activation='relu'))
model.add(Dense(units=15,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

%%time
# stima modello con overfitting
# https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network
# https://datascience.stackexchange.com/questions/18414/are-there-any-rules-for-choosing-the-size-of-a-mini-batch
model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=0
          )

Wall time: 35.6 s

<tensorflow.python.keras.callbacks.History at 0x26b07120608>

# loss crossentropy
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x26b05f654c8>

png

Early Stopping

# 'Stop training when a monitored quantity has stopped improving'
help(EarlyStopping)

Help on class EarlyStopping in module tensorflow.python.keras.callbacks:

class EarlyStopping(Callback)
 |  EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=0, mode='auto', baseline=None, restore_best_weights=False)
 |  
 |  Stop training when a monitored quantity has stopped improving.
 |  
 |  Arguments:
 |      monitor: Quantity to be monitored.
 |      min_delta: Minimum change in the monitored quantity
 |          to qualify as an improvement, i.e. an absolute
 |          change of less than min_delta, will count as no
 |          improvement.
 |      patience: Number of epochs with no improvement
 |          after which training will be stopped.
 |      verbose: verbosity mode.
 |      mode: One of `{"auto", "min", "max"}`. In `min` mode,
 |          training will stop when the quantity
 |          monitored has stopped decreasing; in `max`
 |          mode it will stop when the quantity
 |          monitored has stopped increasing; in `auto`
 |          mode, the direction is automatically inferred
 |          from the name of the monitored quantity.
 |      baseline: Baseline value for the monitored quantity.
 |          Training will stop if the model doesn't show improvement over the
 |          baseline.
 |      restore_best_weights: Whether to restore model weights from
 |          the epoch with the best value of the monitored quantity.
 |          If False, the model weights obtained at the last step of
 |          training are used.
 |  
 |  ...

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

# definizione struttura neural network per classificazione
model = Sequential()
model.add(Dense(units=30,activation='relu'))
model.add(Dense(units=15,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')

# definizione early stopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

%%time
# stima modello con early stop per limitare overfitting
model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=0,
          callbacks=[early_stop]
          )

Epoch 00047: early stopping
Wall time: 3.06 s

<tensorflow.python.keras.callbacks.History at 0x26b08b5b2c8>

# loss crossentropy
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x26b08f1bb48>

png

DropOut Layers

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

# definizione struttura neural network per classificazione
model = Sequential()
model.add(Dense(units=30,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=15,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

%%time
# stima modello con early stop e dropout per limitare overfitting
model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=0,
          callbacks=[early_stop]
          )

Epoch 00107: early stopping
Wall time: 8.93 s

<tensorflow.python.keras.callbacks.History at 0x26b091d5b48>

# loss crossentropy
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x26b0583fc08>

png

Metrics

# predictions
predictions = model.predict_classes(X_test)

# metrics
print('\nConfusion Matrix:')
print(confusion_matrix(y_test,predictions))
print('\nClassification metrics:')
print(classification_report(y_test,predictions))

Confusion Matrix:
[[54  1]
 [ 6 82]]

Classification metrics:
              precision    recall  f1-score   support

           0       0.90      0.98      0.94        55
           1       0.99      0.93      0.96        88

    accuracy                           0.95       143
   macro avg       0.94      0.96      0.95       143
weighted avg       0.95      0.95      0.95       143

LendingClub dataset

Kaggle: Predict default loans with classification

# lib
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report,confusion_matrix

import random as rn
import os

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout

# df
df = pd.read_csv('/lending_club_loan_two.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 loan_amnt             396030 non-null  float64
 term                  396030 non-null  object 
 int_rate              396030 non-null  float64
 installment           396030 non-null  float64
 grade                 396030 non-null  object 
 sub_grade             396030 non-null  object 
 emp_title             373103 non-null  object 
 emp_length            377729 non-null  object 
 home_ownership        396030 non-null  object 
 annual_inc            396030 non-null  float64
verification_status   396030 non-null  object 
issue_d               396030 non-null  object 
loan_status           396030 non-null  object 
purpose               396030 non-null  object 
title                 394275 non-null  object 
dti                   396030 non-null  float64
earliest_cr_line      396030 non-null  object 
open_acc              396030 non-null  float64
pub_rec               396030 non-null  float64
revol_bal             396030 non-null  float64
revol_util            395754 non-null  float64
total_acc             396030 non-null  float64
initial_list_status   396030 non-null  object 
application_type      396030 non-null  object 
mort_acc              358235 non-null  float64
pub_rec_bankruptcies  395495 non-null  float64
address               396030 non-null  object 
dtypes: float64(12), object(15)
memory usage: 81.6+ MB

Step 1: EDA

# verifichiamo la distribuzione target
ax = sns.countplot(x='loan_status',data=df)
# un po' sbilanciato, ci aspetteremo un'elevata accuracy ma precision e recall saranno quelle difficili

# aggiungo le frequenze
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height+3000,
            '{:,.0f}'.format(height).replace(",", "X").replace(".", ",").replace("X", "."), # '{:1.2f}'.format(height/float(len(df)))
            ha="center")

png

# histogram di loan_amnt
plt.figure(figsize=(12,4))
sns.distplot(df['loan_amnt'],kde=False,color='b',bins=40,hist_kws=dict(edgecolor='grey'))
plt.xlim(0,45000)

(0.0, 45000.0)

png

# heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),annot=True,cmap='coolwarm',linecolor='white',linewidths=1)
# plt.ylim(10, 0)

<matplotlib.axes._subplots.AxesSubplot at 0x2db149a6f88>

png

# mezzo-pairplot (parte 1)
df_pair1 = df[df.columns[0:18]].sample(n=10000, random_state=1311)
sns.pairplot(df_pair1,hue='loan_status',diag_kind='hist',diag_kws=dict(edgecolor='black',alpha=0.6,bins=30),plot_kws=dict(alpha=0.4))

<seaborn.axisgrid.PairGrid at 0x18e3e6a8cc8>

png

# mezzo-pairplot (parte 2)
df_pair2 = df.iloc[:, np.r_[12,18:27]].sample(n=10000, random_state=1311) # indexer
sns.pairplot(df_pair2,hue='loan_status',diag_kind='hist',diag_kws=dict(edgecolor='black',alpha=0.6,bins=30),plot_kws=dict(alpha=0.4))

<seaborn.axisgrid.PairGrid at 0x18e0a62d088>

png

# scatterplot 
sns.scatterplot(x='installment',y='loan_amnt',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x2db16fec1c8>

png

# boxplot loan_status e loan amount
sns.boxplot(x='loan_status',y='loan_amnt',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x2db177d3ac8>

png

# per mostrare i separatori delle migliaia come punti e decimali come virgola
dot_sep = lambda x: format(round(x,2) if abs(x) < 1 else round(x,1) if abs(x) < 10 else int(x), ',').replace(",", "X").replace(".", ",").replace("X", ".")

# loan amount by loan status
df.groupby('loan_status')['loan_amnt'].describe().applymap(dot_sep)

	count	mean	std	min	25%	50%	75%	max
loan_status
Charged Off	77.673	15.126	8.505	1.000	8.525	14.000	20.000	40.000
Fully Paid	318.357	13.866	8.302	500	7.500	12.000	19.225	40.000

# countplot per grade stratificato per target
sns.countplot(x='grade',hue='loan_status',data=df,order=sorted(df['grade'].unique()))

<matplotlib.axes._subplots.AxesSubplot at 0x2a843721688>

png

# countplot per subgrade
plt.figure(figsize=(12,4))
sns.countplot(x='sub_grade',data=df,order=sorted(df['sub_grade'].unique()),palette='coolwarm')

<matplotlib.axes._subplots.AxesSubplot at 0x2db14a6f548>

png

# countplot per subgrade stratificato per target
plt.figure(figsize=(12,4))
sns.countplot(x='sub_grade',data=df,order=sorted(df['sub_grade'].unique()),palette='coolwarm',hue='loan_status')

<matplotlib.axes._subplots.AxesSubplot at 0x2db16a31888>

png

# countplot per subgrade (F e G) stratificato per target
plt.figure(figsize=(12,4))
df_FG = df[(df['grade']=='G') | (df['grade']=='F')]
# df_FG = df[df['sub_grade'].apply(lambda x: x[0] in ['G','F'])]
sns.countplot(x='sub_grade',data=df_FG,order=sorted(df_FG['sub_grade'].unique()),palette='coolwarm',hue='loan_status')

<matplotlib.axes._subplots.AxesSubplot at 0x2db1782ed48>

png

# il method map è il più veloce per la rimappatura di una variabile
mapping_dict = {'Fully Paid': 1, 'Charged Off': 0}
df['loan_status'].map(mapping_dict).value_counts() # se il mapping non è esaustivo invece dei NaN si può dare il valore originale con il metodo .fillna(df['loan_status'])

# Risultati identici ma meno performanti:
# df['loan_status'].replace(mapping_dict).value_counts()
# df.replace({'loan_status': mapping_dict})['loan_status'].value_counts() # così devo specificare la colonna e deve agire su tutto il df
# df['loan_status'].replace(to_replace=['Fully Paid', 'Charged Off'], value=[1, 0]).value_counts()

1    318357
0     77673
Name: loan_status, dtype: int64

# creo dummy/dicotomizzo la target loan status
df['loan_repaid'] = df['loan_status'].map({'Fully Paid':1,'Charged Off':0})
print(df.shape)
# tabella di contingenza
df.groupby(["loan_repaid", "loan_status"]).size().reset_index(name="Frequenza")
# pd.crosstab(df['loan_repaid'],df['loan_status'])

(396030, 28)

	loan_repaid	loan_status	Frequenza
0	0	Charged Off	77673
1	1	Fully Paid	318357

# correlazione target con le altre variabili numeriche
df.corr()['loan_repaid'][:-1].sort_values().plot(kind='bar')
# df.corr()['loan_repaid'].sort_values().drop('loan_repaid').plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x18e17784e48>

png

Step 2: Data Preprocessing

Section Goals: Remove or fill any missing data. Remove unnecessary or repetitive features. Convert categorical string features to dummy variables.

# df numero record
len(df)

396030

# missing data
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

<matplotlib.axes._subplots.AxesSubplot at 0x2db17a29148>

png

# missing data count
missing_counts = pd.concat(
    [df.isnull().sum()[df.isnull().sum()>0].apply(dot_sep),
    (df.isnull().sum()[df.isnull().sum()>0]/len(df)*100).apply(dot_sep)], 
    axis=1)
missing_counts.columns = ['Freq', 'Freq %']
missing_counts

	Freq	Freq %
emp_title	22.927	5,8
emp_length	18.301	4,6
title	1.755	0,44
revol_util	276	0,07
mort_acc	37.795	9,5
pub_rec_bankruptcies	535	0,14

# employement job titles univoci
print(df['emp_title'].value_counts())
print('\nUnivoci:',dot_sep(df['emp_title'].nunique()))
# troppe per creare dummy, la rimuovo ma si potrebbero raggruppare
df.drop('emp_title',inplace=True,axis=1)
print(df.shape)

Teacher                     4389
Manager                     4250
Registered Nurse            1856
RN                          1846
Supervisor                  1830
                            ... 
Annunciation                   1
Atos Inc                       1
chevy parts maneger            1
Architectural Intern           1
GroupSystems Corporation       1
Name: emp_title, Length: 173105, dtype: int64

Univoci: 173.105
(396030, 27)

# sorted(df['emp_length'].dropna().unique())
emp_length_order = ['Missing','< 1 year', '1 year', '2 years', '3 years', 
                    '4 years', '5 years', '6 years', '7 years', 
                    '8 years', '9 years', '10+ years']

# countplot emp_length
plt.figure(figsize=(12,4))
ax = sns.countplot(x='emp_length',data=df[['emp_length']].fillna('Missing'),order=emp_length_order)

# aggiungo le frequenze
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 1000,
            '{:,.0f}'.format(height).replace(",", "X").replace(".", ",").replace("X", "."),
            ha="center")

png

# countplot emp_length
plt.figure(figsize=(12,4))
sns.countplot(x='emp_length',data=df[['emp_length','loan_status']].fillna('Missing'),order=emp_length_order, hue='loan_status')

<matplotlib.axes._subplots.AxesSubplot at 0x2db203a15c8>

png

# tasso di insucesso per ogni emp_length
# emp_len = df[df['loan_status']=="Charged Off"].groupby("emp_length").count()['loan_status']/df.groupby("emp_length").count()['loan_status']
emp_len = pd.DataFrame(df[['emp_length','loan_status']].fillna('Missing').groupby(['emp_length','loan_status']).size().groupby(level=0).apply(lambda x: x / x.sum()).xs('Charged Off',level='loan_status'),columns=['% Failure']).reset_index()
emp_len

	emp_length	% Failure
0	1 year	0.199135
1	10+ years	0.184186
2	2 years	0.193262
3	3 years	0.195231
4	4 years	0.192385
5	5 years	0.192187
6	6 years	0.189194
7	7 years	0.194774
8	8 years	0.199760
9	9 years	0.200470
10	< 1 year	0.206872
11	Missing	0.275286

# barplot tasso
plt.figure(figsize=(12,4))
# emp_len.plot(kind='bar')
sns.barplot(x='emp_length',y='% Failure',data=emp_len,order=emp_length_order,palette='coolwarm')
# non c'è una forte evidenza, quindi rimuovo 'emp_length'
df.drop('emp_length',axis=1,inplace=True)
df.shape

(396030, 26)

png

# missing data count
missing_counts = pd.concat(
    [df.isnull().sum()[df.isnull().sum()>0].apply(dot_sep),
    (df.isnull().sum()[df.isnull().sum()>0]/len(df)*100).apply(dot_sep)], 
    axis=1)
missing_counts.columns = ['Freq', 'Freq %']
missing_counts

	Freq	Freq %
title	1.755	0,44
revol_util	276	0,07
mort_acc	37.795	9,5
pub_rec_bankruptcies	535	0,14

# title e purpose
print(df['purpose'].value_counts().head(7), end='\n\n')
print(df['title'].value_counts().head(7))

debt_consolidation    234507
credit_card            83019
home_improvement       24030
other                  21185
major_purchase          8790
small_business          5701
car                     4697
Name: purpose, dtype: int64

Debt consolidation         152472
Credit card refinancing     51487
Home improvement            15264
Other                       12930
Debt Consolidation          11608
Major purchase               4769
Consolidation                3852
Name: title, dtype: int64

# title e purpose
print('Purpose nunique:',df['purpose'].nunique())
print('Title nunique:',df['title'].nunique())
# rimuovo title che è una sottocategoria e non avrebbe senso rendere dummy
df.drop('title',axis=1,inplace=True)
print(df.shape)

Purpose nunique: 14
Title nunique: 48817
(396030, 25)

# missing data count
missing_counts = pd.concat(
    [df.isnull().sum()[df.isnull().sum()>0].apply(dot_sep),
    (df.isnull().sum()[df.isnull().sum()>0]/len(df)*100).apply(dot_sep)], 
    axis=1)
missing_counts.columns = ['Freq', 'Freq %']
missing_counts

	Freq	Freq %
revol_util	276	0,07
mort_acc	37.795	9,5
pub_rec_bankruptcies	535	0,14

# mort_acc (acconto del mutuo ipotecario)
print('mort_acc nunique:', df['mort_acc'].nunique())
print(df['mort_acc'].value_counts().head().apply(dot_sep))

mort_acc nunique: 33
0    139.777
0     60.416
0     49.948
0     38.049
0     27.887
Name: mort_acc, dtype: object

# mort_acc correlations
df.corr()['mort_acc'].sort_values()

int_rate               -0.082583
dti                    -0.025439
revol_util              0.007514
pub_rec                 0.011552
pub_rec_bankruptcies    0.027239
loan_repaid             0.073111
open_acc                0.109205
installment             0.193694
revol_bal               0.194925
loan_amnt               0.222315
annual_inc              0.236320
total_acc               0.381072
mort_acc                1.000000
Name: mort_acc, dtype: float64

df[['mort_acc','total_acc']].head()

	mort_acc	total_acc
0	0.0	25.0
1	3.0	27.0
2	0.0	26.0
3	0.0	13.0
4	1.0	43.0

# Mean of mort_acc column per total_acc
print('Mean of mort_acc column per total_acc:')
print(df.groupby('total_acc').mean()['mort_acc'])

Mean of mort_acc column per total_acc:
total_acc
0      0.000000
0      0.052023
0      0.066743
0      0.103289
0      0.151293
           ...   
0    1.000000
0    1.000000
0    3.000000
0    2.000000
0    0.000000
Name: mort_acc, Length: 118, dtype: float64

# fillna by group per mort_acc

# lambda function con due variabili
# def fill_mort_acc(total_acc,mort_acc):
#     if np.isnan(mort_acc):
#         return total_acc_avg[total_acc]
#     else:
#         return mort_acc
# df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis=1)

df['mort_acc'] = df[['mort_acc','total_acc']].groupby('total_acc').transform(lambda x: x.fillna(x.mean()))
df.shape

(396030, 25)

# rimuovo record con missing inferiori allo 0.5%
print('Pre rimozione missing:',len(df))
df.dropna(inplace=True)
print('Post rimozione missing:',len(df))
df.shape

Pre rimozione missing: 396030
Post rimozione missing: 395219

(395219, 25)

# verifica missing rimasti
df.isnull().sum().sum()

0

Step 3: Categorical Variables and Dummy Variables

String values due to the categorical columns

# distribuzione tipi colonne
df.dtypes.value_counts()

float64    12
object     12
int64       1
dtype: int64

# seleziono colonne non numeriche
df.select_dtypes(['object']).columns
# df.select_dtypes(exclude=['float64','int64']).columns

Index(['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status',
       'issue_d', 'loan_status', 'purpose', 'earliest_cr_line',
       'initial_list_status', 'application_type', 'address'],
      dtype='object')

# codifico term in numeric
print('Pre codifica:', df['term'].unique())
df['term'] = df['term'].map({' 36 months':36,' 60 months':60})
# df['term'] = df['term'].apply(lambda x: int(x[:3]))
print('Post codifica:', df['term'].unique())

Pre codifica: [' 36 months' ' 60 months']
Post codifica: [36 60]

# sub_grade è una sottocategoria di grade ma si può rendere dummy, quindi rimuoviamo grade
print('Univoci grade:', df['grade'].nunique())
print('Univoci sub_grade:', df['sub_grade'].nunique())
df.drop('grade',axis=1,inplace=True)
print(df.shape)

Univoci grade: 7
Univoci sub_grade: 35
(395219, 24)

# dummyfication
df = pd.get_dummies(df,columns=['sub_grade'],drop_first=True)
# subgrade_dummies = pd.get_dummies(df['sub_grade'],drop_first=True,prefix='sub_grade')
# df = pd.concat([df.drop('sub_grade',axis=1),subgrade_dummies],axis=1)
df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'home_ownership',
       'annual_inc', 'verification_status', 'issue_d', 'loan_status',
       'purpose', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'initial_list_status',
       'application_type', 'mort_acc', 'pub_rec_bankruptcies', 'address',
       'loan_repaid', 'sub_grade_A2', 'sub_grade_A3', 'sub_grade_A4',
       'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2', 'sub_grade_B3',
       'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1', 'sub_grade_C2',
       'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5', 'sub_grade_D1',
       'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4', 'sub_grade_D5',
       'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3', 'sub_grade_E4',
       'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2', 'sub_grade_F3',
       'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1', 'sub_grade_G2',
       'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5'],
      dtype='object')

# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns

Index(['home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'purpose', 'earliest_cr_line', 'initial_list_status',
       'application_type', 'address'],
      dtype='object')

# rimanenti variabili categoriche
print('verification_status nunique:',df['verification_status'].nunique())
print('application_type nunique:',df['application_type'].nunique())
print('initial_list_status nunique:',df['initial_list_status'].nunique())
print('purpose nunique:',df['purpose'].nunique())

verification_status nunique: 3
application_type nunique: 3
initial_list_status nunique: 2
purpose nunique: 14

# dummyficate
df = pd.get_dummies(df,columns=['verification_status', 'application_type','initial_list_status','purpose'],drop_first=True)

# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns

Index(['home_ownership', 'issue_d', 'loan_status', 'earliest_cr_line',
       'address'],
      dtype='object')

# riduzione mapping sostituzione categorie di home_ownership
print(df['home_ownership'].value_counts())
print('\n')
# df['home_ownership'].map({'NONE':'OTHER','ANY':'OTHER'}).fillna(df['home_ownership']).value_counts()
print(df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER').value_counts())

MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          110
NONE            29
ANY              3
Name: home_ownership, dtype: int64


MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          142
Name: home_ownership, dtype: int64

# dummyfication home_ownership
df['home_ownership'] = df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')
df = pd.get_dummies(df,columns=['home_ownership'],drop_first=True)
df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'annual_inc', 'issue_d',
       'loan_status', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'mort_acc',
       'pub_rec_bankruptcies', 'address', 'loan_repaid', 'sub_grade_A2',
       'sub_grade_A3', 'sub_grade_A4', 'sub_grade_A5', 'sub_grade_B1',
       'sub_grade_B2', 'sub_grade_B3', 'sub_grade_B4', 'sub_grade_B5',
       'sub_grade_C1', 'sub_grade_C2', 'sub_grade_C3', 'sub_grade_C4',
       'sub_grade_C5', 'sub_grade_D1', 'sub_grade_D2', 'sub_grade_D3',
       'sub_grade_D4', 'sub_grade_D5', 'sub_grade_E1', 'sub_grade_E2',
       'sub_grade_E3', 'sub_grade_E4', 'sub_grade_E5', 'sub_grade_F1',
       'sub_grade_F2', 'sub_grade_F3', 'sub_grade_F4', 'sub_grade_F5',
       'sub_grade_G1', 'sub_grade_G2', 'sub_grade_G3', 'sub_grade_G4',
       'sub_grade_G5', 'verification_status_Source Verified',
       'verification_status_Verified', 'application_type_INDIVIDUAL',
       'application_type_JOINT', 'initial_list_status_w',
       'purpose_credit_card', 'purpose_debt_consolidation',
       'purpose_educational', 'purpose_home_improvement', 'purpose_house',
       'purpose_major_purchase', 'purpose_medical', 'purpose_moving',
       'purpose_other', 'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding', 'home_ownership_OTHER',
       'home_ownership_OWN', 'home_ownership_RENT'],
      dtype='object')

# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns

Index(['issue_d', 'loan_status', 'earliest_cr_line', 'address'], dtype='object')

# feature engineering address column
print('address nunique:',df['address'].nunique())
print('\n')
print(df['address'].value_counts().head(10))

address nunique: 392898


USS Johnson\nFPO AE 48052     8
USCGC Smith\nFPO AE 70466     8
USS Smith\nFPO AP 70466       8
USNS Johnson\nFPO AE 05113    8
USNS Johnson\nFPO AP 48052    7
USNS Johnson\nFPO AA 70466    6
USNV Brown\nFPO AA 48052      6
USS Smith\nFPO AP 22690       6
USCGC Miller\nFPO AA 22690    6
USCGC Smith\nFPO AA 70466     6
Name: address, dtype: int64

# creo variabile cap (zip code)
df['zip_code'] = df['address'].apply(lambda x: x[-5:])
df['zip_code'].unique()

array(['22690', '05113', '00813', '11650', '30723', '70466', '29597',
       '48052', '86630', '93700'], dtype=object)

# dummy zip code
df.drop('address',axis=1,inplace=True)
df = pd.get_dummies(df,columns=['zip_code'],drop_first=True)
df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'annual_inc', 'issue_d',
       'loan_status', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'mort_acc',
       'pub_rec_bankruptcies', 'loan_repaid', 'sub_grade_A2', 'sub_grade_A3',
       'sub_grade_A4', 'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2',
       'sub_grade_B3', 'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1',
       'sub_grade_C2', 'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5',
       'sub_grade_D1', 'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4',
       'sub_grade_D5', 'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3',
       'sub_grade_E4', 'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2',
       'sub_grade_F3', 'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1',
       'sub_grade_G2', 'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5',
       'verification_status_Source Verified', 'verification_status_Verified',
       'application_type_INDIVIDUAL', 'application_type_JOINT',
       'initial_list_status_w', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_educational',
       'purpose_home_improvement', 'purpose_house', 'purpose_major_purchase',
       'purpose_medical', 'purpose_moving', 'purpose_other',
       'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding', 'home_ownership_OTHER',
       'home_ownership_OWN', 'home_ownership_RENT', 'zip_code_05113',
       'zip_code_11650', 'zip_code_22690', 'zip_code_29597', 'zip_code_30723',
       'zip_code_48052', 'zip_code_70466', 'zip_code_86630', 'zip_code_93700'],
      dtype='object')

# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns

Index(['issue_d', 'loan_status', 'earliest_cr_line'], dtype='object')

# issue_date è 'data leakage' la avremmo solo con la realizzazione della terget, quindi va esclusa
df.drop('issue_d',axis=1,inplace=True)
df.shape

(395219, 80)

# earliest_cr_line possiamo estrarre l'anno
print(df['earliest_cr_line'].head())
print('\n')
print('Lunghezza stringa:',df['earliest_cr_line'].apply(lambda x: len(x)).unique()) # sono tutti costanti da 8, si prende l'anno dalla fine
df['earliest_cr_year'] = df['earliest_cr_line'].apply(lambda x: int(x[-4:]))
df.drop('earliest_cr_line',axis=1,inplace=True)

  Jun-1990
  Jul-2004
  Aug-2007
  Sep-2006
  Mar-1999
Name: earliest_cr_line, dtype: object

Lunghezza stringa: [8]

# faccio un backup del df, nel caso sbaglio
df_backup = df.copy()
df_backup.shape

(395219, 80)

# verifico le colonne non numeriche rimaste
df.select_dtypes(['object']).columns

Index(['loan_status'], dtype='object')

# elimino loan_status così lascio la target codificata ('loan_repaid')
df.drop('loan_status',axis=1,inplace=True)
df.shape
# siamo pronti!

(395219, 79)

Step 4: Model

Tuning NN
Droput links 1, 2, 3

# potrei lavorare sul campione
df_sample = df.sample(frac=0.1,random_state=101)
print('df intero:',len(df))
print('df ridotto:',len(df_sample))

df intero: 395219
df ridotto: 39522

# X e y (come numpy arrays)
X = df.drop('loan_repaid',axis=1).values
y = df['loan_repaid'].values

# train e test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20,random_state=101)

# normalizing data
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # avevo erroneamente fatto il fit, non andava messo, si otterranno risultati leggermente diversi

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

# definizione struttura neural network per classificazione
model = Sequential()
model.add(Dense(units=78,activation='relu'))
model.add(Dropout(0.5)) # 0.2 da provare
model.add(Dense(units=39,activation='relu'))
model.add(Dropout(0.5)) # 0.2 da provare
model.add(Dense(units=19,activation='relu'))
model.add(Dropout(0.5)) # 0.2 da provare
model.add(Dense(units=1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')

# set seed per ridurre la non determinatezza del fit via GPU
os.environ['PYTHONHASHSEED'] = '13111990'
np.random.seed(13)
rn.seed(11)
tf.random.set_seed(1990)

%%time
# stimo il modello
model.fit(x=X_train,y=y_train,
          validation_data=(X_test,y_test),
          verbose=2,batch_size=128,epochs=25) # batch_size=256 da provare

Train on 316175 samples, validate on 79044 samples
Epoch 1/25
316175/316175 - 8s - loss: 0.3197 - val_loss: 0.2726
...
Epoch 25/25
316175/316175 - 6s - loss: 0.2629 - val_loss: 0.2713
Wall time: 2min 51s

<tensorflow.python.keras.callbacks.History at 0x2a80300b748>

# save the model
model.save('LendingClub_Keras_Model.h5')  # creates a HDF5 file

'F:\\Python\\Course 001'

# loss crossentropy (campione)
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2a8573fa248>

png

# loss crossentropy
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2a80381b2c8>

png

# predictions
predictions = model.predict_classes(X_test)

# metrics
print('\nConfusion Matrix:')
print(confusion_matrix(y_test,predictions))
print('\nClassification metrics:')
print(classification_report(y_test,predictions))
# non è fantastico perché la recall della classe 0 è bassa

Confusion Matrix:
[[ 7176  8482]
 [  406 62980]]

Classification metrics:
              precision    recall  f1-score   support

           0       0.95      0.46      0.62     15658
           1       0.88      0.99      0.93     63386

    accuracy                           0.89     79044
   macro avg       0.91      0.73      0.78     79044
weighted avg       0.89      0.89      0.87     79044

New data prediction

# df
rn.seed(101)
random_ind = rn.randint(0,len(df))
new_customer = df.drop('loan_repaid',axis=1).iloc[random_ind]
new_customer

loan_amnt           25000.00
term                   60.00
int_rate               18.24
installment           638.11
annual_inc          61665.00
                      ...   
zip_code_48052          0.00
zip_code_70466          0.00
zip_code_86630          0.00
zip_code_93700          0.00
earliest_cr_year     1996.00
Name: 305323, Length: 78, dtype: float64

# scaling
new_customer = scaler.transform(new_customer.values.reshape(1, 78))
# predict
print('Probabilità Prevista:', model.predict(new_customer))
print('Classe Prevista:', model.predict_classes(new_customer))
print('Classe Osservata:', df.iloc[random_ind]['loan_repaid'])

Probabilità Prevista: [[0.54277164]]
Classe Prevista: [[1]]
Classe Osservata: 1.0

# reset
%reset -f

Alberto