PySpark to Run Machine Learning Models
Note: The data used for this article is fictious data with no PII (Personally Identifiable Information) and doesn’t relate to any entity what so ever.
Ok, let’s begin!
I have been learning PySpark application since a couple of months and I thought, wouldn’t it be nice to post my project details here to share the knowledge. PySpark
is a great tool to analyse huge amounts of data. It is a combination of Python and Spark Engine. You get the goodness of both the worlds.
Project Background
Let’s create a fictious project for this article. The domain would be shipping industry and the goal is to predict a key variable carrier_base_pay
(the amount of base money excluding taxes, a customer would owe to ship a package) using the associated feature variables
. The feature variables would be dimensions of the package
, weight
, source location
, destination location
, type of shipping (express or overnight or leasure)
etc.
PySpark Modules and Zeppeline Notebook
You may have to install PySpark modules using online tutorials. But once the step is out of the way, the focus would be on using Zeppelin Notebook
for the project. Zeppelin notebook is very similar to the Jupyter Notebook
, except that it’ll be running on top of the PySpark
engine.
Initiating a PySpark Session
To initiate a PySpark session and get the data into the session, we need to start by running import commands to import the hivewarehousesession
libraries. Next, we will be using SQL Queries through Apache Hive
to extract data from our underlying HDFS flat files.
%LivyPy3.pyspark
### start the environment
from pyspark_llap.sql.session import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
Importing data
Let’s run a basic SQL, i.e., SELECT * <tablename>
command to import data into the session.
%LivyPy3.pyspark
##Load the data to dataframe FS_orders_hive
FS_orders_hive = hive.executeQuery("SELECT * FROM LOGISTICS_SDBX_RAW.FS_FY2019_4")
FS_orders_hive.show()
Output
+---------+-------------+--------------------+---------+-------------+--------------------+-----------------+--------------------+--------------+-------+--------------+-----+------+----------+------------+--------------+------+-------+-----------+-------------+----------+----------------+
| id|order_type_id| order_type_desc|bill_date|bill_distance| commodity|equipment_type_id| equipment_type_desc|freight_charge| weight| pu_city|pu_st|pu_zip|pu_arrival|pu_departure| del_city|del_st|del_zip|del_arrival|del_departure|carrier_id|carrier_base_pay|
+---------+-------------+--------------------+---------+-------------+--------------------+-----------------+--------------------+--------------+-------+--------------+-----+------+----------+------------+--------------+------+-------+-----------+-------------+----------+----------------+
| null|ORDER_TYPE_ID| ORDER_TYPE_DESC|BILL_DATE| null| COMMODITY|EQUIPMENT_TYPE_ID| EQUIPMENT_TYPE_DESC| null| null| PU_CITY|PU_ST|PU_ZIP|PU_ARRIVAL|PU_DEPARTURE| DEL_CITY|DEL_ST| null|DEL_ARRIVAL|DEL_DEPARTURE|CARRIER_ID| null|
|74243580.0| FLAT| FLATBED|10-Oct-18| 357.0| MACHINERY| FT|Flatbed w/Tarps (...| 1250.0| 5500.0| POMONA| CA| 91768| 28-Sep-18| 28-Sep-18| TEMPE| AZ|85280.0| 1-Oct-18| 1-Oct-18| SAFRJAFL| 900.0|
|74235585.0| VAN| VAN|12-Oct-18| 2799.0|CONSUMER GOODS (R...| V53| 53' Dry Van| 4900.0|18832.0| ALBANY| GA| 31705| 28-Sep-18| 28-Sep-18| EUGENE| OR|97402.0| 4-Oct-18| 4-Oct-18| MBGLELIL| 4600.0|
|74235787.0| VAN| VAN|10-Oct-18| 2011.0| DISPLAY MATERIAL| V53| 53' Dry Van| 3850.0|20000.0| IRVINE| CA| 92602| 28-Sep-18| 28-Sep-18| ANTIOCH| IL|60002.0| 2-Oct-18| 2-Oct-18| HIGHELIL| 3400.0|
|74235481.0| VAN| VAN| 8-Oct-18| 1640.0| PACKAGING MATERIAL| V53| 53' Dry Van| 3180.0|20370.0| WACO| TX| 76712| 28-Sep-18| 28-Sep-18| CLIFTON| NJ| 7014.0| 1-Oct-18| 1-Oct-18| HUNTPANJ| 3500.0|
|74235583.0| VAN| VAN|10-Oct-18| 2004.0| DISPLAY MATERIAL| V53| 53' Dry Van| 3850.0|15000.0| IRVINE| CA| 92602| 27-Sep-18| 27-Sep-18|MOUNT PROSPECT| IL|60056.0| 3-Oct-18| 3-Oct-18| PROLWOIL| 3300.0|
|74234584.0| LTL| LTL ORDER|22-Oct-18| 1238.0| HARDWARE| LTLS| Standard LTL| 452.55| 2305.0| KENTWOOD| MI| 49512| 28-Sep-18| 28-Sep-18| HOUSTON| TX|77041.0| 10-Oct-18| 10-Oct-18| CLEAININ| null|
|74228170.0| VAN| VAN| 8-Oct-18| 970.0|SHAMPOOS/CONDITIO...| V53| 53' Dry Van| 1600.0|26857.0| POOLER| GA| 31322| 28-Sep-18| 28-Sep-18| NILES| IL|60714.0| 1-Oct-18| 1-Oct-18| ANGTELIL| 1300.0|
|74228471.0| VAN| VAN| 3-Oct-18| 970.0|SHAMPOOS/CONDITIO...| V53| 53' Dry Van| 1600.0|18363.0| POOLER| GA| 31322| 26-Sep-18| 26-Sep-18| NILES| IL|60714.0| 28-Sep-18| 28-Sep-18| LYNXWAIA| 1300.0|
|74272873.0| VAN| VAN|11-Oct-18| 809.0|COMSUMER GOODS (R...| V53| 53' Dry Van| 2150.0|27366.0| POOLER| GA| 31322| 26-Sep-18| 26-Sep-18| MOONACHIE| NJ| 7074.0| 28-Sep-18| 28-Sep-18| FREIPLNJ| 1800.0|
|74228794.0| RAIL| INTERMODAL|18-Oct-18| 2105.0| MUSICAL INSTRUMENTS| C53'|Intermodal Contai...| 3500.0|20000.0|SAN BERNARDINO| CA| 92408| 26-Sep-18| 26-Sep-18| FORT WAYNE| IN|46818.0| 8-Oct-18| 8-Oct-18| CELTTIIL| null|
|74422876.0| VAN| VAN|12-Oct-18| 1.0| CLOTHING OR APPAREL| SRTL|Straight Truck w/...| 572.4|25000.0| POMONA| CA| 91766| 2-Oct-18| 2-Oct-18| SAN FRANCISCO| CA|94102.0| 2-Oct-18| 2-Oct-18| TOMAFRCA| 515.68|
|74228577.0| VAN| VAN|12-Oct-18| 1.0| CLOTHING OR APPAREL| SRTL|Straight Truck w/...| 307.93|25000.0| POMONA| CA| 91766| 2-Oct-18| 2-Oct-18| EMERYVILLE| CA|94608.0| 2-Oct-18| 2-Oct-18| WALTHACA| 335.51|
|74212878.0| VAN| VAN|12-Oct-18| 1.0| CLOTHING OR APPAREL| SRTL|Straight Truck w/...| 244.48|25000.0| POMONA| CA| 91766| 4-Oct-18| 4-Oct-18| SANTA CLARA| CA|95050.0| 4-Oct-18| 4-Oct-18| TOMAFRCA| 220.25|
|74222879.0| VAN| VAN|12-Oct-18| 1.0| CLOTHING OR APPAREL| SRTL|Straight Truck w/...| 276.26|25000.0| POMONA| CA| 91766| 2-Oct-18| 2-Oct-18| SANTA CLARA| CA|95050.0| 2-Oct-18| 2-Oct-18| TOMAFRCA| 248.88|
|74228472.0| VAN| VAN| 3-Oct-18| 493.0|COMSUMER GOODS (R...| V53| 53' Dry Van| 1530.0|27883.0| POOLER| GA| 31322| 26-Sep-18| 26-Sep-18| MIAMI| FL|33172.0| 27-Sep-18| 27-Sep-18| AAPCMIFL| 1300.0|
|74228830.0| VAN| VAN| 8-Oct-18| 1.0| CLOTHING OR APPAREL| SRTL|Straight Truck w/...| 149.22|25000.0| POMONA| CA| 91766| 2-Oct-18| 2-Oct-18| COSTA MESA| CA|92626.0| 2-Oct-18| 2-Oct-18| ADRLCOCA| 134.43|
|74228569.0| PARTIAL|Partial Truckload...|15-Oct-18| 1468.0|coated film **Par...| V53| 53' Dry Van| 3375.0|21130.0| IOWA CITY| IA| 52240| 27-Sep-18| 27-Sep-18| MIRAMAR| FL|33025.0| 1-Oct-18| 1-Oct-18| F5EXBOIL| null|
|74228875.0| RAIL| INTERMODAL|18-Oct-18| 2105.0| MUSICAL INSTRUMENTS| C53'|Intermodal Contai...| 3500.0|20000.0|SAN BERNARDINO| CA| 92408| 28-Sep-18| 29-Sep-18| FORT WAYNE| IN|46818.0| 11-Oct-18| 11-Oct-18| CELTTIIL| null|
|74228881.0| VAN| VAN| 8-Oct-18| 1.0| CLOTHING OR APPAREL| SRTL|Straight Truck w/...| 266.13|25000.0| POMONA| CA| 91766| 2-Oct-18| 2-Oct-18| SANTA MONICA| CA|90401.0| 2-Oct-18| 2-Oct-18| ADRLCOCA| 239.76|
+---------+-------------+--------------------+---------+-------------+--------------------+-----------------+--------------------+--------------+-------+--------------+-----+------+----------+------------+--------------+------+-------+-----------+-------------+----------+----------------+
only showing top 20 rows
It’s possible that we may have more than 100,000 records in the table. However, our SQL is only returning the top 20 rows. The records shown are in no particular order.
This is our chance to take a good look at the columns and the associated data that we are dealing with.
Importing required ML libraries
For this project, because we’ll be mainly dealing with Machine Learning
techniques, we need to import associated libraries. We’ll be using sklearn
a.k.a SciKit-learn gloriously. We also need data manipulation libraries to transform the data. For that, we’ll be using numpy
and pandas
. Finally, to visualize the output, we’ll be using matplotlib.pyplot
library.
%LivyPy3.pyspark
########### Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn import metrics
Understanding the data
Let’s take a closer look at the data!
Let’s see if we can get the record count “before” and “after” dropping na
values. NA
signify not available
, meaning missing values. These records posses either blank
or null
data.
%LivyPy3.pyspark
######### Converting dataset to pandas
pdfFS_orders_hive = FS_orders_hive.toPandas()
####### Misc commands
##pdfFS_orders_hive.count()
####### Total records
print('Total records:', len(pdfFS_orders_hive))
######## Dropping null/improper records
pdfFS_orders_hive_dropna = pdfFS_orders_hive.dropna()
print('Total records after dropping na:', len(pdfFS_orders_hive_dropna))
Output
Total records: 86864
Total records after dropping na: 76285
$(86864-76285) \over (86864)$ X $100 %$ = $12.1 %$
As you can see, more than 12% of the dataset has na
records. It’s best to drop them as at this point we are just trying to cleanse data rather than enriching it.
Split the data into Train and Test datasets
Once we are comfortable with the data we have, we need to move to the next step, i.e., splitting the data into a training
set and a testing
set. This helps us keep a part of our original dataset on the side lanes, just so that we can use it to test our final machine learning
model. You could also use sampling methods if you want to, but for this project, we’ll keep it simple.
The usual percentage of the dataset split is 80% and 20%. Meaning, use 80% for the training purposes and 20% for the testing. Let’s do the same in our case as well.
%LivyPy3.pyspark
##### split dataset 80%(train) 20%(test)
pdfFS_orders_hive_dropna_copy = pdfFS_orders_hive_dropna.copy()
train_set = pdfFS_orders_hive_dropna.sample(frac=0.80, random_state=0)
print('Total train records count:', len(train_set))
test_set = pdfFS_orders_hive_dropna.drop(train_set.index)
print('Total train records count:', len(test_set))
Output
Total train records count: 61028
Total train records count: 15257
The 80% of our 76K records yields us 61K records, and rest is for testing.
Plotting using matplotlib (Optional)
You do not have to plot the values, but it is just another way of looking at the data to understand it better.
%LivyPy3.pyspark
#####Plot train and test dataset - not happening with Zeppelin!!!
from matplotlib import pyplot as plt
from matplotlib import style
style.use('ggplot')
x = ['train','test','total']
y = [len(train_set),len(test_set),len(pdfFS_orders_hive_dropna_copy)]
x , y
plt.bar(x, y, align='center')
plt.title('bar chart to plot the counts')
plt.ylabel('record counts')
plt.xlabel('type')
plt.show()
Output
(['train', 'test', 'total'], [61028, 15257, 76285])
PENDING Plot output
Data Transformation - I
Let’s convert some of the non numeric fields to numeric fields to ease the traning process.
%LivyPy3.pyspark
##### Convert dataframes to numeric.
### Train_set - train_set_to_numeric
train_set_to_numeric = train_set.apply(pd.to_numeric , errors='ignore')
#### Checking data definition of the train_set
train_set_to_numeric.dtypes
len(train_set_to_numeric)
###### Test_set - test_set_to_numeric
test_set_to_numeric = test_set.apply(pd.to_numeric , errors='ignore')
test_set_to_numeric.dtypes
len(test_set_to_numeric)
##### more stats
len(train_set_to_numeric.del_zip.unique())
train_set_to_numeric.shape
train_set_to_numeric.describe()
train_set_to_numeric.isnull().any()
Output
id False
order_type_id False
order_type_desc False
bill_date False
bill_distance False
commodity False
equipment_type_id False
equipment_type_desc False
freight_charge False
weight False
pu_city False
pu_st False
pu_zip False
pu_arrival False
pu_departure False
del_city False
del_st False
del_zip False
del_arrival False
del_departure False
carrier_id False
carrier_name False
carrier_base_pay False
dtype: bool
Data Transformation - II
Now that data has been converted to numeric
fields, we need to transform some of the fields into categorical
to make sure we can accomodate with the ML libraries and their parameters.
%LivyPy3.pyspark
### del [train_set_to_numeric]
########### Misc commands
##train_set_to_numeric.dtypes
##train_set_to_numeric.order_type_id
##train_set_to_numeric.order_type_desc
##train_set_to_numeric.bill_distance[47189]
##len(train_set_to_numeric.bill_distance.unique())
##len(train_set_to_numeric.commodity.unique())
##train_set_to_numeric.pu_arrival
##len(train_set_to_numeric.carrier_id.unique())
################ Converting 6 object variables to categorical variables to train a model.
#### Train set
train_set_to_numeric.order_type_id = train_set['order_type_id'].astype('category')
train_set_to_numeric.commodity = train_set['commodity'].astype('category')
train_set_to_numeric.equipment_type_id = train_set['equipment_type_id'].astype('category')
train_set_to_numeric.pu_city = train_set['pu_city'].astype('category')
train_set_to_numeric.del_city = train_set['del_city'].astype('category')
train_set_to_numeric.carrier_id = train_set['carrier_id'].astype('category')
train_set_to_numeric.dtypes
#### Test set
test_set_to_numeric.order_type_id = test_set['order_type_id'].astype('category')
test_set_to_numeric.commodity = test_set['commodity'].astype('category')
test_set_to_numeric.equipment_type_id = test_set['equipment_type_id'].astype('category')
test_set_to_numeric.pu_city = test_set['pu_city'].astype('category')
test_set_to_numeric.del_city = test_set['del_city'].astype('category')
test_set_to_numeric.carrier_id = test_set['carrier_id'].astype('category')
test_set_to_numeric.dtypes
id float64
order_type_id category
order_type_desc object
bill_date object
bill_distance float64
commodity category
equipment_type_id category
equipment_type_desc object
freight_charge float64
weight float64
pu_city category
pu_st object
pu_zip object
pu_arrival object
pu_departure object
del_city category
del_st object
del_zip float64
del_arrival object
del_departure object
carrier_id category
carrier_name object
carrier_base_pay float64
dtype: object
Feature Engineering
Understanding the features (columns /attributes of the records) that we would need to train the machine learning
model is one of the crucial steps in the lifecycle of datascience processes. Looking at the data we have, we need to find correlations between the target variable that we are trying to predict and the feature matrix (columns) we were given.
%LivyPy3.pyspark
######### Creating datasets with required features for learning.
############# Train_set
train_set_to_numeric_features = train_set_to_numeric.drop(columns = ['id', 'order_type_desc', 'bill_date', 'equipment_type_desc', 'pu_st', 'pu_zip', 'pu_arrival', 'pu_departure', 'del_st', 'del_zip', 'del_arrival', 'del_departure', 'carrier_name', 'carrier_base_pay'])
train_set_to_numeric_features.columns
############ Test_set
test_set_to_numeric_features = test_set_to_numeric.drop(columns = ['id', 'order_type_desc', 'bill_date', 'equipment_type_desc', 'pu_st', 'pu_zip', 'pu_arrival', 'pu_departure', 'del_st', 'del_zip', 'del_arrival', 'del_departure', 'carrier_name', 'carrier_base_pay'])
test_set_to_numeric_features.columns
Output
Index(['order_type_id', 'bill_distance', 'commodity', 'equipment_type_id',
'freight_charge', 'weight', 'pu_city', 'del_city', 'carrier_id'],
dtype='object')
Extracting label field from the train and test sets
You could probably do this in a much simpler way, but I thought let’s just use the same command to drop all fields except the carrier_base_pay
, which is our target label. 😄
%LivyPy3.pyspark
########### Creating labels/dependent variable for test and train datasets
##### Train_set
### train_set_to_numeric_label = train_set_to_numeric.copy()
train_set_to_numeric_label = train_set_to_numeric.drop(columns = ['id', 'order_type_id', 'order_type_desc', 'bill_date', 'bill_distance', 'commodity', 'equipment_type_id', 'equipment_type_desc', 'freight_charge', 'weight', 'pu_city', 'pu_st', 'pu_zip', 'pu_arrival', 'pu_departure', 'del_city', 'del_st', 'del_zip', 'del_arrival', 'del_departure', 'carrier_id', 'carrier_name'])
train_set_to_numeric_label.dtypes
### Test_set
test_set_to_numeric_label = test_set_to_numeric.drop(columns = ['id', 'order_type_id', 'order_type_desc', 'bill_date', 'bill_distance', 'commodity', 'equipment_type_id', 'equipment_type_desc', 'freight_charge', 'weight', 'pu_city', 'pu_st', 'pu_zip', 'pu_arrival', 'pu_departure', 'del_city', 'del_st', 'del_zip', 'del_arrival', 'del_departure', 'carrier_id', 'carrier_name'])
test_set_to_numeric_label.dtypes
Output
carrier_base_pay float64
dtype: object
Data Transformation - III
Assigning categorical codes to our features
Because we have chosen to create categorical codes out of our given features, we need to assign appropriate codes to make sure they have a significance during the ML training process.
%LivyPy3.pyspark
### Assigning categorical codes to 6 features/attributes
train_set_to_numeric_features_cat_codes = train_set_to_numeric_features.copy()
test_set_to_numeric_features_cat_codes = test_set_to_numeric_features.copy()
train_set_to_numeric_features_cat_codes['order_type_id'] = train_set_to_numeric_features['order_type_id'].cat.codes
train_set_to_numeric_features_cat_codes['commodity'] = train_set_to_numeric_features['commodity'].cat.codes
train_set_to_numeric_features_cat_codes['equipment_type_id'] = train_set_to_numeric_features['equipment_type_id'].cat.codes
train_set_to_numeric_features_cat_codes['pu_city'] = train_set_to_numeric_features['pu_city'].cat.codes
train_set_to_numeric_features_cat_codes['del_city'] = train_set_to_numeric_features['del_city'].cat.codes
train_set_to_numeric_features_cat_codes['carrier_id'] = train_set_to_numeric_features['carrier_id'].cat.codes
test_set_to_numeric_features_cat_codes['order_type_id'] = test_set_to_numeric_features['order_type_id'].cat.codes
test_set_to_numeric_features_cat_codes['commodity'] = test_set_to_numeric_features['commodity'].cat.codes
test_set_to_numeric_features_cat_codes['equipment_type_id'] = test_set_to_numeric_features['equipment_type_id'].cat.codes
test_set_to_numeric_features_cat_codes['pu_city'] = test_set_to_numeric_features['pu_city'].cat.codes
test_set_to_numeric_features_cat_codes['del_city'] = test_set_to_numeric_features['del_city'].cat.codes
test_set_to_numeric_features_cat_codes['carrier_id'] = test_set_to_numeric_features['carrier_id'].cat.codes
##train_set_to_numeric_features.dtypes
##test_set_to_numeric_features.dtypes
##train_set_to_numeric_features_cat_codes
##train_set_to_numeric_features_cat_codes['order_type_id'][47189], train_set_to_numeric_features['order_type_id'][47189]
Assigning names - ‘X’ and ‘Y’ to the final datasets (Optional)
Just for fun, let’s just name features as 'X'
and labels as 'Y'
. The labels will help us shorten the variable names and bring us closer to the ML traning steps.
%LivyPy3.pyspark
############ Final attributes or x or features and labels or y
#### X variable
x_train = train_set_to_numeric_features_cat_codes.copy()
##train_set_to_numeric_features
### X label
x_test = test_set_to_numeric_features_cat_codes.copy()
##train_set_to_numeric_features
#### Y variable
y_train = train_set_to_numeric_label.copy()
#### Y label
y_test = test_set_to_numeric_label.copy()
y_test[113:114]
Output
carrier_base_pay
639 750.0
Machine Learning Traning
Let the fun begin! This is what we are here for, to train the ML model. As we have categorical codes
in our dataset and our final label is a continuous value, we’ll be using regression
techniques to train the model.
Linear Regression (our best loyal friend)
One of the basic regression techniques to try out first is Linear Regression
. Technically speaking a linear regression will only be able to classify the data that is linearly separable. The model will not be penalized for its choice of weights, full stop. Which means, during the training stage, if the model feels like one particular feature is particularly important, the model may place a large weight to the feature. This might lead to overfitting in small datasets. As per the big data concepts, data under 100K isn’t big at all. Hence, the rest of the methods better. Ex: LASSO.
%LivyPy3.pyspark
###### Training the MULTIPLE LINEAR REGRESSION algorithm -- linear
## LR
lr = LinearRegression()
lr_train = lr.fit(x_train, y_train)
print('lr_train complete!')
############################## Stats
####lr_train.summary()
### Key co-efficients choosen by lr algorithm
##coeff_df = pd.DataFrame(lr.coef_, x_train.columns, columns=['Coefficient'])
##coeff_df
##lr_train(lr.coef_, X.columns, columns=['Coefficient'])
##lr_train.score(x_train ,y_train)
##lr_train.coef_ , lr_train.intercept_
print('lr_train.coef_ and x_train.columns\n', lr_train.coef_, '\n', x_train.columns)
Output
lr_train complete!
lr_train.coef_ and x_train.columns
[[-3.33850643e+00 1.15221818e-01 -1.83800535e-02 4.90240105e+00
8.29745310e-01 2.76253066e-03 -6.24481228e-04 8.62335947e-04
1.25402509e-03]]
Index(['order_type_id', 'bill_distance', 'commodity', 'equipment_type_id',
'freight_charge', 'weight', 'pu_city', 'del_city', 'carrier_id'],
dtype='object')
LASSO
Lasso is a modified form of Linear Regression
, where the model is penalized for the sum of absolute values of the weights. Thus, the absolute values of weight will be (in general) reduced, and many will tend to be zeros. The alpha value passed as a parameter to the lasso
function will decide the penalty.
%LivyPy3.pyspark
###### Training the LASSO algorithm -- linear
lasso = linear_model.Lasso(alpha=50)
lasso_train = lasso.fit(x_train, y_train)
print('lasso_train complete!')
print('\n lasso_train summary\n', lasso_train)
lasso_train.coef_ , lasso_train.intercept_
print('\n lasso_train.coef_ and lasso_train.intercept_\n', lasso_train.coef_, '\n', lasso_train.intercept_)
Output
lasso_train complete!
lasso_train summary
Lasso(alpha=50, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
lasso_train.coef_ and lasso_train.intercept_
[-0.00000000e+00 1.13367273e-01 -1.86165577e-02 3.68935985e+00
8.31349235e-01 2.75292477e-03 -8.13115624e-04 6.88206445e-04
1.31730816e-03]
[-209.36205307]
Random Forest
Let’s just try couple of ensembling models as well to see which would do better.
%LivyPy3.pyspark
###### Training the RANDOM FOREST algorithm - non linear
rf = RandomForestRegressor(n_estimators=20, random_state=1) ###around 10 minutes
##rf = RandomForestRegressor(n_estimators=200, random_state=1) ###around 10 minutes
rf_train = rf.fit(x_train, y_train)
print('rf_train complete!')
print('\n rf_train summary:\n', rf_train)
print('\n rf_train feature importances:\n', rf_train.feature_importances_)
##rf_train.coef_ , rf_train.intercept_
##print('\n rf_train.coef_ and rf_train.intercept_\n', rf_train.coef_, '\n', rf_train.intercept_)
Output
rf_train complete!
rf_train summary:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
oob_score=False, random_state=1, verbose=0, warm_start=False)
rf_train feature importances:
[0.00234888 0.01285913 0.00740664 0.00392913 0.94589572 0.00968901
0.00614 0.00543848 0.00629303]
SVR (Support Vector Regression)
%LivyPy3.pyspark
###### Training the SUPPORT VECTOR REGRESSION algorithm - non linear
##svr = SVR(gamma='scale', C=1.0, epsilon=0.2)
svr = SVR() ## ETA 5 minutes
svr_train = svr.fit(x_train, y_train)
##print('svr_train complete!')
print('\n svr_train summary:\n', svr_train)
Output
svr_train summary:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
Predicting the label using Test datasets
Now that the training is completed, we need to move to the next stage where we can run the model against the test dataset.
Testing against Logistic Regression
%LivyPy3.pyspark
######################## Algorithm #1
### Do prediction on the test data using LR
y_pred_lr = lr.predict(x_test)
print('y_pred_lr complete!')
y_pred_lr complete!
Testing against LASSO
%LivyPy3.pyspark
###################### Algorithm #2
###### Do prediction on the test data using LASSO
y_pred_lasso = lasso.predict(x_test)
print('y_pred_lasso complete!')
y_pred_lasso complete!
Testing against Random Forest
%LivyPy3.pyspark
######################## Algorithm #3
###### Do prediction on the test data using RF
y_pred_rf = rf.predict(x_test)
print('y_pred_rf complete!')
y_pred_rf complete!
Testing against SVR (support vector regression)
%LivyPy3.pyspark
######################## Algorithm #4
###### Do prediction on the test data using SVR
y_pred_svr = svr.predict(x_test)
print('y_pred_svr complete!')
y_pred_svr complete!
Verifying the results
%LivyPy3.pyspark
#### verifying the results
##df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
##df1 = df.head(25)
##len(y_pred_lr)
##results = [y_test, y_pred]
##y_test[100], y_pred[100]
##y_test[3]
##y_test_copy = y_test.copy()
##y_test_appended = y_test_copy.append(y_pred_lr, ignore_index=True)
y_test
print('verifying the results!')
print('y_pred_lasso - ', y_pred_lasso[103])
print('y_pred_lr - ', y_pred_lr[103])
print('y_pred_rf - ', y_pred_rf[103])
print('y_pred_svr - ', y_pred_svr[103])
print('y_test - ', y_test[103:104])
verifying the results!
y_pred_lasso - 3300.649436439301
y_pred_lr - [3287.63222526]
y_pred_rf - 3248.15
y_pred_svr - 1085.4400155267074
y_test - carrier_base_pay
576 3200.0
Calculating Metrics
There are couple of ways to understand the efficiency of our model, one way is to calculate metrics i.e., Mean Absolute Error
, Mean Squared Error
and Root Mean Squared Error
.
%LivyPy3.pyspark
### Calculating Metrics LR
print('### Calculating Metrics - LR')
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_lr))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_lr))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_lr)))
### Calculating Metrics LASSO
print('\n### Calculating Metrics - LASSO')
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_lasso))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_lasso))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_lasso)))
### Calculating Metrics RF
print('\n### Calculating Metrics - RF')
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_rf))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_rf))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_rf)))
### Calculating Metrics SVR
print('\n### Calculating Metrics - SVR')
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_svr))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_svr))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_svr)))
### Calculating Metrics - LR
Mean Absolute Error: 196.5480533740987
Mean Squared Error: 105866.09745638819
Root Mean Squared Error: 325.3707077417821
### Calculating Metrics - LASSO
Mean Absolute Error: 195.7809924834218
Mean Squared Error: 105562.7762233446
Root Mean Squared Error: 324.90425701019154
### Calculating Metrics - RF
Mean Absolute Error: 199.25110190502312
Mean Squared Error: 110547.55835529094
Root Mean Squared Error: 332.48692960068513
### Calculating Metrics - SVR
Mean Absolute Error: 996.5762893287749
Mean Squared Error: 1933387.4782769545
Root Mean Squared Error: 1390.4630445563646
Calculating R^2 score
%LivyPy3.pyspark
###### R^2 score
##### The coefficient of determination R^2 of the prediction.
#### MORE INFO - The best possible score is 1.0 and it can be negative
## (because the model can be arbitrarily worse).
#### A constant model that always predicts the expected value of y,
## disregarding the input features, would get a R^2 score of 0.0.
print('R Score for Multiple LR Algorithm: ', lr.score(x_test, y_test))
print('R Score for LASSO Algorithm: ', lasso.score(x_test, y_test))
print('R Score for RF Algorithm with 200 estimators: ', rf.score(x_test, y_test))
print('R Score for SVR Algorithm: ', svr.score(x_test, y_test))
R Score for Multiple LR Algorithm: 0.9402042234156325
R Score for LASSO Algorithm: 0.9403755467110041
R Score for RF Algorithm with 200 estimators: 0.9375600191167579
R Score for SVR Algorithm: -0.09202481700704723
Final Conclusion
Linear Regression
and LASSO
performed better. The higher the R_square score, the better the algorithm. LASSO came out with R Score : 0.9403
. All the metrics points us to the same results.
Hope you enjoyed the long post! Thanks.