PySpark to Run Machine Learning Models

Python

Note: The data used for this article is fictious data with no PII (Personally Identifiable Information) and doesn’t relate to any entity what so ever.

Ok, let’s begin!

I have been learning PySpark application since a couple of months and I thought, wouldn’t it be nice to post my project details here to share the knowledge. PySpark is a great tool to analyse huge amounts of data. It is a combination of Python and Spark Engine. You get the goodness of both the worlds.

Project Background

Let’s create a fictious project for this article. The domain would be shipping industry and the goal is to predict a key variable carrier_base_pay (the amount of base money excluding taxes, a customer would owe to ship a package) using the associated feature variables. The feature variables would be dimensions of the package, weight, source location, destination location, type of shipping (express or overnight or leasure) etc.

PySpark Modules and Zeppeline Notebook

You may have to install PySpark modules using online tutorials. But once the step is out of the way, the focus would be on using Zeppelin Notebook for the project. Zeppelin notebook is very similar to the Jupyter Notebook, except that it’ll be running on top of the PySpark engine.

Initiating a PySpark Session

To initiate a PySpark session and get the data into the session, we need to start by running import commands to import the hivewarehousesession libraries. Next, we will be using SQL Queries through Apache Hive to extract data from our underlying HDFS flat files.

%LivyPy3.pyspark

### start the environment
from pyspark_llap.sql.session import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()

Importing data

Let’s run a basic SQL, i.e., SELECT * <tablename> command to import data into the session.

%LivyPy3.pyspark

##Load the data to dataframe FS_orders_hive
FS_orders_hive = hive.executeQuery("SELECT * FROM LOGISTICS_SDBX_RAW.FS_FY2019_4")
FS_orders_hive.show()

Output

+---------+-------------+--------------------+---------+-------------+--------------------+-----------------+--------------------+--------------+-------+--------------+-----+------+----------+------------+--------------+------+-------+-----------+-------------+----------+----------------+
|       id|order_type_id|     order_type_desc|bill_date|bill_distance|           commodity|equipment_type_id| equipment_type_desc|freight_charge| weight|       pu_city|pu_st|pu_zip|pu_arrival|pu_departure|      del_city|del_st|del_zip|del_arrival|del_departure|carrier_id|carrier_base_pay|
+---------+-------------+--------------------+---------+-------------+--------------------+-----------------+--------------------+--------------+-------+--------------+-----+------+----------+------------+--------------+------+-------+-----------+-------------+----------+----------------+
|      null|ORDER_TYPE_ID|     ORDER_TYPE_DESC|BILL_DATE|         null|           COMMODITY|EQUIPMENT_TYPE_ID| EQUIPMENT_TYPE_DESC|          null|   null|       PU_CITY|PU_ST|PU_ZIP|PU_ARRIVAL|PU_DEPARTURE|      DEL_CITY|DEL_ST|   null|DEL_ARRIVAL|DEL_DEPARTURE|CARRIER_ID|            null|
|74243580.0|         FLAT|             FLATBED|10-Oct-18|        357.0|           MACHINERY|               FT|Flatbed w/Tarps (...|        1250.0| 5500.0|        POMONA|   CA| 91768| 28-Sep-18|   28-Sep-18|         TEMPE|    AZ|85280.0|   1-Oct-18|     1-Oct-18|  SAFRJAFL|           900.0|
|74235585.0|          VAN|                 VAN|12-Oct-18|       2799.0|CONSUMER GOODS (R...|              V53|         53' Dry Van|        4900.0|18832.0|        ALBANY|   GA| 31705| 28-Sep-18|   28-Sep-18|        EUGENE|    OR|97402.0|   4-Oct-18|     4-Oct-18|  MBGLELIL|          4600.0|
|74235787.0|          VAN|                 VAN|10-Oct-18|       2011.0|    DISPLAY MATERIAL|              V53|         53' Dry Van|        3850.0|20000.0|        IRVINE|   CA| 92602| 28-Sep-18|   28-Sep-18|       ANTIOCH|    IL|60002.0|   2-Oct-18|     2-Oct-18|  HIGHELIL|          3400.0|
|74235481.0|          VAN|                 VAN| 8-Oct-18|       1640.0|  PACKAGING MATERIAL|              V53|         53' Dry Van|        3180.0|20370.0|          WACO|   TX| 76712| 28-Sep-18|   28-Sep-18|       CLIFTON|    NJ| 7014.0|   1-Oct-18|     1-Oct-18|  HUNTPANJ|          3500.0|
|74235583.0|          VAN|                 VAN|10-Oct-18|       2004.0|    DISPLAY MATERIAL|              V53|         53' Dry Van|        3850.0|15000.0|        IRVINE|   CA| 92602| 27-Sep-18|   27-Sep-18|MOUNT PROSPECT|    IL|60056.0|   3-Oct-18|     3-Oct-18|  PROLWOIL|          3300.0|
|74234584.0|          LTL|           LTL ORDER|22-Oct-18|       1238.0|            HARDWARE|             LTLS|        Standard LTL|        452.55| 2305.0|      KENTWOOD|   MI| 49512| 28-Sep-18|   28-Sep-18|       HOUSTON|    TX|77041.0|  10-Oct-18|    10-Oct-18|  CLEAININ|            null|
|74228170.0|          VAN|                 VAN| 8-Oct-18|        970.0|SHAMPOOS/CONDITIO...|              V53|         53' Dry Van|        1600.0|26857.0|        POOLER|   GA| 31322| 28-Sep-18|   28-Sep-18|         NILES|    IL|60714.0|   1-Oct-18|     1-Oct-18|  ANGTELIL|          1300.0|
|74228471.0|          VAN|                 VAN| 3-Oct-18|        970.0|SHAMPOOS/CONDITIO...|              V53|         53' Dry Van|        1600.0|18363.0|        POOLER|   GA| 31322| 26-Sep-18|   26-Sep-18|         NILES|    IL|60714.0|  28-Sep-18|    28-Sep-18|  LYNXWAIA|          1300.0|
|74272873.0|          VAN|                 VAN|11-Oct-18|        809.0|COMSUMER GOODS (R...|              V53|         53' Dry Van|        2150.0|27366.0|        POOLER|   GA| 31322| 26-Sep-18|   26-Sep-18|     MOONACHIE|    NJ| 7074.0|  28-Sep-18|    28-Sep-18|  FREIPLNJ|          1800.0|
|74228794.0|         RAIL|          INTERMODAL|18-Oct-18|       2105.0| MUSICAL INSTRUMENTS|             C53'|Intermodal Contai...|        3500.0|20000.0|SAN BERNARDINO|   CA| 92408| 26-Sep-18|   26-Sep-18|    FORT WAYNE|    IN|46818.0|   8-Oct-18|     8-Oct-18|  CELTTIIL|            null|
|74422876.0|          VAN|                 VAN|12-Oct-18|          1.0| CLOTHING OR APPAREL|             SRTL|Straight Truck w/...|         572.4|25000.0|        POMONA|   CA| 91766|  2-Oct-18|    2-Oct-18| SAN FRANCISCO|    CA|94102.0|   2-Oct-18|     2-Oct-18|  TOMAFRCA|          515.68|
|74228577.0|          VAN|                 VAN|12-Oct-18|          1.0| CLOTHING OR APPAREL|             SRTL|Straight Truck w/...|        307.93|25000.0|        POMONA|   CA| 91766|  2-Oct-18|    2-Oct-18|    EMERYVILLE|    CA|94608.0|   2-Oct-18|     2-Oct-18|  WALTHACA|          335.51|
|74212878.0|          VAN|                 VAN|12-Oct-18|          1.0| CLOTHING OR APPAREL|             SRTL|Straight Truck w/...|        244.48|25000.0|        POMONA|   CA| 91766|  4-Oct-18|    4-Oct-18|   SANTA CLARA|    CA|95050.0|   4-Oct-18|     4-Oct-18|  TOMAFRCA|          220.25|
|74222879.0|          VAN|                 VAN|12-Oct-18|          1.0| CLOTHING OR APPAREL|             SRTL|Straight Truck w/...|        276.26|25000.0|        POMONA|   CA| 91766|  2-Oct-18|    2-Oct-18|   SANTA CLARA|    CA|95050.0|   2-Oct-18|     2-Oct-18|  TOMAFRCA|          248.88|
|74228472.0|          VAN|                 VAN| 3-Oct-18|        493.0|COMSUMER GOODS (R...|              V53|         53' Dry Van|        1530.0|27883.0|        POOLER|   GA| 31322| 26-Sep-18|   26-Sep-18|         MIAMI|    FL|33172.0|  27-Sep-18|    27-Sep-18|  AAPCMIFL|          1300.0|
|74228830.0|          VAN|                 VAN| 8-Oct-18|          1.0| CLOTHING OR APPAREL|             SRTL|Straight Truck w/...|        149.22|25000.0|        POMONA|   CA| 91766|  2-Oct-18|    2-Oct-18|    COSTA MESA|    CA|92626.0|   2-Oct-18|     2-Oct-18|  ADRLCOCA|          134.43|
|74228569.0|      PARTIAL|Partial Truckload...|15-Oct-18|       1468.0|coated film **Par...|              V53|         53' Dry Van|        3375.0|21130.0|     IOWA CITY|   IA| 52240| 27-Sep-18|   27-Sep-18|       MIRAMAR|    FL|33025.0|   1-Oct-18|     1-Oct-18|  F5EXBOIL|            null|
|74228875.0|         RAIL|          INTERMODAL|18-Oct-18|       2105.0| MUSICAL INSTRUMENTS|             C53'|Intermodal Contai...|        3500.0|20000.0|SAN BERNARDINO|   CA| 92408| 28-Sep-18|   29-Sep-18|    FORT WAYNE|    IN|46818.0|  11-Oct-18|    11-Oct-18|  CELTTIIL|            null|
|74228881.0|          VAN|                 VAN| 8-Oct-18|          1.0| CLOTHING OR APPAREL|             SRTL|Straight Truck w/...|        266.13|25000.0|        POMONA|   CA| 91766|  2-Oct-18|    2-Oct-18|  SANTA MONICA|    CA|90401.0|   2-Oct-18|     2-Oct-18|  ADRLCOCA|          239.76|
+---------+-------------+--------------------+---------+-------------+--------------------+-----------------+--------------------+--------------+-------+--------------+-----+------+----------+------------+--------------+------+-------+-----------+-------------+----------+----------------+
only showing top 20 rows

It’s possible that we may have more than 100,000 records in the table. However, our SQL is only returning the top 20 rows. The records shown are in no particular order.

This is our chance to take a good look at the columns and the associated data that we are dealing with.

Importing required ML libraries

For this project, because we’ll be mainly dealing with Machine Learning techniques, we need to import associated libraries. We’ll be using sklearn a.k.a SciKit-learn gloriously. We also need data manipulation libraries to transform the data. For that, we’ll be using numpy and pandas. Finally, to visualize the output, we’ll be using matplotlib.pyplot library.

%LivyPy3.pyspark

########### Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

from sklearn.model_selection import train_test_split
from sklearn import metrics

Understanding the data

Let’s take a closer look at the data!

Let’s see if we can get the record count “before” and “after” dropping na values. NA signify not available, meaning missing values. These records posses either blank or null data.

%LivyPy3.pyspark

######### Converting dataset to pandas
pdfFS_orders_hive = FS_orders_hive.toPandas()

####### Misc commands
##pdfFS_orders_hive.count()

####### Total records
print('Total records:', len(pdfFS_orders_hive))

######## Dropping null/improper records
pdfFS_orders_hive_dropna = pdfFS_orders_hive.dropna()
print('Total records after dropping na:', len(pdfFS_orders_hive_dropna))

Output

Total records: 86864
Total records after dropping na: 76285

$(86864-76285) \over (86864)$ X $100 %$ = $12.1 %$

As you can see, more than 12% of the dataset has na records. It’s best to drop them as at this point we are just trying to cleanse data rather than enriching it.

Split the data into Train and Test datasets

Once we are comfortable with the data we have, we need to move to the next step, i.e., splitting the data into a training set and a testing set. This helps us keep a part of our original dataset on the side lanes, just so that we can use it to test our final machine learning model. You could also use sampling methods if you want to, but for this project, we’ll keep it simple.

The usual percentage of the dataset split is 80% and 20%. Meaning, use 80% for the training purposes and 20% for the testing. Let’s do the same in our case as well.

%LivyPy3.pyspark

##### split dataset 80%(train) 20%(test)
pdfFS_orders_hive_dropna_copy = pdfFS_orders_hive_dropna.copy()

train_set = pdfFS_orders_hive_dropna.sample(frac=0.80, random_state=0)
print('Total train records count:', len(train_set))

test_set = pdfFS_orders_hive_dropna.drop(train_set.index)
print('Total train records count:', len(test_set))

Output

Total train records count: 61028
Total train records count: 15257

The 80% of our 76K records yields us 61K records, and rest is for testing.

Plotting using matplotlib (Optional)

You do not have to plot the values, but it is just another way of looking at the data to understand it better.

%LivyPy3.pyspark
#####Plot train and test dataset - not happening with Zeppelin!!!
from matplotlib import pyplot as plt
from matplotlib import style
 
style.use('ggplot')
 
x = ['train','test','total']
y = [len(train_set),len(test_set),len(pdfFS_orders_hive_dropna_copy)]
x , y 

plt.bar(x, y, align='center')
plt.title('bar chart to plot the counts')
plt.ylabel('record counts')
plt.xlabel('type')
plt.show()

Output

(['train', 'test', 'total'], [61028, 15257, 76285])

PENDING Plot output

Data Transformation - I

Let’s convert some of the non numeric fields to numeric fields to ease the traning process.

%LivyPy3.pyspark

##### Convert dataframes to numeric.
### Train_set - train_set_to_numeric
train_set_to_numeric = train_set.apply(pd.to_numeric , errors='ignore')

#### Checking data definition of the train_set
train_set_to_numeric.dtypes
len(train_set_to_numeric)

###### Test_set - test_set_to_numeric
test_set_to_numeric = test_set.apply(pd.to_numeric , errors='ignore')
test_set_to_numeric.dtypes
len(test_set_to_numeric)

##### more stats
len(train_set_to_numeric.del_zip.unique())
train_set_to_numeric.shape
train_set_to_numeric.describe()
train_set_to_numeric.isnull().any()

Output

id                     False
order_type_id          False
order_type_desc        False
bill_date              False
bill_distance          False
commodity              False
equipment_type_id      False
equipment_type_desc    False
freight_charge         False
weight                 False
pu_city                False
pu_st                  False
pu_zip                 False
pu_arrival             False
pu_departure           False
del_city               False
del_st                 False
del_zip                False
del_arrival            False
del_departure          False
carrier_id             False
carrier_name           False
carrier_base_pay       False
dtype: bool

Data Transformation - II

Now that data has been converted to numeric fields, we need to transform some of the fields into categorical to make sure we can accomodate with the ML libraries and their parameters.

%LivyPy3.pyspark
### del [train_set_to_numeric]

########### Misc commands

##train_set_to_numeric.dtypes
##train_set_to_numeric.order_type_id
##train_set_to_numeric.order_type_desc
##train_set_to_numeric.bill_distance[47189]
##len(train_set_to_numeric.bill_distance.unique())

##len(train_set_to_numeric.commodity.unique())
##train_set_to_numeric.pu_arrival
##len(train_set_to_numeric.carrier_id.unique())

################ Converting 6 object variables to categorical variables to train a model.
#### Train set
train_set_to_numeric.order_type_id = train_set['order_type_id'].astype('category')
train_set_to_numeric.commodity = train_set['commodity'].astype('category')
train_set_to_numeric.equipment_type_id = train_set['equipment_type_id'].astype('category')
train_set_to_numeric.pu_city = train_set['pu_city'].astype('category')
train_set_to_numeric.del_city = train_set['del_city'].astype('category')
train_set_to_numeric.carrier_id = train_set['carrier_id'].astype('category')

train_set_to_numeric.dtypes

#### Test set
test_set_to_numeric.order_type_id = test_set['order_type_id'].astype('category')
test_set_to_numeric.commodity = test_set['commodity'].astype('category')
test_set_to_numeric.equipment_type_id = test_set['equipment_type_id'].astype('category')
test_set_to_numeric.pu_city = test_set['pu_city'].astype('category')
test_set_to_numeric.del_city = test_set['del_city'].astype('category')
test_set_to_numeric.carrier_id = test_set['carrier_id'].astype('category')

test_set_to_numeric.dtypes

id                      float64
order_type_id          category
order_type_desc          object
bill_date                object
bill_distance           float64
commodity              category
equipment_type_id      category
equipment_type_desc      object
freight_charge          float64
weight                  float64
pu_city                category
pu_st                    object
pu_zip                   object
pu_arrival               object
pu_departure             object
del_city               category
del_st                   object
del_zip                 float64
del_arrival              object
del_departure            object
carrier_id             category
carrier_name             object
carrier_base_pay        float64
dtype: object

Feature Engineering

Understanding the features (columns /attributes of the records) that we would need to train the machine learning model is one of the crucial steps in the lifecycle of datascience processes. Looking at the data we have, we need to find correlations between the target variable that we are trying to predict and the feature matrix (columns) we were given.

%LivyPy3.pyspark
######### Creating datasets with required features for learning.

############# Train_set
train_set_to_numeric_features = train_set_to_numeric.drop(columns = ['id', 'order_type_desc', 'bill_date', 'equipment_type_desc', 'pu_st', 'pu_zip', 'pu_arrival', 'pu_departure', 'del_st', 'del_zip', 'del_arrival', 'del_departure', 'carrier_name', 'carrier_base_pay'])

train_set_to_numeric_features.columns

############ Test_set
test_set_to_numeric_features = test_set_to_numeric.drop(columns = ['id', 'order_type_desc', 'bill_date', 'equipment_type_desc', 'pu_st', 'pu_zip', 'pu_arrival', 'pu_departure', 'del_st', 'del_zip', 'del_arrival', 'del_departure', 'carrier_name', 'carrier_base_pay'])

test_set_to_numeric_features.columns

Output


Index(['order_type_id', 'bill_distance', 'commodity', 'equipment_type_id',
       'freight_charge', 'weight', 'pu_city', 'del_city', 'carrier_id'],
      dtype='object')

Extracting label field from the train and test sets

You could probably do this in a much simpler way, but I thought let’s just use the same command to drop all fields except the carrier_base_pay, which is our target label. 😄

%LivyPy3.pyspark
########### Creating labels/dependent variable for test and train datasets

##### Train_set
### train_set_to_numeric_label = train_set_to_numeric.copy()

train_set_to_numeric_label = train_set_to_numeric.drop(columns = ['id', 'order_type_id', 'order_type_desc', 'bill_date', 'bill_distance', 'commodity', 'equipment_type_id', 'equipment_type_desc', 'freight_charge', 'weight', 'pu_city', 'pu_st', 'pu_zip', 'pu_arrival', 'pu_departure', 'del_city', 'del_st', 'del_zip', 'del_arrival', 'del_departure', 'carrier_id', 'carrier_name'])

train_set_to_numeric_label.dtypes

### Test_set
test_set_to_numeric_label = test_set_to_numeric.drop(columns = ['id', 'order_type_id', 'order_type_desc', 'bill_date', 'bill_distance', 'commodity', 'equipment_type_id', 'equipment_type_desc', 'freight_charge', 'weight', 'pu_city', 'pu_st', 'pu_zip', 'pu_arrival', 'pu_departure', 'del_city', 'del_st', 'del_zip', 'del_arrival', 'del_departure', 'carrier_id', 'carrier_name'])

test_set_to_numeric_label.dtypes

Output

carrier_base_pay    float64
dtype: object

Data Transformation - III

Assigning categorical codes to our features

Because we have chosen to create categorical codes out of our given features, we need to assign appropriate codes to make sure they have a significance during the ML training process.

%LivyPy3.pyspark

### Assigning categorical codes to 6 features/attributes
train_set_to_numeric_features_cat_codes = train_set_to_numeric_features.copy()

test_set_to_numeric_features_cat_codes = test_set_to_numeric_features.copy()

train_set_to_numeric_features_cat_codes['order_type_id'] = train_set_to_numeric_features['order_type_id'].cat.codes
train_set_to_numeric_features_cat_codes['commodity'] = train_set_to_numeric_features['commodity'].cat.codes
train_set_to_numeric_features_cat_codes['equipment_type_id'] = train_set_to_numeric_features['equipment_type_id'].cat.codes
train_set_to_numeric_features_cat_codes['pu_city'] = train_set_to_numeric_features['pu_city'].cat.codes
train_set_to_numeric_features_cat_codes['del_city'] = train_set_to_numeric_features['del_city'].cat.codes
train_set_to_numeric_features_cat_codes['carrier_id'] = train_set_to_numeric_features['carrier_id'].cat.codes

test_set_to_numeric_features_cat_codes['order_type_id'] = test_set_to_numeric_features['order_type_id'].cat.codes
test_set_to_numeric_features_cat_codes['commodity'] = test_set_to_numeric_features['commodity'].cat.codes
test_set_to_numeric_features_cat_codes['equipment_type_id'] = test_set_to_numeric_features['equipment_type_id'].cat.codes
test_set_to_numeric_features_cat_codes['pu_city'] = test_set_to_numeric_features['pu_city'].cat.codes
test_set_to_numeric_features_cat_codes['del_city'] = test_set_to_numeric_features['del_city'].cat.codes
test_set_to_numeric_features_cat_codes['carrier_id'] = test_set_to_numeric_features['carrier_id'].cat.codes


##train_set_to_numeric_features.dtypes 
##test_set_to_numeric_features.dtypes

##train_set_to_numeric_features_cat_codes
##train_set_to_numeric_features_cat_codes['order_type_id'][47189], train_set_to_numeric_features['order_type_id'][47189]

Assigning names - ‘X’ and ‘Y’ to the final datasets (Optional)

Just for fun, let’s just name features as 'X' and labels as 'Y'. The labels will help us shorten the variable names and bring us closer to the ML traning steps.

%LivyPy3.pyspark

############ Final attributes or x or features and labels or y
#### X variable
x_train = train_set_to_numeric_features_cat_codes.copy()
##train_set_to_numeric_features 

### X label
x_test = test_set_to_numeric_features_cat_codes.copy()
##train_set_to_numeric_features

#### Y variable
y_train = train_set_to_numeric_label.copy()

#### Y label
y_test = test_set_to_numeric_label.copy()

y_test[113:114]

Output

carrier_base_pay
639             750.0

Machine Learning Traning

Let the fun begin! This is what we are here for, to train the ML model. As we have categorical codes in our dataset and our final label is a continuous value, we’ll be using regression techniques to train the model.

Linear Regression (our best loyal friend)

One of the basic regression techniques to try out first is Linear Regression. Technically speaking a linear regression will only be able to classify the data that is linearly separable. The model will not be penalized for its choice of weights, full stop. Which means, during the training stage, if the model feels like one particular feature is particularly important, the model may place a large weight to the feature. This might lead to overfitting in small datasets. As per the big data concepts, data under 100K isn’t big at all. Hence, the rest of the methods better. Ex: LASSO.

%LivyPy3.pyspark
###### Training the MULTIPLE LINEAR REGRESSION algorithm -- linear
## LR
lr = LinearRegression()  
lr_train = lr.fit(x_train, y_train)
print('lr_train complete!')

############################## Stats
####lr_train.summary()
### Key co-efficients choosen by lr algorithm
##coeff_df = pd.DataFrame(lr.coef_, x_train.columns, columns=['Coefficient'])  
##coeff_df
##lr_train(lr.coef_, X.columns, columns=['Coefficient'])
##lr_train.score(x_train ,y_train)
##lr_train.coef_ , lr_train.intercept_

print('lr_train.coef_ and x_train.columns\n', lr_train.coef_, '\n', x_train.columns)

Output

lr_train complete!
lr_train.coef_ and x_train.columns
 [[-3.33850643e+00  1.15221818e-01 -1.83800535e-02  4.90240105e+00
   8.29745310e-01  2.76253066e-03 -6.24481228e-04  8.62335947e-04
   1.25402509e-03]] 
 Index(['order_type_id', 'bill_distance', 'commodity', 'equipment_type_id',
       'freight_charge', 'weight', 'pu_city', 'del_city', 'carrier_id'],
      dtype='object')

LASSO

Lasso is a modified form of Linear Regression, where the model is penalized for the sum of absolute values of the weights. Thus, the absolute values of weight will be (in general) reduced, and many will tend to be zeros. The alpha value passed as a parameter to the lasso function will decide the penalty.

%LivyPy3.pyspark
###### Training the LASSO algorithm -- linear 

lasso = linear_model.Lasso(alpha=50)
lasso_train = lasso.fit(x_train, y_train)
print('lasso_train complete!')

print('\n lasso_train summary\n', lasso_train)

lasso_train.coef_ , lasso_train.intercept_

print('\n lasso_train.coef_ and lasso_train.intercept_\n', lasso_train.coef_, '\n', lasso_train.intercept_)

Output

lasso_train complete!

 lasso_train summary
 Lasso(alpha=50, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

 lasso_train.coef_ and lasso_train.intercept_
 [-0.00000000e+00  1.13367273e-01 -1.86165577e-02  3.68935985e+00
  8.31349235e-01  2.75292477e-03 -8.13115624e-04  6.88206445e-04
  1.31730816e-03] 
 [-209.36205307]

Random Forest

Let’s just try couple of ensembling models as well to see which would do better.

%LivyPy3.pyspark
###### Training the RANDOM FOREST algorithm - non linear
rf = RandomForestRegressor(n_estimators=20, random_state=1)  ###around 10 minutes
##rf = RandomForestRegressor(n_estimators=200, random_state=1)  ###around 10 minutes
rf_train = rf.fit(x_train, y_train)

print('rf_train complete!')
print('\n rf_train summary:\n', rf_train)
print('\n rf_train feature importances:\n', rf_train.feature_importances_)

##rf_train.coef_ , rf_train.intercept_
##print('\n rf_train.coef_ and rf_train.intercept_\n', rf_train.coef_, '\n', rf_train.intercept_)

Output


rf_train complete!

 rf_train summary:
 RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
           oob_score=False, random_state=1, verbose=0, warm_start=False)

 rf_train feature importances:
 [0.00234888 0.01285913 0.00740664 0.00392913 0.94589572 0.00968901
 0.00614    0.00543848 0.00629303]

SVR (Support Vector Regression)

%LivyPy3.pyspark
###### Training the SUPPORT VECTOR REGRESSION algorithm - non linear

##svr = SVR(gamma='scale', C=1.0, epsilon=0.2)
svr = SVR() ## ETA 5 minutes
svr_train = svr.fit(x_train, y_train)

##print('svr_train complete!')
print('\n svr_train summary:\n', svr_train)

Output

svr_train summary:
 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

Predicting the label using Test datasets

Now that the training is completed, we need to move to the next stage where we can run the model against the test dataset.

Testing against Logistic Regression

%LivyPy3.pyspark

######################## Algorithm #1
### Do prediction on the test data using LR
y_pred_lr = lr.predict(x_test)

print('y_pred_lr complete!')

y_pred_lr complete!

Testing against LASSO

%LivyPy3.pyspark

###################### Algorithm #2
###### Do prediction on the test data using LASSO
y_pred_lasso = lasso.predict(x_test)

print('y_pred_lasso complete!')

y_pred_lasso complete!

Testing against Random Forest

%LivyPy3.pyspark
######################## Algorithm #3
###### Do prediction on the test data using RF
y_pred_rf = rf.predict(x_test)

print('y_pred_rf complete!')

y_pred_rf complete!

Testing against SVR (support vector regression)

%LivyPy3.pyspark
######################## Algorithm #4
###### Do prediction on the test data using SVR
y_pred_svr = svr.predict(x_test)

print('y_pred_svr complete!')

y_pred_svr complete!

Verifying the results

%LivyPy3.pyspark
#### verifying the results


##df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
##df1 = df.head(25)
##len(y_pred_lr)
##results = [y_test, y_pred]
##y_test[100],  y_pred[100]
##y_test[3]
##y_test_copy = y_test.copy()
##y_test_appended = y_test_copy.append(y_pred_lr, ignore_index=True)
y_test

print('verifying the results!')
print('y_pred_lasso - ', y_pred_lasso[103])
print('y_pred_lr - ', y_pred_lr[103])
print('y_pred_rf - ', y_pred_rf[103])
print('y_pred_svr - ', y_pred_svr[103])
print('y_test - ', y_test[103:104])


verifying the results!
y_pred_lasso -  3300.649436439301
y_pred_lr -  [3287.63222526]
y_pred_rf -  3248.15
y_pred_svr -  1085.4400155267074
y_test -       carrier_base_pay
576            3200.0

Calculating Metrics

There are couple of ways to understand the efficiency of our model, one way is to calculate metrics i.e., Mean Absolute Error, Mean Squared Error and Root Mean Squared Error.

%LivyPy3.pyspark
### Calculating Metrics LR
print('### Calculating Metrics - LR')
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_lr))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_lr))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_lr)))


### Calculating Metrics LASSO
print('\n### Calculating Metrics - LASSO')
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_lasso))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_lasso))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_lasso)))


### Calculating Metrics RF
print('\n### Calculating Metrics - RF')
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_rf))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_rf))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_rf)))

### Calculating Metrics SVR
print('\n### Calculating Metrics - SVR')
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_svr))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_svr))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_svr)))

### Calculating Metrics - LR
Mean Absolute Error: 196.5480533740987
Mean Squared Error: 105866.09745638819
Root Mean Squared Error: 325.3707077417821

### Calculating Metrics - LASSO
Mean Absolute Error: 195.7809924834218
Mean Squared Error: 105562.7762233446
Root Mean Squared Error: 324.90425701019154

### Calculating Metrics - RF
Mean Absolute Error: 199.25110190502312
Mean Squared Error: 110547.55835529094
Root Mean Squared Error: 332.48692960068513

### Calculating Metrics - SVR
Mean Absolute Error: 996.5762893287749
Mean Squared Error: 1933387.4782769545
Root Mean Squared Error: 1390.4630445563646

Calculating R^2 score

%LivyPy3.pyspark

###### R^2 score
##### The coefficient of determination R^2 of the prediction.
#### MORE INFO - The best possible score is 1.0 and it can be negative 
## (because the model can be arbitrarily worse). 
#### A constant model that always predicts the expected value of y, 
## disregarding the input features, would get a R^2 score of 0.0.

print('R Score for Multiple LR Algorithm: ', lr.score(x_test, y_test)) 
print('R Score for LASSO Algorithm: ', lasso.score(x_test, y_test))
print('R Score for RF Algorithm with 200 estimators: ', rf.score(x_test, y_test))
print('R Score for SVR Algorithm: ', svr.score(x_test, y_test))

R Score for Multiple LR Algorithm:  0.9402042234156325
R Score for LASSO Algorithm:  0.9403755467110041
R Score for RF Algorithm with 200 estimators:  0.9375600191167579
R Score for SVR Algorithm:  -0.09202481700704723

Final Conclusion

Linear Regression and LASSO performed better. The higher the R_square score, the better the algorithm. LASSO came out with R Score : 0.9403. All the metrics points us to the same results.

Hope you enjoyed the long post! Thanks.

Project Background

PySpark Modules and Zeppeline Notebook

Initiating a PySpark Session

Importing data

Output

Importing required ML libraries

Understanding the data

Output

Split the data into Train and Test datasets

Output

Plotting using matplotlib (Optional)

Output

Data Transformation - I

Output

Data Transformation - II

Feature Engineering

Output

Extracting label field from the train and test sets

Output

Data Transformation - III

Assigning categorical codes to our features

Assigning names - ‘X’ and ‘Y’ to the final datasets (Optional)

Output

Machine Learning Traning

Linear Regression (our best loyal friend)

Output

LASSO

Output

Random Forest

Output

SVR (Support Vector Regression)

Output

Predicting the label using Test datasets

Testing against Logistic Regression

Testing against LASSO

Testing against Random Forest

Testing against SVR (support vector regression)

Verifying the results

Calculating Metrics

Calculating R^2 score

Final Conclusion

References