H2o.ai using python

ARTIFICIAL INTELLIGENCE

NOVEMBER 19, 2019

Demystifying H2O.ai | Using Python

 

H2O provides an interface for Python developers to interact with the H2O cluster. Through Python API, we can invoke the java code responsible for machine learning stuff that happens in the cluster.

Installing dependencies:

Installing H2O:

1. Download H2O. This is a zip file that contains everything you need to get started.

2. From your terminal, run:


cd ~/Downloads
unzip h2o-3.22.1.6.zip
cd h2o-3.22.1.6
java -jar h2o.jar

3. Point your browser to http://localhost:54321

Installing H2O Python API:

1. Prerequisite: Python 2.7.x, 3.5.x, or 3.6.x

2. Install dependencies (prepending with `sudo` if needed):



pip install requests
pip install tabulate
pip install “colorama>=0.3.8”
pip install future

3. At the command line, copy and paste these commands one line at a time:



# The following command removes the H2O module for Python.
pip uninstall h2o
# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-xu/6/Python/h2o-3.22.1.6-py2.py3-none-any.whl

Loading data toH2O Dataframe

The example which we are going to look at is a regression problem of predicting the quality of white-wine based on some numeric attributes.

You can download the dataset in a CSV file from:


https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

Let’s see different methods of loading data onto H2O Dataframe.

Note: H2O data frame is similar to Pandas or R data frame but is resides in H2O shared memory cluster. As it resides in memory, all operations are pretty quick.

Method 1: Loading CSV from a local file.


# Method 1 - Import data from a local CSV file
data_from_csv = h2o.import_file("winequality-white.csv")
data_from_csv.head(5)

Method 2: Loading data from any external web source


# Method 2 - Import data from the web
data_from_web = h2o.import_file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv")
data_from_web.head(5)

Method 3: Loading data into H2O frame from a Pandas data frame


# Method 3 - Convert Python data frame into H2O data frame
## Import Wine Quality data using Pandas
import pandas as pd
wine_df = pd.read_csv('winequality-white.csv', sep = ';')
wine_df.head(5)
## Convert Pandas data frame into H2O data frame
data_from_df = h2o.H2OFrame(wine_df)
data_from_df.head(5)

Basic Operations on Dataframe

Now that we have seen different methods to load the dataset, let’s continue with our example.

wine = h2o.import_file("winequality-white.csv")

Note: “wine” data frame doesn’t save to python memory but resides in H2O cluster memory and only the pointer to the data resides in python memory.

Let’s have a look at all the features we are dealing with. Our target feature is ‘quality’.


# Define features (or predictors)
features = list(wine.columns) # we want to use all the information
features.remove('quality')    # we need to exclude the target 
                               'quality' (otherwise there is nothing 
                                to predict)
features
Output:
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

To check the model performance we can split the data into Train and Test sets. This functionality can be achieved using split_frame() function.


# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = wine.split_frame(ratios = [0.8], seed = 1234)
wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1]  # using the rest 20% for out-of-bag 
                             evaluation

wine_train.shape, wine_test.shape


(3932, 12) (966, 12)

Building a Model on H2O platform

Now that our data is nicely split, it is time for some modeling. I choose to use GBM (Gradient Boosting Machine) for this example but you can use any of the available algorithms supported by the H2O platform.

Let’s try the GBM model with default settings to get a base model to compare the performance after optimizations.


# Build a Gradient Boosting Machines (GBM) model with default 
                                                     settings
# Import the function for GBM
from h2o.estimators.gbm import H2OGradientBoostingEstimator
# Set up GBM for regression
# Add a seed for reproducibility
gbm_default = H2OGradientBoostingEstimator(model_id = 'gbm_default', seed = 1234)
# Use .train() to build the model
gbm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)
# Check the GBM model summary
gbm_default
--------------------------------------------------------------------
Output:
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_default


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.33754946668043595
RMSE: 0.5809900745111193
MAE: 0.4582897982543992
RMSLE: 0.0859869651179757
Mean Residual Deviance: 0.33754946668043595

As you see in the output, H2O gives a bunch of metrics automatically tailored for Regression problem. The output also contains scoring history and feature importance which I choose not to show because of space constraints.

We can check the model performance on the data which the model hasn't seen.


# Check the model performance on test dataset
gbm_default.model_performance(wine_test)
--------------------------------------------------------------------
Output:
ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.4569904494269438
RMSE: 0.6760106873614823
MAE: 0.5248612169030329
RMSLE: 0.10032043310648843
Mean Residual Deviance: 0.4569904494269438

To make the predictions from our model on future data, we can use predict() function of the model passing future data.


# Use GBM model to make predictions
yhat_test_gbm = gbm_default.predict(wine_test)
yhat_test_gbm.head(5)
--------------------------------------------------------------------
Output:
gbm prediction progress: |████████████████████████████████████████████████| 100%
predict
5.78661
5.96088
5.32867
6.19424
5.7198

Setting the Hyperparameters of the algorithm

Let’s try to improve our model. One way to do this is to use manual settings for the algorithm. The following are some of the settings that can be tuned. I am not going to explain the theory behind each setting as the intention of the post is focused on H2O.

If you need to learn more about improving model performance and tuning the hyperparameters of the GBM algorithm, you can refer the following article: Hyperparameter Tuning in Gradient Boosting machine (in Python)


# increase the number of trees for more accuracy
ntrees = 10000,
# Row Random picking for more generalization
sample_rate = 0.9, 
# Columnar Random picking for more generalization
col_sample_rate = 0.9,
# Add cross validation 
nfolds = 5,
# Early stopping 
stopping_metric = 'mse', # let early stopping feature determine
stopping_rounds = 15,     # the optimal number of trees
score_tree_interval = 1

Let’s look at the model instantiation function call with these settings.


# Build a GBM with manual settings, CV and early stopping
# Set up GBM for regression
# Add a seed for reproducibility
gbm_manual_cv_es = H2OGradientBoostingEstimator(
                                      model_id = 'gbm_manual_cv_es', 
                                       seed = 1234,
                                       ntrees = 10000,
                                       sample_rate = 0.9,
                                       col_sample_rate = 0.9,
                                       nfolds = 5,
                                       stopping_metric = 'mse',
                                       stopping_rounds = 15,
                                       score_tree_interval = 1) 
# Use .train() to build the model
gbm_manual_cv_es.train(x = features, 
                       y = 'quality', 
                       training_frame = wine_train)
# Check the model summary
gbm_manual_cv_es.summary()
# Check the cross-validation model performance
gbm_manual_cv_es
--------------------------------------------------------------------
Output:
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_manual_cv_es


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.20686547419217588
RMSE: 0.4548246631309431
MAE: 0.34894778424095163
RMSLE: 0.06741983008017692
Mean Residual Deviance: 0.20686547419217588

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.43719485835036376
RMSE: 0.6612071221261638
MAE: 0.5071563697468089
RMSLE: 0.09876420394757868
Mean Residual Deviance: 0.43719485835036376

Let’s check the performance on the Test set.


# Check the model performance on test dataset
gbm_manual_cv_es.model_performance(wine_test)
--------------------------------------------------------------------
Output:
ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.426937189802319
RMSE: 0.6534043080683805
MAE: 0.49245882283169323
RMSLE: 0.09727407043431956
Mean Residual Deviance: 0.426937189802319

Metrics clearly shows the improvement of the model’s performance.

But how to get the perfect set of values for the manual settings?

Grid Search and Random Search in H2O

One way to get this is to assign a set of values for each of the settings and run all the combinations and compare results to get the best set of values for the settings. This is called “Grid Search”. And if there are too many settings and too many combinations to run, we can set a max number of combinations to check randomly to overcome the computational constraints. This is called “Random Grid Search”. Let’s look at how to do both in H2O.

Grid Search:

First set the search criteria and define the set of values for hyperparameters.

Note: Settings, parameters, and hyper-parameters are all different nomenclature used for the same thing in this context.



# define the criteria for full grid search
search_criteria = {'strategy': "Cartesian"}
# define the range of hyper-parameters for grid search
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9]}

Let’s look at the Grid search function call.


# Set up GBM grid search
# Add a seed for reproducibility
gbm_full_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_full_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)
# Use .train() to start the grid search
gbm_full_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)

Now to get the best model with the best set of hyper-parameters.


# Extract the best model from full grid search
gbm_full_grid_sorted = gbm_full_grid.get_grid(sort_by='mse', decreasing=False)
best_model_id = gbm_full_grid_sorted.model_ids[0]
best_gbm_from_full_grid = h2o.get_model(best_model_id)
best_gbm_from_full_grid.summary()


# Check the model performance on test dataset
best_gbm_from_full_grid.model_performance(wine_test)
--------------------------------------------------------------------
Output:
ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.4225488547806245
RMSE: 0.6500375795141574
MAE: 0.49083490901261195
RMSLE: 0.09684966383616216
Mean Residual Deviance: 0.4225488547806245
Random Grid Search:

The only change we need to do to make a grid search into a Random grid search is to change the search category to “RandomDiscrete” and add the max_models parameter to search criteria.


# define the criteria for random grid search 
search_criteria = {'strategy': "RandomDiscrete", 
                 'max_models': 9,  
                       'seed': 1234}

After this you can add more settings and more combinations.


# define the range of hyper-parameters for grid search # 27 combinations in total 
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],  
                'col_sample_rate': [0.7, 0.8, 0.9],     
                'max_depth': [3, 5, 7]}

After this step, it is pretty much the same code seen above. Let's look at the results for this.


# Check the model performance on test dataset
best_gbm_from_rand_grid.model_performance(wine_test)
--------------------------------------------------------------------
ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.38887438717699785
RMSE: 0.6235979371173367
MAE: 0.45221978663123497
RMSLE: 0.09308978440162519
Mean Residual Deviance: 0.38887438717699785

Clearly, it is a great improvement over the base model performance. H2O provides more of such optimization features to improve model performance at scale.

Stacking in H2O

H2O also provides a signature functionality to stack different models to get the state of the art results. This is especially helpful to win Kaggle competitions.

Stacking Ensembles:

First Let’s try to build the best models using different estimators. In this example, I am creating best GBM, DRF and DNN models using a grid search.

GBM model:

# define the range of hyper-parameters for GBM grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_rand_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        fold_assignment = "Modulo",               
                        keep_cross_validation_predictions = True, 
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)
# Use .train() to start the grid search
gbm_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)
gbm_rand_grid_sorted = gbm_rand_grid.get_grid(sort_by='mse', decreasing=False)

# Extract the best model from random grid search best_gbm_model_id = gbm_rand_grid_sorted.model_ids[0] best_gbm_from_rand_grid = h2o.get_model(best_gbm_model_id)
DRF Model:

# define the range of hyper-parameters for DRF grid search # 27 combinations in total 
hyper_params = {'sample_rate': [0.5, 0.6, 0.7], 
                'col_sample_rate_per_tree': [0.7, 0.8, 0.9],    
             'max_depth': [3, 5, 7]}
# Set up DRF grid search
# Add a seed for reproducibility
drf_rand_grid = H2OGridSearch(
                    H2ORandomForestEstimator(
                        model_id = 'drf_rand_grid', 
                        seed = 1234,
                        ntrees = 200,   
                        nfolds = 5,
                        fold_assignment = "Modulo",                 
                        keep_cross_validation_predictions = True),  
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)
# Use .train() to start the grid search
drf_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)
drf_rand_grid_sorted = drf_rand_grid.get_grid(sort_by='mse', decreasing=False)

# Extract the best model from random grid search best_drf_model_id = drf_rand_grid_sorted.model_ids[0] best_drf_from_rand_grid = h2o.get_model(best_drf_model_id)

DNN Model:


# define the range of hyper-parameters for DNN grid search
# 81 combinations in total
hyper_params = {'activation': ['tanh', 'rectifier', 'maxout'],
                'hidden': [[50], [50,50], [50,50,50]],
                'l1': [0, 1e-3, 1e-5],
                'l2': [0, 1e-3, 1e-5]}
# Set up DNN grid search
# Add a seed for reproducibility
dnn_rand_grid = H2OGridSearch(
                    H2ODeepLearningEstimator(
                        model_id = 'dnn_rand_grid', 
                        seed = 1234,
                        epochs = 20,   
                        nfolds = 5,
                        fold_assignment = "Modulo",                
                        keep_cross_validation_predictions = True), 
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)
# Use .train() to start the grid search
dnn_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)
dnn_rand_grid_sorted = dnn_rand_grid.get_grid(sort_by='mse', decreasing=False)
# Extract the best model from random grid search
best_dnn_model_id = dnn_rand_grid_sorted.model_ids[0]
best_dnn_from_rand_grid = h2o.get_model(best_dnn_model_id)
Model Stacking:

# Define a list of models to be stacked
# i.e. best model from each grid
all_ids = [best_gbm_model_id, best_drf_model_id, best_dnn_model_id]
# Set up 
Stacked Ensemble ensemble = H2OStackedEnsembleEstimator(
                                           model_id = "my_ensemble",
                                        base_models = all_ids)
# use .train to start model stacking
# GLM as the default metalearner
ensemble.train(x = features, 
               y = 'quality', 
               training_frame = wine_train)
ensemble.model_performance(wine_test)
--------------------------------------------------------------------
Output:
Stacked Ensembles        (MSE) :  0.39948493548786057

The above example covered extensively on a regression problem but it is pretty much the same code to do classification too. H2O auto detects if its a regression or a classification based on the type of target variable. If its a category it will invoke classification code and if it is a numeric variable it will invoke Regression. To specify binary or multi-class is a mere change of a family attribute in the instantiation call.


# Set up GLM for binary classification
glm_default = H2OGeneralizedLinearEstimator(family = 'binomial', model_id = 'glm_default')

This post gives a clear picture of how easy it is, to use the Python API to invoke H2O code and how much sophistication H2O provides to do the machine learning activities.

WRITTEN BY

Rehan Ahmad

AI Expert at wavelabs.ai

Want to explore all the ways you can start, run & grow your business?

Fill out the information below and we will get in touch with you shortly.

Related Articles