H2o.ai An Overview
NOVEMBER 12, 2019
Demystifying H2O.ai | An Overview
H2O.ai is an open source machine learning platform which is getting a lot of traction lately and for good reasons.
The crux of the H2O platform is based on distributed in-memory computing. It basically means all the computations, data and everything involved in machine learning happens in the distributed memory of the H2O cluster itself.
You can think of a cluster like a bunch of nodes, sharing memory and computation. A Node could be a server, an EC2 instance, or your laptop. This gives an edge over the traditional way of doing machine learning with one instance and loading things into python memory. And H2O’s core code is written in Java which gives it an additional boost in speed. These unique features of H2O help to do machine learning faster and perfect for processing large amounts of data.
Let us look at different aspects of H2O.
Data imports to H2O:
H2O supports many common data importing formats like Local File System, Remote File, SQL, S3, HDFS, JDBC and Hive.
Interface support in H2O:
H2O’s got an extended interface for languages such as Scala, Python, R and more. This means we can write code in the language we know and the framework takes care of translating it into Java code to run on the cluster and gives back the results in the same language we wrote the code.
Algorithmic Support in H2O:
H2O framework supports a bunch of algorithms, carefully optimized to make the most of the underlying distributed framework. Supported Algorithms: Distributed Random Forest, Linear Regression, Logistic Regression, XGBoost, Gradient Boosting Machine, Deeplearning(Multilayered Perceptron with backpropagation and stochastic gradient descent), K-means Clustering, PCA, Naïve Bayes classifier and word2vec. It also supports stacking and ensembles to get the most of individual algorithms playing together.
Data Manipulation in H2O Frames(Like Pandas and R) :
Supported data manipulations in H2O Frames are Combining Columns from Two Datasets, Combining Rows from Two Datasets, Fill Nas, Group By, Imputing Data, Merging Two Datasets, Pivoting Tables, Replacing Values in a Frame, Slicing Columns, Slicing Rows, Sorting Columns, Splitting Datasets into Training/Testing/Validating and Target Encoding.
Supported metrics in H2O:
Metrics are auto-detected based on the type of machine learning problem we are dealing with (Regression or classification). At the end of training, it provides results with a bunch of metrics to measure the model performance.
For Regression based problems: R2 Score, RMSE, MSE, MAE, etc.
For Classification Problems: AUC, Log loss, Accuracy, F1, F2 Score, Gini Coefficient, Confusion Matrix.
H2O’s AutoML feature:
H2O has an automated option for finding the best model for any given data. H2O AutoML feature has a dependency that pandas module must be installed. This can be invoked by simply calling the H2OAutoML(). The results can be viewed using the tensorboard like dashboard.
At the end of a modeling phase, H2O gives provides functionality to save the models as POJOs or MOJOs.
POJO- Plain old java object, MOJO- Model ObJect, Optimized
These objects can be used to do predictions in any java installed production environment by writing wrapper classes over it.
Requirements to support H2O:
Operating system -Windows 7 or later, OS X 10.9 or later, Ubuntu 12.04 or later, CentOS 6 or later. Java 7 or later versions is a basic requirement for installing and running H2O.
Setting up an H2O Multi-node cluster:
To download H2O, including the .jar, go to the H2O downloads page and choose the version that is right for your environment.
Make sure the same h2o.jar file is available on every host.
The best way to get multiple H2O nodes to find each other is to provide a flat file which lists the set of nodes.
Create a flatfile.txt with the IP and port for each H2O instance. Put one entry per line. For example:
(Note that the -flatfile option tells one H2O node where to find the others. It is not a substitute for the -ip and -port specification.)
Copy the flatfile.txt to each node in your cluster.
The Xmx option in the java command line specifies the amount of memory allocated to one H2O node. The cluster’s memory capacity is the sum across all H2O nodes in the cluster.
For example, if a user creates a cluster with four 20g nodes (by specifying Xmx20g), H2O will have available a total of 80 gigs of memory.
For best performance, we recommend to size your cluster to be about four times the size of your data (but to avoid swapping, Xmx must not be larger than physical memory on any given node). Giving all nodes the same amount of memory is strongly recommended (H2O works best with symmetric nodes).
Note the optional -ip (not shown in the example below) and -port options tell this H2O node what IP address and ports (port and port+1 are used) to bind to. The -ip option is especially helpful for hosts that have multiple network interfaces.
$ java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -port 54321
You will see output similar to the following:
08:35:33.553 main INFO WATER: ----- H2O started ----- 08:35:33.555 main INFO WATER: Build git branch: master 08:35:33.555 main INFO WATER: Build git hash: f253798433c109b19acd14cb973b45f255c59f3f 08:35:33.555 main INFO WATER: Build git describe: f253798 08:35:33.555 main INFO WATER: Build project version: 22.214.171.1240 08:35:33.555 main INFO WATER: Built by: 'jenkins' 08:35:33.555 main INFO WATER: Built on: 'Thu Sep 12 00:01:52 PDT 2013' 08:35:33.556 main INFO WATER: Java availableProcessors: 32 08:35:33.558 main INFO WATER: Java heap totalMemory: 1.92 gb 08:35:33.559 main INFO WATER: Java heap maxMemory: 17.78 gb 08:35:33.559 main INFO WATER: ICE root: '/tmp/h2o-tomk' 08:35:33.580 main INFO WATER: Internal communication uses port: 54322 + Listening for HTTP and REST traffic on http://192.168.1.163:54321/ 08:35:33.613 main INFO WATER: H2O cloud name: 'MyClusterName' 08:35:33.613 main INFO WATER: (v126.96.36.1990) 'MyClusterName' on /192.168.1.163:54321, static configuration based on -flatfile flatfile.txt 08:35:33.615 main INFO WATER: Cloud of size 1 formed [/192.168.1.163:54321] 08:35:33.747 main INFO WATER: Log dir: '/tmp/h2o-tomk/h2ologs'
As you add more nodes to your cluster, the H2O output will inform you:
INFO WATER: Cloud of size 2 formed [/...]...
STEP 6 (Optional)
Access the H2O Web UI with your browser. Point your browser to the HTTP link given by “Listening for HTTP and REST traffic on…” in the H2O output.
How to connect to that H2O’s Multi-node cluster:
Once a cluster is created using the above steps, get the IP address and port number from the output.
INFO WATER: Internal communication uses port: Listening for HTTP and REST traffic on http://192.168.1.163:54321/
Step 1: From your python script or notebook, Import H2O module
# Import module import h2o
Step 2: Connect to H2O cluster with IP address
# In order to connect to a H2O cluster in the cloud, you need to specify the IP address h2o.connect(ip = "xxx.xxx.xxx.xxx") # fill in the real IP
That is all we need to do. Now we can leverage the power of the cluster’s speed and computation from any local machine’s python notebook.
Note: The data and data manipulations done in the notebook is not actually stored in python memory but on the cluster itself(distributed among nodes) and only the pointer is used in the python notebook to refer the data.
Follow through the series to know more about the H2O platform.