This article contains content that is written like an advertisement. (February 2016) (Learn how and when to remove this template message)
Tutte (184.108.40.206) / January 12, 2017
|Written in||H2O (written in Java, Python, and R)|
|Operating system||Linux, macOS, and Microsoft Windows|
|Platform||Apache Hadoop Distributed File System; Amazon EC2, Google Compute Engine, and Microsoft Azure.|
|Standard(s)||Databricks certified on Spark.|
|Type||big data analytics, machine learning, statistical learning theory|
|License||Apache license 2.0|
|As of||1 June 2015|
H2O is open-source software for big-data analysis. It is produced by the company H2O.ai (formerly 0xdata), which launched in 2011 in Silicon Valley. H2O allows users to fit thousands of potential models as part of discovering patterns in data.
H2O's mathematical core is developed with the leadership of Arno Candel, part of Fortune's 2014 "Big Data All Stars". The firm's scientific advisors are experts on statistical learning theory and mathematical optimization.
The H2O software runs can be called from the statistical package R, Python, and other environments. It is used for exploring and analyzing datasets held in cloud computing systems and in the Apache Hadoop Distributed File System as well as in the conventional operating-systems Linux, macOS, and Microsoft Windows. The H2O software is written in Java, Python, and R. Its graphical-user interface is compatible with four browsers: Chrome, Safari, Firefox, and Internet Explorer.
The H2O project aims to develop an analytical interface for cloud computing, providing users with tools for data analysis.
H2O's chief executive, SriSatish Ambati, had helped to start Platfora, a big-data firm that develops software for the Apache Hadoop distributed file system. Ambati was frustrated with the performance of the R programming language on large data-sets and started the development of H2O software with encouragement from John Chambers, who created the S programming language at Bell Labs and who is a member of Rs core team (which leads the development of R).
Ambati co-founded 0xdata with Cliff Click, who served as the chief technical officer of H2O and helped create much of H2O's product. Click helped to write the HotSpot Server Compiler and worked with Azul Systems to construct a big-data Java virtual machine (JVM). Click left H2O in February 2016.Leland Wilkinson, author of The Grammar of Graphics, serves as Chief Scientist and provides visualization leadership.
H2O's Scientific Advisory Council lists three mathematical scientists, who are all professors at Stanford University: Professor Stephen P. Boyd is an expert in convex minimization and applications in statistics and electrical engineering.Robert Tibshirani, a collaborator with Bradley Efron on bootstrapping, is an expert on generalized additive models and statistical learning theory.Trevor Hastie, a collaborator of John Chambers on S, is an expert on generalized additive models and statistical learning theory.
The software is open-source and freely distributed. The company receives fees for providing customer service and customized extensions. In November 2014, its twenty clients included Cisco, eBay, Nielsen, and PayPal, according to VentureBeat.
Big datasets are too large to be analyzed using traditional software like R. The H2O software provides data structures and methods suitable for big data. H2O allow users to analyze and visualize whole sets of data without using the Procrustean strategy of studying only a small subset with a conventional statistical package. H2O's statistical algorithms includes K-means clustering, generalized linear models, distributed random forests, gradient boosting machines, naive bayes, principal component analysis, and generalized low rank models.
H2O is also able to run on Spark.
H2O uses iterative methods that provide quick answers using all of the client's data. When a client cannot wait for an optimal solution, the client can interrupt the computations and use an approximate solution. In its approach to deep learning, H2O divides all the data into subsets and then analyzing each subset simultaneously using the same method. These processes are combined to estimate parameters by using the Hogwild scheme, a parallel stochastic gradient method. These methods allow H2O to provide answers that use all the client's data, rather than throwing away most of it and analyzing a subset with conventional software.
The H2O software can be run on conventional operating-systems: Microsoft Windows (7 or later), Mac OS X (10.9 or later), and Linux (Ubuntu 12.04 ; RHEL/CentOS 6 or later), It also runs on big-data systems, particularly Apache Hadoop Distributed File System (HDFS), several popular versions: Cloudera (5.1 or later), MapR (3.0 or later), and Hortonworks (HDP 2.1 or later). It also operates on cloud computing environments, for example using Amazon EC2, Google Compute Engine, and Microsoft Azure. The H2O Sparkling Water software is Databricks-certified on Apache Spark.