2014年9月29日星期一

Comparative Test Report on esProc, Hive, Impala Clusters (part VI)

6. Hermeneutic Analysis

In general, by observing the results tested from each use case, the data features may be concluded as below:

1.In most cases, the performances of three take on a downgrading trend, that is, esProc ranks top, Impala takes second place, Hive is worst. esProc comfortably stays ahead of Impala, and is several times higher than Hive in performance.

2.The case of big group is somewhat special. Hive presents a lot better performance than Impala. In the use case for common group, Impala is also inferior to Hive.

3.Impala is not sensitive to computation amount, and it degrades slowly even if the computation amount continues to grow.

The reasons why these three tools present the above features come here as follows: 
1. esProc's performance takes the top level, which could benefit from hard disk IO. esProc enables direct access to hard disk bypass HDFS, while Hive/Impala depends on HDFS in this regard. As the most time for big data computation is consumed on the hard disk, esProc may gain the performance boosts of hard disk IO. Of course, this advantage of esProc makes sense only for small and middle scales of clusters, because in large cluster environment, esProc still ensures the data safety by HDFS or by other redundancy performance.

2. In most cases, Impala demonstrates better performance than Hive, which could involve data exchange. Impala supports in-memory computation, it can exchange data in memory; while Hive only supports out-memory computation, it has to do a data exchange by hard disk. In terms of exchange data, Impala's performance is higher than that of Hive at least by an order of magnitude, but from its overall performance, the gap of 3-90 times just like what Cloudera declared does not appear at all, generally only 2-3 times as it surpasses over Hive. 

3. When operating a big group, Impala is far more inferior to Hive. In the use case for common group, Impala also performs worse than Hive on rare occasion. Both cases appear as long as there is a large mass of data. As you think, Impala only provides a support for in-memory computation, so you can guess this phenomenon isdue to the fact that the data volume is massive enough to reach the limit of memory. At this time, JVM needs to proceed frequent memory exchanges. In fact, the memory overflow will appear in Impala when data amount further increases. Unless additional physical memory is available, the computation can't be achieved.

4. Impala is not sensitive to the computation amount, and it degrades slowly even if the computation amount continues to grow. This is because Impala supports dynamical native code generation, while Hive and esProc is interpreted by JAVA with a big gap of executingefficiency. The big data computation, however, mainly consumes the most time on the hard disk IO, not code execution Impala's advantage in native code generation does not often help improve its overall performance. 


1 条评论:

  1. Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Training India




    回复删除