6. Hermeneutic Analysis
In
general, by observing the results tested from each use case, the data features
may be concluded as below:
1.In
most cases, the performances of three take on a downgrading trend, that is,
esProc ranks top, Impala takes second place, Hive is worst. esProc comfortably
stays ahead of Impala, and is several times higher than Hive in performance.
2.The
case of big group is somewhat special. Hive presents a lot better performance
than Impala. In the use case for common group, Impala is also inferior to Hive.
3.Impala
is not sensitive to computation amount, and it degrades slowly even if the
computation amount continues to grow.
The
reasons why these three tools present the above features come here as
follows:
1. esProc's
performance takes the top level, which could benefit from hard disk IO. esProc enables direct
access to hard disk bypass HDFS, while Hive/Impala depends on HDFS in this
regard. As the most time for big data computation is consumed on the hard disk,
esProc may gain the performance boosts of hard disk IO. Of course, this
advantage of esProc makes sense only for small and middle scales of clusters,
because in large cluster environment, esProc still ensures the data safety by
HDFS or by other redundancy performance.
2. In
most cases, Impala demonstrates better performance than Hive, which could
involve data exchange. Impala supports in-memory computation, it can exchange
data in memory; while Hive only supports out-memory computation, it has to do a
data exchange by hard disk. In terms of exchange data, Impala's performance is
higher than that of Hive at least by an order of magnitude, but from its
overall performance, the gap of 3-90 times just like what Cloudera declared
does not appear at all, generally only 2-3 times as it surpasses over Hive.
3. When
operating a big group, Impala is far more inferior to Hive. In the use case for
common group, Impala also performs worse than Hive on rare occasion. Both cases
appear as long as there is a large mass of data. As you think, Impala only
provides a support for in-memory computation, so you can guess this phenomenon isdue
to the fact that the data volume is massive enough to reach the limit of
memory. At this time, JVM needs to proceed frequent memory exchanges. In fact,
the memory overflow will appear in Impala when data amount further increases.
Unless additional physical memory is available, the computation can't be
achieved.
4. Impala is not sensitive to the computation
amount, and it degrades slowly even if the computation amount continues to
grow. This is because Impala supports dynamical native code generation, while
Hive and esProc is interpreted by JAVA with a big gap of executingefficiency.
The big data computation, however, mainly consumes the most time on the hard
disk IO, not code execution Impala's advantage in native code generation does
not often help improve its overall performance.
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Training India
回复删除