1. Objective
By making separate tests on computation capabilities of esProc, Hive and Impala clusters as they are running in the same hardware environment, to demonstrate different performances by comparison.2. Test Content and Method
Cluster Scale: 4 Nodes.Data volume: 125G wide fact table and 143G narrow fact table are used as primary data for testing, whose data volume is considerably larger than the physical memory of computer node.
Algorithm classification: Total five typical SQL algorithms: scan, group, join, join big dimension table across nodes and big group, are separately tested. What you should note here is that, the purpose of adopting these simple algorithms is only to help understand testing process and intuitively demonstrate comparative performances, but not means esProc and SQL are fully consistent. In fact, both of them lay stress on different key functions, that is to say, esProc is good at computation with relatively more complex business logic, while SQL is fit to operate some common complex computations. Complex algorithm in SQL will be executed with different plan, which is out of manual control, and thus not good for comparison. We'll not do such test.
Category of use cases: A number of sets of use cases will be designed for test process according to table width, data types and computation amount.
Storage structure: Row-based storage.
Note: The Comparative Test Process for esProc, Hive, Impala Clusters is annexed to this report. For specific data structure, test code, test reproducibility and other contents, refer to this document.
3. Environment Description
Hardware:
Number of PCs:4
CPU:Intel Core i5 2500(4
Cores)
RAM:16G
HDD:2T/7200rpm
Ethernet adapter:1000M
Software:
OS: CentOS6.4
JDK:1.7
Hadoop/hdfs 2.2.0
Test Objects:
esProc 3.1
Impala 1.2.0
4. Data Description
Data
scale is defined based on the exported files.
The
file formats that help demonstrate the highest performance of each object are
used here, of which, esProc uses the proprietary binary files; Hive and Impala
use the text files.
4.1 Data Table and
Associative Table
Fact Table T1
This
is a wide Table, used to simulate the fact case with a number of fields, which
is designed with 100 fields.
Fact TableT2
This
is narrow Table T2, used to simulate a fact case with lesser fields, which is
designed with 11 fields.
The fact table is primary data source in this test, which will be used during scanning, grouping, joining processes.
Dimension Tables DL2, DL6, DD2, DD6, DC2, DC6
The
dimension table is only used to test the usecase of join (and multi-level join)
operation. These dimension tables will join with the fact table, also join with
each other with smaller amount of data.
DC11,
as a dimension table across nodes, must be loaded as segmented into memory of
different computers because the table is too large to load into one computer. Cluster
needs to load DC11 across nodes to complete the computation.
没有评论:
发表评论