2014年9月22日星期一

Comparative Test Report on esProc, Hive, Impala Clusters (part I)

1. Objective

By making separate tests on computation capabilities of esProc, Hive and Impala clusters as they are running in the same hardware environment, to demonstrate different performances by comparison.

2. Test Content and Method

Cluster Scale: 4 Nodes.

Data volume: 125G wide fact table and 143G narrow fact table are used as primary data for testing, whose data volume is considerably larger than the physical memory of computer node.

Algorithm classification: Total five typical SQL algorithms: scan, group, join, join big dimension table across nodes and big group, are separately tested. What you should note here is that, the purpose of adopting these simple algorithms is only to help understand testing process and intuitively demonstrate comparative performances, but not means esProc and SQL are fully consistent. In fact, both of them lay stress on different key functions, that is to say, esProc is good at computation with relatively more complex business logic, while SQL is fit to operate some common complex computations. Complex algorithm in SQL will be executed with different plan, which is out of manual control, and thus not good for comparison. We'll not do such test.

Category of use cases: A number of sets of use cases will be designed for test process according to table width, data types and computation amount.

Storage structure: Row-based storage.

Note: The Comparative Test Process for esProc, Hive, Impala Clusters is annexed to this report. For specific data structure, test code, test reproducibility and other contents, refer to this document.

3. Environment Description

Hardware:
       Number of PCs4
       CPUIntel Core i5 2500(4 Cores)
       RAM16G
       HDD2T/7200rpm
       Ethernet adapter1000M

Software:
       OS: CentOS6.4
       JDK1.7
       Hadoop/hdfs 2.2.0

Test Objects:
       Hive  0.11.0
       esProc     3.1
       Impala    1.2.0

4. Data Description

Data scale is defined based on the exported files.

The file formats that help demonstrate the highest performance of each object are used here, of which, esProc uses the proprietary binary files; Hive and Impala use the text files.

4.1 Data Table and Associative Table

Fact Table T1
      
This is a wide Table, used to simulate the fact case with a number of fields, which is designed with 100 fields.
      
Fact TableT2
      
This is narrow Table T2, used to simulate a fact case with lesser fields, which is designed with 11 fields.

The fact table is primary data source in this test, which will be used during scanning, grouping, joining processes.

Dimension Tables DL2, DL6, DD2, DD6, DC2, DC6
The dimension table is only used to test the usecase of join (and multi-level join) operation. These dimension tables will join with the fact table, also join with each other with smaller amount of data.
      
DC11, as a dimension table across nodes, must be loaded as segmented into memory of different computers because the table is too large to load into one computer. Cluster needs to load DC11 across nodes to complete the computation.

4.2 . Data Scale



没有评论:

发表评论