2014年9月30日星期二

Preparing Testing Data with esProc

Testing data preparation is a critical work in software testing. High-quality testing data can better simulate the business case. It helps to meet the testing requirements by timely and effective evaluation of softwareperformance, or finding potential issues in the software builds. Most oftime, the amount of data used in testing is relatively large, and the data needs to be randomly generated according to specific requirements. Sometimes there is certain relationship between the data, and there is the need to retrieve data from an existing database. Therefore, the preparation of testing data often means complexity and and huge workload.

esProc is a handy tool for testing data preparation.

Now we need to prepare the testing data for employee’s information in text format, including employee number, name, gender, date of birth, city and state of residence, etc. Through this example, we can understand the way testing data are being prepared.

We have the following requirements for testing data: the employee numbers are generated sequentially. Name and gender are randomly generated. Birthdays arerandomly generated, however we need to ensure that the current age of the employees are between 18 to 55 years.City and states were randomly obtained from a table in database.

In 3 text files Top100MaleNames.txt, Top100FemaleNames.txt and Top100Surnames.txt, there are 100 most used male and female names, and surnames stored. 

The cities of employees need to be retrieved randomly from the CITIES table in database: 

According to the STATEID fields in CITIES table, we can retrieve the abbreviation of the state for the employee from STATES table 

Note that when generating the employee information, the name of the employee is related to his/her gender. Therefore we need to retrieve the text data first, combine the most used male and female names, and add the gender field to them: 

After sorting, we can see in C2 the following sorted table consists of name and gender 

Similarly, the city and abbreviation of states are also related. After retrieving data from database, the abbreviation of states is added to city information: 

The City table is: 

Then the basic information of generated data are sorted, including the data structure for employee information table, and amounts of testing data to be generated, etc.: 

Among this, the number in C5is the definition of cache, meaning that after generation of every 1500 records we need to input data to the text file once. This way we can control the memory space being used. In B6 the data structure of employee information table is output to the text file.

As the next step, we can now run a loop to generate the testing data for every employee: 

B7 generates a random sequence number as reference to names, while C7 generates one for surnames. They are used to retrievethename and gender for the employees. Accordiing to the requirements, B11randomly generatesthe age,andaccording to the age, selecting a random date in the correspondingyearincode line 12as this employee's birthday. In line 13, 14of the code, randomlyselect a city and to get the city and state for the employee. After all required data a regenerated, B15 will add all data to the sorted table of employee information created in A5.A16controlsthe data output,and write data to text file after every 1500records. After data output A5 is dumped, to avoid occupying too much memory.

After all data output, the text file are as following: 

When preparing testing data with esProc, we can run a loop to generate large amount of random data. Meanwhile, in the loop, we can retrieve existing database data or text data easily, to generate data according to business needs and avoid writing complex programs.

2014年9月29日星期一

Comparative Test Report on esProc, Hive, Impala Clusters (part VI)

6. Hermeneutic Analysis

In general, by observing the results tested from each use case, the data features may be concluded as below:

1.In most cases, the performances of three take on a downgrading trend, that is, esProc ranks top, Impala takes second place, Hive is worst. esProc comfortably stays ahead of Impala, and is several times higher than Hive in performance.

2.The case of big group is somewhat special. Hive presents a lot better performance than Impala. In the use case for common group, Impala is also inferior to Hive.

3.Impala is not sensitive to computation amount, and it degrades slowly even if the computation amount continues to grow.

The reasons why these three tools present the above features come here as follows: 
1. esProc's performance takes the top level, which could benefit from hard disk IO. esProc enables direct access to hard disk bypass HDFS, while Hive/Impala depends on HDFS in this regard. As the most time for big data computation is consumed on the hard disk, esProc may gain the performance boosts of hard disk IO. Of course, this advantage of esProc makes sense only for small and middle scales of clusters, because in large cluster environment, esProc still ensures the data safety by HDFS or by other redundancy performance.

2. In most cases, Impala demonstrates better performance than Hive, which could involve data exchange. Impala supports in-memory computation, it can exchange data in memory; while Hive only supports out-memory computation, it has to do a data exchange by hard disk. In terms of exchange data, Impala's performance is higher than that of Hive at least by an order of magnitude, but from its overall performance, the gap of 3-90 times just like what Cloudera declared does not appear at all, generally only 2-3 times as it surpasses over Hive. 

3. When operating a big group, Impala is far more inferior to Hive. In the use case for common group, Impala also performs worse than Hive on rare occasion. Both cases appear as long as there is a large mass of data. As you think, Impala only provides a support for in-memory computation, so you can guess this phenomenon isdue to the fact that the data volume is massive enough to reach the limit of memory. At this time, JVM needs to proceed frequent memory exchanges. In fact, the memory overflow will appear in Impala when data amount further increases. Unless additional physical memory is available, the computation can't be achieved.

4. Impala is not sensitive to the computation amount, and it degrades slowly even if the computation amount continues to grow. This is because Impala supports dynamical native code generation, while Hive and esProc is interpreted by JAVA with a big gap of executingefficiency. The big data computation, however, mainly consumes the most time on the hard disk IO, not code execution Impala's advantage in native code generation does not often help improve its overall performance. 


2014年9月28日星期日

Comparative Test Report on esProc, Hive, Impala Clusters (part V)

5.4 Use Case for dimension Table across Nodes

This case is used for testing performance of foreign key join over big data. The dimension table in this case is big enough to exceed the memory size. At this time, it will be segmented to be loaded into several sub-nodes.


Use Case Number Description:

Tested Results :

Data Features:
1. esProc, Impala and Hive degrade in turn, and esProc shows an evident advantage.

2. Data type also produces a very minimal impact on the performances of three test objects.  

5.5 Use Case for Big Group

This case is used for testing performance of big data grouping, and the number of grouped result goes far beyond the physical memory that the computation can’t be done within  memory.

Use Case Description: 

Tested Results 

Data Features:
1. As we can see, the change has taken place in the rank. esProc, Hive, Impala degrade in turn, esProc still significantly takes the lead, and Impala present a worst performance. 

2. The performance of Impala will decrease sharply when there is a huge mass of the data.

3. Data type also produces a very minimal impact on the performances of three test objects.

2014年9月25日星期四

Comparative Test Report on esProc, Hive, Impala Clusters (part IV)

5.3 Joining Use Case

This case is used for testing performance of foreign key join over big data, involving of integer, digital and string join as well as single-level and multilevel.


Use Case Number Description:

 Tested Results:



Data Features:
1.esProc, Impala and Hive degrade in turn. At top level of performances, esProc is nearly twice times higher than Impala, while 4-6 times than Hive. 

2.Impala lacks sensitivity to computation amount. As you imagine, if the computing computation amount continues to increase, its performance will exceed that of esProc. 

3.When the number of levels needed to be joined increases, the performances of both esProc and Impala have unapparent changes, but Hive’s performance will be reduced significantly.

2014年9月24日星期三

Comparative Test Report on esProc, Hive, Impala Clusters (part III)

5.2 Use Case for Group

This case is used for testing performance of big data grouping, while separately taking integer, digital, string, date and other grouping into account. The number of grouped results in this case is relatively small, less than physical memory of node.

Use Case Number Description:

Tested Results:

Data Feature:
In general, esProc, Impala and Hive degrade in turn. At top level of performances, esProc is 2-3 times higher than Impala, while 3-4 times higher than Hive. Apart from Use cases 32, 36, 40 which are used for wide table with large amount data at this point, Hive presents higher performance than Impala.

2014年9月23日星期二

Comparative Test Report on esProc, Hive, Impala Clusters (part II)

5. Test Case

To make you easy to understand, the operation logic of any test will be described in SQL. During the test, Hive and Impala implement SQL statements directly, while in esProc, we will write the codes to achieve SQL function-equivalent operations.

5.1 Use Case for Scan

This case is used for testing performance of full-table scanning of big data, while separately considering simple counting, integer summary, float  summary and numeric summary, as well as filtering of integer, digital, string and date, and other functions.


Use case description:


Tested Results

Note: Time unit is in seconds.

1. esProc, Impala and Hive degrade in turn, at top level of performance, esProc is 2-3 times higher than Impala, and 3-4 times than Hive.

2. Computing amount has a little impact on the performances of three test objects.

3. Data type also produces a very minimal impact on the performances of three test objects.

2014年9月22日星期一

A Handy Method of Performing Cross-database Relational Operations in Java

Program development in Java involves cross-database relational operations. The following example will illustrate Java’s method of handling these operations. sales table is in db2 database, employee table is in mysql database. The task is to join sales with employee through sellerid of sales table and eid of employee table, and filter out the data in sales and employee that satisfy state=“California”.


The data and structure of sales table are as follows:

The data and structure of employee table are as follows:

Since the two tables come from different databases, SQL cannot be used to join them. But Rowset, Java's class library for data computing, can do this, because it has JoinRowSet and FilteredRowSet classes that can perform cross-database computing.

Java's way of programming is as follows
1. Retrieve the data of sales table and employee table respectively from databasesdb2 and mysql, and store them in the object of CachedRowSet.

2. Use JoinRowSet to complete the inner join of the two tables.

3.   Use FilteredRowSet to complete the conditioned filtering.

4.   Print out the resulting data.

In the following code, two functions are used to import the data of db2 and mysql:
    public static RowSet db2() throws Exception {
String drive = "com.ibm.db2.jcc.DB2Driver";
String url = "jdbc:db2://127.0.0.1:50000/demo";
String DBUSER="db2admin";
String password="db2admin"
Connection conn = null;
Statement stmt= null;
ResultSet result = null;   
Class.forName(drive);
conn =DriverManager.getConnection(url,DBUSER,password);
stmt = conn.createStatement();
result1 =stmt.executeQuery("SELECT * FROM sales");
CachedRowSetcachedRS = new CachedRowSetImpl();
cachedRS.populate(result);
result.close();
stmt.close();
conn.close();
returncachedRS;
         }

         public static RowSetmysql() throws Exception {
                   String drive = "com.mysql.jdbc.Driver";
             String url = "jdbc:mysql://127.0.0.1:3306/test";
             String DBUSER="root";
             String password="root";
        Connection conn = null;
        Statement stmt= null;
ResultSet result1 = null;   
Class.forName(drive);
conn =DriverManager.getConnection(url,DBUSER,password);
stmt = conn.createStatement();
        result1 =stmt.executeQuery("SELECT * FROM employee");
CachedRowSetcachedRS = new CachedRowSetImpl();
cachedRS.populate(result1);
result1.close();
stmt.close();
conn.close();
returncachedRS;
         }

Then another two functions are used to join the two tables and filter the data.
         public static void myJoin() throws Exception {
         //retrieve data from the two databases
RowSetmysqlRS= mysql();
                   RowSet db2RS= db2();
                   // join the two tables
                   JoinRowSetjoinRS = new JoinRowSetImpl();
                   joinRS.addRowSet(db2RS, "SELLERID");
                   joinRS.addRowSet(mysqlRS,"EID");
                   // perform the conditioned filtering
                   FilteredRowSetfilterRS = new FilteredRowSetImpl();
                   filterRS.populate(joinRS);
                   StateRange range = new StateRange();//filtering condition; for the detailed filtering process, please see below.
                   filterRS.setFilter(range);
                   while(filterRS.next()){//print out the result
                            int  ORDERID =filterRS.getInt("ORDERID");
                            int  SELLERID =filterRS.getInt("SELLERID");
                            String NAME = filterRS.getString("NAME");
                  String STATE = filterRS.getString("STATE");
                  System.out.print("ORDERID="+ORDERID+";");
                  System.out.print("SELLERID="+SELLERID+";");
                  System.out.print("NAME="+NAME+";");
                  System.out.print("STATE="+STATE+";");
     }
         }

During the process, the object of StateRange needs to be created by itself, like the following inner class:
public static class StateRange implements Predicate {
                   publicStateRange(){}
                   publicboolean evaluate(RowSetrs) {
         try {
                   if (rs.getString("STATE").equals("California"))
         return true;//if “state” equals “California”, return true
                 } catch (SQLException e) {
                     // do nothing
                 }
                            return false;
                   }
                   publicboolean evaluate(Object value, int column) throws SQLException {
                            return false;
                   }
                   publicboolean evaluate(Object value, String columnName)
                                     throwsSQLException {
                            return false;
                   }
         }

The above code completed the task of cross-database relation operation between db2 and mysql and the filtering in them, but it is far from ideal. First, JoinRowSet supports only the inner joinand doesn't support outer join. Second, db2, mysql and hsql can use JoinRowSet after test, but the result set of joining oracle11g with another database is empty though no error will be reported. However, if oracle11g's users of two databases perform cross-database joining using JoinRowSet, they will get correct result. So the jdbc provided by different database providers could affect the result obtained by using this method. Third, the code is a little complicated.

The task can be made easier by introducing esProc to the operation. As a programming language specializing in processing structured (semi-structured) data, esProc can perform cross-database relational operations easily, and integrate with Java seamlessly, enabling Java to perform cross-database data computing as flexibly as SQL does. esProc supports various databases, including oracle, db2, mysql, sqlserver, sybase, postgre, and so on, all of which can perform cross-database relational operations like inner join and outer join.
To fulfill the above task, 8 lines of code in esProc are enough, which you can see below:

A1Connect to db2 data source configured in advance.

A2Connect to mysql data source configured in advance. In fact other databases like oracle can also be used here.

A3, A4Retrieve table sequences salesand employee from db2 and mysql respectively. esProc's IDE can display the retrieved data visually, as shown in the right part of the above figure.

A5Join table sequences sales and employee through sellerid=eid using esProc's object reference mechanism.

A6Filter the two table sequences according to state="California".

A7Generate a new table sequence and get the required field.

A8Return the field to the caller that called the esProc program.

Lastly, call this piece of esProc code in Java to get the result using jdbc provided by esProc. Java needs to call the following code to save the above esProc code as test.dfx file:

// create a connection between esProc and jdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// call esProc program (the stored procedure), test is the name of file dfx
com.esproc.jdbc.InternalCStatementst;
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
// execute the esProc stored procedure
st.execute();
// get the result set
ResultSet set = st.getResultSet();

Comparative Test Report on esProc, Hive, Impala Clusters (part I)

1. Objective

By making separate tests on computation capabilities of esProc, Hive and Impala clusters as they are running in the same hardware environment, to demonstrate different performances by comparison.

2. Test Content and Method

Cluster Scale: 4 Nodes.

Data volume: 125G wide fact table and 143G narrow fact table are used as primary data for testing, whose data volume is considerably larger than the physical memory of computer node.

Algorithm classification: Total five typical SQL algorithms: scan, group, join, join big dimension table across nodes and big group, are separately tested. What you should note here is that, the purpose of adopting these simple algorithms is only to help understand testing process and intuitively demonstrate comparative performances, but not means esProc and SQL are fully consistent. In fact, both of them lay stress on different key functions, that is to say, esProc is good at computation with relatively more complex business logic, while SQL is fit to operate some common complex computations. Complex algorithm in SQL will be executed with different plan, which is out of manual control, and thus not good for comparison. We'll not do such test.

Category of use cases: A number of sets of use cases will be designed for test process according to table width, data types and computation amount.

Storage structure: Row-based storage.

Note: The Comparative Test Process for esProc, Hive, Impala Clusters is annexed to this report. For specific data structure, test code, test reproducibility and other contents, refer to this document.

3. Environment Description

Hardware:
       Number of PCs4
       CPUIntel Core i5 2500(4 Cores)
       RAM16G
       HDD2T/7200rpm
       Ethernet adapter1000M

Software:
       OS: CentOS6.4
       JDK1.7
       Hadoop/hdfs 2.2.0

Test Objects:
       Hive  0.11.0
       esProc     3.1
       Impala    1.2.0

4. Data Description

Data scale is defined based on the exported files.

The file formats that help demonstrate the highest performance of each object are used here, of which, esProc uses the proprietary binary files; Hive and Impala use the text files.

4.1 Data Table and Associative Table

Fact Table T1
      
This is a wide Table, used to simulate the fact case with a number of fields, which is designed with 100 fields.
      
Fact TableT2
      
This is narrow Table T2, used to simulate a fact case with lesser fields, which is designed with 11 fields.

The fact table is primary data source in this test, which will be used during scanning, grouping, joining processes.

Dimension Tables DL2, DL6, DD2, DD6, DC2, DC6
The dimension table is only used to test the usecase of join (and multi-level join) operation. These dimension tables will join with the fact table, also join with each other with smaller amount of data.
      
DC11, as a dimension table across nodes, must be loaded as segmented into memory of different computers because the table is too large to load into one computer. Cluster needs to load DC11 across nodes to complete the computation.

4.2 . Data Scale



2014年9月21日星期日

Examples of esProc as used in set Operation

Set operations are frequently used in statistical analysis with structured data, For example, listing all students who has published papers; listing all staff who has participated in all previous training; selecting qualified students in examination for re-exam and so on. Within esProc, application of set is everywhere. The most commonly used sequence and sequence table data types are all sets. Therefore, better understanding and using of set helps to complete data computation in a more reasonable and faster way.

For example, the table below contains some sales data:

Now we need to select customers who entered Top 20 revenue contributors (Top 20 customers) in every month of 2013. To solve this problem we can first select all sales data for 2013, group them and to get the statistics for each month. Then we can do a loop to select the Top 20 customers for each month. The intersection of the Top 20 lists for all 12 months will contain the name of customers we wanted. Such complex problems are too difficult to be handled by SQL or stored procedures.

With esProc, we can split complex problems into different steps, and do the computations step by step to get the final result. First, from the sales data we can retrieve those for 2013, and group them by month:

esProc's grouping of data is real grouping, which actually separates data into different groups according to the criterion. This is different from SQL, in which the "group by" command can only return the aggregated result of a grouping. After grouping, the data in A3 is as following:

Before grouping, all data will be sorted automatically. Each group is a set of sales records. For example, the data for March is as following:

To know the total sales revenue for each customer in every month, we need to further split the data by customers. In esProc, we only need to do loop on data for each month, and group them by customers respectively. We can use A.(x) to to do loop on set members, without the need to code for loop.

After further grouping, the monthly data in A4 is a set of sets

Now, the data for March is as following:
We can see that each group in data for March is the transaction data for certain customer.

The set used in esProc is different from that in mathematical concepts. They are ordered setsand therefore can meet the statistical needs of sorting and selection by position, etc. Then we can find the Top 20 customers for each month:

In A5, do loop on the data for each month to get the Top 20 customers for each month. And in A6, list the names and monthly revenues of these customers. The computation result in A6 is as following:

Finally, we can further solve the problem:

Generate the name lists of Top 20 customers in A7 for each month. And finally in A8 we can find the intersection of the Top 20 lists for each month as following:

Loop Code for esProc

With esProc, the code for loop is mainly implemented with for statement. The for statement will repeat the code block with for as the main cell. There are different formats of the for statements, as listed below:

1. The for loop

Unconditional loop. The values of the main cell are, in turn, the count for the current loop, which often needs to be exited by break command.

Unconditional loop or endless loop is the simplest structure, for example:

In the above example, A2 executes unconditional loop. According to the loop number in A2, B2 retrieves one employee record every time. The rest of codes savethe employee information for Texas in B1.After the information for first 10 Texas employees were retieved, the loop will jump out with break statement in C6. Alternatively, break C command can also be used to replace break statement to jump out from the loop body with cell C as the main cell.


After execution, B1 contains the following results: 
Note that when using the unconditional loop, we need to make sure that the break statement can be executed properly, otherwise theprogram will never end at the time of execution.

2. for A

Loop the member of the sequence A, the main cell value will in turn be the current member of A.

In esProc, the most commonly used loop is to loop on every member in a sequence. For example: 

In this example, all Texas employees are selected in A2. The loop is executed in A3 for each Texas employee to compute their age, and then store the oldest age in B1. When the loop is complete, we can get the maximum age for all Texas employees from B1.

For relatively simple loop statement, sometimes we can use the loop function for sequence to achieve the same result. For example:

3. for n

Loop n times;with the main cell value as the current loop count in turn.

In addition to loop on all members in a sequence, loop for specific times is also very common.

For example, suppose we have a piece of paper which is 1000mm * 800mm in size and 0.1mm in thickness. If we fold it ten times, what will be the final length, width and thickness?

In the above example, the folding is looped ten times. After each folding the length will be reduced by half, and the thickness will be doubled. If after the folding the width of the paper is greater than the length, the length and width will be switched. After execution, A1, B1 and C1 contains the length, width and thickness of the paper as following:
The for n statement can also be seen as a simple form of for to (1, n) . This means "loop from 1 to n".

4. for a, b, s

Loop from a to b, at the step of s. This means "to loop on each member of array to(a, b).step (s)". If s is absent the default value will be 1.

Sometimes the loop does not start from 1, and the step value is not 1. We can then use for a, b, s loop. For example: 

For any n digits number, if the total ofthe n-th power of each digit equals its own, the number can be called daffodils number. The following example is used to identify all three digits daffodils number:

In the loop codes in A2, the loop is excecuted on all 3 digits numbers. After the execution of program, we can see the result in A1: 

5. for x

Loop when x is true. The value in main cell is the computed value for x.

By specifying the loop condition, we can control when to end the loop. For example, the issue in section 1 can also be solved with for statement with specific loop conditions:

In the main cell of for xloop, the cell value is either true or false, which is the result we got after computing the loop condition every time. At this point we can not use the value from main cell as the count of loop. Therefore to retrieve employee records in B2, we use #C to get the current number of loops. In #C, C is the main cell of loop. After the code is executed, the result is the same as in section 1.

6. for cs,n:x

Loop cursor, and retrieve n records each time, or retrieve records until x has changed. Return table sequence. Close after the reading.

When retrieve the data with cursor, we can use for statement to loop the data in cursor. 

7. Nested loop

In the body of the loop, we can also use loop statement. This is what we call nested loops, or multi-layer loops. For example:

In the example, A1 stores the prime number sequencewe found; A2 loops the integer from 2 to 1000. The prime number sequence is looped in B2. If the number in A2 is divisible by certain prime number, it will be replaced by the next number in A2 with the command of next A2. If the number in A2 is not divisible by all integer number in current prime number list, we then find a new prime number. It will be added to the sequence in A1. Among them, the next C command in D2 specifies to skip the rest part of the loop body and enter the next round of loop. After execution, the sequence for prime numbers within 1000 are as following in A1:
When using the next statement, we don’t have to specify the main cell C, to skip the deepest level of the loop where the next statement is located.

As with the following one hundred chicken puzzle: each rooster is worth of 5 dollars, and each hen 3 dollars. Three chicks is worth of one dollar. If we can buy 100 checken with 100 dollars, how many roosters, hens and chicks do we buy actually?We can resolve the issue in the following way:

In the codes, A2 loops on the total number of rooster. We need to judge in B2: if the total price of the rooster is more than 100, there must be too many roosters. We can use break command in C2 to end the loop. In B3 we continue to loop with the possible number of hens. If the current total price of chicken is more than 100 in D4,we know that there are too many hens. We can use next A2 to add the number of roosters and try again. increasing the total number of cocks and try again. In D5, the current total price of chicken is still less than 100. We can continue to increase the total number of hens and try. The next statement here can skip from the deepest level of loop, which has B3 as the main cell.