2014年7月31日星期四

An Illustration of Processing Big Text Files with esProc Cursor

esProc can process big text files conveniently by providing cursor data object. The following example is to illustrate this.
Let’s assume that there is a text file, sales.txt, with ten million sales records. Its main fields include SellerID, OrderDate and Amount. Now compute each salesman’s total Amount of big orders in the past four years. The big orders refer to those whose amount is above 2000. 

esProc code


Code interpretation
A1: If all the ten million records are read into memory simultaneously, memory will overflow. So the job will be done in batches.
A2: Read by looping, 100,000 rows at a time.
B3: Filter each batch of data, select those records whose amount is above 2000 after the year of 2011.
B4: Group and summarize the filtered data, seek the sales of each salesperson in this batch of data. 
B5: Add the computed result of this batch of data to a certain variable (B1), and move on to the computation of next batch of data. 
B6: After all the computations, sales of each salesperson in every batch of data will be found in B1. Last, group and summarize these sales data and seek each salesperson’s total sales amount. 

Analysis
In cell A1, esProc cursor is created with function cursor. The cell name is the cursor's variable name. When the cursor is created, data will not be read in the memory directly. Read-in will only be executed while fetch operation or other equal operations are going on, e.g., the code for A1,100000 in cell A2 represents reading data from cursor by looping with 100,000 rows at a time. We can see that the data size in memory is always kept in a relatively small level and no overflows will occur.
select and groups are computation functions specially used with structured data. After the data is read in the memory with esProc cursor, they can be processed and analyzed by employing functions of professional structured data computation library. This is more convenient than writing underlying code by hand.   
Equipped with functions and grammar of semi-structured data processing, e.g., function for data split and merging, looping and traversal statement and branch judgment statement, esProc cursor can do complex task of data cleansing and arrangement and form easily computed structured data.    

Splitting and analyzing
For instance, the format of weblog is too complex to be computed and analyzed directly. A typical web blog text need to be transformed into a two-dimensional table of standard format in order to be used in structured data computation or be stored in a database.   
A record in the original weblog: 
10.10.10.145 - - [01/May/2013:03:24:56 -0400] "GET /product/p0040001/review.jsp?page=5 HTTP/1.1" 200 8100 "http://www.xxx.com/xxxx.html""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" 0 tankLee in 9fc0792eb272b78916c3872e9ad –
A record in the original weblog: 

The following case omits the process of file accesses and the final merging of multiple batches of data(refer to the previous example), and lists the code for splitting and analyzing directly. 
 

Data cleansing
Let’s see a typical example of data cleansing. Since the employee table read in from a file is not a standard format, it need to be reorganized into standard structural data in batches. Data of the current batch will be stored in cell D3 temporarily. The rule for reorganizing is:
1.The record is invalid if UserID and firstName is null or blank string. 
2.UserID can only contain digits; the record is invalid if letters appear in it. 
3.For repeated UserID, only the last entered record is kept.
4.Delete possible spacing before and after the data.
5.Capitalize all the first letters of firstName. 
6.Full name is combined in the form of firstName+”.”+”lastName”. But, if lastName is null or blank string, fullname equals to firstName.

The following table also omits the process of file accesses and the merging of multiple batches of data, and only lists the code for data cleansing: 


Importing Excel Data into Access with esProc

In daily work we have frequent use of text data or spreadsheets, we need to import these data into database for further statistical analytics. For this task, esProc is a very handy tool.
In the following example, we will import Excel data into an Access database, to demonstrate how to migrate text data into database with esProc.
In the directory “D:\files\BoxOffice” we stored some box office data for movies. They are Excel files with the extension of “xlsx”. The first row of the Excel files contains field names, such as:
Now we need to store each filesin Access database, for better data analytics.
Within esProc we need to establish a data connection for Access first, either through system ODBC data source or direct use of mdb or accdb files:
 
Connect to Access data source, and then we need to find the list of files to be imported:
 

Once we get the data file list, we can then run a loop to read the data from each Excel file and import them to Access:

 

When processing each file, we must first read the datasheet, and use its file name (without file extension) as the name for tables in Access database. In order to make the data and table names more in line with the standards for database, spaces in table name will be replaced by "_", and possible blank data records will be deleted. Then, according to the structure and content of data tables, table creation statements will be generated in B5, and data update statements will be generated in C5. When generating the statements, field names and data in the first row must be read, and determine the data types:


Line 17 executes table creation statements. At this point, we must consider the situation where a table with the same name could already exists in the database. If this is true, A1.rollback () must be executed to do a rollback.
Line 19 imports the data from datasheets to Access database.
When the loop is done, we need to close the data source to avoid the existence of too many connections:

From this example, we can learn how to meet complicated requirements in esProc with simple codes. In particular, the same approach works not only for Excel files and Access database, but also for txt/xml files and various databases.
There are also some other approaches to import text data into the database. For example, the most primitive method is manual data input, which is obviously time consuming, laborious, boring and error prone. Programmers often solve these problems by coding. However, coding with common high-level languages (Java, C#) or scripting languages (Perl, Python) means a lot of workload, and is quite difficult to complete. Of course, Excel and Access are all Microsoft Office products. You can import Excel data directly from inside Access, with Excel as an external source. However, every time you can only import a single file, which makes it too troublesome when there are too many files to be imported. Also, this only works for Access database. In contrast, when you need to import batch text data into database, esProc is a nice tool.

Solve A System of Linear Equations with N Unknowns with esProc

For a system of linear equations with n unknowns, the n unknowns require n equations to form an equation set. e.g.: 
    3x+4y+5z=26
    5x+6y+10z=47
    4x+8y+7z=41
Then, how do you solve this system of linear equations with n unknowns with esProc? First, the system of equations should be written in a standard form, and parameters are filled in a text file. Parameters in the same line are separated with Tab. As listed below: 

How to solve a system of linear equations with n unknowns by programming? Elimination by addition and subtraction is a common method. See the following system of equations:
    a11x1+a12x2+a13x3+…+a1nxn=a10
    a21x1+a22x2+a23x3+…+a2nxn=a20 
    a31x1+a32x2+a33x3+…+a3nxn=a30
    ……
    an1x1+an2x2+an3x3+…+annxn=an0

If a11 is not equal to zero, we multiply each term of the first equation simultaneously by a21/a11,  subtract the first equation from the second one, then the coefficient of x1 in the second equation becomes zero. If a11 equals to zero, we need to find an equation in which the coefficient of x1 is not zero, and make it the first. In this way, we can eliminate all coefficients of x1 in the equations after the first one. See below:
    a11x1+a12x2+a13x3+…+a1nxn=a10
    0+b22x2+b23x3+…+b2nxn=b20
    0+b32x2+b33x3+…+b3nxn=b30
    ……
    0+bn2x2+bn3x3+…+bnnxn=bn0

Now let’s look at the case of b22. In the same method, we continue to eliminate coefficients of x2 in all equations after the second one... We do this continuously until the coefficient of xn-1 in the nth equation is eliminated. At this point, the system of equations will be like the following: 
    a11x1+a12x2+a13x3+…+a1nxn=a10
    0+b22x2+b23x3+…+b2nxn=b20
    0+0+c33x3+…+c3nxn=c30
    ……
    0+0+0+…+mnnxn=mn0
And we can solve it step by step from the nth equation:
    xn=mn0/mnn
    ……
    x3=(c30-…-c3nxn)/c33
    x2=(c30-b23x3-…-b2nxn)/b22
    x1=(c30-a12x2-a13x3-…-a1nxn)/a11

If, during the calculation, zero appears in parameters like a11,b22,c33, the system of equations has no solution or infinitely many solutions. 
The following shows how to solve a system of linear equations with n unknowns with esProc:
 

In B1, data is read from text files; parameters of system of equations is transformed into sequence groups in A2: 
 
C1 seeks the number of equations in a system of equations, then the result gets looping execution in A3, and coefficients of xn will be gradually eliminated with the method of elimination by addition and subtraction. During looping execution in B4, we look for the first equation in which the coefficient of xk is not zero from the kth equation; if we cannot find one like this, it means that the term cannot get eliminated. Among the code in line 7, move the equation in which the coefficient of xk is not zero to line k. In line 8 and 9, we eliminate coefficient of xk in equations starting from line k+1. 

Having done the elimination of every step, we can start to work out solution gradually from the last equation in A10, during which a sequence called result will be used to store solutions.If there is no solution or infinite many solutions, return Error. Considering that there could be errors with double-precision number during computation, the solutions will be kept to four decimal places. 
Result can be checked in A13 after all computations: 
 
If new system of equations is required to be solved:
    7x1+2x2+9x3-x4=0
    2x1+9x2-x4=0
    9x1+11x3-x4=0
    x1+x2+x3 =1

We only need to modify text EquInput.txt:
 

With the same program, we compute and read result in A13. 
 

In addition, list of parameters is unnecessary to read-in with text. Instead, it can write directly in esProc. The computing progress is similar to that we’ve explained.

2014年7月30日星期三

Differences and Similarities between esProc TSeq and SQL Data Table

esProc TSeq and SQL data table are all the structured 2-dimensional data objects. The records, indexes, and primary keys can all be applied to the structured data computing. Although both of them can be applied to the structured data computing, their application scenarios differ distinctly. TSeq is fit for solving the complex computing problem involving not-so-big data volume, while data table is fit for the common computation involving possible big data volume.
Their differences are determined by the underlying mechanism.
TSeq is ordered. Each record and data in each column has a specific sequence number. TSeq supports the explicit set, and set operations can be performed directly between multiple TSeq; TSeq is also the generic set. The basic element can be numeric value, reference, or another set.
SQL data table does not have the above features. But SQL data table is transparent to the memory and external storage. A consistent syntax is used to access to the memory, external storage, or blended data.
In the below sections, we will further discuss their differences and similarities.

Common basic functions

Both esProc TSeq and SQL data table are the structured 2-dimensional data objects. In other words, they are row-style 2-dimensional table based on records with one row for one record. The 2-dimensional table together with multiple column names forms a complete data structure. Because the structures are largely similar, their basic usages are almost the same.
Example 1:Query the data in the object. Find the orders in which the number of Freight is greater than 100 and placed before 2013.
SQL: SELECT * FROM Orders WHERE Freight > 100 AND OrderDate<'2013-01-01'
TSeq: =Orders.select(Freight > 100 &&OrderDate< date("'2013-01-01"))
Note:In this example, the data object name is Orders, and another data object of Customers will be used in the subsequent sections.

Example 2: Sort. Sort the orders by EmployeeID in ascending order, then sort by Freight in reverse order.
SQL: SELECT * FROM Orders ORDER BY EmployeeID ,Freight DESC

Example 3:Grouping and summarizing. Group by employee, summarize the freight charge, and count the orders.
SQL: SELECT EmployeeID, COUNT(OrderID), SUM(Freight) FROM Orders GROUP BY EmployeeID
TSeq: = Orders.groups(EmployeeID;sum(Freight),count(OrderID))

Example 4:Join. Join the two data objects Orders and Customers to form a new data object. Use the left join and the join field is CustomerID.
SQL: Select * from Orders left join Customers on Orders. CustomerID =Customers. CustomerID
TSeq: =join@1(Orders:, CustomerID;  Customers:, CustomerID)

Besides the above several basic usages, esProc TSeq and SQL data table are very alike in the algorithms like distinct, count, sum up, average, maximum, and minimum, just a few example here.

Difference of being in order

Thanks to the ordered record set of TSeq, the order-related computing can be solved easily. SQL data table lacks the sequence number and sequence-number-related access method, making it a bit inconvenient to handle the order-related computing.
Example 1: As for the sales data object, compute the sales increment of this month compared with that of the previous month.
SQL:
select salesAmount, salesMonth,
     (case when
prev_price ! =0 then ((salesAmount)/prev_price)-1
else 0
end) compValue
from (select salesMonth, salesAmount,
lag(salesAmount,1,0) over(order by salesMonth) prev_price
from sales) t
TSeq:
sales.derive(salesAmount / salesAmount [-1]-1: compValue)

Comparison:
The monthly sales have nothing to do with orders. It is represented as salesAmount in the TSeq and the data table. The one related to the order is the Sales amount of previous month, which is equivalent to the salesAmount of the previous record relative to the current record. TSeq is ordered, sosalesAmount[-1] can be used to represent the sales of previous month. By comparison, SQL data table is not ordered. Since SQL2003 standard was introduced, the order-related features are added to the window functions. It is so tedious to use the unavoidable  complex method lag(salesAmount,1,0) over(order by salesMonth) to compute the sales of the previous month.

TSeq can also be used to represent the relative interval, for example, the total five months comprising the current month, the two months before, and two months after the current month. With TSeq, it can be represented like this: salesAmount{-2,2}. In SQL, the window function can also be used to represent the summarizing, but much more troublesome.

Example 2: For the sales data table, find the first 10 records with the highest sales for each product.
SQL:
select salesMan, product ,amount
where ranking <=10
TSeq:
= sales.group(product).(~.top(-amount;10))
Comparison:
In this example, the most intuitive thinking pattern to implement is to group data by product, and then perform the order-related computations in the groups. The simplest method is to retrieve the first ten records with the highest amount in the group. The intuitive method is to sort the data in the group in reverse order by amount, and then get the records whose sequence numbers in the group is from 1 to 10.
TSeq supports the order-related computation quite well. For TSeq, not only the function top can be used to implement the first algorithm, as demonstrated in the example, but also the sort function and the sequence number of record can be used to implement the second algorithm, as shown below: =sales.group(product).(~.sort(Amount:-1)).(~([to(10)])).
The record in the SQL data table is not ordered. We must compute out a sequence number or field to act as the sequence number, such as ranking. The algorithm adopted in the above example is to compute the data rankings in the group, and then get the records among the top 10. Obviously, SQL syntax is a bit zigzag and difficult to understand. SQL users will have to use the window function over (partition by…… order by……) and the sub-query which is hard to track and debug.
By comparison, TSeq is more straightforward and simpler for easy track and debug. For another example, programmers can firstly write =sales.group(product) code for test. This code represents that the data will be grouped. This code snippet can run and display the result independently. If the grouping results meet your expectation, then programmers can proceed to add the second paragraph of codes: sort the data in the group in reverse order, that is, .(~.sort(Amount:-1)). In which, the "." indicates the former computing result will be taken as a whole for further processing, "~" indicates the data in the current group, and “-1” indicates the reverse order. The current code line is =sales.group(product).(~.sort(Amount:-1)), which can run and display the result independently. If the grouping results meet your expectation, then programmers can proceed to add the third paragraph of codes: retrieve the records whose sequence numbers in the group is 1-10, i.e. .(~([to(10)])).
As can be seen, the computational procedure of TSeq can be pushed forward step by step. In solving the complex computing problems, the computation can be simplified to avoid errors. In facts, the above three continuous paragraphs of code can be written into three rows of codes step by step, so as to decompose the computing goal more clearly:
        

        
One thing worthy of noticing is that SQL cannot perform data sorting before order-related computing,even with the temporary table.The two steps must be combined into one step. The reason for this is that the data table does not support the explicit set and generic set. It is rather weak regarding its syntax representation.
In addition, although SQL uses the window function up to the ANSI standard, the database vendor does not implement it fully to the standard. The writing styles vary for various databases and some databases does not provide the window function at all. The function syntax of TSeq is independent of data source. For whatever data sources such as database, TXT file, Excel file, and binary file, programmers are not required to modify the code when using TSeq for computing.

Difference between explicit sets

Although SQL has the concept of set, the explicit set is not provided, and cannot be taken as the independent variable. The set operations can only be implemented with temporary table. TSeq is the true explicit set for implementing the set operation.
Example: Compute over the Contract data object. Suppose the contract with over 40 order quantity is the big contract, and the unit price over 2000 is the important contract. Please find the contract settled in this year which are both the big contract and the important contract, and all other contracts except these contracts.
SQL:
         select SellDate,Quantity,Amount,Client from Contract where to_char(SellDate,'yyyy')='2014' and quantity>=40 and AMOUNT>=2000
         select SellDate,Quantity,Amount,Client from Contract where not(to_char(SellDate,'yyyy')='2014' and quantity>=40 and AMOUNT>=2000)
TSeq:
         =thisYear= Contract.select(year(SellDate)=2014)
         =big= Contract.select(Quantity>40)
         =importance = Contract.select(AMOUNT>2000)
         =answer=thieYear^big^ importance
         =others= Contract\answer
Comparison and Analysis:
The typical natural thinking pattern is to find the contracts that are big contract and also the important contract of this year. The intersection operation is most intuitive. If the big contract is defined as big, the important contract is defined as importance, and the contract in the current year is defined as thisYear, then we can write the pseudocode easily: big∩importance∩thisYear. TSeq is the explicit set capable of representing the equivalent expression very intuitively, i.e. thieYear^big^importance. SQL does not allow for representing it with the variables for set. So, we have to find another way, for example, converting it to the Boolean condition, as demonstrated in the example: to_char(SellDate,'yyyy')='2014' and quantity>=40 and AMOUNT>=2000.
It is easy to solve the first problem because the respective development difficulty with these two tools are not great. With the further research into the problems, their differences become evident.
Problem 2: “Other contacts except for those big and important contracts” - it is also a typical thinking pattern - the computing goal can be achieved with difference set in one step. TSeq expression:Contract\answer, very intuitive. Using Boolean conditions, SQL can also be used to compute out the answer. But the expression writing style and business description vary greatly, as shown below: not(to_char(SellDate,'yyyy')='2014' and quantity>=40 and AMOUNT>=2000).
         With SQL, we can solve it using the set operation, and the algorithm is also quite intuitive. However, the code may appear very lengthy since the data table cannot be represented with the set variables.
(select SellDate,Quantity,Amount,Client from Contract)
minus
(Select select SellDate,Quantity,Amount,Client from Contract from(
    (select SellDate,Quantity,Amount,Client from Contract where to_char(SellDate,'yyyy')='2012')
    Intersect
    (select SellDate,Quantity,Amount,Client from Contract where quantity>=40)
    Intersect
    (select SellDate,Quantity,Amount,Client from Contract where AMOUNT>=2000))

Because the code is lengthy, many people would rather use Boolean condition to implement the set operations indirectly.
Undoubtedly, in most cases, it is more convenient for data table to use operation set than Boolean condition, for example, the set operations between multiple physical tables, or the set operations between multi-level sub-queries. In this case, the cost would be high if converting the set operation to the Boolean conditions, and programmers have to adopt the lengthy set operation.

Difference between generic sets

TSeq is the generic set to store both the physic data and the reference to the associated data. With this advantage, TSeq can implement the associated computing through the intuitive object reference. By comparison, the data table can only be used to store the physical data. The equivalent computation can only be completed using the associated complex statements.
Example:Please compute which department managers received the President Award, and of these managers, which subordinates are the annual outstanding employees. This involves two data object:department and employee, in which, the deptName field of department and the empDept field of employee is in the one-to-one relation. Still, the manager field in department and the empName field in employee is also the one-to-one relation. In addition, the code for president award is PA; and the code for annual outstanding employee is EOY.
Data table:
    SELECT A.*
    FROM employee A,department B,employee
    WHERE A.empDept=B.deptName AND B.manager=C.empName AND A.empHonor=‘EOY’ AND C.empHornor=‘PA’
TSeq:
    employee.select(empHonor: "EOY",empDept.manager.empHornor:"PA")
Comparison and Analysis:
The SQL solution can definitely work out.But the relevant associating statement is fairly complex that the average programmer could feel hard to comprehend. By comparison, TSeq solution is rather intuitive, empHonor:"EOY"is one of the conditions: Who are Year Outstanding Employees, while empDept.manager.empHornor indicates the “the department.the department manager.awards received by the department manager”. Obviously, if this value is PA, then the condition 2 in the problem is met: "Department managers who received the president award". This is the object reference by definition.
The object reference enables programmers to use "." operator to make reference to the related data. By this way, the business association relation can be translated into the computer language intuitively, so that multilevel relations can be represented conveniently, and the associating computation can be performed intuitively.

Differences regarding Memory and External Storage Transparencies

SQL
Because SQL data table does not support the generic and set data,no data will be lost when writing the data from memory into the external storage. Thus, it is transparent to the computation over memory and external storage. On accessing the data table for the first time, data are usually from the external storage; then, for the subsequent accesses to the same data table, data can be from the memory cache; for the data table with relatively large volume of data, part of the data can be from the external storage, and part from the memory. No matter the data is from memory or external storage, and data volume is big or small, the syntax for retrieving data table is always the same, and programmers need not to write different SQL statements for it.
TSeq
TSeq supports the generic type (reference in particular) and set data. Data could lost when writing the data from memory to the external storage, and can not always ensure to be loaded properly, which making the computation intransparent to the memory or external storage. TSeq is the pure memory data object, and is only capable of handling limited data volume; If the data volume is relatively big, then cursor (another data object of esProc) should be used for the computation in external storage, with different syntaxes for the cursor and the TSeq; esProc programmers also must convert the data between cursor and TSeq if they want to improve performance or handle the computation involving complex business logics.
Comparisons
TSeq is not transparent to the memory and external storage. esProc programmers need to write different codes to meet the requirement of computing in memory, external storage, or mixed computing. In addition, they need to modify the codes to meet the needs of data volume increase. Therefore, the workload for the initial design and post-maintenance is relatively great. SQL data table is transparent to the memory and external storage. Programmers only need to write one set of code to fit various scale of data. The designing and maintenance workloads are relatively small.

Through the above comparison, we can conclude: Characterized with the ordered data, explicit set, and generic set, TSeq can easily solve the order-related complex problem and reduce the complexity of set operations, with support for the intuitive object reference to handle the complex multi-table association. SQL data table is transparent to the memory and external storage, and SQL code is commoner.


An Illustration of esProc’s Parallel Processing of Big Text File

esProc can parallelly process big text files conveniently. The following case will illustrate its operating method.
Let's assume that there is a text file, sales.txt, with ten million sales records. Its main fields include SellerID, OrderDate and Amount. Now compute each salesman’s total Amount of big orders in the past four years.The big orders refer to those whose amount is above 2000. 
To do parallel process, first we must segment the file. esProc provides the cursor data object and its function, which segment big text files conveniently and read easily. Take file(“e:/sales.txt”).cursor@tz(;, 3:24) for an example, it means, basically, the file isdivided averagely into 24 segments by bytes, then the third one will be read. By doing so, half line of data, or half line of record, will occur, and further programming and processing are required. However, we have to traversal across all the previous rows if segmenting the file by rows, which cannot reach an expected high efficiency by adopting segmenting algorithm and parallel processing. Unlike these two types of solution, esProc can do rounding off job automatically to ensure the data’s validity when segmenting the file. 
With esProc at hand, only simple parallel processing will be required after the file is segmented. The code is as follow: 
Main Program 
 
A1: Set the number of parallel tasks as 24, that is, divide the file into 24 segments.
A2: Call the subprogram to do multi-threading parallel computation. There are two task parameters: to(A1),A1. The value of to(A1) is [1,2,3…24], which corresponds the serial number of each segment; A1 represents the total segments. When all the tasks is done, the computed results will be uniformly stored in A2.
A3: Merge the computed result of each task in A2 according to Sellerld.
A4: Group and summarize the merged results to seek each salesman’s amount.

Subprogram
 

Both segment and total, which represent respectively the current segment and total segments, are parameters of the subprogram. Say, the value of segment’s parameter is 3 and that of total is a permanent 24 in task 3. 
A1: Read the file with cursor. Note that which segment should be processed in current task should be decided according to parameter delivered by main program.
A2: Select those records whose order amount is above 2000 after the year of 2011. 
A3: Group and summarize the filtered data.
A4: Return the task’s computed result to main program. 

Code description
For CPU with N cores, it appears natural to set N tasks. The fact is that some tasks always operate faster than others, say, according to the differences of filtered data. So the common situation is that some cores are in an idle state after finishing those faster tasks, but a few other cores are still in operation with slower tasks. Comparatively, if each core performs multiple tasks in turn, the speed of different tasks will be averaged out and the general operation will become more stable. In the above example, therefore, the task is divided into 24 parts and given to CPU’s 8 cores to process (how many parallel threads are allowed at the same time to configurate in the environment of esProc). But, on the other hand, too many segmented tasks may have weaknesses. One is the decline of whole performance, the other is that there will be more computed results produced by each task when added together, and they will occupy more memory.
callx makes develop process simpler by encapsulating complex multithread computation, which enables programmers to focus on business algorithm instead of being distracted by complicated semaphore control. 
Computed results of A3 in the above main program have been sorted automatically according to Sellerld, so it is not necessary to sort again in A4 when grouping and summarizing. @o, the function option of groups, can group and summarize efficiently without sorting.  

Further illustration
Sometimes when the data size of a text file amount to several TB, it is required to use multiple node based on cluster to make parallel computation. Now esProc can do parallel computation easily. Because its cursor and relative function support non-expensive scale-out and distributed file system (DFS). As to the above example, all that is needed is a node list added to A2 in the main program. It will be like this: =callx("sub.dfx", to(A1),A1; ["192.168.1.200:8281", "192.168.1.201:8281", ”......”]). 

Basic Data Type in Data Processing Programming Language

Programming languages focus on various basic data types, subject to their different design goals. Languages such as Java and C# are designed to develop the common applications. Their basic data types are character strings, number, boolean, and other atomic data type, array and common object. SQL, PowerBuilder, R, esProc, and other alike languages are designed to process data. So their basic data types are the structured 2-dimentional data sheet object. Take this SQL statement, for example,SELECT T1.id,T1.name,T1.value FROM T1 LEFT JOIN T2 ON T1.id=T2.id. Of which, the T1, T2, and the computed result just use such data type. With the multiple fields to form one record and the multiple records to form the 2-dimentional data, the combination of such data and its field name is the structured 2-dimenional data table object.
Why not use the atomic data type and the common object as the basic data type for the data processing languages? If representing the T1 and T2 from the above-mentioned SQL statement with the array or ArrayList object, you will find: The complexity will increase for several times, and the length of codes will also increase sharply for dozens of times.
The basic data types of data processing languages are the structured 2-dimentional data table object. This is not a coincidence, but there are subtle reasons instead.

Correspond to actual business. In the real world, most business data is the structured data. As an example, the Payroll list has the employee number, employee name, department, date, pre-tax salary, and post-tax salary; For another example, the retail record has the order time, outlet number, checkout counter number, cashier number, product name, and unit price; The last example of business data is the Website log, which comprises the browse time, URL, visitor IP, browser version, and other properties. These properties are equivalent to the field. Each of the records has the same structure. Though they are stored in text while not the database, they are actually still the structured data in nature. So, it is only natural to use the 2-dimentional data table to represent it. The structured 2-dimentional data table object can be used to represent the business data intuitively. Representing the actual business in the most faithful way, no matter the storage, computing, exchange or sharing. Such kind of data is the easiest for users to understand in a most convenient way.

Easy for massive processing. Business data are mostly the data of the same structure, for example, the Payroll table, Retail record, and Website log mentioned above. In processing such data, in some cases, we will handle a certain data of a certain record, but in most cases, we take a certain record as unit to process all data, for example: Compute the after-tax wage based on the pre-tax wage; Compute the amount based on the unit price and quantity of commodity. Count the daily on-line duration for each IP. The above-mentioned processing mode is just the massive data processing. To implement the batch processing, we can traverse every member of array in loops by row number and column number just as the operations for Java. Alternatively, we can operation on the data with the business field name directly as we would do for SQL and esProc. The latter resolution is simpler and easier-to-use without having to write loop statements. Programmers can thus operate on data intuitively from business perceptive, and the corresponding code become more concise and readable.

Compatible with the Relational Algebra. The relational algebra is the underlying theory developed for data processing and query. By which, the association and laws of operations among business data can be expressed in full details using the basic operation along with the join operation, aggregation operation, and division operation. Theoretically, any computation problem of any degree of difficulty can be implemented and solved by relational algebra in the respects of data processing and data query. Because the relational algebra is concise and complete, databases are largely designed based on this theory. E.F. Codd is thus called as the father of relational database. The structured 2-dimentional data table object is just the data type recommended by E.F. Codd. This data type can be used to express various operations of relational algebra, so as to solve the computation problem in data processing easily. In facts, the database result set is the earliest structured 2-dimentional data table object.

As can be seen, all kinds of programming languages adopt the structured 2-dimeintal data table object as the basic data type because it is corresponding to the real business data, and easy to implement the massive computation, making it compatible to the relational algebra theory. With the 2-dimentional data table object, codes can be simple and easy to understand, and the development efficiency is improved. Let me explain it with a few more examples below:

Result set of SQL (resultSet): Group by the book type to compute the average price of the books whose average price is greater than 15 yuan.
  select avg(price),type from books group by type having avg(price)>15

Table sequence of esProc (TSeq): Group by department to find the top 10 best sellers for each department.
  products. group(department). (~.top(quantity;10)

Data window of PowerBuilder (datawindow): Sort the order by price
  Order.SetSort('value d')
  Order.Sort()

R language data frame (data.frame): Left-join the orders table and customer table by customerID.
  merge(A1,B1,by.x="CustomerID",by.y="CustomerID",all.x=TRUE)

SQL, esProc, and R code comparison: Group the order data by department, and summarize the order data and sales amount of each department.
         
SQL:  Select count(*),sum(sales) from orders group by Dept
esProc:  orders.groups(Dept; count(~), sum(sales))
R language:
result<-aggregate(orders$ sales,list(orders $ Dept),sum) 
result$count<-tapply(orders $ sales, orders $ Dept,length)
        
Let's take a close look on result set, table sequence, data window, and data frame. Although they are all structured 2-dimentional data table objects with basically the same function. There are some slight differences between them.

SQL result set is rich in various materials, widely applied, universal, and simple to use. It is the top mainstream data type of all data processing languages. However, SQL did not implement the relational algebra to the full, making it a bit inconvenient for some computations, such as the set division.

DataWindow usually retrieves number from SQL, and return the final result to the database. It mainly servers the purpose of breaking through any barrier between the data and UI controls, so that programmers can design and deliver the database application with high interactivity soon. Another major function of DataWindow is to render and edit data. It can be only used for form computation, and the data processing capability is relatively poor.

Data Frame is capable of handling the structured computation to some extent. As can be seen from the above example, its syntax is obscure, and it is relatively complex to implement the same functions with it. This is because the major functions of R are scientific and statistical computing, focusing on the data types of array and matrix. As a additional data type, the data frame was later introduced to implement the structured data computing. Considering this point, data frame is not so dedicated as that of the other three tools.

TSeq is quite dedicated in data processing. Having incorporated all common strong points of SQL result sets, TSeq can fully and completely implement the relational algebra. TSeq is generic and sorted, especially fit for the order-related complex computing in data processing, for example: Yearly link relative ratio, year-on-year comparison, ranking, relative position computing, and interval computing. TSeq is also charactered for it is generic, and easier to establish relations between data and provide access to data with multi-level association easily by object. Compared with SQL, TSeq is unable to directly process the big data because it is the pure memory object.

As can be seen, the structured 2-dimeintional data table object is directly related to the degree of dedication for the data processing languages. The more powerful the former one is, the higher the degree of dedication for the latter one would be, and vice versa. If a programing language lack the structured 2-dimentional data table object, then this language can hardly be regarded as a dedicated one in processing the data. To research and examine if a programing language can be used to develop any application for data analysis and processing efficiently, the key is to find out if it offers the dedicated 2-dimentional data table object and the appropriate class library.

Perl is often used to retrieve character string and is capable of processing the data to some extent. However, since its code is lengthy and complex, it is not the dedicated data processing language. For example, to complete the simplest algorithm of grouping and summarizing, the code of Perl is shown below:
  %groups=();                       
  foreach(@carts){             
    $name = $_->[1];
    if($groups{$name} == null){    
         $groups{$name}=[$_];
    }
    else{
         push($groups{$name},$_);     
         }
  }
  my @result=();                           
  foreach( keys(%groups)){        
         $value=0;
    while($row=pop $groups{$_}){        
         $value += $row->[2];                
    }
    push @result,[$_,$value];
  }

Python is a bit simpler to write, but far more inefficient than SQL, esProc, and R in developing by order of magnitude. The sample code is shown below:
  result=[]
  for key, items in groupby(data, itemgetter(0)):      
    value1=0
    value2=0
    for subitem in items:                                            
        value1+=subitem[1]
        value2+=subitem[2]
    result.append([key,value1,value2])                  
  print(result)

Perl and Python is not the dedicated tools on data processing. Most importantly, they lack the structured 2-dimentional table data object.



The TSeq of esProc is not only the structured 2-dimentional table data object, but also is characterized with its being order, generic, step-by-step computation, making it more dedicated than other alike languages. For example, to implement a relatively complex computational goal: Find the shares having been rising on 5 consecutive days. esProc solution code is shown below:

2014年7月29日星期二

Configuration for esProc’s Access to Databases

esProc supports multiple heterogenous data sources, one of which is database. We’ll illustrate how esProc access databases with a few examples.

esProc can connect to database's jdbc driver, or to a database through jdbc-odbc bridge. Because of the problem of copyright, programmers who use esProc need to prepare jdbc or odbc driver of a database by themselves. When jdbc jar package driver is prepared, it need to be put into /common/jdbc of esProc's IDE installation directory, e.g., the directory C:\Program Files (x86)\MicroInsight\common\jdbc.

ODBC interface configuration of esProc's integrated development environment is as follows:


esProc's integrated development environment provides jdbc configure prompts of multiple databases, including SQL server, Oracle, DB2, Sybase, Access, mysql, hsql, teradata, postgres, etc. If the database waiting to be connected doesn't fall into the list, it can be added with other type. The interface configuration is as follows:
 

After the jar package driver is prepared and configuration is finished, the database will be conveniently connected within IDE and table data will be fetched:
 

In the above figure, cell A1 is connected to a hsql database named demo. Cell A2 uses SQL statement to query the table of employee information, and stores the table in this cell, which is a variable, as esProc's table sequence; arg1 is an input parameter. Cell A3 closes database connection. Cell A4 returns the query result outwards. The red box in the esProc IDE's bottom right corner displays demo's table name and field name, which is convenient for programmers to write SQL statement.

As the other functions provided by esProc, query function contains options and parameters. Take the expression query@1("select * from employee") as an example, @1 represents that 1 option is used, and by looking up the function reference, only the first record fetched by SQL statement will be returned. There are parameters in the parentheses. In the above figure, there is only a string of SQL statement and no other parameters in the parentheses, which shows that all other parameters have been set by default.

The cellset in the above figure can be integrated in Java application, and, acting as an esProc jdbc driver, it could be called by Java program . Steps include:
1. Prepare dfx file.
Save esProc program as test.dfx.

2. Deploy esProc jars . 
Putnecessary jars for calling esProc program in classpath of Java application. They could be put in WEB-INF/lib directory for a web application. These jars are located in the esProc IDE’s installation directory \esProc\lib, which includes:
    dm.jar esProc computing engine and JDBC driver
    poi-3.7-20101029.jar process reading and writing of Excel
    log4j_128.jar process logs
    icu4j_3_4_5.jar process internationalization
    dom4j-1.6.1.jar analyze configuration

3. Deploy database drvier jar .
Put database jdbc drivers needed for esProc to connect to the database in Java application's class path. For instance, hsql.jar of demo database.

4. Configure dfxConfig.xml and config.xml files
Config.xml file contains basic configuration information for esProc, such as registration code, address search path, master directory, configuration of data sources. They can be found in directory esProc\config in esProc's installation path, in which the information is the same as that set in the esProc's option page. dfxConfig.xml can be found in directory esProc\classes in installation path. In this article, we’ll illustrate some of the configuration for esProc to connect to a database. For the other, please see A Course of esProc.
1) Configuration method one: Directly configure connection parameters of database data source.
  Config.xml file:
    <DBList>
<!-- name of data source must be in consistent with that in dfx file -->
<DB name="demo">
    <property name="url" value="jdbc:hsqldb:hsql://127.0.0.1/demo"/>
    <property name="driver" value="org.hsqldb.jdbcDriver"/>
    <property name="type" value="HSQL"/>
    <property name="user" value="sa"/>
    <property name="password" value=""/>
    <property name="batchSize" value="1000"/>
    <!--
    Automatically connect or not. If the setting is true, db.query() function can be directly used to access a database; if it is false, the database cannot be automatically connected and connect(db) statement must be used for connection before db.query() function does its job.
    -->
    <property name="autoConnect" value="true"/>
    <property name="useSchema" value="false"/>
    <property name="addTilde" value="false"/>
     </DB>
     </DBList>
2)Configuration method two: Configure connection pool and jndi in Java application, and designate name of jndi in dfxConfig.xml file.
dfxConfig.xml file:
    <jndi-ds-configs>
<!--jndi prefix -->
<jndi-prefix>java:comp/env</jndi-prefix>
<!-- name of data source must be in consistent with that in dfx file -->
<jndi-ds-config>
<name>demo</name>
<dbType>HSQL</dbType>
<dbCharset>ISO-8859-1</dbCharset>
<clientCharset>ISO-8859-1</clientCharset>
<needTranContent>false</needTranContent>
<needTranSentence>false</needTranSentence>
<!--
Automatically connect or not. If the setting is true, db.query() function can be directly used to access a database; if it is false, the database cannot be automatically connected and connect(db) statement must be used for connection before db.query() function does its job.
-->
<autoConnect>true</autoConnect>
</jndi-ds-config>
     </jndi-ds-configs>

Please note that:
Name of configuration files must be config.xml and dfxConfig.xml and cannot be changed.
Reentering is forbidden and esProc JDBC itself cannot be used as a data source and configured as a database connection.
If the two configurations have the data sources of the same name, the data source in config.xml will prevail.

5. Deploy dfxConfig.xml, config.xml and test.dfx files. 
Put dfxConfig.xml and config.xml files in classpath of Java application, or directly package them into dm.jar.
Put test.dfx file in classpath of Java application, or put it in the absolute path designated by <paths/> node of dfxConfig.xml file.

6. Call test.dfx in java program.
If ...?config=... is used in connecting string of esProc JDBC, configuration of .xml will be used and that of config.xml will be ignored. Default setting will be enabled if there is no config parameter in connecting string.
For instance, configuration of myconfig.xml is used in the expression con= DriverManager.getConnection("jdbc:esproc:local://?config=myconfig.xml") .
Code sample is as follows:
    publicvoid testDataServer(){
Connection con = null;
com.esproc.jdbc.InternalCStatementst;
com.esproc.jdbc.InternalCStatement st2;
try{
// establish connection
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
//call the stored procedure, in which test is the file name of dfx
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
//set parameters
st.setObject(1,"3");
//the following statement has the same effect as that of the previous call
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(3)");
//execute the stored procedure
st.execute();
//get result set
ResultSet set = st.getResultSet();
}
catch(Exception e){
System.out.println(e);
}
finally{
//close connection
if (con!=null) {
try {
con.close();
}
catch(Exception e) {
System.out.println(e);
}
}
}
    }