2014年7月30日星期三

Basic Data Type in Data Processing Programming Language

Programming languages focus on various basic data types, subject to their different design goals. Languages such as Java and C# are designed to develop the common applications. Their basic data types are character strings, number, boolean, and other atomic data type, array and common object. SQL, PowerBuilder, R, esProc, and other alike languages are designed to process data. So their basic data types are the structured 2-dimentional data sheet object. Take this SQL statement, for example,SELECT T1.id,T1.name,T1.value FROM T1 LEFT JOIN T2 ON T1.id=T2.id. Of which, the T1, T2, and the computed result just use such data type. With the multiple fields to form one record and the multiple records to form the 2-dimentional data, the combination of such data and its field name is the structured 2-dimenional data table object.
Why not use the atomic data type and the common object as the basic data type for the data processing languages? If representing the T1 and T2 from the above-mentioned SQL statement with the array or ArrayList object, you will find: The complexity will increase for several times, and the length of codes will also increase sharply for dozens of times.
The basic data types of data processing languages are the structured 2-dimentional data table object. This is not a coincidence, but there are subtle reasons instead.

Correspond to actual business. In the real world, most business data is the structured data. As an example, the Payroll list has the employee number, employee name, department, date, pre-tax salary, and post-tax salary; For another example, the retail record has the order time, outlet number, checkout counter number, cashier number, product name, and unit price; The last example of business data is the Website log, which comprises the browse time, URL, visitor IP, browser version, and other properties. These properties are equivalent to the field. Each of the records has the same structure. Though they are stored in text while not the database, they are actually still the structured data in nature. So, it is only natural to use the 2-dimentional data table to represent it. The structured 2-dimentional data table object can be used to represent the business data intuitively. Representing the actual business in the most faithful way, no matter the storage, computing, exchange or sharing. Such kind of data is the easiest for users to understand in a most convenient way.

Easy for massive processing. Business data are mostly the data of the same structure, for example, the Payroll table, Retail record, and Website log mentioned above. In processing such data, in some cases, we will handle a certain data of a certain record, but in most cases, we take a certain record as unit to process all data, for example: Compute the after-tax wage based on the pre-tax wage; Compute the amount based on the unit price and quantity of commodity. Count the daily on-line duration for each IP. The above-mentioned processing mode is just the massive data processing. To implement the batch processing, we can traverse every member of array in loops by row number and column number just as the operations for Java. Alternatively, we can operation on the data with the business field name directly as we would do for SQL and esProc. The latter resolution is simpler and easier-to-use without having to write loop statements. Programmers can thus operate on data intuitively from business perceptive, and the corresponding code become more concise and readable.

Compatible with the Relational Algebra. The relational algebra is the underlying theory developed for data processing and query. By which, the association and laws of operations among business data can be expressed in full details using the basic operation along with the join operation, aggregation operation, and division operation. Theoretically, any computation problem of any degree of difficulty can be implemented and solved by relational algebra in the respects of data processing and data query. Because the relational algebra is concise and complete, databases are largely designed based on this theory. E.F. Codd is thus called as the father of relational database. The structured 2-dimentional data table object is just the data type recommended by E.F. Codd. This data type can be used to express various operations of relational algebra, so as to solve the computation problem in data processing easily. In facts, the database result set is the earliest structured 2-dimentional data table object.

As can be seen, all kinds of programming languages adopt the structured 2-dimeintal data table object as the basic data type because it is corresponding to the real business data, and easy to implement the massive computation, making it compatible to the relational algebra theory. With the 2-dimentional data table object, codes can be simple and easy to understand, and the development efficiency is improved. Let me explain it with a few more examples below:

Result set of SQL (resultSet): Group by the book type to compute the average price of the books whose average price is greater than 15 yuan.
  select avg(price),type from books group by type having avg(price)>15

Table sequence of esProc (TSeq): Group by department to find the top 10 best sellers for each department.
  products. group(department). (~.top(quantity;10)

Data window of PowerBuilder (datawindow): Sort the order by price
  Order.SetSort('value d')
  Order.Sort()

R language data frame (data.frame): Left-join the orders table and customer table by customerID.
  merge(A1,B1,by.x="CustomerID",by.y="CustomerID",all.x=TRUE)

SQL, esProc, and R code comparison: Group the order data by department, and summarize the order data and sales amount of each department.
         
SQL:  Select count(*),sum(sales) from orders group by Dept
esProc:  orders.groups(Dept; count(~), sum(sales))
R language:
result<-aggregate(orders$ sales,list(orders $ Dept),sum) 
result$count<-tapply(orders $ sales, orders $ Dept,length)
        
Let's take a close look on result set, table sequence, data window, and data frame. Although they are all structured 2-dimentional data table objects with basically the same function. There are some slight differences between them.

SQL result set is rich in various materials, widely applied, universal, and simple to use. It is the top mainstream data type of all data processing languages. However, SQL did not implement the relational algebra to the full, making it a bit inconvenient for some computations, such as the set division.

DataWindow usually retrieves number from SQL, and return the final result to the database. It mainly servers the purpose of breaking through any barrier between the data and UI controls, so that programmers can design and deliver the database application with high interactivity soon. Another major function of DataWindow is to render and edit data. It can be only used for form computation, and the data processing capability is relatively poor.

Data Frame is capable of handling the structured computation to some extent. As can be seen from the above example, its syntax is obscure, and it is relatively complex to implement the same functions with it. This is because the major functions of R are scientific and statistical computing, focusing on the data types of array and matrix. As a additional data type, the data frame was later introduced to implement the structured data computing. Considering this point, data frame is not so dedicated as that of the other three tools.

TSeq is quite dedicated in data processing. Having incorporated all common strong points of SQL result sets, TSeq can fully and completely implement the relational algebra. TSeq is generic and sorted, especially fit for the order-related complex computing in data processing, for example: Yearly link relative ratio, year-on-year comparison, ranking, relative position computing, and interval computing. TSeq is also charactered for it is generic, and easier to establish relations between data and provide access to data with multi-level association easily by object. Compared with SQL, TSeq is unable to directly process the big data because it is the pure memory object.

As can be seen, the structured 2-dimeintional data table object is directly related to the degree of dedication for the data processing languages. The more powerful the former one is, the higher the degree of dedication for the latter one would be, and vice versa. If a programing language lack the structured 2-dimentional data table object, then this language can hardly be regarded as a dedicated one in processing the data. To research and examine if a programing language can be used to develop any application for data analysis and processing efficiently, the key is to find out if it offers the dedicated 2-dimentional data table object and the appropriate class library.

Perl is often used to retrieve character string and is capable of processing the data to some extent. However, since its code is lengthy and complex, it is not the dedicated data processing language. For example, to complete the simplest algorithm of grouping and summarizing, the code of Perl is shown below:
  %groups=();                       
  foreach(@carts){             
    $name = $_->[1];
    if($groups{$name} == null){    
         $groups{$name}=[$_];
    }
    else{
         push($groups{$name},$_);     
         }
  }
  my @result=();                           
  foreach( keys(%groups)){        
         $value=0;
    while($row=pop $groups{$_}){        
         $value += $row->[2];                
    }
    push @result,[$_,$value];
  }

Python is a bit simpler to write, but far more inefficient than SQL, esProc, and R in developing by order of magnitude. The sample code is shown below:
  result=[]
  for key, items in groupby(data, itemgetter(0)):      
    value1=0
    value2=0
    for subitem in items:                                            
        value1+=subitem[1]
        value2+=subitem[2]
    result.append([key,value1,value2])                  
  print(result)

Perl and Python is not the dedicated tools on data processing. Most importantly, they lack the structured 2-dimentional table data object.



The TSeq of esProc is not only the structured 2-dimentional table data object, but also is characterized with its being order, generic, step-by-step computation, making it more dedicated than other alike languages. For example, to implement a relatively complex computational goal: Find the shares having been rising on 5 consecutive days. esProc solution code is shown below:

没有评论:

发表评论