2014年7月22日星期二

The Disadvantages of SQL Computation (I)

SQL is invented primarily to provide a method to access structured data in order to transparentise the physical storage scheme, so a lot of various types of English vocabularies and syntaxes are used in SQL to reduce the difficulty in understanding and writing it. And the relational algebra as the basic theory of SQL is a complete computation system, which can compute everything in principle. In terms of this, we certainly should use SQL to satisfy various demands for data computation.

But, though relational database has achieved a huge success, evidently SQL fails to realize its original aim of invention. Except very few simple queries can be completed by end user using SQL, most of SQL users are still technical personnel, and even many complex queries are no easy job for technical personnel.

Why? We inspect the disadvantage of SQL in computation through a very simple example.
Suppose there is a sales performance table consisting of three fields (to simplify the problem, date information is omitted):
sales_amount
Sales performance table
sales
Name of salesman, suppose there is no duplicate name.
product
Products sold
amount
Sales amount of the salesman on the product

Now we want to know the name list of the salespersons whose sales amounts rank among the top 10 places both in air-conditioners and TV sets.
This question is rather simple and people will very naturally design out the computation process as follows:
1) Arrange the sequence according to the sales amount of air-conditioner and find out the top 10 places.
2) Arrange the sequence according to the sales amount of TV and find out the top 10 places.
3) Get the intersection of the results of 1 and 2 and obtain the answer.
   
Now we use SQL to do it.
1) Find out the top 10 places of the sales amount of air-conditioner. This is very simple:
select top 10 sales from sales_amount where product='AC' order by amount desc

2) Find out the top 10 places of the sales amount of TV. The action is the same:
select top 10 sales from sales_amount where product='TV' order by amount desc

3) Seek the intersection of 1 and 2. This is somewhat troublesome, as SQL does not support computation by steps. The computation result of the above two steps cannot be saved, and thus it is necessary to copy it once again:
select * from
( select top 10 sales from sales_amount where product='AC' order by amount desc )
intersect
( select top 10 sales from sales_amount where product='TV' order by amount desc )

A simple 3-step computation has to be written like this using SQL, and daily computations of more than 10 steps are in great numbers. So this evidently goes beyond the acceptability of many people.

In this way, we know the first important disadvantage of SQL: Do not support computation by steps. Dividing complex computation into several steps can reduce the difficulty of a problem to a great extent. On the contrary, completing many steps of computation into one step can increase the difficulty of a problem to a great extent.

It can be imagined that, if a teacher requires pupils to create only one calculation formula to complete the calculation in solving application problems, how distressed the pupils will feel (of course, there are certain clever children who can solve the problem)!

SQL query cannot by conducted by steps, but the stored procedure written out with SQL can operate by steps. Then, is it possible to use the stored procedure to conveniently solve this problem?

For the time being, we just ignore how complex is the technical environment in which the stored procedure is used (this is enough to make most people give it up) and the incompatibility caused by differences of databases. We only try to know theoretically whether it is possible to use SQL to make this computation simpler and faster.

1) Compute the top 10 places sales amount of air-conditioners. The statement is still the same, but we need to save the result for use by Step 3, while in SQL, it is only possible to use table to store set data. So we need to create a temporary table: 
create temporary table x1 as
select top 10 sales from sales_amount where product='AC' order by amount desc

2) Compute the top 10 places of the sales amount of TV. Similarly
create temporary table x2 as
select top 10 sales from sales_amount where product='TV' order by amount desc

3) Seek the intersection, the preceding steps are troublesome but this step is simpler.
select * from x1 intersect x2

After the computation is done in steps, the working thought becomes clear, but it still appears over-elaborate to use a temporary table. In the computation of mass structured data, temporary set, as intermediate result, is rather common. If the temporary table is created for storage in all cases, the computation efficiency is low and it is not intuitive.
  
Moreover, SQL does not allow the value of a certain field to be a set (namely temporary table), so in this way, it is impossible to implement some computations even if we tolerate the over elaborate.

If we change the problem into computing the salespersons whose sales amounts of all products rank among the top 10 places, try thinking how to compute it. By continuing to use the above-mentioned working thought, it is very easy to get the below points:
1) Group the data according to products, arrange the sequence of each group, and get the top 10 places;
2) Get the intersection of the top 10 places of all products; 

As we do not know beforehand how many products there are, so it is necessary to also store the grouping result in a temporary table. There is a field in this table that needs to store the corresponding group members, which is not supported by SQL, so the method is unfeasible.

If supported by window function (SQL2003 standard), it is possible to change the working thought. After grouping by product, compute the number of times each salesman appears in the top 10 places of the sales amounts of all product category group. If the number of times is the same as the total number of the product categories, it indicates this salesman is within the top 10 places regarding the sales amounts of all product categories.
select sales
from ( select sales,
       from ( select sales,
              rank() over (partition by product order by amount desc ) ranking
              from sales_amount)
      where ranking <=10 )
group by sales
having count(*)=(select count(distinct product) from sales_amount)

How many people can write such complex SQL?

Moreover, in many databases, the window functions are not supported. Then, it is only possible to use the stored procedure to develop a loop, according to the sequence, the top 10 places of each product, and seek the intersection of the result of the preceding time. This process is not very much simpler than using high level language to develop, and it is also necessary to cope with the triviality of the temporary table.

Now, we know the second important disadvantage of SQL: Set-lization is not complete. Though SQL has the concept of set, it fails to provide set as a kind of basic data type, which makes it necessary to transform a lot of natural set computations in thinking and writing. 

没有评论:

发表评论