2014年10月12日星期日

Code Examples of External Memory Grouping in esProc

In data analysis, we often need to group the data, and then compute the summary value of each group, or compute it in each group. In esProc, the result of data grouping and summarizing is computed using groups function, records in the table are grouped using group function for the convenience of subsequent computations. However, external memory grouping is required when the data being grouped is big and cannot be loaded to the memory in its entirety and thus the common methods for grouping become useless.


Let's create a table with big data, which will contain 1,000,000 records simulating those of call duration of mobile phone users, and store it in the binary file PhoneBill

In the records, the mobile phone numbers are 8-digit integers in which the first four integers are a fixed 1234 and the rest of four integers are randomly generated. The records start on a randomly generated day of August of 2014. The call duration is also a randomly generated integer whose generation has a 90 percent chance of setting the call duration as one minute, with the maximum limit as 20 minutes. After the data file is completed, the first 1,000 rows of data are imported in B11. The data is shown as follows:
Now it is required to perform the following computations based on the data in PhoneBill:
     The total call duration and the average duration of each call of all users each day in August.
     The total call duration of each user in August.
     Store the call records of each day respectively in the file.
     The numbers of five users who make the longest call in total in each day of the August.

To solve the first problem, all data needs to be grouped by DateTime and then summarized. As there are only 31 days in the month of August, the result set won’t contain big data and the summarizing operation can be performed completely in the memory:

A2 generates cursors with the binary text data in A1. The result of grouping and summarizing data in the cursors is computed in A3:

Because the result contains no big data, during grouping and summarizing, the data in the cursors only needs to be traversed once and groups function can be used. There is no need to use the external memory grouping. To compute the average time of every call in a day, we need to first work out the total call duration and the total number of calls each day. The result in A4 is as follows:

It is a different case for the second problem. As the number of users is far more than the number of days in August, we should consider if the grouping and summarizing result can be returned to the memory all at once. There are 10,000 users at most in this example, which cannot, actually, be counted as big. We merely use it to illustrate the computation of big result sets. It is assumed that, in the computation, the memory can only hold 1,000 records:

groupx function is needed to compute the result of grouping and summarizing big data. The function imports data from the cursors step by step according to the number of buffer rows set in advance, using the external memory. In A3, the data in cursors is computed and grouped by Phone; the total call duration is computed and the result is summarized and the number of buffer rows is set as 1,000. The result of grouping and summarizing big data is the cursor. The method of importing data from it is completely the same as that from any other cursor. The result in A3 is as follows:

In A4, the results of total call duration of the first 1,000 users are fetched:

In addition, if the data of the cursors hasn't been imported entirely, the cs.close()function needs to be called to clean the temporary files from the external memory in time.

The third problem requires grouping data by the DateTime and then storing the data of each day in the file:

Different from common grouping operations, the operation of grouping big data requires specifying the group numbers directly in the grouping expression. In A3, the days in DateTime are specified as the group numbers. There is no summary parameter in the groupx function and the result is different from that in the above. It is the sequence of cursors that will be returned. Each cursor corresponds to a group:

The fourth and fifth lines of code store respectively the cursor data of each day in the file. Data of four days is selected in A6. The first 1,000 records are fetched in A7, as shown below:

For the fourth problem, group data by DateTime, and then group and summarize data of each group and, finally, select the required mobile phone numbers according to the summarizing result:
In A3, the days of DateTime are specified as the group numbers in the grouping expression, and a sequence of cursors will be returned after grouping:

The fourth to sixth lines of code make loops on the cursor data of each group, summarize respectively the total call duration per user per day, and, according to this summarizing result, summarize the information of users whose call duration is in the top five. Please note that when summary function topx is used to sort data in descending order, just add a negative sign before the sorting expression, like –TotalDuration in B5. Select the data of five numbers that have the longest call duration and store it in the table sequence in B3. After the loop is finished, the final result can be viewed in B3 as follows:
Besides grouping data of total call charges, we can also use the file data generated in handling the third problem to solve this one. 

没有评论:

发表评论