The operation of grouping and summarizing
includes grouping one or more certain fields of two-dimensional structured data
and then summarizing fields of each group. The following will introduce methods
of grouping and summarizing in R language through an example. In order to make
the example more typical, we’ll set two fields to be grouped and two
summarizing operations.

**Case description:**

Please group data frame

*orders*according to CLIENT and SELLERID, and then summate field AMOUNT and seek its maximum value respectively in each group.
Note:

*orders*contains records of sales orders. Its source can be a database or a file. Such as orders<-read.table("sales.txt",sep="\t", header=TRUE).The first rows of data are as follows:

**Method 1:**

*aggregate*function**Code:**

result1<-aggregate(orders$AMOUNT,
orders[,c("SELLERID","CLIENT")],sum)

result2<-aggregate(orders$AMOUNT,
orders[,c("SELLERID","CLIENT")],max)

result<-cbind(result1,result2$x)

**Part of the computed result:**

1.The name

*aggregate*implies that it is a function specializing in grouping and summarizing. Both its input parameters and computed result are data frame and its usage is relatively simple.
2.

*aggregate*function cannot perform multiple summarizing operations on grouped data, thus two lines of code are required to realize the operations of seeking sum and maxrespectively, then their results are combined using*cbind*. Obviously, the code is not satisfactory in performance and usability.
3.

*aggregate*function has a strange requirement about the order of the fields to be grouped, that is, the fields must be in reversed order. In view of this, the code for grouping CLIENT first and then SELLERID should only be written as orders[,c("SELLERID","CLIENT")]. The code written according to the normal way of thinking will be incorrect instead.
4. Not only the code is written in
an unnatural way, but the computed result is weird too by putting filed
SELLERID before CLIENT. In reality, the code should be improved in order to
make the computed result conform to the business logic.

**Summary:**

*aggregate*function manages to perform the task after a fashion. But it is not good in performance and usability because the way of coding, computed result and business logic are inconsistent with each other.

**Code:**

sp<-split(orders,orders[,c("SELLERID","CLIENT")],drop=TRUE)

result1<-lapply(sp,FUN=function(x)
sum(x$AMOUNT))

result2<-lapply(sp,FUN=function(x)
max(x$AMOUNT))

result<-cbind(result1,result2)

**Part of the computed result:**

**Code interpretation:**

1. The role of

*split*function is to group the data frame according to specified fields. No further computation is involved.*lapply*function can perform the same operation on data of each group. By working with each other,*split*and*lapply*can fulfill the task.
2. Because the grouped data can be
reused, this operation performs better than that using

*aggregate*function.
3. As

*lapply*function doesn't support multiple statistical approaches, two lines of code are required too to realize the operations of seeking sum and max respectively, and then use*cbind*to combine the results. What’s more, this operation requires an extra*split*function, so instead of enhancing the usability, it reduces it.
4. The grouping order is still unnatural
and the code has to be written reversely as orders[,c("SELLERID","CLIENT")].

5. The
computed result needs a lot of modification which brings great inconvenience.
It can be seen that the first column of the computed result is, in fact, the "SELLERID.CLIENT". The column
needs to be split into two columns whose orders should be exchanged.

**Summary:**This operation improves some performance but the usability is obviously poor with inconsistency in the aspects of way of coding, business logic and the computed result.

*lapply*belongs to the family of

*apply*function. Similar functions include

*sapply*and

*tapply*, whose usages differ on parameters. For example:

sp<-split(orders,orders[,c("SELLERID","CLIENT")],drop=TRUE)

result1<-sapply(sp,simplify=FALSE,FUN=function(x)
sum(x$AMOUNT))

result2<-sapply(sp,simplify=FALSE,FUN=function(x)
max(x$AMOUNT))

result<-cbind(result1,result2)

*tapply*specializes in data frame, which, by rights, is the most suitable one for fulfilling this task. But it isn't in fact. It applies only to the situation where a single field is required to be grouped. When it is used to group two fields together, the result will be two-dimensional matrix. This requires users to make further complicated processing. For example, the computed result of the line of code tapply(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],function(x) sum(x))is as follows:

**Third-party library functions**

There are various disadvantages when using
R's built-in functions to group and summarize. In response to the problem, we
may consider using the third-party library functions, such as

*reshape*,*stack*, etc. The stability and computational efficiency of these library functions is generally not as good as those of the built-in functions, and the information for their use is not many. Therefore it is difficult for them to fulfil the task. Here we won't go into any example about their use.**Third-party languages**

Python,
esProc and Perl can also be employed to fulfil this task. All of them can
perform grouping and summarizing as well as structured data computing as R
language can. We’ll briefly introduce solutions of esProc and python.

**esProc**

esProc can fulfil this task by simply using

*groups*function. Its syntax is concise and easy to understand, as well as in line with the natural way of thinking. The code is as follows: result=orders.groups(CLIENT,SELLERID;sum(Amount),max(Amount))**Python(pandas)**

If
python's built-in functions are used to deal with this task, the code will be
rather complicated. Here pandas, the third-party function library, comes to
help. pandas will first perform grouping operation using

*groupby*function, then summarize using*agg*function. Its code, which is simpler than R languagebut not as good as esProc,is written like this: result=orders.groupby(['CLIENT','SELLERID']).agg({'AMOUNT':[sum,max]}).Pandas' computed result and syntax are highly consistent with the business logic. Part of the computed result is as follows:

The
step-by-step computational mode of grouping first then summarizing adopted by
pandas is not always bad. Such as, it can increase performance in situations
where grouping result is reused. esProc can also perform the step-by-step
operation, the equivalent code is group(CLIENT,SELLERID).new(CLIENT,SELLERID,sum(AMOUNT),max(AMOUNT)).

You should also check the R package "data.table" which is great and fast in aggregations etc.

回复删除