The operation of grouping and summarizing
includes grouping one or more certain fields of two-dimensional structured data
and then summarizing fields of each group. The following will introduce methods
of grouping and summarizing in R language through an example. In order to make
the example more typical, we’ll set two fields to be grouped and two
summarizing operations.
Case description:
Please group data frame orders according to CLIENT and SELLERID,
and then summate field AMOUNT and seek its maximum value respectively in each
group.
Note: orders
contains records of sales orders. Its source can be a database
or a file. Such as orders<-read.table("sales.txt",sep="\t",
header=TRUE).The first rows of data are as follows:
Method 1: aggregate function
Code:
result1<-aggregate(orders$AMOUNT,
orders[,c("SELLERID","CLIENT")],sum)
result2<-aggregate(orders$AMOUNT,
orders[,c("SELLERID","CLIENT")],max)
result<-cbind(result1,result2$x)
Part of the computed result:
1.The name aggregate implies that it is a function specializing in grouping
and summarizing. Both its input parameters and computed result are data frame
and its usage is relatively simple.
2. aggregate function cannot perform
multiple summarizing operations on grouped data, thus two lines of code are
required to realize the operations of seeking sum and maxrespectively, then
their results are combined using cbind.
Obviously, the code is not satisfactory in performance and usability.
3. aggregate function has a strange requirement
about the order of the fields to be grouped, that is, the fields must be in
reversed order. In view of this, the code for grouping CLIENT first and then
SELLERID should only be written as orders[,c("SELLERID","CLIENT")]. The code
written according to the normal way of thinking will be incorrect instead.
4. Not only the code is written in
an unnatural way, but the computed result is weird too by putting filed
SELLERID before CLIENT. In reality, the code should be improved in order to
make the computed result conform to the business logic.
Summary:aggregate function manages to perform the
task after a fashion. But it is not good in performance and usability because
the way of coding, computed result and business logic are inconsistent with
each other.
Code:
sp<-split(orders,orders[,c("SELLERID","CLIENT")],drop=TRUE)
result1<-lapply(sp,FUN=function(x)
sum(x$AMOUNT))
result2<-lapply(sp,FUN=function(x)
max(x$AMOUNT))
result<-cbind(result1,result2)
Part of the computed result:
Code interpretation:
1. The role of split function is to group the data
frame according to specified fields. No further computation is involved. lapply function can perform the same
operation on data of each group. By working with each other, split and lapply can fulfill the task.
2. Because the grouped data can be
reused, this operation performs better than that using aggregate function.
3. As lapply function doesn't support multiple statistical approaches,
two lines of code are required too to realize the operations of seeking sum and
max respectively, and then use cbind
to combine the results. What’s more, this operation requires an extra split function, so instead of enhancing
the usability, it reduces it.
4. The grouping order is still unnatural
and the code has to be written reversely as orders[,c("SELLERID","CLIENT")].
5. The
computed result needs a lot of modification which brings great inconvenience.
It can be seen that the first column of the computed result is, in fact, the "SELLERID.CLIENT". The column
needs to be split into two columns whose orders should be exchanged.
Summary:This operation improves some performance but the usability is
obviously poor with inconsistency in the aspects of way of coding, business
logic and the computed result.
lapply belongs to the family of apply
function. Similar functions include sapply
and tapply, whose usages differ on
parameters. For example:
sp<-split(orders,orders[,c("SELLERID","CLIENT")],drop=TRUE)
result1<-sapply(sp,simplify=FALSE,FUN=function(x)
sum(x$AMOUNT))
result2<-sapply(sp,simplify=FALSE,FUN=function(x)
max(x$AMOUNT))
result<-cbind(result1,result2)
Third-party library functions
There are various disadvantages when using
R's built-in functions to group and summarize. In response to the problem, we
may consider using the third-party library functions, such as reshape, stack, etc. The stability and computational efficiency of these
library functions is generally not as good as those of the built-in functions,
and the information for their use is not many. Therefore it is difficult for
them to fulfil the task. Here we won't go into any example about their use.
Third-party languages
Python,
esProc and Perl can also be employed to fulfil this task. All of them can
perform grouping and summarizing as well as structured data computing as R
language can. We’ll briefly introduce solutions of esProc and python.
esProc
esProc can fulfil this task by simply using
groups function. Its syntax is
concise and easy to understand, as well as in line with the natural way of
thinking. The code is as follows: result=orders.groups(CLIENT,SELLERID;sum(Amount),max(Amount))
Python(pandas)
If
python's built-in functions are used to deal with this task, the code will be
rather complicated. Here pandas, the third-party function library, comes to
help. pandas will first perform grouping operation using groupby function, then summarize using agg function. Its code, which is simpler than R languagebut not as
good as esProc,is written like this: result=orders.groupby(['CLIENT','SELLERID']).agg({'AMOUNT':[sum,max]}).
Pandas' computed result and syntax are highly consistent with the business logic. Part of the computed result is as follows:
The
step-by-step computational mode of grouping first then summarizing adopted by
pandas is not always bad. Such as, it can increase performance in situations
where grouping result is reused. esProc can also perform the step-by-step
operation, the equivalent code is group(CLIENT,SELLERID).new(CLIENT,SELLERID,sum(AMOUNT),max(AMOUNT)).
You should also check the R package "data.table" which is great and fast in aggregations etc.
回复删除