Now let's look at the method of computing link relative ratio and year-on-year comparison in R language through an example.
Case description: Compute the link relative ratio and year-on-year comparison of each month's sales amount during a specified period of time. The data come from orders table sales, in which column Amount contains order amount and column OrderDate contains order dates. Some of the data are as follows:
Code:
filtered<-subset(sales,as.POSIXlt(OrderDate)>=as.POSIXlt('2011-01-01
00:00:00') &as.POSIXlt(OrderDate)<=as.POSIXlt('2014-08-29 00:00:00'))
filtered$y<-format(as.POSIXlt(filtered$OrderDate),'%Y')
filtered$m<-format(as.POSIXlt(filtered$OrderDate),'%m')
agged<-aggregate(filtered$Amount,
filtered[,c("m","y")],sum)
result<-agged[order(agged$m),]
result$yoy<-NA
for(i in 1:nrow(result)){
if(i>1
&& result[i,]$m==result[i-1,]$m){
}
}
Code interpretation:
1.The first four lines of code
are easy to understand. read.table is
used to read data from the table and subset
to filter data, and two format
functions are used to generate year and month respectively. Note that the
beginning and ending time should be output dynamically from the console using scan function; here they are simplified
as fixed constants.
2.agged<-aggregate(filtered$Amount,
filtered[,c("m","y")],sum), this
line of code summates the order amount of each month of each year. Note that in
the code, the month must be written before the year though data are grouped by
the year and the month according to business logic. Otherwise R language will
perform grouping first by the month, then by the year, which will get result
inconsistent with business logic and make data viewing inconvenient.
3. agged$lrr<- c(0,
(agged$x[-1]-agged$x[-length(agged$x)])/agged$x[-length(agged$x)]),this line of code computes link relative ratio. The
result will be stored in the new column Irr.
Business logic is (order amount of the current month – order amount of the
previous month)\order amount of the previous month.
Note: [-N] in the
code represents that the Nth row of data is removed. So agged$x[-1]means the first row of data is
removed and agged$x[-length(agged$x)]means
the last row of data is removed. By performing certain operation between the
two, link relative ratio can be obtained indirectly. But the result won’t include
the link relative ratio of the first month (i.e. January 2011), so a zero
should be added to the code. We can see that the code logic and the business
logic share some similarities but are quite different. The code is difficult to
understand.
At
this point, some of the values of data frame aggedare:
4.result<-agged[order(agged$m),],
this line of code sorts data by the month and the year. Since the data of the
year are ordered, we just need to perform sorting by the month. result$yoy<-NA initializes
a new column which will be used to store the year-on-year comparison of sales
amount.
5. The loop judgment in the last four lines of code is to compute the
year-on-year comparison. Business logic: (order amount of the current month –
order amount of the previous month)\order amount of the previous month. Code
logic: from the second line, if the month in the current line is the same as
that in the previous line, the code will compute year-on-year comparison.
Detailed code is result[i,]$yoy<-(result[i,]$x-result[i-1,]$x)/result[i-1,]$x. We can see that the code written in
this way is easy to understand and its logic is quite similar to the business
logic.
The only weakness of
this piece of code is that it cannot use the loop function of R language, which makes it a little lengthy. But
compared with the difficult operation of link relative ratio, maybe a longer
but simple code is better.
Summary:
R language can compute link relative ratio
and year-on-year comparison, but the operation of link relative ratio is
difficult to understand and the code of year-on-year comparison is a little lengthy.
The codes of both operations are not easy to learn.
The third-party solution
Python,
esProc and Perl, all of which can perform structured data computation, can be
used to handle this case. In the following, we’ll briefly introduce esProc and
Python's solutions.
esProc
esProc is good at expressing business logic
freely with agile syntax. Its code is concise and easy, as shown below:
In the above code, groups function is used to group and summarize data by the year and
the month. The derive functions in A4
and A6 generate link relative ratio and year-on-year comparison respectively.
As can be seen from the code, esProc also
uses[-N]. Different from [-N] in R language, it doesn't represent removing the
Nth row; it represents the Nth row counted from the current line. For example,
[-1] is the previous line. In this way, the operation of link relative ratio
can be simply expressed as (x-x[-1])/x[-1].But R language hasn't expressions
for relative positions, which makes its code difficult to understand.
In the year-on-year comparison operation,
esProc uses judgment function if in loop function, making it avoid the
lengthy loop statement and its code simpler. While R language only has the
judgment statement but hasn't the judgment function. This is the reason why its
code is lengthy.
Finally, these are the computed results:
Python(Pandas)
Pandas isPython’s third-party
package. Its basic data type is created by imitating R’s dataframe but gets
improved greatly. At present, its latest version is 0.14. Its code for handling
this case is as follows:
sales = pandas.read_csv('E:\\salesGroup.txt',sep='\t')
sales['OrderDate']=pandas.to_datetime(sales.OrderDate,format='%Y-%m-%d
%H:%M:%S')
filtered=sales[(sales.OrderDate>='2011-01-01
00:00:00') & (sales.OrderDate<='2014-08-29 00:00:00')]
filtered['y']=filtered.OrderDate.apply(lambda x:
x.year)
filtered['m']=filtered.OrderDate.apply(lambda x: x.month)
grouped=filtered.groupby(['y','m'],as_index=False)
agged=grouped.agg({'Amount':[sum]})
agged['lrr']=agged['Amount'].pct_change()
result.reset_index(drop=True,inplace=True)
result['yoy']=result.apply(lambda _:numpy.nan, axis=1)
if(row_index>0
and result.ix[row_index,'m']==result.ix[row_index-1,'m']):
result.ix[row_index,'yoy']=(result.ix[row_index,'Amount']-result.ix[row_index-1,'Amount'])/result.ix[row_index-1,'Amount']
In the code, pct_change() function is used to directly compute the link relative
ratio, which is more convenient than the method used by R language and esProc.
But this kind of function is not universal and can only deal with isolated
cases. When it is required to compute link relative ratio or year-on-year
comparison, Pandas can only complete
the task by combining div function
and shift function, which makes its
code more difficult to understand than R’s.
In computing year-on-year comparison,
Pandas' code is as lengthy as R’s. This is because Pandas also cannot use if
function in loop function. I’m afraid
cooperation of apply function and lambda syntax is needed if we want to
write simpler code.
Please pay attention to the following
easy-to-get-wrong details:
1. The code must be sort_index(by=['m','y'])when
we perform sorting by the month and the year. The simple form sort(m), which used in R
language and esProc, is not allowed.
2. Pandas has the assignment syntax asresult.loc[row_index,‘yoy’]=value.
But when assigning value to a certain element in data frame, we should write
the code as result.ix[row_index,'yoy']=value.
3.When iterrows()is used to perform loop, its loop number
row_indexis index
instead of row number. To make the row number conform to the index, reset_index() should be used
to reset the indexes.
没有评论:
发表评论