Here is a file, sales.txt, of 1G size, which contains a great number of records of sales orders. We want to group field CLIENT and summarize field AMOUNT. "\t"is used in the file as the column separator. The first rows of data are as follows:
R's
solution:
con
<- file("E: \\sales.txt", "r")
|
|
result=read.table(con,nrows=100000,sep="\t",header=TRUE)
|
|
result<-aggregate(result[,4],list(result[,2]),sum)
|
|
while(nrow(databatch<-read.table(con,header=FALSE,nrows=100000,sep="\t",col.names=c("ORDERID",
"Group.1","SELLERID","x","ORDERDATE")))!=0)
{
|
|
databatch<-databatch[,c(2,4)]
|
|
result<-rbind(result,databatch)
|
|
result<-aggregate(result[,2],list(result[,1]),sum)
|
|
}
|
|
close(con)
|
Code
interpretation:
The 1stline: Open the file
handle.
The 2nd ~ 3rdline:
Import the first batch of 100,000 rows of data, group and summarize them and
save the result in result.
The
4th ~ 8thline: Import data by loop, with 100,000 rows of
data per batch, and store them in the variable databatch. Then get the second and fourth field, i.e. "CLIENT" and "AMOUNT",
merge databatch into result, and execute grouping operation.
It can be seen that, at a
certain moment, only databatch,which
includes 100,000 rows of data, and result,
the summarizing result, have memory usage. Usually, the size of the latter is
small and will not result in a memory overflow.
The 11thline: Close the file
handle.
Matters
needing attention:
Data
frame. Because the data frame of R language cannot directly perform the
computing of big files, loop statement is necessary to help to do the job in
this occasion. The steps are: import a batch of data and merge them into the
data frame result; group and
summarize result and then import the next batch of data. You can see that this
part of code of loop statement is a little complicated.
Column
name. As the first row of data is the column name, header=TRUE can be used in the first batch of data
to directly set the column name. But the subsequent data hasn't column names
and header=FALSE should
be used to import data. The default column names are V1, V2 and so forth when header=FALSE is used. But the default
column names are Group.1 and x after grouping and summarizing are executed, and
col.names is needed to
change the column names in order to maintain structure consistency both before
and after grouping and summarizing and set the stage for the subsequent
merging. The code about column names is worth our notice because it is easy to
get wrong.
Alternative solutions:
Python, esProc and Perl can
also perform the same operation. They can execute the grouping and summarizing
of data from big text files and the subsequent structured data computing as R
language does. We'll briefly introduce the coding methods used by esProc and Python.
esProc can process data in batches automatically, which requires no manual control from the programmers by loop statement and produces quite simple code:
Cursor
is a data type used for structured data computing in esProc. Its usage is
similar to that of the data frame, but it is better at processing big files and
performing complicated computations. What's more, @t option in the code indicates that the first line of
the file is the column name. So it is convenient to use the column name
directly in subsequent computation.
Python's
code structure, which also requires manual loop control, is similar to that of
R language. But Python itself hasn't the structured data type, like data frame
or cursor, so its code is executed in a lower level:
from
itertools import groupby
|
|
from
operator import itemgetter
|
|
result
= []
|
|
myfile
= open("E:\\sales.txt",'r')
|
|
BUFSIZE
= 10240000
|
|
myfile.readline()
|
|
lines
= myfile.readlines(BUFSIZE)
|
|
value=0
|
|
while
lines:
|
|
for line in lines:
|
|
record=line.split('\t')
|
|
result.append([record[1],float(record[3])])
|
|
result=sorted(result,key=lambda x:(x[0])) #the sorting before grouping
is executed
|
|
batch=[]
|
|
for key, items in groupby(result,
itemgetter(0)): # group using groupBy function
|
|
value=0
|
|
for subItem in
items:value+=subItem[1]
|
|
batch.append([key,value]) # finally, merger the
summarizing results into a two-dimensional array
|
|
result=batch
|
|
lines = myfile.readlines(BUFSIZE)
|
|
myfile.close()
|
没有评论:
发表评论