However,
all these problems can be avoided if we access Excel files using gdata package and meanwhile, write to
Excel with WriteXLS. Both of the two
packages support Excel2003 and Excel2007, operate stably, have easy and intuitive
code and require no manual work. The following example is used to illustrate
the method of accessing Excel with the two function packages.
Target:
There
are multiple Excel files of same structure in the directory ordersData. Among these files containing
sales order over the years, some are in the format of Excel2007, others are in
the format of Excel2003. Please load them, compute the total sales amount of
each client and write the result to result.xlsx.
The following is some of the data of 2011.xlsx:
Some
of the data of result.xlsx are as
follows:
Code interpretation:
Code:
library(gdata)
setwd("E: /ordersData")
orders<-read.xls(fileList[1])
for (file in fileList[2:length(fileList)]){
orders<-rbind(orders,read.xls(file))
}
Code interpretation:
1. The two lines of code library(gdata) and library(WriteXLS)aim to import
two third-party function packages, which have read.xls function and WriteXLS function to read and write Excel
respectively.
2. The line of code fileList<-dir()lists all
the files in the directory. The following for
statement read files by loop and merge data into the data frame orders. If there are other files in the
directory, they should be removed using wildcard characters.
3.
This line of code result<-aggregate(orders[,4], orders[c(2)],sum)executes grouping and summarizing, in which orders[,4]represents summarizing column (i.e.
Amount) and orders[c(2)] represents grouping column (i.e.
Client).
4.
Both read.xls and WriteXLS support the data type data.frame though
they come from different packages, therefore, they can coordinate rather well.Besides,
read.xls function can
automatically identify the format of both Excel2003 and Eexcel2007, and is
quite convenient to use.
5. All the code is concise and easy to grasp for beginners.
Note for use:
1. Versions
gdata and WriteXLS
are not R language’s built-in library functions, they are the third-party packages
needing download and installation. What’s more, both of them require the Perl
environment, so it is particularly important to choose an appropriate version. Through
our trials, we find that 2.15.0 version of R
language gets along well with 2.13.3 version of gdata and 3.5.0 version of WriteXLS.
But something may go wrong if they operate with the newest Perl version and an
older 5.14.2 version is thus required. Otherwise the following error report
will appear:
Error in xls2sep(xls, sheet, verbose = verbose,
..., method = method, :
Intermediate file 'C:\Users\Thim\AppData\Local\Temp\RtmpMHvLZS\file224060624738.csv'
missing!
2. Performance
gdata and WriteXLS
have no problem in accessing small files, but they perform badly in handling
bigger files (maybe because of Perl). For example, it takes 8 to 10 minutes to
read an Excel file of 8 columns and 200,000 rows. To achieve a better performance,
we recommend xlsx function package.
But, of course, Excel2003 will be of no use in this occasion. In fact, xlsx performs just slightly better than gdata does. Therefore, in order to truly
improve performance, it is recommended that all Excel files be transferred into
2007 format and xml files in them be
uncompressed and data be read through resolving these xml files.
Alternative methods:
For the problems of version
conflicts and poor performance that R language has, we have alternative solutions
like Python, esProc, Perl etc. As R language, they can also access Excel files
and perform data computing. In the following, we’ll introduce briefly esProc
and Python.
esProc integrates the
function of accessing EXCEL into its installation package, so it is no need for
it to download the extra third-party packages. It can access Excel2003,
Excel2007, Excel 2010 and even the older versions. Its code is as follows:
esProc's performance is satisfactory.
It takes only 20 to 30 seconds for it to read an Excel file of 8 columns and
200,000 rows.
Python has a
rather excellent performance, except that it requires the third-party packages
as R language does. Pandas should have
been able to complete the task of accessing xls
file easily, but its installation under windows failed (after all, xls files are mainly produced under
windows). Finally, we succeeded in performing this operation by using packages
of both xlrd and xlwt3. Unfortunately, the two packages support only Excel2003 and
produce much more complicated code:
import
xlwt3
import xlrd
from itertools
import groupby
from
operator import itemgetter
import os
dir="E:/ordersData/"
fileList
=os.listdir(dir)
rowList
= []
for
f in fileList:
book=xlrd.open_workbook(dir+f) #open read-only workbook by loop
sheet=book.sheet_by_index(0)
nrows
= sheet.nrows
ncols
= sheet.ncols
for
i in range(1,nrows):
row_data
= sheet.row_values(i)
rowList.append(row_data) #all records are appended to rowList
rowList=sorted(rowList,key=lambda
x:(x[1])) #sort the data before grouping
result=[]
for
key, items in groupby(rowList, itemgetter(1)): #
group using groupby function
value1=0
forsubItem
in items:value1+=subItem[3]
result.append([key,value1]) #merge the summarized result into 2D
array in the end
wSheet=wBook.add_sheet("sheet
1")
wSheet.write(0,0,"Client")
wSheet.write(0,1,"Sum")
for
row in range(len(result)): #write
data to the file by loop
wSheet.write(row+1,0,result[row][0])
wSheet.write(row+1,1,result[row][1])
wBook.save(dir+"result.xls") #save the file
It is a far more complicated method than R
language.
没有评论:
发表评论