2014年9月2日星期二

Methods of Accessing Excel Files by R language

There are many ways for R language to access Excel files, but each has its weaknesses. For example, xlsx package has complicated code and supports only Excel2007; RODBC has too many restrictions, is difficult to understand and unstable and goes wrong strangely. Though the method of saving the file in csv format is relatively common and stable, it operates inconveniently and lacks ability to process multiple files with program. Another method is to extract xml, but the complicated steps and code forbid us to use it.It's also not ideal to transform the file with a clipboard because part of the operation need to be done manually and we'd rather save the file in csv format.

However, all these problems can be avoided if we access Excel files using gdata package and meanwhile, write to Excel with WriteXLS. Both of the two packages support Excel2003 and Excel2007, operate stably, have easy and intuitive code and require no manual work. The following example is used to illustrate the method of accessing Excel with the two function packages.

Target:

There are multiple Excel files of same structure in the directory ordersData. Among these files containing sales order over the years, some are in the format of Excel2007, others are in the format of Excel2003. Please load them, compute the total sales amount of each client and write the result to result.xlsx. The following is some of the data of 2011.xlsx:
Code:
library(gdata)                                     
library(WriteXLS)                       
setwd("E: /ordersData")                  
orders<-read.xls(fileList[1])                                         
for (file in fileList[2:length(fileList)]){                         
orders<-rbind(orders,read.xls(file))
}
WriteXLS("result","result.xlsx")                                 

Some of the data of result.xlsx are as follows:

Code interpretation:
1The two lines of code library(gdata) and library(WriteXLS)aim to import two third-party function packages, which have read.xls function and WriteXLS function to read and write Excel respectively.
2. The line of code fileList<-dir()lists all the files in the directory. The following for statement read files by loop and merge data into the data frame orders. If there are other files in the directory, they should be removed using wildcard characters.
3. This line of code result<-aggregate(orders[,4], orders[c(2)],sum)executes grouping and summarizing, in which orders[,4]represents summarizing column (i.e. Amount) and orders[c(2)] represents grouping column (i.e. Client).
4. Both read.xls and WriteXLS support the data type data.frame though they come from different packages, therefore, they can coordinate rather well.Besides, read.xls function can automatically identify the format of both Excel2003 and Eexcel2007, and is quite convenient to use.
5All the code is concise and easy to grasp for beginners.

Note for use:
1.  Versions
gdata and WriteXLS are not R language’s built-in library functions, they are the third-party packages needing download and installation. What’s more, both of them require the Perl environment, so it is particularly important to choose an appropriate version. Through our trials, we find that 2.15.0 version of R language gets along well with 2.13.3 version of gdata and 3.5.0 version of WriteXLS. But something may go wrong if they operate with the newest Perl version and an older 5.14.2 version is thus required. Otherwise the following error report will appear:
Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method,  :
  Intermediate file 'C:\Users\Thim\AppData\Local\Temp\RtmpMHvLZS\file224060624738.csv' missing!
2. Performance
gdata and WriteXLS have no problem in accessing small files, but they perform badly in handling bigger files (maybe because of Perl). For example, it takes 8 to 10 minutes to read an Excel file of 8 columns and 200,000 rows. To achieve a better performance, we recommend xlsx function package. But, of course, Excel2003 will be of no use in this occasion. In fact, xlsx performs just slightly better than gdata does. Therefore, in order to truly improve performance, it is recommended that all Excel files be transferred into 2007 format and xml files in them be uncompressed and data be read through resolving these xml files.

Alternative methods:

For the problems of version conflicts and poor performance that R language has, we have alternative solutions like Python, esProc, Perl etc. As R language, they can also access Excel files and perform data computing. In the following, we’ll introduce briefly esProc and Python.

esProc integrates the function of accessing EXCEL into its installation package, so it is no need for it to download the extra third-party packages. It can access Excel2003, Excel2007, Excel 2010 and even the older versions. Its code is as follows:

esProc's performance is satisfactory. It takes only 20 to 30 seconds for it to read an Excel file of 8 columns and 200,000 rows.

Python has a rather excellent performance, except that it requires the third-party packages as R language does. Pandas should have been able to complete the task of accessing xls file easily, but its installation under windows failed (after all, xls files are mainly produced under windows). Finally, we succeeded in performing this operation by using packages of both xlrd and xlwt3. Unfortunately, the two packages support only Excel2003 and produce much more complicated code:
    import xlwt3
    import xlrd
    from itertools import groupby
    from operator import itemgetter
    import os
    dir="E:/ordersData/"
    fileList =os.listdir(dir)
    rowList = []
    for f in fileList:
    book=xlrd.open_workbook(dir+f)    #open read-only workbook by loop
    sheet=book.sheet_by_index(0)
    nrows = sheet.nrows
    ncols = sheet.ncols
    for i in range(1,nrows):
    row_data = sheet.row_values(i)
    rowList.append(row_data)      #all records are appended to rowList
    rowList=sorted(rowList,key=lambda x:(x[1]))          #sort the data before     grouping
    result=[]
    for key, items in groupby(rowList, itemgetter(1)): # group using groupby function
        value1=0
    forsubItem in items:value1+=subItem[3]
    result.append([key,value1])              #merge the summarized result into 2D array in the end
    wBook=xlwt3.Workbook()                   # create a new writable workbook
    wSheet=wBook.add_sheet("sheet 1")
    wSheet.write(0,0,"Client")
    wSheet.write(0,1,"Sum")
    for row in range(len(result)):         #write data to the file by loop
    wSheet.write(row+1,0,result[row][0])
    wSheet.write(row+1,1,result[row][1])
    wBook.save(dir+"result.xls")                     #save the file

It is a far more complicated method than R language.

没有评论:

发表评论