file function and import function can be used if data in text files
need to be read in. For example, the following text file, empolyee.txt, stores
employee information:
EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY
1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000
3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000
7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000
…
It is convenient to import the file code:
In the above figure, cell A1 uses file function to open a designated file and cell A2 uses the file's import function to import the file's content. We can see in the figure that esProc’s integrated development environment provides debugging functions, like single step, breakpoint and so on. The imported file content can be displayed visually in the window on the right, making debugging program and the observation of result more convenient.
As the other functions provided by esProc, import function contains
options and parameters. @t in import@t() in the above figure shows option t is
used, and by looking up in the function reference, @t means that the data file’s
first row is the field name. It is the parameter that the parentheses contains.
The blank in parentheses in the above figure shows that all parameters are
default.
Let's look at how to code with those
functions in esProc program:
1. Separator
In the above figure, because tab is used as a separator in text file
empolyee.txt, no input parameter is used in the parentheses of import function.
The default character tab is used directly instead. If other separators need to
be imported, they can be defined with parameters. An example: comma is used to
separate fields in csv file supported by excel, so the code for import function
is import@t(;",").
2. Import a portion of the fields
When a portion of the fields, instead of all of them, is to be
imported, parameters can be used to define them, e.g.,
import@t(EID,NAME,BIRTHDAY).
3. Data type designated to be imported
esProc
automatically judges the type of imported fields. For instance, BIRTHDAY is in
consistent with the format of date and time, so esProc will automatically
import the datetime type. If the field is required to be imported as a string,
the code may be like this: import@t(EID,NAME,BIRTHDAY:string). The data types
compatible with esProc include bool, int, long, float, decimal, number, string,
date, time and datetime.
4. Importing data piecewise
It is required
to piecewise import data when the data size of a text file is larger than the
memory. import function of esProc provides two input parameters to read data
from designated starting bytes to end bytes. For instance,
import@t(EID,NAME,BIRTHDAY;,1000:2000) reads data from the 1000th
bytes to the 2000th bytes. Usually, text file data gets its meaning
when read by rows. import function automatically adjusts the position of
starting point and end point to ensure a full data row. In order to ensure that
the imported data consists exclusively of full rows, when the starting point
happens to be in the middle of a data row, the import function will skip the
row and begin with the next row; but if it is the end point that happens to be
in the same position, the reading will continue until the end of the row. Reading
data piecewise rules that the starting character will not be read while the end
character will. So by making each end point to be the starting one of the next
reading, all data rows will be imported without omission and repetition. Thus,
in the preceding example, import@t(EID,NAME,BIRTHDAY;,2000:3000) is the
appropriate code for the next reading.
It's worth
pointing out that because the data should not be segmented according to rows,
the current row number can only be determined after the traversal of all
preceding rows. Otherwise, the expected excellent performance of importing data
piecewise won't be achieved.
esProc also
segments data automatically. For example, if a file is divided into three parts
and the second one is to be read, the code is like this:
import@tz(EID,NAME,BIRTHDAY:string;,2:3). With the option @z, meanings of the
two parameters become the total number of parts and the serial number of
current part waiting to be read. In this method, it is convenient for multiple
parallel tasks to piecewise process a big file.
5. The process and export of big files
After the file
is imported to the memory, it can be exported or re-imported into the database.
For example, after empolyee.txt is imported, NAME and SURNAME could be merged
into FULLNAME to be exported to employee_result.txt. If data size of the file
is big enough, text cursor can be used to process it. esProc cursor can take
out file data one by one and make computations. In addition, it provides quite
a lot of cursor functions, making code more concise.
A complete esProc code is as
follows:
In the above
figure, cell A2 defines a cursor of file; cell A3 calculates full name; cell A4
exports the result of A3 to the result file.
6. Access hdfs files.
It’s easy to
access hdfs files with esProc. Your just need to change file functions to
hdfsfile functions, e.g., hdfsfile("hdfs://192.168.0.204:9000/user/root/student.txt").For
a compressed file, change the extension, e.g., hdfsfile("hdfs://192.168.0.204:9000/user/root/student.gzip").
Thus hdfs compresses or decompresses according to the extensions of files when
reading and writing hdfs files. Compress methods includes gzip, LZO and LZ4
etc.
7. General file operations
We can make
general file operations with esProc. For example, functions, such as move,
exists, date, size, property and filename, respectively have the function of
move (delete), checking whether a file exists, modifying date, size checking,
reading property of configuration files and splitting full path, as well as
obtaining file names and extensions.
没有评论:
发表评论