2014年8月18日星期一

A Handy Method of Conditioned Filtering in Text Files with Java

We often encounter the situation that requires text file data processing. Here we'll look at how to execute conditioned filtering in text files with Java through an example: read employee information from text file employee.txt and select female employees who were born on and after January 1, 1981.

The text file employee.txt is in a format as follows:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000
6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000
7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000
8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000
9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000
10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000
11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000
12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000
13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000

Java's way of code writing is that it reads data from the file by rows, save them in the List objects, traverse List objects, and savethe eligible records in the resultingList objects. Lastly, print out the number of eligible employees. 

Detailed code is as follows:
       public static void myFilter() throws Exception{
              File file = new File("D:\\employee.txt");
              FileInputStream fis = null;
              fis = new FileInputStream(file);
              InputStreamReader input = new InputStreamReader(fis);
              BufferedReader br = new BufferedReader(input);
              String line = null;
              String info[] = null;
              List sourceList= new ArrayList();
              List resultList= new ArrayList();
              if ((line = br.readLine())== null) return;//skip the first line, exit if the file is null
              while((line = br.readLine())!= null){ //import to the memory from the file
                     info = line.split("\t");
                     Map<String,String> emp=new HashMap<String,String>();
                     emp.put("EID",info[0]);
                     emp.put("NAME",info[1]);
                     emp.put("SURNAME",info[2]);
                     emp.put("GENDER",info[3]);
                     emp.put("STATE",info[4]);
                     emp.put("BIRTHDAY",info[5]);
                     sourceList.add(emp);
              }
              for (int i = 0, len = sourceList.size(); i < len; i++) {//process data by rows
                     Map<String,String> emp =(Map) sourceList.get(i); 
                     SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
                     if ( emp.get("GENDER").equals("F") && !sdf.parse(emp.get("BIRTHDAY")).before(sdf.parse("1981-01-01")) )
{ //save the eligible records in List objects using the conditional statement
                            resultList.add(emp);
                     }
              }
              System.out.println("count="+resultList.size());//print out the number of eligible employees
       }

The filtering condition of this function is fixed. If the condition is changed, the conditional statement in the program should be modified accordingly. Multiple pieces of code are needed if there are multiple conditions, and the program lacks the ability to handle the provisional, dynamic conditions. Now we'll rewrite the code and make it universal in some degree by slightly changing the loop of traversing sourceList:
       for (int i = 0, len = sourceList.size(); i < len; i++) {
                     Map<String,String> emp =(Map) sourceList.get(i); 
                     SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
                     boolean isRight = true;
                     if (gender!=null && !emp.get("GENDER").equals(gender)){//process the condition of gender
                            isRight = false;
                     }
                     if (start!=null && sdf.parse(emp.get("BIRTHDAY")).before(start) ){//process the starting conditionof BIRTHDAY
                            isRight = false;
                     }
                     if (end!=null && sdf.parse(emp.get("BIRTHDAY")).after(end) ){//process the end condition of BIRTHDAY
                            isRight = false;;
                     }
                     if (isRight) resultList.add(emp);//save the eligible records in the resulting list
              }

In the rewritten code, gender, start and end are input parameters of the function myFilter. The program can manage situations that GENDER field equals the input value gender, BIRTHDAY field is greater than or equal to the input value start as well as less than or equal to the input value end. If any of the input values is null, the condition will be ignored. Conditions are joined by AND.

If we want to make myFiltera more universal function, for example, join conditions with OR or allow computation between fields, the code will become more complicated, requiring program for analyzing and evaluating dynamic expressions. This type of program can be as flexible and universal as database SQL, but it is really difficult to develop.

In view of this, we can turn to esProc to assist with this task. esProc is a programming language designed for processing structured (semi-structured) data. It is quite easy for it to perform the above universal query task and can integrate with Java seamlessly so that Java can access and process text file data as flexibly as SQL does.

For example, to query female employees who were born on and after January 1, 1981, esProc can import from external an input parameter "where" as the dynamic condition, see the following chart: 

The value of "where"is:BIRTHDAY>=date(1981,1,1) && GENDER=="F". esProc needs only three lines of code as follows: 

A1Define a file object and import data to it. The first row is the headline with tab as the field separator by default. esProc’s IDE can visually display the imported data, as shown on the right of the above chart.
A2Filter according to the condition. Here macro is used to analyze the expression dynamically. “where” is the input parameter. esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as macro string value and interpret and execute the result. In this example, the code we finally execute is =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F").
A3Return the eligible result set to the external program.

When the filtering condition changes, we just need to change the parameter “where”without rewriting the code. For example, the condition is modified into querying female employees who were born on and after January 1, 1981,or records of employees whose NAME+SURNAME equals “RebeccaMoore”. The code for where’s parameter value can be like this: BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore". After execution, the result set in A2 is shown in the following chart: 

Finally, call this piece of esProc code with Java to get the filtering result by using jdbc provided by esProc. The code called by Java for saving the above esProc code as test.dfx file is as follows:
       // create esProcjdbcconnection
       Class.forName("com.esproc.jdbc.InternalDriver");
       con= DriverManager.getConnection("jdbc:esproc:local://");
       //call esProc program (the stored procedure) in which test is the file name of dfx
       st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
       //set parameters
       st.setObject(1," BIRTHDAY>=date(1981,1,1) && GENDER==\"F\" ||NAME+SURNAME==\"RebeccaMoore\"");//the parameter is the dynamic filtering condition
       // execute esProc stored procedure
       st.execute();
       //get the result set: a set of eligible employees
       ResultSet set = st.getResultSet();

When writing script of relatively simple code, we may write the esProc code directly into Java code that calls the esProc JDBC. This can save us from having to write the esProc script file (test.dfx):
st=(com. esproc.jdbc.InternalCStatement)con.createStatement();
ResultSet set=st.executeQuery("=file(\"D:\\\\esProc\\\\employee.txt\").import@t().select(BIRTHDAY>=date(1981,1,1)&&GENDER==\"F\" || NAME+SURNAME==\"RebeccaMoore\")");

This piece of Java code directly calls a line of code from esProc script: get data from the text file, filter them according to the specified condition and return the result set toset, the ResultSet object. 


没有评论:

发表评论