2014年3月31日星期一

Data Source Computing Issue in Reporting

To date, the reporting tools have been proven and developed pretty well. We have embraced a lot of emerging tools like Qlikview, Tableau, Spotfire, and still have the choices of classic tools like Sap Visual Intelligence, JasperReport, and Cognos. However, in developing the report, there is still a key obstacle – data source computation.

The goal of data source computation is to compute out the convenient and usable result set for report. SQL/SP is the traditional method to perform the data source computation. But it is a bit difficult for SQL/SP user to design, debug, and migrate. What’s worse, with regard to the computation with various databases and non-database data sources, this traditional method becomes useless, let alone the JasperReport and those tools that is restricted to use a single data source only. To override the last obstacle, we need to employ some other techniques, and the typical ones include data warehouse, data source computation layer, semantic layer, table-joining, and API programing. I would like to share my viewpoints on the respective pros and cons of these methods according to my personal experience.
        
Data Warehouse. As a way to build a single consistent data view with ETL, data warehouse offers lots of relevant tools that is already proven and well-developed, for example, Teradata Fastload, DB2 data warehouse, and Oracle Warehouse Builder. The data warehouse has many advantages: cross-database computation, non-database computation, historical data tracking, high performance, and low pressure on business database.

The major drawback of data warehouse is the high cost that mainly resulting from the building and maintaining, but not the product purchase. The data warehouse requires the double expert who is skillful in both IT and business. They must be excellent in Perl/VB Script/JAVA/SQL/SP and business logics. The data warehouse users have to perform the complicated data splitting and consolidating, design the norm form and the algorithm for incremental update, and prioritize the task-scheduling across multiple levels. To keep the data warehouse efficient and usable, more than one ETL processes are required. The data warehouse may easily get failed due to the uncontrollable cost.

Data source computation layer. As an emerging method in recent years, the data source computation layer tries to root out the problem in data source computation through a common method. With the computational scripts specially-designed for the structured data, the data source computation layer supports data retrieval from various databases and non-database data sources, and even the interactive computations between heterogeneous data sources. Plus, it offers the professional development tool and perfect debugging tool to reduce the difficulty of complex computation. All in all, this method is more efficient than SQL/SP is. As for the portability, this method is better because it is independent of any specific database or report by nature.

esProc and R are the typical data source computation layers. esProc can output the result via JDBC for direct referencing by report. But esProc does not support the correlativity analysis and other models. Contrarily, it is awkward to invoke R in the report, though R supports various fixed analysis models.

Typically, the full memory computation is adopted in this method. So, its performance on standalone machine is poorer than that of the data warehouse. However, both esProc and R are excellent in handling the distributed computation because they all support the big data computation model like Hadoop.
        
Semantic Layer. The semantic layer was introduced when the data warehouse becomes disappointing. The semantic layer redefines the physical data model used to be quite difficult to understand. Regarding the semantic layer, the database\text\Excel\XML will not be differentiated strictly. So report designers can operate the data in a more user-friendly way. Sap Universe and Qlikview are the most characteristic ones.

Both the semantic layer and the data warehouse are designed to build an easy-to-use data source, and thus their advantages are basically the same. However, building the semantic layer consumes fewer resources and less time because it requires less procedures and comparatively more agile. It is established and maintained at the cost higher than the cost of data warehouse but less than that of data source computation layer.

The drawback of semantic layer is mainly the insufficient computational capability. This layer works with the wizard as the core, complemented with some elementary functions of data source consolidation and computation column. However, it lacks the special-tailored structured data computational script. Therefore, the semantic layer can hardly perform the data source computation independently. Additional assistance from simple data warehouse or ETL is often necessary. The full memory computation is also weak in performance. Lastly, the semantic layer is often bundled with reporting tools and thus poor in portability.

Table-joining. This is one of the basic functions of semantic layer. In many reporting tools, it is defined separately as a simple way to solve the multi-data-source problems for the single-source report. For example, the JasperReport and the Birt only support a single data set. If the data from Oracle and MSSQL are to be represented in a single report at the same time, then you can take advantage of their table-joining functionality.

The table-joining is superior to other methods for the lowest workload and human resource costs. But, there are comparatively more feature restrictions. These restrictions include: only the easiest joining can be performed, the normal data source computation is not allowed; insufficient support for the non-database data source; the bundled reporting tools and database are not easy to migrate; The performance is quite low due to the full memory computation;

API Programing. It is the reporting tool API to compute the data source programmatically, and return the result set ultimately. Most reporting tools provide such methods as CDS frame of SAP, and Bean data source of Jasper.

This method has the utmost flexibility, capable of solving any data source computation problem theoretically. Among all these methods, this method involves the greatest development complexity.

In order to access the non-database data source, Java and other senior languages is generally adopted in this method. These languages may require chunks of code to implement a simple algorithm, which is unlike the languages designed for data computation, such as SQL, esProc, or R. So, even a simple algorithm could be quite difficult to implement and require great workload. As a matter of fact, its existence value largely lies in its role as a complementary method to the semantic layer or table-joining.
        

In conclusion, the above 5 methods are fit for their respective scenarios: the table-joining is fit for the simple joining operations between databases; the API programing is fit for the simple non-database computations; the data source computation layer is fit for the computation requiring great portability or the complex data environment; the data warehouse is fit for the LAN environment with great performance demand; and the semantic layer is fit for those business personnel processing massive amount of simple reports.

2014年3月26日星期三

Consolidate Data from Multiple Sources in a Single-Source Report

The report data sources cover the result set of SQL queries or stored procedures, and the 2D table from the text or Excel files. Owing to the technical competence or versioning, various reporting tools may only support a single data source, such as JasperReport, Quiee, BIRT, and Crystal Report.

But users often find themselves stuck in such cases that a report has multiple data sources to be consolidated into one data source as required. SQL can solve some simple cases of them, for example, associating 2 tables from a same database. But in the real world application, the real scenario is much more complex for SQL to handle.

Many people commonly use stored procedure to solve no matter the nested query and subquery involving multiple tables, or the complex computation involving judgment, loop, and temporary table. However, most report developers are more capable of presenting report than composing the SQL statement. To compose the complex stored procedures, they will have to request assistance from DBA and other IT experts. This is not only time-consuming to coordinate and communicate, but also inconvenient to debug. Not to mention, the amendment and upgrade workload in the future will be ever greater. Judging from the above discussion, the stored procedure is not the best choice to consolidate multiple data sources.

What’s more, the stored procedure is actually not allowed for many core business databases for the sake of safety. In this case, the best practice is to trade space for safety. For example, establish another server to replicate the data synchronously, or use ETL/data warehouse and other methods to implement it indirectly. It will obviously incur more cost and compromise the performance. So it is also not the best choice in this respect.

Worsely, there are some cases are absolutely unresolvable for the stored procedure. For example, since CRM uses the MSSQL database and ERP uses the Oracle database, the cross-database computation is required to compute out the yearly order value increase for respective client in ERP, and present these data on the report together with the client details. What’s even worse, there are also some data stored in the TXT/Excel/XML, in which case, the computation between the database and non-database data sources is required. Many people make a grave mistake to solve the problem with Java and other senior languages.

Then, how to consolidate multiple data sources completely and conveniently?

With years of experiences, I’ve concluded several methods to share and discuss with you.

Virtual Data Source for JasperReport Commercial Server

The Virtual Data Source provided in the JasperReport business edition can solve this problem. With wizard, one virtual table can be created by joining two entity tables of either physical table or SQL. The interface is shown below:

















This method provides so simple a consolidation method that I only need to select the type of relation. In addition, it supports quite a lot data source types – not only the computation between database, but also the computation between HQL/EJBQL and database.

But this method is poor regarding its computational capability. First, this method can only consolidate two tables. The procedure will become exceptionally cumbersome for the multi-table or multi-level relations. Besides, without the native support for computational scripts, this method only provides three ways to consolidate, which is far from enough even for the normal data source computation.

esProc for reporting tools
Unlike the bundled reporting tools, esProc is a data source development tool that is totally independent of databases and reporting tools. So, esProc is quite convenient to work with other reporting tools. The computational interface of esProc is as follows:





















The advantage of esProc is the strong computational capability in solving the complex goal easily. esProc is more powerful than SQL considering its intuitive interface of grid style, professional IDE, and convenient debugging feature. In particular, its support for the hybrid computation between database and non-database is very impressive.
        
esProc is characterized with the full memory computation. So, the standalone performance of esProc is worse than that of data warehouse. However, to partly remedy the drawback, esProc additionally supports the Hadoop. In other words, esProc can support the big data computation in the distributed environment.

Data Sources Join for BIRT
BIRT is the famous open source Java report. Despite the rather simple interface, its capability to develop report is widely recognized. The Data Sources Join can also solve many consolidation problems regarding multiple data sources. The interface is as follows:




















As can be seen, BIRT has almost the same feature as Japer does. Since they share the common advantages and disadvantages, I will not dwell on it and you can read the Jasper introduction for references.

Merge Dimensions for SAP BO Report
BO provides many functions for data source consolidation, such as IDT (Information Design Tool), Database Expert, and Merge Dimensions. Generally speaking, the nature of these features does not differ much. So, I will discuss their pros and cons with the Merge Dimensions of Web Intelligent - a major offering of BO - as shown in the below figure.






















As can be seen, the query1 and query2 is equivalent to left table and right table of Jasper. The year and quarter they are corresponding to are the Joined fields of Jasper. By and large, Merge Dimensions is weaker than Jasper. For some functions that can be implemented on one interface with Jasper, BO requires more than one tool, and each tool will be used by switching between several interfaces.

The above four methods are what I am most familiar with. Why not share your methods?