23 October 2012

After years of basic data extraction and a couple years of data warehousing with time-based analysis, I began designing a system with more interesting query capabilities based on aggregating multiple dimensions. The implementation techniques were generally iterative based on the sophisticated functionality benefits that I gained in similar batch generation projects, but those techniques inevitably run into speed challenges. I had investigated MapReduce as one of the sexy new technologies, but the biggest implementations (Hadoop, Amazon EMR) were designed for big data with some delay for batch processing. That solution does not scale down to what I wanted. However, armed with my rough notes, I began searching for BI and analytics solutions that fit well. As is often the case, it just took some lucky searches to come across the right jargon. Once you have the right search terms, the world opens up.

Most of the queries I was implementing against my data warehouse could be replicated with some effort using a PivotTable in Excel, and reading on server-side implentations of PivotTables eventually led me to the OLAP cube. The OLAP cube was exactly what I was looking for to address some of my use cases. It provides a conceptual framework for communicating and standardizing data structures for BI analysis. Research into OLAP led me to the MDX query language used along with OLAP, and it had such a rich and applicable syntax for addressing multi-dimensional queries. Such queries in SQL require multiple complex (and often inefficient) joins.

At the time of this research, the wikipedia entry for data warehouse only references OLAP as the 17th "see also" link. The term is used more in BI circles. I likely saw OLAP multiple times without paying much attention for one good reason -- it is basically inaccessible to the PHP developer. The leading PHP tool appears to be olap4php.org, but they have not reached a 1.0 release yet. Even worse, the servers are dominated by Java and Microsoft solutions, which rarely fit comfortably in a LAMP stack.

I've thrown some new terms into my vocabulary, and my research continues. If OLAP and XMLA were as ubiquitous as MySQL and PHP, then I'd have my solution. As it stands, I am currently left with the unfortunate choice of adding the complexity of integrating with a system with minimal PHP library and IDE support or continuing down the route of investing time in a more limited but accessible custom solution. Time will tell, but you can guess my next search terms!

If I choose to use an open source OLAP server, Mondrian (from Pentaho) is the definite winner in my book. ocCube appears more user-friendly, but its free version has strict limitations. If I choose to go a custom route, at least I have the option of implementing as a subset of a standard to allow for a more graceful transition later.

Some References for OLAP, MDX/XMLA and MapReduce

  1. Introduction to OLAP cubes
  2. Pivot tables are low-level OLAP clients
  3. MDX is used to query this type of data
  4. MDX over SQL
  5. MDX to create hierarchies
  6. Examples of MDX queries
    1. Basic examples of calculated members without aggregation.
    2. Examples that can be executed, including aggregation.
  7. icCube is an open source OLAP database
  8. Virtuoso provides XMLA support (not free)
  9. Other open source OLAP databases exist.
  10. But is OLAP dead according to Microsoft?
  11. Mondrian (Pentaho): Dynamically load the datasources.xml file to detect changes
    1. Example datasource xml: http://trac.spatialytics.com/geomondrian/browser/trunk/geomondrian/FoodMart.xml
  12. Aggregates and Big Data | StMcPherson's Blog
  13. MapReduce vs Data Warehouse
  14. The aggregate reducer/combiner package


blog comments powered by Disqus