1. Integration with OLAP (Real Time & Batch)
  2. StreamHorizon OLAP-JDBC mode of operation
  3. StreamHorizon OLAP-BULK MODE mode of operation
  4. Refreshing OLAP via SQL procedures & functions
  5. Batch oriented OLAP integration
  6. Real Time OLAP integration
  7. Flushing StreamHorizon caches in data processing (online) mode
  8. In what programming language is StreamHorizon platform written?

Integration with OLAP (Real Time & Batch)

StreamHorizon supports generic events which allow you to seamlessly integrate with any OLAP server.

By deploying StreamHorizon demo at adequate hardware you will get idea of power of the platform in terms of data throughput and processing efficiency. Quick browse of XML tutorial provided in config folder of the StreamHorizon distribution will give you good idea about source and target connectors you could use and all tuning parameters of the platform which are designed to bring maximum performance out of your hardware. Upon decision to proceed with purchase of StreamHorizon platform our technical staff will be in touch with you do discuss best deployment strategy, licensed copy of software including guides and manuals will be provided.

StreamHorizon OLAP-JDBC mode of operation

If you use JDBC mode of operation of StreamHorizon all you need to do is to implement event <afterFeedProcessingCompletion> which is called after every successful load/processing of data (if data comes from files event will be invoked after every file is loaded into target database, if data comes from sql queries or message queues same logic is valid). All that is required is to implement event <afterFeedProcessingCompletion> to invoke your executable (please see below) which will accept parameter like feedName or/and feedID (or any other) variable which will identify which data need be loaded from your database table into the OLAP cube.

<command type="shell">
d:/olapCubeUploder.bat ${feedName} ${feedID} ${cubeName} ${serverName}

StreamHorizon OLAP-BULK MODE mode of operation

IN bulk mode operation instructions are the same as for JDBC mode operation. Only difference is that as database threadpool (rather than etl threadpool) will be executing load of data into the database event which needs overriding is <afterBulkLoadSuccess>

<command type="shell">
d:/olapCubeUploder.bat ${feedName} ${feedID} ${cubeName} ${serverName}

Refreshing OLAP via SQL procedures & functions

Rather than pushing data into OLAP server via external executable or OS you can supply stored procedures or functions which achieve same goal. Only difference is that <command> element type will need be changed to sql (from shell) and suitable procedure name will need be supplied. Bulk mode OLAP refresh given above would become:

<command type="sql">
CALL proc_olapCubeUploader ( ${feedName} ${feedID} ${cubeName} ${serverName} )
Same logic applies for JDBC mode of operation.

Batch oriented OLAP integration

If your Data Warehouse/Business Intelligence stack does not have to operate in real time fashion simplest and most robust in terms of stability is way of operation to refresh all cube dimensions after ETL processes have loaded all batch data into the data mart. This ensures that all dimensions are updated only once, speaking in terms of Analysis Services this is time to execute “ProcessUpdate” against all changed dimensions. Dimensional processing should be followed by data upload into the cube which can be designed to load all data (preferably in batches if data volumes are significant [volumes too big for a single upload]).

Real Time OLAP integration

Real time OLAP integration requires constant uploading of data from database into the cube as data gets loaded into the Data Warehouse. To make OLAP processing as efficient as possible it is recommended that ProcessAdd (addition of new dimensional records) is used rather than updating of whole dimensions (ProcessUpdate). Terminology used is Analysis Services terminology, however, principle is same for all OLAP vendors.

ProcessUpdate can be time consuming when dimensions are large; it can also cause validation of all cube partitions (not only partition that data is loaded into). The bigger the cube and bigger the dimension in question more time delay is introduced by ProcessUpdate execution.

If you configure your StreamHorizon dimensions to call stored procedure while inserting new records into database, you could easily implement stored procedure to:

  • Insert/update new record into dimension table
  • Insert/update record with key of new dimensional record created at control table or a queue

Idea is that process which executes ProcessAdd instructions against the OLAP cube (adding new records into the dimension(s)) has a list of new record identifiers which need to be added (uploaded) to OLAP dimensions.

After all dimensions are refreshed by ProcessAdd then upload of data into the cube can commence. Note that you must ensure that all surrogate keys (dimensional keys) contained in cube data upload (data from fact table) are already in dimensions added by either ProcessAdd or ProcessUpdate (or any other processing type like ProcessData or ProcessFull for that matter).

Considering large volumes (100+ million records) OLAP data push should be implemented as a process which batches multiple data entities (multiple files for example) and uploads them to fact table in one single upload. Ideal size of such batch should be tested and is deployment specific.

Using parallel processing upload of data into the cube will parallelize data uploads at OLAP server side. Note that process uploading data into the cube is single process in such case as parallelism is achieved by parallel processing XMLA statement. The parallel processing is terminology used in Microsoft Analysis Services, however, same paradigm applies for all OLAP vendors in terms of architecture and design of ETL processes.

Indexing underlying tables in your database schema may improve data retrieval performance. This is valid for large dimensions and especially fact tables.

OLAP server should be set up to issue single query per dimension rather than one query per dimensional attribute during refresh (which may be default in some OLAP engines). Later retrieval mechanism may cause refresh failures due to inconsistent data snapshots, this issue manifests with Real Time Data Warehouses which have very high data throughput.

Flushing StreamHorizon dimension caches in data processing (online) mode

StreamHorizon offers ability to flush dimension caches while instance is online even while processing data. Cache flush for all cached data sets can be achieved simply by issuing HTTP call containing given dimension name.

Flushing cache is useful feature in cases when:

  • Target database has been modified by other process (external to StreamHorizon)
  • When referential (mapping, lookup any other) data has changed and StreamHorizon instance needs be notified

Upon receiving HTTP request StreamHorizon instance will:

  • If ETL thread is in the middle of the processing when HTTP request is received:
    1. Finish processing current feed (file, message, sql query or other) while using existing cached data
    2. Before new feed (file, message, sql query or other) is consumed off the processing queue StreamHorizon engine will flush existing cache(s) for given ETL thread and continue processing with newly cached data
  • If ETL thread is idle (not processing) when HTTP request is received:
    1. StreamHorizon engine flushes cache(s)
    2. ETL thread becomes available for processing of new feed (file, message, sql query or other) of the processing queue

Every cache flush is propagated to all ETL threads (entire ETL threadpool) of a given StreamHorizon instance.

HTTP cache flush example: Assuming that your StreamHorizon configuration xml file has defined dimension named myDimension1, following HTTP request will flush cached data for that particular dimension:

The HTTP call above assumes StreamHorizon running on your local machine at port 21000. Generic dimension cache flush syntax is:

In what programming language is StreamHorizon platform written?

StreamHorizon is written in Java programming language. It uses industry standard open-source libraries and APIs. Our team of experienced engineers spent a lot of time optimizing code for highly concurrent, low latency execution.

StreamHorizon developers used latest Java language features to provide lock-free, extremely efficient and highly extensible platform.

Please note that with certain support packages you will get full access to StreamHorizon git repository.