StreamHorizon FAQ

ETL Grid integrated with & cohabiting on Compute Grid

StreamHorizon architecture is compliant with Compute Grid architectures on all (conventional and Big Data (Hadoop)) deployments. Your existing Compute grid nodes can host both, your Compute Grid calculation engines as well as StreamHorizon instances. This allows you to reuse existing grid to host/run your ETL & Compute Grid on the same hardware, thereby achieving minimal movement of data across the network and massive reduction of hardware costs.

For more information please refer to Deployment Topologies chapter in StreamHorizon guides and presentations available at Resources & Downloads page

ETL Grid (ETL Farm)

ETL Grid is StreamHorizon's response to concept of long present Computing Grids (CPU Grids) in world of massively parallel, calculation intensive computing. StreamHorizon ETL Grid deployment architecture enables you to deliver your own ETL Grid by simply running StreamHorizon engine across all flavours of filesystems (Hadoop & non-Hadoop), Operating Systems, Clouds & Virtual deployments...

For more information please refer to Deployment Strategies chapter in StreamHorizon guides and presentations available at Resources & Downloads page

Java backed dimensions (custom ETL dimensional processing)

StreamHorizon offers ability to end user to define & create dimensional logic fully implemented in Java. This way it is possible to use full power of Java language to perform:

  1. Pre-caching of dimensional data
  2. Key lookups during processing
  3. Direct Cache manipulation (manipulation of Infinispan/Hazelcast or any other configured In-Memory Data Grid)
This feature effectively enables your StreamHorizon deployment to interact with any data source which can be accessed via Java.

Infinite Dimensional collections - Hadoop & Non-Hadoop platforms

StreamHorizon supports Hadoop (HBase) and non-Hadoop (Redis, memcached, Coherence) key value stores which enable StreamHorizon users to manipulate infinitely large key-value collections (so called dimensional caches) for extremely high cardinality lookups/dimensions. This feature comes in addition to already supported In Memory Data Grid solutions like Infinispan & Hazelcast.

Apache Thrift Connectivity

StreamHorizon delivers Thrift connector in an effort to deliver scalable cross-language services development connectivity. This is another step in making StreamHorizon more extensible and ensuring it can fit in almost any deployment.

Thrift enables StreamHorizon to be seamlessly integrated with Java, C#, C++, Python, PHP, Perl, Haskell, Smalltalk, JavaScript, OCamel, Delphi, Node.js and other languages.

Hadoop Ecosystem - Full Data Processing Integration

StreamHorizon has positioned itself as Data Processing Bridge between Non-Hadoop and Hadoop platforms by implementation of Hadoop specific connectivity for Read & Write functionality. Apart from being able to talk to both Non-Hadoop & Hadoop platforms within single ETL processes, StreamHorizon can fully operate on Hadoop Ecosystem.

StreamHorizon I/O Fine Tuning Framework

Quest to overcome I/O bottlenecks of a typical deployment has lead StreamHorizon development teams to extend tuning configuration settings which will help Development and Infrastructure teams to deliver maximum performance given constraints of their I/O system.

Following two modes of I/O bottleneck testing enable Development teams to estimate performance of their ETL stack prior to investing money into new I/O infrastructure.

I/O Tuning Approach 1: Exclusion of I/O from your ETL processes

StreamHorizon enables you to test your ETL stream by simply configuring ETL processes to read source data which is cached internally within StreamHorizon JVM (this effectively eliminates Read I/O related latency). Data output of StreamHorizon ETL framework can also be discarded which effectively eliminates write I/O from your ETL pipeline (as data is transformed but not persisted).

I/O Tuning Approach 2: Fine Read & Write I/O buffer tuning

Ability to set size of your ETL Read and Write buffers which in combination with number of parallel ETL processes configured in StreamHorizon derives maximum performance from your current I/O system. Buffer tuning combined with ETL process parallelism enables developers to come up with I/O utilization patterns which are optimal for a given I/O system.

Increasing JDBC bulk load performance by 50%

StreamHorizon R&D team has developed & filed patent for design which effectively increases throughput of single JDBC connection by 50%. StreamHorizon ETL threads are able to share single JDBC connection which operates in bulk mode. This boosts the overall throughput without creating additional locks against the target table.

Shield against Non-Atomic I/O Operations

StreamHorizon releases feed and bulk file acceptance delay configuration settings. Settings are crucial for Non-Atomic I/O operations, this functionality effectively delivers atomicity of I/O operations.


StreamHorizon delivers connector which utilize Unix/Linux pipes which enable StreamHorizon ETL streams to deliver data to the target database via bulk load concepts (like External Tables or SQL*Loader) without persisting bulk files as middle step (thereby, fully eliminating I/O from ETL processing pipeline).

In Memory Data Grid & StreamHorizon

After extensive performance testing StreamHorizon incorporates both Infinispan and Hazelcast as preferred embedded In Memory Data Grid (IMDG) solutions for dimensional caches. Both Infinispan and Hazelcast are integrated, packaged and distributed with official StreamHorizon releases. StreamHorizon architecture also allows embedding of other (custom) IMDG providers.