Explore four alternatives for Batch Processing and Grid Computing among the many options available from both Open Source and Commercial space.
Building the Right Environment to Support AI, Machine Learning and Deep Learning
Scheduling and Load Balancing are two very closely knitted concepts. Most use the term “Load Balancing” in the context of Network Load Balancing – for example, https://en.wikipedia.org/wiki/Load_balancing_(computing). This wiki describes (network) load balancing in the context of a web server farm. But, what options are available today if what you need is a grid for analytic/scientific computations? We do a lot of batch processing in hedge funds and investment banking space – from (S)FTP file transfer to/from brokers (Deal/execution submissions/Allocations), to derivatives pricing/risk/stressing/Pnl calculations (Real time, Day-end, Month-end). Many firms, still, implement their infrastructure from scratch.
We can find from Wiki a precursory survey under two separate categories: “Job Scheduler” and “Grid Computing Software/Middle”.
The question remains – How do they measure up against what set of criteria?
Adding to the confusion, should you decide to build your own batch processing/grid infrastructure, many Open Source libraries are available. Many support Scheduling but not Load Balancing, and vice versa. Yet, some others are simply too immature or lack a following – you can tell from # downloads, # broken links and absence of documentation. For example, NGrid supports load balancing, but not scheduling. Quartz.net, on the other hand, supports both scheduling and “Clustering”, but with specific limitations – the job must be coded in .NET, and it must implement the “IJob” interface (Less restrictive compared to NGrid where you need subclass from “GObject”).
The objective of this article is to explore what options we have from both Commercial and Open Source spaces.
We Don’t Need a Scheduler for Everything
Before we explore our options further, I want to first establish that while scheduling and load balancing are very closely knitted concepts, we do NOT need a scheduler for everything.
Fig 1. Real-time updates of derivatives sensitivities is an example where we don’t need a Scheduler
“Market Data Feed Adapter/Server” may be listening on a socket (Bloomberg Desktop API for example), and publishes arriving ticks to Message Bus, accessible from only the firm’s application within the intranet environment. Some calculation grid monitoring the message bus picks up the newly published market data, runs its calculation, and publishes the result back to the Message Bus. Clients, Desktop or Web, subscribes to updates from Message Bus. In this scenario, you do NOT need a scheduler – what you need is a Message Bus, RabbitMQ for example. Say for example your calculation grid is built in .NET, your primary concern should be to integrate jobs implemented in different languages: unmanaged C++, Java, Perl scripts, .NET.
Fig 2. Day-end/Month-end processing is an example where we DO need a scheduler
Typical day end batch includes mark-to-market, Pnl and risk calculations, stressing/scenario analysis, aggregation of position level data to different levels (book/account/strategy/country levels…etc) – in this case, we need a scheduler.
Anatomy of a Complete Batch Processing Infrastructure with Load Balancing Capability
As mentioned, Wikipedia is a good starting point to get a grasp what tools are available today if you need batch processing and load balancing capability in your firm, or in the new application you’ll be building. In the following passage, we’d try to make detail comparisons using the following criteria:
- Cost estimates and Time-to-Delivery (Case when you decide to build your own Scheduler/Grid)
- Platform Compatibility
- Scheduling facilities
- Load Balancing facilities
- Built-in ERP adapters
- Built-in ETL commands (And Open Source libraries available, from the perspective of a .NET developer, if you build your own)
- Persistence of execution status/timestamps, execution parameters and execution result (Actual Data)
- GUI (Desktop/Web/Mobile)
- Support for Change Management and Security Audits (Common requirement in Enterprise Computing)
Fig 3. The Comparison - “Open Source Stack” is if you decide to build your own Batch Infrastructure & Grid. BMC, Applied Algo, and Schedulix are Standalone Applications. (Click for larger image)
We have explored four options available today.
- Build your own (We’ve described also the number of modules you need to build in order to “Connect-the-dots” and the number of Open Source libraries available, in particular, Quartz.NET+RabbitMQ – from the perspective of a .NET Developer. They also have the most polished GUI – parent child jobs are displayed in Flow Chart Diagram).
- BMC Control-M (Commercial Scheduler+Load Balancer, most expensive but with ERP adapters and built-in ETL commands).
- Applied Algo ETL Suite (Commercial Scheduler+Load Balancer, best suited for anyone which does a lot of number crunching – their persistence mechanism automatically store processed data from FTP transfer to output from a Time Series Analysis is particularly geared for quantitative/scientific analysis. Applied Algo ETL Suite also bundled the most).
- Schedulix (Open Source Scheduler+Load Balancer, with support contract available from independIT. Everything you need for General IT automation purposes. No built-in ERP adapters or ETL commands however.).
This article presented only four alternatives among the many options available from both Open Source and Commercial space. Readers are welcomed to submit additions via the comments below. Please, however, use the comments to make suggestions, not to market your product.