Sunday, July 10, 2011

What is SQL Server 2008 R2 - StreamInsight?

Background
A couple of weeks ago, the second CTP for SQL Server 2008 R2 was released. I downloaded it immediately. Still the product is far from finished, and the CTP didn't contain any preview of the "Gemini" component, but I could at least test the StreamInsight part. The finished product is scheduled to be released during the first half of year 2010, according to Microsoft's web site. [Notice: This article has been updated for the November CTP of StreamInsight]
What's new in SQL Server 2008 R2? I would say the most important news are:
  • StreamInsight
  • Managed Self Service Business Intelligence (the so called "Gemini" project)
  • Application and Multi-Server Management
This blog post will focus on StreamInsight.
StreamInsight
What is StreamInsight and what is it good for? StreamInsight is a platform for developing and deploying applications that handle high-speed streaming data. It could be used for near real-time processing of data from production environments, structured data such as financial information and unstructured data such as Facebook, Twitter and blogs. Multiple sources can be combined and refined before they are being output. This technology is called CEP - complex event processing.
Combine StreamInsight with data mining, and you can have real-time fraud detection, stock market prediction, you name it... Imagine a real-time Business Intelligence-application where the management can actually see the KPI gauges moving on the screen. I would say that monitoring real-time flow of information is one of the key success factors for tomorrow's Business Intelligence. This is also supported by Ralph Kimball, saying in an interview that Business Intelligence is moving from the strategic level towards operational level processes such as customer support, logistics and sales. At the operational level data must be continuously updated, not just once per day. I would add also that the new generation, that has grown up with Facebook and Twitter, will make it necessary to monitor new sources for successful Business Intelligence.
How does it work?
You develop your application in Visual Studio 2008, using libraries from StreamInsight. Basically you develop (or use existing) input and output adapters and connect them through LINQ queries. The picture below (copied from Microsoft's StreamInsight documentation) shows a good overview of the product:
(Click on image to enlarge)
There are three different development models supported by StreamInsight:
  • Explicit model, which provides full control and access to the CEP-server.
  • Implicit model, which can handle only a single query and runs all in memory. Good for debugging.
  • IObservable/IObserver interface, which hides much of the complexity of the CEP-server API
All data in StreamInsight is organized into event streams. The events are of two kinds:
  • INSERT events, that contains a header with timestamp information and a payload with application specific data. These events can be of different models, such as Interval, Point and Edge, to support different temporal characteristics.
  • CTI events, that indicates the completeness of previous events in the stream.
Adapters can be of two types:
  • Untyped adapters that work with multiple types of INSERT events
  • Typed adapters that work with a single type of INSERT events
Queries are written in LINQ with the possibility to do calculations, filtering, grouping, windowing, aggregations, combining and invoke user-defined functions. A query template is bound to specific input and output adapters to produce a query instance.
An example - building an input adapter for Yahoo Finance stock quotes
For testing StreamInsight, I chose to build an input adapter that reads stock quotes from Yahoo Finance. Note however that quotes from Yahoo Finance are delayed at least 15 minutes unless you pay for their premium service. To test the input adapter I also built an output adapter the displays the quotes on the screen (just a simple console application). These I joined together in an application that I built using the implicit development model.
The point here is not the functionality of the application, but that you could easily replace the output adapter with something much more useful. By adding SQL Server 2008 Data Mining, using any algorithm for predictive analysis, you could build something much more useful. Adding an output adapter that places buy and sell orders into a trading system could make you rich. Maybe I will write about that in a later blog, maybe not...
The picture below shows the Visual Studio 2008 project.
(Click on image to enlarge)
ScreenScraper is a helper class that reads the HTML contents of a web page and extracts a string using pattern matching (a regular expression). YahooFinance is a helper class that provides the ScreenScraper with suitable URL and pattern matching expression for the desired stock symbol. StockPayload contains the actual data and is defined as below:
public class StockPayload
{
   ///
   /// Unique ID of stock or index
   ///
   public string StockID { get; set; }

   ///
   /// Current value of stock or index
   ///
   public double Value { get; set; }
}
The data members can only be basic types such as numbers, strings, etc. This is probably because the serialization mechanism in StreamInsight requires a fixed size payload.
The initial rows of the input adapter is shown below.
(Click on image to enlarge)
Note that the input adapter inherits generic class TypedPointInputAdapter, which is used for typed (single kind of event) point (point in time events) adapters. Basically the adapter has to implement methods for Start() and Resume(), monitor the AdapterState and enqueue events to the output stream. The input adapter can work with asynchronous sources through callback functions, and synchronous sources. I've implemented the input adapter as a loop that reads data from Yahoo Finance and then waits for a determined time. A CTI event is sent after each quote, to indicate there are no other events awaiting.
A factory class is provided to instantiate the adapters. The factory implements a Create method that takes a configuration and an event shape as parameter. Based on these parameters it instantiates the input adapter.
The output adapter is built similarly as the input adapter, so I don't describe it here. The adapters are tied together by a query defined by the following statements. Note that this solution is specific for the implicit development model. In the explicit model, the query would instead be bound through the CreateQuery method of the application.
// Join input adapters with a simple LINQ query
var combinedInputStream = (from e in input1Stream
   select e).Union
         (from e in input2Stream
   select e);

// Connect input adapters with output adapter
var query = combinedInputStream.ToQuery(outputSink);
Finally the query is being run:
// Run the query
query.Start();
adapterStopSignal.WaitOne();
query.Stop();
Testing and debugging
SQL Server 2008 R2 StreamInsight comes with a tool, StreamInsight Debugger, that can visualize a recorded event flow. To use it you first need to write tracing data to a file. Enable tracing, run your application, and the disable tracing. This is done from command-line as shown below:
Opening the trace data file, sample.etl, in StreamInsight Debugger you can monitor the stream pipeline. The picture below shows data from the input adapter instances and how they are unioned in the LINQ query.

No comments:

Post a Comment