Building the Right Environment to Support AI, Machine Learning and Deep Learning
Microsoft SQL Server 2005 provides a completely new enterprise extraction, transformation, and loading (ETL) platform called SQL Server Integration Services (SSIS) that ships with the features, tools, and functionality to build both classic and innovative kinds of ETL-based applications. This article examines some of the exciting SSIS features that you can use to build ETL applications. Along the way, you will also learn how to build a simple package using the new Business Intelligence Development Studio, which is a key component of SSIS features.
A History Lesson
Back in the days when Microsoft SQL Server 6.5 was still a hot product, the bulk copy program was the only way to back up or export databases from SQL Server to other data sources, such as other database servers or text files. However, in subsequent SQL Server versions, Microsoft added a large number of features that were aimed at improving the productivity of SQL Server developers and DBAs. One such feature is DTS (Data Transformation Services). By creating DTS packages, you can combine several tasks into one process and use any programming language that supports automation (such as Visual Basic.NET, Visual C#, or Managed C++) to execute these packages and monitor their progress for errors.
With the upcoming SQL Server 2005 release, Microsoft has raised the bar by introducing SSIS, the brand new ETL tool. Similar to DTS, it provides functions for moving data from one place to another and manipulating that data at run time. However, SSIS has been completely redesigned from scratch to be an enterprise ETL platform. SSIS provides the breadth of features—and very high-level performance—necessary to build enterprise-class ETL applications.
SSIS is fully programmable, embeddable, and extensible, which makes it an ideal ETL platform. It has a great development environment hosted in a Visual Studio shell with cool capabilities for building workflows and pipelines through a rich set of pre-built or custom components. SQL Server 2005 ships with the following development and management environments for designing and managing packages:
- The SQL Server Management Studio is an environment for managing the storage and execution of deployed packages. It has special features for doing this, including integration with the DTS Service, and the ability to enumerate packages on remote servers. But it is not a design environment.
- The Business Intelligence Design Studio is an environment for designing packages, organizing them in Solutions and Projects, debugging them, and managing source and version control for multi-user projects.
These two environments offer powerful facilities for deploying, debugging, and monitoring deployed packages. In addition, SQL Server 2005 provides rich workflow capabilities that you can use for performing sophisticated data manipulations.
There are too many SSIS features to cover in a single article, so this piece highlights some of the important ones and then moves on to creating a simple package with Business Intelligence Development Studio. The following are the most notable features:
- You can use SSIS to transfer millions of rows of data to and from heterogeneous data sources, but SSIS functionality doesn't stop there. The tool leverages the end-to-end BI suite by offering complete data integration, movement, and shaping, which means that SSIS provides data cleansing, extensibility, and interoperability.
- SSIS comes with a lot of pre-built data-cleansing functionality, including completely integrated functions, such as fuzzy matching and fuzzy grouping, that use algorithms to match or group disparate data to a configurable degree of accuracy.
- SSIS offers broad extensibility points for third-party component vendors, meaning that if functionality is not available out of the box, SSIS lets you build your own components or add a third-party component to solve your problem.
- SSIS can load data directly into Analysis Services cubes, and it also offers robust data-mining features for including scalable data-mining model creation, training, and predictions.
- SSIS is seamlessly integrated with SQL Server Reporting Services integration, which lets you treat an SSIS package as the data source for reporting.
- SSIS also has greatly improved performance and scalability that allow you to host complex, high-volume ETL applications on lightweight servers, enabling you to scale down.
- SSIS can also help reduce ETL data staging areas and help minimize performance costs associated with data staging (disk I/O and serial processing). This is made possible by the ability to perform complex data transformations, data cleansing, and high-volume lookups—all inline from source to destination.
- SSIS also provides a new feature, the Slowly Changing Dimension (SCD) wizard. Through the SCD interface, you can rapidly generate all the steps and required code to add unique handling of history to multiple attributes in a given dimension.
- The development environment for SSIS, known as Business Intelligence Development Studio, is hosted in Visual Studio, enabling scripting and other programming tasks that take advantage of that enterprise development environment.
- SSIS now fully supports the Microsoft .NET Framework, allowing developers to program SSIS in their choice of .NET-compliant languages, as well as native code.
- The Data Transformation run-time engine is exposed both as a native COM object model and as an entirely managed object model. Although the Data Transformation engine is written in native code, it is available though a signed Primary Interop Assembly (PIA) that enables full, managed access to it.
Now that you have a general understanding of the SSIS features, consider an example that shows the steps involved in constructing and executing a package using the Business Intelligence Development Studio environment.
Creating a Simple Package
Consider a simple example wherein you transfer data from one table to another table by executing a stored procedure. For each row in the source table, you will invoke the stored procedure to load the data into the destination table.
Before looking at the package design, create the required tables and the stored procedure. You will create these database objects in the AdventureWorks database, which is one of the sample databases that ships with SQL Server 2005.
To start, create the source table using the following definition:
CREATE TABLE [dbo].[EmpName]( [FirstName] [varchar](50) NOT NULL, [LastName] [varchar](50) NOT NULL ) ON [PRIMARY]
The destination table definition is as follows:
CREATE TABLE [dbo].[Emp]( [EmpID] [int] IDENTITY(1,1) NOT NULL, [EmpName] [varchar](100) NOT NULL ) ON [PRIMARY]
The stored procedure to load the data into the destination table is as follows:
CREATE PROCEDURE [dbo].[InsertEmp] @FirstName [varchar](50), @LastName [varchar](50) WITH EXECUTE AS CALLER AS Insert into Emp(EmpName) Values (@LastName + ', ' + @FirstName) GO
Using the package, you will retrieve the FirstName and LastName columns from the EmpName table and load them into the Emp Table through the InsertEmp stored procedure. Inside the stored procedure, you will concatenate the FirstName and LastName columns and assign the concatenated value to the EmpName column of the Emp table. Now that you understand the package functionality, take a look at the design of the package, as shown in the next section.