ETL, what you need to know (1/3)

Very often companies use online assessment to test ETL/bi knowledge, I think putting all together would be an advantage for everyone is going to work on an ETL role; let me know of what you think and if you have any suggestion, thanks (mamatucci@gmail.com)

To be able to work in an ETL project you need some basic knowledge divided for areas of interest, let’ see each one and some examples, reference found on the web.

Topics to know

This list comes directly from the summary of the test so, I’ll go topic but topic and provide some info, exercise and links I’ve found on the web

Area

Topic

What to know

Business Intelligence

Business Intelligence

identifies understanding of business intelligence aspect of data warehousing.

Business Intelligence

Data Mining and OLAP

measures understanding of data mining and the Online Analytical Processing (OLAP) aspect of data warehousing.

Data warehouse

Data Warehouse Layers

assesses knowledge of the layers of the data warehouse, including presentation, business model, and physical data mart.

Data warehouse

Data Warehousing Overview and Architecture

measures understanding of the overall data warehouse principles and solution framework.

ETL

Extract, Transform, and Load (ETL) Procedures

assesses the proficiency in the Extract, Transform, and Load (ETL) procedures used in data warehousing.

Data warehouse

Modeling Practices

measures proficiency in data warehouse modeling, and covers the design of measure and fact dimensions

SQL

Implementing Data Integrity

tests understanding of how various types of data integrity are enforced within SQL Server, including data types, constraints, defaults, rules, and triggers.

SQL

Managing Transactions and Locks

tests knowledge in using transactions, SQL Server locking, and managing locks.

SQL

SQL Server Overview and Architecture

tests understanding of SQL Server terminology, databases, and applications, and integration with Microsoft Windows operating systems.

SQL

Transact-SQL

tests knowledge of the Transact-SQL programming language, including data control statements, data definition statements, data manipulation statements, and query processing.

SQL

Using Cursors

tests the ability to use Transact-SQL cursors and understanding of methods of calling cursors.

PL-SQL

Bulk Binding and Dynamic SQL

determines the knowledge of passing and fetching bulk data using Oracle 10g PL/SQL bulk binding. Also assesses knowledge and use of dynamic and native dynamic SQL for executing dynamic SQL strings and data definition and data control operations.

PL-SQL

PL/SQL Constructs and Language Structure

measures knowledge of features and benefits of PL/SQL and different PL/SQL program constructs used in Oracle 10g.

 

Let’s start point by point…

Business Intelligence

You can quickly check this https://mamatucci.wordpress.com/2011/05/05/olap-modeling-concepts-a-summary/. Generally speaking the first 2 chapters of the book FoundBIR2 are enough…

 

Data Mining and OLAP

Spend 1h watching the Webinar IntroDM; I don’t think is the core for ETL anyway, but nice to have this knowledge, especially to connect with business analyst that needs to discover patterns on the data, prediction of future data.

Some of the most common DM algorithms are:

Decision tree: to create a model that predicts the value of a target variable based on several input variables

Clustering: to understand how the data is organized and reveal how strange behavior of some data. Used for fraud detection in SSIS, or erroneous data…

Time Series: forecasting of data, to predict the values based on historical data (maintained in the data sources)

Association Rules: to discover which data is related with others, for example some products bought together in some related circumstances, this helpful to undercover rule inside the data itself

 

Data Warehouse Layers

Check chapter 3 of DelivBI2008, the main concepts are:

The business transactions are kept the OLTP system; here the data is highly normalized (aka organized in tables that put list together correlated business entities) to provide fast insert/update (not really matter the delete as usually you kept all your data and use some flag to change the status,…). The analysis of the data itself needs aggregation of data as the business guys need to have a global view of the data (on aggregated info or get new insight of the business from the data itself). The problem arises from the denormalization approach needed to have fast (insert/delete) OLTP system for the daily transactions we need to create datamarts; they are set of historical data (for a specific business) still relation systems but organized like ‘star schema’ or ‘snow flake schema’ to achieve better performances in the creation of the OLAP cubes. A lot of time the data comes from heterogeneous data source, the ETL procedure will consolidate, clean and put in the datamarts

The Extract, Transform, and Load (ETL) process extracts data to copy from one or more OLTP systems, performs any required data cleansing to transform the data into a consistent format, and loads the cleansed data by inserting it into the data mart

The data can be divided in 4 groups; measures, dimensions, attributes, hierarchies

Data Warehouse Layers

In general, all data warehouse systems have the following layers:

  • Data Source Layer
  • Data Extraction Layer
  • Staging Area
  • ETL Layer
  • Data Storage Layer
  • Data Logic Layer
  • Data Presentation Layer
  • Metadata Layer
  • System Operations Layer

The picture below shows the relationships among the different components of the data warehouse architecture:

 

Metadata Layer

 

This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that’s in the metadata layer.

 

Data Source Layer

 

This represents the different data sources that feed data into the data warehouse. The data source can be of any format — plain text file, relational database, other types of database, Excel file, … can all act as a data source.

All these data sources together form the Data Source Layer.

 

Data Extraction Layer

 

Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation.

 

Staging Area

 

This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.

 

ETL Layer

 

This is where data gains its “intelligence”, as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens.

 

Data Storage Layer

 

This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS)

 

Data Logic Layer

 

This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like.

 

Data Presentation Layer

 

This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report

 

System Operations Layer

 

This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.

 

[Data Storage Layer] Data Warehouse is an ‘offline’ integration of data, whereas Operational Data Store is an ‘online’ integration of data.

 

http://www.1keydata.com/datawarehousing/data-warehouse-architecture.html

http://www.executionmih.com/data-warehouse/definition-concept.php

 

Data Warehousing Overview and Architecture

Data warehouses and their architectures vary depending upon the specifics of an organization’s situation. Three common architectures are:

  • Data Warehouse Architecture (Basic)
  • Data Warehouse Architecture (with a Staging Area)
  • Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (Basic)

  • Data Sources (operational systems and flat files)
  • Warehouse (metadata, summary data, and raw data)
  • Users (analysis, reporting, and mining)

 

A simple architecture for a data warehouse: End users directly access data derived from several source systems through the data warehouse; the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable; they pre-compute long operations in advance. A summary in Oracle is called a materialized view.

Data Warehouse Architecture (with a Staging Area)

You need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management.

  • Staging Area (where data sources go before the warehouse)

Data Warehouse Architecture (with a Staging Area and Data Marts)

You may want to customize your warehouse’s architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business

http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/concept.htm#50822

http://informaticatutors.blogspot.com/2011/03/data-warehouse-architecture.html

 

Book References

FoundBIR2: Foundations of SQL Server 2008 R2 Business Intelligence, http://www.apress.com/9781430233244

IntroDM: Introduction to Data Mining, Brian Knight, http://pragmaticworks.com/resources/webinars/

DelivBI2008: Delivering Business Intelligence with Microsoft SQL Server 2008, http://www.mhprofessional.com/book.php?isbn=0071549447

 

 

 

 

One thought on “ETL, what you need to know (1/3)

  1. Pingback: ETL, what you need to know (2/3) « Obar1 aka Mario Amatucci

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s