ETL, what you need to know (2/3)

Very often companies use online assessment to test ETL/bi knowledge, I think putting all together would be an advantage for everyone is going to work on an ETL role; let me know of what you think and if you have any suggestion, thanks (mamatucci@gmail.com)

This is the second part, just use it to get ispiration, it’s not a complete reference, but it helps for sure, check the first part here

Topics to know

This list comes directly from the summary of the test so, I’ll go topic but topic and provide some info, exercise and links I’ve found on the web

Area

Topic What to know 

Business Intelligence

Business Intelligence identifies understanding of business intelligence aspect of data warehousing.

Business Intelligence

Data Mining and OLAP measures understanding of data mining and the Online Analytical Processing (OLAP) aspect of data warehousing.

Data warehouse

Data Warehouse Layers assesses knowledge of the layers of the data warehouse, including presentation, business model, and physical data mart.

Data warehouse

Data Warehousing Overview and Architecture measures understanding of the overall data warehouse principles and solution framework.

ETL

Extract, Transform, and Load (ETL) Procedures assesses the proficiency in the Extract, Transform, and Load (ETL) procedures used in data warehousing.

Data warehouse

Modeling Practices measures proficiency in data warehouse modeling, and covers the design of measure and fact dimensions

SQL

Implementing Data Integrity tests understanding of how various types of data integrity are enforced within SQL Server, including data types, constraints, defaults, rules, and triggers.

SQL

Managing Transactions and Locks tests knowledge in using transactions, SQL Server locking, and managing locks.

SQL

SQL Server Overview and Architecture tests understanding of SQL Server terminology, databases, and applications, and integration with Microsoft Windows operating systems.

SQL

Transact-SQL tests knowledge of the Transact-SQL programming language, including data control statements, data definition statements, data manipulation statements, and query processing.

SQL

Using Cursors tests the ability to use Transact-SQL cursors and understanding of methods of calling cursors.

PL-SQL

Bulk Binding and Dynamic SQL determines the knowledge of passing and fetching bulk data using Oracle 10g PL/SQL bulk binding. Also assesses knowledge and use of dynamic and native dynamic SQL for executing dynamic SQL strings and data definition and data control operations.

PL-SQL

PL/SQL Constructs and Language Structure measures knowledge of features and benefits of PL/SQL and different PL/SQL program constructs used in Oracle 10g.

Let’s start point by point…

Extract, Transform, and Load (ETL) Procedures

Loading a Data Warehouse with SSIS

http://www.analyticsinaction.com/learning-ssis/

a)  Extract (Data Flow Sources). This step extracts data from 1 or more files. Different Data flow sources “are used to connect to different file types. For example a “flat file connection manager” is used to connect to CSV or Text files, an “Excel connection manager” is used to connect to Excel files, an ADO.net connection manager for connecting to a SQL database tables etc.

b) Transform (Data flow transformations): There are dozens of transformations that can be used in SSIS. The following  8 are perhaps the commonly used. Therefore most solutions can be built using the following transformations.The list below contains links to YouTube videos demonstrating how to use each transformation. Some of the links do cover multiple transformations.

  1. Data Conversion (covered in the same video as the aggregate function).
  2. Derived Column (Covered in the standardizing date formats video)
  3. Merge Join
  4. Conditional Split (Covered in the 3 part 15 minute Overview video above).
  5. Sort (Covered in the Merge  Join video).
  6. Multicast
  7. Aggregate (Covered in the Aggregate and Data conversion Video)
  8. Union ALL
  9. Lookup

An  explanation of each these 8 transformations are provided in Table 1 below.

c)  Load (Data Flow Destinations). The function is the same as a source connection manager, but rather than piping data into SSIS, it can pipe pipes data off to a range of different file types and destinations.

Table 1. Key data flow transformations.

Transformation Description Example of use
Merge Join Joins two data sources using a common field (join key). Prior to joining the join key needs to be sorted. Linking a list of customer leads to a master customer database using the customer’s phone number.
Conditional Split Accepts an input and determines which destination to pipe the data into based on the result of an expression. Filtering a customer’s location or product holdings to restrict a dataset.
Sort Sorts data in ascending or descending order. Primarily used before a “Merge Join” transformation. The join key in both data sources need to be first sorted for a Merge Join to function to work.
Data Conversion Converts input columns from one data type to another. For example Text to Numeric formats. This transformation is useful when importing CSV or .txt files where, by default, all of the imported fields are in a string format. Converting currency data in a CSV file into currency format so that one currency field can be added to another.
Multicast Transformation that pipes the entire dataset to multiple destinations. Enables one copy of a customer leads file to be loaded into a CRM System, and another copy into a customer leads database
Derived Column Calculates new column value based on an existing column or multiple columns. Removing leading and trailing spaces from a column. Add title of courtesy (Mr, Mrs, Dr, etc.) to the name. Restructure dates into the correct format.
Aggregate Calculates aggregations such as SUM, COUNT, AVG, MIN and MAX based on the values of a given numeric column. Note MIN, MAX, & AVG functions don’t work on columns formatted as text. Often data is converted into numeric or currency format prior to an AGGREGATE transformation. Calculating the SUM of purchases that a customer has made.
Lookup Performs a similar function as a Merge Join, but is more efficient & doesn’t require pre-sorted inputs. Merging an Excel data source with a SQL Table
Union ALL Appends two or more datasets together. I.e. it pastes one dataset below one or more other datasets. Appending 5 days of customer’s sales or leads into a single file.

Modeling Practices

Facts and Dimensions

// add notes from book

Implementing Data Integrity

of data integrity are enforced within SQL Server, including data types, constraints, defaults, rules, and triggers.

Data values of a column must be of the same data type. (The only exception specifies the values of the SQL_VARIANT data type.) Transact-SQL uses different data types, which can be categorized as follows:

  • Numeric data types
  • Character data types // strings of single-byte characters or strings of Unicode characters.
  • Temporal (date and/or time) data types // SQL Server 2008 introduces the new data types DATE and TIME, The DATETIME2 data type is also a new data type that stores high-precision
  • Miscellaneous data types
    • BINARY and VARBINARY are the two binary data types.
    • Large objects (LOBs) are data objects with the maximum length of 2GB
    • UNIQUEIDENTIFIER data type is a unique identification number stored as a 16-byte binary string.
    • The SQL_VARIANT data type can be used to store values of various data types
    • The HIERARCHYID data type is used to store an entire hierarchy
  • DECIMAL with VARDECIMAL storage type

Integrity Constraint

. The constraints, which are used to check the modification or insertion of data, are called integrity constraints.

two groups of integrity constraints handled by a DBMS:

  • Declarative integrity constraints // defined using the DDL statements CREATE TABLE and ALTER TABLE
  • Procedural integrity constraints that are handled by triggers

All declarative constraints can be categorized into several groups:

  • DEFAULT clause
  • UNIQUE clause // CLUSTERED or NONCLUSTERED

http://www.w3schools.com/sql/sql_unique.asp

  • PRIMARY KEY clause a column or group of columns whose values are different in each row // must be NOT NULL, and its default value is CLUSTERED

CREATE TABLE employee (emp_no INTEGER NOT NULL,

emp_fname CHAR(20) NOT NULL,

emp_lname CHAR(20) NOT NULL,

dept_no CHAR(4) NULL,

CONSTRAINT prim_empl PRIMARY KEY (emp_no));

  • CHECK clause // The check constraint specifies conditions for the data inserted into a column

http://www.w3schools.com/sql/sql_check.asp

CREATE TABLE Persons
(
P_Id int NOT NULL CHECK (P_Id>0),
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),
City varchar(255)
)

  • FOREIGN KEY clause and referential integrity // A foreign key is a column or group of columns in one table that contains values that match the primary key values in the same or another table

http://www.w3schools.com/sql/sql_foreignkey.asp

CREATE TABLE works_on (emp_no INTEGER NOT NULL,

project_no CHAR(4) NOT NULL,

job CHAR (15) NULL,

enter_date DATE NULL,

CONSTRAINT prim_works PRIMARY KEY(emp_no, project_no),

CONSTRAINT foreign_works FOREIGN KEY(emp_no)

REFERENCES employee (emp_no))

If you try to modify the values of a primary key, and these modifications result in inconsistencies in the corresponding foreign key following some rules:

NO ACTION

Trigger (chapter 14)

A trigger is a mechanism that is invoked when a particular action occurs on a particular table.

  • DML triggers execute when a user tries to modify data through a data manipulation language (DML)
    • INSERT, UPDATE, or DELETE statements on a table or view //regardless of whether or not any table rows are affected.
  • DDL triggers execute in response to a variety of data definition language (DDL) events.
    • CREATE, ALTER, and DROP statement
  • Logon triggers fire in response to the LOGON event that is raised when a user sessions is being established.

SQL Server allows for creating multiple triggers for any specific statement.

CREATE TRIGGER modify_budget

ON project AFTER UPDATE

AS IF UPDATE(budget)

BEGIN

DECLARE // INSERT …

END

http://msdn.microsoft.com/en-us/library/ms189799.aspx

Managing Transactions and Locks

A transaction specifies a sequence of Transact-SQL statements that is used by database programmers to package together read and write operations, so that the database system can guarantee the consistency of data. There are two forms of transactions:

  • Implicit Specifies any single INSERT, UPDATE, or DELETE statement as a transaction unit.
  • Explicit Generally a group of Transact-SQL statements, where the beginning and the end of the group are marked using statements such as BEGIN TRANSACTION, COMMIT, and ROLLBACK.

BEGIN TRANSACTION /* The beginning of the transaction */

UPDATE employee

SET emp_no = 39831

WHERE emp_no = 10102

IF (@@error <> 0)

ROLLBACK /* Rollback of the transaction */

UPDATE works_on

SET emp_no = 39831

WHERE emp_no = 10102

IF (@@error <> 0)

ROLLBACK

COMMIT /*The end of the transaction */

N> The global variable @@error is used to test the execution of each Transact-SQL statement.

Transactions have the following properties, which are known collectively by the acronym ACID:

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Locking

Concurrency can lead to several negative effects, such as the reading of nonexistent data or loss of modified data

Locking has several different aspects:

  • Lock duration
  • Lock modes
  • Lock granularity

// The following discussion concerns the pessimistic concurrency model.

Lock Modes

Lock modes specify different kinds of locks. The choice of which lock mode to apply depends on the resource that needs to be locked.

For row- and page-level locking:

  • Shared (S)
    • It reserves a resource (page or row) for reading only => several processes can read the resource locked
  • Exclusive (X)
    • It reserves a page or row for the exclusive use of a single transaction
  • Update (U)
    • if no other update or exclusive lock exists

Lock Granularity

Lock granularity specifies which resource is locked by a single lock attempt. Database Engine can lock the following resources:

  • Row
  • Page
  • Index key or range of index keys
  • Table
  • Extent
  • Database itself

A deadlock is a special concurrency problem in which two transactions block the progress of each other.

Row Versioning

Database Engine supports an optimistic concurrency control mechanism based on row versioning. When data is modified using row versioning, logical copies of the data in tempdb.

One thought on “ETL, what you need to know (2/3)

  1. Pingback: Trick: Use Microsoft Excel to product bult sets of SQL Commands for your database « End of Line Magazine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s