Very often companies use online assessment to test ETL/bi knowledge, I think putting all together would be an advantage for everyone is going to work on an ETL role; let me know of what you think and if you have any suggestion, thanks (firstname.lastname@example.org)
This is the second part, just use it to get ispiration, it’s not a complete reference, but it helps for sure, check the first part here
Topics to know
This list comes directly from the summary of the test so, I’ll go topic but topic and provide some info, exercise and links I’ve found on the web
|Topic||What to know|
|Business Intelligence||identifies understanding of business intelligence aspect of data warehousing.|
|Data Mining and OLAP||measures understanding of data mining and the Online Analytical Processing (OLAP) aspect of data warehousing.|
|Data Warehouse Layers||assesses knowledge of the layers of the data warehouse, including presentation, business model, and physical data mart.|
|Data Warehousing Overview and Architecture||measures understanding of the overall data warehouse principles and solution framework.|
|Extract, Transform, and Load (ETL) Procedures||assesses the proficiency in the Extract, Transform, and Load (ETL) procedures used in data warehousing.|
|Modeling Practices||measures proficiency in data warehouse modeling, and covers the design of measure and fact dimensions|
|Implementing Data Integrity||tests understanding of how various types of data integrity are enforced within SQL Server, including data types, constraints, defaults, rules, and triggers.|
|Managing Transactions and Locks||tests knowledge in using transactions, SQL Server locking, and managing locks.|
|SQL Server Overview and Architecture||tests understanding of SQL Server terminology, databases, and applications, and integration with Microsoft Windows operating systems.|
|Transact-SQL||tests knowledge of the Transact-SQL programming language, including data control statements, data definition statements, data manipulation statements, and query processing.|
|Using Cursors||tests the ability to use Transact-SQL cursors and understanding of methods of calling cursors.|
|Bulk Binding and Dynamic SQL||determines the knowledge of passing and fetching bulk data using Oracle 10g PL/SQL bulk binding. Also assesses knowledge and use of dynamic and native dynamic SQL for executing dynamic SQL strings and data definition and data control operations.|
|PL/SQL Constructs and Language Structure||measures knowledge of features and benefits of PL/SQL and different PL/SQL program constructs used in Oracle 10g.|
Let’s start point by point…
Extract, Transform, and Load (ETL) Procedures
Loading a Data Warehouse with SSIS
a) Extract (Data Flow Sources). This step extracts data from 1 or more files. Different “Data flow sources “are used to connect to different file types. For example a “flat file connection manager” is used to connect to CSV or Text files, an “Excel connection manager” is used to connect to Excel files, an ADO.net connection manager for connecting to a SQL database tables etc.
b) Transform (Data flow transformations): There are dozens of transformations that can be used in SSIS. The following 8 are perhaps the commonly used. Therefore most solutions can be built using the following transformations.The list below contains links to YouTube videos demonstrating how to use each transformation. Some of the links do cover multiple transformations.
- Data Conversion (covered in the same video as the aggregate function).
- Derived Column (Covered in the standardizing date formats video)
- Merge Join
- Conditional Split (Covered in the 3 part 15 minute Overview video above).
- Sort (Covered in the Merge Join video).
- Aggregate (Covered in the Aggregate and Data conversion Video)
- Union ALL
An explanation of each these 8 transformations are provided in Table 1 below.
c) Load (Data Flow Destinations). The function is the same as a source connection manager, but rather than piping data into SSIS, it can pipe pipes data off to a range of different file types and destinations.
Table 1. Key data flow transformations.
|Transformation||Description||Example of use|
|Merge Join||Joins two data sources using a common field (join key). Prior to joining the join key needs to be sorted.||Linking a list of customer leads to a master customer database using the customer’s phone number.|
|Conditional Split||Accepts an input and determines which destination to pipe the data into based on the result of an expression.||Filtering a customer’s location or product holdings to restrict a dataset.|
|Sort||Sorts data in ascending or descending order.||Primarily used before a “Merge Join” transformation. The join key in both data sources need to be first sorted for a Merge Join to function to work.|
|Data Conversion||Converts input columns from one data type to another. For example Text to Numeric formats. This transformation is useful when importing CSV or .txt files where, by default, all of the imported fields are in a string format.||Converting currency data in a CSV file into currency format so that one currency field can be added to another.|
|Multicast||Transformation that pipes the entire dataset to multiple destinations.||Enables one copy of a customer leads file to be loaded into a CRM System, and another copy into a customer leads database|
|Derived Column||Calculates new column value based on an existing column or multiple columns.||Removing leading and trailing spaces from a column. Add title of courtesy (Mr, Mrs, Dr, etc.) to the name. Restructure dates into the correct format.|
|Aggregate||Calculates aggregations such as SUM, COUNT, AVG, MIN and MAX based on the values of a given numeric column. Note MIN, MAX, & AVG functions don’t work on columns formatted as text. Often data is converted into numeric or currency format prior to an AGGREGATE transformation.||Calculating the SUM of purchases that a customer has made.|
|Lookup||Performs a similar function as a Merge Join, but is more efficient & doesn’t require pre-sorted inputs.||Merging an Excel data source with a SQL Table|
|Union ALL||Appends two or more datasets together. I.e. it pastes one dataset below one or more other datasets.||Appending 5 days of customer’s sales or leads into a single file.|
Facts and Dimensions
// add notes from book
Implementing Data Integrity
of data integrity are enforced within SQL Server, including data types, constraints, defaults, rules, and triggers.
Data values of a column must be of the same data type. (The only exception specifies the values of the SQL_VARIANT data type.) Transact-SQL uses different data types, which can be categorized as follows:
- Numeric data types
- Character data types // strings of single-byte characters or strings of Unicode characters.
- Temporal (date and/or time) data types // SQL Server 2008 introduces the new data types DATE and TIME, The DATETIME2 data type is also a new data type that stores high-precision
- Miscellaneous data types
- BINARY and VARBINARY are the two binary data types.
- Large objects (LOBs) are data objects with the maximum length of 2GB
- UNIQUEIDENTIFIER data type is a unique identification number stored as a 16-byte binary string.
- The SQL_VARIANT data type can be used to store values of various data types
- The HIERARCHYID data type is used to store an entire hierarchy
- DECIMAL with VARDECIMAL storage type
. The constraints, which are used to check the modification or insertion of data, are called integrity constraints.
two groups of integrity constraints handled by a DBMS:
- Declarative integrity constraints // defined using the DDL statements CREATE TABLE and ALTER TABLE
- Procedural integrity constraints that are handled by triggers
All declarative constraints can be categorized into several groups:
- DEFAULT clause
- UNIQUE clause // CLUSTERED or NONCLUSTERED
- PRIMARY KEY clause a column or group of columns whose values are different in each row // must be NOT NULL, and its default value is CLUSTERED
CREATE TABLE employee (emp_no INTEGER NOT NULL,
emp_fname CHAR(20) NOT NULL,
emp_lname CHAR(20) NOT NULL,
dept_no CHAR(4) NULL,
CONSTRAINT prim_empl PRIMARY KEY (emp_no));
- CHECK clause // The check constraint specifies conditions for the data inserted into a column
CREATE TABLE Persons
P_Id int NOT NULL CHECK (P_Id>0),
LastName varchar(255) NOT NULL,
- FOREIGN KEY clause and referential integrity // A foreign key is a column or group of columns in one table that contains values that match the primary key values in the same or another table
CREATE TABLE works_on (emp_no INTEGER NOT NULL,
project_no CHAR(4) NOT NULL,
job CHAR (15) NULL,
enter_date DATE NULL,
CONSTRAINT prim_works PRIMARY KEY(emp_no, project_no),
CONSTRAINT foreign_works FOREIGN KEY(emp_no)
REFERENCES employee (emp_no))
If you try to modify the values of a primary key, and these modifications result in inconsistencies in the corresponding foreign key following some rules:
Trigger (chapter 14)
A trigger is a mechanism that is invoked when a particular action occurs on a particular table.
- DML triggers execute when a user tries to modify data through a data manipulation language (DML)
- INSERT, UPDATE, or DELETE statements on a table or view //regardless of whether or not any table rows are affected.
- DDL triggers execute in response to a variety of data definition language (DDL) events.
- CREATE, ALTER, and DROP statement
- Logon triggers fire in response to the LOGON event that is raised when a user sessions is being established.
SQL Server allows for creating multiple triggers for any specific statement.
CREATE TRIGGER modify_budget
ON project AFTER UPDATE
AS IF UPDATE(budget)
DECLARE // INSERT …
Managing Transactions and Locks
A transaction specifies a sequence of Transact-SQL statements that is used by database programmers to package together read and write operations, so that the database system can guarantee the consistency of data. There are two forms of transactions:
- Implicit Specifies any single INSERT, UPDATE, or DELETE statement as a transaction unit.
- Explicit Generally a group of Transact-SQL statements, where the beginning and the end of the group are marked using statements such as BEGIN TRANSACTION, COMMIT, and ROLLBACK.
BEGIN TRANSACTION /* The beginning of the transaction */
SET emp_no = 39831
WHERE emp_no = 10102
IF (@@error <> 0)
ROLLBACK /* Rollback of the transaction */
SET emp_no = 39831
WHERE emp_no = 10102
IF (@@error <> 0)
COMMIT /*The end of the transaction */
N> The global variable @@error is used to test the execution of each Transact-SQL statement.
Transactions have the following properties, which are known collectively by the acronym ACID:
Concurrency can lead to several negative effects, such as the reading of nonexistent data or loss of modified data
Locking has several different aspects:
- Lock duration
- Lock modes
- Lock granularity
// The following discussion concerns the pessimistic concurrency model.
Lock modes specify different kinds of locks. The choice of which lock mode to apply depends on the resource that needs to be locked.
For row- and page-level locking:
- Shared (S)
- It reserves a resource (page or row) for reading only => several processes can read the resource locked
- Exclusive (X)
- It reserves a page or row for the exclusive use of a single transaction
- Update (U)
- if no other update or exclusive lock exists
Lock granularity specifies which resource is locked by a single lock attempt. Database Engine can lock the following resources:
- Index key or range of index keys
- Database itself
A deadlock is a special concurrency problem in which two transactions block the progress of each other.
Database Engine supports an optimistic concurrency control mechanism based on row versioning. When data is modified using row versioning, logical copies of the data in tempdb.