OLAP modeling concepts, a summary

Simple definitions

Data warehouse: A data warehouse is an enterprise data repository that houses a single, unified version of business data

Data mart: A defined subset of a data warehouse (a business unit, a business subject area)

Cube: A storage structure in place of many (normalized) tables; they present a summarized, aggregated view of enterprise data, as opposed to normalized table sources that present detailed data

ETL: The data is moved from source systems to the destination cubes via extract, transform, and load (ETL) processes

Normalization means reducing duplicate data by using keys or IDs to relate rows of information from one table to another

Denormalization means the opposite, which is deliberately duplicating data in one or more structures.

Normalization improves the efficiency of inserting, updating, or deleting data; Denormalization improves the efficiency of reading or selecting data and reduces the number of tables the data engine has to access or the number of calculations it has to perform to provide information.

Data mining models: objects that contain source data (either relational or multidimensional) that have been processed using a particular type of data mining algorithm to classify (group) only or classify and predict one or more column values

Microsoft BI stack (R2)

SQL Server 2008 R2: staging and source location for BI solutions

SQL Server Integration Services (SSIS): toolset used to import, cleanse, and validate data prior to making the data available to data warehouses, data marts

SQL Server Analysis Services (SSAS): SSAS provides multidimensional storage for the data used in cubes for your data warehouse

Data mining using SSAS: to create data mining structures and data mining models.

SQL Server Reporting Services (SSRS): report component creation to answer business questions (without writing queries)

Report Builder 3.0: it provides authorized end users with a simplified interface for report development.

Excel: a BI visualization client

PowerPivot: Excel 2010 component to perform data analysis and data mining from Excel.

SharePoint: workgroup collaboration and web publishing server.

Visio: This Microsoft modeling tool for BI projects

BIDS: development environment to create SSIS/SSAS/SSRS projects

BI Languages

MDX: This is the language used to query OLAP cubes.

DMX: This is the language used to query data mining structures

XMLA: This is the language used to perform administrative tasks in SSAS.

Pain Point OLAP Solutions

Pain points Olap solution
Slow-to-execute queries Cubes that are optimized for read-only queries of aggregate data.
General system slowdowns OLAP systems use optimized, multi-dimensional storage that do not use locks, except during processing.
Manual query writing Allow end users to click to query (click and drag on a pivot table)
Disparate data sources Combine data into central repositories or cubes using ETL packages using SSIS
Invalid/inconsistent report data Cubes provide a consistent and unified view of enterprise data
Too much data Data mining alg can find patterns in large amounts of data automatically.

OLAP concepts

Let’s review the concepts you need to understand an OLAP project

Star Schema

A star schema is a normalized models that consists of at least one fact table and a number of dimension tables

A star schema fact table consists of at least two types of columns: keys and measures. Keys are foreign key (FK) values that relate rows in the fact table to rows in the dimension tables. Facts (which may also be called measures) are numeric values that express the business metrics; FK values are said to “give context or meaning to” the facts. Fact tables can also contain columns that are neither keys nor facts. These columns are the basis for a special type of dimension called a degenerate dimension: they are often used as a track-back, or pointer, to data in the source system

Facts are individual values stored in rows in the fact table, and measures are those values as stored and displayed in an OLAP cube

A dimension table describes a particular business entity or aspect of the fact table entries; they should consist of three types of columns. The first is a newly generated primary key (PK) for each row, called surrogate key; the second is the original PK, and the third group consists of columns to describe the business entity, called business key.

The attributes are built from the source columns in the dimension source tables

Each dimension’s source data is put into a single table. Typically, these tables are very wide, that is, having a large number of columns, and not especially deep.

Snowflakes schema

A snowflake is a type of source schema used for dimensional modeling; it means basing a dimension on more than one source relational table. The most common case is to use two source tables. However, if more than two tables are used as the basis of a snowflake dimension, there must be a key relationship between each of the tables containing the dimension information.

A typical business situation where using of a snowflake dimension design is one that would reduce the size of the dimension table by removing one or more attributes that are not commonly used to a separate dimension table, when this value is very sparse

Diff star/snow

The most typical relationship type between the dimension and fact tables is the Regular (or star) type: there is a zero or one-to-many relationship between the rows in the fact table and the dimension table, based on the listed key.

The snowflake relationship is between the dimension and fact tables is the Referenced type: there is a one-to-many relationship between the rows in the fact table and the dimension table and includes an additional table, still based on the listed key

Hierarchy

It’s a grouping mechanism, and serves two purposes. The first is largely a convenience for end users. This type of hierarchy is called a browse (or navigational) hierarchy. The second type is called a natural hierarchy, when is due to the nature of the data (ex geographical hierarchy)

OLAP Terms

Dimension Entity and all attributes related to that entity Customers

Hierarchy Grouping of attribute values for an entity Customers by Geography

Member Instance of entity, including attributes ID=50, Name = Langit…

Level Name of rollup position in hierarchy State Level

Key Primary identifier, two types: surrogate (or new) and original NewID = 1, OldID = 101

The Slowly Changing Dimension (SCD)

In OLAP modeling, inserting new dimension members is not considered a change. The only two cases you must be concerned with here are updates and deletes. This type of modeling is called slowly changing dimension (SCD) modeling.

Type 1, 2, 3 SCD Solutions

• Type 1 means overwriting previous dimension member values, which is sometimes also called last change wins

• Type 2 means adding a new record (or row value) when the dimension member value changes.

• Type 3 means adding additional attributes (or column values) when the dimension member value changes.

The Rapidly Changing Dimension (RCD)

The rapidly changing dimension (RCD) is a dimension whose member values change constantly.

Writeback Dimension

Writeback is the ability for authorized end users to update the data members in a dimension: insert, update, or delete.

Fact (Measure) Modeling

One key and core part of your BI solution is the business facts that you choose to include. Measures are the key metrics by which you ascertain the success of your business.

The built-in aggregation types available for use in SSAS are :

Additive means to roll up the one ultimate total.

Semiadditive means to roll up to a total for each level, but not cumulatively

Nonadditive means to not roll up, that is, only shows that particular value.

Calculated Measure vs. Derived Measure

A derived measure it is derived, or created, when the cube is loaded, rather than simply retrieved using a SELECT statement from the source fact table(s); done via tsql

Calculated measure values are calculated at query time by SSAS. Calculated measures execute based on queries that you write against the OLAP cube data; done via mdx

Data mining

Data mining is the ability to use predefined algorithms to detect patterns in the data

Key performance indicator (KPI)

It’s a method (usually visual) of showing one or more key business metrics: the current state, comparison to goal, trend over time

Source

http://www.apress.com/9781430233244

2 thoughts on “OLAP modeling concepts, a summary

  1. Pingback: TSql Practical SQL Examples (Joining Tables) | Obar1 aka Mario Amatucci

  2. Pingback: ETL, what you need to know (1/3) | Obar1 aka Mario Amatucci

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s