Data warehouse: A data warehouse is an enterprise data repository that houses a single, unified version of business data
Data mart: A defined subset of a data warehouse (a business unit, a business subject area)
Cube: A storage structure in place of many (normalized) tables; they present a summarized, aggregated view of enterprise data, as opposed to normalized table sources that present detailed data
ETL: The data is moved from source systems to the destination cubes via extract, transform, and load (ETL) processes
Normalization means reducing duplicate data by using keys or IDs to relate rows of information from one table to another
Denormalization means the opposite, which is deliberately duplicating data in one or more structures.
Normalization improves the efficiency of inserting, updating, or deleting data; Denormalization improves the efficiency of reading or selecting data and reduces the number of tables the data engine has to access or the number of calculations it has to perform to provide information.
Data mining models: objects that contain source data (either relational or multidimensional) that have been processed using a particular type of data mining algorithm to classify (group) only or classify and predict one or more column values
Microsoft BI stack (R2)
SQL Server 2008 R2: staging and source location for BI solutions
SQL Server Integration Services (SSIS): toolset used to import, cleanse, and validate data prior to making the data available to data warehouses, data marts
SQL Server Analysis Services (SSAS): SSAS provides multidimensional storage for the data used in cubes for your data warehouse
Data mining using SSAS: to create data mining structures and data mining models.
SQL Server Reporting Services (SSRS): report component creation to answer business questions (without writing queries)
Report Builder 3.0: it provides authorized end users with a simplified interface for report development.
Excel: a BI visualization client
PowerPivot: Excel 2010 component to perform data analysis and data mining from Excel.
SharePoint: workgroup collaboration and web publishing server.
Visio: This Microsoft modeling tool for BI projects
BIDS: development environment to create SSIS/SSAS/SSRS projects
MDX: This is the language used to query OLAP cubes.
DMX: This is the language used to query data mining structures
XMLA: This is the language used to perform administrative tasks in SSAS.
Pain Point OLAP Solutions
|Pain points||Olap solution|
|Slow-to-execute queries||Cubes that are optimized for read-only queries of aggregate data.|
|General system slowdowns||OLAP systems use optimized, multi-dimensional storage that do not use locks, except during processing.|
|Manual query writing||Allow end users to click to query (click and drag on a pivot table)|
|Disparate data sources||Combine data into central repositories or cubes using ETL packages using SSIS|
|Invalid/inconsistent report data||Cubes provide a consistent and unified view of enterprise data|
|Too much data||Data mining alg can find patterns in large amounts of data automatically.|
Let’s review the concepts you need to understand an OLAP project
A star schema is a normalized models that consists of at least one fact table and a number of dimension tables
A star schema fact table consists of at least two types of columns: keys and measures. Keys are foreign key (FK) values that relate rows in the fact table to rows in the dimension tables. Facts (which may also be called measures) are numeric values that express the business metrics; FK values are said to “give context or meaning to” the facts. Fact tables can also contain columns that are neither keys nor facts. These columns are the basis for a special type of dimension called a degenerate dimension: they are often used as a track-back, or pointer, to data in the source system
Facts are individual values stored in rows in the fact table, and measures are those values as stored and displayed in an OLAP cube
A dimension table describes a particular business entity or aspect of the fact table entries; they should consist of three types of columns. The first is a newly generated primary key (PK) for each row, called surrogate key; the second is the original PK, and the third group consists of columns to describe the business entity, called business key.
The attributes are built from the source columns in the dimension source tables
Each dimension’s source data is put into a single table. Typically, these tables are very wide, that is, having a large number of columns, and not especially deep.
A snowflake is a type of source schema used for dimensional modeling; it means basing a dimension on more than one source relational table. The most common case is to use two source tables. However, if more than two tables are used as the basis of a snowflake dimension, there must be a key relationship between each of the tables containing the dimension information.
A typical business situation where using of a snowflake dimension design is one that would reduce the size of the dimension table by removing one or more attributes that are not commonly used to a separate dimension table, when this value is very sparse
The most typical relationship type between the dimension and fact tables is the Regular (or star) type: there is a zero or one-to-many relationship between the rows in the fact table and the dimension table, based on the listed key.
The snowflake relationship is between the dimension and fact tables is the Referenced type: there is a one-to-many relationship between the rows in the fact table and the dimension table and includes an additional table, still based on the listed key
It’s a grouping mechanism, and serves two purposes. The first is largely a convenience for end users. This type of hierarchy is called a browse (or navigational) hierarchy. The second type is called a natural hierarchy, when is due to the nature of the data (ex geographical hierarchy)
Dimension Entity and all attributes related to that entity Customers
Hierarchy Grouping of attribute values for an entity Customers by Geography
Member Instance of entity, including attributes ID=50, Name = Langit…
Level Name of rollup position in hierarchy State Level
Key Primary identifier, two types: surrogate (or new) and original NewID = 1, OldID = 101
The Slowly Changing Dimension (SCD)
In OLAP modeling, inserting new dimension members is not considered a change. The only two cases you must be concerned with here are updates and deletes. This type of modeling is called slowly changing dimension (SCD) modeling.
Type 1, 2, 3 SCD Solutions
• Type 1 means overwriting previous dimension member values, which is sometimes also called last change wins
• Type 2 means adding a new record (or row value) when the dimension member value changes.
• Type 3 means adding additional attributes (or column values) when the dimension member value changes.
The Rapidly Changing Dimension (RCD)
The rapidly changing dimension (RCD) is a dimension whose member values change constantly.
Writeback is the ability for authorized end users to update the data members in a dimension: insert, update, or delete.
Fact (Measure) Modeling
One key and core part of your BI solution is the business facts that you choose to include. Measures are the key metrics by which you ascertain the success of your business.
The built-in aggregation types available for use in SSAS are :
Additive means to roll up the one ultimate total.
Semiadditive means to roll up to a total for each level, but not cumulatively
Nonadditive means to not roll up, that is, only shows that particular value.
Calculated Measure vs. Derived Measure
A derived measure it is derived, or created, when the cube is loaded, rather than simply retrieved using a SELECT statement from the source fact table(s); done via tsql
Calculated measure values are calculated at query time by SSAS. Calculated measures execute based on queries that you write against the OLAP cube data; done via mdx
Data mining is the ability to use predefined algorithms to detect patterns in the data
Key performance indicator (KPI)
It’s a method (usually visual) of showing one or more key business metrics: the current state, comparison to goal, trend over time