Niraj Bhatt works as an Enterprise  Architect for a Fortune 500 company and has an innate passion for  building / studying software systems. He is a top rated speaker at  various technical forums including Tech·Ed, MCT Summit, Developer  Summit, and Virtual Tech Days, among others. Having run a successful  startup for four years Niraj enjoys working on – IT innovations that can  impact an enterprise bottom line, streamlining IT budgets through  IT consolidation, architecture and integration of systems, performance  tuning, and review of enterprise applications. He has received Microsoft  MVP award for ASP.NET, Connected Systems and most recently on Windows  Azure. When he is away from his laptop, you will find him taking deep  dives in automobiles, pottery, rafting, photography, cooking and  financial statements though not necessarily in that order. He is also a  manager/speaker at BDOTNET, Asia’s largest .NET user group.
Here is the guest post by Niraj Bhatt.
As data in your applications grows it’s the database that usually  becomes a bottleneck. It’s hard to scale a relational DB and the  preferred approach for large scale applications is to create separate  databases for writes and reads. These databases are referred as  transactional database and reporting database. Though there are tools / techniques which can allow you to create snapshot of your transactional database  for reporting purpose, sometimes they don’t quite fit the reporting  requirements of an enterprise. These requirements typically are data  analytics, effective schema (for an Information worker to self-service  herself),  historical data, better performance (flat data, no joins)  etc. This is where a need for data warehouse or an OLAP system arises.
A Key point to remember is a data warehouse is mostly a relational  database. It’s built on top of same concepts like Tables, Rows, Columns,  Primary keys, Foreign Keys, etc. Before we talk about how data  warehouses are typically structured let’s understand key components that  can create a data flow between OLTP systems and OLAP systems. There are  3 major areas to it:
a) OLTP system should be capable of tracking its changes as all these  changes should go back to data warehouse for historical recording. For  e.g. if an OLTP transaction moves a customer from silver to gold  category, OLTP system needs to ensure that this change is tracked and  send to data warehouse for reporting purpose. A report in context could  be how many customers divided by geographies moved from sliver to gold  category. In data warehouse terminology this process is called Change Data Capture.  There are quite a few systems that leverage database triggers to move  these changes to corresponding tracking tables. There are also out of  box features provided by some databases e.g. SQL Server 2008 offers Change Data Capture and Change Tracking for addressing such requirements.
b) After we make the OLTP system capable of tracking its changes we  need to provision a batch process that can run periodically and takes  these changes from OLTP system and dump them into data warehouse. There  are many tools out there that can help you fill this gap – SQL Server  Integration Services happens to be one of them.
c) So we have an OLTP system that knows how to track its changes, we  have jobs that run periodically to move these changes to warehouse. The  question though remains is how warehouse will record these changes? This  structural change in data warehouse arena is often covered under  something called Slowly Changing Dimension (SCD). While we will talk about dimensions in a while, SCD can be  applied to pure relational tables too. SCD enables a database structure  to capture historical data. This would create multiple records for a  given entity in relational database and data warehouses prefer having  their own primary key, often known as surrogate key.
As I mentioned a data warehouse is just a relational database but  industry often attributes a specific schema style to data warehouses.  These styles are Star Schema or Snowflake Schema.  The motivation behind these styles is to create a flat database  structure (as opposed to normalized one), which is easy to understand /  use, easy to query and easy to slice / dice. Star schema is a database  structure made up of dimensions and facts. Facts are generally the  numbers (sales, quantity, etc.) that you want to slice and dice. Fact  tables have these numbers and have references (foreign keys) to set of  tables that provide context around those facts. E.g. if you have  recorded 10,000 USD as sales that number would go in a sales fact table  and could have foreign keys attached to it that refers to the sales  agent responsible for sale and to time table which contains the dates  between which that sale was made. These agent and time tables are called  dimensions which provide context to the numbers stored in fact tables.  This schema structure of fact being at center surrounded by dimensions  is called Star schema. A similar structure with difference of dimension  tables being normalized is called a Snowflake schema.
This relational structure of facts and dimensions serves as an input  for another analysis structure called Cube. Though physically Cube is a  special structure supported by commercial databases like SQL Server  Analysis Services, logically it’s a multidimensional structure where  dimensions define the sides of cube and facts define the content. Facts  are often called as Measures inside a cube. Dimensions often tend to  form a hierarchy. E.g. Product may be broken into categories and  categories in turn to individual items. Category and Items are often  referred as Levels and their constituents as Members with their overall  structure called as Hierarchy. Measures are rolled up as per dimensional  hierarchy. These rolled up measures are called Aggregates. Now this may  seem like an overwhelming vocabulary to deal with but don’t worry it  will sink in as you start working with Cubes and others.
Let’s see few other terms that we would run into while talking about data warehouses.
ODS or an Operational Data Store is a frequently misused term. There  would be few users in your organization that want to report on most  current data and can’t afford to miss a single transaction for their  report. Then there is another set of users that typically don’t care how  current the data is. Mostly senior level executives who are interesting  in trending, mining, forecasting, strategizing, etc. don’t care for  that one specific transaction. This is where an ODS can come in handy.  ODS can use the same star schema and the OLAP cubes we saw earlier. The  only difference is that the data inside an ODS would be short lived,  i.e. for few months and ODS would sync with OLTP system every few  minutes. Data warehouse can periodically sync with ODS either daily or  weekly depending on business drivers.
Data marts are another frequently talked about topic in data  warehousing. They are subject-specific data warehouse. Data warehouses  that try to span over an enterprise are normally too big to scope,  build, manage, track, etc. Hence they are often scaled down to something  called Data mart that supports a specific segment of business like  sales, marketing, or support. Data marts too, are often designed using  star schema model discussed earlier. Industry is divided when it comes  to use of data marts. Some experts prefer having data marts along with a  central data warehouse. Data warehouse here acts as information staging  and distribution hub with spokes being data marts connected via data  feeds serving summarized data. Others eliminate the need for a  centralized data warehouse citing that most users want to report on  detailed data.
Reference: Pinal Dave (http://blog.SQLAuthority.com)
Filed under: Best Practices, Business Intelligence, Data Warehousing, Database, Pinal Dave, PostADay, Readers Contribution, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology