How SQL Server 2008 can help to build fast track data warehouse?
Beginning of this year I have posted here about Build your BI skills in fast track data warehouse and based on that I have had few questions and feedback on how best we can take help of SQL Server 2008 features to build the fast track data warehouse.
Based on the material and resources available on Microsoft site the approach must be on data access layer, configuration (hardware, software) and maintaining the normalization practices.When you talk about Business Intelligence then data access layer is key to work on consecutive I/O access for data warehouse workloads.The platform must be scalable in terms of performance & access methods on hardware side in addition to data layer side. By design the data warehouse store will have large file reads with less writes, this involves one-block operation on disk level to obtain the data having a penalty (minimum) of latency with the disk drive head having to move to a
new location. Whereas the hardware configuration of Random I/O, on the other hand, involves large numbers of seeks and
rotations, and is usually much slower.
As an Analyst for BI implementation you need to understand the performance constraints related to configuration techniques on hardware, software & application (if needed) to maximize the sequential I/O data access methods. By design the I/O from an instance of the SQL Server Database Engine includes logical
and physical reads, such logical read occurs every time the DB Engine
requests a page from the buffer cache. If
the page is not currently in the buffer cache, a physical read first copies the
page from disk into the cache.You can help the DB engine optimizer in a way that relational engine determines the most effective access method (such as a table
scan, an index scan, or a keyed read); the access methods and buffer manager
components of the storage engine determine the general pattern of reads to
perform, and optimize the reads required to implement the access method.
Such performance optimization methods are called as read-ahead, BOL clarifies that "....the data and index pages needed to
fulfill a query execution plan and brings the pages into the buffer cache before
they are actually used by the query. This allows computation and I/O to overlap,
taking full advantage of both the CPU and the disk". This mechanism helps the DB engine to manage upto 64
contiguous pages (512KB) from one file. The read is performed as a single
scatter-gather read to the appropriate number of (probably non-contiguous)
buffers in the buffer cache. This is where you will observe table scans and index scans, this is entirely a different topic that will be covered on performance tuning blog section on this site.
Coming back to reference of configuration model (Fast
Track Data Warehouse) approach from Microsoft, it is key that maintaining performance over time is
preserving the sequential order of the data on disk by preventing and/or
minimizing the fragmentation of the data under normal operating conditions. The programme is highlighted with an objectives such as:
- Providing an initial SSDW sequential I/O data access experience with known
performance and scalability characteristics.
- Developing a method that is focused on understanding the performance
characteristics, limitations, and the cause-and-effect relationships of each
component contained in the configuration.
- Providing proven recommendations of SSDW best practices for sequential I/O
data access that are simple, easy to implement, and include holistic
recommendations for database setup, configuration, and data loading.
One of the big question was How the fast track approach is different?
Microsoft documentation highlights that approach is specifically focused on building
scalable CPU core-balanced configurations to support SSDW sequential I/O data
access workloads. A holistic and vendor-agnostic approach is used to address the
CPU to I/O balance of the system to efficiently design a configuration that is
intentionally targeted at and optimized for sequential data access. This is
designed to achieve the best possible performance by matching all internal
components to their individual maximum performance level while not exceeding the
capacity of any components above it in the stack, and then applying principles
in physical and logical layout that allow the hardware to provide this maximum
performance in a real environment.
So you need to be aware that to attain such a scalable platform you need to determine the right mix of
hardware, software, layout, and configuration settings differs significantly
from conventional reference architectures because the focus is on determining
the right file and software configuration while maintaining maximum performance
of all items in the hardware stack, starting with the CPU and the cores, and
working back down to the storage.
The underlying the optimization technique includes usage of building-block approach, to keep performance
constant per MB of data processed, there must be enough complete building blocks
used to scale up the CPU and scale out storage together. Scaling up the CPU
cores in unison with storage components provides near constant performance for
larger data volumes. It is also possible to increase storage capacity without
adding more CPU cores by using larger drives or more drives per LUN; however
this implies that queries covering a wider range of data will have lower
performance. Important to note is that adding more data but still querying the
same sized ranges per query will not decrease performance.
Lastly you can download Spreadsheet Calculator that has been developed to assist with
the optimization of the software configuration on the selected hardware based
upon specified component performance ratings and internal data feed requirements
for SQL Server.
- The spreadsheet helps to advise the number of storage
arrays, disks, and LUN’s required to achieve the specified throughput given the
hardware detailed in the calculator.
- The spreadsheet has pre-populated ratings determined from
component testing and research for both the EMC CX4-240 and HP MSA2000 storage
options. See above for details on how to determine the SQL Server per-core
requirement for any new compute node server configuration.
- New configurations and new hardware can be added to the
calculator by cloning an existing EMC or HP sheet, renaming it to the new
configuration, and entering the specific factors for the hardware being
reviewed. The new sheet will then utilize the new factors when doing its
calculations for LUN’s, drives, and the number of required storage arrays.
Review the Implementing a SQL Server Fast Track Data
Warehouse & An
Introduction to Fast Track Data Warehouse Architectures links too.
**__________________________________**
SQL Server MVP, Sr. DBA & industry expert.
-
Knowledge is of two kinds. We know a subject ourselves or we know where we can find information on it. It is also a power and you will gain by sharing it.