Topic - 2 Mapping the Data Warehouse to a Multiprocessor Architecture (Data Warehouse and Data mining )
Mapping the Data Warehouse to a Multiprocessor Architecture
Mapping a data warehouse to a
multiprocessor architecture involves adapting the relational database
management system (RDBMS) underlying the data warehouse to leverage parallel
processing across multiple processors or nodes. This enables efficient handling
of large-scale data queries, analysis, and storage by distributing workloads.
Data warehouses, which store integrated historical data for decision support,
benefit from this mapping through improved performance, scalability, and faster
query execution. The core idea is to use parallelism in database operations,
where tasks are divided and executed concurrently, based on multiprocessor
hardware configurations.
The primary goals are achieving linear
speedup (reducing query response time proportionally by adding processors)
and linear scaleup (maintaining performance as data volume grows by
scaling resources). This is particularly crucial for data warehouses dealing
with terabytes of data and complex analytical queries.
Types
of Parallelism in Data Warehouses
Parallelism allows multiple
operations to run simultaneously, optimizing resource use:
- Inter-Query Parallelism: Different queries or requests are handled
concurrently by separate server threads or processes. This is ideal for
multi-user environments where independent queries can run without
interference.
- Intra-Query Parallelism: A single query is decomposed into sub-operations
(e.g., scan, join, sort, aggregate) that execute in parallel. It includes:
- Horizontal Parallelism: Data is partitioned across multiple disks or nodes,
and the same operation runs on different data subsets simultaneously
(e.g., scanning partitions in parallel).
- Vertical Parallelism:
Query steps are pipelined, where the output of one operation (e.g., scan)
feeds directly into the next (e.g., join) without waiting for completion.
Mapping the data warehouse
architecture to Multiprocessor architecture
Data
Partitioning Strategies
Partitioning divides data across
processors or disks to facilitate parallelism and reduce I/O bottlenecks. It
ensures data locality and efficient access:
- Random Partitioning:
Data is distributed evenly without regard to content, using methods like
round-robin (sequential assignment) or random striping.
- Intelligent Partitioning: The DBMS tracks data locations for targeted access:
- Hash Partitioning:
Applies a hash function to a key column to assign rows to partitions.
- Key Range Partitioning: Groups rows by ranges of key values (e.g., customer
IDs 1–1000 in partition 1).
- Schema Partitioning:
Assigns entire tables to specific disks, useful for small, frequently
joined reference tables.
- User-Defined Partitioning: Custom rules or expressions define partitions for
specialized needs.
Multiprocessor
Architectures for Data Warehouses
Data warehouses map to three main
DBMS architectures for parallel processing, each with trade-offs in
scalability, communication, and fault tolerance. These are often implemented in
symmetric multiprocessing (SMP) or massively parallel processing (MPP) systems.
|
Architecture |
Description |
Advantages |
Disadvantages |
Typical
Use Cases |
|
Shared Memory (Shared Everything) |
Multiple processing units (PUs)
share a common memory pool via a high-speed bus. All processors access the
same memory and disks. |
Simple implementation, fast
communication via memory, and a single system image for easy administration. |
Limited scalability due to bus
bandwidth bottlenecks, memory contention, and latency issues. |
Small to medium data warehouses on
SMP machines like Sun SparcServer. |
|
Shared Disk |
Each node has private memory and
PUs but shares disks via a high-speed interconnect. A Distributed Lock
Manager (DLM) handles synchronization and cache consistency. |
High availability (data accessible
if a node fails), incremental growth; unified database view. |
Overhead from DLM synchronization;
relies on fast interconnect; poor performance if workloads aren't
well-partitioned. |
Clustered environments like VAX or
Sun clusters for fault-tolerant data warehouses. |
|
Shared Nothing (Distributed
Memory) |
Each node has its own CPU, memory,
and disks; no shared resources. Data is partitioned across nodes, with
minimal inter-node communication. |
Excellent scalability (add
nodes/disks easily); fault isolation (node failure doesn't affect others);
cost-effective for large-scale growth. |
Higher coordination for cross-node
queries; latency in remote data access; less ideal for heavy updates without
optimizations like data-dependent routing. |
Large read-heavy data warehouses
in MPP systems, such as decision support queries. |
Some systems use a combined
architecture, blending shared nothing for inter-server parallelism (across
clusters) and shared memory/disk for intra-server parallelism (within a node).
Parallel
DBMS Features and Vendor Examples
Modern parallel DBMS include
features like parallel query optimization, application transparency (queries
run parallel without code changes), and tools for monitoring/tuning. Vendors
supporting these architectures include:
- Oracle:
Shared disk with parallel query option; supports hash, key range, and
round-robin partitioning; parallel joins, scans, and indexes.
- Informix:
All three architectures offer round-robin, hash, schema, key range, and
user-defined partitioning; parallel inserts, updates, and deletes.
- IBM DB2:
Shared nothing; hash partitioning; parallel loads, recoveries, and index
creation.
- Sybase MPP:
Shared nothing; hash, key range, and schema partitioning; combines
horizontal and vertical parallelism.
This mapping is essential for data
warehouses in industries like retail, finance, and healthcare, where quick
insights from massive datasets are critical.





Kindly comment on it...
ReplyDelete