Topic - 2 Mapping the Data Warehouse to a Multiprocessor Architecture (Data Warehouse and Data mining )

 


Mapping the Data Warehouse to a Multiprocessor Architecture

Mapping a data warehouse to a multiprocessor architecture involves adapting the relational database management system (RDBMS) underlying the data warehouse to leverage parallel processing across multiple processors or nodes. This enables efficient handling of large-scale data queries, analysis, and storage by distributing workloads. Data warehouses, which store integrated historical data for decision support, benefit from this mapping through improved performance, scalability, and faster query execution. The core idea is to use parallelism in database operations, where tasks are divided and executed concurrently, based on multiprocessor hardware configurations.

The primary goals are achieving linear speedup (reducing query response time proportionally by adding processors) and linear scaleup (maintaining performance as data volume grows by scaling resources). This is particularly crucial for data warehouses dealing with terabytes of data and complex analytical queries.

Types of Parallelism in Data Warehouses

Parallelism allows multiple operations to run simultaneously, optimizing resource use:

  • Inter-Query Parallelism: Different queries or requests are handled concurrently by separate server threads or processes. This is ideal for multi-user environments where independent queries can run without interference.
  • Intra-Query Parallelism: A single query is decomposed into sub-operations (e.g., scan, join, sort, aggregate) that execute in parallel. It includes:
    • Horizontal Parallelism: Data is partitioned across multiple disks or nodes, and the same operation runs on different data subsets simultaneously (e.g., scanning partitions in parallel).
    • Vertical Parallelism: Query steps are pipelined, where the output of one operation (e.g., scan) feeds directly into the next (e.g., join) without waiting for completion.



Mapping the data warehouse architecture to Multiprocessor architecture

Data Partitioning Strategies

Partitioning divides data across processors or disks to facilitate parallelism and reduce I/O bottlenecks. It ensures data locality and efficient access:

  • Random Partitioning: Data is distributed evenly without regard to content, using methods like round-robin (sequential assignment) or random striping.
  • Intelligent Partitioning: The DBMS tracks data locations for targeted access:
    • Hash Partitioning: Applies a hash function to a key column to assign rows to partitions.
    • Key Range Partitioning: Groups rows by ranges of key values (e.g., customer IDs 1–1000 in partition 1).
    • Schema Partitioning: Assigns entire tables to specific disks, useful for small, frequently joined reference tables.
    • User-Defined Partitioning: Custom rules or expressions define partitions for specialized needs.

Multiprocessor Architectures for Data Warehouses

Data warehouses map to three main DBMS architectures for parallel processing, each with trade-offs in scalability, communication, and fault tolerance. These are often implemented in symmetric multiprocessing (SMP) or massively parallel processing (MPP) systems.

Architecture

Description

Advantages

Disadvantages

Typical Use Cases

Shared Memory (Shared Everything)

Multiple processing units (PUs) share a common memory pool via a high-speed bus. All processors access the same memory and disks.

Simple implementation, fast communication via memory, and a single system image for easy administration.

Limited scalability due to bus bandwidth bottlenecks, memory contention, and latency issues.

Small to medium data warehouses on SMP machines like Sun SparcServer.

Shared Disk

Each node has private memory and PUs but shares disks via a high-speed interconnect. A Distributed Lock Manager (DLM) handles synchronization and cache consistency.

High availability (data accessible if a node fails), incremental growth; unified database view.

Overhead from DLM synchronization; relies on fast interconnect; poor performance if workloads aren't well-partitioned.

Clustered environments like VAX or Sun clusters for fault-tolerant data warehouses.

Shared Nothing (Distributed Memory)

Each node has its own CPU, memory, and disks; no shared resources. Data is partitioned across nodes, with minimal inter-node communication.

Excellent scalability (add nodes/disks easily); fault isolation (node failure doesn't affect others); cost-effective for large-scale growth.

Higher coordination for cross-node queries; latency in remote data access; less ideal for heavy updates without optimizations like data-dependent routing.

Large read-heavy data warehouses in MPP systems, such as decision support queries.

Some systems use a combined architecture, blending shared nothing for inter-server parallelism (across clusters) and shared memory/disk for intra-server parallelism (within a node).







Parallel DBMS Features and Vendor Examples

Modern parallel DBMS include features like parallel query optimization, application transparency (queries run parallel without code changes), and tools for monitoring/tuning. Vendors supporting these architectures include:

  • Oracle: Shared disk with parallel query option; supports hash, key range, and round-robin partitioning; parallel joins, scans, and indexes.
  • Informix: All three architectures offer round-robin, hash, schema, key range, and user-defined partitioning; parallel inserts, updates, and deletes.
  • IBM DB2: Shared nothing; hash partitioning; parallel loads, recoveries, and index creation.
  • Sybase MPP: Shared nothing; hash, key range, and schema partitioning; combines horizontal and vertical parallelism.

This mapping is essential for data warehouses in industries like retail, finance, and healthcare, where quick insights from massive datasets are critical.


Comments

Post a Comment

Popular posts from this blog

Topic :- 2 Science and technology change social life. (Science and Technology)

Data Warehouse and Data mining Topic-1 What is Data Warehouse