Data Transformation: Importance of Structural Transformation

It is crucial to transform the structure of data when moving it from operational source systems to Business Intelligence (BI) systems because operational systems and analytical systems have fundamentally different requirements for data organization [1–7].

Operational Systems (OLTP)

Designed for transactional efficiency, focusing on capturing, processing, and storing individual business events. Data is often highly normalized (e.g., header/detail patterns) to optimize inserts, updates, and consistency, even if that means complex joins for queries [1, 5, 8–13].

BI & Analytical Systems (OLAP)

Optimized for analysis, reporting, and decision support. Structures like the star schema provide quick access to aggregated historical data [6, 14–16].

Why Structural Transformation Is Important

Analytical Suitability: Operational data is often granular and fragmented—“not ready for analytics.” Analytical needs derived measures and changes in granularity (e.g., daily to monthly) [2, 3, 22–27].
Ease of Use: Business users struggle with highly normalized schemas. Star schemas and other business-oriented models simplify querying and understanding [1, 17, 19–21, 28].
Performance Optimization: Denormalized schemas reduce joins and speed up analytical queries [1, 19, 21, 29].
Relevance of Data: Drop purely operational details or aggregate them (e.g., full addresses to postal codes) for analysis [2, 30–32].
Adding Context: Introduce time or other dimensions crucial for analysis if not already present [2, 24, 25, 33].

Data Cleansing and Quality Management

Data Cleansing

Data cleansing is the process of detecting and correcting or removing corrupt, incorrect, or irrelevant data from a dataset [8, 29, 34]. Even if operational data quality is high, small inconsistencies can derail analytical results [8, 29, 34, 35].

How to Identify Data Quality Issues

Check Against Business Rules: Validate duplicates, missing or impossible values, inconsistent formats [8, 37, 38, 40].
Data Profiling & Cluster Analysis: Identify outliers or odd distributions during exploratory analysis [39, 41].
Verify Against Real Values: Confirm with actual sources (e.g., calling customers) for highest accuracy [39, 41, 42].

Data Quality Feedback Loop

Feed BI-discovered quality issues back to operational systems to prevent recurrence [37, 40, 42–44].

Identification: Errors found during cleansing/analysis.
Root Cause Analysis: Investigate input forms, definitions, integration issues.
Feedback: Inform IT (system fixes) and business users (process changes) [42–46].
Prevention: Implement validation rules, improve information culture, redesign processes [42–45].
Monitoring: Continuously track data quality within Business Performance Management (BPM) [43–46].

Stale Data Errors

Occur when data correctly entered at the time loses context as business definitions change, leading to misinterpretation in historical analysis [46–49].

Example

The code “GM” meant “General Motors” in 1995 but “General Mills” after 2017. Without handling, BI queries aggregate both under “GM,” corrupting trends [46–48, 51–53].

Impact

Inaccurate analyses, distorted trends and segmentation [50, 54].
Poor business decisions from misleading data [50, 54].
Loss of trust in BI if reports are erroneous [50, 54].

Solution

Use Slowly Changing Dimensions (SCDs) to track and manage dimension changes over time, preserving historical accuracy [55, 56].

Roche’s Maxim of Data Transformation

“Data should be transformed as far upstream as possible, and as far downstream as necessary” [19, 57, 58].

As Far Upstream as Possible

Address quality and transformations close to the data source (validation rules, MDM) to prevent error propagation [19, 57, 59, 60].

As Far Downstream as Necessary

Perform additional transformations near analytical applications (e.g., star schemas, aggregations) to meet specific reporting or machine learning needs [19, 61–64].

Multidimensional Analytics: Measures, Dimensions, and Facts

Facts: Events or processes stored in a fact table (e.g., sales transactions) [62, 68–71].
Measures: Numeric values (e.g., quantity, sales value) that can be aggregated [68, 70, 72, 73].
Dimensions: Context attributes (e.g., time, product, customer) used for filtering and grouping [18, 65, 75–80].

Star Schema

Structure

A central fact table links to descriptive dimension tables. Fact tables hold measures and foreign keys; dimensions hold context attributes [17–20, 69, 71, 83].

Benefits

Simplicity: Intuitive for business users [1, 17, 19, 21].
Performance: Fewer joins speed up queries [1, 19, 21].
Flexibility: Easy slicing and dicing [65, 66].

Levels and Hierarchies

Hierarchies organize dimension members into levels for drill-down and roll-up [75, 85, 86].

Leveled Hierarchies: Uniform depth (e.g., Year > Quarter > Month > Day) [86–89].
Ragged Hierarchies: Varying depths (e.g., Country > State > City) [88–91].

Slowly Changing Dimensions (SCD)

Type 1: Overwrite old values—history lost [96–99].
Type 2: Create new records with surrogate keys and validity dates—full history preserved [100–106].
Type 3: Add columns for previous values—limited history [95, 107].
Type 4: Split frequently changing attributes into a separate mini-dimension [95, 108, 109].
Type 5: Hybrid of Types 1/4—current values in main table, history in mini-dimension [94, 110].

Nonadditive and Semiadditive Measures

Additive: Can sum across all dimensions (e.g., sales value) [111].
Nonadditive: Cannot sum without losing meaning (e.g., ratios, percentages) [73, 111].
Semiadditive: Additive across some dimensions but not others (e.g., inventory level over time) [73, 111, 112].

BI tools default to summation—explicit aggregation functions (AVERAGE, LAST_VALUE) are needed for nonadditive measures [111]. Missing values treated as zero can also skew results [112, 113].

Calculation Precedencies and Custom Rollups

When custom rollups are required, the order of operations matters for non-commutative formulas [113–116]. For example, summing individual Actual/Plan ratios differs from ratio of summed Actual to summed Plan [115, 116].

OLAP Operations: Drill-Down, Roll-Up, Slice, and Dice

Drill-Down: Move from summary to detail (e.g., Year → Quarter → Month) [117, 118].
Roll-Up: Move from detail to summary (e.g., Month → Quarter) [117, 118].
Slice: Filter on one dimension to create a subcube (e.g., sales in East Region) [117–119].
Dice: Filter on multiple dimensions (e.g., Electronics in East Region in Q1) [119, 120].

Other Related Operations

Pivot (Rotate): Swap axes in the view without changing data [119, 120].
Drill Across: Analyze across fact tables sharing dimensions [119, 120].
Drill Through: Access detailed relational data beyond the cube’s lowest level [119, 121].