Data Transformation: Importance of Structural Transformation
It is crucial to transform the structure of data when moving it from
operational source systems to Business Intelligence (BI) systems because
operational systems and analytical systems have fundamentally different
requirements for data organization [1–7].
Operational Systems (OLTP)
Designed for transactional efficiency, focusing on capturing, processing,
and storing individual business events. Data is often highly normalized
(e.g., header/detail patterns) to optimize inserts, updates, and
consistency, even if that means complex joins for queries [1, 5, 8–13].
BI & Analytical Systems (OLAP)
Optimized for analysis, reporting, and decision support. Structures like
the star schema provide quick access to aggregated historical data [6,
14–16].
Why Structural Transformation Is Important
-
Analytical Suitability: Operational data is often
granular and fragmented—“not ready for analytics.” Analytical needs
derived measures and changes in granularity (e.g., daily to monthly)
[2, 3, 22–27].
-
Ease of Use: Business users struggle with highly
normalized schemas. Star schemas and other business-oriented models
simplify querying and understanding [1, 17, 19–21, 28].
-
Performance Optimization: Denormalized schemas reduce
joins and speed up analytical queries [1, 19, 21, 29].
-
Relevance of Data: Drop purely operational details or
aggregate them (e.g., full addresses to postal codes) for analysis
[2, 30–32].
-
Adding Context: Introduce time or other dimensions
crucial for analysis if not already present [2, 24, 25, 33].
Data Cleansing and Quality Management
Data Cleansing
Data cleansing is the process of detecting and correcting or removing
corrupt, incorrect, or irrelevant data from a dataset [8, 29, 34]. Even
if operational data quality is high, small inconsistencies can derail
analytical results [8, 29, 34, 35].
How to Identify Data Quality Issues
-
Check Against Business Rules: Validate duplicates,
missing or impossible values, inconsistent formats [8, 37, 38, 40].
-
Data Profiling & Cluster Analysis: Identify outliers
or odd distributions during exploratory analysis [39, 41].
-
Verify Against Real Values: Confirm with actual
sources (e.g., calling customers) for highest accuracy [39, 41, 42].
Data Quality Feedback Loop
Feed BI-discovered quality issues back to operational systems to prevent
recurrence [37, 40, 42–44].
- Identification: Errors found during cleansing/analysis.
-
Root Cause Analysis: Investigate input forms,
definitions, integration issues.
-
Feedback: Inform IT (system fixes) and business users
(process changes) [42–46].
-
Prevention: Implement validation rules, improve
information culture, redesign processes [42–45].
-
Monitoring: Continuously track data quality within
Business Performance Management (BPM) [43–46].
Stale Data Errors
Occur when data correctly entered at the time loses context as business
definitions change, leading to misinterpretation in historical analysis
[46–49].
Example
The code “GM” meant “General Motors” in 1995 but “General Mills” after
2017. Without handling, BI queries aggregate both under “GM,” corrupting
trends [46–48, 51–53].
Impact
- Inaccurate analyses, distorted trends and segmentation [50, 54].
- Poor business decisions from misleading data [50, 54].
- Loss of trust in BI if reports are erroneous [50, 54].
Solution
Use Slowly Changing Dimensions (SCDs) to track and manage
dimension changes over time, preserving historical accuracy [55, 56].
Roche’s Maxim of Data Transformation
“Data should be transformed as far upstream as possible, and as far
downstream as necessary” [19, 57, 58].
As Far Upstream as Possible
-
Address quality and transformations close to the data source
(validation rules, MDM) to prevent error propagation [19, 57, 59, 60].
As Far Downstream as Necessary
-
Perform additional transformations near analytical applications
(e.g., star schemas, aggregations) to meet specific reporting or
machine learning needs [19, 61–64].
Multidimensional Analytics: Measures, Dimensions, and Facts
-
Facts: Events or processes stored in a fact table
(e.g., sales transactions) [62, 68–71].
-
Measures: Numeric values (e.g., quantity, sales value)
that can be aggregated [68, 70, 72, 73].
-
Dimensions: Context attributes (e.g., time, product,
customer) used for filtering and grouping [18, 65, 75–80].
Star Schema
Structure
A central fact table links to descriptive dimension tables. Fact tables
hold measures and foreign keys; dimensions hold context attributes
[17–20, 69, 71, 83].
Benefits
- Simplicity: Intuitive for business users [1, 17, 19, 21].
- Performance: Fewer joins speed up queries [1, 19, 21].
- Flexibility: Easy slicing and dicing [65, 66].
Levels and Hierarchies
Hierarchies organize dimension members into levels for drill-down and
roll-up [75, 85, 86].
-
Leveled Hierarchies: Uniform depth (e.g., Year > Quarter
> Month > Day) [86–89].
-
Ragged Hierarchies: Varying depths (e.g., Country >
State > City) [88–91].
Slowly Changing Dimensions (SCD)
-
Type 1: Overwrite old values—history lost [96–99].
-
Type 2: Create new records with surrogate keys and
validity dates—full history preserved [100–106].
-
Type 3: Add columns for previous values—limited history
[95, 107].
-
Type 4: Split frequently changing attributes into a
separate mini-dimension [95, 108, 109].
-
Type 5: Hybrid of Types 1/4—current values in main
table, history in mini-dimension [94, 110].
Nonadditive and Semiadditive Measures
-
Additive: Can sum across all dimensions (e.g., sales
value) [111].
-
Nonadditive: Cannot sum without losing meaning (e.g.,
ratios, percentages) [73, 111].
-
Semiadditive: Additive across some dimensions but not
others (e.g., inventory level over time) [73, 111, 112].
BI tools default to summation—explicit aggregation functions (AVERAGE,
LAST_VALUE) are needed for nonadditive measures [111]. Missing values
treated as zero can also skew results [112, 113].
Calculation Precedencies and Custom Rollups
When custom rollups are required, the order of operations matters for
non-commutative formulas [113–116]. For example, summing individual
Actual/Plan ratios differs from ratio of summed Actual to summed Plan
[115, 116].
OLAP Operations: Drill-Down, Roll-Up, Slice, and Dice
-
Drill-Down: Move from summary to detail (e.g., Year →
Quarter → Month) [117, 118].
-
Roll-Up: Move from detail to summary (e.g., Month →
Quarter) [117, 118].
-
Slice: Filter on one dimension to create a subcube
(e.g., sales in East Region) [117–119].
-
Dice: Filter on multiple dimensions (e.g., Electronics
in East Region in Q1) [119, 120].
Other Related Operations
-
Pivot (Rotate): Swap axes in the view without changing
data [119, 120].
-
Drill Across: Analyze across fact tables sharing
dimensions [119, 120].
-
Drill Through: Access detailed relational data beyond
the cube’s lowest level [119, 121].