Here is a comprehensive response to your query, drawing on the provided
sources:
Data Visualization Guidelines
Data visualization is crucial for Business Intelligence & Analytics
(BI&A) because it helps to present complex information in a graphical
form (e.g., charts, maps, flow diagrams) to facilitate understanding and
pattern recognition [1]. It is key for rapid understanding and
communication of insights derived from data analysis [1].
Digital dashboards, for instance, are designed to present key information
on a single screen, are typically highly personalized, and display
aggregated data (though users can often drill down for more detail)
[2, 3]. The primary purpose of visualization is to make complex data
accessible and actionable for users, especially for decision-making [1].
Big Data Definition and Characteristics
Big Data refers to data that cannot be handled with traditional
approaches, tools, or technologies [4]. It is characterized by its immense
scale and scope, often exemplified by the data generated by large digital
companies like Google, Amazon, and Facebook [5].
The concept of Big Data is commonly described by The Four V's [5]:
- Volume: The sheer amount of data. Data volume is
scaling faster than computing resources, posing challenges for
distributed storing and processing and operating in dynamic
environments [5, 6].
- Velocity: The speed of data flow within systems,
including rapid collection of huge amounts of data in short periods [6].
Traditional SQL databases are often not designed for such stream data
[6]. Maintaining both accuracy and speed is the challenge [6].
- Variety: Data comes from multiple sources in different
shapes and formats: structured, semi-structured, and unstructured [6].
- Veracity: The uncertainty and imprecision of data.
Big Data can be imprecise, noisy, or wrong [6, 7]. High velocity and
volume often preclude traditional ETL, leading to raw data storage
and a “good enough” quality focus [7].
Big Data Technology Categories
When dealing with Big Data, technologies primarily fall into three key
categories [8]:
- Storage: Addresses the volume component [8].
- Processing: Aligns with the velocity component [8].
- Analytics: Methods to gain insights from stored and
processed information, spanning both volume and variety [8].
Specific Problems That Have to Be Resolved (for Big Data)
- Data Volume Scalability: Data grows faster than
computing resources, requiring distributed solutions [5, 6].
- Handling Unstructured Data: Big Data includes
unstructured data, necessitating Hadoop, NoSQL, etc. [9–11].
- Real-time Processing Needs: High velocity demands
stream processing systems [6, 12].
- Data Quality (Veracity): Imprecise or noisy data
needs “good enough” quality strategies [6, 7].
- Complexity of Data Pipelines: More nodes increase
pipeline complexity; they must be fault-tolerant [13].
Cloud BI – Why Is It Important, Disadvantages
Advantages
- Cost-Effectiveness: Near-infinite, low-cost storage,
pay-as-you-go [14–16].
- Scalability: Easy scaling without heavy in-house
infrastructure [9, 14, 16, 17].
- Reliability & Availability: High durability
guarantees [9, 15, 17].
- Management & Security Outsourcing: Vendor-managed
warehousing and security [14, 16].
- Flexibility: Options for regions and storage classes
(e.g., Amazon S3) [15, 18].
Disadvantages: Not explicitly listed in the sources.
Columnar Storage
Columnar storage stores data by columns rather than rows [19]. The
primary key maps back to row IDs, optimizing analytical queries.
Main advantages include:
- Efficient Processing: Fast aggregations on single
columns [19, 20].
- High Compression Rates: Less space (e.g., Parquet)
[20, 21].
- Self-Describing: Includes metadata and schema
[21].
- Performance Enhancement: VertiPaq engine in Power BI
leverages it for speed [22].
Data Lake vs. Data Warehouse
Data Lakes hold massive raw data in native formats (schema-on-read),
designed for low-cost storage and data science exploration [23–27]. They
complement, not replace, data warehouses.
Feature |
Data Warehouse (DWH) |
Data Lake |
Nature of Data |
Structured, processed |
Any raw/native format |
Schema |
On-write (predefined) |
On-read / NoSQL |
Data Preparation |
Up front (ETL) |
On demand (ELT) |
Costs |
Expensive at scale |
Low-cost storage |
Agility |
Fixed configuration |
Highly agile |
Users |
Business professionals |
Data scientists |
One Platform vs. Multiple
Data Fabric (Single)
- Integration & Unification: Unified architecture
for data engineering, science, analytics, and BI [28–30].
- Centralized Management: Connects silos into one
infrastructure [29].
- Benefits: Eliminates silos, standardizes and
automates data practices, democratizes access [31].
Data Mesh (Decentralized)
- Domain‐Oriented Ownership: Teams own their data
domains [32, 33].
- Organizational Shift: Requires cultural and team
changes [33, 34].
- Benefits: Scales analytics across distributed teams,
fosters agility in large organizations [32, 33].
BI&A Information Quality
Focusing on two of Eppler’s dimensions [35]:
1. Consistency
Problem: Fragmented systems yield multiple “truths”
(e.g., differing customer addresses) [36–39].
Solution: A DWH provides one source of truth via ETL
cleansing; MDM creates a golden record [40–51].
2. Comprehensiveness
Problem: Siloed operational data limits holistic analysis
(e.g., marketing vs. sales) [45, 52–55].
Solution: BI&A integrates multiple sources into a
unified warehouse, enriching analysis with external data [35, 41, 45, 59,
60].
Encouraging BI System Adoption
- Perceived Effort & Benefit: Low effort, clear
performance gains drive use [62–64].
- Result Demonstrability: Visible improvements boost
adoption [62].
- Social Influence: Peer usage encourages uptake [62].
- Management Support: Active leadership fosters
voluntary use [62].
- Training: Ensures correct interpretation and builds
an information culture [38, 65–69].
- Relevance: Tailored data marts improve applicability
[35, 43, 70].
- Information Quality: Trusted, consistent data
encourages reliance [41, 43, 71, 72].
DAX Concepts
DAX is used in models (e.g., Power BI) for calculated measures, columns,
and tables. Context is key.
Calculated Columns vs. Measures
- Columns:
- Stored per row [74–75].
- Consume memory [76–77].
- Used for slicing/filtering [78].
- Measures:
- Calculated on the fly [75–77].
- Operate on aggregated data [74–75].
- Consume CPU at query time [76–77].
Filtering Contexts
- Filter Context: Filters applied by visuals or CALCULATE [80–81].
- Row Context: Per-row evaluation, used by iterators [78, 80].
The Total Is Not Always the Sum!
- Non-additive measures (balance, ratios) need custom rollups [82–84].
- Use last-value, average, or tailored aggregation [82, 85].
Iterators (X-functions)
- SUMX, AVERAGEX, etc., iterate row by row [78].
- Two params: table and expression [78].
- Can be computationally expensive [74].
Relational Functions
- RELATED: N:1 lookup [76].
- RELATEDTABLE: 1:N lookup [86].
Filter Manipulation
- FILTER: Returns a filtered table [86].
- ALL: Ignores existing filters [86].
- CALCULATE:
- Changes filter context for an expression [81, 87].
- Removes existing filters on specified columns [81, 87].
- Transforms row context to filter context when inside iterators
[87].
Identifying Non-Working Code
- Missing Aggregation: Measures need SUM, AVERAGE, etc.
[75, 88].
- Incorrect Context: Misused relationships or filters
[81].
- Division by Zero: Use DIVIDE() to avoid errors [76].
- Iterator Misuse: Row operations need X-functions [78].
- Relationship Issues: RELATED/RELATEDTABLE require
proper relationships.