Mastering Data Analysis & Business Intelligence Terms with Microsoft Power Tools
Lectureexcelisfun•103,958 views•Jun 24, 2018
A comprehensive guide to key concepts like granularity, data models, keys, and the star schema for effective data analysis using Excel Power Query and Power Pivot.
Blurb
This video from ExcelIsFun's Microsoft Power Tools for Data Analysis class dives into essential data analysis and business intelligence terminology needed to build effective data models and reports. Key points include:
- Defining data analysis and business intelligence in a business context
- Understanding raw data vs. proper data sets with smart, consistent naming
- Importance of Excel Tables for dynamic data updates
- Granularity (grain) and its impact on data detail and aggregation
- Primary and foreign keys and their role in relationships
- Differentiating fact tables and dimension tables
- Overview of data models: flat tables, star schema, and snowflake schema
- How relationships enable efficient filtering and calculations
- Introduction to columnar databases and their performance benefits
- Cleaning, transforming, and importing data using Power Query
- Concepts of ETL, data warehousing, and SQL relevance
This foundational lecture sets the stage for practical use of Power Query and Power Pivot in upcoming videos.
Want the big picture?
Highlighted Clips
Defining Data Analysis and Business Intelligence
Explains the difference and overlap between data analysis and business intelligence, emphasizing actionable, refreshable information for decision makers.
Understanding Raw Data and Proper Data Sets
Shows examples of raw data versus properly structured data sets with field names and consistent data types.
Excel Tables and Naming Conventions
Demonstrates converting data ranges to Excel tables and the importance of smart, consistent naming for tables and fields.
Granularity (Grain) in Data Analysis
Defines granularity as the level of detail in data and its effect on aggregation and filtering in reports.
Introduction and Overview of Data Analysis and Business Intelligence
The video opens with a warm welcome to the second installment of the Microsoft Power Tools for Data Analysis series. The instructor, Mike Girvin, emphasizes the importance of understanding foundational terms before diving into tools like Power Query and Power Pivot. He highlights that this video is primarily informational, setting the stage for practical application in later lessons.
"Now, before we can talk about the tools like Power Query and Power Pivot, we've got to get busy defining data analysis and business intelligence terms."
Key points:
- The video focuses on defining essential terms related to data analysis and business intelligence.
- Downloadable Excel files and PDF notes are available for deeper study.
- The class aims to convert raw data into actionable information for decision makers.
- Business intelligence is framed as data analysis within a business context, emphasizing refreshable and actionable insights.
Defining Data Analysis, Business Intelligence, and Query
Mike defines data analysis as converting raw data into useful information for decision makers. He then refines this for business contexts as business intelligence, which adds the qualities of being actionable and refreshable. The term query is introduced as simply asking a question of the data, a concept familiar in databases but also applicable in Excel when filtering or using pivot tables.
"Query just means to ask a question... when we change the criteria or filter in a slicer, or drag and drop a field into a pivot table, we are changing the question or the query."
Key points:
- Data analysis transforms raw data into useful information.
- Business intelligence focuses on actionable, refreshable data for business decisions.
- Querying is the act of asking questions of data, whether in SQL or Excel.
- Pivot tables and slicers in Excel are practical examples of queries.
Raw Data and Proper Data Sets
The video contrasts unusable raw data with proper data sets. Raw data is described as data stored in its smallest form, often messy and difficult to analyze directly. Proper data sets have field names in the first row and records in subsequent rows, with consistent naming and defined data types.
"Raw data is always the data stored in its smallest form... Proper data set, we all probably know this. But let's make sure we're on the same page. Field names in the first row, records in subsequent rows."
Key points:
- Raw data is granular but often unstructured or messy.
- Proper data sets have clear headers and consistent formatting.
- Field names (also called columns or headers) define the type of data.
- Data types (date, whole number, decimal, currency) must be specified for accurate analysis.
- Naming conventions should be smart and consistent to avoid confusion.
Excel Tables and Naming Conventions
Mike stresses the importance of converting data ranges into Excel tables for use with Power Tools. Excel tables automatically expand and contract as data changes, ensuring that formulas, pivot tables, and Power Query connections update dynamically. He also demonstrates how to name tables smartly using keyboard shortcuts.
"If your data is coming from an Excel spreadsheet and you're actually going to use it in one of the tools in this workbook, then you have to convert it to an Excel table."
Key points:
- Excel tables are essential for dynamic data connections.
- Empty cells around the data set are necessary before creating a table.
- Naming tables with meaningful, consistent names (e.g., prefixing dimension tables with 'd' and fact tables with 'f') is critical.
- Keyboard shortcuts (Control-T to create a table, Alt-J-T-A to name it) speed up workflow.
Grain or Granularity in Data
Granularity is explained as the level of detail in a data set. Mike uses sales data examples to show how more granular data (product-level sales) contains smaller, detailed numbers, while aggregated data (invoice-level totals) has larger, summarized numbers. He also connects granularity to time dimensions like day, month, and year.
"The grain or granularity of the sales table is at the product or invoice line level... The grain or granularity of this table is at the invoice level."
Key points:
- Granularity refers to the size or level of detail in data.
- More granular data allows for detailed filtering and analysis.
- Aggregated data has less granularity and fewer filtering options.
- Time-based granularity (day, month, year) affects how data can be grouped and reported.
- Understanding granularity is crucial for designing data models and writing formulas.
The video introduces primary keys as unique identifiers in dimension tables and foreign keys as potentially duplicated identifiers in fact tables. These keys form the basis of relationships in data models, enabling accurate lookups and filtering.
"Primary key is simply a unique identifier that prevents duplicate rows... Foreign key, there can be many duplicates."
Key points:
- Primary keys uniquely identify each record in a dimension table.
- Foreign keys appear multiple times in fact tables, linking back to dimension tables.
- Relationships between primary and foreign keys enable one-to-many connections.
- These relationships replace traditional Excel lookups like VLOOKUP.
- Keys are essential for merging tables in Power Query and building data models.
Fact Tables and Dimension Tables
Mike defines fact tables as containing measurable business data (e.g., sales, units) and dimension tables as containing descriptive attributes (e.g., product details, sales rep info). He explains how fact tables hold foreign keys, while dimension tables have primary keys and attributes used for filtering and lookup.
"Fact table... has the numbers we want to make calculations on... Dimension tables... have first column as our primary key or unique identifier. The remaining columns are attributes."
Key points:
- Fact tables store quantitative data for analysis.
- Dimension tables store descriptive data for filtering and categorization.
- Measures (DAX formulas) summarize fact table data.
- Attributes in dimension tables serve as slicers or filters in reports.
- Helper columns in dimension tables (like MonthNumber) assist with sorting and hierarchies.
Importance of Grain in Fact Tables
Returning to granularity, Mike emphasizes that the grain of the fact table determines which filters and criteria can be used in reports. If the fact table is too aggregated, certain dimension filters become unusable.
"What's so important about the grain of the fact table is the grain... will define what criteria conditions filter and dimensions we can use in our reporting."
Key points:
- The fact table's granularity must match the desired level of analysis.
- Too coarse a grain limits filtering options.
- Proper grain ensures flexibility in report criteria.
- Dimension tables can have multiple granularities (e.g., day, month, year).
Data Model Requirements and Concepts
Mike outlines five key requirements for a good data model: containing necessary data, ease of use, fast queries, easy updates with new data, and easy structural updates. He explains that the star schema is the preferred model for this class due to its simplicity and efficiency.
"The structure of the data model comes from the type of measures and reports that the business decision maker requires."
Key points:
- Data models must meet business needs and be user-friendly.
- Star schema data models are easy to use and fast.
- Naming conventions and hiding unnecessary columns improve usability.
- Power Query and DAX allow easy updates to data and structure.
- Data models include fact tables, dimension tables, relationships, calculated columns, and measures.
Flat Table vs. Star Schema Data Models
The video contrasts the old flat table approach, where all data is combined into one table with many formulas, with the star schema approach, which separates fact and dimension tables connected by relationships. The flat table method is inefficient for large data sets.
"The problem with this is if you have lots of data, that means lots of formulas, and calculation times slow down."
Key points:
- Flat tables require complex formulas and slow down with large data.
- Star schema separates data into fact and dimension tables.
- Star schema improves performance and simplifies reporting.
- Relationships in star schema enable dynamic filtering and calculations.
Relationships in Data Models
Mike explains the power of relationships between tables, especially the one-to-many relationship from dimension to fact tables. He highlights how filters applied on dimension tables flow to fact tables, enabling efficient calculations on large data sets.
"That filter will flow across the one-to-many relationship, and it will actually filter the fact table down to just the records with the Quad product."
Key points:
- Relationships enable multi-table data models.
- One-to-many relationships are most common.
- Filters on dimension tables propagate to fact tables.
- This filtering optimizes calculations by reducing data scope.
- Bi-directional filters exist but are less common and more complex.
Star Schema Data Model
The star schema is described as a central fact table surrounded by dimension tables, resembling a star. This model is easy to use, allows fast queries, and supports accurate calculations regardless of which dimension is filtered.
"The star schema data model will be easier to use than most models. It will allow faster queries than most other models."
Key points:
- Star schema has one fact table and multiple dimension tables.
- It supports flexible filtering and slicing.
- Measures calculate correctly regardless of filter choice.
- Naming conventions and consistent granularity are important.
- This model is the standard for business intelligence solutions.
Snowflake Data Model
The snowflake model is a more complex variation with normalized dimension tables branching into sub-tables. While functional, it is more complicated and less efficient than the star schema, especially for large data sets and creating hierarchies.
"It's overly complicated, especially if you have lots of these sub tables... you can't do hierarchies in a snowflake data model."
Key points:
- Snowflake model normalizes dimension tables into multiple related tables.
- It can complicate relationships and reporting.
- Hierarchies are difficult or impossible to create.
- Performance can degrade due to relationship overhead.
- Suitable only for small data sets or when data is already structured this way.
Relational Databases and Columnar Databases
Mike briefly discusses relational databases as excellent for storing raw data with no redundancy but complicated for querying. He then introduces columnar databases (also called XVelocity, VertiPaq) used by Power Pivot and Power BI, which store data by columns and compress it efficiently for fast calculations.
"Columnar database... breaks apart every column... and stores only a unique list... that's why it can take big data and condense it down to a much smaller file size."
Key points:
- Relational databases focus on data integrity and minimal redundancy.
- Querying relational databases directly can be complex.
- Columnar databases store data by columns, compressing duplicates.
- This compression reduces file size and speeds up calculations.
- Columnar databases enable handling hundreds of millions of rows efficiently.
Cleaning, Transforming, and Importing Data
The video distinguishes between cleaning raw data (fixing unusable data) and transforming data sets (reshaping data for analysis). Both processes are primarily done in Power Query. Mike also introduces ETL (Extract, Transform, Load), data warehousing, and SQL as related concepts.
"We'll clean the raw data, which means we'll fix unusable raw data... Transforming data sets, that's when we fix unusable data sets... We'll do these two things mostly in Power Query."
Key points:
- Cleaning fixes raw data issues like merged fields or inconsistent formatting.
- Transforming reshapes data sets by adding/removing columns, merging, appending, or unpivoting.
- Power Query is the primary tool for cleaning, transforming, and importing data.
- ETL is a data warehousing term describing extracting, transforming, and loading data.
- SQL is the language for querying databases, but this class focuses on easier tools.
Summary and Closing Remarks
Mike wraps up by summarizing the journey from bad data to useful information through cleaning, transforming, and modeling. He encourages viewers to like, comment, and subscribe for upcoming videos, including the next one focused on Power Query.
"In this DI and BI class, we would be able to connect to lots of source data... clean it, transform it, and load it to become star schema data model, and then into useful information."
Key points:
- The class will cover connecting to diverse data sources.
- Emphasis on cleaning, transforming, and modeling data for business intelligence.
- Upcoming videos will introduce practical tool usage.
- Encouragement to engage with the channel for continued learning.
This detailed breakdown captures the essence and flow of the video, preserving Mike Girvin’s clear, methodical teaching style and the practical focus on preparing data for Microsoft Power Tools in business intelligence.
Key Questions
Data analysis is converting raw data into useful information for decision makers, while business intelligence focuses on creating useful, actionable, and refreshable information specifically for business decision makers.
Have more questions?
Analyzing video...
This may take a few moments.