1. What is a Datawarehouse?
Ans. A datawarehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
Ralph Kimball provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
2. What are Data Marts?
Ans. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department.
3. What is the difference between OLTP and OLAP?
Ans. OLTP – OnLine Transaction Processing
OLAP – OnLine Analytical Processing
OLTP system supports the operational transactions of the system
OLAP system collects, manages, processes and presents multidimensional data for analysis purposes.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized whereas, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.
For example: Cashier on a bank counter keys-in all the data relevant to the transaction which gets stored in the OLTP database, whereas if the regional manager of the bank wants to have a look on the branchwise performance report his query will be addressed to the OLAP database.
5. What is a dimension?
Ans. Dimensions are descriptive details about various objects (facts) allowing for their detailed analysis. In other words, Dimensions provide context to the data from Facts.
For example: A fact table containing numbers 100, 150, 200 is meaningless until it is joined with a dimension table which tells us that these values are pertaining to the average annual rainfall at three different places on the globe. Hence, Dimensions give context to the data from the Facts. Putting it differently, Dimensions provide structured labeling information to otherwise un-ordered numeric measures.
A dimension table typically has two types of columns, primary keys to fact tables and textual/descriptive data.
The primary functions of dimensions are threefold: to provide filtering, grouping and labeling.
6. What are the types of Dimensions?
Ans. There are basically four types of dimensions:
i.) Conformed Dimension – A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables.
Date is a common conformed dimension because its attributes (day, week, month, quarter, year, etc.) have the same meaning when joined to any fact table.
Usually the only rule to consider with the time dimension is whether there is a fiscal year in addition to the calendar year and the definition of a week.
Building a conformed dimension is also part of the process in master data management, or MDM.
ii.) Junk Dimension – A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables.
It is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions.
An example would be an indicator about whether a package had arrived, instead of indicating this as “yes” or “no”, it would be converted into “arrived” or “pending” in the junk dimension.
iii.) Degenerate Dimensions – A degenerate dimension is a key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent. Degenerate dimensions often play an integral role in the fact table’s primary key.
iv.) Role playing dimensions – Dimensions are often recycled for multiple applications within the same database. For instance, a “Date” dimension can be used for “Date of Sale”, as well as “Date of Delivery”, or “Date of Hire”. This is often referred to as a “role-playing dimension”.
7. What is MDM(Master Data Management)?
Ans. Master Data Management (MDM) refers to the process of creating and managing data that an organization must have as a single master copy, called the master data.
MDM is important because it offers the enterprise a single version of the truth. Without a clearly defined master data, the enterprise runs the risk of having multiple copies of data that are inconsistent with one another.
A particularly big challenge to maintaining master data occurs when there is a merger/acquisition. Each of the organizations will have its own master data, and how to merge the two sets of data will be challenging.
Bigger organization means that there are more disparate systems within the company, and the difficulty on providing a single source of truth.
9. What is a STAR-schema?
Ans. STAR schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored.
This entity-relationship diagram looks like a star, hence the name.
10. What is a snowflake schema?
Ans. A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake in shape.
The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.
“Snowflaking” is a method of normalising the dimension tables in a STAR schema.
The principle behind snowflaking is normalisation of the dimension tables by removing low cardinality attributes and forming separate tables.
The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema’s dimensions are normalized with each dimension represented by a single table.
When to use Star Schema and when Snowflake Schema?
Star schemas should be favored with query tools that largely expose users to the underlying table structures and in environments where most queries are simpler in nature. Snowflake schemas are often better with more sophisticated query tools that create a layer of abstraction between the users and raw table structures for environments having numerous queries with complex criteria
11. What is ETL?
Ans. ETL stands for Extract Transform and Load.
ETL refers to a process in database usage and especially in data warehousing that:
· Extracts data from outside sources
· Transforms it to fit operational needs, which can include quality levels
· Loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)