Data Warehousing - OverviewThe term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. Show
An operational database undergoes frequent changes on a daily basis on account of the transactions that take place. Suppose a business executive wants to analyze previous feedback on any data such as a product, a supplier, or any consumer data, then the executive will have no data available to analyze because the previous data has been updated due to transactions. A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a multidimensional space. This analysis results in data generalization and data mining. Data mining functions such as association, clustering, classification, prediction can be integrated with OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That's why data warehouse has now become an important platform for data analysis and online analytical processing. Understanding a Data Warehouse
Why a Data Warehouse is Separated from Operational DatabasesA data warehouses is kept separate from operational databases due to the following reasons −
Data Warehouse FeaturesThe key features of a data warehouse are discussed below −
Note − A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database. Data Warehouse ApplicationsAs discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields −
Types of Data WarehouseInformation processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below −
Data Warehousing - ConceptsWhat is Data Warehousing?Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations. Using Data Warehouse InformationThere are decision support technologies that help utilize the data available in a data warehouse. These technologies help executives to use the warehouse quickly and effectively. They can gather data, analyze it, and take decisions based on the information present in the warehouse. The information gathered in a warehouse can be used in any of the following domains −
Integrating Heterogeneous DatabasesTo integrate heterogeneous databases, we have two approaches −
Query-Driven ApproachThis is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators. Process of Query-Driven Approach
Disadvantages
Update-Driven ApproachThis is an alternative to the traditional approach. Today's data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a warehouse. This information is available for direct querying and analysis. AdvantagesThis approach has the following advantages −
Functions of Data Warehouse Tools and UtilitiesThe following are the functions of data warehouse tools and utilities −
Note − Data cleaning and data transformation are important steps in improving the quality of data and data mining results. Data Warehousing - TerminologiesIn this chapter, we will discuss some of the most commonly used terms in data warehousing. MetadataMetadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say that metadata is the summarized data that leads us to the detailed data. In terms of data warehouse, we can define metadata as following −
Metadata RepositoryMetadata repository is an integral part of a data warehouse system. It contains the following metadata −
Data CubeA data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records. Illustration of Data CubeSuppose a company wants to keep track of sales records with the help of sales data warehouse with respect to time, item, branch, and location. These dimensions allow to keep track of monthly sales and at which branch the items were sold. There is a table associated with each dimension. This table is known as dimension table. For example, "item" dimension table may have attributes such as item_name, item_type, and item_brand. The following table represents the 2-D view of Sales Data for a company with respect to time, item, and location dimensions. But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with respect to time, and item dimensions according to type of items sold. If we want to view the sales data with one more dimension, say, the location dimension, then the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and location is shown in the table below − The above 3-D table can be represented as 3-D data cube as shown in the following figure − Data MartData marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. In other words, a data mart contains only those data that is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects. Points to Remember About Data Marts
The following figure shows a graphical representation of data marts. Virtual WarehouseThe view over an operational data warehouse is known as virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers. Data Warehousing - Delivery ProcessA data warehouse is never static; it evolves as the business expands. As the business evolves, its requirements keep changing and therefore a data warehouse must be designed to ride with these changes. Hence a data warehouse system needs to be flexible. Ideally there should be a delivery process to deliver a data warehouse. However data warehouse projects normally suffer from various issues that make it difficult to complete tasks and deliverables in the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements are not understood completely. The architectures, designs, and build components can be completed only after gathering and studying all the requirements. Delivery MethodThe delivery method is a variant of the joint application development approach adopted for the delivery of a data warehouse. We have staged the data warehouse delivery process to minimize risks. The approach that we will discuss here does not reduce the overall delivery time-scales but ensures the business benefits are delivered incrementally through the development process. Note − The delivery process is broken into phases to reduce the project and delivery risk. The following diagram explains the stages in the delivery process − IT StrategyData warehouse are strategic investments that require a business process to generate benefits. IT Strategy is required to procure and retain funding for the project. Business CaseThe objective of business case is to estimate business benefits that should be derived from using a data warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly stated. If a data warehouse does not have a clear business case, then the business tends to suffer from credibility problems at some stage during the delivery process. Therefore in data warehouse projects, we need to understand the business case for investment. Education and PrototypingOrganizations experiment with the concept of data analysis and educate themselves on the value of having a data warehouse before settling for a solution. This is addressed by prototyping. It helps in understanding the feasibility and benefits of a data warehouse. The prototyping activity on a small scale can promote educational process as long as −
The following points are to be kept in mind to produce an early release and deliver business benefits.
Business RequirementsTo provide quality deliverables, we should make sure the overall requirements are understood. If we understand the business requirements for both short-term and medium-term, then we can design a solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution. The following aspects are determined in this stage −
Technical BlueprintThis phase need to deliver an overall architecture satisfying the long term requirements. This phase also deliver the components that must be implemented in a short term to derive any business benefit. The blueprint need to identify the followings.
Building the VersionIn this stage, the first production deliverable is produced. This production deliverable is the smallest component of a data warehouse. This smallest component adds business benefit. History LoadThis is the phase where the remainder of the required history is loaded into the data warehouse. In this phase, we do not add new entities, but additional physical tables would probably be created to store increased data volumes. Let us take an example. Suppose the build version phase has delivered a retail sales analysis data warehouse with 2 months’ worth of history. This information will allow the user to analyze only the recent trends and address the short-term issues. The user in this case cannot identify annual and seasonal trends. To help him do so, last 2 years’ sales history could be loaded from the archive. Now the 40GB data is extended to 400GB. Note − The backup and recovery procedures may become complex, therefore it is recommended to perform this activity within a separate phase. Ad hoc QueryIn this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools can generate the database query. Note − It is recommended not to use these access tools when the database is being substantially modified. AutomationIn this phase, operational management processes are fully automated. These would include −
Extending ScopeIn this phase, the data warehouse is extended to address a new set of business requirements. The scope can be extended in two ways −
Note − This phase should be performed separately, since it involves substantial efforts and complexity. Requirements EvolutionFrom the perspective of delivery process, the requirements are always changeable. They are not static. The delivery process must support this and allow these changes to be reflected within the system. This issue is addressed by designing the data warehouse around the use of data within business processes, as opposed to the data requirements of existing queries. The architecture is designed to change and grow to match the business needs, the process operates as a pseudo-application development process, where the new requirements are continually fed into the development activities and the partial deliverables are produced. These partial deliverables are fed back to the users and then reworked ensuring that the overall system is continually updated to meet the business needs. Data Warehousing - System ProcessesWe have a fixed number of operations to be applied on the operational databases and we have well-defined techniques such as use normalized data, keep table small, etc. These techniques are suitable for delivering a solution. But in case of decision-support systems, we do not know what query and operation needs to be executed in future. Therefore techniques applied on operational databases are not suitable for data warehouses. In this chapter, we will discuss how to build data warehousing solutions on top open-system technologies like Unix and relational databases. Process Flow in Data WarehouseThere are four major processes that contribute to a data warehouse −
Extract and Load ProcessData extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse. Note − Before loading the data into the data warehouse, the information extracted from the external sources must be reconstructed. Controlling the ProcessControlling the process involves determining when to start data extraction and the consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are executed in correct sequence and at correct time. When to Initiate ExtractData needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a single, consistent version of the information to the user. For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge the list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are no associated subscriptions. Loading the DataAfter extracting the data, it is loaded into a temporary data store where it is cleaned up and made consistent. Note − Consistency checks are executed only when all the data sources have been loaded into the temporary data store. Clean and Transform ProcessOnce the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming −
Clean and Transform the Loaded Data into a StructureCleaning and transforming the loaded data helps speed up the queries. It can be done by making the data consistent −
Transforming involves converting the source data into a structure. Structuring the data increases the query performance and decreases the operational cost. The data contained in a data warehouse must be transformed to support performance requirements and control the ongoing operational costs. Partition the DataIt will optimize the hardware performance and simplify the management of data warehouse. Here we partition each fact table into multiple separate partitions. AggregationAggregation is required to speed up common queries. Aggregation relies on the fact that most common queries will analyze a subset or an aggregation of the detailed data. Backup and Archive the DataIn order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required. For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do month-on-month comparisons for this year and last year. In this case, we require some data to be restored from the archive. Query Management ProcessThis process performs the following functions −
The information generated in this process is used by the warehouse management process to determine which aggregations to generate. This process does not generally operate during the regular load of information into data warehouse. Data Warehousing - ArchitectureIn this chapter, we will discuss the business analysis framework for the data warehouse design and architecture of a data warehouse. Business Analysis FrameworkThe business analyst get the information from the data warehouses to measure the performance and make critical adjustments in order to win over other business holders in the market. Having a data warehouse offers the following advantages −
To design an effective and efficient data warehouse, we need to understand and analyze the business needs and construct a business analysis framework. Each person has different views regarding the design of a data warehouse. These views are as follows −
Three-Tier Data Warehouse ArchitectureGenerally a data warehouses adopts a three-tier architecture. Following are the three tiers of the data warehouse architecture.
The following diagram depicts the three-tier architecture of data warehouse − Data Warehouse ModelsFrom the perspective of data warehouse architecture, we have the following data warehouse models −
Virtual WarehouseThe view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers. Data MartData mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization. In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects. Points to remember about data marts −
Enterprise Warehouse
Load ManagerThis component performs the operations required to extract and load process. The size and complexity of the load manager varies between specific solutions from one data warehouse to other. Load Manager ArchitectureThe load manager performs the following functions −
Extract Data from SourceThe data is extracted from the operational databases or the external information providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection(ODBC), Java Database Connection (JDBC), are examples of gateway. Fast Load
Simple TransformationsWhile loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks:
Warehouse ManagerA warehouse manager is responsible for the warehouse management process. It consists of third-party system software, C programs, and shell scripts. The size and complexity of warehouse managers varies between specific solutions. Warehouse Manager ArchitectureA warehouse manager includes the following −
Operations Performed by Warehouse Manager
Note − A warehouse Manager also analyzes query profiles to determine index and aggregations are appropriate. Query Manager
Query Manager ArchitectureThe following screenshot shows the architecture of a query manager. It includes the following:
Detailed InformationDetailed information is not kept online, rather it is aggregated to the next level of detail and then archived to tape. The detailed information part of data warehouse keeps the detailed information in the starflake schema. Detailed information is loaded into the data warehouse to supplement the aggregated data. The following diagram shows a pictorial impression of where detailed information is stored and how it is used. Note − If detailed information is held offline to minimize disk storage, we should make sure that the data has been extracted, cleaned up, and transformed into starflake schema before it is archived. Summary InformationSummary Information is a part of data warehouse that stores predefined aggregations. These aggregations are generated by the warehouse manager. Summary Information must be treated as transient. It changes on-the-go in order to respond to the changing query profiles. The points to note about summary information are as follows −
Data Warehousing - OLAPOnline Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight of the information through fast, consistent, and interactive access to information. This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and statistical databases and OLTP. Types of OLAP ServersWe have four types of OLAP servers −
Relational OLAPROLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data, ROLAP uses relational or extended-relational DBMS. ROLAP includes the following −
Multidimensional OLAPMOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of data storage representation to handle dense and sparse data sets. Hybrid OLAPHybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store. Specialized SQL ServersSpecialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. OLAP OperationsSince OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data. Here is the list of OLAP operations −
Roll-upRoll-up performs aggregation on a data cube in any of the following ways −
The following diagram illustrates how roll-up works.
Drill-downDrill-down is the reverse operation of roll-up. It is performed by either of the following ways −
The following diagram illustrates how drill-down works −
SliceThe slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works.
DiceDice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation. The dice operation on the cube based on the following selection criteria involves three dimensions.
PivotThe pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation. OLAP vs OLTP
Data Warehousing - Relational OLAPRelational OLAP servers are placed between relational back-end server and client front-end tools. To store and manage the warehouse data, the relational OLAP uses relational or extended-relational DBMS. ROLAP includes the following −
Points to Remember
Relational OLAP ArchitectureROLAP includes the following components −
Advantages
Disadvantages
Data Warehousing - Multidimensional OLAPMultidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the dataset is sparse. Therefore, many MOLAP servers use two levels of data storage representation to handle dense and sparse datasets. Points to Remember −
MOLAP ArchitectureMOLAP includes the following components −
Advantages
Disadvantages
MOLAP vs ROLAP
Data Warehousing - SchemasSchema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the schemas used in a data warehouse. Star Schema
Note − Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country. Snowflake Schema
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space. Fact Constellation Schema
Schema DefinitionMultidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts. Syntax for Cube Definitiondefine cube < cube_name > [ < dimension-list > }: < measure_list > Syntax for Dimension Definitiondefine dimension < dimension_name > as ( < attribute_or_dimension_list > ) Star Schema DefinitionThe star schema that we have discussed can be defined using Data Mining Query Language (DMQL) as follows − define cube sales star [time, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count(*) define dimension time as (time key, day, day of week, month, quarter, year) define dimension item as (item key, item name, brand, type, supplier type) define dimension branch as (branch key, branch name, branch type) define dimension location as (location key, street, city, province or state, country) Snowflake Schema DefinitionSnowflake schema can be defined using DMQL as follows − define cube sales snowflake [time, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count(*) define dimension time as (time key, day, day of week, month, quarter, year) define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type)) define dimension branch as (branch key, branch name, branch type) define dimension location as (location key, street, city (city key, city, province or state, country)) Fact Constellation Schema DefinitionFact constellation schema can be defined using DMQL as follows − define cube sales [time, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count(*) define dimension time as (time key, day, day of week, month, quarter, year) define dimension item as (item key, item name, brand, type, supplier type) define dimension branch as (branch key, branch name, branch type) define dimension location as (location key, street, city, province or state,country) define cube shipping [time, item, shipper, from location, to location]: dollars cost = sum(cost in dollars), units shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type) define dimension from location as location in cube sales define dimension to location as location in cube sales Data Warehousing - Partitioning StrategyPartitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in balancing the various requirements of the system. It optimizes the hardware performance and simplifies the management of data warehouse by partitioning each fact table into multiple separate partitions. In this chapter, we will discuss different partitioning strategies. Why is it Necessary to Partition?Partitioning is important for the following reasons −
For Easy ManagementThe fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact table is very hard to manage as a single entity. Therefore it needs partitioning. To Assist Backup/RecoveryIf we do not partition the fact table, then we have to load the complete fact table with all the data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to load and also enhances the performance of the system. Note − To cut down on the backup size, all partitions other than the current partition can be marked as read-only. We can then put these partitions into a state where they cannot be modified. Then they can be backed up. It means only the current partition is to be backed up. To Enhance PerformanceBy partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance is enhanced because now the query scans only those partitions that are relevant. It does not have to scan the whole data. Horizontal PartitioningThere are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to keep in mind the requirements for manageability of the data warehouse. Partitioning by Time into Equal SegmentsIn this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period represents a significant retention period within the business. For example, if the user queries for month to date data then it is appropriate to partition the data into monthly segments. We can reuse the partitioned tables by removing the data in them. Partition by Time into Different-sized SegmentsThis kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of small partitions for relatively current data, larger partition for inactive data. Points to Note
Partition on a Different DimensionThe fact table can also be partitioned on the basis of dimensions other than time such as product group, region, supplier, or any other dimension. Let's have an example. Suppose a market function has been structured into distinct regional departments like on a state by state basis. If each region wants to query on information captured within its region, it would prove to be more effective to partition the fact table into regional partitions. This will cause the queries to speed up because it does not require to scan information that is not relevant. Points to Note
Note − We recommend to perform the partition only on the basis of time dimension, unless you are certain that the suggested dimension grouping will not change within the life of the data warehouse. Partition by Size of TableWhen there are no clear basis for partitioning the fact table on any dimension, then we should partition the fact table on the basis of their size. We can set the predetermined size as a critical point. When the table exceeds the predetermined size, a new table partition is created. Points to Note
Partitioning DimensionsIf a dimension contains large number of entries, then it is required to partition the dimensions. Here we have to check the size of a dimension. Consider a large design that changes over time. If we need to store all the variations in order to apply comparisons, that dimension may be very large. This would definitely affect the response time. Round Robin PartitionsIn the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to allow user access tool to refer to the correct table partition. This technique makes it easy to automate table management facilities within the data warehouse. Vertical PartitionVertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is done. Vertical partitioning can be performed in the following two ways −
NormalizationNormalization is the standard relational method of database organization. In this method, the rows are collapsed into a single row, hence it reduce space. Take a look at the following tables that show how normalization is performed. Table before Normalization
Table after Normalization
Row SplittingRow splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed up the access to large table by reducing its size. Note − While using vertical partitioning, make sure that there is no requirement to perform a major join operation between two partitions. Identify Key to PartitionIt is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing the fact table. Let's have an example. Suppose we want to partition the following table. Account_Txn_Table transaction_id account_id transaction_type value transaction_date region branch_name We can choose to partition on any key. The two possible keys could be
Suppose the business is organized in 30 geographical regions and each region has different number of branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because our requirements capture has shown that a vast majority of queries are restricted to the user's own business region. If we partition by transaction_date instead of region, then the latest transaction from every region will be in one partition. Now the user who wants to look at data within his own region has to query across multiple partitions. Hence it is worth determining the right partitioning key. Data Warehousing - Metadata ConceptsWhat is Metadata?Metadata is simply defined as data about data. The data that is used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say that metadata is the summarized data that leads us to detailed data. In terms of data warehouse, we can define metadata as follows.
Note − In a data warehouse, we create metadata for the data names and definitions of a given data warehouse. Along with this metadata, additional metadata is also created for time-stamping any extracted data, the source of extracted data. Categories of MetadataMetadata can be broadly categorized into three categories −
Role of MetadataMetadata has a very important role in a data warehouse. The role of metadata in a warehouse is different from the warehouse data, yet it plays an important role. The various roles of metadata are explained below.
The following diagram shows the roles of metadata. Metadata RepositoryMetadata repository is an integral part of a data warehouse system. It has the following metadata −
Challenges for Metadata ManagementThe importance of metadata can not be overstated. Metadata helps in driving the accuracy of reports, validates data transformation, and ensures the accuracy of calculations. Metadata also enforces the definition of business terms to business end-users. With all these uses of metadata, it also has its challenges. Some of the challenges are discussed below.
Data Warehousing - Data MartingWhy Do We Need a Data Mart?Listed below are the reasons to create a data mart −
Note − Do not data mart for any other reason since the operation cost of data marting could be very high. Before data marting, make sure that data marting strategy is appropriate for your particular solution. Cost-effective Data MartingFollow the steps given below to make data marting cost-effective −
Identify the Functional SplitsIn this step, we determine if the organization has natural functional splits. We look for departmental splits, and we determine whether the way in which departments use information tend to be in isolation from the rest of the organization. Let's have an example. Consider a retail organization, where each merchant is accountable for maximizing the sales of a group of products. For this, the following are the valuable information −
As the merchant is not interested in the products they are not dealing with, the data marting is a subset of the data dealing which the product group of interest. The following diagram shows data marting for different users. Given below are the issues to be taken into account while determining the functional split −
Note − We need to determine the business benefits and technical feasibility of using a data mart. Identify User Access Tool RequirementsWe need data marts to support user access tools that require internal data structures. The data in such structures are outside the control of data warehouse but need to be populated and updated on a regular basis. There are some tools that populate directly from the source system but some cannot. Therefore additional requirements outside the scope of the tool are needed to be identified for future. Note − In order to ensure consistency of data across all access tools, the data should not be directly populated from the data warehouse, rather each tool must have its own data mart. Identify Access Control IssuesThere should to be privacy rules to ensure the data is accessed by authorized users only. For example a data warehouse for retail banking institution ensures that all the accounts belong to the same legal entity. Privacy laws can force you to totally prevent access to information that is not owned by the specific bank. Data marts allow us to build a complete wall by physically separating data segments within the data warehouse. To avoid possible privacy problems, the detailed data can be removed from the data warehouse. We can create data mart for each legal entity and load it via data warehouse, with detailed account data. Designing Data MartsData marts should be designed as a smaller version of starflake schema within the data warehouse and should match with the database design of the data warehouse. It helps in maintaining control over database instances. The summaries are data marted in the same way as they would have been designed within the data warehouse. Summary tables help to utilize all dimension data in the starflake schema. Cost of Data MartingThe cost measures for data marting are as follows −
Hardware and Software CostAlthough data marts are created on the same hardware, they require some additional hardware and software. To handle user queries, it requires additional processing power and disk storage. If detailed data and the data mart exist within the data warehouse, then we would face additional cost to store and manage replicated data. Note − Data marting is more expensive than aggregations, therefore it should be used as an additional strategy and not as an alternative strategy. Network AccessA data mart could be on a different location from the data warehouse, so we should ensure that the LAN or WAN has the capacity to handle the data volumes being transferred within the data mart load process. Time Window ConstraintsThe extent to which a data mart loading process will eat into the available time window depends on the complexity of the transformations and the data volumes being shipped. The determination of how many data marts are possible depends on −
Data Warehousing - System ManagersSystem management is mandatory for the successful implementation of a data warehouse. The most important system managers are −
System Configuration Manager
Note − The most important configuration tool is the I/O manager. System Scheduling ManagerSystem Scheduling Manager is responsible for the successful implementation of the data warehouse. Its purpose is to schedule ad hoc queries. Every operating system has its own scheduler with some form of batch control mechanism. The list of features a system scheduling manager must have is as follows −
Note − The above list can be used as evaluation parameters for the evaluation of a good scheduler. Some important jobs that a scheduler must be able to handle are as follows −
Note − If the data warehouse is running on a cluster or MPP architecture, then the system scheduling manager must be capable of running across the architecture. System Event ManagerThe event manager is a kind of a software. The event manager manages the events that are defined on the data warehouse system. We cannot manage the data warehouse manually because the structure of data warehouse is very complex. Therefore we need a tool that automatically handles all the events without any intervention of the user. Note − The Event manager monitors the events occurrences and deals with them. The event manager also tracks the myriad of things that can go wrong on this complex data warehouse system. EventsEvents are the actions that are generated by the user or the system itself. It may be noted that the event is a measurable, observable, occurrence of a defined action. Given below is a list of common events that are required to be tracked.
The most important thing about events is that they should be capable of executing on their own. Event packages define the procedures for the predefined events. The code associated with each event is known as event handler. This code is executed whenever an event occurs. System and Database ManagerSystem and database manager may be two separate pieces of software, but they do the same job. The objective of these tools is to automate certain processes and to simplify the execution of others. The criteria for choosing a system and the database manager are as follows −
System Backup Recovery ManagerThe backup and recovery tool makes it easy for operations and management staff to back-up the data. Note that the system backup manager must be integrated with the schedule manager software being used. The important features that are required for the management of backups are as follows −
Backups are taken only to protect against data loss. Following are the important points to remember −
Data Warehousing - Process ManagersProcess managers are responsible for maintaining the flow of data both into and out of the data warehouse. There are three different types of process managers −
Data Warehouse Load ManagerLoad manager performs the operations required to extract and load the data into the database. The size and complexity of a load manager varies between specific solutions from one data warehouse to another. Load Manager ArchitectureThe load manager does performs the following functions −
Extract Data from SourceThe data is extracted from the operational databases or the external information providers. Gateways are the application programs that are used to extract data. It is supported by underlying DBMS and allows the client program to generate SQL to be executed at a server. Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples of gateway. Fast Load
Simple TransformationsWhile loading, it may be required to perform simple transformations. After completing simple transformations, we can do complex checks. Suppose we are loading the EPOS sales transaction, we need to perform the following checks −
Warehouse ManagerThe warehouse manager is responsible for the warehouse management process. It consists of a third-party system software, C programs, and shell scripts. The size and complexity of a warehouse manager varies between specific solutions. Warehouse Manager ArchitectureA warehouse manager includes the following −
Functions of Warehouse ManagerA warehouse manager performs the following functions −
Note − A warehouse Manager analyzes query profiles to determine whether the index and aggregations are appropriate. Query ManagerThe query manager is responsible for directing the queries to suitable tables. By directing the queries to appropriate tables, it speeds up the query request and response process. In addition, the query manager is responsible for scheduling the execution of the queries posted by the user. Query Manager ArchitectureA query manager includes the following components −
Functions of Query Manager
Data Warehousing - SecurityThe objective of a data warehouse is to make large amounts of data easily accessible to the users, hence allowing the users to extract information about the business as a whole. But we know that there could be some security restrictions applied on the data that can be an obstacle for accessing the information. If the analyst has a restricted view of data, then it is impossible to capture a complete picture of the trends within the business. The data from each analyst can be summarized and passed on to management where the different summaries can be aggregated. As the aggregations of summaries cannot be the same as that of the aggregation as a whole, it is possible to miss some information trends in the data unless someone is analyzing the data as a whole. Security RequirementsAdding security features affect the performance of the data warehouse, therefore it is important to determine the security requirements as early as possible. It is difficult to add security features after the data warehouse has gone live. During the design phase of the data warehouse, we should keep in mind what data sources may be added later and what would be the impact of adding those data sources. We should consider the following possibilities during the design phase.
This situation arises when the future users and the data sources are not well known. In such a situation, we need to use the knowledge of business and the objective of data warehouse to know likely requirements. The following activities get affected by security measures −
User AccessWe need to first classify the data and then classify the users on the basis of the data they can access. In other words, the users are classified according to the data they can access. Data Classification The following two approaches can be used to classify the data −
There are some issues in the second approach. To understand, let's have an example. Suppose you are building the data warehouse for a bank. Consider that the data being stored in the data warehouse is the transaction data for all the accounts. The question here is, who is allowed to see the transaction data. The solution lies in classifying the data according to the function. User classification The following approaches can be used to classify the users −
Classification on basis of Department Let's have an example of a data warehouse where the users are from sales and marketing department. We can have security by top-to-down company view, with access centered on the different departments. But there could be some restrictions on users at different levels. This structure is shown in the following diagram. But if each department accesses different data, then we should design the security access for each department separately. This can be achieved by departmental data marts. Since these data marts are separated from the data warehouse, we can enforce separate security restrictions on each data mart. This approach is shown in the following figure. Classification Based on Role If the data is generally available to all the departments, then it is useful to follow the role access hierarchy. In other words, if the data is generally accessed by all the departments, then apply security restrictions as per the role of the user. The role access hierarchy is shown in the following figure. Audit RequirementsAuditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the system. To complete an audit in time, we require more hardware and therefore, it is recommended that wherever possible, auditing should be switched off. Audit requirements can be categorized as follows −
Note − For each of the above-mentioned categories, it is necessary to audit success, failure, or both. From the perspective of security reasons, the auditing of failures are very important. Auditing of failure is important because they can highlight unauthorized or fraudulent access. Network RequirementsNetwork security is as important as other securities. We cannot ignore the network security requirement. We need to consider the following issues −
These restrictions need to be considered carefully. Following are the points to remember −
Data MovementThere exist potential security implications while moving the data. Suppose we need to transfer some restricted data as a flat file to be loaded. When the data is loaded into the data warehouse, the following questions are raised −
If we talk about the backup of these flat files, the following questions are raised −
Some other forms of data movement like query result sets also need to be considered. The questions raised while creating the temporary table are as follows −
We should avoid the accidental flouting of security restrictions. If a user with access to the restricted data can generate accessible temporary tables, data can be visible to non-authorized users. We can overcome this problem by having a separate temporary area for users with access to restricted data. DocumentationThe audit and security requirements need to be properly documented. This will be treated as a part of justification. This document can contain all the information gathered from −
Impact of Security on DesignSecurity affects the application code and the development timescales. Security affects the following area −
Application DevelopmentSecurity affects the overall application development and it also affects the design of the important components of the data warehouse such as load manager, warehouse manager, and query manager. The load manager may require checking code to filter record and place them in different locations. More transformation rules may also be required to hide certain data. Also there may be requirements of extra metadata to handle any extra objects. To create and maintain extra views, the warehouse manager may require extra codes to enforce security. Extra checks may have to be coded into the data warehouse to prevent it from being fooled into moving data into a location where it should not be available. The query manager requires the changes to handle any access restrictions. The query manager will need to be aware of all extra views and aggregations. Database designThe database layout is also affected because when security measures are implemented, there is an increase in the number of views and tables. Adding security increases the size of the database and hence increases the complexity of the database design and management. It will also add complexity to the backup management and recovery plan. TestingTesting the data warehouse is a complex and lengthy process. Adding security to the data warehouse also affects the testing time complexity. It affects the testing in the following two ways −
Data Warehousing - BackupA data warehouse is a complex system and it contains a huge volume of data. Therefore it is important to back up all the data so that it becomes available for recovery in future as per requirement. In this chapter, we will discuss the issues in designing the backup strategy. Backup TerminologiesBefore proceeding further, you should know some of the backup terminologies discussed below.
Hardware BackupIt is important to decide which hardware to use for the backup. The speed of processing the backup and restore depends on the hardware being used, how the hardware is connected, bandwidth of the network, backup software, and the speed of server's I/O system. Here we will discuss some of the hardware choices that are available and their pros and cons. These choices are as follows −
Tape TechnologyThe tape choice can be categorized as follows −
Tape Media There exists several varieties of tape media. Some tape media standards are listed in the table below −
Other factors that need to be considered are as follows −
Standalone Tape Drives The tape drives can be connected in the following ways −
There could be issues in connecting the tape drives to a data warehouse.
Tape StackersThe method of loading multiple tapes into a single tape drive is known as tape stackers. The stacker dismounts the current tape when it has finished with it and loads the next tape, hence only one tape is available at a time to be accessed. The price and the capabilities may vary, but the common ability is that they can perform unattended backups. Tape SilosTape silos provide large store capacities. Tape silos can store and manage thousands of tapes. They can integrate multiple tape drives. They have the software and hardware to label and store the tapes they store. It is very common for the silo to be connected remotely over a network or a dedicated link. We should ensure that the bandwidth of the connection is up to the job. Disk BackupsMethods of disk backups are −
These methods are used in the OLTP system. These methods minimize the database downtime and maximize the availability. Disk-to-Disk Backups Here backup is taken on the disk rather on the tape. Disk-to-disk backups are done for the following reasons −
Backing up the data from disk to disk is much faster than to the tape. However it is the intermediate step of backup. Later the data is backed up on the tape. The other advantage of disk-to-disk backups is that it gives you an online copy of the latest backup. Mirror Breaking The idea is to have disks mirrored for resilience during the working day. When backup is required, one of the mirror sets can be broken out. This technique is a variant of disk-to-disk backups. Note − The database may need to be shutdown to guarantee consistency of the backup. Optical JukeboxesOptical jukeboxes allow the data to be stored near line. This technique allows a large number of optical disks to be managed in the same way as a tape stacker or a tape silo. The drawback of this technique is that it has slow write speed than disks. But the optical media provides long-life and reliability that makes them a good choice of medium for archiving. Software BackupsThere are software tools available that help in the backup process. These software tools come as a package. These tools not only take backup, they can effectively manage and control the backup strategies. There are many software packages available in the market. Some of them are listed in the following table −
Criteria for Choosing Software PackagesThe criteria for choosing the best software package are listed below −
Data Warehousing - TuningA data warehouse keeps evolving and it is unpredictable what query the user is going to post in the future. Therefore it becomes more difficult to tune a data warehouse system. In this chapter, we will discuss how to tune the different aspects of a data warehouse such as performance, data load, queries, etc. Difficulties in Data Warehouse TuningTuning a data warehouse is a difficult procedure due to following reasons −
Note − It is very important to have a complete knowledge of data warehouse. Performance AssessmentHere is a list of objective measures of performance −
Following are the points to remember.
Data Load TuningData load is a critical part of overnight processing. Nothing else can run until data load is complete. This is the entry point into the system. Note − If there is a delay in transferring the data, or in arrival of data then the entire system is affected badly. Therefore it is very important to tune the data load first. There are various approaches of tuning data load that are discussed below −
Integrity ChecksIntegrity checking highly affects the performance of the load. Following are the points to remember −
Tuning QueriesWe have two kinds of queries in data warehouse −
Fixed QueriesFixed queries are well defined. Following are the examples of fixed queries −
Tuning the fixed queries in a data warehouse is same as in a relational database system. The only difference is that the amount of data to be queried may be different. It is good to store the most successful execution plan while testing fixed queries. Storing these executing plan will allow us to spot changing data size and data skew, as it will cause the execution plan to change. Note − We cannot do more on fact table but while dealing with dimension tables or the aggregations, the usual collection of SQL tweaking, storage mechanism, and access methods can be used to tune these queries. Ad hoc QueriesTo understand ad hoc queries, it is important to know the ad hoc users of the data warehouse. For each user or group of users, you need to know the following −
Points to Note
Data Warehousing - TestingTesting is very important for data warehouse systems to make them work correctly and efficiently. There are three basic levels of testing performed on a data warehouse −
Unit Testing
Integration Testing
System Testing
Test ScheduleFirst of all, the test schedule is created in the process of developing the test plan. In this schedule, we predict the estimated time required for the testing of the entire data warehouse system. There are different methodologies available to create a test schedule, but none of them are perfect because the data warehouse is very complex and large. Also the data warehouse system is evolving in nature. One may face the following issues while creating a test schedule −
Note − Due to the above-mentioned difficulties, it is recommended to always double the amount of time you would normally allow for testing. Testing Backup RecoveryTesting the backup recovery strategy is extremely important. Here is the list of scenarios for which this testing is needed −
Testing Operational EnvironmentThere are a number of aspects that need to be tested. These aspects are listed below.
Testing the DatabaseThe database is tested in the following three ways −
Testing the Application
Logistic of the TestThe aim of system test is to test all of the following areas −
Note − The most important point is to test the scalability. Failure to do so will leave us a system design that does not work when the system grows. Data Warehousing - Future AspectsFollowing are the future aspects of data warehousing.
Hence the future shape of data warehouse will be very different from what is being created today. What are the uses of information gathered through marketing research quizlet?What are the uses of information gathered through marketing research? (Check all that apply.) -It helps an organization identify and define market-driven opportunities and problems. -It helps an organization develop and evaluate marketing actions. You just studied 68 terms!
Which of the following are true of external marketing research suppliers of a company?Which of the following are true of external marketing research suppliers of a company? - They execute all aspects of research, such as study design, questionnaire production, interviewing, data analysis, and report preparation.
What is the motive of shopper research?Consumer research provides more in-depth information about the needs, wants, expectations and behavior analytics of clients. By identifying this information successfully, strategies that are used to attract consumers can be made better and businesses can make a profit by knowing what consumers want exactly.
Which of the following is a technique used to image the relative position of products?Perceptual Mapping & Positioning Dimensions
Perceptual mapping is a diagrammatic technique used by marketers in an attempt to visually display the perceptions of customers or potential customers. Typically the position of a product, product line, brand, or company is displayed relative to their competition.
|