DATA WAREHOUSE VERSUS DATA MART: THE GREAT DEBATE

DATA WAREHOUSE VERSUS DATA MART: THE GREAT DEBATE

Customers exploring the field of business intelligence for the first time often lead with: What is the difference between a data warehouse and a data mart? The next question follows as predictably as night follows day: which one does my company need?

Let me start by saying that the two terms are often confused. Indeed, some people in the industry use them virtually interchangeably, which is unfortunate, because they do reflect a valuable hierarchical difference.

The Data Warehouse

A "data warehouse" will typically contain the full range of business intelligence available to a company from all sources. That data consists of transaction-processing records, corporate and marketing data, and other business operations information; for example, a bank might include loans, credit card statements, and demand deposits data, along with basic customer information. This internal data is frequently combined with statistical and demographic information obtained from outside sources. The cross-divisional nature of the data on file explains why a data warehouse is often called an "enterprise warehouse" -- because the wealth of data it gathers supports the informational needs of the corporate enterprise as a whole.

The Data Mart

Here we move to the next level down in the information hierarchy. A company's marketing, purchasing and finance departments will all make use of data stored in the enterprise warehouse. In many cases they will use the same data, but each department will massage that data in different ways. So each department sets up its own "data mart" designed to extract data from the enterprise warehouse. The key point here is that each mart processes the data in a form which suits its own departmental needs.

Differences defined

Here we see the difference between the two hierarchical levels. At the top of the information chain, a data or enterprise warehouse is "application-neutral." The task of the warehouse is first to store, and then supply, information to different users.

By contrast, a data mart is "application-specific." Data held in the warehouse will be accessed by several departments or divisions, each of which has a specific "single-subject" interest in the warehoused data, be it finance, human resources or marketing. Each area will set up its own "data mart" to service closely defined user-specific needs. From this we see that a mart is designed to solve one business problem well; it is not set up to cope with a variety of needs.

Ideally, warehouse and marts should coexist, complementing each other's roles. Technology cannot keep pace while human minds dream up ever more complex business demands. So we have to evolve strategies to determine when to build marts and when to build warehouses. Let's assume that a major corporate division's IT group finds that several marts are using much the same data. The question arises: "Why not combine the marts?" The reason is usually that one of the marts is delivering a specific price performance objective, an objective which would be impossible if data for other concepts were merged into that mart.

In such cases, the desired analysis depends on some specific denormalization to achieve its goal. But even if no such over-riding objective stands in the way of merging two marts, one still has to ask: "Does the combined data model give me an analytical capacity I didn't have before?" This question usually yields several answers, many of which point in the direction of creating a warehouse. After sober second thoughts the debate usually turns to more important considerations than an immediate need to reduce "redundant" storage costs. This tells us that, if combining the loans data mart with the credit card mart and the checking account mart will enhance our understanding of customers' purchasing habits, we should be thinking about setting up a warehouse to service several needs.

The decision to set up a mart usually originates in that part of an organization with the most business "pain", and, thus, the opportunity for greatest gain.

A warehouse usually comes into being when a senior executive notices a business problem recurring in several departments. Subsequent discussion reveals a greater need for cross-divisional data analysis than anyone thought. So a warehouse is born to collectively help all divisions behave as a single corporation.

In summary, a mart is born from a single department's urgent need to solve a problem. A warehouse is born when executives discover a common problem affecting different departments AND decide that they can obtain added value by getting a cross-departmental view of the data. Ideally, the warehouse is the best place to start, but that may not reflect the real world. Ultimately you pick your starting point knowing that, over time, you may end up with several warehouses and marts.

So the debate should not be mart versus warehouse, but rather which applications are best served by a mart or warehouse design, followed by how the highest priority implementation will fit into a three to five year plan. And the difference itself, summarized in business terms, depends on whether the system is inter-departmental, pulling data from multiple major subsystems (loans, credit, trust, checking, etc.). A mart might model a small number of major entities; a warehouse by definition models several. Thus, a mart will model just customer loans, payments and forecasts, whereas the warehouse would combine this with checking account transactions, credit card transactions, and so on.

Marts tend toward heavy summarization. The epitome of this characteristic is represented by the Essbase OLAP cube -- all summary, no details. Thus marts focus on solving predefined questions. Knowing that lets us fine-tune their performance responses.

Let's phrase the question "either/or," not "versus" I hope by now it is clear that we are not looking at a "Warehouse versus Mart" debate. There is nothing adversarial about the process. Selection revolves around complementary roles.

Accuracy of data marts and data warehouses

As long as I'm on the subject of warehouses and marts, I feel it's my duty to talk about the importance of transforming data into useful context. For data analysis to be carried out efficiently and effectively, it's critical for a data warehouse and/or data mart to provide "a single version of the truth," a term one hears in the industry. To accomplish this, data must be extracted and transformed in one place, for the marts and the warehouses alike, so we can have consistency across them. Therefore, Joe Brown in the mart is not confused with J.P. Brown in the warehouse. The process of eliminating these differences in accuracy and context saves the company considerably -- mailing costs are high as it is, so why mail the same piece twice to a single person?

Analysis: good for what ails you

Some situations are best analyzed at a warehouse level; other, more specific conditions are best examined in a data mart. A useful analogy is to imagine the difference between a data warehouse and a data mart as equivalent to the difference between a general practitioner and a specialist. The GP has a broad knowledge of many disciplines, not least of which is understanding human nature. But one does not expect a GP to perform open-heart surgery or cataract operations, procedures appropriate to specialist training. However, one does expect a GP to diagnose heart disease or eye problems, along with a host of other conditions, referring patients to specialists for treatment, as appropriate. Similarly, competent analysis at a data warehouse level -- the GP -- will give senior management an excellent overview of the whole operation, or of customers' buying habits. That may be enough to prescribe an overall treatment and a few symptom-specific prescriptions, too. Beyond that, if you want to solve a single problem very well, send for a specialist, call in the mart.

At IBM, we come at each customer's challenge from a very specific point of view -- the customer's point of view. It is essential to establish exactly what the customer wants. Only when vendor and customer have worked out the precise parameters of a business problem should they move to the next step: identifying the appropriate solution. We think that if we as a vendor do our homework right, both we and our customer will come out of negotiations happy -- and our customer will prosper.

Building a Data Warehouse(tools)

In general, building any data warehouse consists of the following steps:

1. Extracting the transactional data from the data sources into a staging area

2. Transforming the transactional data

3. Loading the transformed data into a dimensional database

4. Building pre-calculated summary values to speed up report generation

5. Building (or purchasing) a front-end reporting tool

Extracting Transactional Data

A large part of building a DW is pulling data from various data sources and placing it in a central storage area. In fact, this can be the most difficult step to accomplish due to the reasons mentioned earlier: Most people who worked on the systems in place have moved on to other jobs. Even if they haven't left the company, you still have a lot of work to do: You need to figure out which database system to use for your staging area and how to pull data from various sources into that area.

Fortunately for many small to mid-size companies, Microsoft has come up with an excellent tool for data extraction. Data Transformation Services (DTS), which is part of Microsoft SQL Server 7.0 and 2000, allows you to import and export data from any OLE DB or ODBC-compliant database as long as you have an appropriate provider. This tool is available at no extra cost when you purchase Microsoft SQL Server. The sad reality is that you won't always have an OLE DB or ODBC-compliant data source to work with, however. If not, you're bound to make a considerable investment of time and effort in writing a custom program that transfers data from the original source into the staging database.

Transforming Transactional Data

An equally important and challenging step after extracting is transforming and relating the data extracted from multiple sources. As I said earlier, your source systems were most likely built by many different IT professionals. Let's face it. Each person sees the world through their own eyes, so each solution is at least a bit different from the others. The data model of your mainframe system might be very different from the model of the client-server system.

Most companies have their data spread out in a number of various database management systems: MS Access, MS SQL Server, Oracle, Sybase, and so on. Many companies will also have much of their data in flat files, spreadsheets, mail systems and other types of data stores. When building a data warehouse, you need to relate data from all of these sources and build some type of a staging area that can handle data extracted from any of these source systems. After all the data is in the staging area, you have to massage it and give it a common shape. Prior to massaging data, you need to figure out a way to relate tables and columns of one system to the tables and columns coming from the other systems.

Creating a Dimensional Model

The third step in building a data warehouse is coming up with a dimensional model. Most modern transactional systems are built using the relational model. The relational database is highly normalized; when designing such a system, you try to get rid of repeating columns and make all columns dependent on the primary key of each table. The relational systems perform well in the On-Line Transaction Processing (OLTP) environment. On the other hand, they perform rather poorly in the reporting (and especially DW) environment, in which joining multiple huge tables just is not the best idea.

The relational format is not very efficient when it comes to building reports with summary and aggregate values. The dimensional approach, on the other hand, provides a way to improve query performance without affecting data integrity. However, the query performance improvement comes with a storage space penalty; a dimensional database will generally take up much more space than its relational counterpart. These days, storage space is fairly inexpensive, and most companies can afford large hard disks with a minimal effort.

The dimensional model consists of the fact and dimension tables. The fact tables consist of foreign keys to each dimension table, as well as measures. The measures are a factual representation of how well (or how poorly) your business is doing (for instance, the number of parts produced per hour or the number of cars rented per day). Dimensions, on the other hand, are what your business users expect in the reports—the details about the measures. For example, the time dimension tells the user that 2000 parts were produced between 7 a.m. and 7 p.m. on the specific day; the plant dimension specifies that these parts were produced by the Northern plant.

Just like any modeling exercise the dimensional modeling is not to be taken lightly. Figuring out the needed dimensions is a matter of discussing the business requirements with your users over and over again. When you first talk to the users they have very minimal requirements: "Just give me those reports that show me how each portion of the company performs." Figuring out what "each portion of the company" means is your job as a DW architect. The company may consist of regions, each of which report to a different vice president of operations. Each region, on the other hand, might consist of areas, which in turn might consist of individual stores. Each store could have several departments. When the DW is complete, splitting the revenue among the regions won't be enough. That's when your users will demand more features and additional drill-down capabilities. Instead of waiting for that to happen, an architect should take proactive measures to get all the necessary requirements ahead of time.

It's also important to realize that not every field you import from each data source may fit into the dimensional model. Indeed, if you have a sequential key on a mainframe system, it won't have much meaning to your business users. Other columns might have had significance eons ago when the system was built. Since then, the management might have changed its mind about the relevance of such columns. So don't worry if all of the columns you imported are not part of your dimensional model.

Loading the Data

After you've built a dimensional model, it's time to populate it with the data in the staging database. This step only sounds trivial. It might involve combining several columns together or splitting one field into several columns. You might have to perform several lookups before calculating certain values for your dimensional model.

Keep in mind that such data transformations can be performed at either of the two stages: while extracting the data from their origins or while loading data into the dimensional model. I wouldn't recommend one way over the other—make a decision depending on the project. If your users need to be sure that they can extract all the data first, wait until all data is extracted prior to transforming it. If the dimensions are known prior to extraction, go on and transform the data while extracting it.

Generating Precalculated Summary Values

The next step is generating the precalculated summary values which are commonly referred to as aggregations. This step has been tremendously simplified by SQL Server Analysis Services (or OLAP Services, as it is referred to in SQL Server 7.0). After you have populated your dimensional database, SQL Server Analysis Services does all the aggregate generation work for you. However, remember that depending on the number of dimensions you have in your DW, building aggregations can take a long time. As a rule of thumb, the more dimensions you have, the more time it'll take to build aggregations. However, the size of each dimension also plays a significant role.

Prior to generating aggregations, you need to make an important choice about which dimensional model to use: ROLAP (Relational OLAP), MOLAP (Multidimensional OLAP), or HOLAP (Hybrid OLAP). The ROLAP model builds additional tables for storing the aggregates, but this takes much more storage space than a dimensional database, so be careful! The MOLAP model stores the aggregations as well as the data in multidimensional format, which is far more efficient than ROLAP. The HOLAP approach keeps the data in the relational format, but builds aggregations in multidimensional format, so it's a combination of ROLAP and MOLAP.

Regardless of which dimensional model you choose, ensure that SQL Server has as much memory as possible. Building aggregations is a memory-intensive operation, and the more memory you provide, the less time it will take to build aggregate values.

Building (or Purchasing) a Front-End Reporting Tool

After you've built the dimensional database and the aggregations you can decide how sophisticated your reporting tools need to be. If you just need the drill-down capabilities, and your users have Microsoft Office 2000 on their desktops, the Pivot Table Service of Microsoft Excel 2000 will do the job. If the reporting needs are more than what Excel can offer, you'll have to investigate the alternative of building or purchasing a reporting tool. The cost of building a custom reporting (and OLAP) tool will usually outweigh the purchase price of a third-party tool. That is not to say that OLAP tools are cheap (not in the least!).

There are several major vendors on the market that have top-notch analytical tools. In addition to the third-party tools, Microsoft has just released its own tool, Data Analyzer, which can be a cost-effective alternative. Consider purchasing one of these suites before delving into the process of developing your own software because reinventing the wheel is not always beneficial or affordable. Building OLAP tools is not a trivial exercise by any means.

What is a Data Mart?

A data mart is a collection of subject areas organized for decision support based on the needs of a given department. Finance has their data mart, marketing has theirs, sales has theirs and so on. And the data mart for marketing only faintly resembles anyone else's data mart.

Perhaps most importantly, the individual departments OWN the hardware, software, data and programs that constitute the data mart. The rights of ownership allow the departments to bypass any means of control or discipline that might coordinate the data found in the different departments.

Each department has its own interpretation of what a data mart should look like and each department's data mart is peculiar to and specific to its own needs. Typically, the database design for a data mart is built around a star-join structure that is optimal for the needs of the users found in the department. In order to shape the star join, the requirements of the users for the department must be gathered. The data mart contains only a modicum of historical information and is granular only to the point that it suits the needs of the department. The data mart is typically housed in multidimensional technology which is great for flexibility of analysis but is not optimal for large amounts of data. Data found in data marts is highly indexed.

There are two kinds of data marts--dependent and independent. A dependent data mart is one whose source is a data warehouse. An independent data mart is one whose source is the legacy applications environment. All dependent data marts are fed by the same source--the data warehouse. Each independent data mart is fed uniquely and separately by the legacy applications environment. Dependent data marts are architecturally and structurally sound. Independent data marts are unstable and architecturally unsound, at least for the long haul. The problem with independent data marts is that their deficiencies do not make themselves manifest until the organization has built multiple independent data marts.

Fundamental Focus

The classic approach to Data Warehousing that is, for all intents and purposes, a business process that is:

§ Business Driven

§ Market Focused

§ Technology Based

The traditional data warehouse can be viewed as a decision support database that is maintained separately from an organization's transaction processing (or operational) databases. A more rigorous definition is that a data warehouse is a subject oriented, integrated, time variant, non-volatile collection of data that is used primarily to aid organizational decision making.

W. H. Inmon

Bill Inmon, considered by some to be the father of data warehousing, and a prolific writer and champion of the Data Warehouse (DW) concept, has defined data warehousing as a database containing Subject Oriented, Integrated, Time Variant and Non-volatile information used to support the decision making process. Here we take a closer look at those terms and what they mean in the context of data warehousing.

Subject Oriented

Operational databases, such as order processing and payroll databases, are organized around business processes or functional areas. These databases grew out of the applications they served. Thus, the data was relative to the order processing application or the payroll application. Data on a particular subject, such as products or employees, was maintained separately (and usually inconsistently) in a number of different databases. In contrast, a data warehouse is organized around subjects. This subject orientation presents the data in a much easier-to-understand format for end users and non-IT analysts.

Integrated

Integration of data within a warehouse is accomplished by making the data consistent in format, naming, and other aspects. Operational databases, for historic reasons, often have major inconsistencies in data representations. For example, a set of operational databases may represent "male" and "female" by using codes such as "m" and "f", by "1" and "2", or by "b" and "g". Often, the inconsistencies are more complex and subtle. In a data warehouse, on the other hand, data is always maintained in a consistent fashion.

Time Variant

Data warehouses are time variant in the sense that they maintain both historical and (nearly) current data. Operational databases, in contrast, contain only the most current, up-to-date data values. Furthermore, they generally maintain this information for no more than a year (and often much less). In contrast, data warehouses contain data that is generally loaded from the operational databases daily, weekly, or monthly which is then typically maintained for a period of 3 to 10 years. This is a major difference between the two types of environments.

Historical information is of high importance to decision makers, who often want to understand trends and relationships between data. For example, the product manager for a Liquefied Natural Gas soda drink may want to see the relationship between coupon promotions and sales. This is information that is almost impossible - and certainly in most cases not cost effective - to determine with an operational database.

Non-Volatile

Non-volatility, the final primary aspect of data warehouses, means that after the data warehouse is loaded into the data warehouse there are no changes, inserts, or deletes performed against the historical data. The data warehouse is, of course, first loaded with transformed data that originated in the operational databases.

The data warehouse is subsequently reloaded or, more likely, appended on a periodic basis (usually nightly, weekly, and monthly) with new transformed data from the operational databases. Outside of these loading processes, the data in the data warehouse generally stays static. Due to non-volatility, the data warehouse can be heavily optimized for query processing.

Steps for designing a Data Warehouse-Kimball’s Approach

One of the popular approaches in building Data Warehouse by Kimball is known as “Divide-and-Conquer” or “Building Data Marts and Uniting” approach. An enterprise data warehouse is revealed as the union of a set of separate data marts implemented over a period of time, possibly by different design teams, and possibly on different hardware and software platforms.

Important Steps are as follows:-

1. Choose the Business Process - a typically insular IT approach

2. Choose the Grain - data modelling - starting with the lowest level of granularity

3. Identify and conform the Dimensions - data modelling

4. Choose the Facts - data modelling

5. Store pre-calculations in the Fact tables - data modelling

6. Round out the Dimension Tables - more data modelling

7. Choose the duration of the Database - data modelling

8. Track slowly changing dimensions - data modelling

9. Decide on the query priorities and the query modes - operational aspects

The Kimball "approach" evolved from a need to allow small time vendors to enter the DW solution space with their silo approaches to business information, that tendency, which is a facet of intellectual sloppiness in IT, has continued to be embraced and extended

Definition of meta data

Metadata (metacontent) is defined as data providing information about one or more aspects of the data, such as:

Means of creation of the data
Purpose of the data
Time and date of creation
Creator or author of data
Placement on a computer network where the data was created
Standards used

For example, a digital image may include metadata that describes how large the picture is, the color depth, the image resolution, when the image was created, and other data. A text document's metadata may contain information about how long the document is, who the author is, when the document was written, and a short summary of the document.
Metadata is data. As such, metadata can be stored and managed in a database, often called a registry or repository. However, it is impossible to identify metadata just by looking at it because a user would not know when data is metadata or just data.^[1]

Metadata and data warehousing

Data warehouse (DW) is a repository of an organization's electronically stored data. Data warehouses are designed to manage and store the data whereas the Business Intelligence (BI) focuses on the usage of data to facilitate reporting and analysis.^[17]
The purpose of a data warehouse is to house standardized, structured, consistent, integrated, correct, cleansed and timely data, extracted from various operational systems in an organization. The extracted data is integrated in the data warehouse environment in order to provide an enterprise wide perspective, one version of the truth. Data is structured in a way to specifically address the reporting and analytic requirements.
An essential component of a data warehouse/business intelligence system is the metadata and tools to manage and retrieve metadata. Ralph Kimball^[18] describes metadata as the DNA of the data warehouse as metadata defines the elements of the data warehouse and how they work together.
Kimball et al.^[19] refers to three main categories of metadata: Technical metadata, business metadata and process metadata. Technical metadata is primarily definitional while business metadata and process metadata are primarily descriptive. Keep in mind that the categories sometimes overlap.

Technical metadata defines the objects and processes in a DW/BI system, as seen from a technical point of view. The technical metadata includes the system metadata which defines the data structures such as: Tables, fields, data types, indexes and partitions in the relational engine, and databases, dimensions, measures, and data mining models. Technical metadata defines the data model and the way it is displayed for the users, with the reports, schedules, distribution lists and user security rights.

Business metadata is content from the data warehouse described in more user friendly terms. The business metadata tells you what data you have, where it comes from, what it means and what its relationship is to other data in the data warehouse. Business metadata may also serves as documentation for the DW/BI system. Users who browse the data warehouse are primarily viewing the business metadata.

Process metadata is used to describe the results of various operations in the data warehouse. Within the ETL process all key data from tasks are logged on execution. This includes start time, end time, CPU seconds used, disk reads, disk writes and rows processed. When troubleshooting the ETL or query process, this sort of data becomes valuable. Process metadata is the fact measurement when building and using a DW/BI system. Some organizations make a living out of collecting and selling this sort of data to companies - in that case the process metadata becomes the business metadata for the fact and dimension tables. Process metadata is in interest of business people who can use the data to identify the users of their products, which products they are using and what level of service they are receiving.

Requirements Gathering for Data Warehouse Design

· Brainstorming

· Interviews

· Dimensional data modeling

· Process and context modeling

· Prototyping

· Story boarding

ROLAP

Advatages -
The advantages of this model is it can handle a large amount of data and can leverage all the functionalities of the relational database.

Disadvantages -
The disadvantages are that the performance is slow and each ROLAP report is an SQL query with all the limitations of the genre. It is also limited by SQL functionalities

ROLAP vendors have tried to mitigate this problem by building into the tool out-of-the-box complex functions as well as providing the users with an ability to define their own functions.

MOLAP

Advantages -
The advantages of this mode is that it provides excellent query performance and the cubes are built for fast data retrieval. All calculations are pre-generated when the cube is created and can be easily applied while querying data.

Disadvantages -
The disadvantages of this model are that it can handle only a limited amount of data. Since all calculations have been pre-built when the cube was created, the cube cannot be derived from a large volume of data. This deficiency can be bypassed by including only summary level calculations while constructing the cube. This model also requires huge additional investment as cube technology is proprietary and the knowledge base may not exist in the organization.

MOLAP

This is the traditional mode in OLAP analysis. In MOLAP data is stored in form of multidimensional cubes and not in relational databases. The advantages of this mode is that it provides excellent query performance and the cubes are built for fast data retrieval. All calculations are pre-generated when the cube is created and can be easily applied while querying data. The disadvantages of this model are that it can handle only a limited amount of data. Since all calculations have been pre-built when the cube was created, the cube cannot be derived from a large volume of data. This deficiency can be bypassed by including only summary level calculations while constructing the cube. This model also requires huge additional investment as cube technology is proprietary and the knowledge base may not exist in the organization.

ROLAP

The underlying data in this model is stored in relational databases. Since the data is stored in relational databases this model gives the appearance of traditional OLAP’s slicing and dicing functionality. The advantages of this model is it can handle a large amount of data and can leverage all the functionalities of the relational database. The disadvantages are that the performance is slow and each ROLAP report is an SQL query with all the limitations of the genre. It is also limited by SQL functionalities. ROLAP vendors have tried to mitigate this problem by building into the tool out-of-the-box complex functions as well as providing the users with an ability to define their own functions.

OLAP and the Data Warehouse

As noted on the DSS page, On-line Analytical Processing permits a sophisticated multi-dimensional analysis of data that can, in turn, be used for decision making purposes. Though the boundaries of OLAP with respect to other forms of decision support are somewhat vague, OLAP products must provide at least the following minimal set of functions:

Roll-up. The roll-up operation collapses the dimension hierarchy along a particular dimension(s) so as to present the remaining dimensions at a coarser level of granularity.
Drill-down. In contrast, the drill-down function allows users to obtain a more detailed view of a given dimension.
Slice. Here, the objective is to extract a slice of the original cube corresponding to a single value of a given dimension. No aggregation is required with this option. Instead, we are allowing the user to focus in on values of interest.
Dice. A related operation is the dice. In this case, we are defining a subcube of the original space. In other words, by specifying value ranges on one or more dimensions, the user can highlight meaningful blocks of aggregated data.
Pivot. The pivot is a simple but effective operation that allows OLAP users to visualize cube values in more natural and intuitive ways.

While the previous list formally describes the key OLAP functions, it is often helpful for prospective customers or users to think of OLAP systems in more informal terms. The OLAP Report has defined a metric they call FASMI or Fast Analysis of Shared Multidimensional Information. In essence, FASMI is a means by which to grade or compare OLAP products. The FASMI criteria are presented in the following list:

Fast. Vendors must be able to efficiently trade off pre-calculation costs and storage requirements with real-time query response. Studies have shown that users are likely to abort queries that take longer than thirty seconds to complete.
Analysis. Tools should not only provide the five fundamental operations but extras such as times series analysis, currency translation, and data mining capabilities. Most of the business and analytical logic should be available without sophisticated 4GL programming.
Shared. Security and concurrency control should be available when required. It must be noted however that most OLAP systems assume that user-level updates will not be necessary.
Multidimensional. This is the key FASMI requirement. Whether implemented with OLAP or MOLAP (discussed below), the user must see the data in subject-oriented hierarchies.
Information. Applications must be able to handle vast amounts of data. Again, regardless of the server model that is used, good OLAP applications may have to support data cubes that scale to the terabyte range.

Physical Architecture

While OLAP products can be as simple as a spreadsheet extension, larger systems are almost invariably associated with an underlying data warehouse. The classic definition of the data warehouse was provide by W. H. Inmon when he described it as a "subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process." What does this imply? In a nutshell, the definition tells us that, in contrast to transaction-oriented databases (OLTP), a data warehouse is organized around subjects rather than atomic transactions; it represents aggregated or summarized information from a variety of sources; it houses data collected over very long periods, typically years; and it is tuned for read-only access.
In general, data warehouses can be seen as three-tiered data models. As depicted in the figure below, information is first extracted from operational sources and then cleaned, transformed and loaded into the data warehouse. Though this first step is itself outside the scope of the date warehouse proper, it is nonetheless a crucial activity that has received considerable attention from OLAP vendors. Often, the production data resides in a collection of remote, heterogeneous repositories and must undergo considerable massaging before it can be integrated into a single clean store.

The OLAP Architecture

In any case, once the data has been culled from the remote sources, it is placed into the data warehouse, which at this point in time is almost always a relational DB. The data warehouse itself may be constructed as a monolithic enterprise-wide entity and/or a series of data marts, each containing some subset of the corporate data. In either case, it will be the job of the OLAP server at level three to actually supply analytical functionality for the DSS system. In practice, there are two forms of OLAP servers, known as ROLAP and MOLAP (discussed in greater detail below), that may be used for this purpose, though conceptually at least their aims are quite similar. Finally, in the top-tier, we find the front end tools that provide a user-friendly (often graphical) interface to the knowledge workers who will exploit the system.

MOLAP versus ROLAP

One of the most important elements of OLAP environments is their reliance upon multi-dimensional data values. As mentioned previously, data warehouses represent subject-oriented records rather than transaction-oriented ones. As such, aggregated values can be viewed as existing within a logical cube, where the user is free to index the cube on one or more dimensional axes. In the nomenclature of OLAP, this type of conceptual representation of the data gives rise to what is known as a data cube. It is this cube-like model, in fact, that is the focus of the second tier of the DSS architecture, the OLAP server.
Since the data cube suggests a multi-dimensional interpretation of the data space, a number of OLAP vendors have chosen to physically model the cube as a multi-dimensional array. These MOLAP (multi-dimensional OLAP) products offer rapid response time on OLAP queries since it is possible to index directly into the data cube structure to retrieve subsets of aggregated data. Unfortunately, MOLAP solutions do not always scale effectively to large, high-dimensionality data sets (though MOLAP capacity has grown significantly in recent years). The problem is that as the number of dimensions grows, the data in the data cube becomes increasingly sparse. In other words, many of the attribute combinations represented by the data cube structure do not contain any aggregated data. As such, a fully materialized MOLAP array can contain an enormous number of empty cells, resulting in unacceptable storage requirements. Compression techniques can be employed to alleviate this problem, though doing so partially destroys the natural indexing that makes MOLAP so appealing. Nevertheless, it must be noted that virtually all of today's successful OLAP products are in fact MOLAP solutions.
In contrast, relational OLAP (ROLAP) seeks to exploit the maturity and power of the relational paradigm. Instead of a multi-dimensional array, the ROLAP data cube is implemented as a collection of up to 2^d relational tables (where d is the number of dimensions), each representing a particular cuboid or view. Because the cuboids are now conventional database tables, they can be processed and queried with traditional RDBMS techniques (e.g., indexes and joins). Moreover, they may be more efficient on large data warehouses since only those data cube cells that actually contain data are housed within the tables. On the downside, there is no "built-in" indexing with a ROLAP cube as there would be with a MOLAP implementation. Instead, all attribute values within the record must be included with the aggregated or summary values so that the record's position within the cube can be determined (One might liken this to afully-qualified path name in operating system terminology). This additional overhead might offset some of the space savings, particularly in dense environments. Furthermore, the absence of an implicit index implies that an explicit one must be provided. In practice, this can be a challenge of considerable importance since efficient multi-dimensional indexing techniques are notoriously difficult to design.

Comments

John DasMay 30, 2016 at 3:19 PM
Wonderful article I should be say thanks to the owner of the blog.
Chicago Third Party Logistic Warehouse
Priya KannanMay 27, 2017 at 11:11 AM
Somebody necessarily help to make severely posts I might state. This is the first time I frequented your website page and to this point? I surprised with the research you made to create this particular post extraordinary. Well done admin..
Data Warehousing Training in Chennai

Anurag RANA

Search This Blog