Anurag
Information
Technology (IT) has historically influenced organizational performance and
competitive standing. The increasing processing power and sophistication of
analytical tools and techniques have put the strong foundation for the product
called data warehouse. There are a number of reasons that any organization
should consider a data warehouse, which can be the critical tool for maximizing
the organization’s investment in the information it has collected and stored
throughout the enterprise. IT managers need to understand the rationale and
benefits of data warehouses because they may need to design and implement, or
procure this kingpin of business intelligence.
The
data warehouses are supposed to provide storage, functionality and
responsiveness to queries beyond the capabilities of today's
transaction-oriented databases. Also data warehouses are set to improve the
data access performance of databases. Traditional databases balance the
requirement of data access with the need to ensure integrity of data. In
present day organizations, users of data are often completely removed from the
data sources. Many people only need read-access to data, but still need a very
rapid access to a larger volume of data than can conveniently by downloaded to
the desktop. Often such data comes from multiple databases. Because many of the
analyses performed are recurrent and predictable, software vendors and systems
support staff have begun to design systems to support these functions. Currently
there comes a necessity for providing decision makers from middle management
upward with information at the correct level of detail to support
decision-making. Data warehousing, online analytical processing (OLAP) and data
mining provide this functionality.
Here,
we are to discuss what is all about data warehouse, how it helps to gain a
competitive edge for an organization.
Data Warehouse - An Introduction
A
data warehouse is defined as a subject-oriented, integrated, nonvolatile,
time-variant collection of data in support of management's decisions. More
generally, data warehousing is a collection of decision support technologies,
aimed at enabling the knowledge worker, such as executive, manager, and
analyst, to arrive at better and faster decisions. Data warehouses provide
access to data for complex analysis, knowledge discovery, and decision-making.
style="mso-spacerun: They support high performance demands on an
organization's data and information. It provides an enormous amount of
historical and static data from three tiers:
1.
Relational
databases
2.
Multidimensional
OLAP applications
3.
Client
analysis tools
Several
types of applications such as online analytical processing (OLAP), decision-support
systems (DSS) and data mining are being supported. OLAP is a term
used to describe the analysis of complex data from the data warehouse.
OLAP
is a software technology that allows users to easily and quickly analyze and
view data from multiple points-of-view. OLAP provides dynamic and
multi-dimensional support to executives and managers who need to understand
different aspects of the data. Activities that are supported include:
§ Analyzing financial
trends
§ Creating slices of
data
§ Finding new
relationships among the data
§ Drilling down into
sales statistics
§ Doing calculations
through different dimensions where each category of data (that is, product,
location, sales numbers, time period, etc.) is considered a dimension.
There
are OLAP tools that use distributed computing capabilities for analyses that
require more storage and processing power than can be economically and
efficiently located on an individual desktop.
DSS
support an organization's leading decision makers with higher-level data for
complex and critical decisions. A DSS queries a data warehouse or an OLAP
database for relevant information that can be compared in order to make a
business decision and predict the impact of that decision.
Finally,
data mining is being used for knowledge discovery, the process of searching
data for unanticipated new knowledge.
Knowledge
workers and decision makers use tools ranging from parametric queries to ad hoc
queries to data mining. Thus, the access component of the data warehouse must
provide support of structured queries (both parametric and ad hoc). These together
make up a managed query environment.
Databases Vs Data Warehouses
A
database is a collection of related data and a database system is a database
and database software together.
A
data warehouse is also a collection of information as well as a supporting
system.
Databases
are transactional such as relational, object-oriented, network or hierarchical.
Traditional databases support on-line transaction processing (OLTP), which
includes insertions, updates, and deletions, while also supporting information
query requirements. Traditional databases are optimized to process queries that
may touch a small part of the database and transactions that deal with
insertions or updates of a few tuples per relation to process.
Thus
databases must strike a balance between efficiency in transaction processing
and supporting query requirements (ad hoc user requests), That is, they can't
further optimized for the applications such as OLAP, DSS and data mining.
But
a data warehouse is typically optimized for access from a decision maker's
needs. Data warehouses are designed specifically to support efficient
extraction, processing and presentation for analytic and decision-making
purposes.
In
contrast to databases, data warehouses generally contain very large amounts of data
from multiple sources that may include databases from different data models and
sometimes files acquired from independent systems and platforms.
Multidatabases provide access to
disjoint and usually heterogeneous databases and are volatile. Whereas a data
warehouse is frequently a store of integrated data from multiple sources,
processed for storage in a multidimensional model and nonvolatile. Data
warehouses also support time-series and trend analysis, both of which require
more historical data.
In
transactional systems, transactions are the unit and are the agent of change to
the database, but data warehouse information is much more coarse-grained and is
refreshed according to a careful choice of incremental refresh policy.
Warehouse updates are handled by the warehouse's acquisition component that
provides all required processing. As data warehouses encompass large volumes of
data, they are more or less double the size of source databases.
The
sheer volume of data likely to be in terabytes is an issue that has been dealt
with through enterprise-wide data warehouses, virtual data warehouses and data
marts. Enterprise-wide data warehouses are huge projects in need of massive
investment of time and resources. Virtual data warehouses are bound to provide
views of operational databases that are materialized for efficient access. A
data mart is an easy-to-access repository of a subset of highly focused data
for a single function or department (i.e., finance, sales, marketing) and is
considerably smaller than a data warehouse. The data comes form operational
information that is needed by a particular group of employees for analysis,
content, presentations all in terms that are familiar to them. Data for a data
mart is derived from a data warehouse or from more specialized access.
Distinctive Characteristics of Data Warehouses
a.
Data
warehouses are supposed to be blessed with the following unique features.
b.
Multidimensional
conceptual view and generic dimensionality,
c.
Unlimited
dimensions and aggregation levels and unrestricted cross-dimensional
operations,
d.
Dynamic
sparse matrix handling,
e.
Client/server
architecture and multi-user support,
f.
Accessibility
and transparency, intuitive data manipulation and consistent reporting
performance
As
data warehouses are not much particular about transaction processing, there is
an increased efficiency in query processing. There are some specialized tools
and techniques. They are query transformation, index intersection and union,
special ROLAP (Relational OLAP), MOLAP (multidimensional OLAP), DOLAP
(Database OLAP) and WOLAP (Web OLAP) functions, SQL extensions, advanced
join methods, and intelligent scanning.
Traditional
OLAP products are also known as multidimensional OLAP. Relational OLAP tools
take data from traditional two-dimensional or relational databases and create
multidimensional views upon request rather than being prepared in advance as in
OLAP. ROLAP is often used on complex data with a wide number of fields, such as
customer data. DOLAP is a relational database management system designed to
perform OLAP calculations. WOLAP refers to OLAP data that can be reached from a
Web server.
Parallel
processing can enhance the performance of data warehouse. Parallel server
architectures include symmetric
Data Modeling for Data Warehouses
Multidimensional
models highly take advantage of inherent relationships existing in data to
populate data in multidimensional matrices referred to as data cubes. If the
dimensional of the matrix is greater than three, then it is called hypercubes.
Query performance in multidimensional matrices for data that lend themselves to
dimensional formatting can be much better than in the relational data model.
For a corporate data warehouse, three examples of dimensions would be the
corporation's fiscal periods, products and regions.
Multidimensional
models lend themselves readily to hierarchical views such as roll-up display
and drill-down display. Roll-up display moves up the hierarchy, grouping into
larger units along a dimension. A drill-down display provides the opposite
capability, furnishing a finer-grained view through disaggregating process.
The
Multidimensional storage model involves two types of tables: dimension
tables and fact tables. A dimension table consists of tuples of attributes
of the dimension. A fact table can be thought of as having tuples, one per a
recorded fact. This fact contains some measured or observed variables and
identifies them with pointers to dimension tables.
Two
common multidimensional schemas are the star schema and the snowflake
schema. The star schema consists of a fact table with a single table for
each dimension. The snowflake schema is a variation on the star schema in which
the dimensional tables from a star schema are organized into a hierarchy by
normalizing them. A fact constellation is a set of fact tables that
share some dimension tables.
Data
warehouse storage also utilizes indexing techniques to support high performance
access. A technique called bitmap indexing constructs a bit vector for each value
in a domain (column) being indexed. It does well for domains of
low-cardinality. Bitmap indexing can provide considerable input/output and
storage space advantages in low-cardinality domains. With bit vectors a bitmap
index can provide dramatic improvements in comparison, aggregation, and join
performance.
In
a star schema, dimensional data can be indexed to tuples in the fact table by
join indexing. Join indexes are traditional indexes to maintain relationships
between primary key and foreign key values. They relate the values of a
dimension of a star schema to rows in the fact table. Data warehouse storage
can facilitate access to summary data by taking further advantage of the
nonvolatility of data warehouses and a degree of predictability of the analyses
that will be performed using them.
Building a Data Warehouse
Warehouse
builders should take a broad view of the anticipated use of the warehouse. The
design should specifically support adhoc querying that is, accessing data with
any meaningful combination of values for the attributes in the dimension or
fact tables. The following steps are being involved during the data acquisition
phase.
1.
The data must be extracted from multiple, heterogeneous sources such as
databases or other data feeds.
2.
Data must be formatted for consistency within the data warehouse. Names,
meanings and domains of data from unrelated sources must be reconciled.
3.
The data must be cleaned to ensure validity. Data cleaning is an important part
in building a data warehouse and it is being touted as the largest
labor-demanding one. For input data, cleaning process has to be performed
before the data are loaded in the warehouse. Data warehouse builders have to
check for validity and quality when the input data must be examined and
formatted consistently.
4.
The data must be fitted into the data model of the warehouse. Data from the
various sources must be installed in the data model of the warehouse. Data may
have to be converted from relational, object-oriented, or legacy databases.
5.
The data must be loaded into the warehouse. The sheer volume of data in the
warehouse makes loading the data a significant task. Monitoring tools for loads
as well as methods to recover from incomplete or incorrect loads are required.
With the huge volume of data in the warehouse, incremental updating is usually
the only feasible approach. The refresh policy will probably emerge as a
compromise that takes into account the answers to the following questions.
How
up-to-date the data must be?
Can
the warehouse go off-line, and for how long?
What
are the data interdependencies?
What
is the storage availability?
What
are the distribution requirements such as for replication and partitioning?
What
is the loading time including cleaning, formatting, copying, transmitting and
overhead such as index rebuilding?
Data
storage in a data warehouse involves the following processes:
Storing
the data according to the data model of the warehouse
Creating
and maintaining required data structures
Creating
and maintaining appropriate access paths
Providing
for time-variant data as new data are added
Supporting
the updating of warehouse data
Refreshing
and purging the data
The
sheer volume of data in the warehouse generally makes it impossible to simply
reload the warehouse in its entirety later on. There are a couple of
alternatives for this problem such as selective refreshing of data and separate
warehouse versions. When the warehouse uses an incremental data refreshing
mechanism, data may need to be periodically purged.
Data
warehouses should also be designed with by taking the environment in which they
reside into account. The important points behind the data warehouse design are
.
Usage projections
.
The fit of the data model
.
Characteristics of available sources
.
Design of the metadata component
.
Modular component design
.
Design for manageability and change
There
are mainly two types of data warehouses. They are distributed warehouse and
federated warehouse. For a distributed data warehouse, all the issues such as
replication, partitioning, communication and consistency, of distributed
databases are taken into account. As usual, the benefits of distribution, such
as load balancing, scalability of performance and higher availability, are available
with distributed data warehouse. A single replicated metadata repository would
reside at each distribution site.
The
idea of the federated warehouse is like that of the federated database: a
decentralized confederation of autonomous data warehouses, each with its own
metadata repository. Given the magnitude of the challenge inherent to data
warehouses, it is likely that such federations will consist of smaller-scale
components, such as data marts. Large organizations may choose to federate data
marts rather than build huge data warehouses.
Functionality of Data Warehouses
Data
warehouses exist to facilitate complex, data-intensive and frequent adhoc
queries. Data warehouses must provide far greater and more efficient query
support than is demanded of transactional databases. The data warehouse access
component supports enhanced spreadsheet functionality, efficient query
processing, structured queries, adhoc queries, data mining and materialized
views. Particularly enhanced spreadsheet functionality includes support for
state-of-the art spreadsheet applications as well as for OLAP applications
programs. These provide preprogrammed functionalities such as the following:
Roll-up: Data is summarized
with increasing generalization
Drill-down: Increasing levels
of detail are revealed
Pivot: Cross tabulation
that is, rotation, performed
Slice
and dice:
Performing projection operations on the dimensions
Sorting: Data is sorted by
ordinal value
Selection: Data is available
by value or range
Derived
or computer attributes: Attributes are computed by operations on stored and
derived values.
Benefits of Data Warehouses
The
core benefits include
·
Historical
information for comparative and competitive analysis
·
Enhanced
data quality and completeness
·
Supplementing
disaster recovery plans with another data back up source.
Among
the greatest benefits of a data warehouse is the ability to analyze and execute
business decisions based on data from multiple sources. For example, an
organization has collected valuable data and stored it in 30 databases. A data
warehouse is not only a convenient way to analyze and compare data in all the
databases, but it can also give historical data and perspective. Thus data
warehouse is a one-stop shop, but it is also a one-stop shop from an historical
perspective as well. Using data warehouse, one can look at past trends, whether
they be product sales or customers or whatever and may be do some predictions
of what is going to happen in the future.
Also
data retrieved from multiple databases is not constrained by the tables in each
of those databases. A data warehouse receives application neutral data.
Whatever database application is supplying the information to the data
warehouse is not preconditioning the data to be presented in a way the
originator of the data requires. That means, the data from the inventory
system, the financial system, or the sales system is sent to the data warehouse
for processing as application neutral data that is not formatted to answer only
queries from an inventory database, finance database, or sales database
program. If not for application-neutral data, the data warehouse would be
nothing more than a collection of data marts.
A
data warehouse by itself does not create value, but value comes from the use of
the data in the warehouse. In support of a low cost strategy, the data
warehouse can provide savings in billing processes, reduce fraud losses, and
reduce the cost of reporting. The data warehouses can provide analysts with
precalculated reports and graphs. This increases the productivity of business
analysts.
Most
companies can benefit from a data warehouse when the proper tools are in place
and users are trained in analysis of results.
Conclusion
However,
data warehouses are still an expensive solution and typically found in large
firms. The development of a central warehouse is a huge undertaking and capital
intensive with large, potentially unmanageable risks.
Comments
Post a Comment