February 2, 2011

Enterprise-specific Data Management

Traditionally, the database market divides into transaction processing (OLTP) and analytical processing (OLAP) workloads. OLTP workloads are characterized by a mix of reads and writes to a few rows at a time, typically through a B+Tree or other index structures. Conversely, OLAP applications are characterized by bulk updates and large sequential scans spanning few columns but many rows of the database, for example to compute aggregate values. Typically, those two workloads are supported by two different types of database systems transaction processing systems and warehousing systems.
In fact, enterprise applications today are primarily focused on the day-to-day transaction processing needed to run the business while the analytical processing necessary to understand and manage the business is added on after the fact. In contrast to this classification, single applications such as Available-To-Promise (ATP) or Demand Planning exist, which cannot be exclusively referred to one or the other workload category. These applications initiate a mixed workload in that they process small sets of transactional data at a time including write operations and simple read queries as well as complex, unpredictable mostly-read operations on large sets of data with a projectivity on a few columns only. Having a mixed workload is nothing new and has been analyzed on database level a decade ago - the insight that it is originated by a single application is new. Given this and the fact that databases are either built for OLTP or OLAP, it is evident that there is no database management system that adequately addresses the needed characteristics for these complex enterprise applications.
For example, within sales order processing systems, the decision about the ability to deliver the product at the requested time relies on the ATP check. The execution of this results in a confirmation for the sales order containing information about the product quantity and the delivery date. Consequently, the checking operation leads to a database request summing up all available resources in the context of the specific product. Apparently, materialized aggregates could be seen as one solution to tackle the expensive operation of on-the-fly aggregation. However, they fail in processing real-time order rescheduling due to incoming high priority orders leading to a reallocation of all products. Considering this operation as the essential part of the present ATP application encompasses characteristics of analytical workloads with regards to low selectivity and low projectivity as well as aggregation functionality are used and read-only queries are executed. Alongside the aforementioned check operation the write operations declare products as promised to customers work on fine-granular transactional level. While looking at the characteristics of these write operations it is obvious that they belong to the OLTP category as these write-queries are inherent of a high selectivity and high projectivity.
Furthermore, there is an increasing demand for “real-time analytics” – that is, up-to-the moment reporting on business processes that have traditionally been handled by data warehouse systems. Although warehouse vendors are doing as much as possible to improve response times (e.g., by reducing load times), the explicit separation between transaction processing and analytics systems introduces a fundamental bottleneck in analytics scenarios. While the predefinition of data to be extracted and transformed to the analytics system leads to the fact that analytics-based decisions are made on a subset of potential information the separation of system prevents transactional applications from using analytics functionality throughout the transaction processing due to the latency that is inherent in the data transfer.
The aforementioned, simplified example of a complex enterprise application shows workload characteristics, which match with those associated with OLTP and OLAP. As a consequence, nowadays database management systems cannot fulfill the requirements of specific enterprise applications since they are optimized for one or the other category leading to a mismatch of enterprise applications regarding the underlying data management layer. This is mainly because conventional database management systems cannot execute certain important complex operations in a timely manner. While this problem is widely recognized for analytical applications, it also pertains to sophisticated transactional applications. To meet this issue, enterprise applications have become increasingly complicated to make up for shortcomings in the data management infrastructure. One of these solutions has been the packaging of operations as long-running batch jobs. Consequently, this approach slows down the rate at which business processes can be completed, possibly exceeding external requirements. Maintaining pre-computed, materialized results of the operations is another solution that both introduces a higher system complexity due to the maintenance of redundant data and decreases flexibility of applications with the necessity to predefine materialization strategies.

Trends in Enterprise Data Management

Enterprise applications heavily rely on database management systems to take care of the storage and pro- cessing of their data. A common assumption of how enterprise applications work (row based, many updates) has led to decades of database research. A rising trend in database research shows how important it is to rethink how persistence should be managed to leverage new hardware possibilities and discard parts of the over 20-year old data management infrastructure.
The overall goal is to define application persistence based on data characteristics and usage patterns of the consuming applications in realistic customer environments. Of course it has to be considered that some of the characteristics may be weakened due to the fact that they use “old” data management software. Stonebraker et. al proposes a complete redesign of database architectures considering latest trends in hardware and based on actual usage of data in [5, 6]. Besides, Vogels et. al describe in [2] the design and implementation of a key-value storage system that sacrifices consistency under certain failure scenarios and makes use of object versioning and application-assisted conflict resolution.
We analyzed the typical setup for enterprise applications consisting of a transactional OLTP part and an analytical OLAP part. The major difference between both applications is the way how data is stored and that data is typically kept redundantly available in both systems.
While OLTP is well supported by traditional row-based DBMS, OLAP applications are less efficient on such a layout, which has led to the development of several OLAP-specific storage schemes, in particular multidimensional schemas. However, these schemas are often proprietary and difficult to be integrated with the bulk of enterprise data that is stored in a relational DBMS. This poses serious problems to applications which are to support both, the OLAP and the OLTP world. Given that, complex OLTP queries, such as the computation of a stock for a certain material in a specific location cannot be done on the actual material movements events but is done by using pre-computed aggregates on a pre-defined granularity level. However, three recent developments have the potential to lead to new architectures which may well cope with such demands [4]:
  • In-Memory Databases,
  • Column-Oriented Storage Schemas, and
  • Query-aware light-weight Compression.
With up to several terabytes of main memory available to applications as well as the increase in computing power with new multi core hardware architectures holding entire databases in main memory becomes feasible; the application of these in-memory databases is especially promising in the field of enterprise applications. Based on characteristics derived from analyzing customer systems storage techniques such as column-orientation and lightweight compression, this work reevaluates the applicability of those in an enterprise environment. To address the above mentioned mixed workload the database management layer has to be aware of this fact and optimized towards these contradicting workloads by leveraging current advances in hardware while reevaluating data storage techniques and characteristics derived from analyzing customer systems.

Enterprise Application Characteristics

In order to define the requirements of an enterprise-specific data management realistic enterprise applications and customer data have been analyzed. The most important findings based on the research so far and their implications on database design are:
  • Enterprise applications typically present data by building a context for a view, modifications to the data only happen rarely. In fact, over 80% of the workload in an OLTP environment are read operations. Hence column-oriented, in-memory databases that are optimized for read operations perform especially well in enterprise application scenarios.
  • Tables for transactional data typically consist of 100-300 columns and only a narrow set of attributes is accessed in typical queries. Column-oriented databases benefit significantly from this characteristic as entire columns, rather than entire rows, can be read in sequence.
  • Enterprise data is sparse data with a well known value domain and a relatively low number of distinct values. Therefore data of enterprise applications qualifies very well for data compression as these techniques exploit redundancy within data and knowledge about the data domain for optimal results. Abadi et al. have shown in [1] that compression applies particularly well to columnar storages. Since all data within a column a) has the same data type and b) typically has similar semantics and thus low information entropy, i.e. there are few distinct values in many cases.
  • Enterprise applications typically reveal a mix of OLAP and OLTP characteristics [3]. To support both, the data storage of in-memory databases are split into two parts, one optimized for reading and one for writing.
  • Data growth in enterprise systems has not shown the same growth rate as for example social networks. Despite the fact of a growing number of captured events in enterprise environments all events are based on actual events related to the business and have an inherent processing limit due to the size of the company.
Given these findings on enterprise application our approach to build an application-specific data management is focused on in-memory data processing with data compression and column-wise data representation in order to utilize today's hardware as best as possible.


[1] Daniel J. Abadi, Samuel R. Madden,2In SIGMOD Conference, 2006.
[2] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. In SOSP ’07, 2007.
[3] Jens Krueger, Christian Tinnefeld, Martin Grund, Alexander Zeier, and Hasso Plattner. A Case for Online Mixed Workload Processing. In DBTest, 2010.
[4] Hasso Plattner. A common database approach for OLTP and OLAP using an in-memory column database. In SIGMOD Conference, 2009.
[5] Michael Stonebraker and Ugur Cetintemel. ”One Size Fits All”: An Idea Whose Time Has Come and Gone. In ICDE, 2005.
[6] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The End of an Architectural Era (It’s Time for a Complete Rewrite). In VLDB, 2007.

Author: Jens Krueger

No comments:

Post a Comment