November 25, 2011

Active and Passive Data Store

By default, all data is stored in-memory to achieve high speed data access. However, not all data is expected to be read or updated frequently and needs to reside in-memory, as this increases the required amount of main memory unnecessarily. This so-called historic or passive data can be stored in a specific passive data storage based on less expensive storage media, such as SSDs or hard disks, still providing sufficient performance for possible accesses at lower cost. The dynamic transition from active to passive data is supported by the database, based on custom rules defined as per customer needs.We define two categories of data stores: active and passive.

We refer to active data when it is accessed frequently and updates are expected, e.g. access rules. In contrast, we refer to passive data when this data either is not used frequently and neither updated nor read. Passive data is purely used for analytical and statistical purposes or in exceptional situations where specific investigations require this data. For example, tracking events of a certain pharmaceutical product that was sold five years ago can be considered as passive data.

Why is this feasible? Firstly, from the business perspective, the pharmaceutical is equipped with a best-before data of two years after its manufacturing date, i.e. even when the product is handled now, it is no longer allowed to sell it. Secondly, the product was sold to a customer four years ago, i.e. it left the supply chain and is typically already used within this timespan. Therefore, the probability that details about this certain pharmaceutical are queried is very low. Nonetheless, the tracking history needs to be conserved by law regulation, e.g. to prove the used path within the supply chain or when selling details are analyzed for building a new long-term forecast based on historical data. This example gives an understanding about active and passive data. Furthermore, introducing the concept of passive data comes with the advantage to reduce the amount of data, which needs to be accessed in real-time, and to enable archiving. As a result, when data is moved to a passive data store it consumes no longer fast accessible main memory and frees hardware resources. Dealing with passive data stores involves the need for a memory hierarchy from fast, but expensive to slow and cheap. A possible storage hierarchy is given by: memory registers, cache memory, main memory, flash storages, solid state disks, SAS hard disk drives, SATA hard disk drives, tapes, etc. As a result, rules for migrating data from one store to another needs to be defined, we refer to it as aging strategy or aging rules. The process of aging data, i.e. migrating it from a faster store to a slower one, is considered as background tasks, which occurs on regularly basis, e.g. weekly or daily. Since this process involves reorganization of the entire data set, it should be processed during times with low data access, e.g. during nights or weekends.

Please also see our podcast on this technology concept.

November 23, 2011

Insert-only for Time Travel

Insert-only or append-only describes how data is managed when inserting new data. The principle idea of insert-only is that changes to existing data are handled by appending new tuples to the data storage. In other words, the database does not allow applications to perform updates or deletions on physically stored tuples of data.

This design approach allows the introduction of a specific write-optimized data store for fast writing and a read-optimized data store for fast reading.Traditional database systems support four operations for data manipulations, i.e. inserting new data, selecting data, delete data, and updating data. The latter two are considered as destructive since original data is no longer available after its execution. In other words, it is neither possible to detect nor to reconstruct all values for a certain attribute; only the latest value is available. Insert-only enables storing the complete history of value changes and the latest value for a certain attribute. For instance, this is also a foundation of all bookkeeping systems to guarantee transparency. For the history-based access control, insert-only builds the basis to store the entire history of queries for access decision. In addition, insert-only enables tracing of access decision, which can be used to perform incident analysis.

Please also see our podcast on this technology concept.

November 20, 2011

In-Memory Computing

High performance in-memory computing will change how enterprises work. Currently, enterprise data is split into two databases for performance reasons. Usually, disk-based row-oriented database systems are used for operational data and column-oriented databases are used for analytics (e.g. “sum of all sales in China grouped by product”). While analytical databases are often kept in-memory, they are also mixed with disk-based storage media.

Transactional data and analytical data are not stored in the same database: analytical data resides in separate data warehouses, to which it is replicated in batch jobs. In consequence, flexible real-time reporting is not possible and leaders are forced to make decisions based on insufficient information in very short time frames.

This is about to change, since hardware architectures have evolved dramatically during the past decade. Multi-core architectures and the availability of large amounts of main memory at low costs are about to set new breakthroughs in the software industry. It has become possible to store data sets of entire Fortune 500 companies in main memory. At the same time, orders of magnitude faster performance than with disk-based systems can be achieved.

Traditional disks are one of the last remaining mechanical devices in a world of silicon and are about to become what tape drives are today: a device only necessary for backup. With in-memory computing and hybrid databases using both row and column-oriented storage where appropriate, transactional and analytical processing can be unified.

November 17, 2011

No Disk

For a long time the available amount of main memory on large server systems was not enough to hold the complete transactional data set of large enterprise applications.

Today, the situation changed. Modern servers provide up to multiple Terabytes of main memory and allow keeping the complete transactional data in memory. This eliminates multiple I/O layers and simplifies database design, allowing for high throughput of transactional and analytical queries.

Please also see our podcast on this technology concept.

November 15, 2011

MapReduce

MapReduce is a programming model to parallelize the work on large amounts of data. MapReduce took the data analysis world by storm, because it dramatically reduces the development overhead of parallelizing such tasks. With MapReduce the developer only needs to implement a map and reduce function, while the execution engine transparently parallelizes the processing of these functions among available resources.

HANA emulates the MapReduce programming model and allows the developer to define map functions as user-defined procedures. Support for the MapReduce programming model enables developers to implement specific analysis algorithms on HANA faster, without worrying about parallelization and efficient execution by HANA's calculation engine.

Please also see our podcast on this technology concept.

November 11, 2011

Bulk Load

The In-Memory Database allows for switching between transactional inserts and a bulk load mode. The latter can be used to insert large sets of data without the transactional overhead and thus enables significant speed-ups when setting up systems or restoring previously collected data. Furthermore, scenarios with extremely high loads are enabled by buffering events and bulk-inserting them in chunks. This comes at the cost of reducing real-time capabilities to near-real time capabilities with an offset according to a defined buffering period, which oftentimes may be acceptable for business scenarios.

Please also see our podcast on this technology concept.

November 9, 2011

Partitioning

We distinguish between two partitioning approaches: vertical and horizontal partitioning, whereas a combination of both approaches is also possible.

Vertical partitioning refers to rearranging individual database columns. It is achieved by splitting columns of a database table in two or more column sets. Each of the column sets can be distributed on individual databases servers. This can also be used to build up database columns with different ordering to achieve better search performance while guaranteeing high-availability of data. Key to success of vertical partitioning is a thorough understanding of the application’s data access patterns. Attributes that are accessed in the same query should rely in the same partition since locating and joining additional may degrade overall performance. In contrast, horizontal partitioning addresses long database tables and how to divide them into smaller pieces of data. As a result, each piece of the database table contains a subset of the complete data within the table.

Splitting data in equivalent long horizontal partitions is used to support search operations and better scalability. For example, a scan of the request history results in a full table scan. Without any partitioning a single thread needs to access all individual history entries and checks the selection predicate. When using a naïve round robin horizontal partitioning across ten partitions, the total table scan can be performed in parallel by ten simultaneously processing threads reducing response time by approx. 1/9 compared to the single threaded full table scan. This example shows that the resulting partitions depend on the incorporated partitioning strategy. For example, rather than using round-robin as partitioning strategy attribute ranges can be used, e.g. inquirers are portioned in groups of 1,000 with the help of their user id or requested product id.

Please also see our podcast on this technology concept.

November 8, 2011

Reduction of Layers

In application development, layers refer to levels of abstractions. Each application layer encapsulates specific logic and offers certain functionality. Although abstraction helps to reduce complexity, it also introduces obstacles. The latter result from various aspects, e.g. a) functionality is hidden within a layer and b) each layer offers a variety of functionality while only a small subset is in-use.

From the data’s perspective, layers are problematic since data is marshaled and unmarshaled for transformation in the layer-specific format. As a result, the identical data is kept in various layers redundantly and a reduction of layer increases efficient use of hardware resources. Moving application logic to the data it operates on results in a smaller application stack and therefore code reduction. Furthermore, reducing the code length also results in improved maintainability.

Please also see our podcast on this technology concept.

November 3, 2011

Dynamic Multithreading within Nodes

Parallel execution is key to achieve sub second response time for queries processing large sets of data. The independence of tuples within columns enables easy partitioning and therefore supports parallel processing. We leverage this fact by partitioning database tasks on large data sets into as many jobs as threads are available on a given node. This way, the maximal utilization of any supported hardware can be achieved.

Please also see our podcast on this technology concept.