December 1, 2010

In-Memory Computing for Source Code Search

In the past few years in-memory computing has found its application in many areas. The most notable application is still the use in enterprise software systems, both analytical and transactional. However, the performance advantages of in-memory computing can also be exploited in a number of other scenarios. Today, I will describe our experience of application of an in-memory computing engine in a source code search scenario.

Without a doubt source code search is an important software engineering activity. Empirical studies report that up to 30% of development tasks are related to search [1]. In our experience, we discovered that 36% of search queries were structural patterns, keywords or identifiers. Moreover, the most frequent search target during software maintenance is a statement.

if_clause4web.PNG Currently, developers use a simple text-based search engine to locate source code of interest. Alternatively, they use “where-used lists” to navigate to the target piece of source code. However, current technologies are not capable of processing fine grained structural information. For example if I compose a search query from terms, e.g., “if” and “delmdtflag”, I will retrieve a list of classes, where these two terms just occur together, but are not necessarily related. To make the query more precise, one can formulate such a query as a pattern of abstract syntax tree (AST). In this case the results will contain only those classes where “delmdtflag” is used in a clause of an “if´” statement. Until recently, the large amount of AST data and the bad performance of tree matching operations were the major show stoppers of this scenario. In our experiment we found out, that an in-memory computing engine is capable of storing and processing of a large amount of tree-shaped structural information. We analyzed a large software system (72M statements). Source code has been parsed and the resulting ASTs were stored in an in-memory column-oriented database. Since XPath is a standard query language for tree-shaped data we decided to formulate AST patterns as XPath queries. An XPath query engine has been implemented on top of the database.

Our preliminary results show that processing of even a large amount of tree-shaped structural data can be significantly accelerated. Moreover, the values in columns of the AST table repeat multiple times because the number of identifiers, literals and AST vertex types is finite and each vertex value occurs many times. Therefore, the application of compression algorithms for columns seems to be reasonable. In total the source code repository requires 8.5 GB, which is reasonable for a database of this size and granularity. In comparison, the size of original source code in text files was 3.2 GB.

For more information visit our project page.

References

[1] Timothy Lethbridge, Janice Singer, Studies of the Work Practices of Software Engineers, in Advances in Software Engineering: Comprehension, Evaluation, and Evolution

Author: Oleksandr Panchenko

November 2, 2010

New Approach for a Cloud Based OLTP System

In this blog entry, I want to summarize some emerging ideas about, how OLTP can be performed on distributed systems by weakening consistency.
Scaling out servers is a common approach to achieve a higher performance or higher throughput. If systems are scaled out, more servers are used to handle workload. Using more servers often involves the usage of distributed transactions and partitioning. However, distributed transactions according to ACID which can be achieved using two phase commits are expensive, as they increase the latency of a transaction and weaken the availability or the resilience to network partitions.

CAP Theorem and ACID 2.0

According to the CAP theorem, consistency, availability, and partitioning cannot be achieved at the same time. Since availability is a crucial business factor and a system should also be tolerate network partitions, it is common in key value stores to offer reduced consistency guarantees in order to achieve full availability.
According to Helland (Helland 2007), a transaction should only involve single entities that are stored on the same instance. If a transaction should involve more than one entity, the business logic developer has to deal with potential inconsistencies. Between groups of entities the system could support store-and-forward messaging.
Helland defines a new kind of consistency that is called ACID 2.0 that stands for Associative, Commutative, Idempotent, and Distributed (Helland 2009). Main goal for ACID 2.0 is to succeed if the pieces of work happen: at least once, anywhere in the system, and in any order.

Fault Tolerance

Another issue is fault resilience. A common approach to increase fault resilience and availability is to use replication, e.g. store multiple copies of the data. However, storing two copies of data synchronously leads to the issue of increasing latency. Helland proposes to asynchronously replicate updates of data to a second server and let the business application developer cope with the probability that data may be lost in case of errors. For transactions involving low business values, it might work to live with inconsistencies or lost data. In contrast, for transactions with high business values, the system should provide guaranteed consistency and durability of a transaction.
For providing fault tolerance, single transactions need to be idempotent, which means that executing a single transaction twice should not lead to a substantive change of an entity, whereas the application shall define what substantive really means. To achieve idempotent transactions, each transaction has to be versioned or hash values over the request have to be stored, to reject transactions that have already been executed. For that purpose, Helland proposes to store all activities of related entities, e.g. the received messages.
The ideas proposed by Helland are very interesting. However, further research has to show whether it is really feasible and viable to let the business application developer cope with data inconsistencies.

References

(Helland 2007) Pat Helland: Life beyond Distributed Transactions: an Apostate's Opinion. CIDR 2007: 132-141
(Helland 2009) Pat Helland, David Campbell: Building on Quicksand. CIDR 2009

Author: Benjamin Eckart

May 14, 2010

The Case for a Combined OLTP and OLAP Benchmark

In our research area "In-Memory Data Management for Enterprise Applications" we propose to reunite two systems supporting major enterprise operations: Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) systems. The two systems basically work on the same set of business data, using differently optimized data structures for their specific use cases. Latest developments in in-memory database technology and column-oriented data organization explored, for example, in the HYRISE project or the "In-Memory Data Management" bachelor's project encourage the idea of integrating the two workloads, or at least major parts of them, into one system.
In this blog post I will make the case for a new benchmark that combines transactional and analytical processing in order to analyze the aforementioned combined systems and explain what differences are introduced when compared to existing benchmarks.
Benchmarking as referred to in this post relates to evaluating the performance of software systems for managing business data, such as data management systems for transaction processing and reporting. Separate benchmarks exist in this specific area of application for evaluating systems built for transaction processing and analytics. However, only limited statements can be made concerning the ability of existing data management systems to handle a combination of the different workloads since OLTP and OLAP have so far been treated separately.
In any case, it is obvious that existing data management systems are not able to cope with a mixed workload of OLTP- and OLAP-style queries particularly well. Otherwise the mentioned separation would not have occurred. However, new systems that tackle mixed workload scenarios are faced with a greater challenge to keep up with the performance of either of the specialized systems concerning a pure workload. This challenge is well grounded in the aspect of being able to manage a wider range of query types.

Existing Benchmarks

The benchmarks of the Transaction Processing Performance Council (TPC) have become the standard benchmarks for OLTP and OLAP. TPC's currently active benchmarks are TPC-C and its successor TPC-E for OLTP and TPC-H for OLAP. TPC-DS, a new benchmark for decision support systems, is now available as a draft version. The star schema benchmark (SSB) takes the decision support benchmark TPC-H a step further by deriving a pure star schema from the schema layout of TPC-H in order to evaluate the performance of data management systems built for pure data warehouse environments.
Existing benchmarks could be applied to a combined architecture for OLTP and operational reporting, simply running the benchmarks in parallel, but this would lead to a partial picture of the actual performance of such a system measuring only the effects of hardware resource contention. The reason for this is that the underlying data schemes of the different benchmarks differ to a great extent and are hard to be integrated to create a sufficiently relevant and still realistic data schema for both workloads. However, conflicts arising from data access are one characteristic of particular interest for a combined workload. Consequently, new means are needed in order to evaluate and compare new solutions with each other and also with existing ones, integrating the aspect of the mixed workload, while covering desired business scenarios. These, we will investigate in our Combined Benchmark for Transactions and Reporting (CBTR).

Real Data

CBTR is not just defined closely resembling the behavior of original systems of an existing enterprise, but actually operates with real data instead of generated data. Most data generators consider statistical distributions of data in real enterprises. These, however, still lead to idealistic data as real data usually deviates in one way or the other from statistical distributions. Therefore, generated data sets never achieve the same behavior as really operating with original data. The best way of getting the real picture is still testing within the enterprise, which however is not possible, due to high availability requirements and a costly implementation of the benchmark activity. Consequently, CBTR, taking a snapshot of real systems, is as close to a real scenario as possible.

More Benchmarking Dimensions

BenchmarkingDimensions.png
Benchmarking Dimensions
CBTR utilizes three scales to focus on different parameters of the database system under test. The first is the commonly used one for varying the data set size, which will not be further discussed here.
The second scale is the variation of the logical database schema including physical structures for optimization, such as indexes, views etc., from being optimized for transaction processing to being optimized for reporting. The two ends of this scale are clear: the OLTP-optimized schema can directly be taken from established OLTP systems and the OLAP-end of this scale resulting in a star or snowflake schema optimized for reporting queries is also conceivable. However, the schema alterations between these two end-points are the most interesting, building one base of operation as neither of the two aforementioned schemes is ideal for the other workload.
The third scale is a variation of the workloads starting from pure OLTP, going over a mixed workload and ending in a pure analytical workload. Applying the pure OLTP workload on the OLTP schema and doing the same for the OLAP workload, acts as a baseline to compare the mixed workload. As a result, we will quantify performance benefits and penalties of the underlying database schema and the physical optimizations and create a base for comparisons with existing data management systems that are optimized for pure workloads. Applying a mixed workload leads to insights regarding the effects the two distinct workloads have on each other when released upon one and the same set of data.

Author: Anja Bog

February 16, 2010

Single Cloud Chip Symposium at Intel Labs in Santa Clara


A few months back in December 2009 Intel Labs announced a new many core research prototype called SCC. SCC stands for single cloud chip and adds up 48 Intel-Architecture (IA) cores on a single chip, which is the largest number ever put on a single CPU. Last week on February 12th Intel Labs held the SCC Symposium inviting researchers to get to know this chip in more detail. The goal of this symposium was to allow researchers have a really close look at the chip and its capabilities. The idea is that attending researchers apply for access to such a system in order to explore the possibilities of many core computing.


In this blog post I will give you an introduction to the SCC architecture, its challenges and opportunities.

SCC Platform

As mentioned before the SCC platform is a single chip that has 48 IA cores. Those 48 cores are arranged in a two dimensional array of 6 x 4 tiles, and each tile has two cores which are Pentium processors running at max 1.0 GHz frequency. Each core has a 16 kB L1 cache and a 256 kB L2 cache which are private to the core. Main memory is connected to the chip using 4 DDR memory controllers.
For on chip communication all cores are connected to each other using an on-chip mesh network with 256 GB/s bisectional bandwidth. Furthermore the SCC introduces hardware support for message passing and the cores on a tile share 16 kB of additional message passing buffer (MPB). Access to the MPB is very fast and each core can write into / read from any other core‘s MPB. In general, the MPB is of the same structure and size as each core‘s L1 cache, but instead of being only available as a private memory the MPB is shared.

Intel Labs engineers took special care of making the inter-core communication as fast as possible. When communicating with other cores the communication latency is amazingly low. Since each message has to be sent via the routers in the mesh network, the latency depends on the number of hops a message takes, with each hop adding 4 cycles of latency. Another interesting fact is that the router network runs with a different clock speed than the actual cores and Intel engineers said that this design decision was made on purpose: the over-provisioning of the mesh network will not allow the cores to max out the available bandwidth on chip.
Due to the fact that the SCC misses hardware support for cache coherence it is not possible to simply boot any given operating system on the SCC platform and use all 48 cores at the same time. Instead, Intel Labs decided to boot a single Linux on each core. Inter-core communication is handled by RCCE (pronounced "Rocky") which is a communication library based on message passing. How future operating systems could work with SCC using all cores is not solved, and Intel has put this question out to the research community - for example, the Barrelfish project at ETH Zuerich thinks about porting their next-generation multi-core OS to the SCC platform.

Challenges

The system design of SCC allows ultra-fast communication on the chip but introduces a couple of challenges that software engineers have to deal with in order to exploit the chip to the maximum extent. For the sake of maintaining a simple chip design the maximum address space is limited to 64GB and -- more important -- to a maximum address length of 32 bit. As a consequence each core can only address 4GB of main memory. Of course the lack of 64bit addresses can be problematic - especially in the world of main memory databases - but Intel expects the research community to scale down their problems and promises to add more features over time as the SCC platform evolves. Since the system design and fabrication is a long and expensive process the goal is to identify sets of features that are required from software engineers and add them to the platform in the order of priorities.



Another important fact is that there is no hardware support for cache coherence and each core must consequently take care of this problem itself. The reason for this decision is that with hardware cache coherence the system bus would be saturated too early with snoop messages due to changes in the cached memory, thus slowing down the complete system. To ease the handling of memory management the SCC introduces look up tables (LUT). Each LUT can be dynamically configured and modified at runtime, allowing dynamic changes in how memory is mapped from the available physical memory to a core‘s local memory. The challenge here is to use application-level cache coherence protocols for managing memory access.

Opportunities

The SCC platform offers a wide range of possible research topics that could not be addressed until now. Besides topics that are related to operating systems and clusters, I would see the following interesting topics with regards to main memory databases:
Power Management - The SCC allows a very fine-grained control of the power management. The chip is divided into multiple groups of frequency and voltage islands. Each tile with two cores can run at its own frequency and each 4 tiles by 8 cores block can be run at an individual voltage. Both frequency and voltage can be dynamically changed allowing the system to run in a power consumption range between 125W and 25W(!). From my point of view fine-grained power control is a very important factor in densely packed data centers and is also important in scenarios where power consumption can directly be mapped to TCO (e.g. in SaaS? applications).
Compute location latency - With 4 memory controllers and 24 routers on chip it is likely that data is fetched from far away and the hop distance must be optimized. When I am thinking about how a query plan of a main memory DBMS looks like this issue becomes even more important since the query scheduler has to specifically assign sub-plans to certain nodes always making sure that a certain maximum distance is never exceeded.
Message passing based query processing - It would be really interesting to see how one could implement query processing using message passing on a single chip and to see how this might affect the layout of data in memory (i.e. rows, columns, or hybrid).
Database OS - Besides the typical Linux environment the SCC allows to run BareMetal C applications and thereby implementing an own DBMS as close as possible to the CPU. It's important to mention that this is a tedious task but it comes with all the opportunities mentioned above.

Summary

To summarize my impressions from the SCC Symposium I have to say that I really enjoyed it and I would like to thank Intel for their great work. What impressed me most was the very open attitude of Intel towards the research community. Instead of hiding source code and tools behind corporate walls, they announced to publish all the tools that they have developed and in exchange expect the community to build new tools, thereby extending the already available tools and documentation.
By the way, Intel and HPI are currently in process of setting up a research cooperation. As part of the Future SOC Lab initiative, Intel intends to make the SCC physically available to HPI as one of the first institutions in Europe.


Author: Martin Grund