|
![]() |
||
|
Picking the Right Inheritance StrategyMarch 2005, revised in 2007In this paper, by Eric Samson, CTO of Xcalia, we'll go through several common strategies to map inheritance to relational databases, and the results might surprise you! Most Java developers who use data access products and O/R mapping tools know that mapping inheritance is complex. A commonly held view is that products featuring the “per-hierarchy” mapping strategy are the most efficient and perform best. This paper will briefly review the different inheritance mapping strategies and present a realistic example that will allow everyone to better understand which mapping strategy is really best suited to which situation. We believe that most readers will be surprised by the results. Inheritance in object models, benefits and limitations Inheritance is one of the most advanced features of object models yet it is one of the most complicated to reproduce in a database. When object technologies were first being adopted, inheritance was overused. At that time, object programmers defined large, deep inheritance hierarchies. Inheritance allows one to program systems by “generalization/specialization”:
Since multiple inheritance can result in many programming drawbacks, it is less frequently used. In most current industrial object languages, specialization is limited to overloading existing behaviors or adding new ones, not removing existing ones or limiting them. For instance, you cannot change the type of a given attribute in a subclass. Usually, inheritance means class inheritance, where you can refine the behavior of one class in subclasses. Some research projects propose highly reflexive models where inheritance can even be used on objects, not just on classes. Inheritance provides a powerful programming mechanism where the initial effort forces you to be more structured, which produces important benefits for all subsequent projects:
However, designing class hierarchies that are too large can lead to some software issues because the defined model can become too static. Over time, engineers naturally reduce the size of class hierarchies, and use delegation instead of inheritance, which allows them to replace “is a kind of” relationships by “plays the role of” relationships. This leads to more dynamic systems where roles can evolve over time. It is important to point out that inheritance tends to be in conflict with encapsulation within the hierarchy, as information from the superclasses is made directly available to the subclasses. It is the opinion of this author, that inheritance should take precedence over encapsulation, allowing subclasses to have direct access to data in superclasses. Nevertheless, Java allows this kind of decision to be made by the individual programmer with the protected and private keywords. That said, when used correctly, inheritance is a very efficient mechanism and thus is massively present in most Java applications today. Mapping inheritance Aside from object databases, the notion of inheritance is not available in most data stores. For instance, there is no straightforward way to store inherited classes in a relational database. Indeed, managing inheritance in a RDBMS is one of the goals of a data access layer. As with most of the features in the data access layer, performance and flexibility are also important for inheritance management. Let us now examine the three well-known inheritance mapping strategies in a relational model. Per-class inheritance mapping The first strategy, which we will refer to as “per-class”, tends to reproduce the object model in the database. In this strategy, there is one table per class in the hierarchy. Each table contains the attributes defined at each class level, and each table associated with a subclass contains a foreign key to the table associated with the superclass. There is one table per class even if the parent classes are abstract. This strategy is also known as “vertical mapping” or “pure vertical mapping”. Given the following object model,
the associated database model could be as follows, depending on the various mapping options that are activated:
In this example, Joe is an employee and Jack is a student. Note: XIC does not require that all subclasses to use the same column to establish the link to the parent class. The benefits of this mapping strategy are that
Conversely, it has some detriments:
For instance, finding all Person objects will generate a query to retrieve all persons, then all employees, and finally all students. When navigating from an object that has a reference to a Person object, we need to be able to determine the actual type of that person, as it could be a person, an employee or a student. Identifying the actual subclass
This problem of identifying the actual subclass can be solved in four different ways. When the object identifiers (OIDs) are managed by the data access layer,
If you need to access a preexisting schema where primary keys have already been defined arbitrarily,
The last algorithm is obviously the most resource-consuming as it implies queries on all subclasses; thus, it should only be used when imposed by an existing schema. A good data access product should be able to deal with all of these situations, as all of them could be necessary in a real application. Per-hierarchy inheritance mapping The second strategy, which we will refer to as “per-hierarchy”, tends to limit the number of statements needed to store and retrieve objects. With this strategy there is only one single table for the whole class hierarchy, this table contains all attributes defined in all classes of the hierarchy. This strategy is also known as “flat mapping” or “filtered mapping”. Using the same object model, the associated database model could be as follows (depending on the various mapping options that are activated):
The benefits of this mapping strategy are that
Conversely, it has some detriments.
Per-concrete-class inheritance mapping The third and last strategy, which we will refer to as “per-concrete-class”, tends to limit both the space and the number of statements needed to store objects. With this strategy, there is one table for each concrete class in the hierarchy. This table contains all the attributes defined in the class itself plus all the attributes inherited from all the superclasses. This strategy is also known as “horizontal mapping”. Given the previously introduced object model, the associated database model could be as follows (depending on the various mapping options that are activated):
Depending on whether Person is an abstract or a concrete class, it may or may not be associated to a table. Please note that a product like XIC lets you define different column names in tables for subclasses for the same inherited attributes. This kind of flexibility is mandatory when dealing with existing schemas. The benefits of this mapping strategy are that
Conversely, its detriments are significant.
Comparing the three mapping strategies Anyone can easily understand that the per-class strategy shows very poor performance due to the multiple tables that must be used for object storage. Performance suffers in terms of object manipulation and queries. In any case, it is mandatory to support this kind of mapping because some existing models use it. I have even seen customers claiming it could be very interesting under the following circumstances:
The big question is which of the two remaining strategies makes the most sense to developers and architects. Most O/R mapping tools initially focused on the per-hierarchy strategy probably because it is the easiest to implement and because many developers think it is the most efficient in terms of queries. To better illustrate the pros and cons of these strategies, let me introduce the following theoretical object model.
Comparing the space in the database In order to stay independent from how different relational database engines are actually managing space allocation on disk for data storage, we just compare the number of cells that are created in the database with each inheritance strategy. A database cell is defined to be the intersection between a row and a column. The total number of cells of a table is its number of rows multiplied by its number of columns. A cell used is a cell that has a non-null value. ![]()
These figures are computed for 100,000 instances in each class. Here is a detail of the formulas used above:
Even on a simple class hierarchy we can see the waste of space implied by the per-hierarchy strategy. Please note the per-concrete-class strategy is even more optimized than the per-class one. Comparing the performance of manipulation operations The population process fills the database. It creates instances in the database, based on the selected mapping strategy and on parameters. We ran it several times changing the mapping strategies and the number of instances. Here are the results in milliseconds for a database progressively populated with 5000, 50000 and 500000 objects (results based on an average of three runs): ![]()
As expected, the per-class strategy is the least efficient. You can also see the per-hierarchy strategy is 50% slower than the per-concrete-class strategy. This becomes more evident as the volume or the number of subclasses increases, but is difficult to notice in a typical benchmark configuration when volumes are small and the number of subclasses is limited to three. This can lead to misleading conclusions at the all-important product evaluation stage. The result demonstrated above would be exactly the same for UPDATE and DELETE operations. Comparing the performance of queries If we push the comparison even further than most benchmarks by querying the root class and then querying on the subclasses, we will see that the performance difference is further entrenched. Queries on the root class ( Here are the results in milliseconds for a database progressively populated with 5000, 50000 and 500000 objects (results based on an average of three runs): ![]()
Once again, as expected, the performance of the per-class strategy is very poor. More surprisingly, however, is the fact that the performance of the other two strategies is almost identical. Remember that in the case of the per-hierarchy algorithm, there is only one query being executed while in the per-concrete-class, 5 queries are executed (one for each class in the hierarchy). To explain these results, one should keep in mind the huge impact on the database server and the network layer imposed by the per-hierarchy strategy. The reason is that from the database point of view a single big query (with 50 attributes in the result set projection) on one huge table is not faster than 5 small queries (with 10, 20 or 30 attributes in the result set projection) on 5 smaller tables. Please note that among vendor solutions, XIC best optimizes the per-hierarchy algorithm:
At this point most lab benchmarks will have stopped, but we are going to push the comparison even further. Queries on one intermediate class ( Here are the results in milliseconds for a database progressively populated with 5000, 50000 and 500000 objects (results based on an average of three runs): ![]()
The performance of the per-class strategy is still the poorest, but, as you can see, the performance of the per-hierarchy strategy noticeably decreases when compared to the per-concrete strategy. Remember that this is because the per-concrete-class case now requires only 3 simple queries, while the per-hierarchy strategy now requires one query with a complex filter containing 2 “OR” boolean operators. This is imposed because we need to filter the right subclasses (Class2, Class21 and Class22 in this particular case). Please note once again that XIC fully optimizes the per-hierarchy algorithm:
Queries on one leaf class ( Here are the results in milliseconds for a database progressively populated with 5000, 50000 and 500000 objects (results based on an average of three runs): ![]()
In this case, the performance of the per-hierarchy strategy becomes as poor as the performance of the per-class strategy. It would be even worse should you have more subclasses in the hierarchy or more instances per class. Why not see for yourself? To better illustrate the pros and cons of these strategies, we have created a project that allows you to discover the implications on your application and on your data store. We encourage anyone interested in this study and the associated results to go
to http://www.xcalia.com/downloads where you can download XIC.
Once you've downloaded and installed XIC, We used a MySQL engine to run our tests, but you can set the configuration parameters and run it over your preferred database engine. There are only two files that you need to edit in order to run the project on your system:
The application is delivered with an ANT build file. Once you've configured the project for your local environment, simply check the values in params.properties, then run
Conclusion Choosing the correct mapping strategy for inheritance is a key step in the design of most Java applications as it can have an enormous impact on performance when database size becomes critical. It is of primary importance to check that all flexibility options, as seen above are available in the mapping definition. As opposed to widely held belief, the best choice by default is the per-concrete-class strategy, as it optimizes the disk usage and it reduces the load on the network layer and the database server. This same conclusion was reached by Scott Ambler a few years ago in his now famous article about "The design of a robust persistence layer for relational databases". Because database resources are expensive, it is critical to check how each data persistence vendor implements each inheritance strategy. Products that focus on a per-hierarchy strategy might look good during simple benchmarks but can lead to disastrous situations in production.
|
||
|
|
|||