SourceForge.net Logo
Main Overview Wiki Issues Forum Build Fisheye
Issue Details (XML | Word | Printable)

Key: CMP-477
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Shay Banon
Reporter: Ramon Garcia Fernandez
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Compass

Improve ScrollableHibernateIndexEntitiesIndexer (use Hibernate Criteria)

Created: 29/Oct/07 06:43 AM   Updated: 26/Jul/08 10:05 AM
Component/s: Compass::Gps
Affects Version/s: 1.2 GA
Fix Version/s: 2.0 M1

File Attachments: 1. File compass-1.2.diff (6 kB)



 Description  « Hide
My company, Kotasoft S.L. wants to contribute this patch to fix memory issues with ScrollableHibernateIndexEntitiesIndexer. This patch was developed as part of a project for a customer, and in exchange for receiving this software for free we want to share this improvement with the rest of the comunity.

The most important issue with the current implementation of the scrolling indexer is that the memory goes up until eventually eating all memory. The reason of the issue is Hibernate caches all the objects got in a transaction forever. Compass tries to free the objects obtained by calling hibernateSession.evict(object), but if the object as references to other objects obtained by database relations, these objects are not released. For this reason, this call should be replaced by hibernateSession.evictAll(). With this change, we can run the indexer without any memory problems.

There are other changes in this patch.

We changed the indexer to respect better hibernate settings. If Hibernate is configured to use outer joins, Compass should respect that. This patch makes the indexer use Hibernate criteria queries instead of hsql queries, because the latter do inconditionally no outer join, independently of what was configured in the hibernate bindings.

We made more changes for supporting outer joins and batch processing. When hibernate is queried for a list of objects and the underlying SQL query is an outer join, Hibernate returns several times the same instance (one for each 1:N relation). See http://www.hibernate.org/117.html#A12 . For supporting this case we changed the query to perform an ORDER BY with the primary key, and the indexing loop places an object after it has been completely fetched. This approach is not optimal: ordering may introduce some overhead (though the primary key should be already indexed). Perhaps this behaviour should be turned into a per-class option.

For batch processing, instead of getting an object and indexing it, we get several objects (fetchSize() at a time) and then we index them. The reason is that, if Hibernate is configured to use batch queries, and lazy download of collections is enabled, Hibernate downloads several the related collections of several objects at once, with queries like:

SELECT .... FROM related_table WHERE related_table.id IN (?,?, ... ?)

This reduces the database load by reduced the number of queries delivered. Although it makes the loop code somewhat more complex, it does not add significant overhead.

As already commented, this patch should be improved by making the ordering request optional, if possible, as an option for each table.

Best regards,

Ramon Garcia Fernandez
Programmer
ramon.garcia@kotasoft.com
Kotasoft S.L.
http://www.kotasoft.com



 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Ramon Garcia Fernandez added a comment - 29/Oct/07 06:44 AM
Patch to fix described issues.

Shay Banon added a comment - 09/Nov/07 08:04 AM
Applied the patch. Changed some things within it in order to be as much as possible backward compatible (it still works with Query as well).

Also, did not understand the prex/item logic when scrolling though the data. I kept it how it was before for now. If you can explain why you did it like that, I would be happy to hear.

Anyhow, thanks for the patch!. I did not know about the outer join difference between Criteria and Query, strange ...


Ramon Garcia Fernandez added a comment - 09/Nov/07 01:00 PM
The logic of comparison with previous is to avoid indexing repeated objects. This is necessary to support OUTER JOIN queries. When Hibernate performs such a query (for getting records of a table that has an 1:N relation with another table) it returns several times the same object. This would create an index with repeated entries. To do the correct indexing, the object must be indexed when it is complete, that is, when all the 1:N references have been fetched. When we see that the current record is different from the previous one, that means that all the 1:N relations of the previous row have been got, and then the previous object is indexed (actually, scheduled for indexing by storing it).

The behaviour of Hibernate with OUTER JOIN is documented in http://www.hibernate.org/117.html#A12


Shay Banon added a comment - 10/Nov/07 04:52 AM
And I am guessing .setResultTransformer(Criteria.DISTINCT_ROOT_ENTITY) does not work with scroll, right?

Ramon Garcia Fernandez added a comment - 10/Nov/07 07:58 AM
Although setResultTransformer(Criteria.DISTINCT_ROOT_ENTITY) works, it is inefficient, because it loads everything and then removes duplicates in memory, thus requiring to have all the rows in memory at the same time.

Shay Banon added a comment - 10/Nov/07 08:18 AM
That is what I thought, since it is not possible to do that with LinkedHashSet and scroll. Ok, I have added the mentioned logic (which will only work with the mentioned order by). Last question, why do you have the RowBuffer class, is there some additional caching that need to be done that I am missing? I have done it without it. Also, can you please check that the changes applied works for you?

Ramon Garcia Fernandez added a comment - 10/Nov/07 12:10 PM
The purpose of the RowBuffer class is to enhance the performance when one is not using OUTER JOIN, but lazy loading and batch processing (see batch processing in http://www.hibernate.org/hib_docs/v3/reference/en/html/batch.html , though the explanation is more about changes than about SELECTS). When batch processing are enabled, Hibernate will load the related rows of several rows instead of one by one. This works in the following way: if one access a related row, and there are several related rows pending to be loaded, then Hibernate will some of them in a single query. Hibernate performs a query like:

SELECT * from related_table WHERE foreign_key in (row1, row2, row3, row4);

If one enables lazy loading, hibernate will load the related rows when they are accessed, that is, when the row is indexed. For this to work, Hibernate must have several objects with related rows pending to be loaded. The RowBuffer class delays the loading of the related rows of one object (caused by the indexing) until other objects have been already loaded. The previous way of working:

load object
index object
evict object

at any time, Hibernate has only one row in its memory, and, when access to its related row is requested, cannot fetch related rows of other objects, because it knows about none.


Shay Banon added a comment - 10/Nov/07 12:41 PM
I thought so. But, isn't it enough to remove the evict method and rely on Hibernate first level cache to keep that. Why do we need to build our own mechanism as well?

Ramon Garcia Fernandez added a comment - 10/Nov/07 01:33 PM
There are two different issues.

If one does not evict, Hibernate keeps everything in memory, and one eventually gets an OutOfMemoryException. This is neccessary to make object identity reliable (that is object1 == object2 if both come from the same database row); if the one calls evictAll() it is under one's responsability. This behavior is somewhat odd. Note that Hibernate is designed for transaction applications, and not for batch processing like indexing. It does not work too bad, but is somewhat odd.

There is another issue.

With the loop {load ; index; } the batch processing cannot do its job. When the related rows of some row are being loaded, there is no other row that has pending related rows. All the rows previously loaded have been completely loaded because they have been already indexed.

What is necessary is something like

load 1;
load 2;
load 3;
load 4;
load 5;
index 1;
index 2;
index 3;
index 4;
index 5;

in this way, when one indexes row 1, Hibernate notices, I have here rows 2,3, 4, 5 pending to be loaded, and I can load them, and issues a select statement to load all of them at once:

SELECT fields FROM related WHERE main_id IN (id1, id2, id3, id4, id5)


Shay Banon added a comment - 11/Nov/07 03:19 PM
I think I understand partially what you are saying. First, basically, the code (without the call to evict, but with the call to clear) (this is not how it is now in svn, basically, the evict call needs to be removed), does not evict on each row. So basically, Hibernate can act as the row buffer by using its first level cache.

What you are suggesting, if I understand correctly, is that by keeping things in memory, and then indexing a "fetchCount" objects that we keep in memory, will cause less sql queries initiated against the database by Hibernate? This I do not fully understand. The call to index 1 will cause object 1 to be indexed, and if it has a relationship to other objects, they will be loaded or have already been preloaded, regardless of the fact that I loaded 2. Right?

Also, with your change using prev and item, we actually only index the last similar object. Object equality is kept by Hibernate first level cache anyhow.

Note, I am not suggesting that what you are saying is wrong, I just want to understand precisely what you are suggesting before committing this (I have not directly worked with Hibernate for over 2 years now).


Ramon Garcia Fernandez added a comment - 12/Nov/07 02:32 AM
In order to Hibernate batch processing work for related objects, Hibernate must have in its first cache level objects that have been loaded, but whose related objects have not been loaded. Hibernate batch processing happens when in addition to the current object, there are other candidate objects in the same table that are pending to be loaded. In this case, "pending" is a related object whose referring object is loaded.

The call to index 1 will cause object 1 to be indexed, and if it has a relationship to other objects, the will be loaded. But if object 2 is loaded, then, when loading the related objects of obj 1, the related objects of obj2 will be also loaded.

load 1
load 2
index 1 -> Hibernate loads related objects of 1 and 2
index 2

load 1
index 1 -> Hibernate loads related objects of 1
load 2
index 2 -> Hibernate loads related objects of 2
....


Shay Banon added a comment - 12/Nov/07 12:38 PM
If Hibernate does that, then it makes perfect sense. Can you post a link here that points to this feature in Hibernate?

Ramon Garcia Fernandez added a comment - 13/Nov/07 02:55 AM
The behaviour was deduced by a mixture of documentation (the linked chapter on batch processing http://www.hibernate.org/hib_docs/v3/reference/en/html/batch.html , that I see that you have read) and by expermimentation, by setting the hibernate property show_sql = true.

I will setup a simple and reproducible test case. This is neccessary anyway, to test the patches as applied.


Shay Banon added a comment - 13/Nov/07 02:33 PM
An example would be great. It is very strange for me that Hibernate would do that since it might load data for things already loaded in its first level cache which might never be used. For example, if I only call on index1 and thats it.

If it does that that, then it would be great to understand how it decides how much to load, since people would probably want to control it.


Ramon Garcia Fernandez added a comment - 13/Nov/07 02:50 PM
You are right that batch fetching is counterintuitive. For this reason it is disabled by default. For cases where it improves the performance, like indexing, the user can enable it.

The documentation about batch fetching (sorry, I had forgotten about it) is in

http://www.hibernate.org/hib_docs/v3/reference/en/html/performance.html#performance-fetching-batch

The decission of how much to load is given by the parameter batch-size of an object. The default can be set in the hibernate property hibernate.default_batch_fetch_size


Shay Banon added a comment - 14/Nov/07 03:21 AM
I knew about the collection part, but did not know that Hibernate is smart enough to do it on class level as well. Pretty cool. Now, the row buffer makes perfect sense, I will add it.

Shay Banon added a comment - 14/Nov/07 03:33 AM
Ok, patch applied (with some changes to support backward compatibility). Can you please check it the latest SNAPSHOT and see if it works?

Ramon Garcia Fernandez added a comment - 14/Nov/07 04:08 AM
OK, I will try a small set of test cases.

Ramon Garcia Fernandez added a comment - 10/Dec/07 11:07 AM
Sorry for the delay.

I have just performed a small set of tests (a couple of tables book, author) and it seems to works fine with all combinations of join type, lazyness and batch-size.