Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 351659

Summary: Parallelization of C/C++ indexer
Product: [Tools] CDT Reporter: Volker Diesel <volker.diesel>
Component: cdt-indexerAssignee: Project Inbox <cdt-indexer-inbox>
Status: REOPENED --- QA Contact: Jonah Graham <jonah>
Severity: normal    
Priority: P3 CC: benshadwick, cdtdoug, dennis.dengpan, dion, dongseong.hwang, dsalinas, eclipse.sprigogin, eostroukhov, haojunmin, janboe.ye, jens.elmenthaler, jerome.correnoz, malaperle, martin.axelsson, mikhail.barg, pillepalle73, recoskie, swimbunny2000, yevshif
Version: 8.0   
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on: 368160    
Bug Blocks:    
Attachments:
Description Flags
Profiler data without project references
none
Profiler data with project references
none
Comparison between runs without and with project references
none
First version of parallel indexer
none
Parallel indexer with lock contention reduced to almost zero
none
Bug fixes, more convinient progess handling, UI beautification
none
git patch for this bug based on CDT master none

Description Volker Diesel CLA 2011-07-10 12:39:02 EDT
Build Identifier: 201106081058

CDT's C/C++ indexer should use multiple threads to speedup indexing. Modern CPUs tend to have more cores running at constant or even lower frequency, therefore indexing will never become faster (maybe even slower) if it is single threaded.

A first idea is to start not only one PDOMIndexerJob, but a configurable number of jobs and to have a fixed mapping of one project to one of these jobs and to have one task queue per indexer job (instead of one global task queue) and let the jobs pick their next task from the appropriate task queue.
This would at least speedup indexing of multiple C++ projects.

Second idea is to prepare the parsing layer before doing the actual parse call. If a bulk of files needs to be reindexed, the parsing layer should first be informed about these files and their scanner configuration. With that information, the parsing layer could start parsing in background with multiple threads and keep the ASTs until the indexer requests them.

Comments and further ideas are welcome :-)

Reproducible: Always

Steps to Reproduce:
1. Index large number of large C++ projects :-)
Comment 1 Markus Schorn CLA 2011-07-11 04:41:34 EDT
(In reply to comment #0)
> A first idea is to start not only one PDOMIndexerJob, but a configurable number
> of jobs and to have a fixed mapping of one project to one of these jobs and to
> have one task queue per indexer job (instead of one global task queue) and let
> the jobs pick their next task from the appropriate task queue.
> This would at least speedup indexing of multiple C++ projects.
Sounds like a reasonable approach to me. You need to watch out for dependent projects though, where the order in which they are indexed is important. Projects that relate to each other via project references should probably be indexed in the same thread.

> Second idea is to prepare the parsing layer before doing the actual parse call.
> If a bulk of files needs to be reindexed, the parsing layer should first be
> informed about these files and their scanner configuration. With that
> information, the parsing layer could start parsing in background with multiple
> threads and keep the ASTs until the indexer requests them.

When parsing a translation unit we reuse information about headers that have been parsed before. Therefore a simple parallelization is not possible here.

There are other options, therefore we should base what we do on some numbers. Where do we spend CPU-time? How much time do we need for read/write operations on the database.

Other options:
(1) Run preprocessor in parallel to the syntax parser.
(2) Run name resolution (including template instantiation) in parallel to the 
    syntax parser of the next translation unit.
(3) Run syntax parser on multiple translation-units in parallel.
(4) Run name resolution of multiple translation-units in parallel. There are
    some restrictions on when we can do that, however that should be managable.

Note, that we'll still have the potential bottle-neck of writing to the database in a single thread.

I don't expect a lot of problems in implementing (1) and (2). It can at least put some load on 3 cores per project.
Also (3) would not be hard to implement, however I expect that we spend more CPU time on name-resolution than on the syntax parsing. Doing (3) may not have an effect without doing (4). An implementation for (4) will be more challenging.
Comment 2 Volker Diesel CLA 2011-07-11 06:28:49 EDT
(In reply to comment #1)

Hi Markus.
Thanks for your reply.
Let's start with idea one... parallel indexing of multiple projects.

> Sounds like a reasonable approach to me. You need to watch out for dependent
> projects though, where the order in which they are indexed is important.
> Projects that relate to each other via project references should probably be
> indexed in the same thread.

I didn't get the point about project references and why dependent projects must be indexed in appropriate order. My (naive) understanding was that the index only contains information about source and object files, no matter in what project these files are located and wether there are dependencies to other projects. Can you please explain in more detail, why dependent projects must be indexed in a specific order. And if this holds true... Does this mean that wrong project dependencies will result in wrong/corrupted indexes? And btw... how about cyclic project dependencies?

Thanks + kind regards,
Volker
Comment 3 Markus Schorn CLA 2011-07-11 08:59:08 EDT
(In reply to comment #2)
> (In reply to comment #1)
> Hi Markus.
> Thanks for your reply.
> Let's start with idea one... parallel indexing of multiple projects.
Yes, lets do that.

> I didn't get the point about project references and why dependent projects 
> must be indexed in appropriate order....
In case you have a project 'B' that depends on project 'A' and sources in 'B' that includes headers from 'A', then the indexer will read the information about the header from the index of project 'A' rather than parsing the header again. For that to work project 'A' needs to be indexed before project 'B'.

Wrong or cyclic dependencies will at the worst cause information to be duplicated in the indexes of multiple projects.
Comment 4 Volker Diesel CLA 2011-07-11 12:19:43 EDT
(In reply to comment #3)

Hi Markus.
Thanks for the prompt reply.

> In case you have a project 'B' that depends on project 'A' and sources in 'B'
> that includes headers from 'A', then the indexer will read the information
> about the header from the index of project 'A' rather than parsing the header
> again. For that to work project 'A' needs to be indexed before project 'B'.
> 
> Wrong or cyclic dependencies will at the worst cause information to be
> duplicated in the indexes of multiple projects.

Okay, that makes things a bit more complicated, but I think it's still managable.

First...
If I got you right and if I didn't miss something, dependent projects don't need to be indexed in one and the same thread necessarily.
In your example, thread of indexer of project B must wait for thread of indexer of project A to finish. I think this is an important difference with regards to parallelization.
Consider a more realistic situation:
project A
project B, C, D, E --> depend on A
project F --> depends on B, C
project G --> depends on B, D, E
Battle plan:
1) run indexer on A
2) run indexers on B, C, D, E in parallel
3) as soon as B, C is finished run indexer on F in parallel
4) as soon as B, D, E is finished run indexer on G in parallel

If this is right, I would suggest associating one indexer job to one project and model these jobs in a graph-like structure, with each job beeing blocked by running parent jobs.

Any objections or other ideas?

Second...
Integration of parallel indexing into the huge eclipse world.
1) Adding, deleting, closing(?) projects must be handled
2) Changing project dependencies must be handled
3) Stoping one ore more jobs must stop dependend jobs as well
What else?

Thanks + kind regards,
Volker
Comment 5 Markus Schorn CLA 2011-07-12 07:17:21 EDT
(In reply to comment #4)
> Okay, that makes things a bit more complicated, but I think it's still
> managable.
I agree.

> First...
> If I got you right and if I didn't miss something, dependent projects don't
> need to be indexed in one and the same thread necessarily.
> In your example, thread of indexer of project B must wait for thread of indexer
> of project A to finish. I think this is an important difference with regards to
> parallelization.
You are right.

> Consider a more realistic situation:
> project A
> project B, C, D, E --> depend on A
> project F --> depends on B, C
> project G --> depends on B, D, E
> Battle plan:
> 1) run indexer on A
> 2) run indexers on B, C, D, E in parallel
> 3) as soon as B, C is finished run indexer on F in parallel
> 4) as soon as B, D, E is finished run indexer on G in parallel
> If this is right, I would suggest associating one indexer job to one project
> and model these jobs in a graph-like structure, with each job beeing blocked by
> running parent jobs.
> Any objections or other ideas?
With the plan you'll win the battle.

> Second...
> Integration of parallel indexing into the huge eclipse world.
> 1) Adding, deleting, closing(?) projects must be handled
> 2) Changing project dependencies must be handled
> 3) Stoping one ore more jobs must stop dependend jobs as well
> What else?

I can think of 
4) New jobs (e.g. because user saved a file) need to be merged into the pending
   jobs.
Comment 6 Volker Diesel CLA 2011-07-13 15:27:23 EDT
(In reply to comment #5)

Hi Markus,
thanks for your reply.

> I can think of 
> 4) New jobs (e.g. because user saved a file) need to be merged into the pending
>    jobs.

Good point! That made me think for a moment :-)

Let's assume project A with file A.h and project B (depending on A) with B.h (which #include's A.h) and let's assume I do a full indexing and meanwhile edit A.h and save my changes every few seconds.

1) Current implementation
I assume that A.h and B.h are added to the job queue again and again in appropriate order (first A.h and then B.h). After a moment, my index is up to date. Right?

2) Future parallel implementation
Now, we have two different cases

2.1) Indexer A is still running
A.h will be added to indexer job A again and again, B.h will be added to indexer job B again and again. Indexer A will keep running and running and indexer B will never get started, because it will be waiting for A. Would that be acceptable?
Well, from my point of view it would, for two reasons...
First: If I don't like that behaviour, I could set number of parallel queues to "1" in indexer configuration and would be back at the old behaviour.
Second: As long as I edit project A, I don't care much about the index of project B.
How do you think about that?

2.2) Indexer A has finished and indexer B is running
I modify A.h and it will be added to indexer job A and B.h will be added to indexer job B. Indexer A must be restarted and therefore indexer B must be stopped. And now we are back at 2.1)
Right?

So my conclusion is...
Before a new task is put into one of the indexer jobs, each and every indexer job has to be paused first, and after all new jobs are in, the entire job tree must be started again, beginning with the most basic project (the root of the job tree).
Right?

BTW... cyclic dependencies need to be broken anyway, otherwise it won't work. But that's no difference to current implementation.

Comments welcome :-)

Kind regards,
Volker
Comment 7 Volker Diesel CLA 2011-07-19 15:36:46 EDT
Hi, CDT people.

Coming back to comment 1:

>>
>> You need to watch out for dependent projects...
>>

!!!!!!!!!!!!!!!!!!!!
!!!NO, BETTER NOT!!!
!!!!!!!!!!!!!!!!!!!!

I tried it in our real-life project...
171 large C++ projects in workspace and we used to have no project references  (no project had references to other projects). Full C/C++ indexing took about three hours (by far too long... for comparison... scratch make including archiving and linking takes about 40 minutes).

I thought it was a good idea to write a plugin that generates appropriate project references according to our build descriptions, hoping that this would speed up CDT indexing... And I did....

Well, writing that plugin might have been a good idea, but running CDT indexer with approriate project references in place wasn't.
I started a full indexer job (with appropriate project references) today at 10:50am and canceled it right now at 9:30pm, after it had managed to index one percent of our sources.

It seems to me that parsing files and building the AST is at least 1000 times faster than looking up references in the index (at least in our real-life project).

Could anyone please comment on that?
Thanks + kind regards
Volker
Comment 8 Markus Schorn CLA 2011-07-20 02:47:25 EDT
(In reply to comment #7)
> ...
> I started a full indexer job (with appropriate project references) today at
> 10:50am and canceled it right now at 9:30pm, after it had managed to index one
> percent of our sources.
> It seems to me that parsing files and building the AST is at least 1000 times
> faster than looking up references in the index (at least in our real-life
> project).
> Could anyone please comment on that?
> Thanks + kind regards
> Volker
When you have project references, then the index has to span multiple databases (one for each project). This makes access to the index slower. I assume that this is the cause of the extreme slowdown.
Not watching out for project references when parallelizing the indexing does not solve this problem, though. Rather than that we need to consider to index each project on its own, regardless of project references. By your measurements, this would speed up the indexing and it would make it easier to parallelize the indexer. The drawback is, that information about headers included from multiple projects gets duplicated.
Comment 9 Volker Diesel CLA 2011-07-20 03:08:59 EDT
(In reply to comment #8)

Hi Markus.
Thanks for your reply.

> When you have project references, then the index has to span multiple databases
> (one for each project). This makes access to the index slower. I assume that
> this is the cause of the extreme slowdown.
> Not watching out for project references when parallelizing the indexing does
> not solve this problem, though. Rather than that we need to consider to index
> each project on its own, regardless of project references. By your
> measurements, this would speed up the indexing and it would make it easier to
> parallelize the indexer. The drawback is, that information about headers
> included from multiple projects gets duplicated.

First of all the root cause should be found instead of working around the performance issues. Why is looking up references in other indexes so much slower than parsing the files again and how can it be made faster. I did a simple test and added debug output before and after the 'getASTTranslationUnit' call. It turned out that this call takes between one and three seconds in most cases, and then it takes ages (sometimes 20-30 seconds) until 'getASTTranslationUnit' is called again. I am currently doing some analysis and will be coming back to that point later (maybe I will open an additional bug for that issue).

Regards,
Volker
Comment 10 Markus Schorn CLA 2011-07-20 03:52:42 EDT
(In reply to comment #9)
In both cases things are looked up in the index. Without project references the index consists of a single database (the one of the project), whereas with project preferences the index consists of a bunch of databases. Therefore my top candidates causing the slow-down are:
* The layer that combines the databases is inefficient, or
* The number of entities to be considered during overload resolution increases,
  this may slow down the parser. This may especially hurt when there are a 
  lot of overloaded operators. You can try to turn off indexing of implicit
  references and see whether this makes a great difference.
Comment 11 Volker Diesel CLA 2011-07-20 09:40:32 EDT
Created attachment 199996 [details]
Profiler data without project references
Comment 12 Volker Diesel CLA 2011-07-20 09:40:54 EDT
Created attachment 199997 [details]
Profiler data with project references
Comment 13 Volker Diesel CLA 2011-07-20 09:41:38 EDT
Created attachment 199998 [details]
Comparison between runs without and with project references
Comment 14 Volker Diesel CLA 2011-07-20 09:43:18 EDT
I reduced number of projects (and thereby number of project references) from 170 to 20 and then...

1) ran full indexing without project references -> 22min13sec
2) ran full indexer WITH project references -> 44min24sec

It seems that some code is not scaling linear with number of projects / number of references. With 170 projects and much more references, indexing is not only two times slower, but hundred or thousand times slower.

3) I ran both tests again with profiling enabled. It seems that the entire time difference results from AST visiting (and actions therein).

See attached profiler data and diff file.
Comment 15 Volker Diesel CLA 2011-07-24 08:39:21 EDT
Hi.

I implemented a first version of indexer parallelization on project level (see attached patch, based on origin/master 0b330a4c9e) and tested it with our 170 C++ projects on a 16 core machine. Full indexing time is now down from ~3hrs to ~40mins. This is factor 4 1/2 which means that parallelization does not really scale well with number of cores. I tried some things but did not yet find out why it scales so bad.

I encountered another severe issue when running parallel indexer with project references configured. It turned out that project references kill parallelization completely, because all parallel jobs are synchronizing on WritablePDOM's write lock.

I stopped at that point (actually I did not spend much effort in honouring project references for indexing, since parallelization is currently not usable anyway when there are many project references in place).

To me it seems the entire indexer infrastructure is currently not really ready for mutithreading and that a larger refactoring will be nescessary.

Could some of you please review my changes and comment on how to continue with that issue?

Thanks + kind regards,
Volker
Comment 16 Volker Diesel CLA 2011-07-24 08:40:44 EDT
Created attachment 200240 [details]
First version of parallel indexer
Comment 17 Volker Diesel CLA 2011-07-28 17:29:09 EDT
Hi.

Coming back to the issue mentioned in comment 15...

I found three major issues, why parallel indexing on project level does not scale with number of CPU cores / parallel jobs and resolved two of them.

First of all... I am talking about parallel indexing WITHOUT any references between projects. Parallel indexing WITH project references is still a nightmare :-(

Reason 1: Lock contention in class ChunkCache
I resolved this by introducing multiple cache slots and assigning chunks to these slots round-robin. When chunks are added to or removed from ChunkCache, only the slot associated with that chunk needs to be locked, not the entire cache.
This brought a 25% speedup during parallel indexing (full-indexing time for our 170 C++ projects decreased from ~40mins to ~30mins).

Reason 2: Lock contention in class Database
I resolved this by not locking Database chunk operations on global ChunkCache, but by introducing a lock per Database.
This brought another 33% speedup during parallel indexing (full-indexing time for our 170 C++ projects decreased from ~30mins to ~20mins)

I don't think, that these changes introduce any new race conditions, but it would  be great, if someone could review my changes and comment on them.

Reason 3: Java garbage collection
Yes, I know what kind of discussion I am starting here... Anyway...
The only thing that prevents parallel indexer from scaling linear with number of CPU cores / indexer jobs is java garbage collection. And the only solution I see is pooling and reusing objects as much as possible instead of creating and destroying them over and over again. In fact, if garbage collection would not happen, then parallel indexer would scale almost linear with number of jobs / CPU cores aleady. 

So, at the end of the day.. I started with 3hrs full indexing time for our 170 C++ projects and ended at ~20mins with the attached patch.

Would be great to have this at least in next CDT release (and maybe also in one of the next updates for CDT 8).

How to proceed?
Open topics?
Objections?

Kind regards
Volker
Comment 18 Volker Diesel CLA 2011-07-28 17:34:07 EDT
Created attachment 200571 [details]
Parallel indexer with lock contention reduced to almost zero
Comment 19 Sergey Prigogin CLA 2011-07-28 17:40:22 EDT
(In reply to comment #17)
Do you really have 170 projects without dependencies between them?
Comment 20 Volker Diesel CLA 2011-07-28 18:28:49 EDT
(In reply to comment #19)

> Do you really have 170 projects without dependencies between them?

Of course, our 170 projects have a lot of dependencies, but as already mentioned in comment 7:

>>
>> You need to watch out for dependent projects...
>>

!!!!!!!!!!!!!!!!!!!!
!!!NO, BETTER NOT!!!
!!!!!!!!!!!!!!!!!!!!

If I set up my workspace with 170 C++ projects and appropriate project dependencies, C++ indexing takes aprox. 1000 times longer than without dependencies. As already mentioned before, it normaly takes about 3hrs for full indexing (without project dependencies) and WITH project dependencies, I canceled full indexing after 10hrs and after indexer had managed to index approx. 1% of our sources.

So... We definitely have project dependencies, but C++ indexer is completely unusable when configuring these dependencies.
Comment 21 Sergey Prigogin CLA 2011-07-28 18:41:26 EDT
(In reply to comment #20)
I have a gut feeling that in spite of the great performance improvements you were able to achieve with projects without dependencies, inter-project parallelization is a dead-end approach. Parallelization between files, although harder to implement, is much more promising than the inter-project one and would benefit everybody, not just those who use a lot of projects.

My 2c.
Comment 22 Volker Diesel CLA 2011-07-29 02:19:56 EDT
(In reply to comment #21)
> (In reply to comment #20)
> I have a gut feeling that in spite of the great performance improvements you
> were able to achieve with projects without dependencies, inter-project
> parallelization is a dead-end approach. Parallelization between files, although
> harder to implement, is much more promising than the inter-project one and
> would benefit everybody, not just those who use a lot of projects.

Hi Sergey.
Sorry, but I don't share your opinion for several reasons.

First, I question that it is possible to utilize a 4 core machine 100% with parallelization on parser level only (and I consider anything else than 100% utilization a waste of resources). And normal desktop PCs will probably have much more than 4 cores in two or three years. So I think, paralleliziation ONLY on parser level is the dead end (even without taking CDT code complexity into account).

Second, I wonder for whom CDT is built? Probably not for "hello world" style projects. Hello world projects probably don't have performance issues with CDT indexing, right? And large projects have a lot of subcomponents (and hence a lot of potential CDT projects). How does a larger C++ project look like? You probably have some executables and a bunch of shared libraries, and at least each shared library must be in its own CDT project, because in real life, each shared library will have it's own set of #define's, that is different from any other shared library (at least a #define for DLL symbol import/export will be different in each shared library project). So I think, non-hello-world C++ projects will naturally consist of a lot of CDT projects.

This leads me to point three (another CDT dead end)...
If projects differ in only one single #define (and real life projects do, as mentioned above), it doesn't make much sense to honour project references during indexing, because reuse of indexing information across projects will be almost zero. Not to mention that it is a thousand times slower today (that could probably be fixed). But it simply doesn't make much sense in real C++ projects and it currently prevents inter-project parallelization, when project references are configured.

And last but not least, why not have both? Why not have inter-project parallelization AND a reasonable parallelization on parser level (reasonable with regards to code complexity)? Is there anything wrong with inter-project parallelization?
At least it has been easy to achive with some hundred lines of code, and if your review my changes, you will find out, that most of my code has not even been invented from scratch but simply reorganized / moved from one class to another.

But maybe this is more a topic for cdt-dev mailing list. I you don't mind, I would like to post this discussion there.

Kind regards,
Volker
Comment 23 Chris Recoskie CLA 2011-07-29 11:17:45 EDT
(In reply to comment #20)
> If I set up my workspace with 170 C++ projects and appropriate project
> dependencies, C++ indexing takes aprox. 1000 times longer than without
> dependencies. As already mentioned before, it normaly takes about 3hrs for full
> indexing (without project dependencies) and WITH project dependencies, I
> canceled full indexing after 10hrs and after indexer had managed to index
> approx. 1% of our sources.
> So... We definitely have project dependencies, but C++ indexer is completely
> unusable when configuring these dependencies.

This confuses me.

Let's assume for a moment there are no dependency cycles.  This means in the worst case that for N projects, at worst you will have

Project N-1 depends on N-2, which depends on N-3 ... depends on Project 0

That means that at worst, you cannot parallelize any indexing at all, and you index each project sequentially, which ought to be the same as the current behaviour without any of your patches applied.  There'd be some additional overhead to look at those dependencies, but it should not be a great deal and it would scale linearly with the number of projects.

There might be some confusion between what you think of as a project dependency, and what some of the rest of us think of as a dependency.  I think of project A depending on B if in A's project properties, B is listed as a project that A depends on.  This is a user configured relationship that is stored, and not computed automatically in any way.  Resolving whether an arbitrary project X depends on Y in this case is a simple matter of looking ito X's properties and seeing if Y is listed in the dependencies.  Hence, it's not a huge amount of overhead to do in and of itself.

If you are talking about trying to automatically resolve whether the code in project A actually calls/includes code in project B, that I can see taking a huge amount of time that would make the entire exercise impractical.
Comment 24 Volker Diesel CLA 2011-07-29 12:38:11 EDT
(In reply to comment #23)

Hi Chris.

Let me make this a bit more clear...

There are three different things with regards to project references (and yes, I mean project references between eclipse projects, NOT references between source files).

First, take CDT8 as it was shipped with Indigo, import 170 C++ projects, configure many references between these projects and try to run a full indexer... But be careful... You will need much patience and a lot of coffee. It takes ages!
Remove the dependencies, run full indexer again and it is faster by a factor of 1000.
This is a bug in official CDT8 and has nothing to do with parallelization.

Second (and disregarding the above bug), if you try to run indexer in parallel on projects with many references, you will fail, because all parallel jobs synchronize on write lock of PDOM index, which spans over many -if not all- project PDOMs.

Third, I do not see much reason why indexer should honour project references at all. As already explained in my previous comment, in real life almost all projects have different sets of #define's and therefore reuse of indexer information across projects is minimal anyway.

Kind regards,
Volker
Comment 25 Chris Recoskie CLA 2011-07-29 12:53:20 EDT
(In reply to comment #24)
> > First, take CDT8 as it was shipped with Indigo, import 170 C++ projects,
> configure many references between these projects and try to run a full
> indexer... But be careful... You will need much patience and a lot of coffee.
> It takes ages!
> Remove the dependencies, run full indexer again and it is faster by a factor of
> 1000.
> This is a bug in official CDT8 and has nothing to do with parallelization.

Hmm... what do your dependencies look like?  Are there any cycles in the dependency graph?  Otherwise I can't imagine why that ought to take 1000 times faster, since as I said, without cycles, it should behave close to the same as without dependencies.  This is purely conjecture, but if there were cycles in the graph, then perhaps the indexer is reindexing your projects over and over again as a result of the changes it generates with each successive indexing pass?

> Second (and disregarding the above bug), if you try to run indexer in parallel
> on projects with many references, you will fail, because all parallel jobs
> synchronize on write lock of PDOM index, which spans over many -if not all-
> project PDOMs.

I don't dispute that there is a bottle neck right now in some parts of the index code, which you've outlined.  Even still though, without resolving that bottleneck, it should still be faster than doing it sequentially, as you'll at least parallelize the parsing,

> Third, I do not see much reason why indexer should honour project references at
> all. As already explained in my previous comment, in real life almost all
> projects have different sets of #define's and therefore reuse of indexer
> information across projects is minimal anyway.
> Kind regards,
> Volker

Well as Markus stated, we use the index of the other project to help speed up indexing by getting information out of it rather than reparsing headers that project provides.  In theory that speeds things up a lot.
Comment 26 Volker Diesel CLA 2011-07-29 14:58:25 EDT
(In reply to comment #25)

Hi Chris. Thanks for your reply.

> Hmm... what do your dependencies look like?  Are there any cycles in the
> dependency graph?  Otherwise I can't imagine why that ought to take 1000 times
> faster, since as I said, without cycles, it should behave close to the same as
> without dependencies.  This is purely conjecture, but if there were cycles in
> the graph, then perhaps the indexer is reindexing your projects over and over
> again as a result of the changes it generates with each successive indexing
> pass?

I did not explicitly check our dependencies for cycles, but I am 100% sure there are cycles in our project depencencies. I asked Markus Schorn about cyclic depencencies in comment #3 and he replied:

> Wrong or cyclic dependencies will at the worst cause information to be
> duplicated in the indexes of multiple projects.

Is that right? Or will cyclic dependencies lead to cyclic indexing, as you suspect?

As already mentioned earlier in comment #14, my impression is, that some indexer code is not scaling linear with number of projects / project references.
With 170 projects, full indexing takes three hours WITHOUT project references and 10 hours for the first percent of full indexing WITH project references (after ten hours and one percent progress, I cancelled the test).
I reduced workspace size from 170 to 20 projects (and consequently less project references). Now the difference was 20 mins WITHOUT project references compared to 40 mins WITH project references (official CDT8 without any parallel stuff).
See also profiler data attached to this bugzilla.

> I don't dispute that there is a bottle neck right now in some parts of the
> index code, which you've outlined.  Even still though, without resolving that
> bottleneck, it should still be faster than doing it sequentially, as you'll at
> least parallelize the parsing,

That's what I had hoped for, but it isn't that way:-(
You can try it with my attached patch. When complex project references are in place, parallel indexing is basically single-threaded. Whenever you stop in debugger, you see 1 of N jobs parsing and N - 1 jobs waiting on WritablePDOM's write lock.

> Well as Markus stated, we use the index of the other project to help speed up
> indexing by getting information out of it rather than reparsing headers that
> project provides.  In theory that speeds things up a lot.

Maybe, I still didn't get the point...
1) Consider project P1 with a set of include pathes I1 and a set of #defines D1.
2) Consider project P2 with a set of include pathes I2 and a set of #defines D2
3) Consider P2 has a reference to P1.
How can you reuse indexing information of P1 while indexing P2, if either I1 != I2 or D1 != D2?
As far as I understand, indexer of P2 will not be able to reuse any information of P1's indexer.
Am I right or did I miss something?

Thanks + kind regards.
Volker
Comment 27 Sergey Prigogin CLA 2011-07-30 00:13:54 EDT
(In reply to comment #22)
> First, I question that it is possible to utilize a 4 core machine 100% with
> parallelization on parser level only (and I consider anything else than 100%
> utilization a waste of resources). And normal desktop PCs will probably have
> much more than 4 cores in two or three years. So I think, paralleliziation ONLY
> on parser level is the dead end (even without taking CDT code complexity into
> account).

Better parallelization is achieved when work units are smaller and therefore there are more of them. 170 project-level work units is enough for efficient parallelization when there are no dependencies between them. With dependencies between projects 170 may no longer be a big enough number for efficient parallelization. Parallelization at file level would benefit from thousands of work units. Even with constraints due to dependencies between files, there will be more parallelizable work units than in the case of projects.

The above assertion is based on the assumption that writing to the index database constitutes a small fraction of indexing time. If this assumption is incorrect writing to the database will create be a bottleneck precluding efficient parallelization.

> Second, I wonder for whom CDT is built? Probably not for "hello world" style
> projects. Hello world projects probably don't have performance issues with CDT
> indexing, right? And large projects have a lot of subcomponents (and hence a
> lot of potential CDT projects). How does a larger C++ project look like? You
> probably have some executables and a bunch of shared libraries, and at least
> each shared library must be in its own CDT project, because in real life, each
> shared library will have it's own set of #define's, that is different from any
> other shared library (at least a #define for DLL symbol import/export will be
> different in each shared library project). So I think, non-hello-world C++
> projects will naturally consist of a lot of CDT projects.

CDT projects used for large bodies of code are usually monolithic, not consisting of many small projects. This tendency reflects the fact that given a set of complex makefiles it's easier to create a single project out of them than  many small projects.

At Google monolithic projects are used to work with one of the largest codebases  in the industry.

> And last but not least, why not have both? Why not have inter-project
> parallelization AND a reasonable parallelization on parser level (reasonable
> with regards to code complexity)? Is there anything wrong with inter-project
> parallelization?

Having both is fine, but intra-project parallelization would likely achieve larger performance gains and would benefit more users.
 
> But maybe this is more a topic for cdt-dev mailing list. I you don't mind, I
> would like to post this discussion there.

Absolutely.

> Kind regards,
> Volker
Comment 28 Volker Diesel CLA 2011-07-31 14:25:23 EDT
I created a third patch that makes parallel indexing more convenient (e.g. one job progress for the entire jobs queue), fixes bugs in cancel handling and job scheduling and adds some beautification to indexer preferences UI.
Would be great if some of you would review and comment the stuff.
I will post a message on cdt-dev for fueling further discussion.
Comment 29 Volker Diesel CLA 2011-07-31 14:27:03 EDT
Created attachment 200634 [details]
Bug fixes, more convinient progess handling, UI beautification
Comment 30 Volker Diesel CLA 2011-07-31 15:34:45 EDT
(In reply to comment #27)

Hi Sergey. Thanks for your reply.

> Parallelization at file level would benefit from thousands of
> work units. 

Okay, now I got you... You hav been talking about parallelization on file level while I have been talking about parallelization on intra-parser level).
I already asked about parallelization on file level (see comment #1), but there was no reply.
Discussion instead went on to parallelization on parser level (see comment #2).

> At Google monolithic projects are used to work with one of the largest
> codebases  in the industry.

Well, that's one point of view. In our project, we have a different point of view. We are using our company's lagacy make infrastructure... no make/gmake files, etc... one may question if that is smart, but at the end of the day, that's the way it is. And it has the advantage of beeing able to model fine grained projects with fine grained project dependencies. And I am talking about "one of the largest code bases in industry" as well :-)
I don't think CDT should be developed only for monolithic projects, but should also support fine grained project configurations with many fine grained dependencies.

And I wonder how you can manage situations, where library A needs to be build with "#define X 1" and library B with "#define X 0"?
How do you model that situation in one monolithic project, that contains both library A and library B? I don't see how you can get consistent indexer information in your monolithic environment.
Maybe we are doing something the wrong way. Could we achive a consistent monolithic configuration in our environment as well?

Thanks + kind regards.
Volker
Comment 31 Markus Schorn CLA 2012-01-11 01:39:24 EST
We are about to change the behavior of the indexer with respect to dependent projects (see bug 368160), which opens up the path for parallelizing the indexer accross projects.
Comment 32 Markus Schorn CLA 2012-01-27 06:42:11 EST
We have changed the behavior of the indexer, such that indexing of projects is entirely independent. This makes parallelization of the indexer accross projects easier.
Volker, if you are willing to port your patch, such that it applies on the master branch I'll work on getting it committed.
Comment 33 Volker Diesel CLA 2012-01-28 09:09:11 EST
Created attachment 210230 [details]
git patch for this bug based on CDT master

I created and attached a patch for indexer parallelization across projects. Please let me know if you encounter any build or runtime issues.
Comment 34 Marc-André Laperle CLA 2012-04-25 15:09:42 EDT
*** Bug 377686 has been marked as a duplicate of this bug. ***
Comment 35 Junmin Hao CLA 2012-04-25 18:54:51 EDT
(In reply to comment #30)
> Okay, now I got you... You hav been talking about parallelization on file level
> while I have been talking about parallelization on intra-parser level).
> I already asked about parallelization on file level (see comment #1), but there
> was no reply.
> Discussion instead went on to parallelization on parser level (see comment #2).
> 
Yes, parallelization on file level is desperately needed in Google.  I agree there is no easy solution for file level parallelization, but solving hard problem is more fun.
Comment 36 Jerome Correnoz CLA 2013-03-04 10:03:13 EST
Hi all,

If this enhancement already available in a given CDT 8.x  version ? If not, are there any plans to deliver it ?

Sorry by advance if the answer is known by all (but me :-(), I'm trying to catch this thread and the final status is not clear to me.

Regards,
  Jerome
Comment 37 Sergey Prigogin CLA 2013-03-04 13:08:03 EST
(In reply to comment #36)
AFAIN nobody is working on indexer parallelization at the moment.
Comment 38 Pan Deng CLA 2013-04-26 05:16:30 EDT
Why this work stopped? It is a really useful feature!

I experienced CDT with chromium, webkit which contains many many files, the single thread index procedure takes more than 4 hours on my 4 core 8 threads CPU, and during that UI have unsmooth/less response...

CDT is really good, but that maybe the only reason I give up it.

thanks!

Pan
Comment 39 Axelsson Martin CLA 2013-09-25 05:51:00 EDT
Hi,
Has the investigation of this proble, stopped?
Steve: Do you know anything about this work?

Thanks Martin
Comment 40 Benjamin Shadwick CLA 2016-08-25 12:04:58 EDT
FYI, Eclipse CDT Neon is still using single-threaded indexing. It takes a very long time to index my 5 projects one at a time, with 3 VM cores sitting idle.
Comment 41 Dongseong Hwang CLA 2018-01-02 18:17:27 EST
oops, I clicked 'rebuild' by accident and have to wait for 3 hours. meanwhile, I read this threads in 2018.