Performance review 2.0 - Concurrency ($1921206) · Snippets · SIS-CC / Stat Suite / dotstatsuite-core-data-access

Awesome work, @pedroacarranza! I love doing and reading performance tests. Some questions this raises for me (many of which I think you already mentioned in the DevOps meeting):

What effect could multithreading have on the import process? (from your experiments it sounds like... a lot)
What effect could not having the memory-leak have? (not only would it not crash the server, but I wonder if it would trigger fewer garbage collections and thus be faster too)
How many of those errors in Concurrent Requests table were time-outs waiting for a free SQLConnection from the connection pool? (we found significant improvements upping the connection pool limit)
How much of the wait time is constructing the SdmxObjects, vs serializing them, vs waiting for the output stream?
How much of a problem is the fact that pretty much every single call in the NSI services is blocking?

Thanks for addressing these questions. I believe they nicely summarize the critical areas of opportunity were we could largely improve the services. Unfortunatelly, as of now, I don’t have the answers.

I will try to answer what I think these changes would implicate. Later I would really like to give a precise answer.

What effect could multithreading have on the import process? (from your experiments it sounds like... a lot)

It would have two main improvements; time to process and escallability related to the resources.

The way I see it, there could be two levels of paralelization:

First level: Parallelize the threatment of observations+obs. level attributes, dimension group level attributes and dataset attributes.
Second level: Parallelize the processing of each row (Iobservation). It is not about reading in parallel, but mapping and validating in parallel. I believe the first level of parallelization would cut by half the total import time, and the second level of parallelization would reduced it to one third.

I would really appreciate some feedback on this topic!

What effect could not having the memory-leak have? (not only would it not crash the server, but I wonder if it would trigger fewer garbage collections and thus be faster too)

As you might have seen, when the import is really large, before the out of memory exception, the triggering of the garbage collector is taking up all the processing, blocking all the remaining rows to be processed. Fixing this will not improve the import time for small files, but will definetly improve the throughput of large imports and allow more concurrent imports.

How many of those errors in Concurrent Requests table were time-outs waiting for a free SQLConnection from the connection pool? (we found significant improvements upping the connection pool limit)

I don’t have the answer. Taking into account the high number of connections required for a single transaction, it totally makes sence that this is a bottle neck for concurrent requests.

How much of the wait time is constructing the SdmxObjects, vs serializing them, vs waiting for the output stream?

I did a simple test where I read a sdmx file all into memory, and then inserted (sqlbulkcopy) only the observations into a heap table.

test 1: serialize the rows into iobservation, map sdmx codes to internal codes and then bulk insert
test 2: maps sdmx codes to internal codes and then bulk insert

And the result was 1/4 of the time in the test 2.

Is not a propper comparation, since the sdmxcsv reader facilitates a lot of work, but pherhaps not all the information is needed in this context.

Re: the parallelisation, I'll have to read the transfer service in greater detail to have anything useful to say, but I need to do that anyway, so hopefully I can contribute soon! :-)

In terms of the wait time, I was referring more to the output tests, though I completely failed to make that clear 😄