break up the nbnxn_cuda module into multiple compilation units - Redmine #1444

As now we have 120 kernels compiled for up to four different target architectures, the nbnxn_cuda module takes a very long time (1.5-2 min on a fast Intel CPU) to build and can become the bottleneck during compilation.

For this we may need to introduce intermediate wrapper functions because AFAIK the host-side call and kernel need to be in the same compilation unit (need to double-check).

(from redmine: issue id 1444, created on 2014-02-28 by pszilard, closed on 2016-02-15)

Changesets:
- Revision 61db73ad by Szilárd Páll on 2016-01-09T11:42:39Z:

split NBNXN CUDA kernels into four compilation units

The CUDA nonbonded kernels are no longer included into nbnxn_cuda.cu,
but are built in four different compilation units (w/wo energy, w/wo
pruning) when this is supported/possible; since we only support CUDA
>=v5.0, the condition is: that CC >=3.0 devices have to be targeted.

Note that with CC 2.x devices all current CUDA compilers including 7.0
generate incorrect kernel code (hence the criterion above).

Switching back to using single compilation unit happens automatically
whenever nvcc-flags are auto-generated (as {sm,compute}_20 is added
by default).
Switching manually can be done using the
GMX_CUDA_NB_SINGLE_COMPILATION_UNIT cmake option.

Fixes #1444

Change-Id: If4eeaa5b58a35c5cd59babd20ef1179c7f27782e