When working with large datasets, utilizing more system resources like multiple CPU threads (for shared-memory parallelization) can drastically reduce computation time of the operations performed. However, it isn’t straightforward to determine the optimal thread count to obtain the lowest runtime for an operation, or on the other hand, the number of threads required for efficient speedup scaling.
Given an R package like data.table
where most operations make use of parallelization, it can be convenient to automatically figure out the number of threads to use for achieving the fastest execution time in the case of a particular routine, without the user needing to rely on ad-hoc experiments. Likewise, it would be handy if the user could set the thread count keeping in mind maximum (or a user-defined ratio) scalability in terms of the speedup obtained, which otherwise can be tricky or time-consuming to figure out manually.
This is where data.table.threads
comes in - A package designed to assist in finding the most suitable thread count for the various parallelizable routines within data.table
.
Key Features
In terms of user-facing functions, findOptimalThreadCount(rowCount, columnCount)
is the one that runs a set of predefined benchmarks for each applicable function across varying numbers of threads (iteratively from one to the number available as per the user’s system/configuration) and returns a data.table
containing the optimal thread count for each function. The returned object is of a custom class, for which print
and plot
methods have been provided.
Printing the results would enlist the fastest median runtime (in milliseconds) along with the thread count that achieved it for each function.
Using the plot method involves computation of speedup and the ‘recommended’ thread counts for each function aside from the optimal/ideal case. The recommended value is the point of near-maximum gain in terms of thread-use efficiency, and stems from the highest intersection between sub-optimal (50% speedup efficiency) and measured speedup data. Altogether, the speedup plot shows the ideal, sub-optimal, and measured performance trends for each of the benchmarked data.table
functions.
Here is an example:
(benchmarkData <- data.table.threads::findOptimalThreadCount(1e7, 10, verbose = FALSE))
plot(benchmarkData)
O/P:
data.table function Thread count Fastest median runtime (ms)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
forder 8 82.736011
GForce_sum 6 15.670897
subsetting 6 54.386931
frollmean 6 23.329410
fcoalesce 5 7.319135
between 6 22.716911
fifelse 10 18.825437
nafill 10 7.006490
CJ 1 3.194330
In the plot above, the black lines represent the measured speedup (case-wise for each function), while the light blue and red lines represent the recommended and ideal speedup values respectively.
Finally, setThreadCount(benchmarkData, functionName, efficiencyFactor)
is the function to be used to set the thread count based on the observed results for a user-specified function and efficiency value for the speedup: (of the range [0, 1]; default being 0.5)
setOptimalThreadCount(benchmarks, functionName = "forder", verbose = TRUE)
getDTthreads()
O/P:
The number of threads that data.table will use has been set to 3, based on an efficiency factor of 0.5 for data.table::forder() based on the performed benchmarks.
[1] 3
Setup
Please use remotes
or devtools
to fetch the developmental version of the package from its GitHub repository, and then install using install.packages()
:
if(!require(remotes)) install.packages("remotes") remotes::install_github("Anirban166/data.table.threads")
if(!require(devtools)) install.packages("devtools") devtools::install_github("Anirban166/data.table.threads")