With the advent of the GSoC’20 coding period, it’s finally time to jot down the testComplexity functions one by one when done, along with their functionality and implementation details - one purpose which these blog posts serve!
Proceeding with my thought-out proposal and the first objective as laid out here, the timings-quantifying function asymptoticTimings
has been written (subject to changes if required) and it is the topic of discussion in this post.
There are issues on the GitHub repository pertaining to this function, where we discussed implementation-specific details for the same, the outcome of which is the current state of asymptoticTimings
.
Brief outlook on its necessity
To get the complexity class for a user given algorithm/function (an expression is used instead of a function for convenience), we must obtain the timings with corresponding data sizes first, since the way the complexity classification takes place is through a glm
-based formulae between the data (timings here) and the data sizes, both of which we must procure before passing it onto the time complexity classifying function.
Function Details : Parameters
- e: An expression which is in the form of a function operating on
N
, which resembles the size of the input, as used in general notation for denoting asymptotic complexity classes, i.e. big-O notation. (Eg: O(N2) denotes quadratic complexity inN
) The numeric set of values for N are given bydata.sizes
. (the next argument) - data.sizes A set of data sizes, which should preferably be a sequence of ten. It is highly recommended to use
by
with a suitable value to incorporate sub-values of the powers of ten, which improves the accuracy of results. Example:data.sizes = 10^seq(1, 5, by = 0.5)
- max.seconds The maximum time value of the benchmarked time results that can be accounted for on average in an iteration for a particular data size.
Mean is considered here, the reason behind this being to avoid the additional computation cost which goes with the comparison of each timing value with our threshold (instead, only one comparison from the mean of the timings is made).
If the set threshold is crossed (which is measured after the rest of the code in the for-loop is evaluated) then it will stop further computation by breaking out of the loop and avoid extra time spent on larger data set sizes.
In short, this parameter can control the time invested while running this function. However, this can potentially be a double-edged sword depending on what time limit the user provides, i.e. if a very small time-bound is provided, the accuracy is lowered or the complexity might not be correct - for which it is recommended to test for the same with a higher time limit value (in seconds) to verify.
The default value has been set to a second or 109 nanoseconds as observed by the correct prediction of complexity classes for the test functions using the data frames obtained by this function.
Function Details : Implementation
It encompasses the expression passed as a parameter to microbenchmark::microbenchmark()
which computes the timings based on different dataset sizes given by input data.sizes
within a loop which iterates over them.
Inside the loop, the timings obtained at each iteration are put into a list along with the data sizes (which are added inside the loop sequentially), and the time constraint given by max.seconds
is checked before moving on to the next iteration.
Lastly, we combine our data collected at each iteration inside our list in a data frame through a call to rbind()
. The efficient use of rbind’s functionality here was suggested by my mentor Toby Dylan Hocking, who wrote about the contrast between its usage inside and outside a loop, with the former approach leading to more complexity (often quadratic) than the later (linear).
Function Details : Return value
The function returns a data frame with numeric columns ‘Timings’ and ‘Data sizes’, which are composed of the benchmarked timings and corresponding dataset sizes respectively. This data frame can then be passed as a parameter to:
asymptoticTimeComplexityClass
function to get the time complexity class.plotTimings
function to plot a graph viaggplot2
for the same.
Usage
Simply pass the required algorithm as a function of N
, with a suitable time limit: (optional, with the default limit being set to one second per iteration)
library(PeakSegOptimal)
library(data.table)
df <- asymptoticTimings(PeakSegOptimal::PeakSegPDPA(rpois(N, 1), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
data.table(df)
Timings Data sizes
1: 318703 10.000
2: 174002 10.000
3: 237002 10.000
4: 197901 10.000
5: 221702 10.000
---
596: 90820902 3162.278
597: 99308102 3162.278
598: 102231901 3162.278
599: 118894800 3162.278
600: 118237202 3162.278
Code
The rxoygen-documented version can be found here, but here’s a fully-commented one for quick reference:
asymptoticTimings <- function(e, data.sizes, max.seconds)
{ # Check for NaNs, infinities (Inf/-Inf) and NA/missing values:
ifelse(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)), stop("data.sizes must not contain any NA/NaN/Infinite value."), return)
# Collect the parse tree for our expression with all the variables found locally in the environment (env):
lang.obj <- substitute(e)
# Assign the user-provided data sizes to the variable N:
N <- data.sizes
# Create a local function which takes N as its formal argument:
fun.obj <- function(N)
{ # Evaluate our substituted (previously unevaluated) expression object, so as to make our expression a function of N:
eval(lang.obj)
}
# Set the time limit as provided by the user, otherwise use the hardcoded value of 10^9 nanoseconds or 1 second:
time.limit = ifelse(missing(max.seconds), 10^8, max.seconds*10^9)
# Create a list to store a data frame of timings + data sizes:
timings.list <- list()
# Initialize loop variables:
i <- 1
continue <- TRUE
# Run a loop through the data sizes:
while(i <= length(data.sizes) & continue)
{ # Run the benchmark for our function object:
benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
# Collect the dataset size once for each iteration:
benchmarked.timings$data.size <- data.sizes[i]
# Create a data frame composed of our collected timings and corresponding data sizes, then store it in a list per iteration:
timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.size)
# Stop further computation if the mean value of timings from the current iteration exceeds the time limit: (else continue)
if(mean(benchmarked.timings$time) > time.limit) continue <<- FALSE
# Increment index:
i <- i + 1
}
# Call rbind on our list:
resultant.df <- do.call(rbind, timings.list)
# Assign proper column names to our data frame:
colnames(resultant.df) <- c("Timings", "Data sizes")
# Return it:
return(resultant.df)
}
Anirban | 06/10/2020 |