testComplexity: asymptoticMemoryUsage

20 Jun 2020

Reading time ~5 minutes

Although memory complexity analysis was intended to be coming in between July and August as per the schedule followed in my proposal, it was covered within June as I was able to accomplish my goals earlier, thanks to the headstart and a bit of research before GSoC selection was news. A separate branch “Memtest” was created, subject to dealing with memory complexity. Prior work (time complexity analysis) was left untouched until I merged this branch with Travis and Codacy (build/code-review) checks. It’s still a work in progress, being in the test run stages before I publish full-fledged examples for the same, but some of them can be found in this thread.

Focusing on asymptoticMemoryUsage, this post is a go-to for the ones interested in memory usage quantification (for an expression over some data sizes), plus an essential include among the list of blog posts as all of these act as a ready reference for package users, as mentioned in @details for each of the functions.

Function Details : Parameters

e: An expression which is in the form of a function operating on N, where N resembles the size of the input, as used in the general notation for denoting asymptotic complexity classes. The numeric set of values for N is given by the data.sizes argument.
data.sizes : A set of data sizes, which should preferably be a sequence of ten. It is highly recommended to use step sizes of small intervals to incorporate a substantial amount of sub-values of the powers of ten, which improves the accuracy of results. Example: data.sizes = 10^seq(1, 5, by = 0.1) or data.sizes = 10^seq(1, 5, by = 0.01)
max.bytes The limit for the memory size value (in bytes) for the benchmarked memory allocations per iteration. If the set threshold is crossed (which is measured after the rest of the code in the for-loop is evaluated) then it will stop further computation by breaking out of the loop and avoid extra time spent on larger data set sizes, thereby controlling the time invested while running this function.
It’s recommended to test for higher values of this threshold for better accuracy since altogether, it runs pretty fast.

In fact, it runs faster as compared to asymptoticTimings as because bench computes the memory ‘allocations’ and it doesn’t spend extra time calculating replicates - it only sums up to one value per data size.

Function Details : Implementation

It encompasses the expression passed as a parameter to bench::bench_memory() which computes the raw memory allocations based on different dataset sizes (given by input data.sizes) within a lapply function that iterates over them.
Inside the ‘lapply-loop’, the memory-use sizes obtained at each iteration along with the data sizes (which are added inside the loop sequentially) are collectively returned in the form of a data frame as one list element and the memory size constraint given by max.bytes is checked before moving onto the next iteration.
Lastly, we combine our data collected at each iteration inside our list in a data frame again, through a call to rbind().

Function Details : Return value

The function returns a data frame with numeric columns ‘Memory usage’ and ‘Data sizes’, which are composed of the benchmarked memory allocations and corresponding dataset sizes respectively. This data frame can then be passed as a parameter to:

asymptoticMemoryComplexityClass function to get the memory complexity class.
plotMemoryUsage function to plot a graph via ggplot2 for the same.

Usage

Pass the data.sizes to run against, along with your algorithm (as a function of N) and a suitable memory size limit (again, optional):

library(PeakSegDP)
asymptoticMemoryUsage(cDPA(rpois(N, 10), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
  Memory usage  Data sizes
1         6256    10.00000
2         9416    31.62278
3        20032   100.00000
4        51136   316.22777
5       149632  1000.00000
6       460960  3162.27766
7      1445632 10000.00000
# default memory size limit (max.bytes) = 1MB or 10^6 bytes

Code

asymptoticMemoryUsage <- function(e, data.sizes, max.bytes)
{ # Check for NaNs, infinities (Inf/-Inf) and NA/missing values in data.sizes: (stop if any)
  ifelse(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)), stop("data.sizes must not contain any NA/NaN/Infinite value."), return)
  # Collect the parse tree for our expression with all the variables found locally in the environment (env):
  lang.obj <- substitute(e)
  # Assign the variable N with the user-supplied data sizes:
  N <- data.sizes
  # Create a local function which takes N as its formal argument:
  fun.obj <- function(N)  
  { # Evaluate our substituted (previously unevaluated) expression object, so as to make our expression a function of N:
    eval(lang.obj)
  }
  # Set the memory size limit (in bytes) as provided by the user, otherwise use the hardcoded value of 10^6 bytes or 1 megabyte:
  memory.size.limit <- if(missing(max.bytes)) 10^6 else max.bytes
  # Set a boolean for breaking out of the lapply loop if the above set memory size limit is exceeded:
  break.bool <- TRUE
  # Create a list to store a data frame of memory-use metrics + data sizes:
  memory.metrics.list <- list()
  # Use a lapply (returning a list) over a function which runs through the data sizes:
  memory.metrics.list <- lapply(seq(along = data.sizes), function(i)
  { # Proceed if memory limit is not breached:
    if(break.bool)
    { # Compute benchmarked memory allocation via bench_memory(expr)$mem_alloc and store it in a variable:
      benchmarked.memory.size <- bench_memory(fun.obj(data.sizes[i]))$mem_alloc
      # Toggle boolean to false to discontinute further iterations: (current one will run)      
      if(benchmarked.memory.size > memory.size.limit) break.bool <<- FALSE
      # Collect the dataset size once for each iteration:
      data.size <- data.sizes[i]
      # Return the memory metrics obtained in the form of a data frame composed of our collected memory allocations and corresponding data sizes:
      return(data.frame(c(benchmarked.memory.size), c(data.size)))
      }
  })
  # Call rbind on our list:
  resultant.df <- do.call(rbind, memory.metrics.list)
  # Assign proper column names to our data frame: (as acceptable by asymptoticMemoryComplexityClass)
  colnames(resultant.df) <- c("Memory usage", "Data sizes")
  # Return it:
  return(resultant.df)
}

rxoygen-documented version for the same can be found here.

Anirban

06/20/2020

R Package GSoC'20 Functions Code Snippets