data.table work diary: Jun 10 - 14

June 10, ‘24:

1) Helped Doris in creating an atime-generated plot with just the time measurements as opposed to both time and memory, reviewed 6175. (2 hours)

2) Continuing with the testing work from last week. Found out via a grep on the R files for usage that using the .External function to call C code is faster. Paired it with the symbol ‘Cfastmean’ (as it’s being used in two R files) in my test code. Created more test cases and the ones that were failing last week for base::mean vs fast mean are still failing, meaning there is a bug in the fast mean code? No difference in results with/without the change to fast mean though. (6 hours)

June 11, ‘24:

1) Gave Doris more suggestions to make the single unit plot for her presentation, and tried to debug an error with my code for that. (1 hour)

2) Shared my testing results briefly in an issue (6176). Trying to understand and modify fastmean.c itself to make it more simpler. (5 hours)

June 12, ‘24:

1) Incorporated Toby and Michael’s feedback into my testing process, tried to think and create more critical cases (stress testing, corner cases, etc.). For e.g.: (3 hours)

library(data.table)

meanComparison <- function(x, na.rm) 
{
  baseR <- mean(x, na.rm = TRUE)
  # fastmean <- .External("Cfastmean", x, ...)
  options(datatable.optimize=1)
  fastmean <- dt[, mean(values, na.rm = TRUE), verbose = TRUE]
  cat("Results as computed by:\nBase R's mean:", baseR, "\ndata.table's fast mean:", fastmean, "\n")
  fifelse(identical(baseR, fastmean), "Passed", "Failed")
}

testInputs <- list(
  c(.Machine$double.xmax, -.Machine$double.xmax, 0),
  c(1e10, -1e10, 1e-10, -1e-10, 0),
  c(Inf, -Inf, NaN), 
  c(rep(1, 1e6), rep(-1, 1e6)),
  rnorm(1e6, mean = 0, sd = 1e5),
  runif(1e6, min = -1e5, max = 1e5),
  c(rnorm(1e6, mean = 0, sd = 1e5), rep(1, 1e6), .Machine$double.xmax),
  c(rep(.Machine$double.eps, 1e6), rep(-.Machine$double.eps, 1e6)),
  c(rep(.Machine$double.xmin, 1e6), rep(-.Machine$double.xmin, 1e6)),
  c(rep(1e308, 1e3), rep(-1e308, 1e3)),
)

for(i in seq_along(testInputs))
{
  cat("Test case", i, ":\n")
  cat(meanComparison(testInputs[[i]], na.rm = TRUE), "\n")
}

2) Solved the errors with my atime.R modifications and tried to help Doris in creating a single time unit plot. Created issue #53 in atime for that, also asking Toby if an option for just having time/memory would be viable. (3 hours)

3) Refactored setThreadCount to include the new ‘type’ argument (derived the closest points logic for recommended speedup stype from my plot method), updated roxygen comments and messages, avoided a bit of redundancy as I saw (such as moving conditional logic inside the setDTthreads() call and computing speedup only when necessary). (3 hours)

June 13, ‘24:

1) Found and fixed different bugs while testing data.table.threads when calling functions in a specific order:
i) Running plot.data_table_threads_benchmark after using setThreadCount causes the plot to be misconfigured (this issue shows the implications); Figured out that it was the thread count not being reset in the plot method that affected it. Sent and merged PR 11 to fix it.
ii) setThreadCount does not run for the recommended type after using the plot method. Initially thought I was modifying the data.table I take as argument in memory so made a deep copy of it via copy(x) and also renamed variables common to both functions (such as merged and closestPoints) but that didn’t change anything and I figured it’s the execution path not being reached due to my switch case logic (discarded that change). (8 hours)

June 14, ‘24:

1) Tried to make a helper function computeSpeedup that takes care of the optimal and recommended speedup computations to add to the benchmark data holding data.table (which we can then call here and then in the plot method, facing issues when using the recommended type though and waiting for Toby to give the green light on this approach before I push). Reviewed more PRs, a few over the weekend too (6179, 6181, 6182, 6178, 6184). (9 hours)