data.table work diary: Jun 3 - 7

June 3, ‘24:

1) Worked on changes to data.table.threads - closestPoints now has type (recommended), discarded use.names and this loop for column order after ensuring through debugging the variables that rbind is correctly binding things by position, searched for the missing columns still and mentioned them in a comment. (5 hours)

2) Gave Doris more feedback and revised my own slides a bit. (2 hours)

June 4, ‘24:

1) Conducted Tuesday’s Zoom meetings; Helped Doris again (taught her how to update branch names locally and on GitHub) (4 hours)

2) Reviewed PRs 6167 and 6169. (1 hour)

3) Made final changes to my refactoring PR in data.table.threads, added essentials and documentation (including a NEWS.md). (4 hours)

June 5, ‘24:

1) Made a PR that added a function to set the thread count for data.table operations based on optimal performance (in benchmarks) of a user-specified data.table function, tested the function, wrote the roxygen documentation for it, updated examples to use a more practical row count value. (8 hours)

June 6, ‘24:

1) Started looking on the mutation testing cases mentioned in issue 6114. For a first go, I went with what Toby suggested - A mutant of fastmean.c line 78 that passes tests. Currently trying to find a way to use that C code in R. As far as I searched up, it appears that I can use .Call to do so. Compiling the file via gcc seems to always complain about the missing R.h header, but using R CMD SHLIB to generate the shared object and then loading via dyn.load is working. (8 hours)

June 7, ‘24:

1) Continuing the work above. Trying to create test cases now that I’m able to compile and test changes to the C code. Using testthat to create some simple edge cases. For example:

library(data.table)
library(testthat)

testFastMean <- function(x, na.rm)
{
  dt <- as.data.table(x)
  options(datatable.optimize=1)
  result <- dt[, mean(x, na.rm = na.rm)] # Updated this as per Michael's suggestion.
  expected <- base::mean(x, na.rm = na.rm)
  expect_equal(result, expected, tolerance = .Machine$double.eps^0.5)
}
test_that("fast mean tests",
{
  # Edge case for precision testing with large numbers:
  x <- c(1e10, 1e10 + 2, 1e10 + 4)
  testFastMean(x, TRUE)
  # Edge case for precision testing with small differences:
  x <- c(1e-10, 1e-10 + 1e-12, 1e-10 + 2e-12, 1e-10 + 3e-12, 1e-10 + 4e-12)
  testFastMean(x, TRUE)
  # Mixed values:
  x <- c(1e10, 1e-10, 1e12, 1e-12)
  testFastMean(x, TRUE)
  # Large vector with a small increment + testing accumulated precision:
  x <- c(rep(1e10, 1e6), rep(1e-10, 1e6))
  testFastMean(x, TRUE)
})

Also testing base R’s mean vs data.table’s fast mean to see any discrepancies. For e.g.:

test_that("fastmean matches base mean",
{
  x <- c(NA, NA, NA)
  testFastMean(x, TRUE)
  testFastMean(x, FALSE)
  x <- numeric(0)
  testFastMean(x, TRUE)
  x <- c(1L, 2L, 3L, NAL, 5L, NA)
  testFastMean(x, TRUE)
  testFastMean(x, FALSE)
  x <- c(TRUE, FALSE, NA, TRUE, FALSE)
  testFastMean(x, TRUE)
  testFastMean(x, FALSE)
  x <- rnorm(1e6)
  testFastMean(x, TRUE)
})

Got a few breaking results for that though. For e.g.:

# Inputs to mean functions:
testInputs <- list(
  c(rep(1e308, 1e3), rep(-1e308, 1e3)),
  c(rnorm(1e6, mean = 0, sd = 1e5), rep(1, 1e6), .Machine$double.xmax)
)

Nothing for the mutant though, or both with and without the change I’m getting the same results. (8 hours)