- June 17, ‘24:
1) Created my own version of fastmean.c
with a more condensed setup based on the execution path:
SEXP fastmean(SEXP args)
{
double *x;
R_len_t n;
double sum = 0.0;
// Extracting the numeric vector:
if(!isReal(args))
{
error("Input must be numeric");
}
x = REAL(args);
n = length(args);
// Computing the sum:
for(R_len_t i = 0; i < n; ++i)
{
sum += x[i];
}
sum *= n; // This line is the change that I'm testing!
// Computing the mean and returning it as a numeric vector:
double mean = sum / n;
SEXP result = PROTECT(allocVector(REALSXP, 1));
REAL(result)[0] = mean;
UNPROTECT(1);
return result;
}
Still no breaking or conflicting changes for the mutant operator vs original one from the test cases I have. (8 hours)
2) Responded to emails from GSoC students, reviewed 6179. (1 hour)
- June 18, ‘24:
1) Made suggestions as per Ben’s comments/review of my openmp-utils.Rd
doc improving PR. (2 hours)
2) Reviewed PRs (6187, 6189) and led meetings, tried more tests cases and incorporated test case titles. For e.g.: (5 hours)
newTestCases <- list(
list(data = c(rep(1e300, 1e6), rep(-1e300, 1e6)), name = "Large values"),
list(data = c(1e-300, -1e-300, 1e-300, -1e-300), name = "Precision testing (5th one)"),
list(data = c(1 + 7i, 11 + 8i, 5 + 21i), name = "Complex numbers"),
list(data = c(rep(1, 1e6 - 1), NA), name = "Missing values")
)
runAdditionalTests <- function(testCases)
{
for(tc in testCases)
{
data <- tc$data
name <- tc$name
dt <- as.data.table(list(values = data))
options(datatable.optimize = 1)
dt.fastmean.result <- dt[, mean(values, na.rm = TRUE), verbose = TRUE]
baseR.result <- mean(data, na.rm = TRUE)
result <- ifelse(identical(baseR.result, dt.fastmean.result), "Passed", "Failed")
cat(name, ":\n Results as computed by:\n Base R's mean:", baseR.result, "\n data.table's fast mean:", dt.fastmean.result, "\n ", result, "\n\n")
}
}
runAdditionalTests(newTestCases)
- June 19, ‘24:
1) Thinking to switch to another mutation testing case as this is not bearing any results. Last day of trying to come up with more test cases that would break for the mutant but otherwise not. Wrote a script to double-check that I’m using the correct data.table
installation for testing changes to my C code. Also tried to check (mostly out of curiosity) the difference between the Cfastmean
routine and use of fast mean via mean
inside data.table
when datatable.optimize=1
: (8 hours)
# Function to remove data.table package
removeDT <- function(libPath)
{
installedPackages <- installed.packages(lib.loc = libPath)
if("data.table" %in% rownames(installedPackages))
{
remove.packages("data.table", lib = libPath)
message(paste("Removed data.table from", libPath))
}
else
{
message(paste("data.table not found in", libPath))
}
}
# Unloading from current session:
if("package:data.table" %in% search())
{
detach("package:data.table", unload = TRUE)
message("Unloaded data.table from the session")
}
libraryPaths <- .libPaths()
lapply(libraryPaths, removeDT)
# Checking for any remaining directories and deleting them:
for(libPath in libraryPaths)
{
data_table_dir <- file.path(libPath, "data.table")
if(dir.exists(data_table_dir))
{
unlink(data_table_dir, recursive = TRUE)
message(paste("Deleted directory:", data_table_dir))
}
}
# Verify if data.table has been completely removed
installed_packages <- lapply(libraryPaths, installed.packages)
if(!any(sapply(installed_packages, function(pkg) "data.table" %in% rownames(pkg))))
{
message("data.table has been successfully removed from all library paths.")
} else
{
message("data.table is still installed in some library paths.")
}
# Taking care of .Rprofile files that can load DT:
RprofileFiles <- c("~/.Rprofile", file.path(Sys.getenv("R_HOME"), "etc", "Rprofile.site"))
for(Rprofile in RprofileFiles)
{
if(file.exists(Rprofile))
{
RprofileContent <- readLines(Rprofile)
if(any(grepl("data.table", RprofileContent)))
{
message(paste("data.table reference found in:", Rprofile))
} else
{
message(paste("No data.table reference in:", Rprofile))
}
}
}
# Verifying that data.table is installed in the current directory: (after removing from .libPaths() directory via remove.packages without lib spec)
dt.path <- try(find.package("data.table", lib.loc = getwd()), silent = TRUE)
if(inherits(data.tablePath, "try-error"))
{
cat("data.table is not installed in the current directory.\n")
devtools::install(".")
}
else
{
cat("data.table is installed in the current directory at:", dt.path, "\n")
}
printStack <- function()
{
cat("Call stack:\n")
for(i in 1:sys.nframe())
{
if(exists("sys.calls", frame = i))
{
call <- sys.calls()[[i]]
cat(deparse(call), "\n")
}
}
}
DT <- data.table(x = c(10, 3, NA, 5))
# Approach A: (Using optimize=1 and calling mean inside a data.table scope)
options(datatable.optimize = 1)
result.dt <- DT[, mean(x, na.rm = TRUE)]
printStack()
# Approach B: (Directly calling Cfastmean via .External)
result.Cfastmean <- .External("Cfastmean", DT$x, na.rm = TRUE)
printStack()
# if("Cfastmean" %in% ls(envir = baseenv()) && length(body(mean)) >= 2 && identical(body(mean)[[2]], quote(.External("Cfastmean", ...)))) { print("mean uses Cfastmean") }
- June 20, ‘24:
1) Trying to create test cases that break for the coalesce
mutant. fcoalesce
fails for raw inputs: (6 hours)
test_that("Single element raw vector test.",
{
x <- list(as.raw(c(NA, 2, NA)), as.raw(1))
result <- data.table:::fcoalesce(x)
expect_equal(result, as.raw(c(1, 2, 1)))
})
── Warning: coalesce works correctly for raw vectors ───────────────────────────
out-of-range values treated as 0 in coercion to raw
── Error: coalesce works correctly for raw vectors ─────────────────────────────
Error in `data.table:::fcoalesce(x)`: Type 'raw' is not supported
Backtrace:
▆
1. └─data.table:::fcoalesce(x)
Error:
! Test failed
Backtrace:
▆
1. ├─testthat::test_that(...)
2. │ └─withr (local) `<fn>`()
3. └─reporter$stop_if_needed()
4. └─rlang::abort("Test failed", call = NULL)
2) Discussed with Michael the difference between Cfastmean
via the .External
function vs fast mean via use of mean within a data.table
with datatable.optimize=1
, gave an example based on my understanding till then, and also discussed the FAQ entry for avoiding T
/F
. (2 hours)
- June 20, ‘24:
1) Made the changes Toby suggested for switching from type
to efficiencyFactor
for setThreadCount
. (2 hours)
2) Continued with the testing for fcoalesce
and made some progress towards the blog post on my GHA for performance testing that was integrated into data.table
’s CI. (6 hours)