data.table work diary: Jul 22 - 26

July 22, ‘24:

1) Testing the performance test based on historical regression with transform: (9 hours)

  # Fixed in: https://github.com/Rdatatable/data.table/pull/5493 (off-branch)
  # Merged to master in: https://github.com/Rdatatable/data.table/commit/2d1a0575f87cc50e90f64825c30d7a6cb6b05dd7
  "transform improved in #5493" = atime::atime_test(
    N = 10^seq(1, 20),
    setup = {
      df <- data.frame(x = runif(N))
      dt <- as.data.table(df)
    },
    expr = data.table:::`[.data.table`(transform(dt, y = round(x))),
    Slow = "bf499090c0e6fd5cb492bf8b1603d93c1ee21dfb",
    Fast = "2d1a0575f87cc50e90f64825c30d7a6cb6b05dd7")
)

I’m using the commit where the changes got introduced to data.table’s master branch, and the parent of that commit (as the PR branch got merged into a dev branch and not master), but still the versions do not seem to be correctly producing the performance plots (Slow and Fast are together with the CRAN version and not with base or head, using the same commit SHA for Before and Regression puts them far apart, etc.). Take a look at #14 for example.

July 23, ‘24:

1) Further GHA-based testing (no sensible result..) for the transform regression PRs on my fork. (4 hours)

2) Multiple Zoom meetings. (3 hours)

July 24, ‘24:

1) Briefly reviewed the research paper/article that Doris will publish and started drafting a segment for the GHA part (\section{GitHub Action}) that she wants me to write. (4 hours)

3) Reviewed #6307, Zoom meetings. (3 hours)

July 25 and 26, ‘24:

1) Edited my slides a bit, waiting to finalize on the feedback (from Tyson/Kelly) and what I see ongoing with data.table (for e.g. the current translation projects). (8 hours)

2) Resolved an issue with digital verification (signatures, addresses, etc.) for the conference hotel that Lawson booked with back and forth communication with both hotel staff and Lawson. Also clarified on how to collect and share information for reimbursement. (6 hours)

3) Revised my blog posts a bit. Wrote ‘Key Features’ segment for my GHA-post: (3 hours)

## Key Features

- Predefined customizable tests: The action runs test cases (utilizes the `atime` package) from the setup defined in `.ci/atime/tests.R` on different versions of `data.table`. These tests are either based on documented historical regressions or performance improvements.

- Automated commenting: Using `cml`, my GHA publishes results in a comment on the pull request thread. It gets updated time and again to avoid cluttering. This comment includes:
    - A plot with subplots for each test case, showing time and memory trends across different `data.table` versions.
    - The commit SHA that generated the results.
    - A link to download the artifact containing all results.
    - Timing details for setup and test execution.
    - The action ensures only one comment per PR, updating the existing comment with new information as needed.
    The plot has different `data.table` versions that we visually compare:
        - HEAD (PR source)
        - Base (PR target)
        - Merge-base (common ancestor between base and HEAD)
        - CRAN (latest version)
        - Before (pre-regression commit)
        - Regression (the commit or if the source is unknown or distributed, the range of commits which is responsible for the performance degradation)
        - Fixed (commit where the performance has been restored or the regression source has been fixed)

Currently, the action is not constrained to be OS-specific and there is only one single job or set of steps that execute on the same runner.