• Home
  • About
    • Anirban photo
    • About Me
    • Email
  • Blog Posts
    • Writings
    • Tags
  • Skill Set

data.table work diary

Greetings reader! \(^^)/

This is a blog-like page where I’ll be briefly documenting my day-to-day progress on tasks that I’m working on for helping and improving the data.table project. I started writing this on the 27th of February, 2024 and I intend to update it daily at the end of the day (my workweeks prior to this mostly included activities for the NSF POSE Winter ‘24 program such as interviews and presentations, apart from smaller tasks such as documentation improvements).

  • What I’m currently working on!
  • Any feedback?
  • Presentations
  • Blog posts

Workweeks

  • February 27 - March 1 ‘24: Researched ways to include a link to an artifact generated from a GitHub Actions (GHA) workflow, got the feature to work after exploring upload-artifact@v4, fixed issues related to my GHA along the way (unidentified references to branch names, specifying the safe directory exception with appropriate authentication, corrupted package databases), refactored my action to work inside another repository instead of having to running it from my Autocomment-atime-results repository, fixed local git issues for Doris and small errors based on atime test code, reviewed older posts on The Raft.

  • March 4 - 8 ‘24: Made slides for my presentation ‘GitHub Actions: Automated regression testing on pull requests’ (for data.table) and delivered, made changes to my action based on received feedback, made an example of my action working on another R package (bingsegRcpp, on a fork of it), investigated ways to update a GitHub-bot comment and ended up with an easy cml feature-based fix, investigated ways to benchmark steps in a workflow.

  • March 11 - 15 ‘24: Solved an error with git rev-parse not finding references, got merge-base to appear on plots, made a bunch of changes to my action to get it published on the Marketplace and running appropriately (after iterative testing), made PRs to demonstrate its working on a refork of data.table, worked on changes Toby suggested.

  • March 18 - 22 ‘24: Tried to understand the C code involving parallelization (via OpenMP) for documentation purposes, and after thorough digging, wrote about all 12 applicable cases (files/functions) mentioned in the docs in as much detail as I could provide (sent a PR).

  • March 25 - 29 ‘24: Added information to my OpenMP documentation PR on better speedups for a large number of rows vs columns in the input data after writing and running some benchmarks for the same, created a first draft of a step to check and update openmp-utils.Rd based on files where OpenMP is used (to be used within the repo-meta-tests.yaml), tried to divide the timed segment for the atime step in my action into separate timings for installation and test runs separately.

  • April 1 - 5 ‘24: Fixed timing code, tried to set up CRAN mirror within the Rprofile in my runner (solved it by making a copy) and avoided repetition in all other places in my GHA script, updated my action (published new releases with the changes), tried to fix a ‘Before’ label and did some research as to why certain commit SHAs are failing to be installed, created two detailed historical regression mirroring PRs for showcasing to the data.table community.

  • April 8 - 12 ‘24: Created new PRs following the reset of my data.table fork’s cache with the help of GitHub support, posted an issue in data.table conveying and demonstrating my work on the GitHub Action with as much detail as I could include, made the PR to add my GHA along with two tests and implemented suggested changes to them and my workflow script, reviewed blog posts for The Raft.

  • April 15 - 19 ‘24: Made a follow-up PR to my GHA introducing one where I added another tests and moved tests to .ci along with suggested changes, debugged errors and completed a first draft of my function that gets the number of optimal threads for different data.table functions where parallelization is possible.

  • April 22 - 26 ‘24: Prepared initial set of slides for a talk in JSM ‘24 followed by a presentation in the weekly lab meeting, reverted some changes made last week (made based on Michael’s comment) and fixed #6094, tried to create an ideal speedup plot, assisted Doris to fix and showcase performance improvement test cases, communicated with Joshua regarding GSoC.

  • April 29 - May 3 ‘24: Proposed solutions for errors with pkg.edit.fun and git2r::revparse, created a GitHub repository data.table.threads which aims to benchmark different data.table functions which are parallelizable and find the optimal thread count, implemented suggestions to speedup plot generating code, changed work log format to weekly, updated CODEOWNERs.

  • May 6 - 10 ‘24: Incorporated feedback and made the second version of my JSM conf. slides and presented this week (with Kelly and Tyson’s feedback in addition to lab members), created an R package for data.table.threads, introduced S3 methods to override plot and print for a class assigned to the output of findOptimalThreadCount, documented all four functions and two issues, wrote a basic readme for the repository to get started and then a brief summary for each workweek in this page.

  • May 13 - 17 ‘24: Wrote logic to get the maximum speedup value among points that are visually intersecting or closest to each other among sub-optimal and measured speedup (for different data.table functions) lines data. Extensively refactored the entire codebase to use data.table strictly instead of data.frame operations. Made use of more efficient methods (rbindlist for e.g.) and removed redundant bits of code such as multiple calls to geom_line/point/text after accomodating required changes.

  • May 20 - 24 ‘24: Introduced new arguments to benchmarking functions (such as times and verbose), made several suggested changes in simplifying data.table.threads code apart from making it easier to use, wrote about takeaways from a rubric, implemented custom legends, led GSoC community bonding period.

  • May 27 - 31 ‘24: Shepherded GSoC students over the first week of work + reviewed several PRs, helped Doris and tested the new atime test cases that are to be incorporated into data.table, made more changes/simplifications to my S3 plot method for generating speedup plots.

  • June 3 - 7 ‘24: Made progress in running data.table’s C code (to discuss in a GH issue next week: Compiling the C code and linking the .so to run fastmean, comparing against base R’s mean, and creation of a test that breaks for a change in fastmean.c), continued working with GSoC students and making changes to data.table.threads as per suggestions, added NEWS items and updated docs, provided feedback to Doris’ presentation and taught a bit of git.

  • June 10 - 14 ‘24: Found a better way to call fast mean (.External + Cfastmean, and learned from Michael to use optimize=1), worked on and tested setThreadCount (detected and fixed bugs), reviewed PRs for data.table and commented on issues, helped Doris with creating a plot for just time instead of both time/memory using atime.

  • June 17 - 21 ‘24: Created a condensed version of fastmean.c and a few R scripts (for testing purposes, please check the link of this week for more details), tried to create tests for other C files, reviewed PRs, switched from type to efficiencyFactor for setThreadCount.

  • June 24 - 28 ‘24: Tested mutant changes to utility functions, sent PRs to the Raft for various small improvements and reviewed a few PRs in data.table, wrote some parts for my GHA blog post.

  • July 1 - 5 ‘24: Tried to create test cases that fail for mutants of rbindlist.c, fifelse.c, forder.c, and subset.c, continued writing more content for the Raft (will send a PR for my GHA post after I come back from my break), helped with an issue revolving around the incorrect printing of integer64 columns, reviewed PRs.

(Vacation till July 16 ‘24)

  • July 16 - 19 ‘24: Started testing my GHA on non-data.table packages (Rcpp using ones), starting with dplyr - testing a performance regression with dplyr::summarise ft. many groups, reviewed #6296 and made a performance test to show the improvement, reviewed PRs based on atime test cases, Wiki edits and writeups.

  • July 22 - 26 ‘24: Performed some GHA-based testing for a performance regression case with transform, briefly reviewed a paper, revised my slides a bit, resolved a hotel reservation issue with Lawson for my upcoming stay in Portland for my talk in JSM, Zoom calls, reviewing PRs, blog post draft revisions.

  • July 29 - August 2 ‘24: Created a fresh set of detailed slides for my presentation (worked on received feedback and went the extra mile in design), created a performance test for forder caching, reviewed PRs, prepared for my trip to Portland, shared details about our session and the presenting platform with the team.

  • August 5 - 9 ‘24: Rehearsed and gave my talk, met project members (Kelly, Tyson) and Michael, attended and engaged in JSM sessions, communicated with Lawson multiple times to discuss financial matters.

  • August 12 - 16 ‘24: Reviewing Doris’ research article, continuining work on atime test cases for forder and transform, reviewing PRs and posts on the Raft, discussed record management (NAU) and trip reimbursement, wrote for the NSF annual report.

  • August 19 - 23 ‘24: Finished my section in the research article and edited a few other parts/sections of it, reviewed Seal of Approval posts @Raft and PRs, revised the forder atime case a bit, propagated .tex to .Rmd conversion and other topics in issues.

  • August 26 - 30 ‘24: Wrote a blog post for mutation testing results, reviewed the OpenMP availability checking shell logic in data.table, showed how to convert .tex to .rmd, and then to .pdf and .html using rmarkdown for the research article and extensively reviewed Doris’ writing, Zoom meetings.

  • September 2 - 6 ‘24: Reviewed various PRs, added an atime test for performance improvement in DT[by, verbose = TRUE] cases, revised my blog post on mutation testing as per Toby’s suggestions, more reviewing and feedback cycles for the research article, etc.

  • September 9 - 13 ‘24: Revised atime test cases, tested and solved the issue of the time-consuming setup based on installation of atime’s ‘Suggests’ field dependencies for my action, writing a blog post on data.table.threads and made a follow-up to fix the closestPoints logic.

  • September 16 - 20 ‘24: Wrapped up my blog post on data.table.threads, revised atime test cases, worked on a GHA to keep the the OpenMP manual updated, filled up forms.

  • September 23 - 27 ‘24: Uploaded data.table.threads to CRAN after making required changes, updated my performance tests GHA, pushed an atime test for melt, reviewed PRs, etc.

  • September 30 - October 4 ‘24: Reviewed PRs, made the required changes from the manual tests on CRAN for data.table.threads and resubmitted, created a PR for adding an atime test on forderv improvement, made progress towards blog post content.

  • October 7 - 11 ‘24: Introduced two features to data.table.threads - 1) Users can now access the plot data (speedup trends and key points) via attributes of the data.table returned from findOptimalThreadCount, and 2) recommendedEfficiency parameter now can be used to dictate the slope of the recommended line and the recommended thread count designating point in speedup plots generated by using the plot method, updated the package on CRAN, reviewed PRs.

(Off-time from Oct 14 - 18 ‘24)

  • October 21 - 25 ‘24: Reviewed PRs, made more changes to data.table.threads (including the addition of addRecommendedEfficiency, a function for adding and computing plot data based on a user-specified speedup efficiency), wrote a blog post about the package for the Raft.

  • October 28 - November 1 ‘24: Explored new mutant cases, pushed another version of my package to CRAN, fixed an issue with libgit2 installation (was incompatibile with git2r requirement) and came up with a potential solution for the issue with missing branch references.

  • November 4 - 8 ‘24: Created tests for more mutants, ensured that the CRAN version of data.table.threads is in sync with the latest updates on GitHub, made and tested changes to the branch reference retrieving part of my GHA.

  • November 11 - 15 ‘24: Started revising the research article on atime bit by bit, explored more C files for mutation testing and tried to create more tests, created a PR to use hyperlinks for linking between data.table vignettes.

  • November 18 - 22 ‘24: Implemented custom benchmarks for findOptimalThreadCount, proofreading/editing the atime paper and blog posts.

(Vacation for Nov 25 ‘24)

  • November 26 - 27 ‘24: Mostly stuff for The-Raft.

(Holidays from Nov 28 to 29 ‘24)

  • December 2 - 6 ‘24: Worked on fixes for data.table.threads, a shell script to keep openmp-utils.Rd in sync with src/* files for OpenMP usage, wrote some sections for a GSoC project and a bit for the GHA post, reviewed some parts for the atime article.

  • December 9 - 13 ‘24: Completed writing the GSoc project’s wiki page and my blog posts.

  • January 15 ‘25: Updated requirements for the GSoC project (commit/link) based on extending my GHA to help contributors who are not project members or lack permissions to trigger/run the action in data.table aside from the original goal of CI time reduction.

Feedback!

If you would like to provide feedback in any form, you can either reach out to me via email or post an issue on the repositories I’m working on:

  • Autocomment-atime-results
  • data.table.threads
  • Rdatatable/data.table (tag @Anirban166)

Presentations

  • “Creating a self-sustaining ecosystem for data.table” at JSM 2024 (Portland, Oregon) on 8/6/24: Slides

  • “GitHub Actions: Automated performance regression testing on pull requests” at NAU SICCS (Flagstaff, AZ) on 3/5/24: Slides

Writeups

  • Conducting interviews for NSF POSE
  • Mutation testing for data.table
  • Continuous performance testing using GitHub Actions
  • data.table.threads - find the best thread count!

Thank you! - Ani