Greetings reader! \(^^)/
This is a blog-like page where I’ll be briefly documenting my day-to-day progress on tasks that I’m working on for helping and improving the data.table project. I started this on the 27th of February, 2024 and I intend to update it daily at the end of the day (weeks prior to this mostly include activities for the NSF POSE Winter ‘24 program such as interviews and presentations, apart from smaller tasks such as documentation improvements).
Workweeks
-
February 27 - March 1: Researched ways to include a link to an artifact generated from a GitHub Actions (GHA) workflow, got the feature to work after exploring
upload-artifact@v4
, fixed issues related to my GHA along the way (unidentified references to branch names, specifying the safe directory exception with appropriate authentication, corrupted package databases), refactored my action to work inside another repository instead of having to running it from myAutocomment-atime-results
repository, fixed local git issues for Doris and small errors based onatime
test code, reviewed older posts on The Raft. -
March 4 - 8: Made slides for my presentation ‘GitHub Actions: Automated regression testing on pull requests’ (for
data.table
) and delivered, made changes to my action based on received feedback, made an example of my action working on another R package (bingsegRcpp
, on a fork of it), investigated ways to update a GitHub-bot comment and ended up with an easycml
feature-based fix, investigated ways to benchmark steps in a workflow. -
March 11 - 15: Solved an error with
git rev-parse
not finding references, gotmerge-base
to appear on plots, made a bunch of changes to my action to get it published on the Marketplace and running appropriately (after iterative testing), made PRs to demonstrate its working on a refork ofdata.table
, worked on changes Toby suggested. -
March 18 - 22: Tried to understand the C code involving parallelization (via OpenMP) for documentation purposes, and after thorough digging, wrote about all 12 applicable cases (files/functions) mentioned in the docs in as much detail as I could provide (sent a PR).
-
March 25 - 29: Added information to my OpenMP documentation PR on better speedups for a large number of rows vs columns in the input data after writing and running some benchmarks for the same, created a first draft of a step to check and update
openmp-utils.Rd
based on files where OpenMP is used (to be used within therepo-meta-tests.yaml
), tried to divide the timed segment for theatime
step in my action into separate timings for installation and test runs separately. -
April 1 - 5: Fixed timing code, tried to set up CRAN mirror within the Rprofile in my runner (solved it by making a copy) and avoided repetition in all other places in my GHA script, updated my action (published new releases with the changes), tried to fix a ‘Before’ label and did some research as to why certain commit SHAs are failing to be installed, created two detailed historical regression mirroring PRs for showcasing to the
data.table
community. -
April 8 - 12: Created new PRs following the reset of my
data.table
fork’s cache with the help of GitHub support, posted an issue indata.table
conveying and demonstrating my work on the GitHub Action with as much detail as I could include, made the PR to add my GHA along with two tests and implemented suggested changes to them and my workflow script, reviewed blog posts for The Raft. -
April 15 - 19: Made a follow-up PR to my GHA introducing one where I added another tests and moved tests to
.ci
along with suggested changes, debugged errors and completed a first draft of my function that gets the number of optimal threads for differentdata.table
functions where parallelization is possible. -
April 22 - 26: Prepared initial set of slides for a talk in JSM ‘24 followed by a presentation in the weekly lab meeting, reverted some changes made last week (made based on Michael’s comment) and fixed #6094, tried to create an ideal speedup plot, assisted Doris to fix and showcase performance improvement test cases, communicated with Joshua regarding GSoC.
-
April 29 - May 3: Proposed solutions for errors with
pkg.edit.fun
andgit2r::revparse
, created a GitHub repositorydata.table.threads
which aims to benchmark differentdata.table
functions which are parallelizable and find the optimal thread count, implemented suggestions to speedup plot generating code, changed work log format to weekly, updated CODEOWNERs. -
May 6 - 10: Incorporated feedback and made the second version of my JSM conf. slides and presented this week (with Kelly and Tyson’s feedback in addition to lab members), created an R package for
data.table.threads
, introduced S3 methods to overrideplot
andprint
for a class assigned to the output offindOptimalThreadCount
, documented all four functions and two issues, wrote a basic readme for the repository to get started and then a brief summary for each workweek in this page. -
May 13 - 17: Wrote logic to get the maximum speedup value among points that are visually intersecting or closest to each other among sub-optimal and measured speedup (for different
data.table
functions) lines data. Extensively refactored the entire codebase to usedata.table
strictly instead ofdata.frame
operations. Made use of more efficient methods (rbindlist
for e.g.) and removed redundant bits of code such as multiple calls togeom_line/point/text
after accomodating required changes. -
May 20 - 24: Introduced new arguments to benchmarking functions (such as
times
andverbose
), made several suggested changes in simplifyingdata.table.threads
code apart from making it easier to use, wrote about takeaways from a rubric, implemented custom legends, led GSoC community bonding period. -
May 27 - 31: Shepherded GSoC students over the first week of work + reviewed several PRs, helped Doris and tested the new atime test cases that are to be incorporated into
data.table
, made more changes/simplifications to my S3 plot method for generating speedup plots. -
June 3 - 7: Made progress in running
data.table
’s C code (to discuss in a GH issue next week: Compiling the C code and linking the.so
to runfastmean
, comparing against base R’smean
, and creation of a test that breaks for a change infastmean.c
), continued working with GSoC students and making changes todata.table.threads
as per suggestions, added NEWS items and updated docs, provided feedback to Doris’ presentation and taught a bit of git. -
June 10 - 14: Found a better way to call fast mean (
.External
+Cfastmean
, and learned from Michael to useoptimize=1
), worked on and testedsetThreadCount
(detected and fixed bugs), reviewed PRs fordata.table
and commented on issues, helped Doris with creating a plot for just time instead of both time/memory usingatime
. -
June 17 - 21: Created a condensed version of
fastmean.c
and a few R scripts (for testing purposes, please check the link of this week for more details), tried to create tests for other C files, reviewed PRs, switched fromtype
toefficiencyFactor
forsetThreadCount
. -
June 24 - 28: Tested mutant changes to utility functions, sent PRs to the Raft for various small improvements and reviewed a few PRs in
data.table
, wrote some parts for my GHA blog post. -
July 1 - 5: Tried to create test cases that fail for mutants of
rbindlist.c
,fifelse.c
,forder.c
, andsubset.c
, continued writing more content for the Raft (will send a PR for my GHA post after I come back from my break), helped with an issue revolving around the incorrect printing ofinteger64
columns, reviewed PRs.
(Vacation till July 16)
-
July 16 - 19: Started testing my GHA on non-
data.table
packages (Rcpp
using ones), starting with dplyr - testing a performance regression withdplyr::summarise
ft. many groups, reviewed #6296 and made a performance test to show the improvement, reviewed PRs based on atime test cases, Wiki edits and writeups. -
July 22 - 26: Performed some GHA-based testing for a performance regression case with
transform
, briefly reviewed a paper, revised my slides a bit, resolved a hotel reservation issue with Lawson for my upcoming stay in Portland for my talk in JSM, Zoom calls, reviewing PRs, blog post draft revisions. -
July 29 - August 2: Created a fresh set of detailed slides for my presentation (worked on received feedback and went the extra mile in design), created a performance test for
forder
caching, reviewed PRs, prepared for my trip to Portland, shared details about our session and the presenting platform with the team. -
August 5 - 9: Rehearsed and gave my talk, met project members (Kelly, Tyson) and Michael, attended and engaged in JSM sessions, communicated with Lawson multiple times to discuss financial matters.
-
August 12 - 16: Reviewing Doris’ research article, continuining work on atime test cases for
forder
andtransform
, reviewing PRs and posts on the Raft, discussed record management (NAU) and trip reimbursement, wrote for the NSF annual report. -
August 19 - 23: Finished my section in the research article and edited a few other parts/sections of it, reviewed Seal of Approval posts @Raft and PRs, revised the
forder
atime case a bit, propagated.tex
to.Rmd
conversion and other topics in issues. -
August 26 - 30: Wrote a blog post for mutation testing results, reviewed the OpenMP availability checking shell logic in
data.table
, showed how to convert.tex
to.rmd
, and then to.pdf
and.html
usingrmarkdown
for the research article and extensively reviewed Doris’ writing, Zoom meetings. -
September 2 - 6: Reviewed various PRs, added an
atime
test for performance improvement inDT[by, verbose = TRUE]
cases, revised my blog post on mutation testing as per Toby’s suggestions, more reviewing and feedback cycles for the research article, etc. -
September 9 - 13: Revised
atime
test cases, tested and solved the issue of the time-consuming setup based on installation ofatime
’s ‘Suggests’ field dependencies for my action, writing a blog post ondata.table.threads
and made a follow-up to fix theclosestPoints
logic. -
September 16 - 20: Wrapped up my blog post on
data.table.threads
, revisedatime
test cases, worked on a GHA to keep the the OpenMP manual updated, filled up forms. -
September 23 - 27: Uploaded
data.table.threads
to CRAN after making required changes, updated my performance tests GHA, pushed an atime test formelt
, reviewed PRs, etc. -
September 30 - October 4: Reviewed PRs, made the required changes from the manual tests on CRAN for
data.table.threads
and resubmitted, created a PR for adding anatime
test onforderv
improvement, made progress towards blog post content. -
October 7 - 11: Introduced two features to
data.table.threads
- 1) Users can now access the plot data (speedup trends and key points) via attributes of thedata.table
returned fromfindOptimalThreadCount
, and 2)recommendedEfficiency
parameter now can be used to dictate the slope of the recommended line and the recommended thread count designating point in speedup plots generated by using the plot method, updated the package on CRAN, reviewed PRs.
(Off-time from Oct 14 - 18)
-
October 21 - 25: Reviewed PRs, made more changes to
data.table.threads
(including the addition ofaddRecommendedEfficiency
, a function for adding and computing plot data based on a user-specified speedup efficiency), wrote a blog post about the package for the Raft. -
October 28 - November 1: Explored new mutant cases, pushed another version of my package to CRAN, fixed an issue with
libgit2
installation (was incompatibile withgit2r
requirement) and came up with a potential solution for the issue with missing branch references. -
November 4 - 8: Created tests for more mutants, ensured that the CRAN version of
data.table.threads
is in sync with the latest updates on GitHub, made and tested changes to the branch reference retrieving part of my GHA. -
November 11 - 15: Started revising the research article on
atime
bit by bit, explored more C files for mutation testing and tried to create more tests, created a PR to use hyperlinks for linking betweendata.table
vignettes. -
November 18 - 22: Implemented custom benchmarks for
findOptimalThreadCount
, proofreading/editing theatime
paper and blog posts.
Feedback!
If you would like to provide feedback in any form, you can either reach out to me via email or post an issue on the repositories I’m working on:
Presentations
-
“Creating a self-sustaining ecosystem for data.table” at JSM 2024 (Portland, Oregon) on 8/6/24: Slides
-
“GitHub Actions: Automated performance regression testing on pull requests” at NAU SICCS (Flagstaff, AZ) on 3/5/24: Slides
Writeups
Thank you! - Ani