Home
About
Blog Posts
- Writings
- Tags
Skill Set

data.table work diary: May 13 - 17

May 13, ‘24:

1) Talked with Lawson and finalized the JSM ‘24 route after researching hotels and transportation and registering for the conference. (2 hours)

2) Onto creating a test for #5054. Having a headache so on the Todo (to complete by this week). (2 hours)

May 14, ‘24:

1) Trying to ssh into Monsoon to test my R package with a large number of threads - https://ondemand.hpc.nau.edu/pun/sys/dashboard or ssh -y ac4743@monsoon.hpc.nau.edu not working for me. Called ITS - need permissions apart from needing to use VPN or being within the NAU network. (2 hours)

2) Working on getting the maximum speedup value among points that are close to intersecting for sub-optimal and measured speedup (for different data.table functions) lines data. (7 hours)
A bit tricky since: i) The line with slope 0.5 does not have points that are exactly the same as measured speedup values even when considering just the y-values (speedup) - using absolute values for that at the moment ii) Obtaining the least deviation to obtain the closest points does not strictly correlate to the highest point of intersection as can be observed from the plot

Another thought related to low speedups (when considering 1e7 rows and 10 columns for e.g.) for some functions: Since some of them derive more benefit from parallelization when the data has more number of columns, it might be appropriate for the user to instead input the total size of data (nrow x ncol) and then I can separate the benchmarks into two parts - one with more number of rows, and one with more number of columns.

May 15, ‘24:

1) Finally finished the code base for the second point I worked on yesterday, and added a legend for the points. (7 hours)

2) Ran some tests modifying runBenchmarks and findOptimalThreads to see better speedup gains when dividing the benchmarks into two sets, one with more number of rows for functions that perform better in terms of parallel scaling then, and one with more number of columns likewise. Need Toby’s thoughts on if this is a good idea or not since this involves changing the input to the total size only, rather than rows and columns (maybe add configurable sets of rows and columns as user input?). (3 hours)

May 16, ‘24:

1) Discussed things with Toby. (2 hours)

2) Removed redundant wording in my plot’s legend, fixed the issue with the geom_point() representing recommended speedup not showing up, removed color-based distinction of lines and switched from using the mean (from microbenchmark’s summary) to median, ensured the data.frame containing values of ideal speedup and sub-optimal speedup have consistent structure. (7 hours)

May 17, ‘24:

1) Worked on refactoring my code to use data.table strictly instead of data.frame, and to avoid redundancy. Created a PR regarding that (did a bit of work on Saturday). Returning a data.table from runBenchmarks and findOptimalThreadCount now (also using rbindlist in the later as I researched and learned about it). Extensively refactored the S3 plot and print methods to replace all data.frame operations with data.table specific ones. For the plot code, I refactored it to use only call to geom_line, geom_point, and geom_text each apart from changes to the scales to accommodate them. Removed redundant bits of code and renamed my variables appropriately wherever I noticed. Fixed all errors that popped in the process and tested everything to work correctly. (11 hours)