• Home
  • About
    • Anirban photo
    • About Me
    • Email
  • Blog Posts
    • Writings
    • Tags
  • Skill Set

data.table work diary: May 6 - 10

  • May 6, ‘24: (7)

1) Made some of the changes Toby mentioned for the speedup plot in #2.

2) Made (edited old ones and created a few new ones) slides for my presentation tomorrow.

  • May 7, ‘24: (4)

1) Presented and collected feedback from Toby, Kelly, Tyson, and NAU-ML lab members.

2) Made changes to speedup plot generating code along with Toby - findOptimalThreads now returns a class (data_table_threads_benchmark) and there is a separate plot method extending that class (plot.data_table_threads_benchmark).

3) Discussed with Lawson the details about my travel and lodging for the talk in August.

  • May 8, ‘24: (7)

1) Added a geom_line() to my plot which is half of the ideal speedup’s slope (1). Initially take starts below the point where threads and speedup start (1, 1) since it would be half of the values of the ideal speedup at each point, thus I had to manually modify it to start appropriately (as opposed to 0.5, 0.5) and then progress towards a slope of 0.5 (or ending at 5, 5 for 10 threads for e.g.).

2) Added geom_ribbon() to show variance/noise (experimented and found 0.3 on both sides to look decent).

3) Organized my code in an appropriate way (three functions) and created examples.

  • May 9, ‘24: (10)

1) Documented the functions for my R package.

2) Iteratively ran devtools::check and corrected the errors (2) and warnings (7) that arose. Done with the creation of my R package (pushed on commit acfb5a0); tested it locally:

> install()
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?

 1: All                                                
 2: CRAN packages only                                 
 3: None                                               
 4: data.table  (6f008bdd9... -> eaf869eb4...) [GitHub]

Enter one or more numbers, or an empty line to skip updates: 
── R CMD build ────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/Users/anirban166/data.table.threads/DESCRIPTION’ ...
─  preparing ‘data.table.threads’:
✔  checking DESCRIPTION meta-information
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘data.table.threads_0.1.1.tar.gz’
   
Running /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL \
  /var/folders/9_/qc989n050_d2sbtw92scjshr0000gn/T//RtmpePtdKG/data.table.threads_0.1.1.tar.gz \
  --install-tests 
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library’
* installing *source* package ‘data.table.threads’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (data.table.threads)
> library(data.table.threads)
> (benchmarkedData <- data.table.threads::findOptimalThreadCount(10000000, 10))
Running benchmarks with 1 thread, 10000000 rows, and 10 columns.
Running benchmarks with 2 threads, 10000000 rows, and 10 columns.
Running benchmarks with 3 threads, 10000000 rows, and 10 columns.
Running benchmarks with 4 threads, 10000000 rows, and 10 columns.
Running benchmarks with 5 threads, 10000000 rows, and 10 columns.
Running benchmarks with 6 threads, 10000000 rows, and 10 columns.
Running benchmarks with 7 threads, 10000000 rows, and 10 columns.
Running benchmarks with 8 threads, 10000000 rows, and 10 columns.
Running benchmarks with 9 threads, 10000000 rows, and 10 columns.
Running benchmarks with 10 threads, 10000000 rows, and 10 columns.
$threadCount
 [1]  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4
[36]  4  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  7  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8
[71]  8  8  9  9  9  9  9  9  9  9  9 10 10 10 10 10 10 10 10 10

$expr
 [1] "forder"     "GForce_sum" "subsetting" "frollmean"  "fcoalesce"  "between"    "fifelse"    "nafill"    
 [9] "CJ"         "forder"     "GForce_sum" "subsetting" "frollmean"  "fcoalesce"  "between"    "fifelse"   
[17] "nafill"     "CJ"         "forder"     "GForce_sum" "subsetting" "frollmean"  "fcoalesce"  "between"   
[25] "fifelse"    "nafill"     "CJ"         "forder"     "GForce_sum" "subsetting" "frollmean"  "fcoalesce" 
[33] "between"    "fifelse"    "nafill"     "CJ"         "forder"     "GForce_sum" "subsetting" "frollmean" 
[41] "fcoalesce"  "between"    "fifelse"    "nafill"     "CJ"         "forder"     "GForce_sum" "subsetting"
[49] "frollmean"  "fcoalesce"  "between"    "fifelse"    "nafill"     "CJ"         "forder"     "GForce_sum"
[57] "subsetting" "frollmean"  "fcoalesce"  "between"    "fifelse"    "nafill"     "CJ"         "forder"    
[65] "GForce_sum" "subsetting" "frollmean"  "fcoalesce"  "between"    "fifelse"    "nafill"     "CJ"        
[73] "forder"     "GForce_sum" "subsetting" "frollmean"  "fcoalesce"  "between"    "fifelse"    "nafill"    
[81] "CJ"         "forder"     "GForce_sum" "subsetting" "frollmean"  "fcoalesce"  "between"    "fifelse"   
[89] "nafill"     "CJ"        

$meanTime
 [1] 238.604075  15.834843  82.098118  25.305659  11.333574  47.862823  33.384304   8.923383   5.471632
[10] 149.147105  15.756613  67.731085  25.256690  10.199715  39.860884  27.033305   8.434186   4.575443
[19] 120.739087  15.740213  63.735838  25.050872  10.023808  30.423629  23.726084   8.581675   4.583418
[28] 105.888774  15.795719  58.474678  26.241693   8.626748  28.835158  22.174659   9.445956   4.237654
[37]  98.232321  15.756840  58.480091  25.956710   8.847574  26.856262  21.225636   8.266084   4.638750
[46]  93.186126  15.779605  59.211183  25.948519   9.663049  25.499239  20.452548   8.451822   4.282420
[55]  89.772309  15.771334  57.352493  25.581512   9.894759  25.990644  20.695062  10.041181   4.296897
[64]  88.713179  15.740802  58.772491  24.697033   8.976329  27.440822  20.559703   9.042136   5.672482
[73]  90.864280  15.764039  60.712491  25.285269   8.923695  25.864757  22.138314   9.320046   3.663480
[82]  89.682569  15.791384  61.186428  24.114304   9.222357  31.012210  22.981579   8.163546   4.477042

attr(,"row.names")
 [1] "forder"      "GForce_sum"  "subsetting"  "frollmean"   "fcoalesce"   "between"     "fifelse"    
 [8] "nafill"      "CJ"          "forder1"     "GForce_sum1" "subsetting1" "frollmean1"  "fcoalesce1" 
[15] "between1"    "fifelse1"    "nafill1"     "CJ1"         "forder2"     "GForce_sum2" "subsetting2"
[22] "frollmean2"  "fcoalesce2"  "between2"    "fifelse2"    "nafill2"     "CJ2"         "forder3"    
[29] "GForce_sum3" "subsetting3" "frollmean3"  "fcoalesce3"  "between3"    "fifelse3"    "nafill3"    
[36] "CJ3"         "forder4"     "GForce_sum4" "subsetting4" "frollmean4"  "fcoalesce4"  "between4"   
[43] "fifelse4"    "nafill4"     "CJ4"         "forder5"     "GForce_sum5" "subsetting5" "frollmean5" 
[50] "fcoalesce5"  "between5"    "fifelse5"    "nafill5"     "CJ5"         "forder6"     "GForce_sum6"
[57] "subsetting6" "frollmean6"  "fcoalesce6"  "between6"    "fifelse6"    "nafill6"     "CJ6"        
[64] "forder7"     "GForce_sum7" "subsetting7" "frollmean7"  "fcoalesce7"  "between7"    "fifelse7"   
[71] "nafill7"     "CJ7"         "forder8"     "GForce_sum8" "subsetting8" "frollmean8"  "fcoalesce8" 
[78] "between8"    "fifelse8"    "nafill8"     "CJ8"         "forder9"     "GForce_sum9" "subsetting9"
[85] "frollmean9"  "fcoalesce9"  "between9"    "fifelse9"    "nafill9"     "CJ9"        
attr(,"class")
[1] "data_table_threads_benchmark"
> plot(benchmarkedData)
  • May 10, ‘24:

1) Overrode the S3 generic print function and added the print.data_table_threads_benchmark method (commit acfb5a0).

2) Documented two issues in the process: #6, #4

3) Wrote a basic readme.

4) Making changes to the main page (including a short summary of each week and hours spent working each day going forward) over the weekend.