• Home
  • About
    • Anirban photo
    • About Me
    • Email
  • Blog Posts
    • Writings
    • Tags
  • Skill Set

data.table work diary: Sep 9 - 13

  • Sep 9, ‘24:

1) Resolved the issue of the time-consuming setup based on installation of atime’s ‘Suggests’ field dependencies (#35), filed #6492 to fix inconsistencies with .ci/atime/tests.R, made suggested changes. Tested for both these separate things in two PRs on my fork. (7 hours)

2) Zoom meeting. (1 hour)

  • Sep 10, ‘24:

1) Tried to make the suggested changes to the closestPoints computation logic in data.table.threads. (5 hours)
Stopped at this version/implementation of the plot method:

plot.data_table_threads_benchmark <- function(x, ...)
{
  x[, `:=`(speedup = median[threadCount == 1] / median, type = "Measured"), by = expr]

  setDTthreads(0)
  systemThreadCount <- getDTthreads()
  functions <- unique(x$expr)

  speedupData <- data.table(
    expr = rep(functions, each = systemThreadCount),
    threadCount = rep(1:systemThreadCount, length(functions)),
    speedup = c(rep(seq(1, systemThreadCount), length(functions)), 
                rep(seq(1, systemThreadCount / 2, length.out = systemThreadCount), length(functions))),
    type = rep(c("Ideal", "Recommended"), each = systemThreadCount * length(functions))
  )

  maxSpeedup <- x[, .(threadCount = threadCount[which.max(speedup)], 
                      speedup = max(speedup), 
                      type = "Ideal"), 
                  by = expr]

  closestPoints <- x[, {
    recommendedSubset <- speedupData[type == "Recommended" & expr == .BY$expr]
    mergedData <- merge(.SD, recommendedSubset, by = "threadCount", suffixes = c("", "_recommended"))
    filteredRows <- mergedData[speedup > speedup_recommended]
    if(nrow(filteredRows) > 0)
    {
      filteredRows[which.max(speedup)]
    }
    else
    {
      NA
    }
  }, by = expr, .SDcols = c("speedup", "threadCount")]
  
  closestPoints <- closestPoints[!is.na(threadCount)]
  closestPoints[, type := "Recommended"]

  combinedLineData <- rbind(speedupData, x, fill = TRUE)
  combinedPointData <- rbind(maxSpeedup, closestPoints, fill = TRUE)

  ggplot(x, aes(x = threadCount, y = speedup)) +
    geom_line(data = combinedLineData, aes(color = type), size = 1) +
    geom_point(data = combinedPointData, aes(color = type), size = 3) +
    geom_text(data = combinedPointData, aes(label = threadCount), vjust = -0.5, size = 4, na.rm = TRUE) +
    facet_wrap(. ~ expr) +
    coord_equal() +
    labs(x = "Threads", y = "Speedup", title = "data.table functions") +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_x_continuous(breaks = 1:systemThreadCount, labels = 1:systemThreadCount) +
    scale_color_manual(values = c("Measured" = "black", "Ideal" = "#f79494", "Recommended" = "#93c4e0")) +
    guides(color = guide_legend(title = "Type"))
}

2) Filed #45 for consistency in post images in terms of blending into the background of The Raft website (no matter what color). (2 hours)

  • Sep 11, ‘24:

1) Made PR#20@data.table.threads that wraps up the changes that Toby suggested. (3 hours)

2) Writing a blog post about data.table.threads. (5 hours)

  • Sep 12, ‘24:

1) Created an atime test case for testing a memory efficiency improvement for data.table::melt() in #20@Anirban166/data.table. (4 hours)

2) More writeup for data.table.threads. (4 hours)

  • Sep 13, ‘24:

1) Tried to revise the transform regression atime case (#14). (5 hours)

2) Reviewed and made changes to code in PRs #6290 and #6295 to help Doris in getting those test cases merged in. (3 hours)

3) Continuing my writeup for data.table.threads. Going to publish in my own blog this weekend or early next week, and then ping Toby for review and send a PR to the Raft after. (1 hour)