substring_and_gregexpr.Rmd
substring()
and gregexpr()
(perl = TRUE
variant) showed quadratic time complexity for R version 3.5.2.Initially, the PCRE (perl set) version of regexpr repeatedly scanned the entire input string to ensure it is valid UTF-8, which accounts for the time being proportional to the square of the number of pattern matches in the input strings, resulting in quadratic complexity. (same was observed in the PCRE version of strsplit()
as well)
After this issue was resolved with a following patch for the gregexpr function, a similar case of quadratic time complexity over the expected linear complexity was observed in substring()
and the functions from other packages such as rex, rematch2 and namedCapture which interally use it. A figure showcasing the quadratic trend of functions from the mentioned packages which use substring()
internally, namely rex::re_matches
, rematch2::re_match_all
and namedCapture::str_match_all_named
in comparison to the linear functions such as gregexpr after the patch, can be found here.
After avoiding the repeated calls to strlen
, the extra check for UTF-8 validity and buffering for the same via this commit, substring()
now shows linear time complexity.
The HTML pages which provide ready reference for the aforementioned issues can be found below. (source for extracting information on the topic)
This figure highlights the quadratic trend in the substring function (source code), as provided by Toby Dylan Hocking. One would require an R version prior to 3.6.0 to reproduce that, which reflects the origin of the issue with those functions. The issues and their timeline as per tdhock’s activity are discussed in this article for the package namedCapture.
This is where testComplexity can be useful, so as to verify the change in time complexity after resolution of the issues. A run with our time complexity analysis functions asymptoticTimings
and asymptoticTimeComplexityClass
with suitable parameters show that both substring and gregexpr yield linear time complexity:
library(testComplexity) # substring function test: asymptoticTimeComplexityClass(asymptoticTimings(substring(paste(rep("A", N), collapse = ""), 1:N, 1:N), data.sizes = 10^seq(1, 5, by = 0.5), max.seconds = 1)) #> [1] "linear" # gregexpr function test: asymptoticTimeComplexityClass(asymptoticTimings(gregexpr("a", paste(collapse = "", rep("ab", N)), perl = TRUE), data.sizes = 10^seq(1, 5, by = 0.5), max.seconds = 1)) #> [1] "linear"
Created on 2020-08-17 by the reprex package (v0.3.0)
Do note that sometimes the predicted complexity class may be log linear instead of linear, as the complexity classification procedure relies on the least error from each of the complexity classification models, and depending on the benchmarks run (which always vary) within asymptoticTimings, it is prone to misprediction by the stochastic nature of timings obtained. It is well recieved that it never shows quadratic in any case (also, plots such as this highlight the stark contrast between quadratic time algorithms and the linear ones such as substring and gregexpr), which supports the resolution of the issues.