Quadratic to linear transition for substring() and gregexpr()

History

The functions substring() and gregexpr() (perl = TRUE variant) showed quadratic time complexity for R version 3.5.2.
They are linear in nature (time complexity wise) starting from R version 3.6.0, after the commits to substring and gregexpr by Tomas Kalibera, as suggested by Toby Dylan Hocking and mentioned here.

Issues

Initially, the PCRE (perl set) version of regexpr repeatedly scanned the entire input string to ensure it is valid UTF-8, which accounts for the time being proportional to the square of the number of pattern matches in the input strings, resulting in quadratic complexity. (same was observed in the PCRE version of strsplit() as well)

After this issue was resolved with a following patch for the gregexpr function, a similar case of quadratic time complexity over the expected linear complexity was observed in substring() and the functions from other packages such as rex, rematch2 and namedCapture which interally use it. A figure showcasing the quadratic trend of functions from the mentioned packages which use substring() internally, namely rex::re_matches, rematch2::re_match_all and namedCapture::str_match_all_named in comparison to the linear functions such as gregexpr after the patch, can be found here.

After avoiding the repeated calls to strlen, the extra check for UTF-8 validity and buffering for the same via this commit, substring() now shows linear time complexity.

The HTML pages which provide ready reference for the aforementioned issues can be found below. (source for extracting information on the topic)

References

gregexpr: https://stat.ethz.ch/pipermail/r-devel/2017-January/073577.html
substring: https://stat.ethz.ch/pipermail/r-devel/2019-February/077318.html

Reproducibility

This figure highlights the quadratic trend in the substring function (source code), as provided by Toby Dylan Hocking. One would require an R version prior to 3.6.0 to reproduce that, which reflects the origin of the issue with those functions. The issues and their timeline as per tdhock’s activity are discussed in this article for the package namedCapture.

Testing

This is where testComplexity can be useful, so as to verify the change in time complexity after resolution of the issues. A run with our time complexity analysis functions asymptoticTimings and asymptoticTimeComplexityClass with suitable parameters show that both substring and gregexpr yield linear time complexity:

library(testComplexity)
# substring function test:
asymptoticTimeComplexityClass(asymptoticTimings(substring(paste(rep("A", N), collapse = ""), 1:N, 1:N), data.sizes = 10^seq(1, 5, by = 0.5), max.seconds = 1))
#> [1] "linear"
# gregexpr function test:
asymptoticTimeComplexityClass(asymptoticTimings(gregexpr("a", paste(collapse = "", rep("ab", N)), perl = TRUE), data.sizes = 10^seq(1, 5, by = 0.5), max.seconds = 1))
#> [1] "linear"

^{Created on 2020-08-17 by the reprex package (v0.3.0)}

Do note that sometimes the predicted complexity class may be log linear instead of linear, as the complexity classification procedure relies on the least error from each of the complexity classification models, and depending on the benchmarks run (which always vary) within asymptoticTimings, it is prone to misprediction by the stochastic nature of timings obtained. It is well recieved that it never shows quadratic in any case (also, plots such as this highlight the stark contrast between quadratic time algorithms and the linear ones such as substring and gregexpr), which supports the resolution of the issues.

Anirban Chetia

2020-07-20

History

Issues

References

Reproducibility

Testing