gfpop::gfpop(), a log-linear time algorithm for constrained changepoint detection

Abstract

The generalized functional pruning and optimal partitioning (gfpop) is a constrained changepoint detection algorithm that jointly estimates the number of peaks and their locations by minimizing a cost function which consists of a data fitting term and a penalty for each changepoint.

As the name suggests, its a generalization of the fpop algorithm which allows optimal inference in models with inequality constraints between adjacent means. It can be used to compute optimal peak models for genomic data in log-linear time, which was empirically found as discussed below.

Computational complexity

The overall theoretical time/space complexity for gfpop is O(NI), where I is the number of intervals (candidate changepoints) stored in the optimal cost functions. During test runs to estimate its complexity, it was observed that the empirical mean/max number of intervals increases logarithmically with data set size, i.e. I = O(logN) was observed.

This makes the overall complexity O(NlogN) or log-linear in the number of data points (N) to segment, (same for a disk-based storage as well, as found via the package PeakSegDisk) with the number of intervals in real (theoretically there could be some pathological data sets for which the algorithm computes I = O(N) intervals, which results in the worst-case complexity of O(N²)) data being found to be I = O(logN) empirically.

Reference

The pdf proposing the algorithm and enacting as the source for me to extract the above information can be found here.

Testing

We can use the testComplexity functions asymptoticTimings and asymptoticTimeComplexityClass to verify the empirical observation of the log-linear trend followed in the number of data points to segment:

library(testComplexity)
df <- asymptoticTimings(gfpop::gfpop(data = dataGenerator(as.integer(N), c(0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1), c(0, 0.5, 1, 1.5, 2, 2.5, 3), sigma = 1), mygraph = gfpop::graph(penalty = 2*log(as.integer(N)), type = "isotonic"), type = "mean"), data.sizes = 10^seq(1, 4, by = 0.5))
asymptoticTimeComplexityClass(df)
#> [1] "loglinear"

^{Created on 2020-08-17 by the reprex package (v0.3.0)}

Anirban Chetia

2020-08-02

Abstract

Computational complexity

Reference

Testing