Software Development in R

20 Aug 2020

Reading time ~14 minutes

Throughout the development of testComplexity, I’ve come across a bunch of convenient R packages for covering different aspects of software development & workflow automation, most of which I’ll be discussing in this post, along with the git VCS.

Contents/Hyperlinks:

Version Control

Before getting started with any code, setting up a version control system becomes a must for any updates/changes to your project to be saved (committed) plus kept track of, which becomes crucial in certain situations (such as the unforeseen need to rewind to a previous state) and for working collaboratively with other people involved in the project. Undoubtedly, Git would be the most favourable choice, in terms of convenience and given its integration with GitHub, apart from it being prominent enough to outclass the others (Mercurial, AccuRev, Subversion, etc.).

Setup

The starting step would be to set up Git in your system if you haven’t already. After opening a terminal/shell session with git integrated, set up your credentials if this is your first time: (or in the case the configuration has been removed at some point)

git config --global user.name "your name"
git config --global user.email "your email"

This allows Git to recognize you by your name and email, with any updates done by you or changes incorporated under your supervision being labelled under those credentials. For example, if you’re the owner or a collaborator (basically, a role with write access) of a repository, you can confirm the merging of branches for pull requests, and when you do so, the ratification is given by your digital signature (git uses both SHA-2 based RSA and ECDSA via elliptic curves with a standard key length of 256 bits, as is for most industry-based encryption schemes) under your name.

To use Git with RStudio, the VCS choice must be specified by going into “Project Options > Git/SVN” and then correspondingly changing the ‘version control’ option from none to Git (restart if prompted). To get started with your local repository, initialize git by running git init via RStudio’s shell interface. (“Tools > Shell”)

If you would like to have a GitHub repository for the project set up with Git, you’ll need to add a remote for your local version:

git remote add origin git@github.com:username/reponame.git

Note that the remote version is called ‘origin’ here, and you can start by pushing your current work from the local branch:

git push -u origin master

Further changes to the local master branch can be incorporated into the remote branch via either the above push command run via a terminal or by simply using the green ‘up’ key/button (signifying a push operation) available in RStudio’s top right panel. (same goes for pulling changes from the remote repository via the blue ‘down’ button)

As you may have observed, the ‘git’ section in the same panel keeps track of all files for the current branch, which is toggleable via the dropdown, and acts as the git checkout operation when branches are switched.

Branches

To implement or experiment with new features, or to push changes separately without affecting the main stream of development, different branches can be created. Ideally, we would want to branch out from our master (branch) to have the existing functionality:

git checkout -b NewBranchName master

The -b flag is the option for creating a new branch, as you might have guessed. (similarly -d goes for deleting one)

All of the branches that are created (provided they are existent and not deleted) will be available in the top right panel’s branch dropdown menu or the names can be extracted via the shell with a git branch command. Note that the current branch will have a different colour (preferably green with the default highlighting) with an asterisk.

Pull Requests

The prime reason why branches are created is to implement new features or fix bugs/issues, all of which are addressed as pull requests to the developmental branch. These pull requests (will be referring to them as PRs henceforth) can be generated on GitHub but in order to have the updates ready or available in your remote, you’ll need to push the changes first to the origin/remote-master branch from the branch your working on or have the changes staged:

 git push origin NewFeatureBranch

Git Operations

Feel free to play around and tweak with the usage of basic git commands, since you’re not restricted to the scope of working on another branch only and creating a PR from it. You can make changes in your local master branch itself, transfer changes to another branch via a merge operation, correspondingly delete the updates from master via a reset operation and then send a PR from that branch:

git checkout NewBranch
git merge master
git checkout master
git reset --hard HEAD~n 
git checkout NewBranch
git push origin NewBranch

Note that n in the reset operation above designates the number of commits to roll back up to (or you could use the SHA ID/signature for the commit), which must be an integer. For going back only one step or reverting to the last commit, use HEAD^ instead.

There are different ways to perform the merge operation as well, such as a fast forward merge (which fast forwards the current branch tip to the target branch tip/head). For example, if I’m working in a branch ExampleBranch which is a few commits ahead of master and I want to merge with the master branch directly without the need of a PR from ExampleBranch, I can use:

git checkout master && git merge ExampleBranch

This tends to be useful since you can merge commits from another branch to master without making a PR. Now making a PR isn’t essentially bad in any sense, but for my project, I’m keeping my PRs restricted to the addition of much-needed features, and for a series of small changes without any major additions/fixes, I prefer to merge the changes to master via this technique. (For example, this commit from testComplexity)

If you prefer a single commit over your multiple commits in a merge operation, setting the squash option (while merging) would fit right. By using git merge --squash, all the commits from an incoming branch would be literally squashed into a single, consolidated commit.

Also, if a commit is made to the wrong branch (or not the one intended to be committed to) then cherry-picking works as well. A git cherry-pick takes a given commit from any branch and applies it to a different branch, without having to apply any other changes from the history of that commit.

Do note that a git merge operation preserves the histories of both branches in complete detail, which can sometimes make the overall history of the project difficult to follow. In the case that you would like to maintain a compact and linear history, a git rebase would do the job, which basically does a rewrite of the commit history of one branch such that the other is incorporated into it from the point when it was created.

If you happened to create a PR but haven’t merged it yet, don’t worry! It’s possible to completely eradicate it, first via hard resetting your local repository to the commit before the PR and then by subsequently forcing the updates to GitHub:

git reset --hard HEAD^ && git push -f

If you’ve merged it though, you can remove the commits but you cannot remove the existence of it in GitHub’s history. (unless GitHub support removes that for you)

You can change the message of your previous commit as well:

git commit --amend -m "New Commit Message"

This locally resets your last commit’s message to the newly stated one. However, if you’ve already pushed the commit to GitHub, you’ll need to force push the changes with git push --force (or use -f). Note that the old and new commits will be different, possessing unique commit IDs or sha1 signatures.

Commit ID change example

The aforementioned commands/techniques should be enough to cover all the basics, but there are many intriguing operations and several ways to achieve a certain functionality through git, which can be assimilated by the curiosity of the reader.

Unit Testing

Creating test cases for all the functions embedded in your package should be a priority (unless you are certain that every corner case is taken care of, and that further debugging isn’t required). However, it becomes monotonous if we have to run a chunk of code manually several times in order to test it against different functions and use cases.

Automated Testing

A quintessential way is to use automated testing with the library testthat, with expectations clubbed in the expect_xyz functions it has to offer under a testthat block, in a testthat file. The files typically follow a naming convention of test-xyz, which are stored under ./tests/testthat/ for your package. devtools::test() or devtools::check() (which incorporates the former) runs all the tests under your testthat directory for your package (Ctrl+Shift+T works as well), but if you want to test a specific file or files in a folder, you can do that via test_file("./FilePath") and test_dir("./DirectoryPath") respectively.

The way this becomes automated is when they are run time and again for each R CMD check and subsequently for each consolidated commit you push to GitHub (via CI services, which will be discussed further below).

Do note that testthat only checks for specific input cases (such as null/NA/NaN checks) provided we handle those in our function(s) pre-emptively, and not for a generalized test-case scenario. (i.e. you’ll need to ensure your function fits in all cases, which isn’t handled by testthat)

Pros include clubbing of multiple test cases inside a testthat block plus chaining of expectations. You can also create your own expect_ functions, or furbish custom expectations.

Code Coverage

Testing code is fine, as long as you can determine what portion of your source code is being actually covered by your tests, and then subsequently cover the remnants when possible. The term for this test ratio governing metric is ‘Code Coverage’, which can be measured in R by using the covr package, and via the inclusion of some third-party software like codecov and coveralls.

In order to diagnose your source code and obtain the degree of code coverage, you should store your results in a variable via a run with covr::package_coverage:

cc <- covr::package_coverage()

If you enter this as input to the console, you’ll observe that it gives the percentage of code coverage for your package (overall %) plus for each individual function which is covered by the automated tests (via testthat).

To diagnose your code for each function (to check which line is not covered by your tests), simply run covr::report() with the above variable as an argument (which contains the computed code coverage data).

covr::report(cc)

Instantly, you will notice a neat delineation of code coverage metrics for your package in the ‘Viewer’ tab, from where you can click on the link(s) to the functions (with source code displayed) which could do with better coverage (as per the %), and correspondingly make more tests which include the lines highlighted in red. (similar to deletions in a git diff)

Automated Coverage

To automate the process, you can use either codecov or coveralls which will provide a measure of your code coverage after every commit you stage and push. Follow the steps below to get started:

Run devtools::use_coverage() in the console and specify the one you want for the type parameter:
```
devtools::use_coverage(pkg = ".", type = c("codecov"))
```
I’m proceeding with codecov here, and for the rest of the steps to be followed as well.
For the freshly generated codecov.yml, add the following lines:

comment: false
language: R
sudo: false
cache: packages
after_success:
- Rscript -e 'covr::codecov()'

The last two lines can be added to travis.yml as well, in case you’re using Travis and would like to generate automated code coverage reports from codecov after every commit.

You’ll need to log into codecov using your github and for once, give it access to your repository which will generate a token for you to use (which should be copied).
Run covr::codecov() and supply the copied value (provided in the last step) in the token parameter:
```
covr::codecov(token = "TokenValue")
```
This will upload your code coverage results (same as measured by covr::package_coverage(), but then you get a neat badge!) to codecov and give you access to the dashboard and various graphs to view your coverage.

The process should be strikingly similar for coveralls, although I haven’t tried it yet.

Continuous Integration

Using CI is the best way to automate the checks for your package, wherein an R CMD check is run, covering the written tests every time you (or another contributor) push a commit to the project’s repository on GitHub.

There are quite a few CI service providers (most of which are free, unless your repository is private, or you would want to run the services on an array of repositories/projects, like an organization) and just like many out there, I’ve been using Travis, likely for good reason with no trouble so far. (most likely haven’t done any complex stuff yet!)

After setting up Travis via usethis::use_travis(), you need to edit the travis.yml file to configure it as per your requirements. If your Travis CI builds throw an error, there could be three possibilities:
(a) You didn’t run an R CMD check (via devtools::check()) locally and just pushed the commit or PR, which would cause the same errors as you would observe in your native development environment. Debugging is simple here, as you only need to fix the issues prevalent locally.
Note that you can modify the script or R CMD commands as well, such as if you want to build it like cran and include vignette compression, you can include the --as-cran flag to check() and --compact-vignettes with gs+pdf or both (devtools::build(build_args = c('--compact-vignettes=both')) if your using the command line) for build() respectively:

script:
  - R CMD build . --compact-vignettes=gs+qpdf
  - R CMD check *tar.gz --as-cran

(b) You’ve tested it in your local environment with whatever OS you’re running, but it’s not working on some other operating systems, as observed via Travis CI jobs running on them. You’ll need to either resolve those cross-platform dependencies/errors or label your package unfit for those operating systems. Either way, Travis just helped you know your current project status for cross-platform deployment.
Concerning different operating systems, you can specify them in the matrix pool, with a separate job run for each separately:

matrix:
  include:
  - r: release
    os: osx

This covers the remaining one (OS X), provided am done with the others. (the native development OS in which I run and test my builds is Windows, and Travis by default runs on Ubuntus-Xenial for the moment)
If you want to go for all of them, just include them under the OS level:

os:
- linux
- osx
- windows

This can be compared with the jobs for an R CMD check:

jobs:
  R-CMD-check:
    runs-on: $

    name: $ ($)

    strategy:
      fail-fast: false
      matrix:
        config:
          - {os: macOS-latest,   r: 'release'}
          - {os: windows-latest, r: 'release'}
          - {os: ubuntu-16.04,   r: 'release', rspm: "https://packagemanager.rstudio.com/cran/__linux__/xenial/latest"}

You can specify different releases (under the r: level) as well, such as the development version (devel), current release version (rel) & the penultimate latest version (oldrel): (bioconductor versions are separate)

r:
  - oldrel
  - release
  - devel

(c) You didn’t configure Travis properly, or it’s an issue isolated with Travis CI. Best option for resolving the former is to dive into the Travis docs (such as the build environment & specific guide). For the latter, you might need to contact them for additional support.

Reproducible Examples

If the code in your package results in a bug/flaw which needs to be addressed/rectified, it needs to be reproduced either on a cloud-based virtual environment or the local setup for other developers (to experiment with), for which we generally include a copy-paste version of our code which tends to replicate the error.

This certainly isn’t ideal always, since this does not work in an isolated environment at times, or even locally for someone else trying to reproduce the bug via a copy of your code. This is where the package reprex() kicks in, with the provision of a function by the same name which first tests if your code works without any dependencies and in a standalone way, and then copies the rendered output to your clipboard, which is targeted to be posted on code-posting sites (target can be specified), following the default markdown capture and flavour. The first action itself will trigger errors if your code is not in a reproducible form, indicating what could be possibly wrong. If it does not trigger any error and shows the incorrect output or bug (as you have discovered while debugging), then you can look forward to copy-pasting the rendered markdown code into GitHub issues or as questions to Stack Overflow or the RStudio community, in the hope to get answers.

Anirban

08/20/2020

R Git Package GSoC'20 Software Development Tools Libraries Code Snippets