Open code, git and GitHub

Advanced R Course

Ella Kaye | Department of Statistics | University of Warwick

January 16, 2023

Overview

  • Open code and software

  • git

  • GitHub

  • git and GitHub with RStudio

Open code and software

From https://www.heatherturner.net/talks/gregynog2022/#1

Open practices

Recognised as integral to healthy research culture

Working openly means that our work is

  • Accessible, for free
  • Open to scrutiny (verifiable)
  • Reproducible

Code and Software

Scripts: e.g analysis/simulation

  • Customised
  • Reproducible workflow
    • read data
    • analyse
    • summarise
    • report

Software: Package, tool, dashboard

  • Optimized
  • Reusable
  • Sharable

Why share code and software?

Beyond general benefits of open practices

  • Increased impact and reputation
  • Faster translation into practice
  • Additional, citable, outputs

Benefits of coding in the open

  • Encourages good practices
  • Facilitates collaboration
  • Can give access to software engineering tools

Sharing code in an online repository

Pros

  • Version control (commit history + releases)
  • README.md for quick documentation
  • Facilitates contribution (bug reports, patches)
  • Can use CITATION.cff file for clear citation GitHub
  • Links to Zenodo for publishing releases

Cons

  • Learning curve to take full advantage

More on open code and software

See https://www.heatherturner.net/talks/gregynog2022/#1

git

Version Control

When developing code, we often want to keep old versions.

We might save with different files names

code                               
  ¦--simulations.R                  
  ¦--simulations_correct_sd.R       
  ¦--simulations_return_parameters.R

or comment on changes (especially when collaborating with others)

## EK 2023-01-10 use geom_bar instead of geom_histogram
## p <- p + geom_histogram(stat = "identity")
p <- p + geom_bar(stat = "identity")
## HT 2023-01-09 remove legend from plot
p <- p + theme(legend.position = "none")

Either way it can get messy and hard to track/revert changes!

git

git is a version control system that allows us to record changes made to files in a repository or repo.

Each version has a unique ID and metadata:

  • Who created the new version
  • A short description of changes made
  • When the version was made

Versions can be compared, restored and merged.

git repository

To get started, a repository must be created locally (within a working directory on your computer) or on a remote hosting platform (we’ll use GitHub).

git can then track when files/folders are

  • Added
  • Modified
  • Deleted

Repositories can have multiple branches of development. We will work on a single branch, with the default name of main.

Staging and committing

Versions are created in a commit.

We prepare the commit by staging changes we want to record:

  • Untracked files (git treats the whole content as new)
  • Tracked files that have been modified or deleted since the last commit

Think of it like taking photographs: we stage the scene by adding/removing people, or changing people’s outfits, when we have a scene we want to save we take a photograph.

GitHub

git + Github

The full power comes by connecting a local repo to GitHub.

  • You can make changes locally and push them to GitHub
  • You can make changes via the GitHub website and later pull them into your local copy.
  • Collaborators can also push/pull changes to the repo.

Diagram showing arrows to and from your repo and a central remote repo (GitHub). There are more arrows to and from GitHub and the web UI. There are also arrows to and from GitHub and "their repo" representing another person.

GitHub

Warwick GitHub Enterprise: https://github.warwick.ac.uk/

  • Use for Warwick-specific, private projects
  • Username will be ITS username

GitHub: https://github.com/

  • Free version allows private/public projects
  • Can share/collaborate with people outside Warwick
  • Develop your personal portfolio

Features

GitHub repositories have some nice features:

  • README.md displayed as HTML
  • Browsable commit history
  • Issues where you/others can note bug reports/TODO/feature requests
  • Pull requests (advanced) where you/others can propose changes to the code to be reviewed
  • Actions (advanced) automated actions when you commit to the repo
  • Projects (advanced) to organize issues (To Do/In Progress/Done)
  • Wiki for project documentation
  • Deploy websites with pages

Further benefits of GitHub

Research compendia

Potential to use Binder to so that people can run your code in the browser

  • For short analysis (<10 min) on small data (< 10MB)

  • Zero to Binder tutorial [Julia, Python, R]

Software packages

  • Users can install Julia/Python/R packages from GitHub

  • Can deploy package websites using GitHub pages

  • Can benefit from GitHub Actions, e.g. automatically running tests

Your turn! Explore some GitHub repositories

Your turn! Create a GitHub repository

  • Login or sign up for a free account at https://www.github.com.

    • Recommend to use personal email

    • Username recommendations:

      • Incorporate your actual name
      • Reuse your username from other contexts.
      • Use a name you can share in professional contexts.
      • Use all lower case letters.
  • Click on + in the top right to create a new, public, repository called github-intro.

    • Choose to initialize the repository with a README file

Your turn! Edit on GitHub

  • Click the pencil icon on the README.md file to edit.

    • Update the title
    • Add some example content using markdown syntax
    • Use the “Preview changes” tab to check your edits
  • Scroll down to the Commit changes section

    • Add a short description of your changes in the first dialog box, e.g. add basic information to README

    • Click the green Commit changes button

    • This will stage and commit the file in one go.

  • View the commit history (look for clock icon with anti-clockwise arrow) and look at the diff for your commit.

See Basic writing and formatting syntax for more on editing markdown.

Your turn! Continue experimenting

Further exercises to do while other people set up authentication.

  • Try uploading a picture from Unsplash. Go to Add file > Upload files. Edit your README to add the image.

  • Go to Add file > Create new file. Type subfolder/ in the “Name your file box” to create a subfolder. Now type README.md in the “Name your file box”. Add some content to the README and commit - try some new markdown syntax, e.g. emoji or a table.

  • Build a stunning README for your GitHub profile

Configuring git and GitHub

Using git locally

Check if you already have git by running the following command in a terminal (e.g. Terminal tab on RStudio).

On MacOS/Linux (or Windows with Rtools)

which git

On Windows

where git

MacOS: If asked to install the Xcode command line tools, say yes - this will install git.

Installing git

Windows:

  • Use the installer from https://git-scm.com/downloads

  • Check RStudio can find the git executable

    • Go to Tools > Global Options > Git/SVN to check
    • Restart RStudio before trying to use git

Linux:

sudo apt-get install git

or

sudo yum install git

Configure git (globally)

Set the default name and email to associate with your git commits:

library(usethis)
use_git_config(
  user.name = "Ada Lovelace",     # your full name
  user.email = "ada@example.com"  # email associated with GitHub account
  )

To keep your email private:

Authentication

GitHub requires authentication with a Personal Access Token or SSH key.

SSH keys PAT
More setup (on Windows) usethis helpers
Once per computer+host Renew every 30 days (recommended)
Need for HPC Need for usethis/other R packages

Recommend: PAT now (easier), SSH later if needed

Main recommendations

From {usethis} vignette Managing Git(Hub) Credentials:

Our main recommendations are:

  • Adopt HTTPS as your Git transport protocol.
  • Turn on two-factor authentication for your GitHub account.
  • Use a personal access token (PAT) for all Git remote operations from the command line or from R.
  • Allow tools to store and retrieve your credentials from the Git credential store. If you have previously set your GitHub PAT in .Renviron, stop doing that.

Highly recommend reading this entire vignette and following all guidance

sitrep and vaccinate

library(usethis) # make sure > v2.0.0
git_sitrep() # current situation report
git_vaccinate() # add files to global .gitignore (best practice)

Get a personal access token (PAT)

First, make sure you’re signed into GitHub. Then run

usethis::create_github_token()
  • Add Note describing use-case (e.g. personal-macbook-pro-2021, project-xyz)
  • Select expiration (recommend default 30 days)
  • Check scope
  • Click ‘Generate Token’
  • Important! Copy token to clipboard, do not close window until stored!
  • You may want to store token in a secure vault, like 1Password

Put your PAT into the local Git credential store

Assume token is on your clipboard

gitcreds::gitcreds_set()
  • If you don’t have a PAT stored, will prompt you to enter: paste!
  • If you do, you will be given a choice to keep/replace/see the password
    • choose as appropriate
    • if replacing, paste!

PAT maintenance

  • By default token will expire after 30 days.
  • Return to https://github.com/settings/tokens and click on its Note
    • or else click on link in e-mail telling you token is about to expire!
  • Regenerate token
  • rerun gitcreds::gitcreds_set()

A bit of a pain to do this every month, but only takes a couple of minutes.

Your turn!

If you need to, generate and store a PAT, as described over last few slides and in the vignette.

If you already have a PAT set, read the vignette and follow through on other best pracice recommendations.

Create an RStudio project from a GitHub repo

Now you can clone your GitHub repo locally.

  • Go to your repo homepage on GitHub. Click the green “Code” button, copy the HTTPS address (of the form https://github.com/USERNAME/REPO.git).

  • In RStudio, go to File > New Project… > Version Control > Git.

    • Enter the repo URL you just copied. The project directory name will be filled automatically.
    • Browse to a location you want the project directory to be created.
    • Click Create Project.

Git tab

A Git tab is added to the pane that is in the top right by default, usually with Environment, History, and Connections tabs.

The initial view is equivalent to the output of the terminal command git status.

We can stage changes to commit

Underlying command:
git add <file/folder>.

Then commit a set of changes

Underlying command:
git commit -m "commit message"

.gitignore

At this stage you should see two untracked files in your Git Pane that were created when setting up the project: an .Rproj file and a .gitignore file.

The .gitignore file specifies files that git should ignore - they won’t appear in the git pane even as untracked files.

Examples of how to specify files in .gitignore:

  • Single file: .Rhistory
  • File pattern: *.log (all files with .log extension)
  • Directory (and files in it): /dirname/

The .gitignore file must at least be staged to have an effect.

First commit from RStudio

  • Stage and commit the .Rproj file and .gitignore file, with the message “setup RStudio project”.

  • Click on the clock icon in the Git pane to view the history of previous commits.

  • Close the “Review Changes” window. Now click the green up arrow to push your changes to GitHub.

  • Go to the repo on GitHub and verify your changes have been pushed.

Pulling changes from GitHub

  • Edit the README once more on GitHub in a new commit.

  • Back in RStudio, on the Git tab, click the blue down arrow to pull the changes from GitHub

  • The changes to README should now appear in your local copy

Avoid conflicts

If you work on both the local and GitHub copy, it’s possible to get out of sync and end up with conflicting versions of the same file.

It is possible to fix this, but it can be tricky/confusing. It’s best to avoid problem in first place!

Recommended practice:

  • Always commit and push changes at the end of an RStudio session
  • Always pull changes at the beginning of an RStudio session

Set .Rprofile to check git status

This is a neat trick (credit: Lisa De Bruine)

Open your .Rprofile

usethis::edit_r_profile()

and add the following (credit: Lisa De Bruine)

cat(cli::col_blue(system("git status -u no", TRUE)))

This will run git status from the command line when you start R, giving a (blue coloured!) message e.g.

On branch main Your branch is behind 'origin/main' by 1 commit,
and can be fast-forwarded.   (use "git pull" to update your  
local branch)  nothing to commit, working tree clean

If your main branch is behind the main branch on origin (GitHub), you should pull changes before making new edits.

Advanced: Amend commit (before pushing)

Sometimes we don’t stage everything we intended to include in a commit, e.g. we committed a file before saving the latest changes.

If we haven’t yet pushed the commit to GitHub, simply stage the extra commits and check the “Amend previous commit” box under the commit message.

The original commit message will be shown - you can edit this to change the message for the amended commit (useful if you forgot to reference a GitHub issue number)

Advanced: Undo last commit (before pushing)

Alternatively, you can undo a commit before pushing.

To undo the commit, keeping files as they are

git reset HEAD~1

(change the 1 to a higher number to go back more than 1 commit).

To undo the commit and all the changes in that commit

git reset --hard HEAD~1

This goes back to the version at the last commit.

Advanced: Undo last commit (after pushing)

It is best practice to create a new commit that undoes the changes. Run

git revert HEAD

This edits the files to undo the changes in your last commit. You should then commit these edits, with a relevant message.

It is possible to use git reset --hard to undo a commit and then git push origin main --force to force this change onto GitHub. Sometimes repository maintainers do not allow this as it rewrites the history, which can cause problems for people that have cloned or forked your repo.

General workflow

  • Commit regularly, once you’ve got a small complete change, e.g. a working draft of a function, a bug fix, a draft of a README.

    • It is easier to review/revert changes if they relate to a single file or common issue
  • Ideally, make a commit everytime you make a substantial, coherent set of changes.

  • At least make a commit every time you take a break, especially when leaving at the end of your working session

  • Push often enough that GitHub is a useful backup

Adding an existing project to GitHub

The simplest approach is to create a GitHub repo with just a README as before, create the corresponding RStudio project, copy your files into the new directory, stage and commit them.

If you are already using git and want to move the project to GitHub, see Adding a local repository to GitHub using git. Once the project is on GitHub you can clone it into an RStudio project.

Some other tools

  • gert R package has functions for interacting with git

  • It can be worth using dedicated software for interacting with git/GitHub

Summary

  • Working openly encourages good practices

  • Your code/software is an asset!

  • We can make steps to improve our own practice

  • git and GitHub are invaluable resources for version control and collaboration

Learning more

End Matter

License

Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).