Friday, January 31, 2014

Access Google Spreadsheet directly in bash and in R

Google Doc is a good way to share/manage documents between you and your colleagues, but sometime you want to directly access the data in terminal (e.g. bash) or in program (e.g. R), without downloading the data first.

For example, I have a Google Spreadsheet here:

https://docs.google.com/spreadsheet/pub?key=0Aumm3V3g3dF7dFR6eWRJQjE0UHJpdU8yVTlxb2hUMFE

I want to open it in R or select some of columns in bash. Here are the tips for that:

Step1: publish the tab you want to access to the web [howto], in format of CVS or TXT (which is a tab-delimited file actually)

Step2: copy the published URL. At the end, you will see "&output=txt" part for output format, and "&gid=0" to indicate which tab to access (if you have multiple tab, specify tab number here, which starts from 0).

Step3: To access the file externally. Done.

For example, to access the 1st and 3rd columns in bash:

wget --no-check-certificate -q -O - 'https://docs.google.com/spreadsheet/pub?key=0Aumm3V3g3dF7dFR6eWRJQjE0UHJpdU8yVTlxb2hUMFE&single=true&gid=0&output=txt' | cut -f1,3

or
curl -s 'https://docs.google.com/spreadsheet/pub?key=0Aumm3V3g3dF7dFR6eWRJQjE0UHJpdU8yVTlxb2hUMFE&single=true&gid=0&output=txt'

Note: the quote on the URL is a must.
In R,

myURL="https://docs.google.com/spreadsheet/pub?key=0Aumm3V3g3dF7dFR6eWRJQjE0UHJpdU8yVTlxb2hUMFE&gid=0&output=tsv"
require(RCurl)
myTable=read.delim(textConnection(getURL(myURL)))


Note: the above code still works as a simple solution. Just note that Google has changed the txt into tsv, as I highlighted above (2016-Feb-2)

5 comments:

  1. R throws the following error (I use windows). Do you know of a solution to this problem?

    Error in function (type, msg, asError = TRUE) :
    SSL certificate problem, verify that the CA cert is OK. Details:
    error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

    ReplyDelete
  2. Thanks for this. It is really useful. However, I have the same issue as Unknown. Is there something else that has to be done?

    ReplyDelete
  3. hey, Rob and Unknown
    Thanks for visiting the blog. You are right; the error is due to unproper publishing of the Google Spreadsheet. I've updated it (exactly as the first 2 steps as I described above). Please try it again. Thanks

    ReplyDelete
  4. Here's a simpler way to download google spreadsheets in R, without having to publish to the web:

    install.packages('gsheet')
    library(gsheet)
    gsheet2tbl('docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo')

    Full disclosure: I wrote this, because I had a spreadsheet that I was happy to share by link, but didn't want fully published.
    More detail is here: https://github.com/maxconway/gsheet

    ReplyDelete