
R packages to interact with data source APIs
Although it's great that we can read HTML tables, CSV files and JSON and XML data, and even parse raw HTML documents to store some parts of those in a dataset, there is no sense in spending too much time developing custom tools until we have no other option. First, always start with a quick look on the Web Technologies and Services CRAN Task View; also search R-bloggers, StackOverflow, and GitHub for any possible solution before getting your hands dirty with custom XPath selectors and JSON list magic.
Socrata Open Data API
Let's do this for our previous examples by searching for Socrata, the Open Data Application Program Interface of the Consumer Financial Protection Bureau. Yes, there is a package for that:
> library(RSocrata) Loading required package: httr Loading required package: RJSONIO Attaching package: 'RJSONIO' The following objects are masked from 'package:rjson': fromJSON, toJSON
As a matter of fact, the RSocrata
package uses the same JSON sources (or CSV files), as we did before. Please note the warning message, which says that RSocrata
depends on another JSON parser R package rather than the one we used, so some function names are conflicting. It's probably wise to detach('package:rjson')
before automatically loading the RJSONIO
package.
Loading the Customer Complaint Database by the given URL is pretty easy with RSocrata
:
> df <- read.socrata(paste0(u, '/25ei-6bcr')) > str(df) 'data.frame': 18894 obs. of 11 variables: $ Complaint.ID : int 2240 2243 2260 2254 2259 2261 ... $ Product : chr "Credit card" "Credit card" ... $ Submitted.via : chr "Web" "Referral" "Referral" ... $ Date.received : chr "12/01/2011" "12/01/2011" ... $ ZIP.code : chr ... $ Issue : chr ... $ Date.sent.to.company: POSIXlt, format: "2011-12-19" ... $ Company : chr "Citibank" "HSBC" ... $ Company.response : chr "Closed without relief" ... $ Timely.response. : chr "Yes" "Yes" "No" "Yes" ... $ Consumer.disputed. : chr "No" "No" "" "No" ...
We got numeric
values for numbers, and the dates are also automatically processed to POSIXlt
!
Similarly, the Web Technologies and Services CRAN Task View contains more than a hundred R packages to interact with data sources on the Web in natural sciences such as ecology, genetics, chemistry, weather, finance, economics, and marketing, but we can also find R packages to fetch texts, bibliography resources, Web analytics, news, and map and social media data besides some other topics. Due to page limitations, here we will only focus on the most-used packages.
Finance APIs
Yahoo! and Google Finance are pretty standard free data sources for all those working in the industry. Fetching for example stock, metal, or foreign exchange prices is extremely easy with the quantmod
package and the aforementioned service providers. For example, let us see the most recent stock prices for Agilent Technologies with the A
ticker symbol:
> library(quantmod) Loading required package: Defaults Loading required package: xts Loading required package: zoo Attaching package: 'zoo' The following objects are masked from 'package:base': as.Date, as.Date.numeric Loading required package: TTR Version 0.4-0 included new data defaults. See ?getSymbols. > tail(getSymbols('A', env = NULL)) A.Open A.High A.Low A.Close A.Volume A.Adjusted 2014-05-09 55.26 55.63 54.81 55.39 1287900 55.39 2014-05-12 55.58 56.62 55.47 56.41 2042100 56.41 2014-05-13 56.63 56.98 56.40 56.83 1465500 56.83 2014-05-14 56.78 56.79 55.70 55.85 2590900 55.85 2014-05-15 54.60 56.15 53.75 54.49 5740200 54.49 2014-05-16 54.39 55.13 53.92 55.03 2405800 55.03
By default, getSymbols
assigns the fetched results to the parent.frame
(usually the global) environment with the name of the symbols, while specifying NULL
as the desired environment simply returns the fetched results as an xts
time-series object, as seen earlier.
Foreign exchange rates can be fetched just as easily:
> getFX("USD/EUR") [1] "USDEUR" > tail(USDEUR) USD.EUR 2014-05-13 0.7267 2014-05-14 0.7281 2014-05-15 0.7293 2014-05-16 0.7299 2014-05-17 0.7295 2014-05-18 0.7303
The returned string of getSymbols
refers to the R variable in which the data was saved inside .GlobalEnv
. To see all the available data sources, let's query the related S3 methods:
> methods(getSymbols) [1] getSymbols.csv getSymbols.FRED getSymbols.google [4] getSymbols.mysql getSymbols.MySQL getSymbols.oanda [7] getSymbols.rda getSymbols.RData getSymbols.SQLite [10] getSymbols.yahoo
So besides some offline data sources, we can query Google, Yahoo!, and OANDA for recent financial information. To see the full list of available symbols, the already loaded TTR
package might help:
> str(stockSymbols()) Fetching AMEX symbols... Fetching NASDAQ symbols... Fetching NYSE symbols... 'data.frame': 6557 obs. of 8 variables: $ Symbol : chr "AAMC" "AA-P" "AAU" "ACU" ... $ Name : chr "Altisource Asset Management Corp" ... $ LastSale : num 841 88.8 1.3 16.4 15.9 ... $ MarketCap: num 1.88e+09 0.00 8.39e+07 5.28e+07 2.45e+07 ... $ IPOyear : int NA NA NA 1988 NA NA NA NA NA NA ... $ Sector : chr "Finance" "Capital Goods" ... $ Industry : chr "Real Estate" "Metal Fabrications" ... $ Exchange : chr "AMEX" "AMEX" "AMEX" "AMEX" ...
Note
Find more information on how to handle and analyze similar datasets in Chapter 12, Analyzing Time-series.
Fetching time series with Quandl
Quandl provides access to millions of similar time-series data in a standard format, via a custom API, from around 500 data sources. In R, the Quandl
package provides easy access to all these open data in various industries all around the world. Let us see for example the dividends paid by Agilent Technologies published by the U.S. Securities and Exchange Commission. To do so, simply search for "Agilent Technologies" at the http://www.quandl.com homepage, and provide the code of the desired data from the search results to the Quandl
function:
> library(Quandl) > Quandl('SEC/DIV_A') Date Dividend 1 2013-12-27 0.132 2 2013-09-27 0.120 3 2013-06-28 0.120 4 2013-03-28 0.120 5 2012-12-27 0.100 6 2012-09-28 0.100 7 2012-06-29 0.100 8 2012-03-30 0.100 9 2006-11-01 2.057 Warning message: In Quandl("SEC/DIV_A") : It would appear you aren't using an authentication token. Please visit http://www.quandl.com/help/r or your usage may be limited.
As you can see, the API is rather limited without a valid authentication token, which can be redeemed at the Quandl
homepage for free. To set your token, simply pass that to the Quandl.auth
function.
This package also lets you:
- Fetch filtered data by time
- Perform some transformations of the data on the server side—such as cumulative sums and the first differential
- Sort the data
- Define the desired class of the returning object—such as
ts
,zoo
, andxts
- Download some meta-information on the data source
The latter is saved as attributes
of the returning R object. So, for example, to see the frequency of the queried dataset, call:
> attr(Quandl('SEC/DIV_A', meta = TRUE), 'meta')$frequency [1] "quarterly"
Google documents and analytics
You might however be more interested in loading your own or custom data from Google Docs, to which end the RGoogleDocs
package is a great help and is available for download at the http://www.omegahat.org/ homepage. It provides authenticated access to Google spreadsheets with both read and write access.
Unfortunately, this package is rather outdated and uses some deprecated API functions, so you might be better trying some newer alternatives, such as the recently released googlesheets
package, which can manage Google Spreadsheets (but not other documents) from R.
Similar packages are also available to interact with Google Analytics or Google Adwords for all those, who would like to analyze page visits or ad performance in R.
Online search trends
On the other hand, we interact with APIs to download public data. Google also provides access to some public data of the World Bank, IMF, US Census Bureau, and so on at http://www.google.com/publicdata/directory and also some of their own internal data in the form of search trends at http://google.com/trends.
The latter can be queried extremely easily with the GTrendsR
package, which is not yet available on CRAN, but we can at least practice how to install R packages from other sources. The GTrendR
code repository can be found on BitBucket
, from where it's really convenient to install it with the devtools
package:
Tip
To make sure you install the same version of GTrensR
as used in the following, you can specify the branch
, commit
, or other reference in the ref
argument of the install_bitbucket
(or install_github
) function. Please see the References section in the Appendix at the end of the book for the commit hash.
> library(devtools) > install_bitbucket('GTrendsR', 'persican', quiet = TRUE) Installing bitbucket repo(s) GTrendsR/master from persican Downloading master.zip from https://bitbucket.org/persican/gtrendsr/get/master.zip arguments 'minimized' and 'invisible' are for Windows only
So installing R packages from BitBucket or GitHub is as easy as providing the name of the code repository and author's username and allowing devtools
to do the rest: downloading the sources and compiling them.
Windows users should install Rtools
prior to compiling packages from the source: http://cran.r-project.org/bin/windows/Rtools/. We also enabled the quiet mode, to suppress compilation logs and the boring details.
After the package has been installed, we can load it in the traditional way:
> library(GTrendsR)
First, we have to authenticate with a valid Google username and password before being able to query the Google Trends database. Our search term will be "how to install R":
> conn <- gconnect('some Google username', 'some Google password') > df <- gtrends(conn, query = 'how to install R') > tail(df$trend) start end how.to.install.r 601 2015-07-05 2015-07-11 86 602 2015-07-12 2015-07-18 70 603 2015-07-19 2015-07-25 100 604 2015-07-26 2015-08-01 75 605 2015-08-02 2015-08-08 73 606 2015-08-09 2015-08-15 94
The returned dataset includes weekly metrics on the relative amount of search queries on R installation. The data shows that the highest activity was recorded in the middle of July, while only around 75 percent of those search queries were triggered at the beginning of the next month. So Google do not publish raw search query statistics, but rather comparative studies can be done with different search terms and time periods.
Historical weather data
There are also various packages providing access to data sources for all R users in Earth Science. For example, the RNCEP
package can download historical weather data from the National Centers for Environmental Prediction for more than one hundred years in six hourly resolutions. The weatherData
package provides direct access to http://wunderground.com. For a quick example, let us download the daily temperature averages for the last seven days in London:
> library(weatherData) > getWeatherForDate('London', start_date = Sys.Date()-7, end_date = Sys.Date()) Retrieving from: http://www.wunderground.com/history/airport/London/2014/5/12/CustomHistory.html?dayend=19&monthend=5&yearend=2014&req_city=NA&req_state=NA&req_statename=NA&format=1 Checking Summarized Data Availability For London Found 8 records for 2014-05-12 to 2014-05-19 Data is Available for the interval. Will be fetching these Columns: [1] "Date" "Max_TemperatureC" "Mean_TemperatureC" [4] "Min_TemperatureC" Date Max_TemperatureC Mean_TemperatureC Min_TemperatureC 1 2014-05-12 18 13 9 2 2014-05-13 16 12 8 3 2014-05-14 19 13 6 4 2014-05-15 21 14 8 5 2014-05-16 23 16 9 6 2014-05-17 23 17 11 7 2014-05-18 23 18 12 8 2014-05-19 24 19 13
Please note that an unimportant part of the preceding output was suppressed, but what happened here is quite straightforward: the package fetched the specified URL, which is a CSV file by the way, then parsed that with some additional information. Setting opt_detailed
to TRUE
would also return intraday data with a 30-minute resolution.
Other online data sources
Of course, this short chapter cannot provide an overview of querying all the available online data sources and R implementations, but please consult the Web Technologies and Services CRAN Task View, R-bloggers, StackOverflow, and the resources in the References chapter at the end of the book to look for any already existing R packages or helper functions before creating your own crawler R scripts.