R packages to interact with data source APIs_Mastering Data Analysis with R-QQ阅读女生幻言网

上QQ阅读APP看书，第一时间看更新

R packages to interact with data source APIs

Although it's great that we can read HTML tables, CSV files and JSON and XML data, and even parse raw HTML documents to store some parts of those in a dataset, there is no sense in spending too much time developing custom tools until we have no other option. First, always start with a quick look on the Web Technologies and Services CRAN Task View; also search R-bloggers, StackOverflow, and GitHub for any possible solution before getting your hands dirty with custom XPath selectors and JSON list magic.

Socrata Open Data API

Let's do this for our previous examples by searching for Socrata, the Open Data Application Program Interface of the Consumer Financial Protection Bureau. Yes, there is a package for that:

> library(RSocrata)
Loading required package: httr
Loading required package: RJSONIO

Attaching package: 'RJSONIO'

The following objects are masked from 'package:rjson':

 fromJSON, toJSON

As a matter of fact, the RSocrata package uses the same JSON sources (or CSV files), as we did before. Please note the warning message, which says that RSocrata depends on another JSON parser R package rather than the one we used, so some function names are conflicting. It's probably wise to detach('package:rjson') before automatically loading the RJSONIO package.

Loading the Customer Complaint Database by the given URL is pretty easy with RSocrata:

> df <- read.socrata(paste0(u, '/25ei-6bcr'))
> str(df)
'data.frame': 18894 obs. of 11 variables:
 $ Complaint.ID : int 2240 2243 2260 2254 2259 2261 ...
 $ Product : chr "Credit card" "Credit card" ...
 $ Submitted.via : chr "Web" "Referral" "Referral" ...
 $ Date.received : chr "12/01/2011" "12/01/2011" ...
 $ ZIP.code : chr ...
 $ Issue : chr ...
 $ Date.sent.to.company: POSIXlt, format: "2011-12-19" ...
 $ Company : chr "Citibank" "HSBC" ...
 $ Company.response : chr "Closed without relief" ...
 $ Timely.response. : chr "Yes" "Yes" "No" "Yes" ...
 $ Consumer.disputed. : chr "No" "No" "" "No" ...

We got numeric values for numbers, and the dates are also automatically processed to POSIXlt!

Similarly, the Web Technologies and Services CRAN Task View contains more than a hundred R packages to interact with data sources on the Web in natural sciences such as ecology, genetics, chemistry, weather, finance, economics, and marketing, but we can also find R packages to fetch texts, bibliography resources, Web analytics, news, and map and social media data besides some other topics. Due to page limitations, here we will only focus on the most-used packages.

Finance APIs

Yahoo! and Google Finance are pretty standard free data sources for all those working in the industry. Fetching for example stock, metal, or foreign exchange prices is extremely easy with the quantmod package and the aforementioned service providers. For example, let us see the most recent stock prices for Agilent Technologies with the A ticker symbol:

> library(quantmod)
Loading required package: Defaults
Loading required package: xts
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

 as.Date, as.Date.numeric

Loading required package: TTR
Version 0.4-0 included new data defaults. See ?getSymbols.
> tail(getSymbols('A', env = NULL))
 A.Open A.High A.Low A.Close A.Volume A.Adjusted
2014-05-09 55.26 55.63 54.81 55.39 1287900 55.39
2014-05-12 55.58 56.62 55.47 56.41 2042100 56.41
2014-05-13 56.63 56.98 56.40 56.83 1465500 56.83
2014-05-14 56.78 56.79 55.70 55.85 2590900 55.85
2014-05-15 54.60 56.15 53.75 54.49 5740200 54.49
2014-05-16 54.39 55.13 53.92 55.03 2405800 55.03

By default, getSymbols assigns the fetched results to the parent.frame (usually the global) environment with the name of the symbols, while specifying NULL as the desired environment simply returns the fetched results as an xts time-series object, as seen earlier.

Foreign exchange rates can be fetched just as easily:

> getFX("USD/EUR")
[1] "USDEUR"
> tail(USDEUR)
 USD.EUR
2014-05-13 0.7267
2014-05-14 0.7281
2014-05-15 0.7293
2014-05-16 0.7299
2014-05-17 0.7295
2014-05-18 0.7303

The returned string of getSymbols refers to the R variable in which the data was saved inside .GlobalEnv. To see all the available data sources, let's query the related S3 methods:

> methods(getSymbols)
 [1] getSymbols.csv getSymbols.FRED getSymbols.google
 [4] getSymbols.mysql getSymbols.MySQL getSymbols.oanda 
 [7] getSymbols.rda getSymbols.RData getSymbols.SQLite
[10] getSymbols.yahoo

So besides some offline data sources, we can query Google, Yahoo!, and OANDA for recent financial information. To see the full list of available symbols, the already loaded TTR package might help:

> str(stockSymbols())
Fetching AMEX symbols...
Fetching NASDAQ symbols...
Fetching NYSE symbols...
'data.frame': 6557 obs. of 8 variables:
 $ Symbol : chr "AAMC" "AA-P" "AAU" "ACU" ...
 $ Name : chr "Altisource Asset Management Corp" ...
 $ LastSale : num 841 88.8 1.3 16.4 15.9 ...
 $ MarketCap: num 1.88e+09 0.00 8.39e+07 5.28e+07 2.45e+07 ...
 $ IPOyear : int NA NA NA 1988 NA NA NA NA NA NA ...
 $ Sector : chr "Finance" "Capital Goods" ...
 $ Industry : chr "Real Estate" "Metal Fabrications" ...
 $ Exchange : chr "AMEX" "AMEX" "AMEX" "AMEX" ...

Note

Find more information on how to handle and analyze similar datasets in Chapter 12, Analyzing Time-series.

Fetching time series with Quandl

Quandl provides access to millions of similar time-series data in a standard format, via a custom API, from around 500 data sources. In R, the Quandl package provides easy access to all these open data in various industries all around the world. Let us see for example the dividends paid by Agilent Technologies published by the U.S. Securities and Exchange Commission. To do so, simply search for "Agilent Technologies" at the http://www.quandl.com homepage, and provide the code of the desired data from the search results to the Quandl function:

> library(Quandl)
> Quandl('SEC/DIV_A')
 Date Dividend
1 2013-12-27 0.132
2 2013-09-27 0.120
3 2013-06-28 0.120
4 2013-03-28 0.120
5 2012-12-27 0.100
6 2012-09-28 0.100
7 2012-06-29 0.100
8 2012-03-30 0.100
9 2006-11-01 2.057
Warning message:
In Quandl("SEC/DIV_A") :
 It would appear you aren't using an authentication token. Please visit http://www.quandl.com/help/r or your usage may be limited.

As you can see, the API is rather limited without a valid authentication token, which can be redeemed at the Quandl homepage for free. To set your token, simply pass that to the Quandl.auth function.

This package also lets you:

Fetch filtered data by time
Perform some transformations of the data on the server side—such as cumulative sums and the first differential
Sort the data
Define the desired class of the returning object—such as ts, zoo, and xts
Download some meta-information on the data source

The latter is saved as attributes of the returning R object. So, for example, to see the frequency of the queried dataset, call:

> attr(Quandl('SEC/DIV_A', meta = TRUE), 'meta')$frequency
[1] "quarterly"

Google documents and analytics

You might however be more interested in loading your own or custom data from Google Docs, to which end the RGoogleDocs package is a great help and is available for download at the http://www.omegahat.org/ homepage. It provides authenticated access to Google spreadsheets with both read and write access.

Unfortunately, this package is rather outdated and uses some deprecated API functions, so you might be better trying some newer alternatives, such as the recently released googlesheets package, which can manage Google Spreadsheets (but not other documents) from R.

Similar packages are also available to interact with Google Analytics or Google Adwords for all those, who would like to analyze page visits or ad performance in R.

Online search trends

On the other hand, we interact with APIs to download public data. Google also provides access to some public data of the World Bank, IMF, US Census Bureau, and so on at http://www.google.com/publicdata/directory and also some of their own internal data in the form of search trends at http://google.com/trends.

The latter can be queried extremely easily with the GTrendsR package, which is not yet available on CRAN, but we can at least practice how to install R packages from other sources. The GTrendR code repository can be found on BitBucket, from where it's really convenient to install it with the devtools package:

Tip

To make sure you install the same version of GTrensR as used in the following, you can specify the branch, commit, or other reference in the ref argument of the install_bitbucket (or install_github) function. Please see the References section in the Appendix at the end of the book for the commit hash.

> library(devtools)
> install_bitbucket('GTrendsR', 'persican', quiet = TRUE)
Installing bitbucket repo(s) GTrendsR/master from persican
Downloading master.zip from https://bitbucket.org/persican/gtrendsr/get/master.zip
arguments 'minimized' and 'invisible' are for Windows only

So installing R packages from BitBucket or GitHub is as easy as providing the name of the code repository and author's username and allowing devtools to do the rest: downloading the sources and compiling them.

Windows users should install Rtools prior to compiling packages from the source: http://cran.r-project.org/bin/windows/Rtools/. We also enabled the quiet mode, to suppress compilation logs and the boring details.

After the package has been installed, we can load it in the traditional way:

> library(GTrendsR)

First, we have to authenticate with a valid Google username and password before being able to query the Google Trends database. Our search term will be "how to install R":

Tip

Please make sure you provide a valid username and password; otherwise the following query will fail.

> conn <- gconnect('some Google username', 'some Google password')
> df <- gtrends(conn, query = 'how to install R')
> tail(df$trend)
 start end how.to.install.r
601 2015-07-05 2015-07-11 86
602 2015-07-12 2015-07-18 70
603 2015-07-19 2015-07-25 100
604 2015-07-26 2015-08-01 75
605 2015-08-02 2015-08-08 73
606 2015-08-09 2015-08-15 94

The returned dataset includes weekly metrics on the relative amount of search queries on R installation. The data shows that the highest activity was recorded in the middle of July, while only around 75 percent of those search queries were triggered at the beginning of the next month. So Google do not publish raw search query statistics, but rather comparative studies can be done with different search terms and time periods.

Historical weather data

There are also various packages providing access to data sources for all R users in Earth Science. For example, the RNCEP package can download historical weather data from the National Centers for Environmental Prediction for more than one hundred years in six hourly resolutions. The weatherData package provides direct access to http://wunderground.com. For a quick example, let us download the daily temperature averages for the last seven days in London:

> library(weatherData)
> getWeatherForDate('London', start_date = Sys.Date()-7, end_date = Sys.Date())
Retrieving from: http://www.wunderground.com/history/airport/London/2014/5/12/CustomHistory.html?dayend=19&monthend=5&yearend=2014&req_city=NA&req_state=NA&req_statename=NA&format=1 
Checking Summarized Data Availability For London
Found 8 records for 2014-05-12 to 2014-05-19
Data is Available for the interval.
Will be fetching these Columns:
[1] "Date" "Max_TemperatureC" "Mean_TemperatureC"
[4] "Min_TemperatureC" 
 Date Max_TemperatureC Mean_TemperatureC Min_TemperatureC
1 2014-05-12 18 13 9
2 2014-05-13 16 12 8
3 2014-05-14 19 13 6
4 2014-05-15 21 14 8
5 2014-05-16 23 16 9
6 2014-05-17 23 17 11
7 2014-05-18 23 18 12
8 2014-05-19 24 19 13

Please note that an unimportant part of the preceding output was suppressed, but what happened here is quite straightforward: the package fetched the specified URL, which is a CSV file by the way, then parsed that with some additional information. Setting opt_detailed to TRUE would also return intraday data with a 30-minute resolution.

Other online data sources

Of course, this short chapter cannot provide an overview of querying all the available online data sources and R implementations, but please consult the Web Technologies and Services CRAN Task View, R-bloggers, StackOverflow, and the resources in the References chapter at the end of the book to look for any already existing R packages or helper functions before creating your own crawler R scripts.