Other popular online data formats_Mastering Data Analysis with R-QQ阅读女生青春网

上QQ阅读APP看书，第一时间看更新

Other popular online data formats

Structured data is often available in XML or JSON formats on the Web. The high popularity of these two formats is due to the fact that both are human-readable, easy to handle from a programmatic point of view, and can manage any type of hierarchical data structure, not just a simple tabular design, as CSV files are.

Note

JSON is originally derived from JavaScript Object Notation, which recently became one of the top, most-used standards for human-readable data exchange format. JSON is considered to be a low-overhead alternative to XML with attribute-value pairs, although it also supports a wide variety of object types such as number, string, boolean, ordered lists, and associative arrays. JSON is highly used in Web applications, services, and APIs.

Of course, R also supports loading (and saving) data in JSON. Let's demonstrate that by fetching some data from the previous example via the Socrata API (more on that later in the R packages to interact with data source APIs section of this chapter), provided by the Consumer Financial Protection Bureau. The full documentation of the API is available at http://www.consumerfinance.gov/complaintdatabase/technical-documentation.

The endpoint of the API is a URL where we can query the background database without authentication is http://data.consumerfinance.gov/api/views. To get an overall picture on the structure of the data, the following is the returned JSON list opened in a browser:

As JSON is extremely easy to read, it's often very helpful to skim through the structure manually before parsing. Now let's load that tree list into R with the rjson package:

> library(rjson)
> u <- 'http://data.consumerfinance.gov/api/views'
> fromJSON(file = u)
[[1]]
[[1]]$id
[1] "25ei-6bcr"

[[1]]$name
[1] "Credit Card Complaints"

[[1]]$averageRating
[1] 0
…

Well, it does not seem to be the same data we have seen before in the comma-separated values file! After a closer look at the documentation, it's clear that the endpoint of the API returns metadata on the available views instead of the raw tabular data that we saw in the CSV file. So let's see the view with the ID of 25ei-6bcr now for the first five rows by opening the related URL in a browser:

The structure of the resulting JSON list has changed for sure. Now let's read that hierarchical list into R:

> res <- fromJSON(file = paste0(u,'/25ei-6bcr/rows.json?max_rows=5'))
> names(res)
[1] "meta" "data"

We managed to fetch the data along with some further meta-information on the view, columns, and so on, which is not something that we are interested in at the moment. As fromJSON returned a list object, we can simply drop the metadata and work with the data rows from now on:

> res <- res$data
> class(res)
[1] "list"

This is still a list, which we usually want to transform into a data.frame instead. So we have list with five elements, each holding 19 nested children. Please note that one of those, the 13th sub element, is list again with 5-5 vectors. This means that transforming the tree list into tabular format is not straightforward, even less so when we realize that one of those vectors holds multiple values in an unprocessed JSON format. So, for the sake of simplicity and proof of a concept demo, let's simply ditch the location-related values now and transform all other values to data.frame:

> df <- as.data.frame(t(sapply(res, function(x) unlist(x[-13]))))
> str(df)
'data.frame': 5 obs. of 18 variables:
 $ V1 : Factor w/ 5 levels "16756","16760",..: 3 5 ...
 $ V2 : Factor w/ 5 levels "F10882C0-23FC-4064-979C-07290645E64B" ...
 $ V3 : Factor w/ 5 levels "16756","16760",..: 3 5 ...
 $ V4 : Factor w/ 1 level "1364270708": 1 1 ...
 $ V5 : Factor w/ 1 level "403250": 1 1 ...
 $ V6 : Factor w/ 5 levels "1364274327","1364274358",..: 5 4 ...
 $ V7 : Factor w/ 1 level "546411": 1 1 ...
 $ V8 : Factor w/ 1 level "{\n}": 1 1 ...
 $ V9 : Factor w/ 5 levels "2083","2216",..: 1 2 ...
 $ V10: Factor w/ 1 level "Credit card": 1 1 ...
 $ V11: Factor w/ 2 levels "Referral","Web": 1 1 ...
 $ V12: Factor w/ 1 level "2011-12-01T00:00:00": 1 1 ...
 $ V13: Factor w/ 5 levels "Application processing delay",..: 5 1 ...
 $ V14: Factor w/ 3 levels "2011-12-01T00:00:00",..: 1 1 ...
 $ V15: Factor w/ 5 levels "Amex","Bank of America",..: 2 5 ...
 $ V16: Factor w/ 1 level "Closed without relief": 1 1 ...
 $ V17: Factor w/ 1 level "Yes": 1 1 ...
 $ V18: Factor w/ 2 levels "No","Yes": 1 1 ...

So we applied a simple function that drops location information from each element of the list (by removing the 13th element of each x), automatically simplified to matrix (by using sapply instead of lapply to iterate though each element of the list), transposed it (via t), and then coerced the resulting object to data.frame.

Well, we can also use some helper functions instead of manually tweaking all the list elements, as earlier. The plyr package (please find more details in Chapter 3, Filtering and Summarizing Data and Chapter 4, Restructuring Data) includes some extremely useful functions to split and combine data:

> library(plyr)
> df <- ldply(res, function(x) unlist(x[-13]))

It looks a lot more familiar now, although we miss the variable names, and all values were converted to character vectors or factors—even the dates that were stored as UNIX timestamps. We can easily fix these problems with the help of the provided metadata (res$meta): for example, let's set the variable names by extracting (via the [ operator) the name field of all columns except for the dropped (13th) location data:

> names(df) <- sapply(res$meta$view$columns, `[`, 'name')[-13]

One might also identify the object classes with the help of the provided metadata. For example, the renderTypeName field would be a good start to check, and using as.numeric for number and as.POSIXct for all calendar_date fields would resolve most of the preceding issues.

Well, did you ever hear that around 80 percent of data analysis is spent on data preparation?

Parsing and restructuring JSON and XML to data.frame can take a long time, especially when you are dealing with hierarchical lists primarily. The jsonlite package tries to overcome this issue by transforming R objects into a conventional JSON data structure and vice-versa instead of raw conversion. This means from a practical point of view that jsonlite::fromJSON will result in data.frame instead of raw list if possible, and it makes the interchange data format even more seamless. Unfortunately, we cannot always transform lists to a tabular format; in such cases, the list transformations can be speeded up by for example the rlist package. Please find more details on list manipulations in Chapter 14, Analyzing the R Community.

Note

Extensible Markup Language (XML) was originally developed by the World Wide Web Consortium in 1996 to store documents in a both human-readable and machine-readable format. This popular syntax is used in for example the Microsoft Office Open XML and Open/LibreOffice OpenDocument file formats, in RSS feeds, and in various configuration files. As the format is also highly used for the interchange of data over the Internet, data is often available in XML as the only option—especially with some older APIs.

Let us also see how we can handle another popular online data interchange format besides JSON. The XML API can be used in a similar way, but we must define the desired output format in the endpoint URL: http://data.consumerfinance.gov/api/views.xml, as you should be able to see in the following screenshot:

It seems that the XML output of the API differs from what we have seen in the JSON format, and it simply includes the rows that we are interested in. This way, we can simply parse the XML document and extract the rows from the response then transform them to data.frame:

> library(XML)
> doc <- xmlParse(paste0(u, '/25ei-6bcr/rows.xml?max_rows=5'))
> df <- xmlToDataFrame(nodes = getNodeSet(doc,"//response/row/row"))
> str(df)
'data.frame': 5 obs. of 11 variables:
 $ complaint_id : Factor w/ 5 levels "2083","2216",..: 1 2 ...
 $ product : Factor w/ 1 level "Credit card": 1 1 ...
 $ submitted_via : Factor w/ 2 levels "Referral","Web": 1 1 ...
 $ date_recieved : Factor w/ 1 level "2011-12-01T00:00:00" ...
 $ zip_code : Factor w/ 1 level "": 1 1 ...
 $ issue : Factor w/ 5 levels ...
 $ date_sent_to_company: Factor w/ 3 levels "2011-12-01T00:00:00" ...
 $ company : Factor w/ 5 levels "Amex" ....
 $ company_response : Factor w/ 1 level "Closed without relief"...
 $ timely_response : Factor w/ 1 level "Yes": 1 1 ...
 $ consumer_disputed : Factor w/ 2 levels "No","Yes": 1 1 ...

Although we could manually set the desired classes of the variables in the colClasses argument passed to xmlToDataFrame, just like in read.tables we can also fix this issue afterwards with a quick helper function:

> is.number <- function(x)
+ all(!is.na(suppressWarnings(as.numeric(as.character(x)))))
> for (n in names(df))
+ if (is.number(df[, n]))
+ df[, n] <- as.numeric(as.character(df[, n]))

So we tried to guess if a column includes only numbers, and convert those to numeric if our helper function returns TRUE. Please note that we first convert the factor to character before transforming to number, as a direct conversion from factor to numeric would return the factor order instead of the real value. One might also try to resolve this issue with the type.convert function, which is used by default in read.table.

Note

To test similar APIs and JSON or XML resources, you may find it interesting to check out the API of Twitter, GitHub, or probably any other online service provider. On the other hand, there is also another open-source service based on R that can return XML, JSON, or CSV files from any R code. Please find more details at http://www.opencpu.org.

So now we can process structured data from various kinds of downloadable data formats but, as there are still some other data source options to master, I promise you it's worth it to keep reading.

本周热推：

Java Web应用开发技术与案例教程 Python忍者秘籍 PHP+MySQL网站开发技术项目式教程（第2版）Python金融风控策略实践亮剑ASP.NET项目开发案例导航