--- title: "Vignette 2: Creating Distance Objects" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Vignette 2: Creating Distance Objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = FALSE, # hide code message = FALSE, # hide messages warning = FALSE, # hide warnings fig.width = 12, fig.height = 7, out.width = "100%" ) ``` ```{r setup} library(gsClusterDetect) ``` # Overview When using this package to identify clusters, given a set of date-location-count data, one of the key elements that must be provided is the geographical information about the units of clustering. Specifically, we need to provide to the `find_clusters()` function an object that contains information about pairwise distances (See `vignette("basic_demo")` for overview information on finding clusters). Here we describe how the package provides utilities to generate these objects. # Types of Distance Objects: There are two types of distance objects that can be passed to the `find_clusters()` function. These include square pairwise distance matrices and a constrained version that contains only a limited subset of the locations. ## Square pairwise distance matrices If we have a set of locations among which we would like to identify clusters (say all the counties in Minnesota, or all the zip codes in Maryland), we can create a matrix containing all the pairwise distances between the centroids of these locations. The package provides functions to do this for census tracts, zipcodes, counties, and states. - `county_distance_matrix()` - `zip_distance_matrix()` - `tract_distance_matrix()` Each of these functions must be called with a state level parameter (`st`), to constrain to a particular state (To combine states, see the section on "Custom Distance Objects"). There is also a `unit` parameter, which takes a string: `"miles"` (default), `"kilometers"`, or "`meters"` to obtains the distance estimates in alternative units. All of these functions return a two-element named list containing: - `loc_vec`: a vector containing the names of the `N` locations in the geographic target (e.g. "Minnesota") - `distance_matrix`: an `N x N` matrix containing the pairwise distances between all locations For all of these functions, locations in the rows and columns of the `distance_matrix` will have the same ordering as in the `loc_vec` vector. ### Example: county distance matrix: ```{r eval=TRUE, echo=TRUE} minnesota_counties <- county_distance_matrix(st = "MN") # show the length and first 10 values of the `loc_vec` length(minnesota_counties$loc_vec) minnesota_counties$loc_vec |> head(10) # show the dimension and upper left section of the `distance_matrix` dim(minnesota_counties$distance_matrix) minnesota_counties$distance_matrix[1:5, 1:5] ``` ### Example: zip code distance matrix: ```{r eval=TRUE, echo=TRUE} maryland_zips <- zip_distance_matrix(st = "MD") # show the length and first 10 values of the `loc_vec` length(maryland_zips$loc_vec) maryland_zips$loc_vec |> head(10) # show the dimension and upper left section of the `distance_matrix` dim(maryland_zips$distance_matrix) maryland_zips$distance_matrix[1:5, 1:5] ``` ### Example: tract distance matrix: The package also provides a function to generate distance matrices between census tracts. However, unlike at the zip code and the county level, there is no built-in data set that contains the centroids of these tracts across the United States. Therefore, to use the `tract_distance_matrix()` function, the user will need to have the package `tigris` installed. Also, in addition to the `st` and `unit` parameters, the user can also restrict the tracts within a state by providing a vector of 3-character fips codes in the `county` parameter. Furthermore, to prevent the `tigris` package from using a local cache (using the cache does have the benefit of speed for repeated calls), the user can turn this cache off using `use_cache = FALSE` ```{r echo=TRUE, eval=TRUE} cook_county_tracts <- tract_distance_matrix( st = "IL", county = "031" # the full 5-digit code for Cook County, IL is 17031 ) # show the length and first 10 values of the `loc_vec` length(cook_county_tracts$loc_vec) cook_county_tracts$loc_vec |> head(10) # show the dimension and upper left section of the `distance_matrix` dim(cook_county_tracts$distance_matrix) cook_county_tracts$distance_matrix[1:5, 1:5] ``` As described above, the default unit for these objects is "miles". However, for all of these functions, one can also obtain distances in other units by passing "kilometers" or "meters" to the `unit` parameter. ```{r echo=TRUE, eval=FALSE} maryland_zip <- zip_distance_matrix(st = "MD", unit = "kilometers") ``` ## Custom Distance Matrices Notice that all of the above functions require the user to indicate a state parameter, `st`, thus restricting the output to a single state. We also provide a function `custom_distance_matrix()` to allow users to provide any type of geo-spatial unit. The function has a different set of parameters than the other functions described above: - `df`: This is a `data.frame` with one row per unit, a unique label, and columns containing the latitude and longitude of the centroid of each unit - `unit`: as before, this allows the user to obtain distances in miles, kilometers, or meters - `label_var`: this is the string name of the column containing the name/label of the unit - `lat_var`: the user must provide the name of the column containing the latitude of the centroid - `long_var`: the user must provide the name of the column containing the longitude of the centroid Below, we demonstrate how this function could be used to get a pairwise distance matrix for a collection of contiguous states, in this case Maryland, Delaware, and Virginia ```{r echo=TRUE, eval=TRUE} # Use the built-in-counties dataset to get a dataframe of # counties in the states of interests states <- c("Delaware", "Maryland", "Virginia") delmarva_counties <- counties[state_name %in% states] head(delmarva_counties, 3) # Use the custom function to get the distance matrix delmarva_dm <- custom_distance_matrix( df = delmarva_counties, label_var = "fips", lat_var = "latitude", long_var = "longitude" ) ``` The output structure of the `custom_distance_matrix()` function is the same as the other distance matrix functions. ```{r eval = TRUE, echo=TRUE} # show the length and first 10 values of the `loc_vec` length(delmarva_dm$loc_vec) delmarva_dm$loc_vec |> head(10) # show the dimension and upper left section of the `distance_matrix` dim(delmarva_dm$distance_matrix) delmarva_dm$distance_matrix[1:5, 1:5] ``` ## Distance Lists Notice that the square matrices in all the above examples are of size `N x N` where `N` is the number of unique locations in your target geography. These matrices can become very large, and as they become large, they also become slower to calculate. Furthermore, many of the pairs are never really used in subsequent analyses, because the distance between them is larger than any typical radius constraint that a user might want to place on their cluster-finding technique (Again, see `vignette("basic_demo")` for more information on using the `find_clusters()` function and setting the maximum distance to consider when constructing clusters). For example, we typically recommend starting at 50 miles as the radius for county-level cluster finding, 15 miles for zip-level cluster finding, and 3 miles for tract-level cluster finding. Whatever the radius used, say `r`, in the square distance matrices above, none of the pairs where the distance between centroids exceeds `r` are needed for the calculation. Therefore, we also a provide a way to construct distance objects for `find_clusters()` that only includes pairs where the centroid-to-centroid distance is within some threshold. These objects can be returned using a function `create_distance_list()` which has the following parameters: - `level`: this is the geographic level, one of "tract", "zip", "county", or "state" - `threshold`: this is the threshold distance, expressed in units given by the `unit` argument, to constrain paired locations; for each location `x`, return a vector with only those locations `y1`, `y2`, `y3`, `...` where distance between `x_i` and `y_i` \< `threshold` - `st`: this is a two character state abbreviation, which is required for `level="tract"`, but is otherwise optional. In the latter case (i.e. when `st` is not specified), the function will return distances of pairs within `threshold` for all zips or counties in the entire United States - `county`: like the `tract_distance_matrix()` function, the user can restrict the estimation of tract-to-tract distances to a vector of 3-character fips codes - `unit`: like other functions discussed above, the distance unit can be returned in `miles` (default), kilometers, or meters. The function assumes that `threshold` is always given in terms of `unit` (i.e. if `unit` is set to "kilometers", for example, be sure to express `threshold` in kilometers). The output structure of this function is a list of vectors. The list is of length `N` where `N` is the number of unique locations in the target geography. Each element of the list is a named vector of distances to the location(s) (i.e. the `y_i`) that are within `threshold` units of the current location. An example will help illustrate: ```{r echo=TRUE, eval=TRUE} maryland_zip_list <- create_dist_list( level = "zip", # using a small distance for zip code clustering for demo purposes threshold = 7, st = "MD" ) # this returns a list class(maryland_zip_list) # the list is of length equal to the number of unique zip codes. # Recall from above that we produced a 621 x 621 matrix; in this # case, our list is of length 621 length(maryland_zip_list) # the names of the list are the locations names(maryland_zip_list) |> head(10) # each element of the list is a named vector with distances to # those locations within threshold units maryland_zip_list |> tail(3) ``` This approach is MUCH faster than the `_distance_matrix()` functions described above, when the number of locations is large, but only a few are used. A good example of this is obtaining the zip code level distance information for Texas. ``` r system.time(texas_zip_dm <- zip_distance_matrix(st="TX")) user system elapsed 4.88 0.08 4.97 system.time(texas_zip_dl <- create_dist_list(level="zip", threshold=15, st="TX")) user system elapsed 0.76 0.03 0.80 ``` ## Custom Distance Lists As above, we provide a way to generate distance lists for custom location units. In this case, we use the function `create_custom_distance_list()`. Again, we can use the example of creating a distance list for the joint Maryland, Delaware, and Virginia region, but any user-defined data set can be used for this. The requirements are basically the same as the `custom_distance_matrix()` function, except that we additionally require a `threshold`. Another advantage of custom lists requiring thresholds is that if a user specifies non-contiguous states covering possibly large distances, distance lists limited by a reasonable threshold will avoid calculation of many long distances between faraway locations that would be omitted from clusters by the maximum cluster radius anyway. ```{r echo=TRUE, eval=TRUE} # As before, we use the built-in-counties dataset to get a dataframe of counties states <- c("Delaware", "Maryland", "Virginia") delmarva_counties <- counties[state_name %in% states] head(delmarva_counties, 3) # Use the custom function to get the distance list delmarva_dl <- create_custom_dist_list( df = delmarva_counties, label_var = "fips", lat_var = "latitude", long_var = "longitude", threshold = 50 ) # this is a list class(delmarva_dl) # with length equal to all the counties in the Delmarva region length(delmarva_dl) # first three elements (i.e. locations) in this list delmarva_dl[1:3] ``` ## Other Functions 1. We also provide `us_distance_matrix()` which is a similar to the the other `_distance_matrix()` functions except that it only takes the `unit` parameter. It returns a county level distance matrix (and vector) for all the counties in the US. In general, unless all pairwise distances are truly desired, we strongly recommend that users desiring national-level distance information pre-determine a threshold and use the `create_dist_list()` function leaving the `st` parameter as `NULL`, as that approach will be substantially faster. 2. We provide a function `state_distance_matrix()` which acts identically to the other `_distance_matrix()` functions, except that it doesn't take `st` parameter. This returns a distance matrix for all states in the US, and is included for completeness and convenience, even if its practical usage is limited.