--- title: "Introduction to treeSS: reproducing Cançado et al. (2025)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to treeSS: reproducing Cançado et al. (2025)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview The **treeSS** package implements the tree-spatial scan statistic of Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters that are simultaneously anomalous in geographic space and on a hierarchical classification tree. It generalises Kulldorff's (1997) circular spatial scan by searching jointly over circular spatial zones $z$ and tree branches $g$, returning the (zone, branch) pair that maximises the likelihood ratio. The package targets epidemiological and surveillance applications where the events to be clustered (deaths, hospitalisations, crimes, collisions) carry both a *location* and a *categorical hierarchy* (ICD-10 chapter $\to$ subchapter $\to$ leaf, NAICS sector, road type, etc.). Detecting clusters jointly in both dimensions can identify specific cause–area combinations that pure spatial or pure tree scans would miss. The package provides: | Function | Purpose | |:---------|:--------| | `treespatial_scan()` | Tree-spatial scan (main entry point) | | `circular_scan()` | Kulldorff's circular spatial scan (tree-free) | | `tree_scan()` | Tree-based scan (space-free) | | `filter_clusters()` | Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1 | | `sequential_scan()` | Sequential adjustment of Zhang, Assunção & Kulldorff (2010) | | `get_cluster_regions()` | Tidy region-by-cluster table for plotting | | `aggregate_tree()` | Roll counts up the tree | All scans accept the Poisson and Binomial models of the original paper and run the Monte Carlo step under OpenMP via the `n_cores` argument. The package ships four example datasets: | Dataset | Country | Domain | Regions | Tree | |:--------|:--------|:-------|:--------|:-----| | `rj_mortality` + `rj_tree` | Brazil | Public health | 89 municipalities | ICD-10 (622 nodes) | | `chicago_crimes` + `chicago_tree` | USA | Crime | 77 community areas | Type × Description × Location (2841 nodes) | | `fl_deaths` | USA | Public health | 65 counties | (built by user from raw data) | | `london_collisions` + `london_tree` | UK | Road safety | 33 boroughs | Light × Road × Junction (81 nodes) | This vignette walks through `rj_mortality` end-to-end, reproducing the published analysis of Cançado et al. (2025). A companion vignette, `vignette("florida", package = "treeSS")`, is pedagogical: it builds the inputs from raw long-form data and the ICD-10 tree from the codes that actually appear in the data. The Chicago and London datasets are worked out in the companion software paper. --- ## Infant mortality in Rio de Janeiro We reproduce Section 5.2 of Cançado et al. (2025): all live births and infant deaths in the 89 municipalities of Rio de Janeiro state in 2016, classified by the ICD-10 hierarchy. The pre-built dataset combines deaths, live births, ICD-10 leaf codes and centroid coordinates in long format — exactly the parallel-vector input contract that **treeSS** expects. ### Run the scan ```{r rj-data} library(treeSS) data(rj_mortality) data(rj_tree) c(rows = nrow(rj_mortality), municipalities = length(unique(rj_mortality$region_id)), tree_nodes = nrow(rj_tree), total_deaths = sum(rj_mortality$cases), total_births = sum(unique( rj_mortality[, c("region_id", "live_births")] )$live_births)) #> rows municipalities tree_nodes total_deaths total_births #> 1440 89 622 2981 219124 ``` The scan is one function call: ```{r rj-scan} result_rj <- treespatial_scan( cases = rj_mortality$cases, population = rj_mortality$live_births, region_id = rj_mortality$region_id, x = rj_mortality$x, y = rj_mortality$y, node_id = rj_mortality$node_id, tree = rj_tree, max_pop_pct = 0.50, nsim = 999, seed = 2016, n_cores = 4L ) print(result_rj) ``` The most likely cluster is the ICD-10 leaf code **P209** ("unspecified intrauterine hypoxia") concentrated in 18 municipalities of the north fluminense region. Twenty-seven deaths are observed against an expectation of 3.34 under uniform risk — a relative risk of $\sim 8$ — with a Monte Carlo $p$-value of $1/(\text{nsim} + 1) = 0.001$. This closely matches Table 7 of Cançado et al. (2025): the same 18 municipalities (by IBGE code), 27 observed deaths, expected $\approx 3.37$, log-likelihood ratio $LR = 38.48$ (we obtain $\approx 38.83$). The minor LR discrepancy is explained by a SIM database update between the paper's analysis and the version shipped here, and by the discontinuation of TCU population estimates after 2017; the conclusions are identical. ### Why the joint scan matters here To make the value of the joint search concrete, it is worth running the two simpler scans on exactly the same data. The purely *spatial* scan, ignoring the tree, returns a single 8-municipality cluster around greater Rio de Janeiro driven by the aggregate death count (Cançado et al. 2025, Table 5); the P209 cluster in the north of the state does not appear at all, because aggregating over all causes dilutes the cause-specific signal. We can verify this directly by aggregating to one row per municipality and running the circular scan on the totals: ```{r rj-spatial-only} rj_agg <- unique(rj_mortality[, c("region_id", "x", "y", "live_births")]) rj_agg$cases <- tapply(rj_mortality$cases, rj_mortality$region_id, sum)[as.character(rj_agg$region_id)] result_spatial <- circular_scan( cases = rj_agg$cases, population = rj_agg$live_births, region_id = rj_agg$region_id, x = rj_agg$x, y = rj_agg$y, max_pop_pct = 0.50, nsim = 999, seed = 2016 ) result_spatial$most_likely_cluster$region_ids ``` Conversely, the purely *tree-based* scan, ignoring geography, returns more than fifty significant cuts at the 5 % level on this dataset (Cançado et al. 2025, Table 6); the analyst is then left to apply a multiple-testing correction across tree branches and decide which to follow up. The joint scan returns a *single* most-likely $(z, g)$ pair with one Monte Carlo $p$-value — the family-wise error rate is already controlled by the joint maximisation. ### Multiple distinct clusters (paper-faithful) The paper's Section 5.1.1 specifies a filtering rule: a candidate (zone, branch) pair is retained only if its branch is disjoint from every previously retained branch (no ancestor/descendant relation), *or* its zone is disjoint from every previously retained zone (no region in common). This is implemented in `filter_clusters()`: ```{r rj-filter} filter_clusters(result_rj)[1:5, c("node_id", "n_regions", "cases", "expected", "llr")] ``` No additional Monte Carlo simulation is performed; the secondary clusters are read directly from `result_rj$secondary_clusters` and filtered for distinctness in $O(k^2)$ time. The top filtered candidates correspond to the rows of Table 7 of Cançado et al. (2025). ### Visualising the cluster `get_cluster_regions()` returns a tidy data.frame ready to merge with any polygon layer: ```{r rj-getregions} # Most likely cluster only (single map) cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE) # Top-2 distinct clusters (faceted map) cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE) ``` The package does not ship `sf` polygons for Rio de Janeiro — they are easy to fetch on demand from \CRANpkg{geobr}: ```{r rj-plot} library(ggplot2) library(geobr) library(sf) mun <- read_municipality(code_muni = "RJ", year = 2016) mun$code6 <- as.integer(substr(mun$code_muni, 1, 6)) cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE), unique(rj_mortality[, c("region_id", "ibge_code")]), by = "region_id") mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code") ggplot(mun_facet) + geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) + scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"), na.value = "gray95", na.translate = FALSE) + facet_wrap(~ panel, nrow = 1) + theme_minimal() + theme(legend.position = "none") ``` A complete annotated worked example, including the polygon download and faceted plots for both `filter_clusters()` and `sequential_scan()` results, is shipped at `system.file("examples", "example_brazil_rj.R", package = "treeSS")`. --- ## Secondary clusters when the MLC shadows weaker ones `filter_clusters()` extracts distinct clusters from a *single* run of the scan. It is fast (no additional Monte Carlo) and faithful to the paper. When the most likely cluster is overwhelmingly strong, however, it can *shadow* weaker secondary clusters: the excess cases inside the MLC inflate the expected counts everywhere else, weakening the likelihood ratios of any other genuine cluster on the same dataset. For that situation the package also offers `sequential_scan()`, adapted from Zhang, Assunção and Kulldorff (2010). It detects the MLC, removes the MLC's regions (optionally with a buffer of nearest neighbours) from the dataset, and re-runs the scan on the reduced data with a *fresh* Monte Carlo simulation. The procedure iterates until the current MLC is no longer significant at the user-specified $\alpha$ level. Because each iteration's $p$-value is computed against the simulated null distribution of the *current* (reduced) dataset, no multiple-testing correction is required. ```{r rj-seq} seq_rj <- sequential_scan( cases = rj_mortality$cases, population = rj_mortality$live_births, region_id = rj_mortality$region_id, x = rj_mortality$x, y = rj_mortality$y, node_id = rj_mortality$node_id, tree = rj_tree, max_iter = 5L, alpha = 0.05, nsim = 999, seed = 2016, max_pop_pct = 0.50, n_cores = 4L ) print(seq_rj) ``` The `clusters` data.frame has one row per iteration with the cluster identifier, log-likelihood ratio, fresh $p$-value, and a logical `significant` flag: ```{r rj-seq-table} seq_rj$clusters[, c("iteration", "node_id", "n_regions", "llr", "pvalue", "significant")] ``` Mapping the sequential result uses the same `get_cluster_regions()` generic. The complete plotting workflow is shipped in the same annotated example file. --- ## When to use which * Use `filter_clusters()` when you want to follow Cançado et al. (2025) exactly: one scan, then a non-overlap pass over the candidate pool. No extra computation. * Use `sequential_scan()` when the MLC may be hiding weaker secondary clusters that `filter_clusters()` cannot detect on the original (unreduced) data. The two answer different questions and are complementary; both are appropriate in different applied scenarios. --- ## Computational notes `treespatial_scan()` and `sequential_scan()` both accept `n_cores`, which distributes the Monte Carlo step across OpenMP threads in C++. Each thread carries its own `std::mt19937` generator seeded deterministically from the user-supplied `seed`, so the result is reproducible for any fixed pair `(seed, n_cores)`. The serial path (`n_cores = 1L`) uses R's `rmultinom` and is bit-identical to the pre-OpenMP implementation. For publication-critical work where simulated $p$-values must be bit-identical between machines, use `n_cores = 1L`. For exploratory work, `n_cores = 4` typically yields a 4–6× speed-up. --- ## Other examples shipped with the package | File | Purpose | |:-----|:--------| | `inst/examples/example_brazil_rj.R` | Full plotting workflow for the RJ analysis (this vignette) | | `inst/examples/example_florida.R` | Building the input from raw long-form data — see `vignette("florida")` | | `inst/examples/example_chicago.R` | Crime in Chicago, 2023 (discussed in the companion paper) | | `inst/examples/example_london.R` | Road collisions in London, 2022 (discussed in the companion paper) | --- ## References Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. *Environmental and Ecological Statistics*, 32, 953–978. [doi:10.1007/s10651-025-00670-w](https://doi.org/10.1007/s10651-025-00670-w) Kulldorff, M. (1997). A spatial scan statistic. *Communications in Statistics — Theory and Methods*, 26(6), 1481–1496. Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan statistic for database disease surveillance. *Biometrics*, 59(2), 323–331. Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. *Journal of Probability and Statistics*, 2010, 642379.