.Rd files that were previously maintained by hand (the three map/raw
datasets chicago_map, london_boroughs_map, fl_deaths,
filter_clusters(), and the eight print/summary methods) have been
consolidated into the roxygen blocks in R/data.R, R/filter_clusters.R,
and R/print.R, so devtools::document() no longer skips them. Content
from the hand-written pages (the st_simplify note and merge tip for
london_boroughs_map, the Crown-copyright attribution, and the fuller
fl_deaths examples) was preserved. No user-visible change to the
rendered documentation.The three Monte Carlo routines (mc_treespatial_cpp, mc_spatial_cpp,
mc_treescan_cpp) now use a single native C++ implementation for every
n_cores >= 1. Previously n_cores = 1 took a separate code path that
used R's rmultinom() over NumericMatrix objects, while n_cores > 1
used a native std::mt19937 sampler over flat arrays.
n_cores. Each simulation draws from a
deterministic per-simulation seed (from R's RNG when seed is set), so
the simulated null distribution and the resulting p-value are identical
for any thread count given a fixed seed. n_cores changes only
wall-clock time.p-values at n_cores = 1 are no longer
bit-identical to the pre-0.1.50 serial path (which used R's rmultinom).
Observed statistics, most-likely clusters, and secondary-cluster
extraction are unaffected. Fix your seed to reproduce results.aggregate_up() and
max_llr_all_pairs().example_chicago.R now uses the compositional population denominator
(total incidents per area) rather than pop_residential. With the
residential denominator the most likely cluster is a broad-spectrum
spatial hotspot reported at the tree root; the compositional denominator
asks which (crime category, area) combinations are over-represented and
returns a specific branch, which is the tree-spatial use the method is
designed for.Small adjustments to the DESCRIPTION file
Small adjustments to the vignettes
The package now ships two vignettes:
vignette("introduction", package = "treeSS") — Rio de Janeiro
end-to-end, reproducing Section 5.2 of Cançado et al. (2025). This
was the previous introduction vignette, trimmed to RJ only.
vignette("florida", package = "treeSS") (new) — a pedagogical
walk-through of building the tree-spatial scan inputs from raw data
using the bundled fl_deaths dataset: building the ICD-10 tree
from the codes that actually appear in the data, downloading
county polygons + centroids from tigris, and assembling the
parallel-vector input contract that treespatial_scan() expects.
The Chicago and London datasets, previously discussed inline in the
introduction vignette, are now reserved for the companion software
paper.
The four bundled plotting examples for sequential_scan()
(example_brazil_rj.R, example_chicago.R, example_florida.R)
previously did a left join from the full map polygon set onto the
cluster table. When the shapefile contained polygons not present in
the analysis dataset (3 RJ municipalities missing from the
DATASUS/IBGE 89-municipality subset, for instance), those polygons
emerged with panel = NA, which facet_wrap rendered as an extra
empty panel labelled "NA".
The examples now cross-join the polygon set with the panel labels
first and then left-join the cluster information by (id, panel),
so every map polygon is drawn in every iteration panel — those that
fall outside the analysis dataset get the na.value colour (a
light grey), exactly as intended. No extra "NA" panel is produced.
The london example uses leaflet rather than facet_wrap and was
not affected.
multicluster_scan() (added in 0.1.45 as an adaptation of Li, Wang,
Yang, Li and Lai 2011 to the tree-spatial setting) has been removed.
The function is gone, along with its C++ backend
(mc_multicluster_treespatial_cpp, mc_multicluster_spatial_cpp),
the get_cluster_regions.multicluster_scan S3 method, the
corresponding print / summary methods, all examples, and the
vignette subsection.
Rationale:
On real datasets with a concentrated signal (e.g. infant mortality
in Rio de Janeiro: 622 tree nodes, 5358 zones), the top-K
candidate pool was dominated by overlapping variants of a single
geographic neighbourhood, so the fast top-K disjoint-pair search
could not find a valid pair. The full-pool rescue path was too
slow to be practical (timing out on nsim = 999 with 4 cores).
The factorisation of the joint LLR used by Li et al. (2011) is exact under the Poisson model for circular scans; its extension to the tree-spatial setting was not formally established.
filter_clusters() (Cançado et al. 2025) and sequential_scan()
(Zhang, Assunção and Kulldorff 2010) together already cover the
practical secondary-cluster use cases with published, well-studied
statistical properties.
Users who want joint-cluster detection in the circular case can use the original implementation from Li et al. (2011) outside this package.
The package now offers two clearly-bounded approaches:
filter_clusters() — paper-faithful non-overlap criterion of
Cançado et al. (2025), Sec. 5.1.1, applied to the single-pass
candidate pool.
sequential_scan() — sequential adjustment of Zhang, Assunção and
Kulldorff (2010): detect MLC, remove its regions (with optional
buffer of nearest neighbours), re-run the scan on the reduced data
with a fresh Monte Carlo simulation; iterate until the current MLC
is no longer significant. Each iteration's p-value is correct under
the conditional argument in the paper, so no multiple-testing
correction is required.
Replaced the ad-hoc Holm-Bonferroni iterative_scan() with two methods
drawn directly from the published literature on multi-cluster spatial
scan statistics, adapted to the tree-spatial setting. The package now
offers three approaches to secondary-cluster detection, with the
choice driven by which type of shadowing the user wants to remove:
filter_clusters() (unchanged) -- the original non-overlap
criterion of Cancado et al. (2025) Sec. 5.1.1, applied to the
single-pass candidate pool.
sequential_scan() (new) -- the sequential adjustment of Zhang,
Assuncao and Kulldorff (2010), adapted to tree-spatial / circular /
tree-only inputs. Detects the MLC, removes its regions (and an
optional buffer_size of nearest neighbours) from the dataset, and
re-runs the scan on the reduced data with a fresh Monte Carlo
simulation. Iterates until the MLC of the current reduced data is
no longer significant or max_iter is reached. Each iteration's
p-value is correct under the conditional argument of Section 3 of
the paper -- no post-hoc multiple-testing correction is applied or
required.
multicluster_scan() (new) -- the two-cluster joint statistic of
Li, Wang, Yang, Li and Lai (2011), adapted to tree-spatial and
circular scans. Builds the alternative as a joint presence of two
region-disjoint clusters; the joint LLR factorises into the sum of
the two single-cluster LLRs under Poisson, so the observed maximum
is found by sweeping the candidate pool. The Monte Carlo for the
joint statistic runs in C++ (new exports mc_multicluster_treespatial_cpp
and mc_multicluster_spatial_cpp) with the same OpenMP backend as
the other scans, so performance is on par with treespatial_scan().
The decision rule of Table 2 of the paper is applied: 0, 1, or 2
significant clusters are reported based on the joint p-value and a
re-evaluation of the weaker cluster on the reduced dataset.
iterative_scan() and its print/summary/get_cluster_regions
methods have been removed. The Holm-Bonferroni "scan + zero cases +
re-scan" procedure is not part of the published methods we wanted to
offer; the sequential and multi-cluster scans above cover the
intended use cases and are grounded in the literature.
Internal helper .matrix_to_vectors() (previously used only by
iterative_scan) has been removed.
print.sequential_scan(), summary.sequential_scan()print.multicluster_scan(), summary.multicluster_scan()get_cluster_regions.sequential_scan(),
get_cluster_regions.multicluster_scan()filter_clusters(), treespatial_scan(), and circular_scan()
cross-reference the new methods in @seealso.inst/examples/ (Brazil/RJ,
Chicago, Florida, London) use sequential_scan() in place of the
removed iterative_scan() block.tests/testthat/test-sequential-scan.R covering structure,
the max_iter stopping rule, the buffer mechanism, behaviour
under H0, and printing.tests/testthat/test-multicluster-scan.R covering structure,
the stronger-versus-weaker ordering, region disjointness of the
returned pair, the significance decision rule, and printing.tests/testthat/test-get-cluster-regions.R and
tests/testthat/test-binomial.R updated to drop their references
to iterative_scan().Address the four items requested in the first-round CRAN review.
Single-quote software/API names per the CRAN cookbook: OpenMP is
now written as 'OpenMP' in the package description.
Reference:
https://contributor.r-project.org/cran-cookbook/description_issues.html#formatting-software-names
Add DOI links to the two references that were previously cited
without a link, using the CRAN-mandated
authors (year) <doi:...> form (no space after doi:, no space
inside the angle brackets):
\value tags (and the corresponding @return
roxygen blocks) to the seven print()/summary() method Rd
files flagged by CRAN. Each documents that the method invisibly
returns its input object unchanged and is called for its
printing side effect, with a description of the fields written
to the console (and, for summary() methods, the additional
fields beyond those of the matching print() method):
print.circular_scan.Rdprint.iterative_scan.Rdprint.tree_scan.Rdprint.treespatial_scan.Rdsummary.circular_scan.Rdsummary.tree_scan.Rdsummary.treespatial_scan.Rd
Reference:
https://contributor.r-project.org/cran-cookbook/docs_issues.html#missing-value-tags-in-.rd-filesgenerate_example_data() no longer sets a hardcoded seed within
the function: the default of the seed argument is now NULL
(previously 123L). When the user does not pass a seed, the
function draws from the user's session-level RNG state without
modifying it; when the user passes an explicit integer, the
existing save-and-restore logic (introduced in 0.1.43) still
applies. The \usage{} block and the \item{seed}{...}
description of the corresponding Rd file have been updated to
match. The roxygen example
(ex <- generate_example_data(seed = 42)) is unchanged: it
passes an explicit seed and so remains reproducible.
Reference:
https://contributor.r-project.org/cran-cookbook/code_issues.html#setting-a-specific-seedTesting a a clean R CMD check --as-cran.
\source{} blocks to all three tree datasets, pointing at
the corresponding leaf-level dataset and at the data-raw/
build script in the GitHub repo.get_cluster_regions(). Added @examples block.@examples block to the roxygen comments.seed = ... argument no longer
silently overwrite the user's session-level RNG state. Previously,
calling treespatial_scan(..., seed = 42) after a set.seed(2026)
in the user's session would leave the RNG in a state determined by
the internal Monte Carlo loop, so any subsequent runif(),
sample(), etc. was no longer reproducible from the user's
set.seed(2026). Now the user's pre-existing RNG state is saved on
entry and restored on exit (whether the function returns normally
or via an error), so the seed argument affects only the result
of the call. Implementation is in two new internal helpers
.seed_save_and_set() and .seed_restore() in R/utils.R.print.iterative_scan() now accepts max_show for API
consistency with the other three print methods. The default
behavior is unchanged (the table is printed without the
region_ids and leaf_ids columns to keep it compact); pass
max_show = -1L to include both columns.cran-comments.md file.remotes::install_github("allanvc/treeSS").summary() methods for circular_scan, tree_scan, and
treespatial_scan now have proper roxygen descriptions and
explicitly document that the max_show argument added in 0.1.39
is forwarded to the corresponding print() method via
\code{...}. Each summary doc points to the matching print doc
for the full details.The print methods now truncate long Leaf IDs and Regions lists
by default, in the style of tibble. The motivation is the Chicago
example: the most likely cluster turns out to be
the root of the FBI crime taxonomy (1900+ leaves), which under
the previous policy printed every single leaf, producing more than
10 pages of console output in the rendered PDF.
New argument max_show on print.treespatial_scan(),
print.tree_scan() and print.circular_scan(). Default is
10L. When a vector field exceeds this length, only the first
max_show values are shown and a tail of ... and N more is
appended. Pass max_show = -1L (or any value at least as large
as the field) to recover the previous full-output behavior.
The internal .cat_wrapped() helper gained the same max_show
argument (default 10L) and propagates it through the print
methods.
No changes to the underlying scan results: only the console / PDF
rendering of the result objects is affected. The full leaf and
region IDs are always available on result$most_likely_cluster$ leaf_ids and result$most_likely_cluster$region_ids for
programmatic use.
The choice of default mirrors tibble's behavior: enough to give
the reader a sense of the cluster contents, but not so much that
a single print() call dominates the document.
treespatial_scan() for combined spatial and hierarchical
cluster detection.circular_scan() for Kulldorff's circular spatial scan
statistic.tree_scan() for the tree-based scan statistic.build_zones(), aggregate_tree(),
filter_clusters().print() and summary() methods for all scan result classes.