| Title: | Just Another Latent Space Network Clustering Algorithm |
|---|---|
| Description: | Fit latent space network cluster models using an expectation-maximization algorithm. Enables flexible modeling of unweighted or weighted network data (with or without noise edges), supporting both directed and undirected networks (with or without degree and strength heterogeneity). Designed to handle large networks efficiently, it allows users to explore network structure through latent space representations, identify clusters (i.e., community detection) within network data, and simulate networks with varying clustering, connectivity patterns, and noise edges. Methodology for the implementation is described in Arakkal and Sewell (2025) <doi:10.1016/j.csda.2025.108228>. |
| Authors: | Alan Arakkal [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-7001-493X>), Daniel Sewell [aut] (ORCID: <https://orcid.org/0000-0002-9238-4026>) |
| Maintainer: | Alan Arakkal <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 2.1.0 |
| Built: | 2026-05-07 08:57:21 UTC |
| Source: | https://github.com/a1arakkal/jane |
Fit a latent space cluster model, with or without noise edges, using an EM algorithm.
JANE( A, D = 2, K = 2, family = "bernoulli", noise_weights = FALSE, guess_noise_weights = NULL, model, initialization = "GNN", case_control = FALSE, DA_type = "none", seed = NULL, control = list() )JANE( A, D = 2, K = 2, family = "bernoulli", noise_weights = FALSE, guess_noise_weights = NULL, model, initialization = "GNN", case_control = FALSE, DA_type = "none", seed = NULL, control = list() )
A |
A square matrix or sparse matrix of class 'Matrix' (from the Matrix package) representing the adjacency matrix of the network of interest. |
D |
Integer (scalar or vector) specifying the dimension of the latent space (default is 2). |
K |
Integer (scalar or vector) specifying the number of clusters to consider (default is 2). |
family |
A character string specifying the distribution of the edge weights.
|
noise_weights |
A logical; if |
guess_noise_weights |
Only applicable if |
model |
A character string specifying the model to fit:
|
initialization |
A character string or a list to specify the initial values for the EM algorithm:
|
case_control |
A logical; if |
DA_type |
(Experimental) A character string to specify the type of deterministic annealing approach to use
|
seed |
(optional) An integer value to specify the seed for reproducibility. |
control |
A list of control parameters. See 'Details'. |
Isolates are removed from the adjacency matrix A. If an unsymmetric adjacency matrix A is supplied for model %in% c('NDH', 'RS') the user will be asked if they would like to proceed with converting A to a symmetric matrix (i.e., A <- 1.0 * ( (A + t(A)) > 0.0 )); only able to do so if family = 'bernoulli'. Additionally, if a weighted network is supplied and noise_weights = FALSE, then the network will be converted to an unweighted binary network (i.e., (A > 0.0)*1.0) and a latent space cluster model is fit.
control:
The control argument is a named list that the user can supply containing the following components:
verboseA logical; if TRUE causes additional information to be printed out about the progress of the EM algorithm (default is FALSE).
max_itsAn integer specifying the maximum number of iterations for the EM algorithm (default is 1e3).
min_itsAn integer specifying the minimum number of iterations for the EM algorithm (default is 10).
priorsA list of S3 class "JANE.priors" representing prior hyperparameter specifications (default is NULL). See specify_priors for details on how to specify the hyperparameters.
n_interior_knots(only relevant for model %in% c('RS', 'RSR')) An integer specifying the number of interior knots used in fitting a natural cubic spline for degree heterogeneity (and connection strength heterogeneity if working with weighted network) models (default is 5).
termination_ruleA character string to specify the termination rule to determine the convergence of the EM algorithm:
'prob_mat': uses change in the absolute difference in (i.e., the cluster membership probability matrix) between subsequent iterations (default)
'Q': uses change in the absolute difference in the objective function of the E-step evaluated using parameters from subsequent iterations
'ARI': comparing the classifications between subsequent iterations using adjusted Rand index
'NMI': comparing the classifications between subsequent iterations using normalized mutual information
'CER': comparing the classifications between subsequent iterations using classification error rate
toleranceA numeric specifying the tolerance used for termination_rule %in% c('Q', 'prob_mat') (default is 1e-3).
tolerance_ARIA numeric specifying the tolerance used for termination_rule = 'ARI' (default is 0.999).
tolerance_NMIA numeric specifying the tolerance used for termination_rule = 'NMI' (default is 0.999).
tolerance_CERA numeric specifying the tolerance used for termination_rule = 'CER' (default is 0.01).
n_its_start_CAAn integer specifying what iteration to start computing the change in cumulative averages (note: the change in the cumulative average of , the latent position matrix, is not tracked when termination_rule = 'Q') (default is 20).
tolerance_diff_CAA numeric specifying the tolerance used for the change in cumulative average of termination_rule metric and (note: the change in the cumulative average of is not tracked when termination_rule = 'Q') (default is 1e-3).
consecutive_diff_CAAn integer specifying the tolerance for the number of consecutive instances where the change in cumulative average is less than tolerance_diff_CA (default is 5).
quantile_diffA numeric in [0,1] specifying the quantile used in computing the change in the absolute difference of and between subsequent iterations (default is 1, i.e., max).
beta_temp_schedule(Experimental) A numeric vector specifying the temperature schedule for deterministic annealing (default is 1, i.e., deterministic annealing not utilized).
n_controlAn integer specifying the fixed number of controls (i.e., non-links) sampled for each actor; only relevant when case_control = TRUE (default is 100 when case_control = TRUE and NULL when case_control = FALSE).
n_startAn integer specifying the maximum number of starts for the EM algorithm (default is 5).
max_retryAn integer specifying the maximum number of re-attempts if starting values cause issues with EM algorithm (default is 5).
IC_selectionA character string to specify the information criteria used to select the optimal fit based on the combinations of K, D, and n_start considered:
'BIC_model': BIC computed from logistic regression or Hurdle model component
'BIC_mbc': BIC computed from model based clustering component
'ICL_mbc': ICL computed from model based clustering component
'Total_BIC': sum of 'BIC_model' and 'BIC_mbc'
'Total_ICL': sum of 'BIC_model' and 'ICL_mbc' (default)
sd_random_U_GNN(only relevant when initialization = 'GNN') A positive numeric value specifying the standard deviation for the random draws from a normal distribution to initialize (default is 1).
max_retry_GNN(only relevant when initialization = 'GNN') An integer specifying the maximum number of re-attempts for the GNN approach before switching to random starting values (default is 10).
n_its_GNN(only relevant when initialization = 'GNN') An integer specifying the maximum number of iterations for the GNN approach (default is 10).
downsampling_GNN(only relevant when initialization = 'GNN') A logical; if TRUE employs downsampling s.t. the number of links and non-links are balanced for the GNN approach (default is TRUE).
Running JANE in parallel:
JANE integrates the future and future.apply packages to fit the various combinations of K, D, and n_start in parallel. The 'Examples' section below provides an example of how to run JANE in parallel. See plan and future.apply for more details.
Choosing the number of clusters:
JANE allows for the following model selection criteria to choose the number of clusters (smaller values are favored):
'BIC_model': BIC computed from logistic regression or Hurdle model component
'BIC_mbc': BIC computed from model based clustering component
'ICL_mbc': ICL (Biernacki et al. (2000)) computed from model based clustering component
'Total_BIC': Sum of 'BIC_model' and 'BIC_mbc', this is the model selection criterion proposed by Handcock et al. (2007)
'Total_ICL': (default) sum of 'BIC_model' and 'ICL_mbc', this criterion is similar to 'Total_BIC', but uses ICL for the model based clustering component, which tends to favor more well-separated clusters.
Based on simulation studies, Biernacki et al. (2000) recommends that when the interest in mixture modeling is cluster analysis, instead of density estimation, the criterion should be favored over the criterion, as the criterion tends to overestimate the number of clusters. The criterion is designed to choose the number of components in a mixture model rather than the number of clusters.
Warning: It is not certain whether it is appropriate to use the model selection criterion above to select D.
A list of S3 class "JANE" containing the following components:
input_params |
A list containing the input parameters for |
A |
The square sparse adjacency matrix of class 'dgCMatrix' used in fitting the latent space cluster model. This matrix can be different than the input A matrix as isolates are removed. |
IC_out |
A matrix containing the relevant information criteria for all combinations of |
all_convergence_ind |
A matrix containing the convergence information (i.e., 1 = converged, 0 = did not converge) and number of iterations for all combinations of |
optimal_res |
A list containing the estimated parameters of interest based on the optimal fit selected. It is recommended to use |
optimal_starting |
A list of S3 |
Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 719–725.
Handcock, M.S., Raftery, A.E., Tantrum, J.M., 2007. Model-based clustering for social networks. Journal of the Royal Statistical Society Series A: Statistics in Society 170, 301–354.
# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 sim_data <- JANE::sim_A(N = 100L, model = "NDH", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Run JANE on simulated data res <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") # Run JANE on simulated data - consider multiple D and K res <- JANE::JANE(A = sim_data$A, D = 2:5, K = 2:10, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") # Run JANE on simulated data - parallel with 5 cores (NOT RUN) # future::plan(future::multisession, workers = 5) # res <- JANE::JANE(A = sim_data$A, # D = 2L, # K = 3L, # initialization = "GNN", # model = "NDH", # case_control = FALSE, # DA_type = "none") # future::plan(future::sequential) # Run JANE on simulated data - case/control approach with 20 controls sampled for each actor res <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = TRUE, DA_type = "none", control = list(n_control = 20)) # Reproducibility res1 <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", seed = 1234, model = "NDH", case_control = FALSE, DA_type = "none") res2 <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", seed = 1234, model = "NDH", case_control = FALSE, DA_type = "none") ## Check if results match all.equal(res1, res2) # Another reproducibility example where the seed was not set. # It is possible to replicate the results using the starting values due to # the nature of EM algorithms res3 <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") ## Extract starting values start_vals <- res3$optimal_start ## Run JANE using extracted starting values, no need to specify D and K ## below as function will determine those values from start_vals res4 <- JANE::JANE(A = sim_data$A, initialization = start_vals, model = "NDH", case_control = FALSE, DA_type = "none") ## Check if optimal_res are identical all.equal(res3$optimal_res, res4$optimal_res)# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 sim_data <- JANE::sim_A(N = 100L, model = "NDH", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Run JANE on simulated data res <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") # Run JANE on simulated data - consider multiple D and K res <- JANE::JANE(A = sim_data$A, D = 2:5, K = 2:10, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") # Run JANE on simulated data - parallel with 5 cores (NOT RUN) # future::plan(future::multisession, workers = 5) # res <- JANE::JANE(A = sim_data$A, # D = 2L, # K = 3L, # initialization = "GNN", # model = "NDH", # case_control = FALSE, # DA_type = "none") # future::plan(future::sequential) # Run JANE on simulated data - case/control approach with 20 controls sampled for each actor res <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = TRUE, DA_type = "none", control = list(n_control = 20)) # Reproducibility res1 <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", seed = 1234, model = "NDH", case_control = FALSE, DA_type = "none") res2 <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", seed = 1234, model = "NDH", case_control = FALSE, DA_type = "none") ## Check if results match all.equal(res1, res2) # Another reproducibility example where the seed was not set. # It is possible to replicate the results using the starting values due to # the nature of EM algorithms res3 <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") ## Extract starting values start_vals <- res3$optimal_start ## Run JANE using extracted starting values, no need to specify D and K ## below as function will determine those values from start_vals res4 <- JANE::JANE(A = sim_data$A, initialization = start_vals, model = "NDH", case_control = FALSE, DA_type = "none") ## Check if optimal_res are identical all.equal(res3$optimal_res, res4$optimal_res)
S3 plot method for object of class "JANE".
## S3 method for class 'JANE' plot( x, type = "lsnc", true_labels, initial_values = FALSE, zoom = 100, density_type = "contour", rotation_angle = 0, alpha_edge = 0.1, alpha_node = 1, swap_axes = FALSE, main, xlab, ylab, cluster_cols, remove_noise_edges = FALSE, ... )## S3 method for class 'JANE' plot( x, type = "lsnc", true_labels, initial_values = FALSE, zoom = 100, density_type = "contour", rotation_angle = 0, alpha_edge = 0.1, alpha_node = 1, swap_axes = FALSE, main, xlab, ylab, cluster_cols, remove_noise_edges = FALSE, ... )
x |
|
type |
A character string to select the type of plot:
|
true_labels |
(optional) A numeric, character, or factor vector of known true cluster labels. Must have the same length as number of actors in the fitted network. Need to account for potential isolates removed. |
initial_values |
A logical; if |
zoom |
A numeric value > 0 that controls the % magnification of the plot (default is 100%). |
density_type |
Choose from one of the following three options: 'contour' (default), 'hdr', 'image', and 'persp' indicating the density plot type. |
rotation_angle |
A numeric value that rotates the estimated latent positions and contours of the multivariate normal distributions clockwise (or counterclockwise if |
alpha_edge |
A numeric value in |
alpha_node |
A numeric value in |
swap_axes |
A logical; if |
main |
An optional overall title for the plot. |
xlab |
An optional title for the x axis. |
ylab |
An optional title for the y axis. |
cluster_cols |
An optional vector of colors for the clusters. Must have a length of at least |
remove_noise_edges |
(only applicable if |
... |
Unused. |
The classification of actors into specific clusters is based on a hard clustering rule of . Additionally, the actor-specific classification uncertainty is derived as 1 - .
The trace plot contains up to five unique plots tracking various metrics across the iterations of the EM algorithm, depending on the JANE control parameter termination_rule:
termination_rule = 'prob_mat': Five plots will be presented. Specifically, in the top panel, the plot on the left presents the change in the absolute difference in (i.e., the cluster membership probability matrix) between subsequent iterations and, if noise_weights = TRUE, the change in the absolute difference in (i.e., the edge weight cluster membership probability matrix) between subsequent iterations. The exact quantile of the absolute difference plotted are presented in parentheses and determined by the JANE control parameter quantile_diff. For example, the default control parameter quantile_diff = 1, so the values being plotted are the max absolute difference in (and potentially ) between subsequent iterations. The plot on the right of the top panel presents the absolute difference in the cumulative average of the absolute change in (and potentially ) and (i.e., the matrix of latent positions) across subsequent iterations (absolute change in , , and are computed in an identical manner as described previously). This metric is only tracked beginning at an iteration determined by the n_its_start_CA control parameter in JANE. Note, this plot may be empty if the EM algorithm converges before the n_its_start_CA-th iteration. Finally, the bottom panel presents ARI, NMI, and CER values comparing the classifications between subsequent iterations, respectively. Specifically, at a given iteration we determine the classification of actors in clusters based on a hard clustering rule of and given these labels from two subsequent iterations, we compute and plot the ARI, NMI and CER.
termination_rule = 'Q': Plots generated are similar to those described in the previous bullet point. However, instead of tracking the change in (and potentially ) over iterations, here the absolute difference in the objective function of the E-step evaluated using parameters from subsequent iterations is tracked. Furthermore, the cumulative average of the absolute change in is no longer tracked.
termination_rule %in% c('ARI', 'NMI', 'CER'): Four plots will be presented. Specifically, the top left panel presents a plot of the absolute difference in the cumulative average of the absolute change in the specific termination_rule employed and across iterations. As previously mentioned, if the EM algorithm converges before the n_its_start_CA-th iteration then this will be an empty plot. Furthermore, the other three plots present ARI, NMI, and CER values comparing the classifications between subsequent iterations, respectively.
A plot of the network or trace plot of the EM run.
If an error interrupts the plotting process, the graphics device may be left in a state where par("new") = TRUE. This can cause subsequent plots to be overlaid. To reset the graphics state, call plot.new() or close and reopen the device with dev.off(); dev.new().
surfacePlot, adjustedRandIndex, classError, NMI
# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 sim_data <- JANE::sim_A(N = 100L, model = "NDH", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Run JANE on simulated data res <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") # plot trace plot plot(res, type = "trace_plot") # plot network plot(res) # plot network - misclassified plot(res, type = "misclassified", true_labels = apply(sim_data$Z_U, 1, which.max)) # plot network - uncertainty and swap axes plot(res, type = "uncertainty", swap_axes = TRUE) # plot network - but only show contours of MVNs plot(res, swap_axes = TRUE, alpha_edge = 0, alpha_node = 0) # plot using starting values of EM algorithm plot(res, initial_values = TRUE)# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 sim_data <- JANE::sim_A(N = 100L, model = "NDH", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Run JANE on simulated data res <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") # plot trace plot plot(res, type = "trace_plot") # plot network plot(res) # plot network - misclassified plot(res, type = "misclassified", true_labels = apply(sim_data$Z_U, 1, which.max)) # plot network - uncertainty and swap axes plot(res, type = "uncertainty", swap_axes = TRUE) # plot network - but only show contours of MVNs plot(res, swap_axes = TRUE, alpha_edge = 0, alpha_node = 0) # plot using starting values of EM algorithm plot(res, initial_values = TRUE)
Simulate an unweighted or weighted network, with or without noise edges, from a -dimensional latent space cluster model with clusters and actors. The squared euclidean distance is used (i.e., ), where and are the respective actor's positions in a -dimensional social space.
sim_A( N, mus, omegas, p, model = "NDH", family = "bernoulli", params_LR, params_weights = NULL, noise_weights_prob = 0, mean_noise_weights, precision_noise_weights, remove_isolates = TRUE )sim_A( N, mus, omegas, p, model = "NDH", family = "bernoulli", params_LR, params_weights = NULL, noise_weights_prob = 0, mean_noise_weights, precision_noise_weights, remove_isolates = TRUE )
N |
An integer specifying the number of actors in the network. |
mus |
A numeric |
omegas |
A numeric |
p |
A numeric vector of length |
model |
A character string specifying the type of model used to simulate the network:
|
family |
A character string specifying the distribution of the edge weights.
|
params_LR |
A list containing the parameters of the logistic regression model to simulate the unweighted network, including:
|
params_weights |
Only relevant when
|
noise_weights_prob |
A numeric in [0,1] representing the proportion of all edges in the simulated network that are noise edges (default is 0.0). |
mean_noise_weights |
A numeric representing the mean of the noise weight distribution. Only relevant when |
precision_noise_weights |
A positive, non-zero, numeric representing the precision (on the log scale) of the log-normal noise weight distribution. Only relevant when |
remove_isolates |
A logical; if |
The returned scalar q_prob represents the proportion of non-edges in the simulated network to be converted to noise edges, computed as , where is the density of the simulated network without noise and is the inputted noise_weights_prob.
A list containing the following components:
A |
A sparse adjacency matrix of class 'dgCMatrix' representing the "true" underlying unweighted network with no noise edges. |
W |
A sparse adjacency matrix of class 'dgCMatrix' representing the unweighted or weighted network, with or without noise. Note, if |
q_prob |
A numeric scalar representing the proportion of non-edges in the "true" underlying network converted to noise edges. See 'Details' for how this value is computed. |
Z_U |
A numeric |
Z_W |
A numeric |
U |
A numeric |
mus |
The inputted numeric |
omegas |
The inputted numeric |
p |
The inputted numeric vector |
noise_weights_prob |
The inputted numeric scalar |
mean_noise_weights |
The inputted numeric scalar |
precision_noise_weights |
The inputted numeric scalar |
model |
The inputted |
family |
The inputted |
params_LR |
The inputted |
params_weights |
The inputted |
mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 # Simulate an undirected, unweighted network, with no noise and no degree heterogeneity JANE::sim_A(N = 100L, model = "NDH", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Simulate a directed, weighted network, with degree and strength heterogeneity but no noise JANE::sim_A(N = 100L, model = "RSR", family = "lognormal", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), params_weights = list(beta0 = 2, precision_weights = 1), remove_isolates = TRUE) # Simulate an undirected, weighted network, with noise and degree and strength heterogeneity JANE::sim_A(N = 100L, model = "RS", family = "poisson", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), params_weights = list(beta0 = 2), noise_weights_prob = 0.1, mean_noise_weights = 1, remove_isolates = TRUE)mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 # Simulate an undirected, unweighted network, with no noise and no degree heterogeneity JANE::sim_A(N = 100L, model = "NDH", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Simulate a directed, weighted network, with degree and strength heterogeneity but no noise JANE::sim_A(N = 100L, model = "RSR", family = "lognormal", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), params_weights = list(beta0 = 2, precision_weights = 1), remove_isolates = TRUE) # Simulate an undirected, weighted network, with noise and degree and strength heterogeneity JANE::sim_A(N = 100L, model = "RS", family = "poisson", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), params_weights = list(beta0 = 2), noise_weights_prob = 0.1, mean_noise_weights = 1, remove_isolates = TRUE)
A function that allows the user to specify starting values for the EM algorithm in a structure accepted by JANE.
specify_initial_values( A, D, K, model, family = "bernoulli", noise_weights = FALSE, n_interior_knots = NULL, U, omegas, mus, p, Z, beta, beta2, precision_weights, precision_noise_weights )specify_initial_values( A, D, K, model, family = "bernoulli", noise_weights = FALSE, n_interior_knots = NULL, U, omegas, mus, p, Z, beta, beta2, precision_weights, precision_noise_weights )
A |
A square matrix or sparse matrix of class 'dgCMatrix' representing the adjacency matrix of the network of interest. |
D |
An integer specifying the dimension of the latent positions. |
K |
An integer specifying the total number of clusters. |
model |
A character string specifying the model:
|
family |
A character string specifying the distribution of the edge weights.
|
noise_weights |
A logical; if TRUE then a Hurdle model is used to account for noise weights, if FALSE simply utilizes the supplied network (converted to an unweighted binary network if a weighted network is supplied, i.e., (A > 0.0)*1.0) and fits a latent space cluster model (default is FALSE). |
n_interior_knots |
An integer specifying the number of interior knots used in fitting a natural cubic spline for degree heterogeneity (and connection strength heterogeneity if working with weighted network) models (i.e., 'RS' and 'RSR' only; default is |
U |
A numeric |
omegas |
A numeric |
mus |
A numeric |
p |
A numeric vector of length |
Z |
A numeric |
beta |
A numeric vector specifying the regression coefficients for the logistic regression model. Specifically, a vector of length |
beta2 |
A numeric vector specifying the regression coefficients for the zero-truncated Poisson or log-normal GLM. Specifically, a vector of length |
precision_weights |
A positive numeric scalar specifying the precision (on the log scale) of the log-normal weight distribution. Only relevant when |
precision_noise_weights |
A positive numeric scalar specifying the precision (on the log scale) of the log-normal noise weight distribution. Only relevant when |
To match JANE, this function will remove isolates from the adjacency matrix A and determine the total number of actors after excluding isolates. If this is not done, errors with respect to incorrect dimensions in the starting values will be generated when executing JANE.
Similarly to match JANE, if an unsymmetric adjacency matrix A is supplied for model %in% c('NDH', 'RS') the user will be asked if they would like to proceed with converting A to a symmetric matrix (i.e., A <- 1.0 * ( (A + t(A)) > 0.0 )). Additionally, if a weighted network is supplied and noise_weights = FALSE, then the network will be converted to an unweighted binary network (i.e., (A > 0.0)*1.0).
A list of S3 class "JANE.initial_values" representing starting values for the EM algorithm, in a structure accepted by JANE.
# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- -1 sim_data <- JANE::sim_A(N = 100L, model = "RSR", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Specify starting values D <- 3L K <- 5L N <- nrow(sim_data$A) n_interior_knots <- 5L U <- matrix(stats::rnorm(N*D), nrow = N, ncol = D) omegas <- stats::rWishart(n = K, df = D+1, Sigma = diag(D)) mus <- matrix(stats::rnorm(K*D), nrow = K, ncol = D) p <- extraDistr::rdirichlet(n = 1, rep(3,K))[1,] Z <- extraDistr::rdirichlet(n = N, alpha = rep(1, K)) beta <- stats::rnorm(n = 1 + 2*(1 + n_interior_knots)) my_starting_values <- JANE::specify_initial_values(A = sim_data$A, D = D, K = K, model = "RSR", n_interior_knots = n_interior_knots, U = U, omegas = omegas, mus = mus, p = p, Z = Z, beta = beta) # Run JANE using my_starting_values (no need to specify D and K as function will # determine those values from my_starting_values) res <- JANE::JANE(A = sim_data$A, initialization = my_starting_values, model = "RSR")# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- -1 sim_data <- JANE::sim_A(N = 100L, model = "RSR", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Specify starting values D <- 3L K <- 5L N <- nrow(sim_data$A) n_interior_knots <- 5L U <- matrix(stats::rnorm(N*D), nrow = N, ncol = D) omegas <- stats::rWishart(n = K, df = D+1, Sigma = diag(D)) mus <- matrix(stats::rnorm(K*D), nrow = K, ncol = D) p <- extraDistr::rdirichlet(n = 1, rep(3,K))[1,] Z <- extraDistr::rdirichlet(n = N, alpha = rep(1, K)) beta <- stats::rnorm(n = 1 + 2*(1 + n_interior_knots)) my_starting_values <- JANE::specify_initial_values(A = sim_data$A, D = D, K = K, model = "RSR", n_interior_knots = n_interior_knots, U = U, omegas = omegas, mus = mus, p = p, Z = Z, beta = beta) # Run JANE using my_starting_values (no need to specify D and K as function will # determine those values from my_starting_values) res <- JANE::JANE(A = sim_data$A, initialization = my_starting_values, model = "RSR")
A function that allows the user to specify the prior hyperparameters for the EM algorithm in a structure accepted by JANE.
specify_priors( D = 2, K = 2, model, family = "bernoulli", noise_weights = FALSE, n_interior_knots = NULL, a, b, c, G, nu, e, f, h, l, e_2, f_2, m_1, o_1, m_2, o_2 )specify_priors( D = 2, K = 2, model, family = "bernoulli", noise_weights = FALSE, n_interior_knots = NULL, a, b, c, G, nu, e, f, h, l, e_2, f_2, m_1, o_1, m_2, o_2 )
D |
An integer specifying the dimension of the latent positions (default is 2). |
K |
An integer specifying the total number of clusters (default is 2). |
model |
A character string specifying the model:
|
family |
A character string specifying the distribution of the edge weights.
|
noise_weights |
A logical; if TRUE then a Hurdle model is used to account for noise weights, if FALSE simply utilizes the supplied network (converted to an unweighted binary network if a weighted network is supplied, i.e., (A > 0.0)*1.0) and fits a latent space cluster model (default is FALSE). |
n_interior_knots |
An integer specifying the number of interior knots used in fitting a natural cubic spline for degree heterogeneity (and connection strength heterogeneity if working with weighted network) models (i.e., 'RS' and 'RSR' only; default is |
a |
A numeric vector of length |
b |
A strictly positive numeric scalar specifying the scaling factor on the precision of the multivariate normal prior on |
c |
A positive numeric scalar |
G |
A numeric positive definite |
nu |
A positive numeric vector of length |
e |
A numeric vector of length |
f |
A numeric positive definite square matrix of dimension |
h |
A positive numeric scalar |
l |
A strictly positive numeric scalar specifying the second shape parameter for the Beta prior on |
e_2 |
A numeric vector of length |
f_2 |
A numeric positive definite square matrix of dimension |
m_1 |
A positive numeric scalar |
o_1 |
A positive numeric scalar |
m_2 |
A positive numeric scalar |
o_2 |
A positive numeric scalar |
Prior on and (note: the same prior is used for ) :
Prior on :
For the current implementation we require that all elements of the nu vector be to prevent against negative mixture weights for empty clusters.
Prior on :
Prior on :
Zero-truncated Poisson
Prior on :
Log-normal
Prior on :
Prior on :
Prior on :
Unevaluated calls can be supplied as values for specific hyperparameters. This is particularly useful when running JANE for multiple combinations of K and D. See 'examples' section below for implementation examples.
A list of S3 class "JANE.priors" representing prior hyperparameters for the EM algorithm, in a structure accepted by JANE.
# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 sim_data <- JANE::sim_A(N = 100L, model = "RS", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Specify prior hyperparameters D <- 3L K <- 5L n_interior_knots <- 5L a <- rep(1, D) b <- 3 c <- 4 G <- 10*diag(D) nu <- rep(2, K) e <- rep(0.5, 1 + (n_interior_knots + 1)) f <- diag(c(0.1, rep(0.5, n_interior_knots + 1))) my_prior_hyperparameters <- specify_priors(D = D, K = K, model = "RS", n_interior_knots = n_interior_knots, a = a, b = b, c = c, G = G, nu = nu, e = e, f = f) # Run JANE on simulated data using supplied prior hyperparameters res <- JANE::JANE(A = sim_data$A, D = D, K = K, initialization = "GNN", model = "RS", case_control = FALSE, DA_type = "none", control = list(priors = my_prior_hyperparameters)) # Specify prior hyperparameters as unevaluated calls n_interior_knots <- 5L e <- rep(0.5, 1 + (n_interior_knots + 1)) f <- diag(c(0.1, rep(0.5, n_interior_knots + 1))) my_prior_hyperparameters <- specify_priors(model = "RS", n_interior_knots = n_interior_knots, a = quote(rep(1, D)), b = b, c = quote(D + 1), G = quote(10*diag(D)), nu = quote(rep(2, K)), e = e, f = f) # # Run JANE on simulated data using supplied prior hyperparameters (NOT RUN) # future::plan(future::multisession, workers = 5) # res <- JANE::JANE(A = sim_data$A, # D = 2:5, # K = 2:10, # initialization = "GNN", # model = "RS", # case_control = FALSE, # DA_type = "none", # control = list(priors = my_prior_hyperparameters)) # future::plan(future::sequential)# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 sim_data <- JANE::sim_A(N = 100L, model = "RS", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Specify prior hyperparameters D <- 3L K <- 5L n_interior_knots <- 5L a <- rep(1, D) b <- 3 c <- 4 G <- 10*diag(D) nu <- rep(2, K) e <- rep(0.5, 1 + (n_interior_knots + 1)) f <- diag(c(0.1, rep(0.5, n_interior_knots + 1))) my_prior_hyperparameters <- specify_priors(D = D, K = K, model = "RS", n_interior_knots = n_interior_knots, a = a, b = b, c = c, G = G, nu = nu, e = e, f = f) # Run JANE on simulated data using supplied prior hyperparameters res <- JANE::JANE(A = sim_data$A, D = D, K = K, initialization = "GNN", model = "RS", case_control = FALSE, DA_type = "none", control = list(priors = my_prior_hyperparameters)) # Specify prior hyperparameters as unevaluated calls n_interior_knots <- 5L e <- rep(0.5, 1 + (n_interior_knots + 1)) f <- diag(c(0.1, rep(0.5, n_interior_knots + 1))) my_prior_hyperparameters <- specify_priors(model = "RS", n_interior_knots = n_interior_knots, a = quote(rep(1, D)), b = b, c = quote(D + 1), G = quote(10*diag(D)), nu = quote(rep(2, K)), e = e, f = f) # # Run JANE on simulated data using supplied prior hyperparameters (NOT RUN) # future::plan(future::multisession, workers = 5) # res <- JANE::JANE(A = sim_data$A, # D = 2:5, # K = 2:10, # initialization = "GNN", # model = "RS", # case_control = FALSE, # DA_type = "none", # control = list(priors = my_prior_hyperparameters)) # future::plan(future::sequential)
S3 summary method for object of class "JANE".
## S3 method for class 'JANE' summary(object, true_labels = NULL, initial_values = FALSE, ...)## S3 method for class 'JANE' summary(object, true_labels = NULL, initial_values = FALSE, ...)
object |
|
true_labels |
(optional) A numeric, character, or factor vector of known true cluster labels. Must have the same length as number of actors in the fitted network. Need to account for potential isolates removed (default is |
initial_values |
A logical; if |
... |
Unused. |
A list of S3 class "summary.JANE" containing the following components (Note: is the number of actors in the network, is the number of clusters, and is the dimension of the latent space):
coefficients |
A list containing the estimated coefficients from the logistic regression model (i.e., 'beta_LR') and, if relevant, the estimated coefficients from the zero- truncated Poisson or log-normal GLM (i.e., 'beta_GLM'). |
U |
A numeric |
p |
A numeric vector of length |
mus |
A numeric |
omegas |
A numeric |
Z_U |
A numeric |
uncertainty |
A numeric vector of length |
cluster_labels |
A numeric vector of length |
Z_W |
A numeric |
q_prob |
A numeric scalar representing the estimated proportion of non-edges in the "true" unobserved network that were converted to noise edges. |
precision_weights |
A numeric scalar representing the estimated precision (on the log scale) of the log-normal weight distribution. Only relevant for |
precision_noise_weights |
A numeric scalar representing the estimated precision (on the log scale) of the log-normal noise weight distribution. Only relevant for |
IC |
Information criteria values of the optimal fit selected, including
|
input_params |
A list with the following components:
|
clustering_performance |
(only if
|
# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 sim_data <- JANE::sim_A(N = 100L, model = "NDH", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Run JANE on simulated data res <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") # Summarize fit summary(res) # Summarize fit and compare to true cluster labels summary(res, true_labels = apply(sim_data$Z_U, 1, which.max)) # Summarize fit using starting values of EM algorithm summary(res, initial_values = TRUE)# Simulate network mus <- matrix(c(-1,-1,1,-1,1,1), nrow = 3, ncol = 2, byrow = TRUE) omegas <- array(c(diag(rep(7,2)), diag(rep(7,2)), diag(rep(7,2))), dim = c(2,2,3)) p <- rep(1/3, 3) beta0 <- 1.0 sim_data <- JANE::sim_A(N = 100L, model = "NDH", mus = mus, omegas = omegas, p = p, params_LR = list(beta0 = beta0), remove_isolates = TRUE) # Run JANE on simulated data res <- JANE::JANE(A = sim_data$A, D = 2L, K = 3L, initialization = "GNN", model = "NDH", case_control = FALSE, DA_type = "none") # Summarize fit summary(res) # Summarize fit and compare to true cluster labels summary(res, true_labels = apply(sim_data$Z_U, 1, which.max)) # Summarize fit using starting values of EM algorithm summary(res, initial_values = TRUE)