This function will find the most spatially relevant cluster label for each gene.
Usage
lasso_markers(
gene_mt,
cluster_mt,
sample_names,
keep_positive = TRUE,
coef_cutoff = 0.05,
background = NULL,
n_fold = 10
)
Arguments
- gene_mt
A matrix contains the transcript count in each grid. Each row refers to a grid, and each column refers to a gene. The column names must be specified and refer to the genes. This can be the output from the function
get_vectors
.- cluster_mt
A matrix contains the number of cells in a specific cluster in each grid. Each row refers to a grid, and each column refers to a cluster. The column names must be specified and refer to the clusters. Please do not assign integers as column names. This can be the output from the function
get_vectors
.- sample_names
A vector specifying the names for the samples.
- keep_positive
A logical flag indicating whether to return positively correlated clusters or not.
- coef_cutoff
A positive number giving the coefficient cutoff value. Genes whose top cluster showing a coefficient vlaue smaller than the cutoff will be . Default is 0.05.
- background
Optional. A matrix providing the background information. Each row refers to a grid, and each column refers to one category of background information. Number of rows must equal to the number of rows in
gene_mt
andcluster_mt
. Can be obtained by only providing coordinates matricescluster_info
. to functionget_vectors
.- n_fold
Optional. A positive number giving the number of folds used for cross validation. This parameter will pass to
cv.glmnet
to calculate a penalty term for every gene.
Value
a list of two matrices with the following components
lasso_top_result
A matrix with detailed information for each gene and the most relevant cluster label.
gene
Gene nametop_cluster
The name of the most revelant cluster after thresholding the coefficients.glm_coef
The coefficient of the selected cluster in the generalised linear model.pearson
Pearson correlation between the gene vector and the selected cluster vector.max_gg_corr
A number showing the maximum pearson correlation for this gene vector and all other gene vectors in the inputgene_mt
max_gc_corr
A number showing the maximum pearson correlation for this gene vector and every cluster vectors in the inputcluster_mt
lasso_full_result
A matrix with detailed information for each gene and the most relevant cluster label.
gene
Gene namecluster
The name of the significant cluster afterglm_coef
The coefficient of the selected cluster in the generalised linear model.pearson
Pearson correlation between the gene vector and the selected cluster vector.max_gg_corr
A number showing the maximum pearson correlation for this gene vector and all other gene vectors in the inputgene_mt
max_gc_corr
A number showing the maximum pearson correlation for this gene vector and every cluster vectors in the inputcluster_mt
Details
This function will take the converted gene and cluster vectors from function
get_vectors
, and return the most relevant cluster label for
each gene. If there are multiple samples in the dataset, this function
will find shared markers across different samples by including additional
sample vectors in the input cluster_mt
.
This function treats all input cluster vectors as features, and create a penalized linear model for one gene vector with lasso regularization. Clusters with non-zero coefficient will be selected, and these clusters will be used to formulate a generalised linear model for this gene vector.
If the input
keep_positive
is TRUE, the clusters with positive coefficient and significant p-value will be saved in the output matrixlasso_full_result
. The cluster with a positive coefficient and the minimum p-value will be regarded as the most relevant cluster to this gene and be saved in the output matrixlasso_result
.If the input
keep_positive
is FALSE, the clusters with negative coefficient and significant p-value will be saved in the output matrixlasso_full_result
. The cluster with a negative coefficient and the minimum p-value will be regarded as the most relevant cluster to this gene and be saved in the output matrixlasso_result
.
If there is no clusters with significant p-value, the a string "NoSig" will be returned for this gene.
The parameter background
can be used to capture unwanted noise
pattern in the dataset. For example, we can include negative control
genes as a background cluster in the model. If the most relevant cluster
selected by one gene matches the background "clusters",
we will return "NoSig" for this gene.
Examples
set.seed(100)
# simulate coordinates for clusters
df_clA = data.frame(x = rnorm(n=100, mean=20, sd=5),
y = rnorm(n=100, mean=20, sd=5), cluster="A")
df_clB = data.frame(x = rnorm(n=100, mean=100, sd=5),
y = rnorm(n=100, mean=100, sd=5), cluster="B")
clusters = rbind(df_clA, df_clB)
clusters$sample="rep1"
# simulate coordinates for genes
trans_info = data.frame(rbind(cbind(x = rnorm(n=100, mean=20,sd=5),
y = rnorm(n=100, mean=20, sd=5),
feature_name="gene_A1"),
cbind(x = rnorm(n=100, mean=20, sd=5),
y = rnorm(n=100, mean=20, sd=5),
feature_name="gene_A2"),
cbind(x = rnorm(n=100, mean=100, sd=5),
y = rnorm(n=100, mean=100, sd=5),
feature_name="gene_B1"),
cbind(x = rnorm(n=100, mean=100, sd=5),
y = rnorm(n=100, mean=100, sd=5),
feature_name="gene_B2")))
trans_info$x=as.numeric(trans_info$x)
trans_info$y=as.numeric(trans_info$y)
w_x = c(min(floor(min(trans_info$x)),
floor(min(clusters$x))),
max(ceiling(max(trans_info$x)),
ceiling(max(clusters$x))))
w_y = c(min(floor(min(trans_info$y)),
floor(min(clusters$y))),
max(ceiling(max(trans_info$y)),
ceiling(max(clusters$y))))
vecs_lst = get_vectors(trans_lst=list(rep1=trans_info),
cluster_info = clusters,
bin_type = "square",
bin_param = c(20,20),
all_genes =c("gene_A1","gene_A2","gene_B1","gene_B2"),
w_x = w_x, w_y=w_y)
lasso_res = lasso_markers(gene_mt=vecs_lst$gene_mt,
cluster_mt = vecs_lst$cluster_mt,
sample_names=c("rep1"),
keep_positive=TRUE,
coef_cutoff=0.05,
background=NULL)