mastodon.world is one of the many independent Mastodon servers you can use to participate in the fediverse.
Generic Mastodon server for anyone to use.

Server stats:

8.1K
active users

#clustering

0 posts0 participants0 posts today

Summer ☀️ read: a new paper on model-based clustering just appeared in Computo!

Julien Jacques and Brendan Thomas Murphy publish a new method for clustering multivariate count data. The method combines feature selection and clustering, and is based on conditionally independent Poisson mixture models and Poisson generalized linear models.

On simulations, the Adjusted Rand Index (ARI) of the model with selected variables is close to the optimal ARI obtained with the true clustering variables.

The paper and accompanying R code are available at computo-journal.org/published-

Don't pass by the new insightful lecture from Dr. Alejandro Rodriguez Garcia, Abdus Salam International Centre for Theoretical Physics (ICTP)!

In this one, Alex provides a comprehensive overview of various clustering methods, including flat, fuzzy, and hierarchical approaches. His lecture not only discusses the mathematical foundations of techniques like k-means and k-medoids but also highlights their practical applications across fields such as image recognition and data classification.

This lecture is an excellent opportunity to deepen your understanding of unsupervised learning and engage critically with advanced clustering methods.

Join Enabla to watch the lecture and interact with Dr. Rodriguez Garcia for free! Ask questions and spark discussions with both him and the rest of the Enabla community: enabla.com/pub/1109/about

Clustering Machine Learning Certification

🌐 Take the exam online: edchart.com/certificate/cluste
📛 Get your verified digital credential: credly.com/org/edchart-technol

EdChart now offers the Clustering Machine Learning Certification, recognized globally and trusted by professionals. Take the online exam from anywhere in the world, and pay only if you pass.

Clustering Workbench of the Carrot2 search engine is working now. It can
cluster search results by 3 algorithms:
Lingo, STC, or k=means. STC is Suffix Tree Clustering method, a fast, phrase-based clustering method that groups documents based on common, frequent phrases. The screenshot shows search results using Lingo clustering for query:
"survey of AI tools for systematic reviews."

search.carrot2.org/#/workbench

#research #academia #Carrot2
#systematicReview
#clustering #Lingo #STC #k-means

Exciting news, our paper is out!

"Behavioral Clusters and Lesion Distributions in Ischemic Stroke, Based on NIHSS Similarity Network" on Springer Journal of Healthcare Informatics Research rdcu.be/efgma

With my co-first-author Andrea Zanola and co-authors, we explore the relations between behavioral measures of impairment after stroke, and the underlying brain lesions.
Rather than focusing on covariances at the population level, we first cluster individual behavioral phenotypes, and then explore the typical and significant lesions of each cluster.

Our technique, Repeated Spectral Clustering is performed on a similarity network (derived from the General Distance Measure, handy for ordinal scales!), and the partitions are statistically robust thanks to the aggregation of results from multiple random initializations.

We end up with 5 clusters, 3 of which show reknown principal components of deficits (Left Motor, Righ Motor, Language), and their associate lesions.

Interestingly, this multi-item and multimodal approach allows to distinguish different etiologies for the same deficits, thanks to their different behavioral associations, and the different lesions characterizing each cluster. Even when the single NIHSS measure is a bit "vague"...

We hope that popularizing the General Distance Measure, Repeated Spectral Clustering and this clustering perspective aside of PCA / CCA studies can inspire multimodal approaches in other neuroscientific and biomedical domains!

Many thanks to our co-authors, Antonio Luigi Bisogno, Silvia Facchini, Lorenzo Pini, Manfredo Atzori and Maurizio Corbetta for data, analytic and medical insights, and their guidance throughout the whole process!

**OptimOTU: Taxonomically aware OTU clustering with optimized thresholds and a bioinformatics workflow for metabarcoding data**

arxiv.org/abs/2502.10350

arXiv.orgOptimOTU: Taxonomically aware OTU clustering with optimized thresholds and a bioinformatics workflow for metabarcoding dataTo turn environmentally derived metabarcoding data into community matrices for ecological analysis, sequences must first be clustered into operational taxonomic units (OTUs). This task is particularly complex for data including large numbers of taxa with incomplete reference libraries. OptimOTU offers a taxonomically aware approach to OTU clustering. It uses a set of taxonomically identified reference sequences to choose optimal genetic distance thresholds for grouping each ancestor taxon into clusters which most closely match its descendant taxa. Then, query sequences are clustered according to preliminary taxonomic identifications and the optimized thresholds for their ancestor taxon. The process follows the taxonomic hierarchy, resulting in a full taxonomic classification of all the query sequences into named taxonomic groups as well as placeholder "pseudotaxa" which accommodate the sequences that could not be classified to a named taxon at the corresponding rank. The OptimOTU clustering algorithm is implemented as an R package, with computationally intensive steps implemented in C++ for speed, and incorporating open-source libraries for pairwise sequence alignment. Distances may also be calculated externally, and may be read from a UNIX pipe, allowing clustering of large datasets where the full distance matrix would be inconveniently large to store in memory. The OptimOTU bioinformatics pipeline includes a full workflow for paired-end Illumina sequencing data that incorporates quality filtering, denoising, artifact removal, taxonomic classification, and OTU clustering with OptimOTU. The OptimOTU pipeline is developed for use on high performance computing clusters, and scales to datasets with millions of reads per sample, and tens of thousands of samples.