Cluster Analysis

pytraj.cluster.kmeans(traj=None, mask='*', n_clusters=10, random_point=True, kseed=1, maxit=100, metric='rms', top=None, frame_indices=None, options='', dtype='ndarray')

perform clustering and return cluster index for each frame

Parameters:

traj : Trajectory-like or iterable that produces Frame

mask : str, default: * (all atoms)

n_clusters: int, default: 10

random_point : bool, default: True

maxit : int, default: 100

max iterations

metric : str, {‘rms’, ‘dme’}

distance metric

top : Topology, optional, default: None

only need to provide this Topology if traj does not have one

frame_indices : {None, 1D array-like}, optional

if not None, only perform clustering for given indices. Notes that this is different from sieve keywords.

options : str, optional

extra cpptraj options controlling output, sieve, ...

Sieve options::

[sieve <#> [random [sieveseed <#>]]]

Output options::

[out <cnumvtime>] [gracecolor] [summary <summaryfile>] [info <infofile>] [summarysplit <splitfile>] [splitframe <comma-separated frame list>] [clustersvtime <filename> cvtwindow <window size>] [cpopvtime <file> [normpop | normframe]] [lifetime] [sil <silhouette file prefix>]

Coordinate output options::

[ clusterout <trajfileprefix> [clusterfmt <trajformat>] ] [ singlerepout <trajfilename> [singlerepfmt <trajformat>] ] [ repout <repprefix> [repfmt <repfmt>] [repframe] ] [ avgout <avgprefix> [avgfmt <avgfmt>] ]

Returns:

1D numpy array of frame indices

Notes

  • if the distance matrix is large (get memory Error), should add sieve number to

options (check example) - install libcpptraj with -openmp flag to speed up this calculation.

Examples

>>> import pytraj as pt
>>> from pytraj.cluster import kmeans
>>> traj = pt.datafiles.load_tz2()
>>> # use default options
>>> cluster_data = kmeans(traj)
>>> cluster_data.cluster_index
array([8, 8, 6, ..., 0, 0, 0], dtype=int32)
>>> cluster_data.centroids
array([95, 34, 42, 40, 71, 10, 12, 74,  1, 64], dtype=int32)
>>> # update n_clusters
>>> data = kmeans(traj, n_clusters=5)
>>> # update n_clusters with CA atoms
>>> data = kmeans(traj, n_clusters=5, mask='@CA')
>>> # specify distance metric
>>> data = kmeans(traj, n_clusters=5, mask='@CA', kseed=100, metric='dme')
>>> # add sieve number for less memory
>>> data = kmeans(traj, n_clusters=5, mask='@CA', kseed=100, metric='rms', options='sieve 5')
>>> # add sieve number for less memory, and specify random seed for sieve
>>> data = kmeans(traj, n_clusters=5, mask='@CA', kseed=100, metric='rms', options='sieve 5 sieveseed 1')