pytraj.cluster

pytraj.cluster.kmeans(traj=None, mask='*', n_clusters=10, random_point=True, kseed=1, maxit=100, metric='rms', top=None, frame_indices=None, options='', dtype='ndarray')

perform clustering and return cluster index for each frame

Parameters:

traj : Trajectory-like or iterable that produces Frame

mask : str, default: * (all atoms)

n_clusters: int, default: 10

random_point : bool, default: True

maxit : int, default: 100

max iterations

metric : str, {‘rms’, ‘dme’}

distance metric

top : Topology, optional, default: None

only need to provide this Topology if traj does not have one

frame_indices : {None, 1D array-like}, optional

if not None, only perform clustering for given indices. Notes that this is different from sieve keywords.

options : str, optional

extra cpptraj options controlling output, sieve, ...

Sieve options::

[sieve <#> [random [sieveseed <#>]]]

Output options::

[out <cnumvtime>] [gracecolor] [summary <summaryfile>] [info <infofile>] [summarysplit <splitfile>] [splitframe <comma-separated frame list>] [clustersvtime <filename> cvtwindow <window size>] [cpopvtime <file> [normpop | normframe]] [lifetime] [sil <silhouette file prefix>]

Coordinate output options::

[ clusterout <trajfileprefix> [clusterfmt <trajformat>] ] [ singlerepout <trajfilename> [singlerepfmt <trajformat>] ] [ repout <repprefix> [repfmt <repfmt>] [repframe] ] [ avgout <avgprefix> [avgfmt <avgfmt>] ]

Returns:

1D numpy array of frame indices

Notes

  • if the distance matrix is large (get memory Error), should add sieve number to

options (check example) - install libcpptraj with -openmp flag to speed up this calculation.

Examples

>>> import pytraj as pt
>>> from pytraj.cluster import kmeans
>>> traj = pt.datafiles.load_tz2()
>>> # use default options
>>> cluster_data = kmeans(traj)
>>> cluster_data.cluster_index
array([8, 8, 6, ..., 0, 0, 0], dtype=int32)
>>> cluster_data.centroids
array([95, 34, 42, 40, 71, 10, 12, 74,  1, 64], dtype=int32)
>>> # update n_clusters
>>> data = kmeans(traj, n_clusters=5)
>>> # update n_clusters with CA atoms
>>> data = kmeans(traj, n_clusters=5, mask='@CA')
>>> # specify distance metric
>>> data = kmeans(traj, n_clusters=5, mask='@CA', kseed=100, metric='dme')
>>> # add sieve number for less memory
>>> data = kmeans(traj, n_clusters=5, mask='@CA', kseed=100, metric='rms', options='sieve 5')
>>> # add sieve number for less memory, and specify random seed for sieve
>>> data = kmeans(traj, n_clusters=5, mask='@CA', kseed=100, metric='rms', options='sieve 5 sieveseed 1')
pytraj.cluster.dbscan(traj=None, mask='', options='', dtype='dataset')

clustering. Limited support.

Parameters:

traj : Trajectory-like or any iterable that produces Frame

mask : str

atom mask

dtype : str

return data type

top : Topology, optional

options: str

more cpptraj option

Notes

Call pytraj._verbose() to see more output. Turn it off by pytraj._verbose(False)

cpptraj manual:

Algorithms:
  [hieragglo [epsilon <e>] [clusters <n>] [linkage|averagelinkage|complete]
    [epsilonplot <file>]]
  [dbscan minpoints <n> epsilon <e> [sievetoframe] [kdist <k> [kfile <prefix>]]]
  [dpeaks epsilon <e> [noise] [dvdfile <density_vs_dist_file>]
    [choosepoints {manual | auto}]
    [distancecut <distcut>] [densitycut <densitycut>]
    [runavg <runavg_file>] [deltafile <file>] [gauss]]
  [kmeans clusters <n> [randompoint [kseed <seed>]] [maxit <iterations>]
  [{readtxt|readinfo} infofile <file>]
Distance metric options: {rms | srmsd | dme | data}
  { [[rms | srmsd] [<mask>] [mass] [nofit]] | [dme [<mask>]] |
     [data <dset0>[,<dset1>,...]] }
  [sieve <#> [random [sieveseed <#>]]] [loadpairdist] [savepairdist] [pairdist <name>]
  [pairwisecache {mem | none}]
Output options:
  [out <cnumvtime>] [gracecolor] [summary <summaryfile>] [info <infofile>]
  [summarysplit <splitfile>] [splitframe <comma-separated frame list>]
  [clustersvtime <filename> cvtwindow <window size>]
  [cpopvtime <file> [normpop | normframe]] [lifetime]
  [sil <silhouette file prefix>]
Coordinate output options:
  [ clusterout <trajfileprefix> [clusterfmt <trajformat>] ]
  [ singlerepout <trajfilename> [singlerepfmt <trajformat>] ]
  [ repout <repprefix> [repfmt <repfmt>] [repframe] ]
  [ avgout <avgprefix> [avgfmt <avgfmt>] ]
Experimental options:
  [[drawgraph | drawgraph3d] [draw_tol <tolerance>] [draw_maxit <iterations]]
Cluster structures based on coordinates (RMSD/DME) or given data set(s).
<crd set> can be created with the 'createcrd' command.

Examples

>>> import pytraj as pt
>>> traj = pt.datafiles.load_tz2()
>>> data = pt.cluster.dbscan(traj, mask='@CA', options='epsilon 1.7 minpoints 5')