Tips

process many files

Normally, user needs to follow below code to process many files

In [1]: import pytraj as pt

In [2]: import numpy as np

In [3]: template = 'tz2.%s.nc'

In [4]: flist = [template % str(i) for i in range(4)]

In [5]: print(flist)
['tz2.0.nc', 'tz2.1.nc', 'tz2.2.nc', 'tz2.3.nc']

In [6]: trajlist = [pt.iterload(fname, 'tz2.parm7') for fname in flist]

# calculate radgyr
In [7]: data = []

In [8]: for traj in trajlist:
   ...:     data.append(pt.radgyr(traj))
   ...: 

In [9]: data = np.array(data).flatten()

In [10]: print(data)
[  9.4451   9.5746  10.1923 ...,   7.4834   7.4324   7.6204]

However, pytraj offer shorter (and easy?) way to do

# load all files into a single TrajectoryIterator
In [11]: filelist = ['tz2.0.nc', 'tz2.1.nc', 'tz2.2.nc']

In [12]: traj = pt.iterload(filelist, 'tz2.parm7')

# perform calculation
In [13]: data = pt.radgyr(traj)

# that's it
In [14]: print(data)
[  9.4451   9.5746  10.1923 ...,   7.4834   7.4324   7.6204]

load from a list of files with frame stride

Supposed you have a list of 5 (or whatever) trajectories, you only want to load those files 1st to 100-th frames and skip every 10 frames. Below is a convention cpptraj input.

parm 2koc.parm7
trajin traj0.nc 1 100 10
trajin traj1.nc 1 100 10
trajin traj2.nc 1 100 10
trajin traj3.nc 1 100 10
trajin traj4.nc 1 100 10

In pytraj, you can specify frame_slice

import pytraj as pt
pt.iterload('traj*.nc', top='2koc.parm7', frame_slice=[(0, 100, 10),]*5)

# [(0, 100, 10),]*5 is equal to [(0, 100, 10), (0, 100, 10),(0, 100, 10),(0, 100, 10),(0, 100, 10),]

load specific frame numbers to memory

In [15]: import pytraj as pt

In [16]: frame_indices = [2, 4, 7, 51, 53]

# use ``load`` to load those frames to memory
In [17]: traj0 = pt.load('tz2.nc', 'tz2.parm7', frame_indices=frame_indices)

In [18]: traj0
Out[18]: 
pytraj.Trajectory, 5 frames: 
Size: 0.000025 (GB)
<Topology: 223 atoms, 13 residues, 1 mols, non-PBC>
           

# only loadd coordinates for specific atoms
In [19]: traj1 = pt.load('tz2.nc', 'tz2.parm7', frame_indices=frame_indices, mask='@CA')

In [20]: traj1
Out[20]: 
pytraj.Trajectory, 5 frames: 
Size: 0.000001 (GB)
<Topology: 12 atoms, 12 residues, 12 mols, non-PBC>
           

# or use ``iterload``
In [21]: frame_indices = [2, 4, 7, 51, 53]

In [22]: traj2 = pt.iterload('tz2.nc', 'tz2.parm7')

In [23]: traj2
Out[23]: 
pytraj.TrajectoryIterator, 101 frames: 
Size: 0.000503 (GB)
<Topology: 223 atoms, 13 residues, 1 mols, non-PBC>
           

In [24]: traj2[frame_indices, '@CA']
Out[24]: 
pytraj.Trajectory, 5 frames: 
Size: 0.000001 (GB)
<Topology: 12 atoms, 12 residues, 12 mols, non-PBC>

merge multiple trajectories to a single file

In [25]: import pytraj as pt

# load multiple files
In [26]: traj = pt.iterload(['tz2.0.nc', 'tz2.1.nc', 'tz2.2.nc'], top='tz2.parm7')

In [27]: traj.save('tz2_combined.nc', overwrite=True)

memory saving

If memory is critical, do not load all frames into memory.

# DO this (only a single frame will be loaded to memory)
In [28]: pt.radgyr(traj, frame_indices=[0, 200, 300, 301])
Out[28]: array([ 9.4451,  7.4324,  7.4834,  7.4324])

# DON'T do this if you want to save memory (all 4 frames will be loaded to memory)
In [29]: pt.radgyr(traj[[0, 200, 300, 301]])
Out[29]: array([ 9.4451,  7.4324,  7.4834,  7.4324])

In [30]: pt.iterframe(traj, frame_indices=[0, 200, 300, 301])
Out[30]: 
<FrameIterator with start=None, stop=None, step=None, n_frames=4, 
frame_indices=[0, 200, 300, 301], 
mask=None, autoimage=False, rmsfit=None, copy=False> 

In [31]: traj[[0, 200, 300, 301]]
Out[31]: 
pytraj.Trajectory, 4 frames: 
Size: 0.000020 (GB)
<Topology: 223 atoms, 13 residues, 1 mols, non-PBC>

Example: calculate pairwise rmsd for 3000 frames (~3.9 G) only costs 112.7 MB

$ python memory/pairwise_rmsd_shorter_version.py

pytraj.TrajectoryIterator, 3000 frames:
Size: 3.911264 (GB)
<Topology: 58329 atoms, 19202 residues, 19168 mols, PBC with box type = ortho>

Filename: memory/pairwise_rmsd_shorter_version.py

Line #    Mem usage    Increment   Line Contents
================================================
     6     30.1 MiB      0.0 MiB   @profile
     7                             def compute(mask='!:WAT,K+,Cl-', mat_type='half'):
     8     64.0 MiB     33.9 MiB       traj = pt.iterload('GAAC3.5000frames.nc', 'GAAC.topo', frame_slice=(0, 3000))
     9     64.1 MiB      0.1 MiB       print(traj)
    10    112.7 MiB     48.5 MiB       return pt.pairwise_rmsd(traj, mask=mask, mat_type=mat_type)

See also: Fancy indexing of trajectory

convert trajectory

# convert Amber netcdf to Charmm dcd file.
pt.iterload('traj.nc', 'prmtop').save('traj.dcd', overwrite=True)

pickle data

Sometimes you need to perform very long analysis (hours), you need to save the output to disk to do further analysis. You have options to save data to different files and write code to load the data back. However, you can use pytraj.to_pickle and pytraj.read_pickle to save the state of data. Check the example:

In [32]: traj3 = pt.datafiles.load_trpcage()

In [33]: data = pt.dssp(traj, ':3-7')

In [34]: data
Out[34]: 
(array(['THR:3', 'TRP:4', 'GLU:5', 'ASN:6', 'GLY:7'], 
       dtype='<U5'), array([['0', '0', '0', '0', '0'],
        ['0', '0', 'S', '0', '0'],
        ['0', '0', 'S', '0', '0'],
        ..., 
        ['0', '0', '0', '0', '0'],
        ['0', '0', 'S', '0', '0'],
        ['0', '0', 'S', '0', '0']], 
       dtype='<U1'), <pytraj.DatasetList with 8 datasets>
none_avg
[ 1.   0.8  0.8 ...,  1.   0.8  0.8]

para_avg
[ 0.  0.  0. ...,  0.  0.  0.]

anti_avg
[ 0.  0.  0. ...,  0.  0.  0.]
...

turn_avg
[ 0.  0.  0. ...,  0.  0.  0.]

bend_avg
[ 0.   0.2  0.2 ...,  0.   0.2  0.2])

In [35]: pt.to_pickle(data, 'my_data.pk')

# load the data's state back for further analysis
In [36]: pt.read_pickle('my_data.pk')
Out[36]: 
(array(['THR:3', 'TRP:4', 'GLU:5', 'ASN:6', 'GLY:7'], 
       dtype='<U5'), array([['0', '0', '0', '0', '0'],
        ['0', '0', 'S', '0', '0'],
        ['0', '0', 'S', '0', '0'],
        ..., 
        ['0', '0', '0', '0', '0'],
        ['0', '0', 'S', '0', '0'],
        ['0', '0', 'S', '0', '0']], 
       dtype='<U1'), <pytraj.DatasetList with 8 datasets>
none_avg
[ 1.   0.8  0.8 ...,  1.   0.8  0.8]

para_avg
[ 0.  0.  0. ...,  0.  0.  0.]

anti_avg
[ 0.  0.  0. ...,  0.  0.  0.]
...

turn_avg
[ 0.  0.  0. ...,  0.  0.  0.]

bend_avg
[ 0.   0.2  0.2 ...,  0.   0.2  0.2])

speed up calculation with parallel: using MPI

$ cat > radgyr_mpi.py <<EOF
import pytraj as pt

# import mpi4py to get rank of each core
from mpi4py import MPI
comm = MPI.COMM_WORLD

# load trajectory to each core. Use iterload to save memory
traj = pt.iterload('tz2.nc', 'tz2.parm7')

# compute radgyr by sending this method to pt.pmap_mpi function
data = pt.pmap_mpi(pt.radgyr, traj, '@CA')

# data is sent to first core (rank=0)
if comm.rank == 0:
    # save data
    # pt.to_pickle(data, 'data.pk')
    print(data)
EOF

$ # run
$ mpirun -n 4 python radgyr_mpi.py
# you can reload data by `pt.read_pickle('data.pk')

speed up calculation with parallel: multiple cores

# send pt.radgyr method to `pt.pmap` function
# need to specify the number of cores
# (choose wisely)
In [37]: pt.pmap(pt.radgyr, traj, '@CA', n_cores=4)
Out[37]: 
OrderedDict([('RoG_00000',
              array([ 8.1092,  7.7643,  8.0969, ...,  6.3806,  6.2414,  6.4899]))])

read cpptraj manual

This does not work with ipython-notebook but it’s still good for interactive ipython

In [106]: import pytraj as pt
In [107]: pt.info('radgyr')
        [<name>] [<mask1>] [out <filename>] [mass] [nomax] [tensor]
          Calculate radius of gyration of atoms in <mask>