From this dataset, we prepared another dataset by pruning structures that include less than 95% of residues relative to the whole chain length. The protein lengths were obtained from UniProt [ 34 ]. There are 2,366 chains in this high-coverage single chain dataset. For each chain, fold class was assigned following CATH. Also, by referring to PISA [ 27 ], we assigned biological unit information. This pruned dataset was shown in inset of Fig 1A and 1B .

The representative set of single-chain protein structures was selected from a PISCES culled list with a resolution cutoff of 2.2 Å, an R factor cutoff of 0.2, and a pairwise sequence identity cutoff of 25% [ 33 ]. From 7,260 chains in the list, we removed short chains with less than 40 amino acids. We have also removed proteins that have a large spatial gap, i.e. structures having more than one cluster when C α atoms were clustered with a 9 Å cutoff. We further removed 82 chains were further removed from the list because their sequences had more than 25% sequence identity to other chains. This process yielded a dataset of 6,841 non-redundant protein structures.

From PDB, we identified structures that exist as a complex as defined in PISA and downloaded the first biological unit (BU). The same resolution, R factor, and length cutoffs as in the single chain dataset were applied. A complex is considered as redundant if there is another complex with the same number of chains and corresponding chains between them have over 25% sequence identity. Among redundant complex entries, we chose the one with the highest resolution and the lowest R factor. This procedure yielded 5,326 complexes. Symmetry information for complexes was obtained from PDB if the BU of the complex considered has the same composition as in PDB. Out of the 5,326 complexes, 2,876 of them acquired symmetry information.

Protein surface shape representation

We used 3DZD, mathematical rotation-invariant moment-based descriptors, to represent the surface shape of single-chain proteins and complexes. For a protein structure, a surface was constructed using the MSMS program [35] and then mapped to a 3D cubic grid of the size of N3 (N was set to 200). Protein size is not explicitly considered in 3DZD calculation. But in our previous study [23], we have shown that it is rare for proteins with very different sizes to share similar global surface. Moreover, in Fig 4C and 4D, we have also analyzed the chain length distribution in the single-chain shape space. MSMS failed to generate surface for two cases each in the single-chain dataset and the complex structure dataset, for which we used the MSROLL program [36] instead. Each voxel (a cube defined by the grid) is assigned either 1 or 0; 1 for a surface voxel that locates closer than 1.7 grid interval to any triangle defining the protein surface, and 0 otherwise. This 3D grid with 1s and 0s was considered as a 3D function f(x), for which a series is computed in terms of the Zernike-Canterakis basis [37] that is defined by the collection of functions (1) with −l<m<l, 0≤l≤n, and (n−l) even. are spherical harmonics. R nl (r) are radial functions defined by Canterakis, constructed so that are homogeneous polynomials when written in terms of Cartesian coordinates. 3D Zernike moments of f(x) are defined as the coefficients of the expansion in this orthonormal basis, i.e. by the formula (2) To achieve rotation invariance, the moments are collected into (2l+1)-dimensional vectors , and the rotationally invariant 3D Zernike descriptors F nl are defined as norms of the vectors Ω nl . Thus (3) Index n is called the order of the descriptor. The rotational invariance of 3D Zernike descriptors means e.g. that calculating F nl for a protein and its rotated version would yield the same result. We used 20 as the order because it gave reasonable results in our previous works on protein 3D shape comparison [23,38–40]. A 3DZD with an order n of 20 represents a 3D structure as a vector of 121 invariants [23]. The similarity between two proteins X and Y was measured by the Euclidean distance d E between their 3DZDs, , where X i and Y i represent the ith invariant for protein X and Y, respectively.

To illustrate the characteristics of 3DZDs, we compare it against two other structure similarity measures, the Procrustes distance [41] and TM-Score [42]. The Procrustes distance is a root-mean square deviation (RMSD) between corresponding points in two objects after an appropriate optimization of translation, rotation, and scaling. The smaller the Procrustes distance, the more similar the shape are. On the other hand, TM-Score is one of the common measures of the similarity of the main-chain conformations of proteins. TM-Score ranges from 0 to 1, with 1 for identical protein structures. Proteins within the same fold usually have a score above 0.5. The Euclidean distance of 3DZD is usually below 10 for proteins of the same shape [23,39].

In S3 Fig, the Euclidian distance of 3DZD and the Procrustes distance were compared in two datasets. Panel A compares pairs of 20 ellipsoids with increasing eccentricities, while panel B shows results on 1,278 single-chain protein pairs that have the same number of vertices in the surface representation. The two measures correlated well with a correlation coefficient of 0.9784 for the ellipsoid dataset (S3A Fig), because surface points were systematically distributed in the same fashion for all the ellipsoids and thus corresponding points are easily matched for aligning two ellipsoids. On the other hand, the two measures often have very different distances in protein shape cases (S3B Fig), which typically happened when point correspondences do not even allow appropriate scaling of the two structures. In S3B Fig, there are many protein pairs that have different surface shapes with a 3DZD Euclidean distance of over 10 but with a small Procrustes distance of around 0.2. S3C and S3D Fig show such protein pairs. As shown, proteins in these pairs have very different shapes, which indicates that 3DZD performs more reasonably for comparing protein shapes. Indeed, for protein shape comparison, The Procrustes distance has difficulty because corresponding surface points in two proteins need to be determined prior to the distance computation, which are not available in general for protein surface comparison. This is more difficult when two proteins have a different number of surface points to be compared. Apparently, 3DZD does not have such a problem because it does not align points to points.

S4A and S4B Fig show the comparison between 3DZD and TM-Score. As shown, these two measures have virtually no correlation. The correlation coefficient was -0.1735 for these two measures. Panel B shows the density of the two measures. The highest density (yellow) was observed at around 3DZD distance of 5 to 10 and TM-score of 0.3, which is the score range for proteins with similar surface shape but with different main-chain fold. As also shown in Table 1, there are cases that proteins of the different fold class have a small 3DZD Euclidian distance. S4C and S4D Fig shows two such examples, where two structures have a similar surface shape to each other according to 3DZD but have a very large difference in their main-chain conformations. These results are consistent with our earlier work where we extensively compared 3DZD with conventional protein structure comparison methods [23].

The 3DZD files of the single-chain and the complex datasets are made available at S1 Data. 3DZD can be also computed for PDB files at the benchmark page of 3D-SURFER (http://kiharalab.org/3d-surfer/batch.php) [25,38].