# Structure Comparison Algorithm

Our method uses predominantly the goodness of geometric superposition as similarity measurement and considers the minimum unit of a match between two protein structures is a pair of residues. Protein structure is represented, in structural comparison, by its traces of alpha carbons. Thus, in our view, protein was an object composed of connected points in a 3-dimensional space. To find the most similar part between two proteins thereby is equivalent to obtain a list of matched pairs of cartesian points. Insertion, deletion, and sequence-order-dependent problems will not exist in our method.

Our residue-based algorithm can be described in two processes. In first process, a set of superpositions based on local similarity between two proteins were obtained through an efficient exhausting searches. In second process, found local similarities were then used as a starting point in an iterative procedure to locate the best global similarity between two proteins. In overall, our method is a fast, reliable protein structure comparison tool.

Since proteins are connected chains, we can superimpose every three consecutive alpha carbon atoms of one protein onto every three consecutive alpha carbon atoms of the other protein in order to find all possible local similarities between proteins. Thus, comparing two proteins said both with N residues, we have to perform N2 operations of comparison for exaustively searching good local matches. With Geometry Hashing algorithm, we are able to accomplish the same task described above by performing only N instead of N2 operations. The Geometry Hash algorithm is composed of two procedures, namely the "hash table construction" and "voting" procedure, respectively.

### Hash Table Construction

In construction of a hash table, an orthogonal reference frame (Rx, Ry, Rz) was calculated for every three consecutive alpha carbons in a protein A as following:

Rx = Ci - Cj
Vt = Ck - Cj
Rz = Rx X Vt
Ry = Rx X Rz

where Ci, Cj, and Ck are the coordinates of three consecutive alpha carbon atoms. Local structural environment of every frame was then implicitly registered in the table according their transformed structural invariants which were computed by projecting onto the orthogonal reference frame. Usually, all residues within a ball of 15 Angstrom were included to represent a local structure. For each entry in the table with address directly calculated from a structural invariants, both the frame identifier (Fi) and alpha carbon atom identifier (Ai) were stored. Since different reference frame might produces same structural invariants, we are very likely to have multiple entries in a single address. For example, if we include the central alpha carbom atom of the reference frame in the table, obviously all of them will occupy at the same address.

### Voting Process

In voting procedure, we repeat the same calculations for another protein B to obtain a address for every environment residue of every reference frame. However, instead of registering it into table we retrieve all entries at that address and update the voting records. For every frame pair (Fi,Fj), we keep the vote count Ni,j and a list of matched alpha carbon atom pairs, (Ai,Bj)n, n=1,Ni,j.

For any frame pair in which vote count exceeds a preset threshold, it is called a seed match. At this stage, we are able to detect local similarity between two proteins by geometric superposition criteria but no actual superimposition calculation was performed. A translational vector and rotational matrix for every seed match was calculated by a superposition procedure via its associated list of matched alpha carbon atom pairs. If a protein was compared with itself, we are going to obtain N-2 exactly the same seed matches. This imply similar seed matches are needed to be clustered and coalesced. After similar seed matches were grouped, a set of potential matches emerged and passed to the extending process.

### Extending Process

This is an iterative process. After one protein (or interface) was superimposed onto another protein, a new matched pair list was reassigned, in which the distance between a matched pair must be shorter than a preset threshold (3.5 Angstrom was used in this study). When one residue from a protein can be matched to many residues in another protein by distance criterion, the selection was always made in favor of some selected similarity score. Here a simple procedure was devised to select a match with improved connectivity score in protein structural comparison. When the similarity score converges within a threshold as compared with previous iteration, an optimal extended alignment was achieved and an extending process ends, otherwise iteration continues.

### Parameters used in the clustering

The first column gives the iteration cycle. There are six successive clustering cycles (A through F). The second column gives the number of interface clusters at the beginning and end of the iteration. For example, during iteration A, there were initially 21,686 interfaces (in the first cycle, this number is equal to the number of two-chain interfaces in the PDB). Using the similarities of the structures and sequences (with the parameters listed in columns 3-5) the number decreased to 16,446. The connectivity score takes into account the residue connectivity in the polypeptide chains. The score favors a match with consecutive residues. At the end of the sixth (final) cycle, we obtained 3,799 distinct clusters. After this cycle, members of each cluster had at least 0.5 connectivity score. There was no amino acid similarity constraint and the maximal size difference between interfaces was 50 residues.