Pfam is a database of protein families and domains that is widely used to analyse novel
genomes, metagenomes and to guide experimental work on particular proteins and
systems 1. From Pfam inception in 1998, Pfam has been designed to scale with the growth in the number of new protein sequences deposited 2. To achieved the scalability 2, each Pfam family has a seed alignment that contains a representative set of sequences for the entry 1. The seed alignments are used to build profile hidden Markov models (HMMs) that can be used to search any sequence database for homologues in a sensitive and accurate fashion 2. Those homologues that score above the curated inclusion thresholds are aligned against the profile HMMs to make a full alignment 2. By searching a protein sequence against the Pfam library of profile HMMs, you can determine which domains it carries 5. Pfam can also be used to analyse proteomes and questions of more complex domain architectures 5. The newest version, Pfam 36.0, contains a total of 20,795 families and 660 clans 4. Some of the Pfam entries are grouped into clans 5. Pfam defines a clan as a collection of entries that have arisen from a single evolutionary origin 5. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or, when structures are not available, from common sequence motifs 5. Since the last release, 1191 new families have been built 4. 28 families have been kill, and 5 new clans have been created 4. Additionally, around 1.5% of existing Pfam entries have been updated 4. 2,818 families have seen a change in their boundaries, 281 of them have changed by more than 50 residues, most of them got trimmed or split into domains often due to improved information from accurate structural models 4. Pfam families are divided into two categories, Pfam-A and Pfam-B 3. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile HMMs built from the seed alignment and an automatically generated full alignment which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases 3. Pfam-B entries are automatically generated from the ProDom database (The ProDom protein domain family database originates from the early recognition that automated methods are needed to reach comprehensiveness of protein domain analysis 6. This comprehensiveness makes ProDom a unique resource usefully complementing expert derived databases such as PFAM 6), and are represented by a single alignment 3. The use of representative seed alignments for Pfam-A families allows efficient and sustainable manual curation of alignments and annotation, while the automatic generation of full alignments and Pfam-B clusters ensures that Pfam is a comprehensive classification of protein families that scales effectively with the growth of the sequence databases 3. Pfam type definitions divide entries into one of six types and they can help users in selecting which Pfam families to use in their analyses 1. In particular, a large scale screen of Pfam families have been carried out using the ncoils software to identify families with a high proportion of predicted coiled-coil, and after inspection of such families, their type were able to be changed 1. 7
Pfam is now hosted by InterpPro. InterPro is a
bioinformatics resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites 8. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro Consortium 8. Pfam is one of the member database. Different member databases use different methods to construct their signatures, and they have their own particular focus of interest: structural and/or functional domains, protein families, or protein features such as active sites or binding sites 9.
The InterPro homepage easily accessible by clicking
on the InterPro link in the Proteins panel on the EBI services page 8
Pfam has been retired. The Pfam website codebase was
first released over 20 years ago, and although it has been updated from time to time, some of its core functionality still dates back to its origins 10. There is a lot of technical debt in its current state, and to maintain becomes harder 10. By retiring the website, the core of Pfam will be focused on producing 10. The deployment and visualisation tasks is leaved to the InterPro website 10. InterPro was redesigned in recent years, using up to date technologies, including a modern framework 10. 1 https://academic.oup.com/nar/article/49/D1/D412/5943818 2 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808889/ 3 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238907/ 4 https://xfam.wordpress.com/2023/09/18/pfam-36-0-release/ 5 https://pfam-docs.readthedocs.io/en/latest/faq.html 6 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC539988/ 7 Pfam is now hosted by InterPro (xfam.org) 8 https://www.ebi.ac.uk/training/online/courses/interpro-functional-and-structural- analysis/what-is-interpro/ 9 https://www.ebi.ac.uk/training/online/courses/interpro-functional-and-structural- analysis/what-is-interpro/where-does-data-come-from/ 10 https://xfam.wordpress.com/2022/08/04/pfam-website-decommission/